Computational Phytochemistry [1 ed.] 0128123648, 9780128123645

Computational Phytochemistry explores how recent advances in computational techniques and methods have been embraced by

129 121 14MB

English Pages 364 [366] Year 2018

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Phytochemistry [1 ed.]
 0128123648, 9780128123645

Citation preview

Computational Phytochemistry

This page intentionally left blank

Computational Phytochemistry

Edited by

Satyajit D. Sarker Lutfun Nahar

Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States © 2018 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-812364-5 For information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Jon Fedor Acquisition Editor: Anneka Hess Editorial Project Manager: Michelle W. Fisher Production Project Manager: Omer Mukthar Cover Designer: Greg Harris Typeset by SPi Global, India

Dedicated to our parents and teachers.

This page intentionally left blank

Contents Contributors xiii Foreword xv Preface xix

1.

An Introduction to Computational Phytochemistry Satyajit D. Sarker, Lutfun Nahar 1.1 Introduction 1 1.2 Computational Phytochemistry 2 1.3 Techniques, Theories, and Applications of Computational Phytochemistry 3 1.3.1 Kohonen-Based Self-Organizing Map 3 1.3.2 Density Functional Theory 6 1.3.3 Docking Experiments and Virtual Screening (In Silico Screening) 10 1.3.4 Structure Prediction and Structure Determination 16 1.3.5 Chemometrics and Principal Component Analysis 21 1.3.6 Data Mining and Databases 26 1.3.7 Response Surface Methodology in Optimization of Extraction of Phytochemicals 29 1.3.8 Computation in Isolation of Phytochemicals 31 1.3.9 Miscellaneous 31 1.4 Conclusions 34 References 34

2.

Prediction of Medicinal Properties Using Mathematical Models and Computation, and Selection of Plant Materials Sanjoy S. Ningthoujam, Anupam D. Talukdar, Satyajit D. Sarker, Lutfun Nahar, Manabendra D. Choudhury 2.1 Introduction 2.2 Mathematical Models 2.3 Computational Models in Drug Discovery 2.3.1 Structure-Based CADD 2.3.2 Ligand-Based CADD 2.3.3 Network Pharmacology

43 45 47 49 51 56

vii

viii  Contents 2.4 Selection of Medicinal Plants 2.4.1 Ethnobotany-Directed Drug Discovery 2.4.2 Chemotaxonomic and Ecological Approach 2.4.3 Random Approach 2.4.4 Integrated Approach 2.5 Role of Medicinal Plants Databases 2.6 Tools and Techniques 2.7 Role of Data Mining in Medicinal Plant Selection 2.8 Safety Considerations 2.9 Conclusion References

3.

58 59 60 62 63 64 65 65 67 69 69

Optimization of Extraction Using Mathematical Models and Computation Anup K. Das, Saikat Dewanjee 3.1 Introduction 3.2 Fundamentals of Design of Experiments 3.2.1 Planning Phase 3.2.2 Designing Phase 3.3 DoE-Based Optimization of MAE Process 3.4 DoE-Based Optimization of Supercritical Fluid Extraction Process 3.5 DoE-Based Optimization of Accelerated Solvent Extraction Process 3.6 Conclusions References

4.

75 76 78 79 101 101 101 104 106

Application of Computational Methods in Isolation of Plant Secondary Metabolites Mukhlesur Rahman 4.1 Introduction 4.2 Computational Methods in Natural Products Isolations 4.2.1 Automated Flash Chromatography 4.2.2 High-Performance/Pressure Liquid Chromatography 4.2.3 Ultra-Pressure/Performance Liquid Chromatography 4.2.4 Counter Current Chromatography 4.2.5 Capillary Electrophoresis 4.2.6 Hyphenated Techniques 4.3 Conclusion References Further Reading

5.

107 108 108 113 116 120 124 129 133 134 139

Application of Computation in Building Dereplicated Phytochemical Libraries Lutfun Nahar, Satyajit D. Sarker 5.1 Introduction

141

Contents ix

5.2 Compound Library 5.2.1 Combinatorial Library 5.2.2 Phytochemical Library 5.3 Dereplication 5.4 Application of Computation in Building Dereplicated Phytochemical Libraries 5.5 Conclusions References

6.

142 148 149 149 155 160 160

High-Throughput Screening of Phytochemicals: Application of Computational Methods Fyaz M.D. Ismail, Lutfun Nahar, Satyajit D. Sarker 6.1 Introduction 165 6.2 The Pre-HTS Era 166 6.3. High-Throughput Screening 167 6.3.1 Reaction Monitoring and Observation 171 6.3.2 Advances in Monitoring In Vivo 172 6.3.3 Location of Facilities 173 6.3.4 Is There a Difference Between So-Called Leads and Drugs? 174 6.3.5 Visualization of Data 174 6.3.6 Dose–Response Analysis 174 6.3.7 Examples of HTS Success 176 6.4. HTS Platforms for Natural Products/Phytochemicals 178 6.4.1 What is a Natural Product? 179 6.4.2 Natural Products for Increasing Diversity 180 6.4.3 Natural Products Sample Preparation 181 6.4.4 Examples of HTS Platforms for Natural Products/Phytochemicals 183 6.5 High-Content Screening 186 6.6 Conclusions 187 References 187

7.

Prediction of Structure Based on Spectral Data Using Computational Techniques Fyaz M.D. Ismail, Lutfun Nahar, Satyajit D. Sarker 7.1 Introduction 7.1.1 History of Spectroscopy 7.1.2 Misassignments of Structures: A Rarity or More Common Than Expected? 7.2 Structure Elucidation Strategies 7.3 What is Density Functional Theory? 7.4 Era of Assignment Versus Prediction 7.4.1 Nuclear Magnetic Resonance 7.4.2 Computational Mass Spectrometry 7.4.3 Chiral Centres

193 194 195 197 201 201 202 204 207

x  Contents 7.4.4 Structure by Calculations 7.4.5 UV Spectroscopy 7.4.6 Infrared (IR) Spectroscopy 7.4.7 Database Search Algorithm 7.5 Can Raman Be Used for Automated Assays and HTS? 7.6 X-Ray Sponge Technique 7.7 Conclusions References Further Reading

8.

208 214 214 216 222 222 223 223 229

Application of Mathematical Models and Computation in Plant Metabolomics Denis S. Willett, Caitlin C. Rering, Dominique A. Ardura, John J. Beck 8.1 Introduction 8.2 Create Clarity From Chaos—Mindset 8.3 Analytical Tools 8.4 Experimental Considerations 8.4.1 Data Collection Considerations 8.4.2 Instrumentation 8.4.3 Sample Preparation 8.4.4 Analysis Modalities 8.4.5 Throughput in Plant Metabolomics 8.4.6 Data Structures 8.5 Analysis 8.5.1 Data Processing 8.5.2 Unsupervised Approach 8.5.3 Supervised Approach 8.5.4 Inference 8.6 Metabolomics in Agriculture 8.7 Conclusions References

9.

231 232 235 235 236 236 237 238 238 239 239 240 241 243 247 248 250 250

Application of Computation in the Biosynthesis of Phytochemicals Nilanjan Adhikari, Sk Abdul Amin, Tarun Jha, Achintya Saha 9.1 Introduction 9.2 Genome-Mining Tools 9.3 Computational Tools and Databases for Identification and Analysis of BGCs and Secondary Metabolites 9.3.1 BACTIBASE 9.3.2 DoBISCUIT 9.3.3 MIBiG 9.3.4 IMG-ABC 9.3.5 ClustScan Database 9.3.6 ClusterMine360

256 257 257 259 260 260 260 261 261

Contents xi

9.3.7 antiSMASH 9.3.8 SMURF 9.3.9 BAGEL 9.3.10 NaPDos 9.3.11 MultiGeneBlast 9.3.12 eSNaPD 9.3.13 NRPSpredictor 9.4 Computational Tools for Metabolomics Study 9.4.1 Cycloquest 9.4.2 NRPquest 9.4.3 RiPPquest 9.4.4 Pep2Path 9.4.5 GNPS 9.4.6 Dereplicator 9.5 Tools for Prediction of Biochemical Pathways 9.5.1 From Metabolite to Metabolite 9.5.2 Biochemical Network-Integrated Computational Explorer 9.5.3 RetroPath 9.5.4 DESHARKY 9.5.5 Cho System Framework 9.6 Chemical Compound Databases 9.6.1 Dictionary of Natural Products 9.6.2 StreptomeDB 9.6.3 Norine 9.6.4 ChEBI 9.6.5 ChEMBL 9.6.6 PubChem 9.6.7 ChemSpider 9.7 Overview and Conclusions References

261 261 262 262 262 263 263 263 264 264 264 264 264 265 265 265 266 267 268 268 268 269 269 269 269 270 270 270 271 272

10. Computational Aids for Assessing Bioactivities Evelyn Wolfram, Adriana Trifan 10.1 Introduction: Computational Aids in Science and Their Role in Bioactivity Studies of Natural Products 10.2 Strategies for Separation and Identification of Bioactive Natural Compounds for Drug Discovery 10.3 Bioactivity Assessment in Phytochemistry 10.3.1 Protein-Based In Vitro Models 10.3.2 In Vitro Cell Culture Models 10.3.3 In Situ and ex vivo Models 10.3.4 Animal Models 10.4 Computational Tools for Data Analysis From Metabolomics and Bioactivity Assessment Data in Natural Product Research and Drug Discovery 10.5 Data- and Text-Mining Strategies 10.6 Virtual or In Silico Screening of Natural Products

277 280 282 283 284 284 285 285 287 288

xii  Contents 10.7 Application Example of an In Silico Assessment of Bioactivities on the Example of the Cannabinoid Receptor 2 10.8 Overview of Software and Web-Tools for Bioactive Phytochemicals Research 10.9 Conclusions References

290 292 295 295

11. Virtual Screening of Phytochemicals Manabendra D. Choudhury, Walid A. Atteya, Keshav Dahal, Pankaj Chetia, Karabi D. Choudhury, Anant Paradkar 11.1 Introduction 11.1.1 Artificial Neural Networks (ANNs) 11.1.2 Application of ANNs in Pharmaceutical Science 11.1.3 ANNs in Predicting Bioactivity 11.1.4 Gossypol and its Derivatives 11.2 Materials and Methods 11.2.1 Input and Output Vector Definition for Data 11.2.2 Software and Hardware Environment 11.2.3 Modelling Procedure 11.2.4 Training and Test Data Set 11.2.5 Experimental Data Set 11.2.6 Docking Experiment 11.3 Results and Discussion 11.4 Conclusions Acknowledgements References

301 302 303 303 303 304 304 304 306 306 308 308 308 332 332 332

Index 335

Contributors

Numbers in parentheses indicate the pages on which the authors’ contributions begin.

Nilanjan Adhikari  (255), Jadavpur University; University of Calcutta, Kolkata, India Sk Abdul Amin  (255), Jadavpur University, Kolkata, India Dominique A. Ardura  (231), Independent Scientist, Davis, CA, United States Walid A. Atteya  (301), University of Bradford, Bradford, United Kingdom John J. Beck  (231), U.S. Department of Agriculture, Gainesville, FL, United States Pankaj Chetia  (301), Assam University, Silchar, India Manabendra D. Choudhury (43,301), Assam University Silchar, Cachar, India; University of Bradford, Bradford, United Kingdom Karabi D. Choudhury  (301), Assam University, Silchar, India Keshav Dahal  (301), University of Bradford, Bradford, United Kingdom Anup K. Das  (75), ADAMAS University, Kolkata, India Saikat Dewanjee  (75), Jadavpur University, Kolkata, India Fyaz M.D. Ismail (165,193), Liverpool John Moores University, Liverpool, United Kingdom Tarun Jha  (255), Jadavpur University, Kolkata, India Lutfun Nahar (1,43,141,165,193), Liverpool John Moores University, Liverpool, United Kingdom Sanjoy S. Ningthoujam  (43), Assam University Silchar, Cachar, India Anant Paradkar  (301), University of Bradford, Bradford, United Kingdom Mukhlesur Rahman  (107), University of East London, London, United Kingdom Caitlin C. Rering  (231), U.S. Department of Agriculture, Gainesville, FL, United States Achintya Saha  (255), University of Calcutta, Kolkata, India Satyajit D. Sarker (1,43,141,165,193), Liverpool John Moores University, Liverpool, United Kingdom Anupam D. Talukdar  (43), Assam University Silchar, Cachar, India Adriana Trifan  (277), Grigore T. Popa University of Medicine and Pharmacy, Iaşi, Romania Denis S. Willett  (231), U.S. Department of Agriculture, Gainesville, FL, United States Evelyn Wolfram  (277), Zurich University of Applied Sciences (ZHAW), Wädenswil, Switzerland xiii

This page intentionally left blank

Foreword It is a great pleasure to write the foreword to Computational Phytochemistry, as the topics for this text have been carefully selected from key researchers in the area, are beautifully written, edited, and contain the current state-of-the-art with respect to computational methods (CM) in natural product science. Professor Sarker and Dr Nahar have invited experts to contribute to 11 chapters covering the uses, methods, and expected developments in Computational Phytochemistry. This text uses the definition of Computational Phytochemistry to cover modelling, data interpretation and handling, storage and the current advances on experiments for virtual screening, structure prediction, principal component analysis, and even the use of computation as an aid in phytochemical isolation. Of course, the starting point of phytochemical research has to be the appropriate selection of a target plant species so that the likelihood of the discovery of a drug is increased. Whilst serendipitous discovery has played an important historical part of natural product drug discovery, targeted approaches using computational methods (CM) will greatly increase the success rate of drug discovery; this has to be true as natural products offer greater chemical diversity, through chirality and functionality than synthetic chemical libraries. Appropriate medicinal plant selection is key and even the ADME behaviour of drug molecules can be predicted by CM. Mathematical modelling and CM are increasingly being used in many areas of phytochemical research and herein are covered in considerable depth. Once plants are selected, computational methods can also help to optimize extraction methods; failure to adequately extract biomass of any type is often the major reason for lack of expected activity from an ethnomedical plant species. Mathematical models in this area are vital as with poor extraction ‘the game if effectively over’ before it has begun. Extraction errors from researchers using degrees of variability contribute to poor phytochemical capture, and the good design of experiments is well covered to minimize these issues. Computational uses in automation for isolation techniques have also greatly changed and benefitted phytochemical science. The subject of compound isolation is beautifully dissected, and then the application of computer techniques is described, with examples from many classic methods ranging from flash chromatography to UPLC.

xv

xvi  Foreword

One of the greatest issues faced by any natural product drug discovery program at present (and past) is that of dereplication. How do we avoid reinventing the wheel by isolating a well-known pharmacologically promiscuous compound that is a poor drug candidate? The building of real and virtual dereplication libraries is how this can be achieved and the computational methods for this are obvious and in continuous development and improvement. High-throughput screening (HTS), whilst actually a relatively old technique, has also benefitted from computer-assisted monitoring as has our in  vivo assays such as robotic systems to study zebrafish in the presence of drugs. These subjects are very well-presented in this book. Our ability to visualize data and closely monitor dose responses in phytochemical assays have also benefitted and contributed to successes in HTS. Hot topics of great current interest are also covered in this beautiful text. Computational prediction of structures based on spectral data has made great advances in the areas of NMR, MS, conventional spectroscopic techniques such as UV and IR spectroscopy, and even newer techniques such as X-ray sponge analysis. In the area of plant metabolomics, CM and mathematical models have helped enormously in the data curation, capability and rapidity of analyses, and lowering of costs of processing and procuring data, which is ever increasing in terms of volume. These methods will have a profound impact on food supply, particularly given climate changes and their impact on agriculture. Tools and techniques are provided here to answer questions in your discipline and to impart a way of thinking and adjustment of our mindset to problem solutions. This will have obvious benefit and impact on future research questions in many areas of phytochemistry and natural product science in general. Computational methods are described that allow prediction of biosynthetic capability based on a plethora of available tools. We are at a point where the ease, rapidity, and cost in this area are much changed from 3 years ago. Genome mining enables natural product biosynthetic gene clusters to be identified and/ or predicted. Even whole genome sequencing (WGS) benefits from the area of Computational Phytochemistry, and whilst we still need appropriate macroscopic identification of organisms and their apposite curation, which is invariably computer-driven, DNA extraction, WGS, and data storage are becoming the norm for all organisms as methods improve and costs lessen. The speed at which we can now assess the bioactivities of extracts and compounds is astonishing; almost entirely due to computational-assisted advances. Computational aids have allowed scientists to guard and represent the results of scientific experiments and to support scientists’ cleverness in the discovery process by mathematical muscle and artificial intelligence. CMs have impacted so many areas; analysis of data, data mining, the capability to identify the active from a huge background noise of inactive natural products, virtual screening and identifying functional group importance in QSAR and excitingly from

Foreword xvii

many perspectives, in exemplis, herbal medicinal products; the investigation of the wonderful phenomenon of synergy. Following on from this, virtual screening is a beautiful and potentially highly fruitful technique that will benefit from advances in CM. As more biological target information becomes available from crystal structures, WGS, and predicted enzyme structure, virtual screening will be able to rapidly and sensitively study small molecule-active site interactions. This will be possible under a variety of virtual conditions and even enable us to study allosteric modulation of enzymes by mixtures, with the need for enormous mathematical and computational power that will surely become available. This is truly a beautiful contribution to the natural product discipline and the Authors and Editors are to be highly congratulated for their care, coverage, and vision for Computational Phytochemistry. I recommend this text to you without reservation and believe it will benefit young scientists starting their natural product research careers and older scientists like myself who wish to become current in this beautiful field. Simon Gibbons UCL School of Pharmacy, London, United Kingdom

This page intentionally left blank

Preface Phytochemistry is the division of chemistry that deals with the chemistry of plants and especially incorporates the biosynthesis, extraction, isolation, structure determination, and bioactivities of various classes of plant secondary metabolites. During the past two decades, with the notable advancements in computational methods, phytochemical research has embraced and applied various computational aids and mathematical models to address several aspects of phytochemical research, e.g. extraction, isolation, structure determination, structure prediction, metabolomics, chemical fingerprinting, biosynthesis, dereplication, phytochemical library construction, and bioactivity testing of phytochemicals. While there are several papers published related to this area, where computational methods have been applied, there is no book available to date to capture exclusively this new development in the area of phytochemistry. Clearly, there is an obvious gap and the book Computational Phytochemistry will certainly fill in that gap. Computational Phytochemistry comprises 11 chapters, contributed by experts in relevant areas. The first chapter provides an introduction to the title topic ‘computational phytochemistry’ and sets the scene for subsequent chapters covering various other areas. There are chapters on: prediction of medicinal properties using mathematical models and computation, and selection of plant materials; optimization of extraction using mathematical models and computation; application of computational methods in isolation of plant secondary metabolites; application of computation in building dereplicated phytochemical libraries; high-throughput screening of phytochemicals; prediction of structure using spectral data using computational techniques; application of mathematical models and computation in plant metabolomics; application of computation in the study of biosynthesis of phytochemicals; computational aids for assessing bioactivities; and virtual screening. This book offers a comprehensive account on the computational aspects of phytochemical research, incorporating protocols for various computational and mathematical approaches. All chapters are presented with several appropriate figures, tables, and diagrams. The principal target audience of this book is both the experienced and inexperienced researchers, who have been working in the area of phytochemistry and applying various computational methods in their work. The secondary

xix

xx  Preface

readership includes academicians, who teach phytochemistry, pharmacognosy, and related topics in various undergraduate and postgraduate programmes, and the students in those programmes of study. We hope this book will become an indispensable reference for computational phytochemical research. Satyajit D. Sarker, Lutfun Nahar

Chapter 1

An Introduction to Computational Phytochemistry Satyajit D. Sarker, Lutfun Nahar Liverpool John Moores University, Liverpool, United Kingdom

Chapter Outline 1.1. Introduction 1.2. Computational Phytochemistry 1.3. Techniques, Theories, and Applications of Computational Phytochemistry 1.3.1 Kohonen-Based Self-Organizing Map 1.3.2 Density Functional Theory 1.3.3 Docking Experiments and Virtual Screening (In Silico Screening) 1.3.4 Structure Prediction and Structure Determination

1 2

3 3 6

10

16

1.3.5 Chemometrics and Principal Component Analysis 1.3.6 Data Mining and Databases 1.3.7 Response Surface Methodology in Optimization of Extraction of Phytochemicals 1.3.8 Computation in Isolation of Phytochemicals 1.3.9 Miscellaneous 1.4. Conclusions References

21 26

29

31 31 34 34

1.1. INTRODUCTION Computation is simply the act or process of computing, and the term ‘computational’ refers to any act or process relating to computation. Various degrees of computation are present all around us, and in almost everything we do. In fact, computation is no longer about machines, but about contributions of these machines to our lives in a globalized and interconnected world. The tremendous advancement in computer science and its wide-ranging applications have influenced the way we carry out research (Sarker and Nahar, 2017). Computer has become an indispensable tool in research and development, as it is closely associated with analytical instrumentation and methods. Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00001-8 © 2018 Elsevier Inc. All rights reserved.

1

2  Computational Phytochemistry

It also serves as a tool for acquiring data, for word-processing and for handling electronic databases, and overall laboratory management and communication. Chemistry has already incorporated computational techniques and mathematical modelling to address research questions and to develop new methods, which have eventually gave birth to a recognized branch in chemistry, known as ‘Computational Chemistry’. Simply, Computation Chemistry can be defined as a branch in chemistry that utilizes computer simulations to assist in solving chemical problems. It applies many methods of theoretical chemistry, embedded into efficient computer programmes, mainly to calculate the structures and properties of molecules and solids, e.g., electronic structure determination, geometry optimizations, frequency calculations, transition structures, docking, electron and charge distribution, rate constants, and many more (Sarker and Nahar, 2017). Gaussian 94, GAMESS, MOPAC, Spartan, and Sybyl are just a few of the popular software used in Computational Chemistry. Phenomenal technological progresses have led to a massive growth in the amounts of chemical data that are typically multivariate and tangled in structure (Bushkov et al., 2016). Therefore, several computational approaches have predominantly addressed dimensionality reduction and easy representation of multi-dimensional datasets to establish the relationships between the observed activity and calculated parameters commonly known as molecular descriptors. A molecular descriptor is the ultimate outcome of a logic and mathematical procedure that transforms chemical information encoded within a symbolic representation of a molecule into a useful number or the result of some standardized experiment (Todeschini and Consonni, 2000). This chapter presents an overview on Computational Phytochemistry, an emerging area of research, where computational tools, techniques and methods, artificial intelligence, and mathematical modelling are incorporated to address various issues in phytochemistry and phytochemical research.

1.2.  COMPUTATIONAL PHYTOCHEMISTRY Over the last few decades, noticeable increases in incorporation of computational techniques, artificial intelligence, and mathematical modelling in phytochemical research, especially in screening plant materials, plant metabolomics, chemical fingerprinting, chemical taxonomy, biosynthetic and phylogenetic studies, prediction of pharmacological and toxicological properties (virtual screening or in silico studies), and automated structure determination of phytochemicals based on spectroscopic data, have been observed (Sarker and Nahar, 2017). Some of these aspects initially formed ‘Phytochemical Informatics’ that dealt with large amounts of data related to phytochemicals and/or their sources (Ehrman et al., 2010), and this was probably the starting point of a new avenue in phytochemical research, now known as ‘Computational Phytochemistry’. Computational Phytochemistry may be defined as an emerging branch of phytochemistry, where computational techniques and mathematical and

An Introduction to Computational Phytochemistry  Chapter | 1  3

s­tatistical models are used to efficiently deal with various aspects of phytochemical research. Computational Phytochemistry, a product of the digital age, uses mathematical algorithms, statistics, and large databases to integrate theories and modelling with experimental observations. Creation of models and simulations of physical processes involved in phytochemical protocols, and application of statistics and data analysis techniques to extract useful information from large bodies of data, are two fundamental building blocks of Computational Phytochemistry. In most cases, introduction of computer-aided approaches saves time and money associated with phytochemical research, ranging from bioactive compound discovery to identifying the metabolomes (Sarker and Nahar, 2017). The overall impact of computational methods on phytochemical research is already visible in recent publications, and this will steadily transform, over the coming years, the way we perform phytochemical research today. There are several articles published on the use of computational approaches to solve a number of issues in phytochemical research (Nuzillard and Massiot, 1991; Stortz and Cerezo, 1992; Sumner et al., 2003; Rollinger et al., 2005; Cape et al., 2006; Desai and Gore, 2011; Jeeshna and Paulsamy, 2011; Barlow et al., 2012; Castellano et al., 2014; Ningthoujam et al., 2014; Das et al., 2017; Mocan et al., 2017), and relevant theories, useful methodologies and techniques have been presented there.

1.3.  TECHNIQUES, THEORIES, AND APPLICATIONS OF COMPUTATIONAL PHYTOCHEMISTRY 1.3.1  Kohonen-Based Self-Organizing Map Bushkov et  al. (2016) utilized an artificial neural network incorporating the Kohonen-based self-organizing map (SOM) to study plant growth regulators, and it was the first example of a large-scale modelling in the field of agrochemistry. Kohonen-based SOM was first introduced by the Finnish professor Teuvo Kohonen in the 1980s, and since then, it has been applied in many fields, especially in those which handle high-dimensional data sets. The main objective of a SOM is to transform an incoming signal pattern of arbitrary dimension into a one- or two-dimensional discrete map and to perform this transformation adaptively in a topologically ordered fashion. Any SOM process has four major components: initialization, competition, cooperation, and adaptation. While in initialization all connection weights are initialized with small random values, in competition, for each input pattern, the neurons compute their respective values of a discriminant function that offers the basis for competition, where the neuron with the smallest value of the discriminant function wins. The winner neuron determines the spatial location of a topological neighbourhood of excited neurons and provides the basis for cooperation. In adaptation, the excited neurons decrease their individual values of the discriminant function in relation to the input pattern through suitable adjustment of the

4  Computational Phytochemistry

associated connection weights in a way that the response of the winner neuron to the subsequent application of a similar input pattern is enhanced. At this stage, the outputs become self-organized and the feature map between inputs and outputs is formed. SOM is one of the most popular neural network models, and it belongs to the category of competitive learning networks. Competitive learning is usually applied to a single-layer topology, but formulations using multi-layer topologies exist, where independent competition on each layer is employed. Competitive learning is unsupervised learning, and competition is by itself a non-linear process, which is difficult to treat mathematically. SOM is based on unsupervised learning, which does not require human intervention during the learning and little needs to be known about the characteristics of the input data. In fact, unsupervised machine learning refers to the machine-learning algorithm used to draw inferences from datasets consisting of input data without labelled responses. As the examples provided to the learner are unlabelled, there is no assessment of the accuracy of the structure that is output by the relevant algorithm, which distinguishes unsupervised learning from supervised learning and reinforcement learning. The most common unsupervised learning method is ‘cluster analysis’, which is applied for exploratory data analysis to find hidden patterns or grouping in data. Unsupervised learning methods are particularly useful in bioinformatics for sequence analysis and genetic clustering, in data mining for sequence and pattern mining, in medical imaging for image segmentation, and in computer vision for object recognition. SOM operates in two modes—training and mapping—and is useful for visualizing low-dimensional views of high-dimensional data, akin to multi-­ dimensional scaling. While SOM can be used for the clustering of genes in the medical field, the study of multi-media and web-based contents, and in the transportation industry, it has been applied to phytochemical research (Ehrman et al., 2007a, b; Emerenciano et al., 2007; Scotti et al., 2012). Bushkov et al. (2016) analysed the experimental data available in patents and scientific publications as well as specific databases for various agro-chemicals by the Kohonen-based SOM technique. They investigated whether the developed in silico model could be applied to predict the agro-chemical activity of small molecules and to offer insights into the distinctive features of different agro-chemical categories. The preliminary external validation with several plant growth regulators displayed a relatively high prediction power (67%) of the constructed model. A similar SOM, which is a data visualization technique, was used in chemotaxonomic studies of the Asteraceae, and flavonoid data were used to classify tribes of this family of flowering plants (Emerenciano et al., 2007). Flavonoids are considered as chemotaxonomic markers for the Asteraceae, and their (about 800 flavonoids) occurrence (4700) and oxidation patterns in the family were considered in an expert system developed for taxonomic purposes. SOM was applied to establish phylogenetic relationships among the subfamilies and tribes of the Asteraceae. The unsupervised training was performed by using

An Introduction to Computational Phytochemistry  Chapter | 1  5

the second version of the SOM Toolbox for Matlab computing environment by MathWorks, Inc. The toolbox contained functions for creation, visualization, and analysis of SOMs. Later, the same group reported SOMs of molecular descriptors for sesquiterpene lactones (total 1111 from 658 species, 63 subtribes and 15 tribes) and their applications to the chemotaxonomy of the Asteraceae (Scotti et al., 2012). All those sesquiterpene lactones were represented and registered in two dimensions in the in-house software system SISTEMATX and were associated with their botanical sources. Descriptors like constitutional, functional groups, atom-centred, 2D autocorrelations, topological, geometrical, BCUT (Burden Eigenvalues), RDF (Radial Distribution Function), 3D–MoRSE (3D–Molecule Representation of Structures based on Electron diffraction), GETAWAY (Geometry, Topology, and Atom-Weights Assembly), and WHIM (Weighted Holistic Invariant Molecular) were used as input data to separate the botanical occurrences through SOM. SOM toolbox 2.0 software was used. However, a similar approach was previously utilized by da Costa et al. (2005), where they studied sesquiterpene lactone-based classification (chemosystematics) of three Asteraceae tribes based on self-organizing neural networks. Random Forest (RF) and SOM were successfully applied to establish distribution patterns of 8411 compounds from 240 Chinese herbs in relation to the herbal categories of traditional Chinese medicine (TCM) (Ehrman et al., 2007a, b). RF is a meta estimator that fits several decision tree classifiers on various sub-samples of the dataset and utilizes averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size. RF helped construction of TCM profiles of all compounds that describe their affinities for 28 major herbal categories, and at the same time, minimized the level of noise associated with the complex chemical diversity that exists in herbs from each category (Ehrman et al., 2007a). Profiles were then reduced and visualized with SOM. The data used for this analysis were obtained from a Chinese herbal constituents’ database. Earlier, the same group (Ehrman et  al., 2007b) built two databases to enhance applications of chemoinformatics and molecular modelling to medicinal plants, especially, Chinese herbs. The first databases had data on known chemical constituents of 240 commonly used Chinese herbs, and the other held information on target specificities of bioactive phytochemicals. In the case of the Chinese herbal constituents’ database, further details such as trivial and systematic names, compound class and skeletal type, botanical and Chinese names of associated herb(s), CAS registry number, chirality, pharmacological and toxicological information, and chemical references were incorporated. For the bioactive plant compounds database, details of molecular target(s), IC50 and related measures, and associated botanical species were added. For Chinese herbs, they listed approximately 7000 unique compounds. Applying a similar approach, Kim et al. (2015) constructed a database of medicinal materials and chemical compounds in Northeast Asian traditional medicine (TM-MC), for which medicinal materials were listed in the Korean, Chinese, and Japanese pharmacopoeias and

6  Computational Phytochemistry

information on the compound names of medicinal materials could easily be confirmed online. Appropriate web-interface was utilized for this database to enhance its capabilities.

1.3.2  Density Functional Theory Density functional theory (DFT), a computational quantum mechanical modelling method, is a link between the density and the wave function that was developed by Hohenberg and Kohn during 1964 (Fukui et al., 1952; Sholl and Steckel, 2009; Mendoza-Huizar and Rios-Reyes, 2011; Gopalakrishnan et al., 2014). It is one of the most widely used methods for ‘ab initio’ calculations of the structure of atoms, molecules, crystals, surfaces, and their interactions. DFT has long been used to predict certain functions such as ionization potentials, electron affinities, orbital energies, and molecular structures. DFT currently is probably the most appropriate approach to compute the electronic structure of matter, and its applications range from atoms, molecules, and solids to nuclei and quantum and classical fluids. In fact, it helps predict molecular properties, e.g., molecular structures, vibrational frequencies, atomization energies, ionization energies, electric and magnetic properties, and reaction paths. A computational study on flavonoids isolated from Trifolium resupinatum, linking to hepatoprotective activity resulting from their antioxidant properties, was carried out applying the DFT calculations using the Gaussian 09 package (Kamel et  al., 2016). The geometries of neutral flavonoids, radicals, anions, and cations were optimized using the B3LYP exchange-correlation functional level without constraints. The 6-311G(d,p) basis set was utilized for this work. Frequency calculations were performed at the same level to describe the stationary points, obtain zero-point energy, and to confirm that the ground states had no imaginary frequency. Subsequently, single-point energy calculations were carried out with the 6-311G(d,p) basis set in the gas phase (ε = 1) and in water (ε = 78.4). For the solvation effect, the self-consistent reaction field method, incorporating the polarizable continuum model solvation, was used. The hydration enthalpies of the hydrogen atom (H•), proton (H+), and electron (e−) were extracted from a previous study. As the antioxidant activity of T. resupinatum is generally attributed to the radical-scavenging activity of flavonoids and the H-abstraction in these compounds is thought to be responsible for their antioxidant activity, three types of mechanisms were put forward: HAT, SET-PT, and SPLET, which could co-exist and the thermodynamic balance presented among their different parts might explain the antioxidant activity. The HAT mechanism is a homolytic OH bond breakage produced by the bond dissociation enthalpy BDE (ΔH3). The SPLET mechanism is divided into two steps, a heterolytic dissociation of the OH bond (the proton affinity of the flavonoid [ΔH1]) followed by an ionization of the anion (ΔH4). The SET-PT mechanism is an ionization of the flavonoid (ΔH2) followed by a departure of the proton (ΔH5). The net results of SET-PT or SPLET mechanisms are the same as HAT. The SET-PT

An Introduction to Computational Phytochemistry  Chapter | 1  7

and SPLET mechanisms are preferred in polar media due to charge separation and are favourable for radicals with a higher electron affinity. According to these mechanisms, the structure–antioxidant-activity-relationship evaluation of the isolated flavonoids could be assessed by the evaluation of ΔH1, ΔH2, ΔH3, ΔH4, and ΔH5 (Kamel et al., 2016). Al-Sehemi et al. (2016) applied DFT at B3LYP/6-31G* to optimize eight rotamers of 3′-methyl-quercetin, where the molecular structure and molecular properties of the most stable rotamers were studied at the same level of theory. Descriptors like electronegativity (χ), hardness (η), electrophilicity (ω), softness (S), and electrophilicity index (ωi) were computed by DFT approach. They computed the absorption spectrum by time-dependent density functional theory at TD-B3LYP/6-31G∗ level of theory and managed to explain the radical-­ scavenging activity based on bond dissociation enthalpy and the adiabatic ionization potential and to establish two possible mechanisms: hydrogen atom transfer and one-electron transfer. The density functional theory (DFT) was used to understand possible physicochemical and biological properties of constituents of the fruits of Cucumis trigonus Roxb. and C. sativus Linn. (Gopalakrishnan et al., 2014). Gaussian 09W programme and Gauss-View molecular visualization programme packages on a personal computer were used. DFT of the phytochemicals identified from these plants by GC-MS was calculated by density functional B3LYP methods using B3LYP/6-311++G(d,p) basis set. The optimized geometries of identified compounds were assessed. Physicochemical properties, e.g., highest occupied molecular orbital (HOMO), lowest unoccupied molecular orbital (LUMO), ionization potential, electron affinity, electronegativity, electrochemical ­potential, hardness, softness, electrophilicity, total energy, and dipole moment were recorded. Glycodeoxycholic acid and 2-(2-methylcyclohexylidene)-­ hydrazinecarboxamide were found to be effective drugs selected based on their HOMO and LUMO energy gap and softness. Pistagremic acid (Fig. 1.1), a major component of the galls of Pistacia integerrima, is well-known for its α-glucosidase and β-secretase enzyme inhibitory property. O O

OH

OH

HOOC

O O

O

Diospyrin FIG. 1.1  Structure of diospyrin and pistagremic acid.

H

Pistagremic acid

8  Computational Phytochemistry

The geometric and electronic properties, e.g., ionization potential, electron affinities, and co-efficient of HOMO and LUMO of this compound, were studied by simulation at B3LYP/6-31G(d, p) level of the DFT (Ullah et al., 2014). Fazl-i-Sattar et al. (2015) conducted a similar study with diospyrin, isolated from Diospyros species (Fig. 1.1), using a suitable level of theory and correlating the experimental and theoretical data. Hybrid DFT method at B3LYP/6-31G (d,p) level of theory was applied for obtaining the electronic, spectroscopic, inter-molecular interaction, and thermodynamic properties of this compound. Its structure was confirmed from the validation of the theory and experiment. In both studies, all calculations were performed on Gaussian 09 suite of programmes in the gas phase, except UV–vis (chloroform medium). The results were visualized with GaussView (http://www.ch.cam.ac.uk/computing/ software/gaussview) and Gabedit (https://sourceforge.net/projects/gabedit/) software. DFT approaches helped deduce the relative configuration of the flavonoid, (+)-tephrodin (Fig. 1.2), isolated from Tephrosia species (Muiva-Mutisya et al., 2014). All geometries of this compound were optimized without any restrictions using DFT approaches, utilizing Becke’s three-parameter functional (B3LYP), and the split-valence triple zeta basis set 6-311G** including polarization functions. The Gaussian 09 programme package was used, and the 3D structures were handled by the SYBYL7.3 (2007) molecular modelling software (http:// softadvice.informer.com/Sybyl_7.3.html). There are several other examples of configurational study of various phytochemicals using DFT approaches (Li et al., 2011a; Zhao et al., 2011; Ebrahimi et al., 2013; Varmaghani et al., 2014; Farooq et al., 2015; Azevedo et al., 2016; Kamel et al., 2016). Wang et al. (2012) applied DFT (B3LYP method) to study interaction between thymine and luteolin through optimizing the geometries of luteolin, thymine, and luteolin-thymine complexes at 6-31+G∗ basis. The vibrational frequencies were analysed at the same level to study the complexes. This study revealed that strong hydrogen-bonding interactions were present in the luteolin-­thymine complexes. DFT-aided Raman Spectroscopic analysis of botryococcene hydrocarbons from the green microalga Botryococcus braunii was reported, where DFT computations utilized the Gaussian 03 package to obtain

O

O

O OAc O

OMe O FIG. 1.2  Structure of (+)-tephrodin.

An Introduction to Computational Phytochemistry  Chapter | 1  9

the calculated vibrational frequencies and to produce the computed Raman spectra, which is commonly used in chemistry to provide a structural f­ ingerprint by which ­molecules could be identified. The B3LYP/cc-pvtz basis set was used. The computed spectra were produced using the GaussView 4.1.2 programme (http://downloads.informer.com/gaussview/4.1/). Use of DFT calculations and vibrational analysis have recently been reported for smeathxanthone A (Fig. 1.3), where high level computational theory employing M06 coupled with 6-311G simple basis set was used, and HOMO-LUMO energy gap and optimized geometry parameters were also computed (Lontsi and Alembert, 2017). Antioxidant potential of curcumin-related compounds was investigated by chemoluminescence kinetics, chain-breaking, radical-scavenging activity, and DFT calculations at UB3LYP/6-31+G(d,p) level; the latter was applied to explain the structure–activity relationships (Slavova-Kazakova et  al., 2015). A similar study using DFT with the B3LYP and BhandHLYP was also reported by Li et al. (2011b) for lespedezavirgatol, lespedezavirgatal, and lespedezacoumestan (Fig. 1.4) from Lespedeza vigrata. Cuca-Suarez et  al. (2013) performed molecular modelling of cadinane sesquiterpenes isolated from Nectandra amazonum, where a DFT molecular modelling study at the B3LYP level was separately performed on two new sesquiterpenes, rel-(4S,6S)-cadina-1(10),7(11)-diene and rel-(1R, 4S, 6S, 10S)cadina-7(11)-en-10-ol, starting from a NOESY-resulted configuration in order OH

O

OH

OH

O OH

FIG. 1.3  Smeathxanthone A from Garcinia smeathmannii. MeO CHO

OMe

MeO

HO O

HO OMe

HO

Lespedezavirgatol

MeO

Lespedezavirgatal

HO

HO

OMe

O

HO

O

O

OH

OMe O OH

Lespedezacoumestan FIG. 1.4  Antioxidants from Lespedeza vigrata.

10  Computational Phytochemistry

to support the assignments, due to the conformation-dependent variability of sesquiterpenes. Boltzmann population analyses (also known as Boltzmann distribution) for optimized structures of the new compounds revealed that the lowest stable conformers exhibited a chair-conformed B-ring and supported the NOESY correlations. Note that a Boltzmann distribution is a probability distribution, probability measure, or frequency distribution of particles in a system over various possible states.

1.3.3  Docking Experiments and Virtual Screening (In Silico Screening) Recent remarkable advances in computational and structural bioinformatics offer solutions to study the mechanism of action of various compounds, including bioactive phytochemicals. Many in silico techniques or computational modelling, virtual screening (including virtual HTS or vHTS), and docking studies have now been routinely performed to evaluate the potential of the lead molecules either of synthetic or of natural origins. These computer-aided studies constitute the main territory of Computational Phytochemistry, and publications in these areas involving phytochemicals have increased sharply over the last decade or so (see Chapters 5, 10 and 11). Virtual High-Throughput-Screening (vHTS) is an established methodology for identifying drug candidates from large collection of compound libraries and requires the careful implementation of each phase of computational screening experiment right from target preparation to hit identification and lead optimization (Subramaniam et al., 2008). In molecular modelling, the term ‘docking’ is defined as a method that predicts the preferred orientation of one molecule to a second when bound to each other to form a stable complex (Lengauer and Rarey, 1996). Knowledge of the preferred orientation helps predict the strength of association or binding affinity between two molecules using scoring functions, which are fast approximate mathematical methods used to predict the strength of the non-covalent interaction (binding affinity) between two molecules after docking. Docking is the process by which two molecules fit together in three-dimensional (3D) space, and molecular docking is a tool in structural molecular biology and computerassisted drug design and bioinformatics. Simply, docking is computational simulation of a candidate ligand, binding to a receptor. Docking studies on molecules, synthetic or natural, revealing various degrees of binding affinities towards receptors (enzymes) often provides an insight into the prediction of bioactivity. Docking allows virtual screening of a database of compounds and predicting the strongest binders, based on their scoring functions. It explores ways in which two molecules, ligand and receptor (enzyme), fit together and dock each other. The molecules binding to a receptor inhibit its function, and thus, can act as a potential drug. In recent years, numerous docking studies involving phytochemicals targeting certain types of bioactivities have been reported. Just a few selected samples of in silico analysis and molecular docking studies are outlined below.

An Introduction to Computational Phytochemistry  Chapter | 1  11

Muhammad and Fatima (2015) analysed the angiotensin-converting enzyme (peptidyl-dipeptidase A) inhibitory action of quercetin glycosides (isolated from buckwheat and onions) by computational docking studies, using PyRx, AutoDock Vina option based on scoring functions. Molecular docking study was performed on polytriterpene phytochemicals, e.g., triterpenes boswellic acid and ursolic acid (Fig. 1.5), to predict potential anticancer property (Sanghani et al., 2012). In this study, the 3D structures of these two terpenoids were drawn by Chemdraw software. The docking analysis of these compounds and human cyclin-dependent kinase 2 receptor was conducted using ArgusLab docking software. Boswellic acid and ursolic acid docked against the same receptor using parameters by default in ArgusLab software (www.arguslab.com), which is an electronic structure programme based on the quantum mechanics. This freely available molecular modelling package/software that runs under Windows is able to predict the potential energies and molecular structures and can optimize geometry of structure, vibration frequencies of coordinates of atoms, bond length, bond angle, and reactions pathway. Mocan et  al. (2017) have reported the functional constituents of Lycium barbarum and their biological profile. They incorporated molecular m ­ odelling studies on chlorogenic acid (Fig. 1.6), which is one of the major ­bioactive constituents of the leaves of this plant, to confirm its role in the inhibition of tested enzymes, butyrylcholinesterase, α-amylase, α-glucosidase, and tyrosinase. Docking experiments with chlorogenic acid were performed with those enzymes applying the Gold suite 6 software (http://www.ch.cam.ac.uk/­computing/­software/ gold-suite) (Verdonk et  al., 2003), using the scoring function chemscore,

O H H HO HOOC

HO

Ursolic acid

Boswellic acid FIG. 1.5  Structure of boswellic acid and ursolic acid. HO

COOH O

HO

O OH OH OH

FIG. 1.6  Chlorogenic acid.

COOH

12  Computational Phytochemistry

which is designed to work in the presence of metalloenzymes as in the case of tyrosinase. The receptors were prepared by using the software, Maestro 11.0 (free academic licence; https://www.schrodinger.com/maestro), protonated at neutral pH and all the crystals errors were corrected manually. The docking grid was automatically calculated by Gold, centred on the crystallographic ligand (Mocan et al., 2017). To obtain additional information about the time-dependent behaviour of the chlorogenic acid-tyrosinase complex found in the docking experiments, a molecular dynamics experiment was carried out using the software Amber 14 (http://ambermd.org/) (Case et al., 2015). Yan et al. (2016) isolated 16-nor limonoids (Fig. 1.7) from Harrisonia perforata as selective 11β-HSD1 inhibitors. While the inhibitory activity was determined by the scintillation proximity study (Glickman et al., 2008; Harder and Fotiadis, 2012), structure-activity-relationships study was performed by molecular docking simulation using co-crystal structures of the 11β-HSD1 enzyme (4K1L for human). AutoDock 4 (http://autodock.scripps.edu/) was used to quantify the parameters that are essential for high affinity ligand binding. LogPlot+ was used for further analysis of the complex between the compound and the enzyme. In silico screening of phytochemicals, targeting childhood absence epilepsy, was reported by Sabeega-Begum et  al. (2014). In that study, the sequences of GABAA (γ-aminobutyric acid) receptor of three subunits (alpha 1, beta 2, and gamma 2) with UniProtKB accession numbers P14867, P47870, and P18507 were obtained from NCBI (http://www.ncbi.nlm.nih.gov/) in FASTA format. FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The gamma 2 subunits of normal and diseased sequences were studied for their physicochemical characteristics using the ProtParam tool (http://web.expasy.org/ protparam/). The main secondary structure elements were calculated by the protein ­secondary structure prediction tool GOR IV (http://npsa-pbil.ibcp.fr/ cgi-bin/npsa_automat.pl?page=npsa_gor4.html). Tertiary structure prediction O

O

O

H O

O O

O

O H O

O

Harperspinoids A

O

H

O H

O

O Harperspinoids B

FIG. 1.7  Harperspinoids A and B from the aerial parts of Harrisonia perforate.

An Introduction to Computational Phytochemistry  Chapter | 1  13

of ­normal and mutated (R82Q) gamma 2 subunit was carried out with the help of an ­automated modelling server I-TASSER (http://zhanglab.ccmb.med.umich. edu/I-TASSER/); the model was validated by the SAVeS server (http://nihserver.mbi.ucla.edu/SAVES/). The pentamer model was modelled applying the PatchDock Server (http://bioinfo3d.cs.tau.ac.il/PatchDock/). Molecular visualization tools helped visualization of the structural changes in both normal and mutated subunits of GABAA receptor models. The suitable antiepileptic, anticonvulsant phytochemicals and currently approved drugs were selected from the NCBI-PubChem (http://pubchem.ncbi.nlm.nih.gov/) chemical databases. Molecular properties and drug likeness were calculated by the Molinspiration software server (www.molinspiration.com). The prediction of ADMET was performed by the FAF-Drugs2 online tool (http://mobyle.rpbs.univ-parisdiderot.fr/ cgi-bin/portal.py?form=FAF-Drugs2#forms::FAF-Drugs2). The Schrodinger Glide module (https://www.schrodinger.com/glide) was applied to analyse the docking mechanisms of the mutated gamma subunit of GABA receptors with the library of ligand molecules. Previously, a similar study was conducted by Jasmine and Vanaja (2013) aiming at optimization of HMG-CoA reductase-inhibiting phytochemicals. The docking analysis of phytochemicals with HMG-CoA reductase was performed by Ligand Fit of Accelrys Discovery studio 2.1 (Accelrys Software Inc.), which allowed virtual screening of a database of phytochemicals and prediction of the strongest binders based on various scoring functions. Enzyme substrate complexes were identified via docking and their relative stabilities were evaluated using their binding affinities. Das et al. (2017) have recently described in silico protocols for predicting anti-Alzheimer’s activity of pyranoflavonoids targeting acetylcholinesterase. Three flavonoids (Fig.  1.8) isolated from Artocarpus anisophyllus were subjected to Lipinski filter, ADME/Tox screening, molecular docking, and quantitative structure–activity relationship (QSAR) in silico. In  vitro activity was OMe OMe HO

O

O

O OH O OH O OH HO

OH O FIG. 1.8  Flavonoids from Artocarpus anisophyllus.

14  Computational Phytochemistry

evaluated by bioactivity staining based on the Ellman’s method. To prepare the ligands (flavonoids from A. anisophyllus), the structures were drawn with the ChemDraw Ultra 8.0 software and converted to 3D structures of ‘smiles’ and ‘sdf’ formats with OpenBabel. The toxicity and drug likeness of the ligands were investigated, respectively, using Mobyle@rpbs online portal and Molsoft LLC online portal (www.molsoft.com). Target selection was accomplished with the aid of PharmMapper. For docking study, the 3D structure of the target protein was obtained from Protein Data Bank (http://rcsb.org/pdb), and FlexX of Biosolveit LeadIT was used. A separate docking was also performed with the target and known inhibitors to compare the efficacy of the selected ligand. Docking results, i.e. docking energy, docked amino acid residues, hydrogen bond, and bond energy, were recorded using LeadIT. Finally, the QSAR analysis was carried out by involving known inhibitors. The QSAR descriptors, namely molar refractivity, index of refraction, surface tension, density, polarity, and logP, were generated for each of the molecules using ACD ChemSketch software. The activities were measured by taking the inverse logarithm of the 50% inhibitory concentration (IC50) values. The descriptors against their bioactivities were tabulated in MS Excel. The descriptors and activities were loaded in Easy QSAR software for multiple linear regression analysis. From the regression, the QSAR equation was generated and the activity of the chosen ligand was predicted. Ehrman et  al. (2010) described the applications of virtual screening and phytochemical data mining (phytochemical informatics) involving identification of single and multiple target ligands through ‘target-fishing’ (Nettles et al., 2006; Wale and Karypis, 2009), a novel technique that seeks to identify multiple receptors to which a compound may bind. They also highlighted the role of informatics in bridging the gap between TCM and biomedical science. Powers and Setzer (2015) studied estrogen mimics from dietary herbal supplements applying molecular docking approach. They used 568 phytochemicals found in 17 herbal supplements sold in the USA and docked with two isoforms of the estrogen receptor. Ligand structures were drawn using Spartan 14 (https://www.wavefun.com/products/windows/Spartan14/win_spartan.html), and conformational search and geometry optimization were performed using the MMFF (Merck Molecular Force Field). Receptor-ligand docking studies were performed, based on the crystal structures of human estrogen receptors. Molegro Virtual Docker version 6.0 (http://www.softpedia.com/get/ScienceCAD/Molegro-Virtual-Docker.shtml) was used to carry out molecular docking calculations, and potential binding sites in the receptor were identified by the grid-based cavity prediction algorithm of this programme. AutoDock (for obtaining insight into the compounds’ binding with receptor), available at http:// autodock.scripps.edu/ and Hex (for protein-ligand docking calculations), two popular docking programmes, were used for comparative docking study of compounds from Lavandula angustifolia along with diazepam and amobarbital with

An Introduction to Computational Phytochemistry  Chapter | 1  15

GABAA receptors (Babahedari et al., 2014). In this study, the 3D structures of ligands were built using ArgusLab 4.0.1 molecular builder and then optimized using the Gaussian package using B3LYP with 6-31G* basis set. Mohan et al. (2015) reported molecular docking studies of secondary metabolites isolated from Phyllanthis niruri against hepatitis B (HBV) DNA polymerase, which is considered to be essential for hepatitis B virus replication in the host and is used as one of the most potent pharmacological targets for the inhibition of this virus. Their study involved homology modelling and molecular docking analysis of Phyllanthis niruri compounds and other nucleoside analogues against this enzyme using the software Discovery studio 4.0 (available at: http://accelrys.com/resource-center/downloads/updates/discovery-studio/ dstudio40/latest. html). Homology modelling (Kavasotto and Phatak, 2009) is actually a representation of the similarity of environmental residues at topologically corresponding positions in the reference proteins. In the absence of experimental data, model building based on a known 3D structure of a homologous protein appears to be the only reliable method for obtaining the structural information. Setzer and Ogungbe (2012) investigated the antitrypanosomal activity of 386 phytochemicals from 19 different Nigerian plants in silico using molecular docking with validated Trypanosoma brucei protein targets that were available from the Protein Data Bank (https://www.rcsb.org/pdb/home/home.do). The same group later reported the docking studies on 352 antileishmanial phytochemicals (polyphenolic compounds) (Ogungbe et  al., 2014). Ravichandran and Sundararajan (2017) have recently carried out in silico-based virtual drug screening and molecular docking analysis of phytochemical-derived compounds and FDA-approved drugs against BRCA1 receptor. Based on the drug screening scores and binding affinity scores of the test molecules, epi-gallocatechin gallate, a well-known phytochemical from green tea extract, and a commercial drug, doxorubicin hydrochloride, were found to be the best candidate in the docking study. Ahmed et al. (2017) have reported the application of molecular docking together with molecular dynamics simulations (MDSs) in the assessment of the inhibitory activities of phytochemicals against pancreatic lipase. MDSs, first introduced by Alder and Wainwright in 1950s, are computational methods that calculate the time-dependent behaviour of a molecular system. MDSs have now been routinely used to study the structure, dynamics, and thermodynamics of biological molecules and their complexes and in the determination of structures from X-ray crystallography and from NMR experiments (Hospital et al., 2015). Initially, Ahmed et al. (2017) docked 3770 phytochemicals against pancreatic lipase and ranked them based on binding affinity. Molecular docking simulations were performed using AMBER16 (http://ambermd.org/). Further details on methodology and more examples of use of computational techniques in molecular docking studies and virtual screening (in silico studies) of phytochemicals are available in Chapters 10 and 11.

16  Computational Phytochemistry

1.3.4  Structure Prediction and Structure Determination Computational methods can play a significant role in structure elucidation of phytochemicals based on spectroscopic data or in providing or predicting theoretical spectra of phytochemicals (see Chapter 7). Computer-assisted elucidation of structures of phytochemicals has been documented in several publications (Massiot and Nuzillard, 1992; Schaller et al., 1996; Munk, 1998; Elyashberg et al., 2008, 2009). The potential of creating computer-assisted methods for the structure elucidation of organic molecules was first taken into account during the second half of the past century, and later, a new area of investigation, known as Computer-Aided Structure Elucidation (CASE), emerged, which was based on the same general cognitive principles common to the properties of particles belonging to the atomic and sub-atomic world. CASE was applied initially to small molecules as distinct from biological macromolecules and biopolymers. Since the inception of CASE methods, efforts have been directed to the creation of artificial intelligence or expert systems based on the analysis of 1D 1H and13C NMR data in combination with MS and IR spectra (Elyashberg et al., 2009), and now, 2D NMR data can be routinely generated, even in automation, and a multitude of data are available as inputs to CASE systems—e.g., HSQC (HMQC), 1H-1H COSY (TOCSY), and HMBC methods. The present capabilities of 2D NMR expert systems to perform structure elucidation and verification were reviewed by Elyashberg et  al. (2008). An example of such system is Structure Elucidator (StrucEluc System). The aim of any expert system for structure elucidation is to extract the maximum amount of structural information from the available spectral data, and in principle, the structural information can be easily quantified. The main stages of any computer-assisted structure elucidation process as initially introduced are shown in Fig. 1.9. Further details are available in the publication by Elyashberg et al. (2009). 1 H and 13C NMR-based structure elucidation of phytochemicals using computation and mathematical approaches involves the following steps: 1. using molecular mechanics calculations (with, e.g., MacroModel) to generate a library of conformers; 2. applying density functional theory (DFT) calculations (with, e.g., Gaussian 09) to determine optimal geometry, free energies, and chemical shifts for each conformer; 3. determining Boltzmann-weighted proton and carbon chemical shifts; 4. comparing the computed chemical shifts for two or more candidate structures with experimental data to determine the best fit. Rychnovsky (2006) reported a computational method for predicting NMR spectra and, based on that model, revised the structure of hexacyclinol. The structure of an unusual halogenated diterpene (Fig.  1.10), isolated from Stypopodium flabelliforme, was deduced by NMR data analyses and computational studies involving DFT calculations (using the B3LYP and mPW1PW91 exchange-­correlation functional and the standard 6-31G(d) basis set) and 13C

An Introduction to Computational Phytochemistry  Chapter | 1  17

Spectroscopic data Molecular formula Selection of fragments Creation of fragment sets Spectrum-structure correlations

Structure creation form atoms and fragments Structural and spectral filtering of isomers Spectral prediction for candidate structures Choice of the most plausible structure

The most plausible structure FIG.  1.9  Computer-assisted structure elucidation process. Adopted from Elyashberg, M., Blinov, K., Molodtsov, S., Smurnyy, Y., Williamn, A. J., Chauranova, T., 2009. Computer-assisted methods for structure elucidation: realizing a spectroscopis’s dream. J. Chemoinform. 1, 3. https://doi.org/10.1186/1758-2946-1-3.

O Cl AcO

H

AcO

OAc

FIG. 1.10  Chlorinated diterpene from Stypopodium flabelliforme.

NMR chemical shifts calculations by gauge invariant atomic orbital (GIAO) method at the same level (Wolonski et al., 1990). Solvent effects were assessed by carrying out single-point B3LYP/6-31G(d) and mPW1PW91/6-31G(d) calculations at the gas phase stationary points involved in the reaction, using the polarizable continuum model (PCM) as outlined by Tomasi and Persico (1994). A dielectric constant ε = 4.9 was used for CHCl3, and relative chemical shifts were estimated corresponding to TMS shielding, calculated at the same theoretical model and using PCM method to solvent effect. All calculations were performed with the Gaussian 03 suite of programmes. Yan et  al. (2016) reported the isolation of two new unusual 16-nor limonoids, harperspinoids A and B (Fig. 1.7), from the aerial parts of Harrisonia perforata, where the structures were elucidated using NMR spectroscopy, X-ray diffraction analysis, and computational modelling using the Gaussian 03 programme at the B3LYP/6-31G* level (Gaussian 03, revision D.01, Gaussian Inc., Pittsburgh). The absolute configuration of phyllostin and scytolide, produced by Phyllosticta cirsii, and oxysporone isolated from Diplodia africana was assigned by computational analysis of their optical rotatory dispersion (ORD),

18  Computational Phytochemistry

electronic circular dichroism (ECD), and vibrational circular dichroism (VCD) spectra (Fig. 1.11). A satisfactory agreement between experimental and calculated VCD spectra could be obtained only after taking into account solvent effects. Preliminary conformational analysis was carried out by the Spartan 02 package using the MMFF94s molecular mechanics force field and Monte Carlo search on chosen compounds. After assessing the conformational space, the geometries within a 10-kcal/mol energy window were subjected to ab initio energy minimization as implemented in the Gaussian 09 package. A similar study, describing the determination of the absolute configuration of phytotoxins, (+)-inuloxin B, and (−)-inuloxin C (Fig.  1.12), isolated from Inula viscosa, and the potential herbicidal activity for the management of parasitic plants, has recently been published (Evidente et  al., 2016). This was achieved by the time-dependent DFT computational prediction of ECD and ORD spectra. Preliminary conformational analysis was performed by the Schrodinger package using the OPLS-2005 molecular mechanics force field and Monte Carlo search on inuloxins. After determining the conformational space, the geometries within a 10-kcal/mol energy window were subjected to ab intio energy minimization as implemented in the Gaussian 09 package. Calculations of ORD and ECD spectra were carried out at the TDDFT level of theory using either the B3LYP or CAM-B3LYP functional, respectively, and the aug-cc-pVDZ basis set. The use of the long-range corrected CAM-B3LYP functional afforded better ECD spectra simulation results than the more common B3LYP functional. The theoretical ORD and ECD spectra were obtained as weighted averages of Boltzmann populations. The ECD spectra, in particular, were obtained from calculated excitation energies and rotational strengths as a sum of Gaussian functions centred at the wavelength of each transition with a O

H

MeO

O

O

H

MeO

OH O

H O

OH

H

O

O OH

Phyllostin

H

O

O

O

Scytolide

H

O O

O

H (+)-Inuloxin B

O O

OH H (−)-Inuloxin C

FIG. 1.12  Structures of (+)-inuloxin B and (−)-inuloxin C.

H

Oxysporone

FIG. 1.11  Structures of phyllostin, scytolide, and oxysporone.

H

O

An Introduction to Computational Phytochemistry  Chapter | 1  19

parameter σ (width of the band at 1/2 height) of 0.3 or 0.4 eV and elaborated using the SpecDis v1.51 programme. Psychotripine (Fig.  1.13), a pyrroloindoline from Psychotria pilifera, was characterized by spectroscopic data analysis as well as by applying quantum theory (Li et al., 2011a). The 13C NMR chemical shifts were computed at the B3LYP/6-311++G(2d,p)//B3LYP/6-31+G(d) level and compared with the experimental data to produce the relative errors. After a conformational search, two conformations with low energy were found and the B3LYP/6-31+G(d)optimized conformations were used in optical rotation (OR) computations at the B3LYP/6-311++G(2d,p) level. Its electronic circular dichroism (ECD) was investigated at the B3LYP/6-31+G(d,p) level. The half-width of 0.2 eV was used in its ECD simulations. Naman et  al. (2015) reported computer-assisted structure elucidation of Black Chokeberry (Aronia melanocarpa) fruit juice isolates with new non-­ crystalline fused pentacyclic flavonoid skeleton with two contiguous hemiketals, e.g., melanodiol 4″-O-protocatechuate and melanodiol (Fig. 1.14). Because of significant hydrogen deficiency indices, their structures were difficult to determine based only on information obtained from conventional

N N N

N

N H

N

FIG. 1.13  Structure of psychotripine from Psychotria pilifera.

OH HO

O

OH OH O

OH

OH HO

OH O

OH

O

O O HO

O

OH

O

OH O

HO

Melanodiol 4′′-O-protocatechuate FIG. 1.14  Fused pentacyclic flavonoid skeleton.

O HO

Melanodiol

OH

20  Computational Phytochemistry

spectrsocopyic methods, but could be determined using computer-assisted structure elucidation software, ACD Structure Elucidator (http://www.acdlabs. com/products/com_iden/elucidation/struc_eluc/) (Moser et al., 2012). While Constantin et al. (2010) introduced a new module of the expert system SISTEMAT and applied that for the structure determination of neolignans, several other computer-aided structure elucidation protocols were reported in the literature almost three decades ago (Abe et al., 1981; Fujiwara et al., 1981). 8-Epicordatin (Fig. 1.15) is a crystalline compound isolated from the bark and leaves of Croton palanostigma Klotzsch of the family Euphorbiaceae. While the conventional spectroscopic data, e.g., NMR, MS, and X-ray diffraction, helped elucidation of the structure of this compound, NMR theoretical calculations had to be performed at B3PW91/DGDZVP level to confirm the assignment of the chemical shifts of the H-7α and H-7β hydrogens (Brasil et al., 2010). The geometry obtained from X-ray diffraction data was utilized as input for the full geometry optimization. This molecular conformation was optimized using the B3LYP hybrid functional, together with the 6-31G(d,p) basis in the Gaussian 03 molecular package. A similar approach was used in the structure determination of epoxyroussoeone and epoxyroussoedione (Fig. 1.16) isolated from the fungal strain Roussoella japanensis KT1651 (Honmura et al., 2015). Although NMR spectra provided insufficient structural information, computation of the theoretical chemical shifts with DFT EDF2/6-31G* enabled elucidation of the planar structure, as well as the relative configuration. Their ECD (electric circular dichroism) spectra suggested the absolute configurations, which were confirmed with time-dependent DFT calculations employing BHandHLYP/TZVP. O

H

H O

H

O MeO

OH O

FIG. 1.15  Structure of 8-epicordatin. OH

OH

OH

O

O

O

O O

MeO

OH

OH

Epoxyroussoeone

MeO

O O

Epoxyroussoedione

FIG. 1.16  Epoxyroussoeone and epoxyroussoedione from Roussoella japanensis.

An Introduction to Computational Phytochemistry  Chapter | 1  21

Computational tools, especially NMR prediction using quantum chemical methods, can be used as a tool to facilitate confirmation, assignment, and reassignment of structures of phytochemicals. While computational methods for predicting 1H and 13C NMR chemical shifts are well-established, these methods do not work in all cases, especially in cases where readily exchangeable protons affect NMR chemical shifts. In these cases, NMR spectra are pH-sensitive, and such molecules tend to oligomerize, making spectral prediction extremely difficult. Further details on methodology and more examples of use of computational techniques in structure determination and prediction and spectral interpretation are available in Chapter 7.

1.3.5  Chemometrics and Principal Component Analysis Chemometrics is not a single tool, but a wide variety of methods including basic statistics, signal processing, factorial design, calibration, curve fitting, factor analysis, detection, pattern recognition, and neural network (Kumar et al., 2014; Sarker and Nahar, 2015). The term chemometrics was first introduced by the Swedish scientist Svante Wold in 1971, and was simply shown as a collection of applications of mathematical and statistical techniques to retrieve more information from the chromatographic data. This term has been officially defined as the science of relating measurements made on a chemical system or process to the state of the system via application of mathematical or statistical methods (Sharaf et al., 1986; Massart et al., 1988; Kumar et al., 2014; Sarker and Nahar, 2015). Chemometrics, which is essentially a group of multivariate data analysis tools, can generally be applied for one or more of the following purposes to: 1. design or select optimal measurements procedures and experiments; 2. obtain maximum chemical information by analysing chemical data; 3. explore patterns of association in data; 4. track properties of materials on a continuous basis, and 5. prepare and use multivariate classification models. Apart from the statistical–mathematical methods, chemometrics methods are also linked to problems of the computer-based laboratory, to methods for handling chemical or spectroscopic databases and to methods of artificial intelligence. Chemometrics can be classified into two main categories: pattern recognition methods (unsupervised and supervised) when a qualitative evaluation is considered, and multivariate calibration for quantitative purposes (Brenton, 2003). Design of experiment, data pre-processing, classification, and calibration are the main practical steps involved in any chemometrics analysis (Sarker and Nahar, 2015). Experimental design primarily screens factors that are important for the success of a process. Selection and implementation of the optimized conditions under which the process will be performed come next. Chemometrics

22  Computational Phytochemistry

tools comprise, among many, various pattern recognition methods, hierarchical cluster analysis, and multivariate regression, decomposition, and calibration methods, e.g., tri-linear regression calibration, multi-linear regression calibration, classical least squares, inverse least squares, partial least square regression, principal component analysis (PCA), parallel factor analysis, soft independent modelling of class analogy, linear discriminate analysis (Kumar et al., 2014). In chemometrics, patterns in the data are modelled. These models can then be routinely applied to future data in order to predict the same quality parameters. The result of the chemometrics approach is gaining efficiency in assessing product quality. It can lead to more efficient laboratory practices or automated quality control systems. The only requirements are various computational tools, e.g., an appropriate instrument such as a PC and software to interpret the patterns in the data. Chemometrics is the bridge between connecting the state of a chemical system to the measurements of the system. It has become an essential part in the modern chemical and biomedical industries. Chemometrics software has been widely used by product development scientists, process engineers, PAT specialists, and QA/QC scientists to build reliable model, ensure product quality, classify raw material, and to monitor process end point in real-time. The science of chemometrics offers many efficient ways to solve the calibration problem for analysis of spectral data. Chemometrics can be used to enhance methods development and make routine use of statistical models for data analysis. All chemometrics procedures begin with taking a measurement and collecting data and continue with employing mathematical and statistical methods to extract relevant information from the data, ensuring the information is related to the chemical process to extract knowledge about a system, and to allow comprehension and understanding of a system so that the understanding can facilitate decision-making. Therefore, the chemometrics progression involves measurement, data collection, information extraction, knowledge provision, and understanding. Chemometrics help remove redundant data, and reduce variation not relating to the analytical signal and build models. The applications of chemometrics in phytochemical research have grown over the last few decades, especially in relation to chemical fingerprinting of plant extracts and plant metabolomics (see Chapter 8). Details on the chemometrics techniques for the analysis of herbal products or phytochemicals are available in several excellent articles published in the last 5 years (Gad et al., 2012; Bansal et al., 2014; Kumar et al., 2014; Pawar and Kamat, 2014; Kowalczuk et al., 2015; Sarker and Nahar, 2015; Donno et al., 2016). Chemometrics tools were also implicated to extraction process optimization of botanical materials (Das et al., 2014). Kumar et al. (2014) reviewed applications of chemometrics in analytical chemistry including phytochemical analysis and presented several specific examples, e.g., two-wavelength HPLC fingerprinting analysis for the quality assessment of several Cassia seed samples, simultaneous determination of triterpenes by the HPLC-DAD method in the fruits of Ziziphus jujube, HPLC fingerprint analysis incorporating chemometrics methods for species

An Introduction to Computational Phytochemistry  Chapter | 1  23

­ ifferentiation, quality evaluation and consistency check of Radix Paeoniae d collected from different sources, matching and discrimination of Artemisia selengensis and Rhizoma Coptidis samples, and classification of samples of Ganoderma lucidum. PCA is an unsupervised pattern recognition technique used for handling multivariate data without prior knowledge about the samples under investigation (Jollife, 2002). The supervised classification procedure using soft independent modelling of class analogy (SIMCA) based on making a PCA model to assign unknown samples into the predefined class model is also used in chemometrics analysis (Bansal et al., 2014). The idea of PCA is mainly to reduce the dimensionality of a data set comprising large amounts of interrelated variables, yet keeping maximum variation in the data set. PCA is usually used to evaluate the discrimination ability of common components using relative peak areas of common peaks as input data instead of the full fingerprint. The large dimensionality and unknown distributions are often met in plant biotechnology and phytochemistry investigations. PCA offers reduction in dimensionality and nonparametric Kruskal-Wallis ANOVA may allow separation of factors’ influence even if the distribution is unknown. It reduces a set of data into three new sets of variables: principal components, scores, and loadings, which can be used to develop and examine latent variations. PCA is used to decompose the data into scores and loadings; scores reveal information about inter-sample variations, and loading establishes which variables from within the original data contribute most to the scores. PCA has been used quite routinely in phytochemical analysis in recent years, and various commercial softwares are available to help with this analysis. Pandey et  al. (2015) carried out PCA, based on the content of 23 bioactive markers in eight collections of Ocimum sanctum representing different geographical origins within India using the software STATISTICA 7.0 (StatSoft Inc., Tulsa, OK, USA; http://statistica.software.informer.com/7.0/). A metabolomics approach incorporating NMR spectroscopy with multivariate data analysis for metabolic profiling of artichoke and cardoon varieties allowed determination of relevant differences in the relative content of the metabolites for the species analysed, and it was the first application of 1H NMR with multivariate statistics to provide a metabolomic fingerprinting of Cynara scolymus (De Falco et  al., 2016). The same software, STATISTICA 7.0, was also used in this study. Viacava and Roura (2015) performed principal component and hierarchical cluster analysis, using the software, PC-ORD, to select natural elicitors for enhancing phytochemical content and antioxidant activity of lettuce sprouts. Interrelationships based on phytochemical properties in strawberry cultivars were analysed by PCA (Samec et al., 2016). Preethi et al. (2014) reported the PCA and HPTLC fingerprint study involving Withania coagulans root extracts, where PCA established significant variance in phytochemical compositions in collected samples. XLSTAT (https://www.xlstat.com/en/) and NTSYSpc (http://en.freedownloadmanager.org/Windows-PC/NTSYSpc.html) software were used, respectively, for PCA and cluster analysis in this study.

24  Computational Phytochemistry

Propolis samples, collected from various geographical origins, were analysed by a combination of PCA and 1H NMR (Watson et al., 2006). In this study, 1 H NMR spectra were processed using the software application MestRe-C (version 4.5.9.1, Mestrelab Research, Santiago de Compestela, Spain; http:// en.freedownloadmanager.org/Windows-PC/MestReC.html), processed spectra were sliced into 0.1 ppm sections between 0 and 9 ppm, thus producing 90 discrete bucketed regions, data sets were imported directly from MestRe-C into SIMCA-P (http://umetrics.com/kb/about-simca-pp-115) for PCA analysis, and the first derivatives of the spectra were used. Dey et al. (2015) carried out comparative phytochemical profiling of Clerodendrum infortunatum using GC-MS and multivariate statistical analysis, e.g., PCA based on correlation matrix. The results were further analysed by a hierarchical cluster analysis (HCA) associated with proximity score matrix represented as proximity heat-map. PCA and HCA were performed using the IBM SPSS statistics version 20.0 software package for Windows. Some common PCA software are listed in Table 1.1. In the analysis of herbal products, especially by HPLC, and chemometrics follow up analysis, pre-treatment of data is required because of unknown components, overlapped peaks, drifted baselines on the HPLC chromatogram.

TABLE 1.1  List of PCA Software and Their Sources PCA software

Source

Analyse-it

https://analyse-it.com/landing/PCA-software

BioDiversity Pro

http://en.freedownloadmanager.org/Windows-PC/ BioDiversity-Pro-FREE.html http://biodiversity-pro2.software.informer.com/

NTSYSpc

Department of Ecology and Evolution State University of New York Stony Brook, NY 11794–5245 http://www.exetersoftware.com/cat/ntsyspc/ntsyspc.html

PC-ORD 7.0

MJM Software Design https://www.pcord.com/

Solo softwares

Eigenvector Research, Inc., 196 Hyacinth Road, Manson, WA 98831 http://www.eigenvector.com/software/index.htm

STATISTICA 7.0

StatSoft Inc., Tulsa, OK, USA http://statistica.io/

Umetrics Suite (Simca)

http://umetrics.com/

The Unscrambler X

http://www.camo.com/rt/Products/Unscrambler/ unscrambler.html

XLSTAT

https://www.xlstat.com/en/download

An Introduction to Computational Phytochemistry  Chapter | 1  25

For herbal products data analysis for quality control and standardization, in addition to PCA as mentioned above, a number of other chemometrics techniques may also be applied, such as, linear discriminate analysis (LDA), spectral correlative chromatography (SCC), information theory (IT), local least square (LLS), heuristic evolving latent projections (HELP), and orthogonal projection analysis (OPA) (Bansal et al., 2014). The commonly used pre-treatment of data obtained from herbal products analysis are LLS and normalization. Metabolomics is a data-driven approach with predictive power that aims to assess all measurable metabolites without any pre-conception or pre-­selection (see Chapter 8). Metabolomics investigates the occurrence and change of concentrations of small molecular weight chemical compounds (metabolites or metabolomes) in organisms, organs, tissues, cells, and ultimately cell compartments in the context of environmental changes, disease, or other boundary conditions, by utilizing appropriate spectroscopic and chromatographic techniques and by observing at once not only a few but all compounds visible to the particular technique used. It is not only the third large omics area of research next to genomics and proteomics, but also a field at the interface between chemistry and biology, helping to answer biological questions using analytical chemistry and cheminformatics tools (Bartel et al., 2013). PCA is arguably the most widely used multivariate analysis method for metabolic fingerprinting and, in fact, chemometrics in general (Worley and Powers, 2013). PCA and related multivariate approaches have been applied successfully to conduct meaningful plant metabolomics study, which is essentially an extension of chemical fingerprinting methods with added insights into plant metabolomes (Miyagi et al., 2010; Tugizimana et al., 2013; Madala et al., 2014). The primary and secondary metabolites in plant cells are the final recipients of biological information flow, and their levels influence gene expression and protein stability (Tugizimana et  al., 2013). Measurement of these metabolites may reveal the cellular state under certain conditions and provide useful insights into the cellular processes associated with the biochemical phenotype of the cell, tissue, or whole organism. Plant metabolomics aims to map all metabolites in a plant both qualitatively and quantitatively using appropriate chromatographic and spectroscopic methods, e.g., HPLC, HPTLC, GC, NMR, and MS, and applying PCA. Booker et al. (2014) described such a plant metabolomics study involving Serenoa repens, where they utilized mainly 1H NMR (Analysis of Mixtures software v3.0, Bruker Biospin, Rheinstetten, Germany; TOPSPIN version 1.3 software for spectra acquisition and processing) and PCA tools (SIMCA Version 13.0). 1H NMR metabolomics analysis incorporating multivariate data analysis technique was applied for Echinacea spp. (Frederich et al., 2010), where PCA was performed with the SIMCA-P software (v. 12.0, Umetrics, Umeå, Sweden) using Pareto scaling method. There are numerous examples of plant metabolomics studies, where PCA and/or other multivariate analyses were used. Chapter 8 has further details and examples of applications of computational and mathematical modelling in plant metabolomics studies.

26  Computational Phytochemistry

1.3.6  Data Mining and Databases Computational methodologies have been applied for the overall advancement of plant-based drug discovery through appropriate data mining and information-led bioprospecting and contributed significantly to the development Computational Phytochemistry (see Chapters 2, 10 and 11). Simply, data mining is the process of sorting through large data sets to identify patterns and establish relationships to solve problems through data analysis using various computational tools. An inclusive and holistic approach to data mining is essential for computerbased phytochemical work, starting right from the selection of plant materials to isolation and identification of lead compounds. Data mining techniques can provide solutions for developing targeted bioprospecting and screening strategies (Sharma and Sarkar, 2012). Data mining can be described as a computing process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems in order to transform large data into an understandable structure for further use. It incorporates database and data management components, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining in relation to phytochemical research begins with digitization of historical texts and scripts; several initiatives have already been taken by various organizations to this effect. The Biodiversity Heritage Library (http:// www.biodiversitylibrary.org/) and the China-US Million Book Digital Library Project (http://www.cadal.zju.edu.cn) are two of such major initiatives. Several electronic databases cataloguing information related to one or more aspects of medicinal plants, e.g., ethnobotany, taxonomy, bioactive phytochemicals, pharmacological uses, genomic or transcript-based information, and molecular targets of active ingredients are now available. Some of these well-known databases are listed in Table 1.2. NAPRALERT is a popular relational database of natural products, which incorporates ethnomedical information, pharmacological and biochemical information on extracts of organisms in  vitro, in situ, in  vivo, in human (case reports, non-clinical trials) and clinical studies. This database includes 200,000 scientific papers and reviews, representing organisms from all countries of the world, including marine and microorganisms. It covers literature on clinical studies of natural products (including safety), natural products that affect sugar metabolism, mammalian reproduction, cancer growth, and plant growth and that possess analgesic, antiviral (including HIV/AIDS), anti-inflammatory, antitubercular chemopreventive, insecticidal, and plant growth stimulatory and inhibiting activity. NAPRALERT also includes natural products and tropical diseases, ethnomedical information on more than 20,000 plant species, metabolism and pharmacokinetics, review articles on organisms at the genus and/or species levels, and reviews of secondary metabolites. However, like all other

An Introduction to Computational Phytochemistry  Chapter | 1  27

TABLE 1.2  Medicinal Plant Databases Name of the database

Web address

The International Ethnobotany Database (ebDB)

http://ebdb.org

NAPALERT

https://www.napralert.org/

GRIN Databases

http://www.ars-grin.gov/

Dr Duke’s Phytochemical and Ethnobotanical Databases

https://phytochem.nal.usda.gov/phytochem/ search https://data.nal.usda.gov/dataset/dr-dukesphytochemical-and-ethnobotanicaldatabases_2719

Flora Europaea (Euro + Med Plantbase)

http://ww2.bgbm.org/EuroPlusMed/query.asp

Traditional Chinese Medicine Information Database (TCM-ID)

http://www.megabionet.org/tcmid/

Herb Ingredienťs Target (HIT)

https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC3013727/ http://lifecenter.sgst.cn/hit/

Raintree

http://rain-tree.com/ethnic.htm

Medicinal Plants Genomics Resource

http://medicinalplantgenomics.msu.edu/

databases, in no way, this database is complete or comprehensive. This database should be used together with other databases to retrieve required information when selecting plant materials for any phytochemical studies (see Chapter 2). The Germplasm Resources Information Network (GRIN) web server provides germplasm information about plants, animals, microbes, and invertebrates. This programme is within the U.S. Department of Agriculture’s Agricultural Research Service. In this server, the GRIN Taxonomy database (http://www.ars-grin.gov/~sbmljw/johnindex.html) is an excellent online resource for retrieving information not only on plant taxonomy, but also on economic plants, rare plants, and some uses. It incorporates information on 37,000 taxa and 14,000 genera. There are several other plant taxonomic database, e.g., International Plant Names Index, Index Nominum Genericorum, Database of North American Plants, Collaborative Floristic Efforts of North American Botanists, and Nomenclatural and Specimen Database of the Missouri Botanical Garden, which could also be useful in phytochemical research. Dr. Duke’s Phytochemical and Ethnobotanical Databases is another electronically retrievable web-based source of information useful for plant selection. From here, one can obtain a list of chemicals and activities for a specific plant of interest, using either its scientific or common name, download a list

28  Computational Phytochemistry

of chemicals and their known activities in PDF or spreadsheet form, find plants with chemicals known for a specific biological activity, display a list of chemicals with their lethal dose toxicity data, find plants with potential cancer-­ preventing activity, display a list of plants for a given ethnobotanical use, and find out which plants have the highest levels of a specific chemical. Appropriate published references are also available. Euro+Med Plantbase (Table 1.2) integrates and appraises information from Flora Europaea, Med-Checklist, the Flora of Macaronesia, and from regional and national floras and checklists from the area as well as additional taxonomic and floristic literature. This is further enriched by the European taxa of several families taken from the World Checklist of Selected Plant Families and of the Leguminosae from the International Legume Database and Information Service ILDIS. It provides access to almost 200 families, corresponding to about 95% of the European flora of vascular plants. Traditional Chinese Medicine Integrated Database (TCMID) (Table 1.2) offers information and bridges the gap between Traditional Chinese Medicine and modern life sciences. Information on all respects of TCM including formulae, herbs, and herbal ingredients is available in this database. Additionally, information on drugs and diseases, which have been studied thoroughly by modern pharmacology and biomedical science, is also available. It is a useful database, especially for those working with Chinese medicinal plants. This database can be complemented by a relatively new database, Herb Ingredienťs Target (HIT), introduced by Ye et al. (2011), which is a comprehensive and fully curated database for Herb Ingredients’ Targets (HIT, http://lifecenter.sgst.cn/hit/). Raintree database (Table 1.2) was established in 1996 to provide accurate and fact-based information on plants from the Amazon Rainforest. In this small database, the individual plant database files are linked through various menus and pages. Each plant database file contains taxonomy data, phytochemical information, ethnobotanical data, uses in traditional medicine systems and clinical research. Medicinal Plants Genomics Resource database (Table  1.2) is a relatively small database, but can be used, in conjunction with other databases, to retrieve useful information on plants. Sharing traditional knowledge and ethnomedicinal data sharing could potentially lead to refinement of scientific approaches to phytochemical studies and conservation of endangered medicinal plants. However, to achieve this ideal situation, comprehensive integration of traditional knowledge and ethnomedicinal data, which are somewhat currently available in heterogenous formats, with scientific data using a single electronic platform is essential. Various attempts were made to integrate ethnomedicinal data from plants, which were based on a relational data model (RDM) as well as a few flat-file approaches. Earlier approaches to store ethnopharmacological data in a RDM have certain limitations. Despite the fact that it is common to store and query data in various forms, it requires pre-designing of the exact fieldstructure of the data for normalization processes (Ningthoujam et al., 2014). To address this issue, a flexible ‘Not only

An Introduction to Computational Phytochemistry  Chapter | 1  29

Structured Query Language (NoSQL)’ data model that could integrate both primary and curated ethnomedicinal plant data from multiple sources has recently been reported (Ningthoujam et al., 2014). The proposed model was based on MongoDB, which is one of the NoSQL databases, and is a document-oriented database developed in C++ that stores data in field and value pairs. In this model, data are stored in binary JavaScript object notation (BSON) format, and all documents are stored in collections having a shared common index. Appropriate modifications were made by Ningthoujam et al. (2014) to allow the model to incorporate both standard and customized ethnomedicinal plant data format from different sources. The proposed model is scalable to an extremely complex level with continuing maturation of the database and is applicable to storing, retrieving, and sharing ethnomedicinal plant data. Chapter 2 has been dedicated to cover computational and mathematical aspects in relation to prediction of medicinal properties and selection of plant materials.

1.3.7  Response Surface Methodology in Optimization of Extraction of Phytochemicals Chapter  3 is dedicated to explore various computational and mathematical models applicable to extraction of phytochemicals. The choice of an appropriate extraction method is crucial for obtaining desired compounds in good quantities from a plant matrix. It is even more important to choose the extraction parameters correctly to ensure a maximum extraction output of desired compounds. Thus, optimization of various parameters that drive any extraction process is fundamental to phytochemical research. Previously, optimization was achieved by trials and errors through tedious processes looking at one factor at a time, but now, applying various computational and mathematical models, extraction optimization can be achieved fairly quickly resulting in optimized maximum yield of any extract. Response surface methodology (RSM) is one of such models. RSM, introduced by Box and Wilson in 1951, is a collection of mathematical and statistical techniques for empirical model building aiming at careful designing of experiments to obtain optimal response (output variable), which is influenced by several independent variables (input variables) (Box and Wilson, 1951). This methodology can be applied to maximize the production of target substance by optimizing operational variables (Bezerra et al., 2008). Optimization means improving the performance of a system, a process, or a product in order to obtain the maximum benefit from it by discovering conditions at which to apply a procedure that produces the best possible response. The major RSM designs are central composite design (CCD, also known as Box-Wilson design), Box-Behnken design (BBD), Doehlert design, and three level full factorial design (Das et al., 2014). Das et al. (2013) described a detailed experimental design using response surface methodology to extract a known triterpenoid, lupeol (Fig. 1.17) efficiently from the leaves of Ficus racemosa by the microwave-assisted ­extraction

30  Computational Phytochemistry

H H H HO

H

FIG. 1.17  Structure of lupeol, a triterpene from Ficus racemose.

technique. In this study, RSM was applied in two stages, first to identify the significant variables for extraction of lupeol using a Plackett–Burman design and later the significant variables that resulted from a Plackett–Burman design were optimized by using a Box–Behnken design. Plackett-Burman experimental design, developed in 1946 by statisticians Robin L. Plackett and J. P. Burman, is generally used to identify the most important factors early in the experimentation phase when complete knowledge about the system is ­usually unavailable. It is an efficient screening method to identify the active factors using as few experimental runs as possible. RSM was applied to study the influence of hydrothermal processing, especially the effect of time and temperature of hydrothermal processing (blanching) on phytochemical content, texture, and colour of the semi-dried edible Irish brown seaweed Himanthalia elongate. A central composite design was used with two factors, a hydrothermal processing time of 10–30 min and temperature of 60–90°C. Predicted models were effective for total phenolic content, free-radicalscavenging activity, flavonoid content, total condensed tannins, texture, and colour. The predicted values for each of the responses were found to be in good agreement with experimental values. In this study, the central composite design was applied using STATGRAPHICS Centurion XV (StatPoint Technologies Inc., Warrenton, VA, USA; http://en.freedownloadmanager.org/Windows-PC/ STATGRAPHICS-Centurion.html). Altemimi et  al. (2015) applied RSM for optimizing ultrasound-assisted extraction of lutein and β-carotene (Fig.  1.18) from spinach by manipulating three independent variables, i.e. extraction temperature, extraction power, and extraction time. A combination of TLC, densitometry, and Box– Behnken with RSM methods was employed. Design-Expert software (version 9) was used to analyse the experimental results of the response surface design (State-Ease Inc., Minneapolis, MN, USA; https://www.statease.com/ software/dx10-trial.html). There are several publications outlining various RSM to optimize extraction of various plant materials available to date (Dashtianeh et  al., 2013; Patil et al., 2014; Turkyilmaz et al., 2014; Alam et al., 2015; Ghasemzadeh et al., 2015; Anne and Nithyannandam, 2016; Mašković et al., 2016; Tomaz et al., 2016).

An Introduction to Computational Phytochemistry  Chapter | 1  31

β-Carotene OH H

HO

Lutein

FIG. 1.18  Structure of lutein and β-carotene.

1.3.8  Computation in Isolation of Phytochemicals Chapter  4 covers the application of computational methods in isolation of plant secondary metabolites. Before the advent of modern chromatographic techniques, e.g., gas chromatography (GC) and high-performance liquid chromatography (HPLC), and phenomenal advancement in computational and mathematical modelling applications, most of the phytochemicals were isolated from plant extracts using classical preparative and analytical thin layer chromatography (TLC), column chromatography (CC), vacuum liquid chromatography (VLC), manual flash chromatography (FC), solid-phase extraction, and solvent partitioning, where there were no computational techniques involved whatsoever. Even for automated fraction collection in CC, it was merely electrical rather than electronic. Nowadays, most of the isolation protocols for phytochemicals involve the use of preparative flash or HPLC methods (Sarker and Nahar, 2012), which can be fully controlled and automated to various levels by various PC-based software. Software provided by the manufacturers can control every stages of the operation of any preparative flash or HPLC protocol, starting from sample injection (using authosampler), through setting up and running the mobile phase (isocratic or gradient), and all the way to collection of samples (using automated fraction collector). It is not only the isolation, but also flash or HPLC software that allows analysis of various parameters of the peaks in automation. Some common preparative flash and/or HPLC manufacturers and their software are presented in Table 1.3.

1.3.9 Miscellaneous Castellano’s group classified various classes of phytochemicals, e.g., flavonoids, stilbenes, and terpenoids, using various mathematical models, e.g., entropy information theory, chemical structural indicators, and entropy of artificial intelligence (Castellano et  al., 2012, 2013, 2014). Cape et  al. (2006) described the computation of the redox and protonation properties of quinones aiming at

32  Computational Phytochemistry

TABLE 1.3  Preparative HPLC (and/or Flash) Manufacturers and Associated Software Preparative System

Type

Manufacturer

Software

Agilent 1200

HPLC

Agilent Technologies www.agilent.com/chem/store

ChemStation 4.01

AZURA Lab Prep LC 50

HPLC

Knauer/Azura http://www.knauer.net/systemssolutions/preparative-hplc/lab. html

Chromeleon ClarityChrom Mobile Control

Dionex UltiMate 3000

HPLC

Dionex—ThermoFisher https://www.thermofisher.com/uk/ en/home.html

Chromeleon

LC-4000

HPLC

Jasco 18 Oak Industrial Park Great Dunmow, Essex CM6 1XN United Kingdom http://www.jasco.co.uk/dunmow_ map.php

ChromNAV 2.0

Shimadzu http://www.shimadzu.com/an/ index.html

LabSolutions LC/GC

LC-20AT, LC-20AR and LC-20AP PuriFlash

Flash and HPLC

Interchim 1536 West 25th Street Suite 452, Los Angeles CA 90732 [email protected]

InterSoft 5.0

Sepacore flash system ×10/×50

Flash

Buchi https://www.buchi.com/gb-en

SepacoreControl

p­ rediction of redox cycling natural products. Redox cycling refers to repetitively coupled reduction and oxidation reactions, often involving oxygen and reactive oxygen species. Automatic Interaction Detection (AID) algorithm (Cairns, 1981) and Artificial Neural Network (ANN) models were applied as alternative to the PCA, factor analysis, and other multivariate analytical techniques in order to identify the relevant phytochemical constituents for characterization and authentication of tomatoes (Suarez et al., 2015). The AID analysis is a sequentially repeated one-way ANOVA. In each step, the algorithm reveals the best variable able to divide the initial group. The fundamental advantage of AID algorithm is that it avoids an arbitrary specification of output and provides an empirical technique that is particularly effective when dealing with a large volume of micro-data (Cairns, 1981). The partition among categories must maximize the

An Introduction to Computational Phytochemistry  Chapter | 1  33

inter-groups variance and minimize intra-group variance. However, as the data of this study were affected by three factors (harvest date, agricultural practice, and cultivar), the original concept was adjusted according to the need. In recent years, with significant progress in metabolic engineering and synthetic biology, biosynthesis of phytochemicals has often been studied utilizing various computational techniques involving microbial production platforms (Lin et al., 2013, 2014; Wang et al., 2014; Jones et al., 2016; Pandey et al., 2016) (see Chapter 9). Escherichia coli (because of its fast growth and ease of genetic manipulation) and Steptomyces cerevisiae (because of its Generally Regarded As Safe-GRAS status and its ability to functionally express plant metabolic enzymes) are the two most commonly used platform organisms for de novo or semi-de novo production of almost all kinds of polyphenolic phytochemicals. Computation modelling is generally used to understand the interaction between transcription factors and cell transcription machinery and to predict overexpression targets to improve yields. All these biosynthesis studies are mainly aimed at renewable and enhanced production of several high-value phytochemicals (Mora-Pale et al., 2014; Valdiani et al., 2014; Wang et al., 2014). An excellent example of such efforts could be the use of an Escherichia coli co-culture for the efficient and significantly improved production of flavonoids in  vivo (Jones et al., 2016). Computational tools and mathematical modelling were employed to optimize factors like strain compatibility, carbon source, temperature, induction point, and inoculation ratio using MATLAB (http://download.cnet. com/Matlab/3000-2053_4-43768.html). An empirical model function, fourdimensional scaled-Gaussian model, based on the initial optimization data, was then implemented to predict the optimum point for the system. Pandey et al. (2016) have reviewed various aspects of microbial production of natural and non-natural flavonoids and highlighted the application of computational tools and mathematical modelling, particularly, for cofactor or precursor enhancement and strain design in the biosynthesis process. Generally, computational strain design protocols deal with system-wide identification of intervention strategies for the enhanced production of biochemicals in microorganisms. Existing approaches relying solely on stoichiometry and rudimentary constraint-based regulation overlook the effects of metabolite concentrations and substrate-level enzyme regulation while identifying metabolic interventions. Computational strain design methods can be divided into two main families: one is a family of strain design methods based on the concept of flux balance analysis, and the other family is based on the concept of elementary mode analysis (Machado and Herrgard, 2015). There are several computational tools including algorithms available for the study of biosynthesis and production of phytochemicals, especially in microorganism platforms. For example, BeReTa (Beneficial Regulator Targeting) is an algorithm for prioritization of transcriptional regulator (TR) manipulation targets and uses unintegrated network models; it identifies TR manipulation targets by evaluating regulatory strengths of interactions and beneficial effects of

34  Computational Phytochemistry

reactions, and subsequently assigning beneficial scores for the TRs. This algorithm can predict both known and novel TR manipulation targets for enhanced production of various chemicals in E. coli. Biopathway Predictor is another example of a computational tool implemented in Genomatica’s SimPheny platform for detailing and evaluating networks of enzyme-catalysed reactions, with the goal of identifying novel pathways for producing a chemical of interest. Reactions in a BioPathway network are generated by a defined set of reaction rules that operate repeatedly on a specified starting material and its derivatives.

1.4. CONCLUSIONS Computational Phytochemistry is a rapidly evolving field of research that offers integration between phytochemistry and computational (and mathematical) tools. This area of research incorporates all avenues of phytochemical analysis, where appropriate computational and mathematical approaches and tools can be employed effectively to enhance quality of output, increase precision and reproducibility, and to reduce experimental time and cost. Computational Phytochemistry has already started changing the way we perform phytochemical research and is expected to influence even further on phytochemical r­ esearch to be carried out in the years to come.

REFERENCES Abe, H., Yamasaki, T., Fujiwara, I., Sasaki, S., 1981. Computer-aided structure elucidation methods. Anal. Chim. Acta 133, 499–506. Ahmed, B., Ashfaq, U.A., Mirza, M.U., 2017. Medicinal plant phytochemicals and their inhibitory activities against pancreatic lipase: molecular docking combined with molecular dynamics simulation approach. Nat. Prod. Res., 1–7. https://doi.org/10.1080/14786419.2017.1320786. Alam, M.S., Damanhouri, Z.A., Ahmad, A., Abidin, L., Amir, M., Aqil, M., Khan, S.A., Mujeeb, M., 2015. Development of response surface methodology for optimization of extraction parameters and quantitative estimation of embelin from Embelia ribes Burm by high performance liquid chromatography. Pharmacogn. Mag. 11, 166–172. Al-Sehemi, A.G., Irfan, A., Aljubiri, S.M., Shaker, K.H., 2016. Density functional theory investigations of radical scavenging activity of 3′-methyl-quercetin. J. Saudi Chem. Soc. 20, S21–S28. Altemimi, A., Lightfoot, A.A., Kinsel, M., Watson, D., 2015. Employing response surface methodology for the optimization of ultrasound assisted extraction of lutein and β-carotene from spinach. Molecules 20, 6611–6625. Anne, R., Nithyannandam, R., 2016. Optimization of extraction of bioactive compounds from medicinal herbs using response surface methodology. Int. Proc. Chem. Biol. Environ. Eng. 99, 76–85. Azevedo, L., Faqueti, L., Kritsanida, M., Efstathiou, A., Smirlis, D., Franchi Jr, G.C., Genta-Jouve, G., Michel, S., Sandjo, L.P., Grougnet, R., Biavatti, M.W., 2016. Three new trixane glycosides obtained from the leaves of Jungia sellowii less. Using centrifugal partition chromatography. Beilstein J. Org. Chem. 12, 674–683. Babahedari, A.K., Faraki, S., Soureshjani, E.H., 2014. A comparative molecular docking study of Lavandula angustifolia Mill’s compounds along diazepam and amobarbital with GABAA receptor. Int. J. Adv. Chem. Eng. Biol. Sci. 1, https://doi.org/10.15242/IJACEBS.C1113048.

An Introduction to Computational Phytochemistry  Chapter | 1  35 Bansal, A., Chhabra, V., Rawal, R.K., Sharma, S., 2014. Chemometrics: a new scenario in herbal drug standardization. J. Pharm. Anal. 4, 223–233. Barlow, D.J., Buriani, A., Ehrman, T., Bosisio, E., Eberini, I., Hylands, P.J., 2012. In silico studies in Chinese herbal medicines’ research: evaluation of in silico methodologies and phytochemical data sources, and a review of research to date. J. Ethnopharmacol. 140, 526–534. Bartel, J., Krumsiek, J., Theis, F.J., 2013. Statistical methods for the analysis of the high-throughput metabolomics data. Comput. Struct. Biotechnol. J. 4, e201301009. Bezerra, M.A., Santelli, R.E., Oliveira, E.P., Villar, L.S., Escaleira, L.A., 2008. Response surface methodology as a tool for optimization in analytical chemistry. Talanta 76, 965–977. Booker, A., Suter, A., Krjic, A., Strassel, B., Zloh, M., Said, M., Heinrich, M., 2014. A phytochemical comparison of saw palmetto products using gas chromatography and 1H nuclear magnetic resonance spectroscopy metabolomics profiling. J. Pharm. Pharmacol. 66, 811–822. Box, G.E.P., Wilson, K.B., 1951. On the experimental attainment of optimum conditions (with discussion). J. Royal Statist. Soc. Ser. B 13, 1–45. Brasil, D.S.B., Muller, A.H., Guilhon, G.M.S.P., Alves, C.N., Peris, G., Llusar, R., Moliner, V., 2010. Isolation, X-ray crystal structure and theoretical calculations of the new compound 8-eepicordatin and identification of others terpenes and steroids from the bark and leaves of Croton palanostigma Klotzsch. J. Braz. Chem. Soc. 21, 731–739. Brenton, R.G., 2003. Chemometrics: Data Analysis for the Laboratory and Chemical Plant. John Wiley and Sons, Chichester. Bushkov, N.A., Veselov, M.S., Chuprov-Netochin, R.N., Marusich, E.I., Majouga, A.G., Volynchuk, P.B., Shumilina, D.V., Leonov, S.V., Ivanekov, Y.A., 2016. Computational insight into the chemical space of plant growth regulators. Phytochemistry 122, 254–264. Cairns, M.B., 1981. The automatic interaction detector algorithm and the measurement of transport output. J. Transport Econ. Policy 15, 277–282. Cape, J.L., Bowman, M.K., Kramer, D.M., 2006. Computation of the redox and protonation properties of quinones: towards the prediction of redox cycling natural products. Phytochemistry 67, 1781–1788. Case, D.A., Berryman, J.T., Betz, R.M., Cerutti, D.S., Cheatham III, Y.E., Darden, T.A., Duke, R.E., Giese, T.J., Gohlke, H., Goetz, A.W., Homeyer, N., Izadi, S., Janowski, P., Kaus, J., Kovalenko, A., Lee, T.S., LeGrand, S., Li, P., Luchko, T., Luo, R., Madei, B., Merz, K.M., Monard, G., Needham, P., Nguyen, H., Nguyen, H.T., Omelyan, I., Onufriev, A., Roe, D.R., Roitberg, A., Salomon-Ferrer, R., Simmerling, C.L., Smith, W., Swails, J., Walker, R.C., Wang, J., Wolf, R.M., Wu, X., Yourk, D.M., Kollman, P.A., 2015. Amber 2015. University of California, San Francisco. Castellano, G., Tena, J., Torrens, F., 2012. Classification of polyphenolic compounds by chemical structural indicators and its relation to antioxidant properties of Posidonia oceanica (L.) Delile. MATCH Commun. Math. Comput. Chem. 67, 231–250. Castellano, G., Gonzalez-Santander, J.L., Lara, A., Torrens, F., 2013. Classification of flavonoid compounds by using entropy of information theory. Phytochemistry 93, 182–191. Castellano, G., Lara, A., Torrens, F., 2014. Classification of stilbenoid compounds by entropy of artificial intelligence. Phytochemistry 97, 62–69. Constantin, M.B., Ferreira, M.J., Rodriguez, G.V., Emerenciano, V.P., 2010. Computer-aided structure elucidation of neolignans. Nat. Prod. Commun. 5, 755–762. Cuca-Suarez, L.E., Ramos, C.A., Coy-Barrera, E.D., 2013. DFT molecular modelling of novel cadinane sesquiterpenes isolated from Nectandra amazonum. Planta Med. 79, PG1. https://doi. org/10.1055/s-0033-1352071. Da Costa, F.B., Terfloth, L., Gasteiger, J., 2005. Sesquiterpene lactone-based classification of three Asteraceae tribes: a study based on self-organizing neural networks applied to chemosystematics. Phytochemistry 66, 345–353.

36  Computational Phytochemistry Das, A.K., Mandal, V., Mandal, S.C., 2013. Design of experiment approach for the process optimisation of microwave assisted extraction of lupeol from Ficus racemose leaves using response surface methodology. Phytochem. Anal. 24, 230–247. Das, A.K., Mandal, V., Mandal, S.C., 2014. A brief understanding of process optimisation in microwave-­assisted extraction of botanical materials: options and opportunities with chemometric tools. Phytochem. Anal. 25, 1–12. Das, S., Laskar, M.A., Sarker, S.D., Choudhury, M.D., Choudhury, P.R., Mitra, A., Jamil, S., Lathiff, S.M.A., Abdullah, S.A., Basar, N., Nahar, L., Talukdar, A.D., 2017. Prediction of antiAlzheimer’s activity of flavonoids targeting acetylcholinesterase in silico. Phytochem. Anal. 28, 324–331. published online, https://doi.org/10.1002/pca.2679. Dashtianeh, M., Vatanara, A., Fatemi, S., Sefidkon, F., 2013. Optimization of supercritical extraction of Pimpinella affinis Ledeb. using response surface methodology. J. CO2 Util. 3-4, 1–6. De Falco, B., Incerti, G., Pepe, R., Amato, M., Lanzotti, V., 2016. Metabolomic fingerprinting of Romaneschi globe artichokes by NMR spectroscopy and multivariate data analysis. Phytochem. Anal. 27, 304–314. Desai, N.S., Gore, M., 2011. Computer aided drug designing using phytochemicals—Bacoside A3 and myricetin and nitric oxide donors-S-nitroso-N-acetylpenicillamine and nitroglycerin as a potential treatment of pancreatic cancer. J. Comput. Sci. Syst. Biol. 5, 1–8. Dey, P., Dutta, S., Chaudhuri, T.K., 2015. Comparative phytochemical profiling of Clerodendrum infortunatum L. using GC-MS method coupled with multivariate statistical approaches. ­Metabol. Open Access 5, 147–157. Donno, D., Boggia, R., Zunin, P., Cerutti, A.K., Guido, M., Mellano, M.G., Prgomet, Z., Beccaro, G.L., 2016. Phytochemical fingerprint and chemometrics for natural food preparation pattern recognition: an innovative technique in food supplement quality control. J. Food Sci. Technol. 53, 1071–1083. Ebrahimi, S., Farimani, M.M., Mirzania, F., Hamburger, M., 2013. New sesterterpenoids from ­Salvia mirzayanii—stereochemical characterization by computational electronic circular ­dichroism. Planta Med. 79, PG2. Ehrman, T.M., Barlow, D.J., Haylands, P.J., 2007a. Phytochemical informatics of traditional ­Chinese medicine and therapeutic relevance. J. Chem. Inf. Model. 47, 254–263. Ehrman, T.M., Barlow, D.J., Haylands, P.J., 2007b. Phytochemical databases of Chinese herbal ­constituents and bioactive plant compounds with known target specificities. J. Chem. Inf. Model. 47, 2316–2334. Ehrman, T.M., Barlow, D.J., Haylands, P.J., 2010. Phytochemical informatics and virtual screening of herbs used in Chinese medicine. Curr. Pharm. Des. 16, 1785–1798. Elyashberg, M., Williams, A.J., Martin, G.E., 2008. Computer-assisted structure verification and elucidation tools in NMR-based structure elucidation. Prog. NMR Spectrosc. 53, 1–104. Elyashberg, M., Blinov, K., Molodtsov, S., Smurnyy, Y., Williamn, A.J., Chauranova, T., 2009. Computer-assisted methods for structure elucidation: realizing a spectroscopist’s dream. J. Chemoinform. 1, 3. https://doi.org/10.1186/1758-2946-1-3. Emerenciano, V.P., Barbosa, K.O., Scotti, M.T., Ferreira, M.J.P., 2007. Self-organizing maps in chemotaxonomic studies of Asteraceae: a classification of tribes using flavonoid data. J. Braz. Chem. Soc. 18, 891–899. Evidente, M., Santoro, E., Petrovic, A.G., Cimmino, A., Koshoubu, J., Evidente, A., Berova, N., Superchi, S., 2016. Absolute configurations of phytotoxic inuloxins B and C based on experimental and computational analysis of chiroptical properties. Phytochemistry 130, 328–334. Farooq, U., Ayub, K., Hashmi, M.A., Sarwar, R., Khan, A., Ali, M., Ahmad, M., Khan, A., 2015. A new rosane-type diterpenoid from Stachys parviflora and its density function theory studies. Nat. Prod. Res. 29, 813–819.

An Introduction to Computational Phytochemistry  Chapter | 1  37 Fazl-i-Sattar, Ullah, Z., Ata-ur-Rahman, Rauf, A., Tariq, M., Tahir, A.A., Ayub, K., Ullah, H., 2015. Phytochemical, spectroscopic and density functional theory study of diospyrin, and non-­ bonding interactions of diospyrin with atmospheric gases. Spectrochim. Acta A Mol. Biomol. Spectrosc. 141, 71–79. Frederich, M., Jansen, C., Tullio, P.D., Tits, M., Demoulin, V., Angenot, L., 2010. Metabolomic analysis of Echinacea spp. by 1H nuclear magnetic resonance spectrometry and multivariate data analysis technique. Phytochem. Anal. 21, 61–65. Fujiwara, I., Okuyama, T., Yamasaki, T., Abe, H., Sasaki, S., 1981. Computer-aided structure elucidation of organic compounds with the chemics system: removal of redundant candidates by 13C NMR prediction. Anal. Chim. Acta 133, 527–533. Fukui, K., Yonezawa, T., Shingu, H., 1952. A molecular orbital theory of reactivity in aromatic hydrocarbons. J. Chem. Phys. 20, 722–725. Gad, H. A., El-Ahmady, S. H., Abou-Shoer, M. I., Al-Azizi, M. M. 2012. Application of chemometrics in authentication of herbal medicines: a review. Phytochem. Anal. 24, 1–24. Ghasemzadeh, A., Jaafar, H.Z.E., Rahmat, A., 2015. Optimization protocol for the extraction of 6-gingerol and 6-shogaol from Zingiber officinale Var. rubrum Theilade and improving antioxidant and anticancer activity using response surface methodology. BMC Complement. Altern. Med. 15, 258–268. Glickman, J.F., Schmid, A., Ferrand, S., 2008. Scintillation proximity assays in high-throughput screening. Assay Drug Develop. Technol. 6, 433–455. Gopalakrishnan, S.B., Kalaiarasi, T., Subramanian, R., 2014. Comparative DFT study of phytochemical constituents of the fruits of Cucumis trigonus Roxb. and Cucumis sativus Linn. J. Comput. Methods Phys. https://doi.org/10.1155/2014/623235. Harder, D., Fotiadis, D., 2012. Measuring substrate binding and affinity of purified membrane transport proteins using the scintillation proximity assay. Nat. Protoc. 7, 1569–1578. Honmura, Y., Takekawa, H., Tanaka, K., Maeda, H., Nehira, T., Hehre, W., 2015. Computationassisted structural elucidation of epoxyroussoeone and epoxyroussoedione isolated from Roussoella japanensis KT1651. J. Nat. Prod. 78, 1505–1510. Hospital, A., Gorii, J.R., Orozco, M., Gelpi, J.L., 2015. Molecular dynamics simulations: advances and applications. Adv. Appl. Bioinforma. Chem. 8, 37–47. Jasmine, J.M., Vanaja, R., 2013. In silico analysis of phytochemical compounds for optimizing the inhibitors of HMG CoA reductase. J. Appl. Pharm. Sci. 3, 43–47. Jeeshna, M.V., Paulsamy, S., 2011. Phytochemistry and bioinformatics approach for the evaluation of medicinal properties of the herb, Exacum bicolor Roxb. Int. Res. J. Pharm. 2, 163–168. Jollife, I.T., 2002. Principal Component Analysis, second ed. Springer-Verlag, Germany. Jones, J.A., Vernacchio, V.R., Sinkoe, A.L., Collins, S.M., Ibrahim, M.H.A., Lachance, D.M., Hahn, J., Koffas, M.A.G., 2016. Experimental and computational optimization of an Escherichia coli co-culture for the efficient production of flavonoids. Metab. Eng. 35, 55–63. Kamel, E.M., Mahmoud, A.M., Ahmed, S.A., Lamsabhi, A.M., 2016. A phytochemical and computational study on flavonoids isolated from Trifolium resupinatum L. and their novel hepatoprotective activity. Food Funct. 7, 2094–2106. Kavasotto, C.N., Phatak, S.S., 2009. Homology modeling in drug discovery: current trends and applications. Drug Discov. Today 14, 676–683. Kim, S.-K., Nam, S., Jang, H., Kim, A., Lee, J.-J., 2015. TM-MC: a database of medicinal materials and chemical compounds in northeast Asian traditional medicine. BMC Complement. Altern. Med. 15, 218. https://doi.org/10.1186/s12906-015-0758-5. Kowalczuk, A.P., Lozak, A., Kiljan, M., Metrak, K., Zjawiony, J.K., 2015. Application of chemometrics for identification of psychoactive plants. Acta Pol. Pharm. Drug Res. 72, 517–525.

38  Computational Phytochemistry Kumar, N., Bansal, A., Sarma, G.S., Rawal, R.K., 2014. Chemometrics tools used in analytical chemistry: an overview. Talanta 123, 186–199. Lengauer, T., Rarey, M., 1996. Computational methods for biomolecular docking. Curr. Opin. Struct. Biol. 6, 402–406. Li, X.-N., Zhang, Y., Cai, X.-H., Feng, T., Liu, Y.-P., Li, Y., Ren, J., Zhu, H.-J., Luo, X.-D., 2011a. Psychotripine: a new trimeric pyrroloindoline derivative from Psychotria pilifera. Org. Lett. 13, 5896–5899. Li, M.-J., Zhang, L.-M., Liu, W.-X., Lu, W.-C., 2011b. DFT study on molecular structures and ROS scavenging mechanisms of novel antioxidants from Lespedeza vigrata. Chin. J. Chem. Phys. 24, 173. Lin, Y., Sun, X., Yuan, Q., Yan, Y., 2013. Combinatorial biosynthesis of plant-specific coumarins in bacteria. Metab. Eng. 18, 69–77. Lin, Y., Jain, R., Yan, Y., 2014. Microbial production of antioxidant ingredients via metabolic engineering. Curr. Opin. Biotechnol. 26, 71–78. Lontsi, T.A., Alembert, T.T., 2017. A density functional theory (DTF) calculations and vibrational analysis of smeathxanthone A. Res. J. Chem. Sci. 7, 6–10. Machado, D., Herrgard, M.J., 2015. Co-evolution of strain design methods based on flux balance and elementary mode analysis. Metabol. Eng. Commun. 2, 85–92. Madala, N.E., Piater, L.A., Steenkamp, P.A., Dubery, I.A., 2014. Multivariate statistical models of metabolomics data reveals different metabolite distribution patterns in isonitrosoacetophenoneelicited Nicotiana tabacum and Sorghum bicolor cells. SpringerPlus 3, 254–264. Mašković, P.Z., Diamanto, L.D., Cvetanović, A., Radojković, M., Spasojević, M.B., Zengin, G., 2016. Optimization of the extraction process of antioxidants from orange using response surface methodology. Food Anal. Methods 9, 1436–1443. Massart, D.L., Vandeginste, B.G.M., Deming, S.N., Michotte, Y., Kaufmann, L., 1988. ­Chemometrics—A Textbook. Elsevier, Amsterdam. Massiot, G., Nuzillard, J.M., 1992. Computer-assisted elucidation of structures of natural products. Phytochem. Anal. 3, 153–159. Mendoza-Huizar, L.H., Rios-Reyes, C.H., 2011. Chemical reactivity of atrazine employing the ­Fukui function. J. Mex. Chem. Soc. 55, 142–147. Miyagi, A., Takahashi, H., Takahar, K., Hirabyashi, T., Nishimura, Y., Tezuka, T., Kawai-Yamada, M., Uchimiya, H., 2010. Principal component and hierarchical clustering analysis of metabolites in destructive weeds; polygonaceous plants. Metabolomics 6, 146–155. Mocan, A., Zengin, G., Simirgiotis, M., Schafberg, M., Mollica, A., Vodnar, D.C., Crisan, G., Rohn, S., 2017. Functional constituents of wild and cultivated Goji (L. barbarum L.) leaves: phytochemical characterisation, biological profile, and computational studies. J. Enzyme Inhibit. Med. Chem. 32, 153–168. Mohan, M., James, P., Valsalan, R., Zazeem, P.A., 2015. Molecular docking studies of phytochemicals from Phyllanthus niruri against hepatitis B DNA polymerase. Bioinformation 11, 426–431. Mora-Pale, M., Sanchez-Rodriguez, S.P., Linhardt, R.J., Dordick, J.S., Koffas, M.A.G., 2014. Biochemical strategies for enhancing the in vivo production of natural products with pharmaceutical potential. Curr. Opin. Biotechnol. 25, 86–94. Moser, A., Elyashberg, M.E., Williams, A.J., Blinov, K.A., DiMartino, J.C., 2012. Blind trials of computer-assisted structure elucidation software. J. Chemom. 4, 5. https://doi. org/10.1186/1758-2946-4-5. Muhammad, S.A., Fatima, N., 2015. In silico analysis and molecular docking studies of potential angiotensin-converting enzyme inhibitor using quercetin glycosides. Pharmacogn. Mag. 11, S123–S126.

An Introduction to Computational Phytochemistry  Chapter | 1  39 Muiva-Mutisya, L., Macharia, B., Heydenreich, M., Koch, A., Akala, H.M., Derese, S., Omosa, L.K., Yusuf, A.O., Kamau, E., Yenesew, A., 2014. 6α-Hydroxy-α-toxicarol and (+)-tephrodin with antiplasmodial activities from Tephrosia species. Phytochem. Lett. 10, 179–183. Munk, M.E., 1998. Computer-based structure determination: then and now. J. Chem. Inf. Model. 38, 997–1009. Naman, C.B., Li, J., Moser, A., Hendrycks, J.M., Benatrehina, P.A., Chai, H., Yuan, C., Keller, W.J., Kinghorn, A.D., 2015. Computer-assisted structure elucidation of black chokeberry (Aronia melanocarpa) fruit juice isolates with a new fused pentacyclic flavonoid skeleton. Org. Lett. 17, 2988–2991. Nettles, J.H., Jenkins, J.L., Bender, A., Deng, Z., Davies, J.W., Glick, M., 2006. Bridging chemical and biological space: “target fishing” using 2D and 3D molecular descriptors. J. Med. Chem. 49, 6802–6810. Ningthoujam, S.S., Choudhury, M.D., Potsangbam, K.S., Chetia, P., Nahar, L., Sarker, S.D., Basar, N., Talukdar, A.D., 2014. NoSQL data model for semi-authomatic integration of ethnomedicinal plant data from multiple sources. Phytochem. Anal. 25, 495–507. Nuzillard, J.-M., Massiot, G., 1991. Computer-aided spectral assignment in nuclear magnetic resonance spectroscopy. Anal. Chim. Acta 242, 37–41. Ogungbe, I.V., Erwin, W.R., Setzer, W.N., 2014. Antileishmanial phytochemical phenolics: molecular docking to potential protein targets. J. Mol. Graph. Model. 48, 105–117. Pandey, R., Chandra, P., Srivastava, M., Mishra, D.K., Kumar, B., 2015. Simultaneous quantitative determination of multiple bioactive markers in Ocimum sanctum obtained from different locations and its marketed herbal formulations using UPLC-ESI-MS/MS combined with principal component analysis. Phytochem. Anal. 26, 383–394. Pandey, R.P., Parajuli, P., Koffas, M.A.G., Sohng, J.K., 2016. Microbial production of natural and non-natural flavonoids: pathway engineering, directed evolution and system/synthetic biology. Biotechnol. Adv. 34, 634–662. Patil, A.A., Sachin, B.S., Wakte, P.S., Shinde, D.B., 2014. Optimization of supercritical extraction and HPLC identification of wedelolactone from Wedelia calendulacea by orthogonal array design. J. Adv. Res. 5, 629–635. Pawar, H.A., Kamat, S.R., 2014. Chemometrics and its application in pharmaceutical field. Phys. Chem. Biophys. 4, 169–171. Powers, C.N., Setzer, W.N., 2015. A molecular docking study of phytochemical estrogen mimics from dietary herbal supplements. In Silico Pharmacol. 3, 1–63. Preethi, M.P., Sangeetha, U., Pradeepa, D., Valizadeh, M., Kalaiselvi, S., 2014. Principal component analysis and HPTLC fingerprint of in vitro and field grown root extracts of Withania coagulans. Int J Pharm Pharm Sci 6, 480–488. Ravichandran, R., Sundararajan, R., 2017. In silico-based virtual drug screening and molecular docking analysis of phytochemical-derived compounds and FDA approved drugs against BRCA1 receptor. J. Cancer Prevent. Curr. Res. 8, 00268. Rollinger, J.M., Bodensieck, A., Seger, C., Ellmerer, E.P., Bauer, R., Langer, T., Stuppner, H., 2005. Discovering COX-inhibiting constituents of Morus root bark: activity-guided versus computeraided methods. Planta Med. 71, 399–405. Rychnovsky, S.D., 2006. Predicting NMR spectra by computational methods: structure revision of hexacyclinol. Org. Lett. 8, 2895–2898. Sabeega-Begum, S.B., Subashini, S., Hemalatha, P., Archana, P., Bharathi, N., Nagarajan, P., 2014. In silico screening of phytochemical compounds targeting childhood absence epilepsy (CAE). Int J Pharm Pharm Sci 6 (5), 430–433.

40  Computational Phytochemistry Samec, D., Maretic, M., Lugaric, I., Mesic, A., Salopek-Sondi, B., Duralija, B., 2016. Assessment of the differences in the physical, chemical and phytochemical properties of four strawberry cultivars using principal component analysis. Food Chem. 194, 828–834. Sanghani, H.V., Ganatra, S.H., Pande, R., 2012. Molecular docking studies of potent anticancer agent. J. Comput. Sci. Syst. Biol. 5, 12–15. Sarker, S.D., Nahar, L., 2012. Natural Products Isolation, third ed. Humana Press—Springer-Verlag, USA. Sarker, S.D., Nahar, L., 2015. Evidence-based validation of herbal medicine: farm to pharma. In: Mukherjee, P. (Ed.), Applications of High Performance Liquid Chromatography in the Analysis of Herbal Products. Elsevier, USA. Sarker, S.D., Nahar, L., 2017. Computer-aided phytochemical research. Trends Phytochem. Res. 1, 1–2. Schaller, R.B., Munk, M.E., Pretsch, E., 1996. Spectra estimation for computer-aided structure determination. J. Chem. Inf. Model. 36, 239–243. Scotti, M.T., Enerenciano, V., Ferreira, M.J., Scotti, L., Stefani, R., da Silva, M.S., Junior, F.J.B.M., 2012. Self-organizing maps of molecular descriptors for sesquiterpene lactones and their application to the chemotaxonomy of the Asteraceae Family. Molecules 17, 4684–4702. Setzer, W.N., Ogungbe, I.V., 2012. In silico investigation of antitrypanosomal phytochemicals from Nigerian medicinal plants. PLoS Negl. Trop. Dis. 6, e1727. Sharaf, M.A., Illman, D.L., Kowalski, B.R., 1986. Chemometrics. Chemical Analysis Series, vol. 82 Wiley, USA. Sharma, V., Sarkar, N., 2012. Bioinformatics opportunities for identification and study of medicinal plants. Brief. Bioinform. 14, 238–250. Sholl, D., Steckel, J.A., 2009. Density Functional Theory: A Practical Introduction. Wiley, USA. Slavova-Kazakova, A.K., Angelova, S.E., Veprintsev, T.L., Denev, P., Fabbri, D., Dettori, M.A., Kratchanova, M., Naumov, V.V., Trofimov, A.V., Vasiev, R.F., Delogu, G., Kancheva, V.D., 2015. Antioxidant potential of curcumin-related compounds studied by chemiluminescence kinetics, chain-breaking efficiencies, scavenging activity (ORAC) and DFT calculations. Beilstein J. Org. Chem. 11, 1398–1411. Stortz, C.A., Cerezo, A.S., 1992. The 13C NMR spectroscopy of carrageenans: calculation of chemical shifts and computer-aided structural determination. Carbohydr. Polym. 18, 237–242. Suarez, M.H., Dopazo, G.A., Lopez, D.L., Espinosa, F., 2015. Identification of relevant phytochemical constituents for characterization and authentication of tomatoes by general linear model linked to automatic interaction detection (GLM-AID) and artificial neural network models (ANNs). PLoS One. https://doi.org/10.1371/journal.pone.0128566. Subramaniam, S., Mehrotra, M., Gupta, D., 2008. Virtual high throughput screening (vHTS)—a perspective. Bioinformatics 3, 14–17. Sumner, L.W., Mendes, P., Dixon, R.A., 2003. Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochemistry 62, 817–836. Todeschini, R., Consonii, V., 2000. Handbook of Molecular Descriptors. Wiley-VCH. Tomasi, J., Persico, M., 1994. Molecular interactions in solution: an overview of methods based on continuous distributions of the solvent. Chem. Rev. 94, 2027–2094. Tomaz, I., Maslov, L., Stupic, D., Preiner, D., Asperger, D., Kontic, J.K., 2016. Multi-response optimisation of ultrasound-assisted extraction for recovery of flavonoids from red grape skins using response surface methodology. Phytochem. Anal. 27, 13–22. Tugizimana, F., Piater, L., Dubery, I., 2013. Plant metabolomics: a new frontier in phytochemical analysis. S. Afr. J. Sci. 109, 1–11.

An Introduction to Computational Phytochemistry  Chapter | 1  41 Turkyilmaz, H., Kartal, T., Yigitarslan Yildiz, S., 2014. Optimization of lead adsorption of mordenite by response surface methodology: characterization and modification. J. Environ. Health Sci. Eng. 12 (5), 1–9. Ullah, H., Rauf, A., Ullah, Z., Fazl-i-Sattar, Anwar, M., Shah, A.A., Uddin, G., Ayub, K., 2014. Density functional theory and phytochemical study of pistagremic acid. Spectrochim. Acta A Mol. Biomol. Spectrosc. 118, 210–214. Valdiani, A., Talei, D., Tan, S.G., Kadir, M.A., Maziah, M., Rafii, M.Y., Sagineedu, S.R., 2014. A classical genetic solution to enhance the biosynthesis of anticancer phytochemicals in Andrographis paniculata Nees. PLoS One 9, e87034. https://doi.org/10.1371/journal.pone.0087034. Varmaghani, Z., Monajjemi, M., Mollaamin, F., 2014. Discovery of active site of vinblastine as application of nanotechnology in medicine. Nanomed. J. 1, 162–170. Verdonk, M.L., Cole, J.C., Hartshorn, M.J., Murray, C.W., Taylor, R.D., 2003. Improved protein–­ ligand docking using GOLD. Proteins 52, 609–623. Viacava, G.E., Roura, S.I., 2015. Principal component and hierarchical cluster analysis to select natural elicitors for enhanching phytochemical content and antioxidant activity of lettuce sprouts. Sci. Hortic. 193, 13–21. Wale, N., Karypis, G., 2009. Target fishing for chemical compounds using target-ligand activity data and ranking based methods. J. Chem. Inf. Model. 49, 2190–2201. Wang, X., Cai, W., Zheng, Y., Li, L., Tian, A., 2012. Chin. J. Chem. 30, 727–732. Wang, J., Guleria, S., Koffas, M.A.G., Yan, Y., 2014. Microrbial production of value-added nutraceuticals. Curr. Opin. Biotechnol. 37, 97–104. Watson, D.G., Peyfoon, E., Zheng, L., Lu, D., Seidel, V., Johnston, B., Parkinson, J.A., Fearnley, J., 2006. Application of principal components analysis to 1H-NMR data obtained from propolis samples of different geographical origin. Phytochem. Anal. 17, 323–331. Wolonski, K., Hilton, J.F., Pulay, P., 1990. Efficient implementation of the gauge independent atomic orbital method for NMR chemical shift calculations. J. Am. Chem. Soc. 112, 8251–8260. Worley, B., Powers, R., 2013. Multivariate analysis in metabolomics. Curr. Metabol. 1, 92–107. Yan, X.-H., Yi, P., Cao, P., Yang, S.-Y., Fang, X., Zhang, Y., Wu, B., Leng, Y., Di, Y.-T., Lv, Y., Hao, X.-J., 2016. 16-nor limonoids from Harrisonia perforata as promising selective 11β-HSD1 inhibitors. Sci. Rep. https://doi.org/10.1038/srep36927. Ye, H., Ye, L., Kang, H., Zhang, D., Tao, L., Tang, K., Liu, Z., Zhu, R., Liu, Q., Chen, Y.Z., Li, Y., Cao, Z., 2011. HIT: linking herbal active ingredients to targets. Nucleic Acids Res. 39, D1055– D1059. Zhao, S.-D., Shen, L., Luo, D.-Q., Zhu, H.-J., 2011. Progression of absolute configuration determination in natural product chemistry using optical rotation (dispersion), matrix determinant and electronic circular dichroism methods. Curr. Org. Chem. 15, 1843–1862.

This page intentionally left blank

Chapter 2

Prediction of Medicinal Properties Using Mathematical Models and Computation, and Selection of Plant Materials Sanjoy S. Ningthoujam*, Anupam D. Talukdar*, Satyajit D. Sarker†, Lutfun Nahar†, Manabendra D. Choudhury* *Assam University Silchar, Cachar, India, †Liverpool John Moores University, Liverpool, United Kingdom

Chapter Outline 2.1. Introduction 2.2. Mathematical Models 2.3. Computational Models in Drug Discovery 2.3.1 Structure-Based CADD 2.3.2 Ligand-Based CADD 2.3.3 Network Pharmacology 2.4. Selection of Medicinal Plants 2.4.1 Ethnobotany-Directed Drug Discovery 2.4.2 Chemotaxonomic and Ecological Approach

43 45 47 49 51 56 58 59

2.4.3 Random Approach 2.4.4 Integrated Approach 2.5. Role of Medicinal Plants Databases 2.6. Tools and Techniques 2.7. Role of Data Mining in Medicinal Plant Selection 2.8. Safety Considerations 2.9. Conclusion References

62 63 64 65 65 67 69 69

60

2.1. INTRODUCTION Use of plant-based materials dates back to early periods of human existence with several ancient records that provide formulations and evidence of phytotherapy. The knowledge derived from ancient and traditional systems of medicine has now transcended to the modern pharmaceutical industry. The focus of any phytochemical research is often to discover new drugs or drug leads from medicinal plants. One of the important issues in medicinal plant research is the appropriate selection of target plant species that may provide lead to new drugs. Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00002-X © 2018 Elsevier Inc. All rights reserved.

43

44  Computational Phytochemistry

Throughout the history of drug discovery from plants, serendipity has played a significant role (Kinghorn, 1994; Kubinyi, 1999). For example, the discovery of dicoumarol from fatal cattle poisoning from Melilotus officinalis was simply a serendipitous discovery (Kubinyi, 1999), which then led to the development of the well-known anticoagulant warfarin (Fig. 2.1). Conducting research without any working hypotheses may produce such unexpected discoveries, but the chances of success are much slimmer than any targeted approach. Research on medicinal plants, thus, requires a thorough knowledge of their various properties that may reflect the hitherto unknown medicinal properties from the plants. Challenges lie in devising appropriate methods to uncover the existing or potential medicinal properties as well as selecting the right plants that may fulfil these criteria. A plant is said to be ‘medicinal’ if it possesses certain medicinal or curative properties against any ailment or group of ailments. Efficacies of the herbal medicine or phytotherapy in their treatment of several diseases may sometimes be linked to placebo effects, but often involve active natural products mostly of low molecular weight that possess ‘drug-like’ properties. Earliest known investigation of bioactive plants dates back to 3000 BCE with Egyptians scrolls detailing these plants along with their medicinal properties. Modern study of bioactive compounds isolated from living organisms for therapeutic purposes began around 200 years ago with the isolation of morphine by F.W. Serturner (Schmitz, 1985). After this, there was no turning back, but to accelerate the process of phytochemical discoveries with tremendous advances in phytochemical methods and medicinal chemistry and allied disciplines. Plant-based medicines have contributed to (and have been continuously doing so) the advancement of modern medical treatments and provision of new drug candidates. However, sometimes the progress in medicinal plant research has somehow been negatively impacted by the introduction of various modern technology-based developments in synthetic medicinal chemistry, e.g., combinatorial chemistry, and by the sheer completion in library-based highthroughput-­screening (HTS) process in modern drug discovery scene. However, the inherent chemical diversity and structural novelty that natural products offer are the best, and for this very reason, natural products or drug discovery from plant remains as one of the main sources of new drugs. One of the bottlenecks O OH

O

OO Dicoumarol

OH

OH

O

O

O

Warfarin

FIG. 2.1  Dicoumarol from Melilotus officinalis, and the well-known anticoagulant warfarin.

Prediction of Medicinal Properties  Chapter | 2  45

in ­phytochemical drug discovery is probably the arduous protocols, and in addition, the overall cost in conventional plant drug discovery methodologies sometimes can be prohibitive for any drug discovery initiative when it comes to cost-effectiveness. To mitigate some of these issues, various stages of plantbased drug discovery programmes require much smarter approach and incorporation of new computational approaches coupled with mathematical models. Thousands of structurally diverse bioactive compounds have evolved during plant development and evolutionary processes, sometimes to offer plant the necessary protection against herbivores and pathogens, while some others to serve as signal compounds to facilitate reproduction, as antioxidants and UV protectants. Isolation and analysis of potential bioactive phytochemicals may include generation of a hypothesis of the target receptor for a particular disorder and the subsequent screening of the in vitro and/or in vivo biological activities of the candidate drug. The major challenge in phytochemistry is to describe and understand the diversity of these molecules, their modes of action, and determination of their natural combinations found in plants (Sarker and Nahar, 2012; Wink, 2015). However, right at the beginning of any plant-based drug discovery programme, probably the most important task is to appropriately select the medicinal plants from a vast array of plants available on the earth that may possess expected or desired bioactivity. Success of any drug discovery programme often depends on accurate data on pharmacokinetics and metabolism. Initiation of absorption, distribution, metabolism, excretion, and toxicity screening has contributed to the success rate of compounds during clinical trials. Pharmacokinetic parameters provide information for future experiments involving animal model and clinical studies for selection of the dose levels and frequency of administration. Apart from these, various techniques and approaches have been attempted to predict potential medicinal activity of plants. One of such attempts is application of phylogenetic methods and chemotaxonomic understanding to determine the pattern of evolution of various groups of specialized metabolites and deriving a correlation between phylogeny and biosynthetic pathways (Rønsted et  al., 2012). There are also attempts to correlate the taste of medicinal plants with their ethnopharmacological activities (Gilca and Barbulescu, 2015). These novel or improvised methods are increasingly using various mathematical modelling and computational approaches such as regression analysis, data mining, or analysing structure-­activity relationships (see Chapters 1 and 7). The fundamental aim of this chapter is to present an overview of methods and processes involved in plant selection by utilizing various mathematical modelling and computational techniques.

2.2.  MATHEMATICAL MODELS A mathematical model can be defined as a description of a system using mathematical concepts and language to facilitate proper explanation of a system or

46  Computational Phytochemistry

to study the effects of different components and to make predictions on patterns of behaviour (Abramowitz and Stegun, 1968). The process of constructing a mathematical model is often called mathematical modelling (Press et al., 1987). Mathematical modelling is known by various names, such as, predictive modelling, simulation, or decision analysis. A traditional mathematical model comprises four major elements: 1. governing equations; 2. defining equations; 3. constitutive equations; and 4. constraints. Mathematical models depend on advanced computational tools and can simulate medical outcomes under some given parameters. Some common methodologies are the Markov Chain and Monte Carlo simulations. Mathematical modelling can be applied for predicting outcomes. It is particularly helpful when limitations like a rare event prohibit repeating actual studies or expanding research on clinical trials. Innovative use of this technique includes estimation of missing data points. While common strategies for replacement of missing values include a point of central tendency (e.g., mean or median), these methods usually have cut off criterion for the minimum allowable proportion. There are technical limitations in preserving the variance. The Markov Chain was first used in the 1940s to model nuclear reactions (McKean, 1966). It is a series of conditional probabilities in a fixed dependent order. This technique was generalized from its limited applications to different disciplines, where one could not derive a single probability function. A Markov process, named after the Russian mathematician Andrey Markov, is a stochastic process that satisfies the Markov property. Simply, a process satisfies the Markov property only if one can predict the future of the process based solely on its present state just as well as one could know the process’s full history. A Markov chain is a type of Markov process that has either discrete state space or discrete index set, often representing time, but the precise definition of a Markov chain may vary. Monte Carlo simulation is a series of random draws, simulating an event within the known parameters of the probability distribution of the event. It is a computerized mathematical technique or algorithm that allows people to account for risk in quantitative analysis and decision-making. Monte Carlo simulation offers the decision-maker with a range of possible outcomes and the probabilities they will occur for any choice of action. This simulation technique came as a useful application in the time of Markov Chain processes. In principle, Monte Carlo methods can be used to solve any problem having a probabilistic interpretation. By the law of large numbers, integrals described by the expected value of some random variables can be approximated by taking the empirical mean of independent samples of the variables. When the probability distribution of the variable is parameterized, mathematicians often use a Markov Chain

Prediction of Medicinal Properties  Chapter | 2  47

Monte Carlo (MCMC) sampler (Del Moral et al., 2006; Kroese et al., 2014). MCMC estimated value preserves the actual variance. Monte Carlo simulation has several advantages over deterministic or ‘single-point estimate’ analysis. Some of those advantages are: 1. Probabilistic results: Results not only display what could happen, but also how probable each outcome is. 2. Graphical results: Monte Carlo simulation-generated data can be easily presented in graphs of different outcomes and their chances of occurrence. This is particularly important for informing findings to other stakeholders. 3. Sensitivity analysis: Deterministic analysis sometimes makes it difficult to see which variables influence the outcome the most. However, in Monte Carlo simulation, it is easy to observe which inputs have the biggest effect on bottom-line results. 4. Scenario analysis: It is extremely difficult to model different combinations of values for different inputs to see the effects of truly different scenarios in deterministic models, but Monte Carlo simulation clearly demonstrates correlations between inputs and several values when certain outcomes are achieved. 5. Correlation of inputs: In Monte Carlo simulation, it is possible to model interdependent relationships between input variables. An enhancement to Monte Carlo simulation is the use of Latin Hypercube sampling (LHS), which samples more accurately from the entire range of distribution functions. LHS, first introduced by McKay in 1979, is a statistical method for generating a near-random sample of parameter values from a multidimensional distribution (McKay et  al., 1979; Tang, 1993). The sampling method is often used to construct computerized experiments or for Monte-Carlo integration. Many mathematical modelling approaches, including simulated data, have been applied in determining the medicinal properties of plants. Some examples of the applications of mathematical modelling in predicting medicinal properties and plant selection are discussed in Section 2.4.

2.3.  COMPUTATIONAL MODELS IN DRUG DISCOVERY A computational model is a mathematical model in computational science that requires extensive computational resources to study the behaviour of a complex system by computer simulation. Thus, computational modelling refers to the use of computers to simulate and study the behaviour of complex systems using mathematics, physics, and computer science. A computational model may contain numerous variables that characterize the system under investigation. Computer-aided drug discovery (CADD) methods contribute significantly to the development of therapeutically important small molecules, either from synthetic or natural sources (Song et al., 2009). CADD methods significantly decrease the number of compounds necessary to screen, while retaining the

48  Computational Phytochemistry

same level of lead compound discovery. Many compounds predicted to be inactive can easily be skipped, and those predicted to be active can be prioritized, thus reducing the cost and workload of a full HTS screen without compromising lead discovery. CADD methods increase the hit rate of novel drug compounds as it uses a much more targeted search than traditional HTS and combinatorial chemistry. It not only aims to explain the molecular basis of therapeutic activity, but also does help predict possible derivatives for improved activity. Mainly the methods can be classified as structure-based or ligand-based methods (Sliwoski et al., 2014). Structure-based methods rely on the knowledge of the target protein structure to estimate interaction energies for all compounds tested. On the other hand, ­ligand-based CADD utilizes the knowledge of known active and ­inactive molecules through chemical similarity searches or construction of quantitative structure-­ activity relationships (QSAR models). Important tools, e.g., target/­ ­ ligand databases, homology modelling, and ligand-fingerprint methods, are necessary for successful implementation of various computer-aided drug ­ discovery/design methods in any modern drug discovery programme. Computational methods for toxicity prediction and optimization for favourable physiologic properties are also parts of modern drug discovery and design protocols. Various approaches of computer-aided drug design can be represented by the following figure (Fig. 2.2) (Aparoy et al., 2012). Many mathematical modelling approaches, including simulated data, have been used to determine the medicinal properties of plants. Computational methods are powerful knowledge-based approach that helps to select plant material or natural products with a high likelihood for biological activity. These methods can also offer rationalization of biological activity of natural products. In silico simulations can be used to propose protein ligand-binding characteristics for

Drug design

SBDD

de novo design

LBDD

QSAR

Virtual screening

2D

3D

FIG. 2.2  Types of drug design.

Scaffold hopping

Pharmacophore modeling

Pseudo receptors

Prediction of Medicinal Properties  Chapter | 2  49

molecular structures, e.g., known constituents of a plant material. Compounds that perform well in in silico predictions can be used as promising starting materials for experimental work. Some examples of the applications of computational modelling in predicting medicinal properties and plant selection are discussed in Section 2.4.

2.3.1  Structure-Based CADD In principle, structure-based CADDs are similar to HTS in that both target and ligand structure information is essential (Douguet et al., 2005). Structure-based approaches include ligand docking, pharmacophore, and ligand design methods. Structure-based CADDs rely on the knowledge of the target protein structure to calculate interaction energies for all compounds tested (Sliwoski et al., 2014). In structure-based drug discovery approaches, therapeutics are designed based on the knowledge of the target structure (Leelananda and Lindert, 2016). This approach depends on the ability to determine and analyze the three-­ dimensional structures of biological molecules. It is based on the hypothesis that a molecule’s ability to interact with a specific protein and exert a desired biological effect depends on its ability to favourably interact with a particular binding site on that protein (Sliwoski et al., 2014). Molecules sharing favourable interactions will possess similar biological effects. Therefore, novel compounds can be determined through analysis of a protein’s binding site. Prerequisite for this approach is structural information that can be accessed for target databases. One of the important requirements is the ability to rapidly determine potential binders to the target of biological interest. Computational models are applied for rapid screening of a large compound library and determination of potential binders through modelling, simulation, and visualization techniques. The ideal starting point for docking is the determination of a target structure that is experimentally confirmed through X-ray crystallography or NMR techniques. Evaluation of appropriate binding pocket is usually performed through the analysis of known target–ligand co-crystal structures. Alternative method is to use in silico methods for identifying novel binding sites. When the experimental structures are not available or absent, computational models are utilized for predicting the 3D structure of the target proteins. Target structure may be predicted based on a template with a similar sequence by the process called comparative modelling. It is based on the belief that protein structure is better conserved than sequence that is proteins with similar sequences have similar structures. In essence, comparative modelling involves the following steps: 1. Identification of related proteins to serve as template structures 2. Sequence alignment of the target and template proteins 3. Copying coordinates for confidently aligned regions 4. Constructing missing atom coordinates of target structures 5. Model refinement and evaluation. The process can be automated through computer programmes, e.g., PSIPRED, MODELER, etc.

50  Computational Phytochemistry

One of the most significant approaches is the homology modelling, where the template and target proteins share the evolutionary origin. Homology modelling is a popular computation method for predicting the 3D coordinates of structures. Homology modelling, also known as comparative modelling of protein, actually refers to constructing an atomic-resolution model of the target protein from its amino acid sequence and an experimental 3D structure of a related homologous protein, which is commonly referred to as the template. The basis of this approach is the fact that evolution-related proteins often share similar structures. The protein structure generally remains more conserved than the sequence during evolution. As such, understanding structures having amino acid sequences similar to the target sequence of interest may assist in predicting the target structure, function, and possible binding and functional sites (Leelananda and Lindert, 2016). Application of homology modelling has emerged as the main alternative to get a 3D representation of the target in the absence of crystal structures (Aparoy et al., 2012). Combination of homology modelling and docking studies has contributed to identification of oxidosqualene cyclases associated with primary and secondary metabolism of Centella asiatica (Kumar et al., 2013) and understanding the structure and function of chalcone synthase protein from Coleus forskohlii. Computational tools have become essential in binding site detection and characterization, which are fundamental to identification of activity of any drug or bioactive molecule. Binding sites can be detected from co-crystal structures of the target or a closely related protein. In the absence of a co-crystal structure, mutational studies can be used to identify ligand-binding sites. Computational methods are used, when there is absence of binding sites or there is need for identification of new binding sites. Computational methods can be divided into three general groups: 1. Geometric algorithms to find shape concave invaginations in the target 2. Methods based on energetic consideration 3. Methods considering dynamics of protein structures Optimal interaction of a ligand with a target can be identified through steric and electronic features derived from a pharmacophore model. Such models are usually defined by hydrogen bond acceptors, hydrogen bond donors, basic groups, acidic groups, partial charge, aliphatic hydrophobic moieties, and aromatic hydrophobic moieties. Pharmacophore model can be used for querying database for bioactive compounds as well as for guiding design of new compounds. Analysis of the target binding site or study of target-ligand complex structure is used for performing structure-based pharmacophore methods. Screening for natural product inhibitors of acetylcholinesterase and cyclooxygenase using protein-based pharmacophores led to the identification of scopoletin as potent AChE (acetylcholinesterase) inhibitor and sanggenons as a potential COX inhibitor (Fig. 2.3) (Barlow et al., 2012). Molecular docking studies established that sieboldigenin could bind to the active site of soybean lipoxygenase and reduce carrageenan-induced paw oedema. This sterol is found

Prediction of Medicinal Properties  Chapter | 2  51

HO

O

OH

O OH

MeO O HO

O

O

Scopoletin

O Sanggenon A

FIG. 2.3  Scopoletin, and sanggenon A.

in several species of Smilax, which is traditionally used in arthritic and skin ailments (Barlow et al., 2012). In recent years, screening of bioactive compounds based on target recognition has become quite popular among researchers. However, these methods can hardly be effective in direct screening of bioactive compounds from plant extracts, which are often complex mixtures of many compounds. General procedures follow the strategy of ‘Isolation—Structure Identification—Activity confirmation’. One novel protocol allowed determining structural information of bioactive compounds without isolating the ligand(s) molecules experimentally by using NMR spectroscopy technique (Tang et al., 2012).

2.3.2  Ligand-Based CADD Ligand-based methods use only ligand information for predicting activity depending on its similarity/dissimilarity to previously known active ligands. Ligand-based methods include ligand-based pharmacophores, molecular descriptors, and quantitative structure-activity relationships. Ligand-based CADDs exploit the knowledge of known active and inactive molecules through chemical similarity searches or construction of predictive, quantitative structure-activity relation (QSAR) models (Acharya et al., 2011; Sliwoski et al., 2014). Ligand libraries are usually constructed by enriching ligands having desirable physiochemical properties suitable for the target of interest. Though there are various docking algorithms available, docking of millions of compounds requires considerable resources. As such, time can be saved by elimination of non-drug like unstable or unfavourable compounds. One of the important parameters selected for study is drug likeness, which is commonly evaluated using Lipinski’s rule of five (Pfizer’s rule of five) (Lipinski et al., 2001). The rule generally states that an orally active drug should have no more than one violation of the following criteria based on multiples of five: 1. maximum of five hydrogen bond donors; 2. maximum of 10 hydrogen bond acceptors (all oxygen and nitrogen atoms); 3. molecular mass of less than 500 Da; 4. an octanol-water partition coefficient of not greater than five.

52  Computational Phytochemistry

If two or more of these conditions are violated, adsorption will be compromised. To improve the predictions of drug likeness, the rules have seen many extensions, such as polar surface area no greater than 140 Å2, molecular weight ranging from 180 to 500, molar refractivity from 40 to 130, partition coefficient in −0.4 to +5.6 range, etc. Before initial virtual HTS, molecules are filtered based on predicted ADMET properties. These predictions depend on statistical and machine-learning approaches, molecular descriptors, and experimental data to model biological processes such as oral bioavailability, intestinal absorption, permeability, half-life time, and distribution in human blood plasma (Sliwoski et al., 2014). In any drug discovery process, lipophilicity and molecular weight are often increased to improve the affinity and selectivity of the drug candidate. As a result, it is often difficult to maintain strict drug likeness as per RO5 during hit and lead optimization. It has been proposed that members of libraries should be biased toward lower molecular weight and lipophilicity. The rule of five has been extended to the rule of three for defining lead-like compounds as per following criteria (Congreve et al., 2003): 1. octanol-water partition coefficient log P not greater than 3; 2. molecular mass less than 300 Da; 3. not more than three hydrogen bond donors; 4. not more than three hydrogen bond acceptors; and 5. not more than three rotable bonds. Lipinski’s fifth rule states that the original four rules do not apply to natural products nor to any molecule that is recognized by an active transport system for considering ‘druggable chemical entities’ (Newman and Cragg, 2012). Compound libraries (Wessjohann, 2000; Geysen et al., 2003) are usually enriched for a particular target or group of targets (see Chapter 5). Physiochemical filters determined from observed ligand-target complexes are used for enriching such libraries by searching for ligands that are similar to known active ligands. As molecules are flexible in solvent environment, their representation of conformational flexibility remains important criteria for determining their potentials. These conformations of protein and ligands are usually precomputed using computational simulation or knowledge-based methods (Foloppe and Chen, 2009). Ligand-based computer-aided drug design involves the analysis of ligands that can interact with a target molecule. The methods require collection of reference structures collected from compounds interacting with the target of interest. Objective of this activity is to represent these compounds with their physicochemical properties that determine desired interactions. There are two main approaches of ligand-based drug designing methods—(a) selection of compounds based on chemical similarity to known actives using some similarity measure and (b) construction of a QSAR model.

Prediction of Medicinal Properties  Chapter | 2  53

For these analyses, molecular properties are converted to numerical vectors for descriptors. Conversion is required to ensure that descriptions of molecules have a constant length independent of size. Representation of information encoded in the molecular structure with one or more numbers is called molecular descriptors. These characteristics are used to establish quantitative relationship between structures and properties, biological activities, and other experimental properties. To date, more than 2000 molecular descriptors that encode the molecular features have been reported. Molecular descriptors can be classified according to their dimensionality, i.e. the representation of molecules from which descriptor values are computed. 1. One-dimensional (1D) descriptors capture bulk properties, i.e. molecular weight, molar refractivity, log P (logarithm of the octanol/water partition coefficient), etc. 2. Two-dimensional (2D) descriptors describe properties that can be computed from two-dimensional representation of molecules, such as number of atoms, number of bonds, connectivity indices, etc. 3. Three-dimensional (3D) descriptors depend on conformations of molecules, i.e. solvent accessible surface areas, principal moment of inertia, van der Waals volume, etc. Some of the descriptors derived from 3D structures may require analysis of many molecular conformations if biologically active conformations are usually not known from previous experiments. Common 3D descriptors may include pharmacophore type representation of molecules, where features known or thought to be responsible for biological activity are mapped to positions in a molecule. Molecular descriptors may be divided according to their ‘nature’ into: 1. constitutional (fragment additive and reflect mostly the general properties of the compound); 2. topological (which are calculated using the mathematical graph theory applied to the scheme of atoms connections of the structure); 3. geometrical; 4. electronic; and 5. quantum-chemical (the last three are derived from the results of empirical schemes or molecular orbital calculations). Among various approaches of ligand-based CADD, application of quantitative structure-activity relationship (QSAR) has contributed significantly in the development of predictive models. QSAR methods are based on the assumption that the quantitative understanding of the role of molecular structure governed the biological or other attributes. The method tries to enumerate how a fragment or sub-structure could result in a certain activity. In many cases, SAR (structure-­ activity relationships), involving enumeration of a fragment or substructure in their biological activity, and QSAR, which quantified the descriptors, are

54  Computational Phytochemistry

c­ ollectively referred to as (Q)SAR (Puzyn et al., 2010). Successful creation of QSAR models demands fulfilment of the following conditions: 1. consensus data on the structures and biological activity of studied compounds; 2. extracting descriptors for the presentation of structures; 3. machine-learning methods, either multiple linear regressions, neural networks, random forest, similarity, support vector machine, etc. QSAR models developed because of homogenous data are known as local models and traditionally used for the optimization of hit or lead compounds. On the other hand, QSAR models are developed based on heterogeneous data and are considered as global models with a wide applicability domain. Global models may be used for virtual screening, prediction of biological activity, and target fishing. QSAR has large potentials across industry, academia, and regulatory agencies. Some of the potential uses include identification of new leads with pharmacological, biocidal, or pesticidal activity, prediction of toxicity, rational design of desirable products and selection of compounds with optimal pharmacokinetic properties, etc. If the researchers tend to determine the potential targets of new chemical entity, the following tools can be used for studying biological activity—(a) pair similarity with known compounds, e.g., Tanimoto coefficient, (b) docking, e.g., INVDOCK, (c) pharmacophore-based virtual screening, and (d) classification prediction based on Bayesian statistics and substructure descriptors or fingerprints. Successful prediction of the properties of all chemical entities including phytochemicals depends on the data on which they are based, the technique to develop the model, and the overall quality of the information including the item to be modelled. Generally, two types of information are required for a model (the effect to be modelled and descriptors on the chemicals) and a technique(s) to formulate the relationship(s). The data to be modelled in QSAR may be denoted by the X-matrix and the descriptors as the Y-matrix (Table 2.1). By using this matrix, various types of relationship may be established by statistical machine-learning techniques. A QSAR is based on a continuous endpoint where activity (X) is a function of one or more descriptors (Y). The development of SAR is associated with identification of a firm basis of relationship. If a compound is identified to elicit a particular effect, and the structural determinant is recognized, then the structural fragment can be determined. It may be flagged as a ‘structural alerť that can be coded into software. Greater the number of compounds with the same structural determinant demonstrating the same effect, greater will be the confidence that the flag is associated with that particular effect. Development of SAR model is usually more appropriate for qualitative (such as yes/no, active/inactive, presence of toxicity/ absence of toxicity, etc.) endpoint. Successful implementation of QSAR depends on selection of appropriate statistical and machine-learning algorithms supplemented with powerful

Prediction of Medicinal Properties  Chapter | 2  55

TABLE 2.1  Typical Data Matrix for QSAR Property/Descriptor (Y) Chemical Identifier

Activity (X)

Fragment 1

Fragment 2

Fragment 3



Fragment n

Molecule i

Xi

Y1i

Y2i

Y3i



Yni

Molecule ii

Xii

Y1ii

Y2ii

Y3ii



Ynii

Molecule iii

Xiii

Y1iii

Y2iii

Y3iii



Yniii

….













Molecule n

Xn

Y1n

Y2n

Y3n



Ynn

c­ omputational tools. In the last few decades, multiple linear regression (MLR) is one of the popular methods to derive linear mapping. However, MLR methods have several limitations of multicollinear, overfitting issues, and non-linear relationship, thereby making the researchers to look to other alternative methods. As such, various methods such as neural networks, genetic algorithms, support vector machine, and random forests are applied in the QSAR analysis. Over the last few decades, several QSAR models have been attempted to explain or describe the potentials of traditionally used medicinal plants. However, application of QSAR methods in herbal formulae, particularly used in Traditional Chinese Medicine and Ayurvedic Systems, is somewhat limited as structure and composition of all compounds in these formulae are not completely known (Wang et  al., 2006). Thus, QSAR method cannot be directly applicable for prediction of bioactivity of polyherbal medicine. Despite that, variation of biological activity of herbal medicine is also associated with the variation of their chemical composition. Considering this relationship, another relationship called quantitative composition activity relationship (QCAR) has been proposed to establish relationship between chemical composition and biological activity (Cheng et  al., 2006). This method applies the same mathematical model used in QSAR studies to derive quantitative relationship of the composition bioactivity of the herbal components. One of the advantages of this method is deriving optimal combination of herbal medicine (Wang et al., 2006). Molecular fingerprint-based technique is one approach more qualitative in nature as compared to other LB-CADD approaches (Sliwoski et al., 2014). Molecular fingerprints are representation of molecular structure and properties encoded as binary bit strings whose settings produce a bit ‘pattern’ characteristic of a given molecule (Hert et al., 2004). Fingerprints may provide different sets of molecular descriptors, structural fragments, and possible connectivity pathways through a molecule or different types of pharmacophores. There are several methodologies for representing chemical binary information. For i­nstance,

56  Computational Phytochemistry

path-based approach, key-based fingerprint, dictionary approach, and SMARTS pattern matching. Molecular fingerprint-based techniques are used to represent molecules for rapid structural comparison of phytochemicals. These approaches depend more on chemical structure and are less computationally expensive than pharmacophore mapping or QSAR models. Fingerprint-based methods provide equal treatment to all parts of the molecule and avoid focus only on parts of a molecule considered to have important role in bioactivity. Screening of phytochemicals using a molecular fingerprint based on the HIV protease inhibitor, saquinavir, led to the discovery of a potential anti-HIV agent leucovorin. Molecular dynamic studies revealed the favourable binding of this compound to the protease active site (Barlow et al., 2012). Combination of molecular fingerprint-based method with docking studies led to the discovery of aurantiamide acetate from Artemisia annua (Fig. 2.4) as an inhibitor of severe acute respiratory syndrome coronavirus main proteinase (Wang et al., 2007). In ligand-based-CADD, machine-learning algorithms are used to be trained to identify patterns in data and for predictions on test data sets. One of the common algorithms is support vector machine (SVM) that is being usually used for classification of sets of biological data (Leelananda and Lindert, 2016). Other significant candidates are Random Forest (Svetnik et al., 2003) and Artificial Neural Network (Wang, 2003).

2.3.3  Network Pharmacology Network pharmacology is the new paradigm in the drug discovery and development (Hopkins, 2008) and offers enhanced understanding of drug action. It applies network analysis to determine the set of proteins most critical in any disease, and then chemical biology to identify molecules capable of targeting that set of proteins. By addressing the true complexity of disease and by seeking to harness the ability of drugs to influence many different proteins, network pharmacology differs from conventional drug discovery approaches, which have usually been based on highly specific targeting of a single protein. Network pharmacology has the potential to provide new treatments for complex diseases, where conventional approaches have failed to deliver satisfactory therapies. O O

O N O

N

H

H

FIG. 2.4  Aurantiamide acetate from Artemisia annua.

Prediction of Medicinal Properties  Chapter | 2  57

With the advancement of bioinformatics, systems biology, and polypharmacology, network-based drug discovery has provided promises toward cost-effective drug discovery and development in traditional herbal medicine. Network analysis is the study of molecular interactions and is related with the mathematical field of graph theory in which the assembly of pairwise connections (edges) between discrete objects (nodes) coalesces to form a network or graph (Arrell and Terzic, 2010). Biological networks can be derived from: 1. de novo through direct experimental interactions; 2. applying known interactions to an -omic dataset; or 3. reverse engineering to generate a subset of networks ab initio that predict the dynamics under study. In the past, the concept of designing selective ligands to avoid unwanted side effects was a major issue in drug discovery. With the emergence of more complex drug action, often it was discovered that there may be many drugs for each drug target as well as single drug that can fit to multiple drug targets (Hopkins, 2007). Network pharmacology tries to understand this complex relationship along with validating target combinations and optimizing multiple structure-­ activity relationships. Network pharmacology is a system biology-based methodology that tries to exploit the pharmacological mechanism of drug action in the biological networks. In the application of network analysis in herbal medicine, a network is a mathematical and computable representation of various connections between herbal formulae and diseases in a complex biological systems (Li and Zhang, 2014). The scope of this new approach includes study of: 1. theories, algorithms, models, and software of network pharmacology; 2. network construction and interactions prediction; 3. theories and methods on dynamics, optimization, and control of pharmacological networks; 4. network analysis of pharmacological networks, including flow balance analysis, topological analysis, network stability; 5. various pharmacological networks and interactions; 6. factors that affect drug metabolism; 7. network approach for searching targets and discovering medicines; and 8. big data analytics of network pharmacology (Zhang, 2016). One of the important contributions of network pharmacology is changing the perspective from ‘one target, one drug’ strategy to a novel version of the ‘network target multi components’ strategy, which is perfectly applicable to Traditional Chinese Medicine and to Ayurvedic medicine system. In one of the pioneering works in 1999, the Chinese researcher Li proposed that there was a possible relationship between Traditional Chinese Medicine syndrome and molecular networks and established a network-based TCM research strategy in 2007 (Li and Zhang, 2014). Subsequently he also proposed a new concept of network target approach in the research of herbal medicine.

58  Computational Phytochemistry

In the biological network approach, a node represents either (a) a gene, gene product, or any biological entity in the biomolecular network, gene regulatory network, genetic interaction, metabolic network, and signaling network, (b) an herb, herb ingredient, or drug, or (c) a clinical phenotype of a disease in the network. An edge represents an association, interaction, or any other well-defined relationship. The degree of a node is represented by the number of edges connected to it, while the betweenness of a node is the number of shortest paths that can traverse through a given node. Network parameters such as betweenness, degree, shortest path, and modules are used to measure the targeted key proteins or protein interactions. Potential of network pharmacology lies in its multidisciplinary approach that integrates a large amount of information to make new discoveries by combining both computational and experimental approaches. Main computational approach includes graph theory, statistical methods, data mining, modelling, and information visualization methods. Network pharmacology can be used to identify active herbal ingredients and synergistic combinations as well as contribution to rational design and optimization of drug discovery process from herbal formulae. Application of network mapping to a wide array of drugs to protein targets both before and after modelling with chemical drug-ligand interactions helps in prediction of new targets. It also enables identification of primary sites of action and off-target proteins as explanations for well-known side effects, with new and unexpected drug binding revealed across major categories of proteins unrelated by sequence or structure (Arrell and Terzic, 2010).

2.4.  SELECTION OF MEDICINAL PLANTS Documentation and analysis of legacy knowledge about medicinal plant provide certain advantages in identification and designing pharmacological products from plants. Identification and selection of medicinal plants for drug discovery studies is a challenging task. In medicinal plants research, what type of plants to be selected and what would be the right criteria still remain the enigma of the scientists. Conventionally, targeted approach (Mann et  al., 2000) is favourable, where certain medicinal plants are prioritized, and even though expensive, depends on generation of working hypothesis and performing experimental studies on the bases of the hypothesis. In this approach, careful selection and choosing the right criteria for the targeted botanical species usually determine the outcome. Considering the diversity of the higher plant, selection of candidate species for the bioprospecting programme is not an easy task. Out of total number of higher plants (estimated 400,000 angiosperms and 1000 gymnosperms), only about 6% have been screened for biological activity and about 15% for phytochemical properties (Rates, 2001). Various research institutes and pharmaceutical industries have taken up different approaches that can be categorized into four broad groups—ethnobotany-directed, random selection, chemotaxonomic, and integrated approach, respectively. Approaches of

Prediction of Medicinal Properties  Chapter | 2  59

ethnobotany-directed, chemotaxonomic, and integrated approach can be part of targeted approach (Yea et al., 2016).

2.4.1  Ethnobotany-Directed Drug Discovery As the traditional medicine has been used for many centuries in different cultures, medicinal plants have attracted a lot of attention as a source of medicinal products. This approach assumes that the traditional use of plants can provide strong clues to the biological activities of the plants (Cox and Balick, 1994). In the drug discovery procedure, plants possessing potential medicinal properties are recognized through ethnobotanical field studies or literature. Such plants are further investigated for bioactive properties in the laboratories for pure compound isolation. Preparation procedure of traditional recipe may provide an indication of best extraction protocol. Formulation method may provide primary information about pharmacological activity, optimal doses, and suitable mode of application of the future drug (Rates, 2001). Ethno-directed approach was initiated by scientists like Luis Lewin, Carl Hartwich, Alexander Tschirch, and Richard Evans Schultes by applying molecular interpretation of the pharmacologically active plants in 19th and early part of 20th century (Gertsch, 2009). During the early phase of ethno-directed approach, anthropologists worked in tandem with chemists and pharmacologists resulting in isolation of various drug molecules such as caffeine and quinine (Fig. 2.5). However, in spite of the richness of the ethnopharmacological surveys worldwide and diversity of traditional knowledge, many of the collected data could not be translated successfully into bioprospecting programmes (Albuquerque et al., 2014). One classical example is Shaman Pharmaceuticals, a bioprospecting company established in 1987 that failed to deliver any blockbuster drugs in spite of wealth of ethnopharmacological knowledge and subsequently went bankrupt in 2001. It is a fact that only a few significant contributions have been made by ethno-directed approach in the last few decades (Gertsch, 2009). Research problems like inadequate design for data collection, misinterpretation of the role of medicinal play in the traditional medicine system, unfavourable influence of

O N

N O

N

HO

N

H

MeO Caffeine N Quinine FIG. 2.5  Caffeine, curare, and quinine.

N

60  Computational Phytochemistry

sampling, and application of irrelevant informant consensus indices hampered ethnobotanical-directed drug discovery programmes, individually or in combination (Albuquerque et al., 2014). In the conventional ethnobotanical research, application of computational tools along with statistical analysis is a new approach, but still rare. Regression analysis of medicinal plant families is a unique method for predicting plant families with large number of medicinal plants. The method was first introduced by Moerman in his classical work (Moerman, 1991). The work concerned prediction of the families having the large number of potential medicinal plants based on linear regression analysis. After that work, there have been many applications of the approach as well as many derivations including Bayesian approach, which challenges the earlier regression methods (Bennett and Husby, 2008; Moerman, 2012). However, application of regression analysis still remains a popular method and helps in identification of plant orders and families favoured by traditional healers among ethnomedicinal plants of South Africa, proving that the use of these plants by traditional healers is not random (Douwes et al., 2008). There was an attempt to apply mathematical and logical method of replacing rare herbs and simplifying traditional Chinese medicine formula and its applicability in the perspective of pathway enrichment analysis (Fang et al., 2013).

2.4.2  Chemotaxonomic and Ecological Approach The knowledge that a particular group of plants contains a particular group of natural product may help in predicting the presence of similar or related compounds in phylogenetically related species (Rates, 2001); this is chemotaxonomy-­guided approach. Chemical plant taxonomy, or simply chemotaxonomy of plants, focuses on the classification of plants based on their chemical composition, i.e. secondary metabolites. The selection highlights the chemical taxonomy of acetylenic compounds, the distribution of fatty acids in plant lipids, distribution of aliphatic polyols, cyclitols, plant glycosides, and alkaloids. Chemotaxonomy is a method of biological classification based on similarities in the structure of certain compounds produced by the organisms in question, e.g., plants. As proteins are more closely controlled by the genes and less subjected to natural selection than are anatomical features, they are more reliable indicators of genetic relationships or phylogeny. This approach may become significant, when a particular compound class is desirable with known biological activity. A good example of this approach is targeting Datura stramonium for tropane alkaloids, with the knowledge that Atropa belladonna contains the alkaloid hyoscyamine, subsequently leading to the discovery of similar alkaloid hyoscine (Fig. 2.6) (Heinrich et al., 2012). Similarly, the discovery of the alkaloid febrifugine (Fig.  2.7) from Hydrangea macrophylla, a native Japanese plant, was the result of targeting this plant because of its taxonomic status as a member of the Hydrangeaceae

Prediction of Medicinal Properties  Chapter | 2  61

N

N

OH

OH

O

O

O O

O Hyoscyamine

Hyoscine

FIG. 2.6  Structures of hyoscyamine and hyoscine. O

OH N

N

O

HN

FIG. 2.7  Febrifugine from Hydrangea macrophylla.

family (Heinrich et  al., 2012). Presence of desirable chemicals may rely on the knowledge of toxicity of particular plant (Rates, 2001), ecology, or plantpathogen relationships (Ottmann et al., 2012). In plant-pathogen interactions, molecular frameworks of natural products play significant role in host colonization or pathogen immunity. These molecules are the outcome of co-evolutionary process enriched with biological activity. Investigation of the mode of action of these natural product classes may provide significant outcome as novel molecule and discovery of novel concepts on how living systems can be manipulated with small molecules (Ottmann et al., 2012). In ecological study-based approach, scientists select plants that occupy a particular habitats or display characteristics, indicating they produce molecules possessing desirable properties. This approach is based on the ecological plant defence theory (Coley et al., 2003). For instance, absence of predation might suggest presence of toxic chemicals. Many phytochemicals that are toxic to insects also exhibit biological activity in humans and might be exploited for therapeutic applications. Observation of the planťs environment that reflects the toxicological properties of the plant led to the isolation of many antibacterial drugs. The ecological approach to select plant material relies on the observation of interactions between organisms and their environment that might lead to the production of bioactive natural compounds. The hypothesis underpinning this approach is that secondary metabolites, e.g., in plant species, possess ecological functions that may have also therapeutic potential for humans. For example, metabolites involved in plant defence against microbial pathogens may be useful as antimicrobials in humans, or secondary products defending a plant against herbivores through neurotoxic activity could have beneficial effects in humans due to a putative central nervous system activity (Barbosa et al., 2012). Major potential limitation of this approach lies in the classification of medicinal plant use (Ernst et al., 2016). Before analysing the medicinal plants in a phylogenetic context, medicinal plant documented or collected need to be

62  Computational Phytochemistry

c­lassified according to the diseases used to treat in ethnomedicinal system. Some common widely used classifications are International Classification of Diseases (ICD) of the WHO and the classification of Cook developed as Economic Botany Data Collection Standards (Cook, 1995). These classification systems could not fully capture the complexity and idiosyncrasy of local healthcare systems. At the same time, these systems are based on categories reflecting systems of the human body or symptoms. These systems provide little information in disease etiology and potential underlying biological activity of the medicinal plants. In recent years, more extensive studies in cellular and molecular mechanisms underlying diseases have been carried out providing more information on disease etiology. Alternative approaches in phylogenetic systems emerge by using classification based on modulating the disease response (Ernst et al., 2016). For the taxonomic classification, the current default classification is usually the Angiosperm Phylogeny Group IV (APG IV, 2016). It is because categorization and assemblage of plant species within a particular category or sub-categories needs to be based on phylogenetic relationship and APG being the most common among the practicing taxonomists. Theoretically, of the various chemical compounds used, most reliable are the semantides (DNA, RNA, and proteins) that provide more reliable taxonomic information. However, in the practical application, the approach is far from perfection and many researchers are still trying on other compound types. One such instance is that the application of graph-clustering algorithm on the metabolite content of the plant led to the successful classification of 217 plants in Japan (Liu et  al., 2017). The approach provides successful result even in incomplete metabolite data by obtaining consistent relationship between plant clusters and known evolutional relationship of plants. This finding led to the application of predictive power of metabolite content in exploring medicinal properties in plants. As such, apart from establishing correlation between the plant group and chemical properties, development of reliable cluster analysis with visual representation of dendrogram remains the fundamental step. All these processes need selection of appropriate clustering algorithms with application of computational tools.

2.4.3  Random Approach Random Approach was popular in 60s, but with limited results. It does not require any computational or mathematical input whatsoever. In this approach, plants are collected regardless of any previous knowledge of their phytochemical or biological activity. This approach relies on availability of plants and is purely serendipitious in nature (Heinrich et al., 2012). It requires a lot of investment in terms of money, time, and sheer amount of luck. This approach has made effective contributions to the development of drugs for many diseases (Albuquerque et  al., 2014). There are two approaches in the random screening. In the first approach, plants are screened for selected class of compounds like alkaloids, flavonoids, coumarins, or lignans. This approach usually does

Prediction of Medicinal Properties  Chapter | 2  63

not provide any idea of the biological efficacies. Second approach screens randomly selected plants for selected bioassays, through focused screening as well as general screening. The Central Drug Research Institute, India, started this approach three decades back. Though the institute has screened about 2000 plants for biological efficacy, the screening could not provide any new drug (Katiyar et  al., 2012). In the United States, the National Cancer Institute of National Institute of Health screened about 35,000 plants for anticancer activity spanning two decades from 1960 to 1982, resulting in discovery of chemotypes including those of taxanes and camptothecin (Fig. 2.8) (Cragg and Newman, 2005). Their development into clinically active agents spanned about 30 years.

2.4.4  Integrated Approach This approach is also called knowledge or information-driven approach and takes into consideration ethnobotanical, random, and chemotaxonomic approach for selecting the medicinal plants (Katiyar et al., 2012; Lin et al., 2015). Computational and mathematical tools are extensively applied in this approach. Related information for a particular species are integrated into a database for prioritizing the screening process. Hypothesis generation and subsequent analysis requires careful assembly, overlay, and comparison of data from divergent sources. Importance of database-driven information sharing in drug discovery can be demonstrated by large-scale production of taxol (Fig. 2.8). In 1962, a team of researchers in National Cancer Institute in US discovered that extracts of Pacific yew (Taxus brevifolia) contained cytotoxic activity. In 1977, the team confirmed the bioactive component of the extract as paclitaxel, also known by its trade name taxol. After starting clinical phase I in 1984 against number of cancer types, taxol was approved by the FDA for the treatment of ovarian cancer and breast cancer. However, supply of paclitaxel was a major challenge as this compound is found in the thin bark of Taxus in extremely low concentration. The bark from a single tree could provide only a single dose for clinical trial leading to the destruction of whole plant. Large-scale production of taxol O

O

N

O O

N

H

OH

O

N O

O O

OH

OH

O

O Taxol FIG. 2.8  Taxol and camptothecin.

H

O

O

OH O Camptothecin

O

64  Computational Phytochemistry

was made by developing several pathways to derive 10-deactyl-baccatin III, a non-cytotoxic precursor of taxol. This precursor was initially isolated in France (Institute de Chimie des Substances Naturelles, Gif-sur-Yvette) from the needles of European yew Taxus baccata by the French scientists (Raviña, 2011). One kilogram of fresh needles can provide 1 g of precursor, making it possible for large-scale supply of taxol. In this way, modern drug discovery approaches relied on various information resources. Curation and compilation of various information from different sources demand access and sharing of different data curation services and databases. Statistically, processing of this varied information through multivariate analysis (e.g., Principal Component Analysis, Discriminant Function Analysis) provided the potential for understanding the pattern of medicinal properties in several target species. Details of curation of medicinal plant and their potentials are discussed in the following sections.

2.5.  ROLE OF MEDICINAL PLANTS DATABASES Phytochemicals, and natural products in general, are recognized to possess characteristics of high chemical diversity, biochemical specificity, and other properties that make them favourable as lead structures for drug discovery programmes. However, assessing the diverse chemical space efficiently and effectively is still impractical in terms of resource and time. It is expected that the application of computational approaches for the identification of bioactive phytochemicals can accelerate the drug discovery by exploring on the chemical space covered by these molecules or on the application of natural product (phytochemical) libraries (see Chapter  5) in ligand-based and target-based virtual screening. Major problems usually encountered in the in silico studies of the biological activity of natural products include unavailability of large natural products or phytochemical databases having adequate structural and biological information. Medicinal plant databases curate the information about plants covering a large spectrum of plant properties including the formulae of traditional medicine (Ningthoujam et  al., 2012). Dozens of databases and Internet resources are available on the internet providing various types of information for the last decades. Development of an inclusive database with information about classification, activity, and ready to dock library of medicinal plant compounds is essential for drug designing using resources of medicinal plants (Mumtaz et  al., 2017). If one particular database could not provide a complete picture of a medicinal plant, data mining and sharing from different resources may be utilized. Such items need unique identification number for a particular plant or a particular entity. These databases are required to provide phytochemical and pharmacological information on medicinal plants. Number of medicinal plant databases increases year by year with specialized or comprehensive information giving opportunity for the study of plants and the utilization of

Prediction of Medicinal Properties  Chapter | 2  65

traditional knowledge. For instance, development of Global Natural Products Social Molecular Networking curated information and enabled sharing of raw, processed, or identified tandem mass spectrometry data (Wang et  al., 2016). Another example of curating specialized data are plant protein interaction data, such as IntAct, The Arabidopsis Information Resource and BioGRID, etc. (Lee et al., 2010). Data curated on medicinal plant databases need to be comprehensive as far as possible to serve as important resources for drug discovery studies. Aggregation of these data allows researchers to visualization, data mining, and further analysis to produce new insights. Aggregations would be unachievable if the data are dispersed within largely inaccessible formats (Rodriguez et al., 2009). Challenges encountered during aggregating data arising from different formats can be mitigated by ontological linking as well as introduction of noSQL data model (Ningthoujam et al., 2014). With the rapidly expanding information derived from various analytical and exploratory activities, the role of medicinal plant databases is also progressively increasing to house all these available information. Availability of comprehensive information about a particular plant species or plant groups would accelerate the analysis and prediction of their medicinal properties.

2.6.  TOOLS AND TECHNIQUES Various tools and techniques are used to explore medicinal properties through data curated in medicinal plants databases as well as analytical methods such as QSAR, molecular modelling, and virtual screening (Lagunin et al., 2014). Software and tools that can be used for virtual screening and identification of potential mechanism of action of herbal constituents can be categorized (Barlow et al., 2012) as shown in Table 2.2.

2.7.  ROLE OF DATA MINING IN MEDICINAL PLANT SELECTION Data mining is the process of sorting through large datasets to identify patterns and establish relationships to solve problems through data analysis by using machine-learning and statistical methods (Afendi et  al., 2013; Yea et  al., 2016). Data mining methods use various kinds of information obtained from sources such as bibliographic literature, experimental data, clinical data, and curated data. Vast amount of data stored in these databases are screened to identify potential natural products. In the data mining approaches, random selection approach does not consider taxonomic affinities, ethnomedicinal contexts, or other intrinsic qualities. However, random screening is associated with extremely low probability of discovery of useful compounds (Yea et al., 2016). Considering the limitations of random selection, some advanced methods have been proposed. For instance, a simple scoring system for searching the local alternatives to Phytolacca dodecandra was developed in ways that are more

Methods

Prerequisites

Use

Tools/Algorithms

Ligand-based screening

Knowledge of compounds with known activity

To identify putatively active compounds

Classification/regression trees (including Random Forest), linear discriminant analysis, artificial neural networks, support vector machines

Pharmacophore (ligand-based or target-based)

• 3D structures of known ligands to chosen targets (Ligand-based) • known 3D structures of target protein(s) ideally known as 3D structure(s) of known complex(es) (Target-based)

To identify putative active compounds

LigandScout, Schrödinger’s Phase program, Accelrys’s Discovery Studio Catalyst, etc.

Docking

Known 3D structure (s) of target proteins

To ‘dock’ potential small molecule ligands into protein active sites, optimizing their topographical and chemical complementarity, and scoring their interaction

FlexX, Gold, Dock, Glide, MolDock, AutoDock, LigandFit, etc.

Pattern recognition

Post-screening analyses (involving dimensionality reduction)

Principle components analysis (PCA), multidimensional scaling, self-organizing maps, various forms of cluster analysis, etc.

Proteomics and genomics data visualization and analysis

Application-specific programs for statistical processing and visualization of data output from DNA micro-array experiments, MS proteomics experiments, etc.

66  Computational Phytochemistry

TABLE 2.2  Uses and Tools and/or Algorithms Important in Computational Methods of Herbal Medicine

Prediction of Medicinal Properties  Chapter | 2  67

complex. Despite the applications of simple scoring, regression analysis, or a logical formula method in data mining from random selection, successful mining of vast body of information and knowledge pertaining to biology, medicine, and botany is far from complete. Still today, mining of biomedical data to unearth knowledge or generate hypothesis is an active research field. One major innovation is inclusion of semantic information of the Medical Subject Headings (MeSH) thesaurus to cluster documents of MEDLINE database (Yea et al., 2016). In the approach, three categories containing terms related to herbal compounds, efficacy, toxicity, and the metabolic processes, were selected and subjected to similarity measurement method. Application of this novel approach in data mining could predict herbs by 500% more accurately with similar efficacy as compared to random selection. Association rule mining is one of the powerful tools to derive the relationship between different factors with the properties of Chinese traditional medicines. As the data mining aims at extracting structured information or discovering new knowledge from large data, one of the prerequisites is data availability (Lee, 2015). So, data mining techniques are intricately related with the advancement in technology and curation of medicinal plant databases.

2.8.  SAFETY CONSIDERATIONS Medicinal plants, though considered to be of lower risk, are not completely free from the possibility of toxicity or other adverse effects. Apart from inherent toxicity, adverse effects of the herbal preparations may come from contamination of products with toxic metals, adulteration, misidentification, or substitution of herbal ingredients and improper processing (Jordan et  al., 2010). There may be interactions between drugs, foods, and other herbal products if taken concomitantly. Considering these aspects, there is increased concern on the safety assessment of herbs with various protocols and guidance documents have been issued. Documents issued by the International Life Sciences Institute, the Union of Pure and Applied Chemistry, the European Medicines Agency, and the European Food Safety Authority discuss the assessment of safety of herbs for using in foods and medicines. Assessment of safety of herbal products, either as pre-market assessment or post-market surveillance, is subject to challenges. Data deficiencies with regard to quantity and the quality of information are the major factors. For efficient assessment, proper information of adverse reactions, ideal product quality, composition of herbal formulae, and toxicity of the constituents are required. Integration of all these parameters can reduce the uncertainty in decision-making and can be fulfilled if all the available information are in a knowledge base. Development of these knowledge base requires integration of various data sources and mapping different information (e.g., toxicity, bioassay, herbal formulae, and contraindications) arising from diverse domains.

68  Computational Phytochemistry

Another dimension in safety consideration is application of predictive toxicology (computer-assisted) to screen and assess the potential toxicity of ­chemicals. Predictive toxicology contributes to the aim in transforming toxicology testing from primarily observational science to a truly predictive science for the benefit of drug development, chemical risk assessment, and food safety, using mathematical and computational tools. Predictive toxicology deals with the development of new non-animal tests that do not simply duplicate existing animal tests, but offer a new scientific basis for safety testing. It reflects a significant shift away from adverse effects observed in experimental animals, sometimes at high doses, to analysing the effects of chronic exposures to low concentrations on cells and organ systems. This approach offers the potential for reliable, reproducible, faster, and more cost-effective safety assessment in both new product development, e.g., new drug development, and eventually in regulatory testing and is advantageous when the numbers of individual chemicals to be screened far exceed the capacity for assessment. Prediction of toxicological property uses the computational toxicology methods such as QSAR to assessment environmental chemicals. Various QSAR models that could predict LD50 in rats, mutagenicity and carcinogenicity of chemicals. Interests have increased on computational predictions of toxicological properties. In silico methods have been used to predict cytotoxic activity of sesquiterpene lactones in members of the Asteraceae family. Fernandes et al. (2008) used artificial neural network to examine these compounds with regard to their cytotoxic potential. One landmark discovery was made by Valerio by using a QSAR model for rodent carcinogenicity. Di Sotto et  al. (2017) have reported genotoxicity assessment of peperitenone oxide, a natural flavouring agent also known as rotundifolone (Fig. 2.9), based on an integrated in vitro and in silico evaluation protocol. In in silico part of the study, the computational prediction of genotoxicity was carried out using the Toxtree and VEGA tools. Computational prediction for piperitenone oxide agreed with the toxicological data and highlighted the presence of the epoxide function and the α,β-unsaturated carbonyl as possible structural alerts for DNA damage. However, it was felt that an improvement of the toxicological libraries for natural occurring compounds was essential to augment the applications of the in silico models to the toxicological predictions.

O H O Piperitenone oxide FIG. 2.9  Piperitenone oxide, a naturally occurring flavouring agent.

Prediction of Medicinal Properties  Chapter | 2  69

2.9. CONCLUSION Study of pleiotropic pharmacological potential of the natural products derived from medicinal plants may be possible with the availability of medicinal plant databases that stored information on chemical structure and therapeutic uses. Modelling may provide answers to hitherto unknown problems and greatly expand our knowledge base from actual study data. Initially, scientific community suffered from lack of large data sets particularly from curated biological activity. These limitations have been overcome, to some extent, with the availability of many open access initiatives like PubChem, DrugBank, ChemBank, and ChemSpider. Nevertheless, problem still persists, as there are limitations in managing high capacity data in sync with generated big data and the ability to transform these data into meaningful knowledge. Data integration from divergent sources at different platform, coupled with increasingly complex multidisciplinary approaches, increased the need of data analysis and interpretation.

REFERENCES Abramowitz, M., Stegun, I.A., 1968. Handbook of Mathematical Functions. Dover Publications, New York. Acharya, C., Coop, A., Polli, J.E., MacKerell, A.D., 2011. Recent advances in ligand-based drug design: relevance and utility of the conformationally sampled pharmacophore approach. Curr. Comput. Aid. Drug Design 7, 10–22. Afendi, F.M., Ono, N., Nakamura, Y., Nakamura, K., Darusman, L.K., Kibinge, N., Morita, A.H., Tanaka, K., Horai, H., Altaf-Ul-Amin, M., Kanaya, S., 2013. Data mining methods for omics and knowledge of crude medicinal plants toward big data biology. Comput. Struct. Biotechnol. J. 4, e201301010. Albuquerque, U.P., Medeiros, P.M.D., Ramos, M.A., Júnior, W.S.F., Nascimento, A.L.B., Avilez, W.M.T., Melo, J.G.D., 2014. Are ethnopharmacological surveys useful for the discovery and development of drugs from medicinal plants? Rev. Bras 24, 110–115. Aparoy, P., Kumar Reddy, K., Reddanna, P., 2012. Structure and ligand based drug design strategies in the development of novel 5-LOX inhibitors. Curr. Med. Chem. 19, 3763–3778. APG IV, 2016. An update of the angiosperm phylogeny group classification for the orders and families of flowering plants: APG IV. Bot. J. Linn. Soc. 181, 1–20. Arrell, D.K., Terzic, A., 2010. Network systems biology for drug discovery. Clin. Pharmacol. Therap. 88, 120–125. Barbosa, W.L.R., Do Nascimento, M.S., Do Nascimento Pinto, L., Maia, F.L.C., Sousa, A.J.a.D., JúNior, J.O.V.C.R.S., Monteiro, M.C.M., De Oliveira, D.R., 2012. Selecting medicinal plants for development of phytomedicine and use in primary health care. In: Bioactive Compounds in Phytomedicine. InTech, London, UK. Barlow, D.J., Buriani, A., Ehrman, T., Bosisio, E., Eberini, I., Hylands, P.J., 2012. In-silico studies in Chinese herbal medicines’ research: evaluation of in-silico methodologies and phytochemical data sources, and a review of research to date. J. Ethnopharmacol. 140, 526–534. Bennett, B.C., Husby, C.E., 2008. Patterns of medicinal plant use: an examination of the Ecuadorian Shuar medicinal flora using contingency table and binomial analyses. J. Ethnopharmacol. 116, 422–430.

70  Computational Phytochemistry Cheng, Y., Wang, Y., Wang, X., 2006. A causal relationship discovery-based approach to identifying active components of herbal medicine. Comput. Biol. Chem. 30, 148–154. Coley, P.D., Heller, M.V., Aizprua, R., Araúz, B., Flores, N., Correa, M., Gupta, M., Solis, P.N., Ortega-Barría, E., Romero, L.I., 2003. Using ecological criteria to design plant collection strategies for drug discovery. Front. Ecol. Environ. 1, 421–428. Congreve, M., Carr, R., Murray, C., Jhoti, H., 2003. A ‘rule of three’ for fragment-based lead discovery? Drug Discov. Today 8, 876–877. Cook, F.E.M., 1995. Economic Botany Data Collection Standard: Prepared for the International Working Group on Taxonomic Databases for Plant Sciences (TDWG). Kew, Royal Botanic Gardens, Kew. Cox, P.A., Balick, M.J., 1994. The ethnobotanical approach to drug discovery. Sci. Am. 270, 82–87. Cragg, G.M., Newman, D.J., 2005. Plants as a source of anti-cancer agents. J. Ethnopharmacol. 100, 72–79. Del Moral, P., Doucet, A., Jasra, A., 2006. Sequential Monte Carlo samplers. J. Royal Stat. Soc. Ser. B (Stat. Method.) 68, 411–436. Di Sotto, A., Di Giacomo, S., Abete, L., Bozovic, M., Parisi, O.A., Barile, F., Vitalone, A., Izzo, A.A., Ragno, R., Mazzanti, G., 2017. Genotoxicity assessment of piperitenone oxide: an in vitro and in silico evaluation. Food Chem. Toxicol. 106, 506–513. Douguet, D., Munier-Lehmann, H., Labesse, G., Pochet, S., 2005. LEA3D: a computer-aided ligand design for structure-based drug design. J. Med. Chem. 48, 2457–2468. Douwes, E., Crouch, N.R., Edwards, T.J., Mulholland, D.A., 2008. Regression analyses of southern African ethnomedicinal plants: informing the targeted selection of bioprospecting and pharmacological screening subjects. J. Ethnopharmacol. 119, 356–364. Ernst, M., Saslis-Lagoudakis, C.H., Grace, O.M., Nilsson, N., Simonsen, H.T., Horn, J.W., Ronsted, N., 2016. Evolutionary prediction of medicinal properties in the genus Euphorbia L. Sci. Rep. 6, 30531. Fang, Z., Zhang, M., Yi, Z., Wen, C., Qian, M., Shi, T., 2013. Replacements of rare herbs and simplifications of traditional Chinese medicine formulae based on attribute similarities and pathway enrichment analysis. Evid. Based Complement. Alternat. Med. 2013, 136732. (9 pages). Fernandes, M.B., Scotti, M.T., Ferreira, M.J., Emerenciano, V.P., 2008. Use of self-organizing maps and molecular descriptors to predict the cytotoxic activity of sesquiterpene lactones. Eur. J. Med. Chem. 43, 2197–2205. Foloppe, N., Chen, I.-J., 2009. Conformational sampling and energetics of drug-like molecules. Curr. Med. Chem. 16, 3381–3413. Gertsch, J., 2009. How scientific is the science in ethnopharmacology? Historical perspectives and epistemological problems. J. Ethnopharmacol. 122, 177–183. Geysen, H.M., Schoenen, F., Wagner, D., Wagner, R., 2003. Combinatorial compound libraries for drug discovery: an ongoing challenge. Nat. Rev. Drug Discov. 2, 222–230. Gilca, M., Barbulescu, A., 2015. Taste of medicinal plants: a potential tool in predicting ethnopharmacological activities? J. Ethnopharmacol. 174, 464–473. Heinrich, M., Barnes, J., Gibbons, S., Williamson, E.M., 2012. Fundamentals of Pharmacognosy and Phytotherapy. Elsevier Health Sciences, Edinburg. Hert, J., Willet, P., Wilton, D.J., 2004. Comparison of fingerprint-based methods for virtual screening uning multiple bioactive reference structures. J. Chem. Inf. Model. 44, 1177–1185. Hopkins, A.L., 2007. Network pharmacology. Nat. Biotechnol. 25, 1110–1111. Hopkins, A.L., 2008. Network pharmacology: the next paradigm in drug discovery. Nat. Chem. Biol. 4, 682–690.

Prediction of Medicinal Properties  Chapter | 2  71 Jordan, S.A., Cunningham, D.G., Marles, R.J., 2010. Assessment of herbal medicinal products: challenges, and opportunities to increase the knowledge base for safety assessment. Toxicol. Appl. Pharmacol. 243, 198–216. Katiyar, C., Gupta, A., Kanjilal, S., Katiyar, S., 2012. Drug discovery from plant sources: an integrated approach. AYU 33, 10. Kinghorn, A.D., 1994. The discovery of drugs from higher plants. In: Gullo, V.P. (Ed.), Discovery of Novel Natural Products With Therapeutic Potential. Newnes, Boston. Kroese, D.P., Brereton, T., Taimre, T., Botev, Z.I., 2014. Why the Monte Carlo method is so important today. Wiley Int. Rev. Comput. Stat. 6, 386–392. Kubinyi, H., 1999. Chance favors the prepared mind-from serendipity to rational drug design. J. Recept. Sig. Transd. 19, 15–39. Kumar, V., Kumar, C. S., Hari, G., Venugopal, N. K., Vijendra, P. D., B, G. B., 2013. Homology modeling and docking studies on oxidosqualene cyclases associated with primary and secondary metabolism of Centella asiatica. SpringerPlus 2, 189. Lagunin, A.A., Goel, R.K., Gawande, D.Y., Pahwa, P., Gloriozova, T.A., Dmitriev, A.V., Ivanov, S.M., Rudik, A.V., Konova, V.I., Pogodin, P.V., Druzhilovsky, D.S., Poroikov, V.V., 2014. Chemo- and bioinformatics resources for in silico drug discovery from medicinal plants beyond their traditional use: a critical review. Nat. Prod. Rep. 31, 1585–1611. Lee, S., 2015. Systems biology—a pivotal research methodology for understanding the mechanisms of traditional medicine. J. Pharm. 18, 11–18. Lee, K., Thorneycroft, D., Achuthan, P., Hermjakob, H., Ideker, T., 2010. Mapping plant interactomes using literature curated and predicted protein–protein interaction data sets. Plant Cell 22, 997–1005. Leelananda, S.P., Lindert, S., 2016. Computational methods in drug discovery. Beilstein J. Org. Chem. 12, 2694. Li, S., Zhang, B., 2014. Traditional Chinese medicine network pharmacology: theory, methodology and application. Chin. J. Nat. Med. 11, 110–120. Lin, W.-C., Wen, C.-C., Chen, Y.-H., Hsiao, P.-W., Liao, J.-W., Peng, C.-I., 2015. Integrative approach to analyze biodiversity and anti-inflammatory bioactivity of Wadelia medicinal plants. PLoS One 10, e0129067. Lipinski, C.A., Lombardo, F., Dominy, B.W., Feeney, P.J., 2001. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings1PII of original article: S0169-409X(96)00423-1. The article was originally published in Advanced Drug Delivery Reviews 23 (1997) 3–25.1. Adv. Drug Deliv. Rev. 46, 3–26. Liu, K., Abdullah, A.A., Huang, M., Nishioka, T., Altaf-Ul-Amin, M., Kanaya, S., 2017. Novel approach to classify plants based on metabolite-content similarity. Biomed. Res. Int. 2017, 12. Mann, D.R.A., Da Rocha, A.B., Scheartsmann, G., 2000. Anti-cancer drug disovery and development in Brazil: targeted plant collection as a rational strategy to acquire candidate ant-cancer compounds. Oncologist 5, 185–198. McKay, M.D., Beckman, R.J., Conover, W.J., 1979. A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239–245. (JSTOR Abstract)|format= requires |url= (help) - American Statistical Association. Mckean, H.P., 1966. A class of Markov processes associated with nonlinear parabolic equations. Proc. Natl. Acad. Sci. U. S. A. 56, 1907–1911. Moerman, D.E., 1991. The medicinal flora of native North America: an analysis. J. Ethnopharmacol. 31, 1–42. Moerman, D.E., 2012. Commentary: regression residual vs. Bayesian analysis of medicinal floras. J. Ethnopharmacol. 139, 693–694.

72  Computational Phytochemistry Mumtaz, A., Ashfaq, U.A., Ul Qamar, M.T., Anwar, F., Gulzar, F., Ali, M.A., Saari, N., Pervez, M.T., 2017. MPD3: a useful medicinal plants database for drug designing. Nat. Prod. Res. 31, 1228–1236. Newman, D.J., Cragg, G.M., 2012. Natural products as sources of new drugs over the 30 years from 1981 to 2010. J. Nat. Prod. 75, 311–335. Ningthoujam, S.S., Talukdar, A.D., Potsangbam, K.S., Choudhury, M.D., 2012. Challenges in developing medicinal plant databases for sharing ethnopharmacological knowledge. J. Ethnopharmacol. 141, 9–32. Ningthoujam, S.S., Choudhury, M.D., Potsangbam, K.S., Chetia, P., Nahar, L., Sarker, S.D., Basar, N., Talukdar, A.D., 2014. NoSQL data model for semi-automatic integration of ethnomedicinal plant data from multiple sources. Phytochem. Anal. 25, 495–507. Ottmann, C., Van Der Hoorn, R.A., Kaiser, M., 2012. The impact of plant-pathogen studies on medicinal drug discovery. Chem. Soc. Rev. 41, 3168–3178. Press, W.H., Flannery, B.P., Teukolsky, S.A., Vetterling, W.T., 1987. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, Cambridge. Puzyn, T., Leszczynski, J., Cronin, M.T., 2010. Recent Advances in QSAR Studies: Methods and Applications. Springer Science & Business Media. Rates, S., 2001. Plants as source of drugs. Toxicon 39, 603–613. Raviña, E., 2011. The Evolution of Drug Discovery: From Traditional Medicines to Modern Drugs. John Wiley & Sons. Rodriguez, H., Snyder, M., Uhlén, M., Andrews, P., Beavis, R., Borchers, C., Chalkley, R.J., Cho, S.Y., Cottingham, K., Dunn, M., Dylag, T., Edgar, R., Hare, P., Heck, A.J.R., Hirsch, R.F., Kennedy, K., Kolar, P., Kraus, H.-J., Mallick, P., Nesvizhskii, A., Ping, P., Pontén, F., Yang, L., Yates, J.R., Stein, S.E., Hermjakob, H., Kinsinger, C.R., Apweiler, R., 2009. Recommendations from the 2008 international summit on proteomics data release and sharing policy—the ­Amsterdam principles. J. Proteome Res. 8, 3689–3692. Rønsted, N., Symonds, M.R.E., Birkholm, T., Christensen, S.B., Meerow, A.W., Molander, M., Mølgaard, P., Petersen, G., Rasmussen, N., Van Staden, J., Stafford, G.I., Jäger, A.K., 2012. Can phylogeny predict chemical diversity and potential medicinal activity of plants? A case study of Amaryllidaceae. BMC Evol. Biol. 12, 182. Sarker, S.D., Nahar, L., 2012. Natural Products Isolation, third ed. Humana Press, Springer-Verlag, USA. Schmitz, R., 1985. Friedrich Wilhelm Sertürner and the discovery of morphine. Pharm. Hist. 27, 61–74. Sliwoski, G., Kothiwale, S., Meiler, J., Lowe, E.W., 2014. Computational methods in drug discovery. Pharmacol. Rev. 66, 334–395. Song, C.M., Lim, S.J., Tong, J.C., 2009. Recent advances in computer-aided drug design. Brief. Bioinform. 10, 579–591. Svetnik, W., Liaw, A., Tong, C., Culberson, J.C., Sheridan, R.P., Feuston, B.P., 2003. Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Info. Model. 43, 1947–1958. Tang, B., 1993. Orthogonal array-based latin hypercubes. J. Am. Stat. Assoc. 88, 1392–1397. Tang, Y., Shang, Q., Xiang, J., Yang, Q., Zhou, Q., Li, L., Zhang, H., Li, Q., Sun, H., Guan, A., Jiang, W., Gai, W., 2012. Integration of screening and identifying ligand(s) from medicinal plant extracts based on target recognition by using NMR spectroscopy. Protoc. Exchange. https://doi.org/10.1038/protex.2012.060. Wang, S.-C., 2003. Artificial neural network. In: Interdisciplinary Computing in Java Programming—Part of the Springer International Series in Engineerings and Computer Science. vol. 743. Springer-Verlag, New York, NY, pp. 81–100.

Prediction of Medicinal Properties  Chapter | 2  73 Wang, Y., Wang, X., Cheng, Y., 2006. A computational approach to botanical drug design by modeling quantitative composition-activity relationship. Chem. Biol. Drug Des. 68, 166–172. Wang, S.-Q., Du, Q.-S., Zhao, K., Li, A.-X., Wei, D.-Q., Chou, K.-C., 2007. Virtual screening for finding natural inhibitor against cathepsin-L for SARS therapy. Amino Acids 33, 129–135. Wang, M., Carver, J.J., Phelan, V.V., Sanchez, L.M., Garg, N., Peng, Y., Nguyen, D.D., Watrous, J., Kapono, C.A., Luzzatto-Knaan, T., Porto, C., Bouslimani, A., Melnik, A.V., Meehan, M.J., Liu, W.-T., Crusemann, M., Boudreau, P.D., Esquenazi, E., Sandoval-Calderon, M., Kersten, R.D., Pace, L.A., Quinn, R.A., Duncan, K.R., Hsu, C.-C., Floros, D.J., Gavilan, R.G., Kleigrewe, K., Northen, T., Dutton, R.J., Parrot, D., Carlson, E.E., Aigle, B., Michelsen, C.F., Jelsbak, L., Sohlenkamp, C., Pevzner, P., Edlund, A., Mclean, J., Piel, J., Murphy, B.T., Gerwick, L., Liaw, C.-C., Yang, Y.-L., Humpf, H.-U., Maansson, M., Keyzers, R.A., Sims, A.C., Johnson, A.R., Sidebottom, A.M., Sedio, B.E., Klitgaard, A., Larson, C.B., Boya, P.C.A., Torres-Mendoza, D., Gonzalez, D.J., Silva, D.B., Marques, L.M., Demarque, D.P., Pociute, E., O’neill, E.C., Briand, E., Helfrich, E.J.N., Granatosky, E.A., Glukhov, E., Ryffel, F., Houson, H., Mohimani, H., Kharbush, J.J., Zeng, Y., Vorholt, J.A., Kurita, K.L., Charusanti, P., Mcphail, K.L., Nielsen, K.F., Vuong, L., Elfeki, M., Traxler, M.F., Engene, N., Koyama, N., Vining, O.B., Baric, R., Silva, R.R., Mascuch, S.J., Tomasi, S., Jenkins, S., Macherla, V., Hoffman, T., Agarwal, V., Williams, P.G., Dai, J., Neupane, R., Gurr, J., Rodriguez, A.M.C., Lamsa, A., Zhang, C., ­Dorrestein, K., Duggan, B.M., Almaliti, J., Allard, P.-M., Phapale, P., 2016. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837. Wessjohann, L.A., 2000. Synthesis of natural-product-based compound libraries. Curr. Opin. Chem. Biol. 4, 303–309. Wink, M., 2015. Modes of action of herbal medicines and plant secondary metabolites. Medicines 2, 251–286. Yea, S.J., Seong, B., Jang, Y., Kim, C., 2016. A data mining approach to selecting herbs with similar efficacy: targeted selection methods based on medical subject headings (MeSH). J. Ethnopharmacol. 182, 27–34. Zhang, W., 2016. Network pharmacology: a further description. Network Pharmacol. 1, 1–14.

This page intentionally left blank

Chapter 3

Optimization of Extraction Using Mathematical Models and Computation Anup K. Das*, Saikat Dewanjee† *

ADAMAS University, Kolkata, India, †Jadavpur University, Kolkata, India

Chapter Outline 3.1. Introduction 3.2. Fundamentals of Design of Experiments 3.2.1 Planning Phase 3.2.2 Designing Phase 3.3. DoE-Based Optimization of MAE Process

75 76 78 79 101

3.4. DoE-Based Optimization of Supercritical Fluid Extraction Process 3.5. DoE-Based Optimization of Accelerated Solvent Extraction Process 3.6. Conclusions References

101

101 104 106

3.1. INTRODUCTION Bioactive plant extracts, mainly from medicinal herbs, have been used to treat various ailments since time immemorial. Herbal drugs are used not only as a single component (monoherbal), but are also used simultaneously as a polyherbal formulation, which is a complex mixture of several herbs and chemical entities in a defined ratio (Budovsky et al., 2016). Developing a robust technique for isolation of bioactive principles from crude drugs encompasses multiple objectives under its canopy. Earlier, such type of tasks was performed through trial and error methods accompanied by prior working experience, knowledge, and wisdom of the operator. However, systematic optimization of the extraction process is essential for maintaining quality of any herbal product. Optimization of an extraction process involves utilization of a great degree of time, energy, and resources. By applying this technique, solution to any specific problem arising from process intensification could certainly be achieved apart from ensuring an optimum outcome. Extraction of phytochemicals is a multifactorial process that involves different mechanisms and unit operations Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00003-1 © 2018 Elsevier Inc. All rights reserved.

75

76  Computational Phytochemistry

with high degree of variability. Such processes customarily comprise different parameters or variables, and as a result, the relationships between input parameters with final product quality are difficult to ascertain. Moreover, the task gets more difficult due to the complex chemical composition of phytochemicals biosynthesized by plants. Unfortunately, in most of the cases, the extraction and isolation of phyto-drugs are performed relying on prior experience of the operator and thus the effect(s) of various input factors on the product quality are not well-defined. Consequently, the attributes of final products often suffer from high variability. Therefore, a thorough understanding of the relationships between the input factors and product quality is essential for improving the extraction performance of phyto-drug offering desirable products with acceptable quality being generated for the end users. In this chapter, an overview on different facets pertaining to extraction techniques of botanicals that have been subjected to DoE is presented. Additionally, a general abridgement on different features of experimental design and the steps involved in its practice are also described.

3.2.  FUNDAMENTALS OF DESIGN OF EXPERIMENTS In any manufacturing processes, it is often interesting to explore the relationships between the main input factors and the output or quality characteristics. For example, in a phytochemical isolation operation, the nature of solvent, particle size of the crude drug, duration of extraction, and cycle of extraction can be treated as input factors and the yield of the product can be considered as an output characteristic. Process intensification in botanical extraction can be achieved when the relationship between final product, i.e. outcome (y), and all the sources of variation or input factors (x) in the manufacturing process are understood. The outcome (y) can be described as a function, which depends on the input factors, as represented in Fig. 3.1. This function can be described in different forms (‘black box’ representing negligible predictive power; ‘gray box’ representing moderate predictive power; or ‘white box’ representing total predictive power) to describe a process depending upon the degree of predictability (Acharya and Pandya, 2013). The outcome (y) is defined in Eq. (3.1) y = f ( x1 ,x2 ¼ xn ) 2

(3.1)

The crucial sources of variations (δ ), which may affect an isolation process, can be attributed to the followings: features of raw material (RM), viz. raw material sourcing, raw material handling prior to extraction, plant part used, adulteration in raw material, if any; extraction related factors (EF), e.g., temperature during extraction, energy consumed, etc.; machinery (MACH) used during extraction, e.g., Soxhlet, microwave, supercritical fluid extractor, ultrasound, etc.; scale (SC), e.g., preparative or analytical scale; environment factors (EVN), e.g., moisture level, humidity, temperature; human (HR) resource involved. Therefore, the total variation can be represented as shown in Eq. (3.2).

Optimization of Extraction Using Mathematical Models  Chapter | 3  77

Uncontrolled variables

(x1, x2......xn) Input factors

f

y

Output or product quality

Negligible predictive power

Moderate predictive power

Total predictive power

Black-box models

Grey-box models

White-box models

FIG. 3.1  Relationship between final outcome (y) and all the sources of variation or input factors (x) in the manufacturing process.

d 2 ( Total ) =d 2 ( RM ) + d 2 ( EF ) + d 2 ( MACH ) + d 2 (SC) + d 2 ( EVN ) + d 2 ( HR )

(3.2)

Thus, the goal of extraction process development is primarily to predict how variations in input factors (x) will influence the outcome (y), and secondarily, to regulate these factors to improve the final product quality. Therefore, the challenge before any herbal industry is to sort out which inputs variables or factors will affect the process. While analysing a process, designed experiments are often performed to evaluate which input factors have a substantial impact on the final outcome, and what level of those input factors should be applied to achieve a desired output. Thus, the objectives of a designed experiment may include the following: 1. To determine the factors that are most influential to the outcome (y). 2. To determine where to set the influential input factor (x) so that y remains near the desired nominal value with minimum variation. 3. To determine where to set the influential input factor (x) to control the variations, if any, due to uncontrolled variables. In general, the goal of the person conducting the experiment should be to determine the impact of these factors on the output response. The general method of planning and experimentation is called an experimental strategy. One of the

78  Computational Phytochemistry

common methods used by many scientists in manufacturing nowadays is OneVariable-At-a-Time (OVAT), where a single factor is varied keeping all other factors constant during the experiment. The success of such approach is dependent on luck, experience, and intuition of the experimenter. In addition, this approach requires heavy financial investments to get limited information about the outcome. Usually, such approach is unreliable, inefficient, ineffective, timeconsuming, and may show false-optimal conditions for the process. The main drawback of OVAT strategy is that it does not take into account any possible interactions between these factors. The interaction between factors is common, and if this happens, it results in undesirable outcome. Despite the drawbacks, an OVAT experiment is still in use. Such experiments are always less efficient than other methods which are based on statistical designs. Statistical methods play important roles in planning, conducting, analysing, and interpreting data from any experiment. When several variables influence a certain characteristic of a product, the best strategy is then to design an experiment so that valid, reliable, and sound conclusions can be drawn effectively, efficiently, and economically. In a designed experiment, the experimenter often makes an intentional change into the input factors and then determines the effect on the final outcome. It is important to note that not all factors affect outcome in the same way. Some factors may impact significantly on the output performance, while some other may have moderate impact, and some may not be effective at all. Therefore, the goal of a designed experiment is to understand the important factors and to determine the optimal level of performance of these factors to get an effective outcome. In order to draw an effective conclusion from the experiment, it is obligatory to amalgamate simple and powerful statistical methods. The success of any industrial designed experiment relies on well-defined planning, selecting an appropriate design, data analysis, and teamwork. Experimental design encompasses planning process, design, and analysis of experiments, so that effective and objective conclusions can be effectively drawn as shown in the Fig. 3.2.

3.2.1  Planning Phase The planning phase is made up of the following steps. 1. Problem identification—Clear understanding of the problem in hand helps to understand what needs to be done during manufacturing. The statement should contain measurable outcomes, which will add real value to the company economy-wise. Some issues related to manufacturing can be resolved using experimental methods including: new product development or improvement of the existing products; intensifying the existing process; improving product performance as per market demand. After determining the experimental outcome, a group can be formed. The team may include

Optimization of Extraction Using Mathematical Models  Chapter | 3  79

Planning Phase

Identify the Problem Response selection Selection of parameters

Designing phase

Screening designs (Factorial designs, OVAT) Optimizing designs (Response surface methodology)

Conducting phase

Planned experiments are carried out and the results are evaluated

Analysing phase

Here Interpretation done so that valid and sound conclusions can be derived.

FIG. 3.2  Different phases involved in Design of Experiment.

experts, process engineers, quality engineers, floor operator, and a nominee from the management. 2. Response or outcome selection—The choice of an appropriate response is critical for the success of any designed experiment. The response can essentially be a variable or an attribute. Variable responses such as duration of extraction, strength of the solvent used, particle size of the raw material, etc. generally provide more information than the attribute response such as good/bad, pass/fail, or yes/no. 3. Process parameter selection—The selection of process parameters should be performed immaculately after in-depth deliberation based on prior knowledge of the process, historical data, cause-and-effect analysis, and brainstorming session. This is an important step in the experiment designing. Screening experiments in the first phase of any experimental survey is a good practice to determine the most important design parameters or factors. In-depth discussion is present in the subsequent sections.

3.2.2  Designing Phase At this phase, one may select the most suitable design for the experiment. The size of the experiment depends on the factors to be studied and/or the number of interactions, the number of levels per factor, and the budget and resources used to carry out the experiment. Screening is used to lessen the number of process factors by identifying the significant ones that affect quality or process performance. This reduction helps to focus on the few important factors or the vital few from trivial many as shown in Fig. 3.3. It is a good practice to have the design ready before commencing the experiment. The design matrix usually shows all the settings of the factors at different levels and the order in which a particular experiment needs to be run. The purpose of screening design is to identify and to separate those factors that demand further investigation, i.e. towards optimization. The designing phase is generally divided into two segments—screening phase and optimization phase. While developing novel extraction strategies, two types of situations may arise, where the mediation of experimental design becomes necessary. First is to screen vital few factors that are expected to have a significant effect on the final outcome

80  Computational Phytochemistry

Sample: solvent

Particle size of drug

Solvent strength

Soaking time

5

Extraction time

6

7

8

Solvent concentration Extraction temp

Instrument power

4

Sample: solvent ratio

3

Solvent concentration

2

Extraction time

Instrument power

1

FIG. 3.3  A typical representation of screening steps to screen out the vital few from trivial many.

and is called screening phase. Second is the optimization phase to optimize systematically the selected significant factors for getting optimal solutions.

3.2.2.1  Screening Phase Initially, a screening is carried out to determine the factors and their interactions, which would have significant influence on the final outcome. Based on these findings, we can move towards optimization using those factors, which are significant during screening. Screening is performed employing simple design (OVAT methodology) and factorial design technique. The former is applied by testing one factor at a time instead of all at a time. Whereas, in the latter case, all the factors are tested simultaneously. If k factors are studied at two level [+1 (high level), −1(low level)], a factorial design will consist of 2k experiments as shown in Box 3.1 below. Let us consider a simple extraction example taking two factors, viz. Factor A (xA): microwave power (Watts) at high level [+1] and low level [−1]; Factor B (xB): extraction time (min) at high level [+1] and low level [−1]. If two factors are taken at two different levels, then it is required to carry out four experiments (22 = 4) for which four different outcomes (yield or extractive value) will be obtained. This means an outcome (y) can be described as a function based on experimental factors and, in a designed experiment, it is called Transfer function

Optimization of Extraction Using Mathematical Models  Chapter | 3  81

BOX 3.1  Number of experiments when k factors are studied at two levels Factors (k)

No. of Experiments (2k)

2

4

3

8

4

16

5

32

6

64

10

1024

BOX 3.2  A general matrix containing all the level combinations of Factor A (XA) Factor B (XB) XA

XB

XA × XB

−1 (LOW)

+1 (HIGH)

−1

+1 (HIGH)

−1 (LOW)

−1

−1 (LOW)

−1 (LOW)

+1

+1 (HIGH)

+1 (HIGH)

+1

(f) as mentioned earlier in Eq. (3.1). All the four experiments, which need to be conducted, can be represented in the form of a general matrix containing all the level combinations as shown in Box 3.2. Both the factors will be at their higher level (+1) and lower level (−1). From the matrix shown below, two types of effects can be deduced, main effects of the factors (XA, XB) and interaction effects of the factors (xA × xB) as depicted. The information, which can be deduced from this matrix, is that each column can be considered as vectors (Box  3.3) and each of these columns (XA, XB, xA × xB) is having four components, i.e. (−1, +1, −1, +1) apart from being linearly independent to each other. If product of two vectors (xA × xB) are taken and added up, then zero (0) will be obtained. Similarly, zero will also be obtained if we do summation of XA and XB columns as shown in Box 3.3. Therefore, it can be asserted that the columns are independent of each other. However, if we square (the concept of secondorder model) each components of XA, XB and xA × xB, and add them up, we will get some value, i.e. four (4) as shown in Box 3.3. Through this explanation, we are trying to understand the basic properties of the factors (XA and XB) and their coefficients, which will determine the progress of the process subsequently.

82  Computational Phytochemistry

BOX 3.3  Representation of XA, XB, XA × XB as column vectors

Column components

Column vectors

xA

xB

xA × xB

–1

+1

–1

+1

–1

–1

–1

–1

+1

+1

+1

+1

4

SUM

Σx = 0 A

4

SUM

1

4

SQUARE

Σ 1

x2A = 4 SQUARE

Σx = 0 B

4

SUM

1

4

Σ 1

x2B = 4 SQUARE

Σx × x = 0 A

B

1

4

Σx

2 ×x =4 A B

1

Coefficients are the effects of each factor. By interpreting the results, it would be possible to tell the factor or factor combination having more influence on the outcome. So, let us now look into the relationship of outcome (y) with XA and XB. Like XA, XB and xA × xB, the outcome (y) can have up to four components as we are conducting four experiments. Therefore, we will be able to fit a curve or find a relationship, which can support only four parameters in this context. The simplest relationship, which could be thought of, is a linear relationship. Now the outcome (y) can be represented as shown in Eq. (3.3). y = b 0 + b A X A + b B X B + b AB X A X B (3.3) β0 = Intercept (mean of XA and XB). βA = coefficients of factor XA. βB = coefficients of factor XB. βAB = coefficients of interaction factor (XAXB). The above equation represents a linear relationship between y and factors XA, XB and their interaction (xA × xB). The next step is to find out the coefficient values of each factor and the intercept. Let us take an example of a microwaveassisted botanical extraction process, where the yield is represented as y (mg/kg of raw material), microwave power (Watts) as XA; extraction time (min) as XB. Here a number of factors can be taken, but to keep the things simple, we have taken two factors only. Let us specify the high values and the low values for each of the factors as shown in Box 3.4 below.

Optimization of Extraction Using Mathematical Models  Chapter | 3  83

BOX 3.4  Representation of high values and low values for factors XA, XB Factor A (xA): instrumental power (W) Factor B (xB): extraction time (min)

High (+1), 50 W

Low (–1), 10 W

High (+1), 10 min

Low (–1), 5 min

xA

xB

y(yield)

Y = b0 + bAXA + bBXB + bABXA XB

–1

+1

1.5 y1

1.5 = b0 – bA + bB – bAB

Eq.(1)

1.5 = b0 – bA + bB – bAB

+1

–1

8.2 y2

8.2 = b0 + bA – bB – bAB

Eq.(2)

8.2 = b0 + bA – bB – bAB

–1

2.0 y3

2.0 = b0 – bA – bB + bAB

Eq.(3)

2.0 = b0 – bA – bB – bAB

+1

3.5 y4

3.5 = b0 + bA + bB + bAB

Eq.(4)

3.5 = b0 + bA + bB + bAB

–1 +1

If l take (–) minus of Eq.1 and Eq.3

Add all the 4 equations

15.2 = 4b0

3.8 = b0

– 1.5 = – b0 + bA – bB + bAB 8.2 = b0 + bA – bB – bAB – 2.0 = – b0 + bA + bB – bAB 3.5 = b0 + bA + bB + bAB 8.2 = 4bA

2.05 = bA

Like wise by solving the matrix we can get the values of all the coefficients. b0 = 3.8, bA = 2.05, bB = – 1.3, bAB = – 1.05 • Coefficients are the effects of each parameters on the final outcome (yield). • Larger the coefficient value, the larger will be the influence of the factors on the final outcome

Using the matrix as shown in Box 3.4, we can now write four equations using the linear relationship. If we perform four experiments, then we will get four outcomes, y (i.e. yield values for example y1 = 1.5, y2 = 8.2, y3 = 2.0, y4 = 3.5) as shown Box 3.4. By solving these four equations, we can get the values of β0, βA, βB, and βAB which are +3.8, + 2.05, −1.3, −1.05, respectively. It is known that larger the value of coefficient, the larger is the influence of the factor on the final outcome. When we are looking for some general idea regarding the process in hand, we can apply first order polynomial model as shown in Eq. (3.2). This model only tells about the current factor settings, but does not give any information about the zone of optimization. In order to gain some knowledge about the optimum zone, it is necessary to introduce a square term, as mentioned earlier. The preferred model is a quadratic (second order) model. The various types of screening designs generally used in herbal extraction are presented hereunder. Full Factorial Design (2k) In a Full factorial design (FFD), the effect of all the factors and their interactions on the outcome (s) is investigated. A common experimental design is one, where all input factors are set at two levels each. These levels are termed high and low or +1 and −1, respectively. A design with all possible high/low groupings of all the input factors is termed as a full factorial design in two levels. If there are k factors, each at 2 levels, a full factorial design will be of 2k runs as mentioned earlier.

84  Computational Phytochemistry 6 8 2 5

4

X3 7

X1 1 X2 3 FIG. 3.4  Representation of a 23 design as a cube.

As shown in Box  3.1, when the number of factors is more than five, a full ­factorial design requires a large number of experimental runs and is not effective. Therefore, a fractional factorial design or a Plackett-Burman design (PBD) is a better choice for five or more factors and is discussed in next section. When a full factorial design for three input factors, each at two levels, is considered (23 design), it will have eight runs. Graphically, we can denote the 23 design by a cube shown in Fig. 3.4. The arrows show the direction of increase of the factors. The numbers 1 through 8 at the corners of the design box represent the Standard Order of runs (Fig. 3.4). In tabular form, this design can be represented as shown in Table 3.1. The column on the left hand side of Table 3.1, which numbers up to 8, is called the Standard Run Order. These numbers are also depicted in Fig. 3.4. For example, run number 8th is made at the high setting of all three factors. Fractional Factorial Design (2k−p) Simply, it can be considered as a half of a full factorial design. Fractional factorial designs offer a reduced number of experiments without losing a lot of information. Let us put some observation values (y1---y8) in Table 3.2 and try to deduce why we do not need all eight runs. The right-most column of the Table  3.2 lists y1 through y8 to indicate the outcomes measured for the experimental runs when listed in standard order. For example, y1 is the response observed when the three factors were all run at their ‘low’ setting (−1). The numbers placed in the ‘y’ column will be used to calculate the main effects of the factors. From the entries in Table 3.2, it is possible to compute all ‘effects’ such as main effects, first-order interaction effects, etc. For example, to compute the main effect estimate c1 of factor X1, it is necessary

Optimization of Extraction Using Mathematical Models  Chapter | 3  85

TABLE 3.1  A 23 Two-Level, Full Factorial Design Table Showing Runs in ‘Standard Order’ Run

Pattern

X1

X2

X3

1







−1

−1

−1

2

+





+1

−1

−1

3



+



−1

+1

−1

4

+

+



+1

+1

−1

5





+

−1

−1

+1

6

+



+

+1

−1

+1

7



+

+

−1

+1

+1

8

+

+

+

+1

+1

+1

TABLE 3.2  A 23 Two-Level, Full Factorial Design Table Showing Runs in Standard Order Plus Observations (y) Run

Pattern

X1

X2

X3

y

1







−1

−1

−1

y1 = 43

2

+





+1

−1

−1

y2 = 73

3



+



−1

+1

−1

y3 = 51

4

+

+



+1

+1

−1

y4 = 67

5





+

−1

−1

+1

y5 = 67

6

+



+

+1

−1

+1

y6 = 61

7



+

+

−1

+1

+1

y7 = 69

8

+

+

+

+1

+1

+1

y8 = 63

to compute the average response at all runs with X1 at the high setting (+1), namely (1/4)(y2+y4+y6+y8), minus the average response of all runs with X1 set at low, namely (1/4)(y1+y3+y5+y7). That is, c1 = (1 / 4 ) ( y2 + y4 + y6 + y8 ) - (1 / 4 ) ( y1 + y3 + y5 + y7 ) = (1 / 4 ) ( 73 + 67 + 61 + 63 ) - (1 / 4 ) ( 43 + 51 + 67 + 69 ) = 8.5

86  Computational Phytochemistry 6 8 2 5

4

X3 7

X1 1 X2 3 FIG. 3.5  Representation of unshaded corners of the design cube.

Suppose we do not have enough resources in hand to perform eight runs, is it still possible to estimate the main effect for X1? The answer is yes. For example, suppose we select only the four light (unshaded) corners of the design cube as shown in Fig. 3.5. Using these four runs (1, 4, 6, and 7), we can still compute c1 as follows: c1 = (1 / 2 ) ( y4 + y6 ) - (1 / 2 ) ( y1 + y7 ) = (1 / 2 ) ( 67 + 61) - (1 / 2 ) ( 43 + 69 ) = 8 In either case, it is possible to obtain a value for the main effects (c1) of X1 and for other factors as well without losing much information. In both the cases, the value of main effects (c1) of X1 is close to 8. Plackett-Burman Design PBD is a particular type of fractional factorial design, which assumes that the interactions can be completely ignored and the main effects can be calculated with a reduced number of experiments. Various factors (n) can be screened in an ‘n + 1’ run PBD. A distinctive feature is that the sample size is a multiple of four, rather than a power of two (4k observations with k = 1, 2…n). PBD is used to investigate n − 1 factor in ‘n’ experiments proposing experimental designs for more than seven factors and especially for n × 4 experiments, i.e. 8, 12, 16, 20, etc. These are suitable for studying up to 7, 11, 15, 19, etc. factors. Such designs are also known as saturated designs, which allow an efficient separation of main effects and interaction effects. An example of PBD with 12 runs and 11 factors is presented in Table 3.3. A case study carried out by Das et al. (2013) using PBD for screening of factors, which significantly helps towards achieving an effective, rapid, and

Trial

X1

X2

X3

X4

X5

X6

X7

X8

X9

X10

X11

Response

1

+



+

+

+







+





y1

2

+



+

+

+







+



+

y2

3



+

+

+







+



+

+

y3

4

+

+

+







+



+

+



y4

5

+

+







+



+

+



+

y5

6

+







+

+

+



+

+

y6

7







+



+

+

+

+

y7

8





+



+

+

+

+

+



y8

9



+



+

+



+

+

+





y9

10

+



+

+



+

+

+







y10

11



+

+



+

+

+







+

y11

12























y12

+

Optimization of Extraction Using Mathematical Models  Chapter | 3  87

TABLE 3.3  An Example of a 12-Run Plackett-Burman Design

88  Computational Phytochemistry

e­ nvironmentally friendly microwave-assisted extraction (MAE) strategy for the industrial scale-up of lupeol using response surface methodology (RSM), is presented here. Initially, a PBD matrix as shown in Table  3.4 was used to determine the most significant extraction factors among microwave power, irradiation time, and particle size, solvent: sample ratio, different solvent strength, and soaking time. Apart from this, five dummy factors were used to estimate the experimental error in a design (Vander et al., 1995). The corresponding outcomes of 12 experiments are shown in Table 3.4. The adequacy of the model was calculated, and the variables showing statistically significant effects were screened via regression analysis (Table  3.5). Among six extraction parameters (microwave power, irradiation/extraction time, solvent composition, particle size, solvent: sample/loading ratio, and soaking time) studied, three parameters (microwave power, irradiation/extraction time, and solvent: sample ratio) had significant influence on lupeol extraction as evidenced from the chart (Fig. 3.6) by their P values (P 95%); 8. standard spectroscopic methods (software-aided), MS, NMR, UV-Vis, and IR for identification of purified compounds for the library. Various structure prediction and spectral data interpretation computational tools (highthroughput structure elucidation) can be used to accelerate the process of structure elucidation of purified compounds (see Chapter  7). A highthroughput structure elucidation protocol may include: computer-assisted search of in-house databases, search of commercially available phytochemical databases, and comprehensive structure elucidation of new compounds (Bindseil et al., 2001); 9. labelling and storing of all purified compounds in the library for appropriate dispensing as required by the HTS. In order to ensure reproducibility, and trace-ability, appropriate and effective data and process management software have to be used (Chan and HuesoRodriguez, 2002), which can include simple matters like automated electronic labelling of all extracts, chromatographic fractions, and purified compounds as well as full management of the processes involved (Fig. 5.6). It is also important to decide on the minimum amount of any individual compound to be included in the library, and the amount should be usable in any small to medium size HTS operation. Ideally, a minimum amount of 10 mg of each compound with a purity of >95% could be used as a threshold. Compound library management is a difficult and demanding process involving the use of sophisticated equipment

Building Dereplicated Phytochemical Libraries  Chapter | 5  157

Registration of purified compounds

Inclusion in the database/library

Dispensing

Pharmacology team

HTS

Stock plates (96-wells)

Confirmation vials

Screening plates (384 or 1536-well)

Dose response or secondary assays

FIG.  5.6  A typical library management process. Based on Chan, J.A., Hueso-Rodriguez, J.A., 2002. Compound library management. Methods Mol. Biol. 190, 117–127.

and databases, and extensive computational support. It is imperative to state that proper management of a compound library is fundamental to ensuring success in any HTS process leading to lead generation. During the production of phytochemical libraries, various online searchable databases and software can help selection of plant materials (see Chapter  2). Optimization of extraction and isolation protocols is essential, and this process involves the use of mathematical modelling and computational approaches (see Chapters 1, 3, 4 and 7). Phytochemical databases may include, either partially or fully identified, compounds with certain amounts of physical and spectroscopic data, e.g., UV–Vis, MS, and 1H NMR data, which also help in the dereplication process using various online databases. Martin et al. (2014) utilized mathematical modelling-based metabolomics approach to evaluate solvent extraction system and chemical fingerprinting on a set of botanical extracts to compare the extraction efficiency of different solvents to inform the construction of phytochemical libraries. Like in any other compound libraries, compounds in a phytochemical library can be stored in two different ways—neats (generally as solids) and solutions. If the isolated amounts allow, compounds should ideally be stored in both ways, i.e. as solids and solutions.

158  Computational Phytochemistry

Phytochemicals in solutions (usually in DMSO) in plates are more suitable for HTS, despite the neats being more stable under long-term storage conditions. Solids in pre-tared bar-coded vials are normally stored at room temperature, usually in automated stores, but they can also be stored at a lower temperature, e.g., 4oC, to enhance stability. They should be easily retrievable for confirmation of activity, secondary testing, and structural verification. Compounds in solution are stored in plates or tubes, typically at two different concentrations. On the other hand, phytochemicals in solution (usually 5–10 mM) are stored in 96-well plates at low temperature (4°C to −20°C) for preparing lower concentration plates for HTS, long-range storage, and used as a repository and/or for dose-response experiments. For phytochemical libraries comprising standardized extracts or fractions, the process is quite similar as outlined above, just substituting neat or purified phytochemicals by dried extracts or fractions. In this case, instead of molar concentration, the concentration of extracts or fractions are generally expressed in mg/mL (normally, 1–10 mg/mL). Storage facilities used for phytochemical libraries can be quite variable, ranging from shelves in a cold room to individual fridge or freezer set at a constant temperature. Liquid handling or dispensation of phytochemical libraries in the formats demanded by any HTS platform is a critical component in the management process (Chan and Hueso-Rodriguez, 2002), so is the creation of an inventory database, managed by informatics tools. Among other trivial information, any inventory database should be linked to information on selection of plants and taxonomy, geographical origins, collection and processing of plant materials, extraction, isolation, identification, spectral data, internal identification number, and quality control (QC) data. The inventory database can help tracking phytochemicals on a vial/plate/well basis by assigning a barcode to each container. Throughout the whole process of phytochemical library generation and management, high level of informatics and various computational tools have to be applied. Herrath et al. (2017) have reported screening of a small, well-curated natural product-based library of 400 compounds from plants, fungus, and marine organisms using a phenotyping assay and identified two rotenoids with potent nematocidal property. In this compound library, purchased from Compounds Australia (www.griffith.edu.au/science-aviation/compounds-australia), the compounds (purity: >95%) were supplied at a concentration of 5 mM in DMSO. Lin et  al. (2017) produced a phytochemical library comprising 67 individual compounds 20 mM in DMSO, mainly of various phenolic types, e.g., flavonoids, Isoflavonoids, and chalcones, following a straightforward extraction-­ isolation-identification protocol, assisted by various computational aids, from an ethyl acetate extract of the roots and rhizomes of Glycyrrhiza inflata. Virtual phytochemical library construction, however, does not involve any actual extraction, isolation, and structure determination process, but utilizes various published databases for data compilation. Pathania et al. (2012) ­reported

Building Dereplicated Phytochemical Libraries  Chapter | 5  159

the construction of a phytochemical library, named Phytochemica, which is a structured compilation of 963 phytochemicals and includes their sources, chemical classification, IUPAC names, SMILES notations, physicochemical properties, and three-dimensional structures with associated references. It also offers refined search option to explore the neighbouring chemical space against ZINC database (Irwin and Shoichet, 2005) to identify analogues of phytochemicals. This database is for virtual screening and in silico study and the construction process was totally based on various phytochemical databases and use of computational modelling to create a searchable library. To obtain an extensive collection of compounds from five selected medicinal plants, Atropa belladonna, Catharanthus roseus, Heliotropium indicum, Picrorhiza kurroa, and Podophyllum hexandrum, data were manually compiled from literature and various web resources. Computational modellings were used to establish the database architecture and web interface. Four data tables were created to store compiled data. Phytochemica implemented MySQL, an object relational database management system, for its backend performance. Web browser interface was created using HTML, CSS, Ajax, JavaScript, and jQuery that connects MySQL terminal. A JMol visualizer (http://www.jmol.org/) and ZINC database (http://zinc.docking.org/) was incorporated in the Graphical User Interface (GUI) to provide a 3D visualization and percentage similarity search against ZINC, respectively. The GUI was designed to be user-friendly for data query and extraction and was tested in all major browsers (Chrome, Firefox, Safari, and Internet Explorer) and OS platforms. Byler et  al. (2016) created an in-house virtual phytochemical library to perform in silico screening of phytochemicals for possible anti-Zika virus property. In this study, they generated ZIKV protease, methyltransferase, and RNA-dependent RNA polymerase using homology-modelling techniques and carried out molecular docking analyses of their in-house virtual phytochemical library of about 2300 plant secondary metabolites against these protein targets as well as with ZIKV helicase. A similar virtual screening of a virtual phytochemical library composed of structures of 290 alkaloids (68 indole alkaloids, 153 isoquinoline alkaloids, 5 quinoline alkaloids, 13 piperidine alkaloids, 14 steroidal alkaloids, and 37 miscellaneous alkaloids), 678 terpenoids (47 monoterpenoids, 169 sesquiterpenoids, 265 diterpenoids, 81 steroids, and 96 triterpenoids), 20 aurones, 81 chalcones, 349 flavonoids, 120 isoflavonoids, 74 lignans, 58 stilbenoids, 169 miscellaneous polyphenolic compounds, 100 coumarins, 28 xanthones, 67 quinones, and 160 miscellaneous phytochemicals for antiviral property against dengue virus was reported by Powers and Setzer (2016). Again, the virtual phytochemical library was built by extraction of structural data from various established databases. A virtual library of 53 natural products derived from Clerodendrum indicum and C. serratum plants was constructed from the literature. Three-dimensional space analyses were carried out and the drug-likeness of this natural products library was established. A natural products–cancer network was built based on

160  Computational Phytochemistry

docking. Apigenin 7-glucoside, hispidulin, scutellarein 7-O-β-D-glucuronate, acteoside, and verbascoside were predicted to be potential binding therapeutics for cancer target proteins. This study established an integrative approach obtained from network pharmacology for identifying combinatorial drug actions against the cancer targets. Another similar virtual phytochemical library of over 1000 compounds was described by Tripathi et al. (2014), where the library was screened virtually to novel targets in Haemophilus ducreyi for the putative treatment of Chancroid, which is a sexually transmitted infection caused by H. ducreyi. As computer-aided-drug-discovery approaches have been routinely used nowadays in drug investigation to enhance efficiency of the drug-discovery and development pipeline, the role of a well-designed virtual phytochemical library has become pivotal in the process. Ravichandran and Sundararajan (2017) have reported the identification of phytochemicals and commercial drugs using molecular docking for treating BRCA1 (this gene is located on the long arm of Chromosome 17 at 17q21 and it also contains 24 coding exons that range over 80 kb) breast cancer. A small phytochemical library comprising 2D and 3D structures of 18 phenolic compounds, selected based on literature information, was used. However, 26 conventional breast cancer drugs were also included in the virtual screening.

5.5. CONCLUSIONS Computation and/or computer-assisted operations, as well as mathematical modelling, have become an integral part of any endeavor to construct and manage any compound library, comprising real synthetic compounds, phytochemicals or other natural products, or merely the 2D and/or 3D structures of known compounds (virtual library) for in silico study in virtual HTS. For virtual compound libraries, the application of computation and computational tools, especially in relation to database creation and extraction and compilation of required structural information, has become more apparent than that in real compounds libraries. However, every step that is involved in extraction, isolation, and identification of dereplicated phytochemicals for phytochemical libraries also requires high degree of automation and data handling involving computational aids of various sorts.

REFERENCES Abdelmohsen, U.R., Cheng, C., Viegelmann, C., Zhang, T., Grkovic, T., Ahmed, S., Quinn, R.J., Hentschel, U., Edrada-Ebel, R., 2014. Dereplication strategies for targeted isolation of new antitrypanosomal actinosporins A and B from a marine sponge associated-Actinokineospora sp. EG49. Mar. Drugs 12, 1220–1244. Allard, P.M., Peresse, T., Bisson, J., Gindro, K., Marcourt, L., Pham, V.C., Roussi, F., Litaudon, M., Wolfender, J.L., 2016. Integration of molecular networking and in-silico MS/MS fragmentation: a novel dereplication strategy in natural products chemistry. Planta Med. 81, S1–S381.

Building Dereplicated Phytochemical Libraries  Chapter | 5  161 Bakiri, A., Plainchont, B., de Paulo Emerenciano, V., Reynaud, R., Hubert, J., Renault, J-H., Nuzillard, J-M., 2017. Computer-aided dreplication and structure elucidation of natural ­ ­products at the University of Reims. Mol. Inf. Barnes, E.C., Kumar, R., Davis, R.A., 2016. The use of isolated natural products as scaffolds for the generation of chemically diverse screening libraries for drug discovery. Nat. Prod. Rep. 33, 372–381. Barot, K.P., Nikolova, S., Ivanov, I., Ghate, M.D., 2014. Liquid-phase combinatorial library synthesis: recent advances and future perspectives. Comb. Chem. High Throughput Screen. 17, 417–438. Berdy, J., Kertesz, M., 1989. Bioactive natural products database: an aid for natural products identification. In: Collier, H.R. (Ed.), Chemical Information. Springer, Berlin, pp. 237–251. Beutler, J.A., Alvarado, A.B., Schaufelberger, D.E., Andrews, P., McCloud, T.G., 1990. Dereplication of phorbol bioactives—Lyngbya majuscula and Croton cuneatus. J. Nat. Prod. 53, 867–874. Bindseil, K.U., Jakupovic, J., Wolf, D., Lavayre, J., Leboul, J., van der Pyl, D., 2001. Pure compound libraries: a new perspective for natural product based drug discovery. Drug Discov. ­Today 6, 840–847. Brkljaca, R., Goker, E., Urban, S., 2015. Dereplication and chemotaxonomical studies of marine algae of the Ochrophyta and Rhodophyta phyla. Mar. Drugs 13, 2714–2731. Byler, K.G., Ogungbe, I.V., Setzer, W.N., 2016. In-silico screening for anti-Zika virus phytochemicals. J. Mol. Graph. Model. 69, 78–91. Chan, J.A., Hueso-Rodriguez, J.A., 2002. Compound library management. Methods Mol. Biol. 190, 117–127. Chang, Y.-P., Chu, Y.-H., 2014. Mixture-based combinatorial libraries from small individual peptide libraries: a case study on α1-antitrypsin deficiency. Molecules 19, 6330–6348. Chervin, J., Stierhof, M., Tong, M.-H., Peace, D., Hansen, K.O., Urgast, D.S., Andersen, J.H., Yu, Y., Ebel, R., Kyeremeh, K., Paget, V., Cimpan, G., Wyk, A.V., Deng, H., Jaspars, M., Tabudravu, J.N., 2017. Targeted dereplication of microbial natural products by high resolution MS and predicted LC retention time. J. Nat. Prod. in press. Corley, D.G., Durley, R.C., 1994. Strategies for database dereplication of natural products. J. Nat. Prod. 57, 1484–1490. Cox, G., Sieron, A., King, A.M., De Pascale, G., Pawlowski, A.C., Koteva, K., Wright, G.D., 2017. A common platform for antibiotic dereplication and adjuvant discovery. Cell Chem. Biol. 24, 98–109. English, L.B., 2002. Combinatorial Library: Methods and Protocols. Humana Press, USA. Fox, R., Alexander, E., Alcover, C., Evanno, L., Maciuk, A., Litaudon, M., Duplais, C., Bernadat, G., Gallard, J.-F., Jullian, J.-C., Mouray, E., Grellier, P., Loiseau, P.M., Pomel, S., Poupon, E., Champy, P., Beniddir, M.A., 2017. Revisiting previously investigated plants: a molecular networking-based study of Geissospermum leave. J. Nat. Prod. 80, 1007–1014. Gaudêncio, S.P., Pereira, F., 2015. Dereplication: racing to speed up the natural products discovery process. Nat. Prod. Rep. 32, 779–810. Herrath, H.M.P.D., Preston, S., Hofmann, A., Davis, R.A., Koehler, A.V., Chang, B.C.H., Jabbar, A., Gasser, R.B., 2017. Screening of a small, well-curated natural product-based library identifies two rotenoids with potent nematocidal activity against Haemonchus contortus. Vet. Parasitol. in press, https://doi.org/10.1016/j.vetpar.2017.07.005. Hook, D.J., Pack, E.J., Yacobucci, J.J., Guss, J., 1997. Approaches to automating the dreplication of bioactive natural products—the key step in high throughput screening of bioactive materials from natural sources. J. Biomol. Screen. 2, 145–152. Hubert, J., Nuzillard, J.-M., Renault, J.-H., 2017. Dereplication strategies in natural product research: how many tools and methodlogoes behind the same concept. Phytochem. Rev. 16, 55–95.

162  Computational Phytochemistry Irwin, J.J., Shoichet, B.K., 2005. ZINC—a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177–182. Jung, G., 1999. Combinatorial Chemistry—Synthesis, Analysis and Screening. Wiley VCH, Weinheim, pp. 1–34. Klitgaard, A., Iversen, A., Andersen, M.R., Larsen, T.O., Frisvad, J.C., Nielsen, K.F., 2014. ­Aggressive dereplication using UHPLC-DAD-QTOF: screening extracts for up to 3000 fungal secondary metabolites. Anal. Bioanal. Chem. 406, 1933–1943. Lam, K.S., Lebl, M., Krchnak, V., 1997. The ‘one-bead-one-compound’ combinatorial library method. Chem. Rev. 97, 411–448. Le Pogam, P., Le Lamer, A.C., Legouim, B., Boustie, J., Rondeau, D., 2016. In situ DART-MS as a versatile and rapid dereplication tool in Linchenology: chemical fingerprinting of Ophioparma ventosa. Phytochem. Anal. 27, 354–363. Lin, Y., Kuang, Y., Li, K., Wang, S., Song, W., Qiao, X., Sabir, G., Ye, M., 2017. Screening for bioactive natural products from a 67-compound library of Glycyrrhiza inflate. Bioorg. Med. Chem. 25, 3706–3713. Ma, C., Liang, C., Wang, Y., Pan, M., Jiang, Q., Shi, C., 2017. Combinatorial library based on restriction enzyme-mediated modular assembly. Comb. Sci. 19, 351–355. Martin, A.C., Pawlus, A.D., Jewett, E.M., Wyse, D.L., Angerhofer, C.K., Hegeman, A.D., 2014. Evaluating solvent extraction systems using metabolomics approaches. RSC Adv. 4, 26325–26334. Mohamed, A., Nguyen, C.H., Mamitsuka, H., 2016. Current status and prospects of computational resources for natural product dereplication: a review. Brief. Bioinform. 17, 309–321. Mohimani, H., Gurevich, A., Mikheenko, A., Garg, N., Nothias, L.-F., Ninomiya, A., Takado, K., Dorrestein, P.C., Pevzner, P.A., 2017. Dereplication of peptidic natural products through database search of mass spectra. Nat. Chem. Biol. 13, 30–37. Neilsen, K.F., Månsson, M., Ramk, C., Frisvad, J.C., Larsen, T.O., 2011. Dereplication of microbial natural products by LC-DAD-TOFMS. J. Nat. Prod. 74, 2338–2348. Neto, F.C., Pilon, A.C., Selegato, D.M., Frelre, R.T., Gu, H., Raftery, D., Lopes, N.P., Castro-­ Gamboa, I., 2016. Dereplication of natural products using GC-TOF mass spectrometry: improved metabolite identification by spectral deconvolution ratio analysis. Front. Mol. Biosci. 3, 1–13. article 59. Paricharak, S., Mendez-Lucio, O., Rabindranath, A.C., Bender, A., IJerman, A.P., van Westen, G.J.P., 2016. Data-driven approaches used for compound library design, hit triage and bioactivity modeling in high throughput screening. Brief. Bioinform., 1–9. https://doi.org/10.1093/bib/ bbw105. Pathania, S., Ramakrishnan, S., Bagler, G., 2012. Phytochemica: a platform to explore phytochemicals of medicinal plants. Database, 1–8. https://doi.org/10.1093/database/bav075. Powers, C.N., Setzer, W.N., 2016. An in-silico investigation of phytochemicals as antiviral agents against dengue fever. Comb. Chem. High Throughput Screen. 19, 516–536. Qui, F., Fine, D.D., Wherritt, D.J., Lei, Z.T., Summer, L.W., 2017. PlantMAT: a metabolomics tool for predicting the specialised metabolic potential of a system and for large-scale metabolite identification. Anal. Chem. 88, 11373–11383. Quinn, R., 2012. Basics and principles for building natural product-based libraries for HTS. In: Chemical Genomics. Cambridge University Press, pp. 87–98. Ravichandran, R., Sundararajan, R., 2017. In silico-based virtual drug screening and molecular docking of phytochemical-derived compounds and FDA approved drugs against BRCA1 receptor. J. Cancer Prev. Curr. Res. 8, 1–6. 00268. Sarker, S.D., Nahar, L., 2012. Natural Products Isolation, third ed. Humana Press Springer-­Verlag, USA.

Building Dereplicated Phytochemical Libraries  Chapter | 5  163 Selegato, D.M., Freire, R.T., Tannus, A., Castro-Gamboa, I., 2016. New dereplication method applied to NMR-based metabolomics on different Fusarium species isolated from rhizosphere of Senna spectabilis. J. Braz. Chem. Soc. 27, 1421–1431. Sepetov, N.F., Krchnak, V., Stankova, M., Wade, S., Lam, K.S., Lebl, M., 1995. Library of libraries: approach to synthetic combinatorial library design and screening of ‘pharmacophore’ motifs. Proc. Natl. Acad. Sci. U. S. A. 92, 5426–5430. Shour, S., Iranshahy, M., Pham, N., Quinn, R.J., Iranshahi, M., 2017. Dereplication of cytotoxic compounds from different parts of Sophora pachycarpa using an integrated method of HPLC, LC-MS and 1H-NMR techniques. Nat. Prod. Res. 31, 1270–1276. Spears, K.L., Brown, S.P., 2017. The evolution of library design: crafting smart compound collections for phenotypic screens. Drug Discov. Today Technol. 23, 61–67. Stahura, F.L., Xue, L., Godden, J.W., Bajorath, J., 1999. Molecular scaffold-based design and comparison of combinatorial libraries focused on the ATP-binding site of protein kinases. J. Mol. Graph. Model. 17, 1–9. Tripathi, P., Chaudhary, R., Singh, A., 2014. Virtual screening of phytochemicals to novel targets in Haemophilus ducreyi towards the treatment of Chancroid. Bioinformation 10, 502–506. Weber, L., 2005. Current status of virtual combinatorial library design. Comb. Sci. 24, 809–823.

This page intentionally left blank

Chapter 6

High-Throughput Screening of Phytochemicals: Application of Computational Methods Fyaz M.D. Ismail, Lutfun Nahar, Satyajit D. Sarker Liverpool John Moores University, Liverpool, United Kingdom

Chapter Outline 6.1. Introduction 165 6.2. The Pre-HTS Era 166 6.3. High-Throughput Screening 167 6.3.1 Reaction Monitoring and Observation 171 6.3.2 Advances in Monitoring In Vivo 172 6.3.3 Location of Facilities 173 6.3.4 Is There a Difference Between So-Called Leads and Drugs? 174 6.3.5 Visualization of Data 174 6.3.6 Dose–Response Analysis 174 6.3.7 Examples of HTS Success 176

6.4. HTS Platforms for Natural Products/Phytochemicals 6.4.1 What is a Natural Product? 6.4.2 Natural Products for Increasing Diversity 6.4.3 Natural Products Sample Preparation 6.4.4 Examples of HTS Platforms for Natural Products/ Phytochemicals 6.5. High-Content Screening 6.6. Conclusions References

178 179 180 181

183 186 187 187

6.1. INTRODUCTION Most accounts state that high-throughput screening (HTS), an essential tool in the drug discovery and development process, originated in the Pharmaceutical and Biotechnology sector, with subsequent adoption by academic and not-forprofit organizations. In reality, the foundations of HTS were laid by Dr Gyula Takátsy (1914–80) to deal with an influenza outbreak. Severe shortages of glassware prompted him to fabricate a ‘microplate’ consisting of 6 rows of 12 wells in polymethyl methacrylate (PMMA; acrylic; Plexiglas, Acrylite, Lucite, Perspex) as well as micropipetting technologies using platinum loops (Takatsy, 1950, 1967; Sever, 1962). This chapter can only touch upon this vast subject and Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00006-7 © 2018 Elsevier Inc. All rights reserved.

165

166  Computational Phytochemistry

High-throughput screening of crude extracts

Determination of potency by dose response curves

Confirmation of primary hits in triplicate

Medium scale regrowth (100 mL) and activity confirmation

LC-MS analysis to identify known compounds

LC-MS/MS and NMR analysis to confirm novelty of active components

HPLC fractionation of active extracts

Active molecule

Large scale regrowth of extract(s) with active novel component(s) and bioassay-guided compound isolation

FIG. 6.1  HTS process leading to active molecule.

the focus will be on the origin and recent applications to compounds of natural origin, but it should be noted that the strategies used in the pharmaceutical drug discovery process and that of metabolomics rely on near-identical technology and approaches. The HTS process leading to discovery of active molecules from natural products is summarized in Fig. 6.1.

6.2.  THE PRE-HTS ERA It is important to understand that pharmacological assays, originated from thinking developed by Paul Ehrlich (Albert, 1973) for perfecting candidate drugs using bioassay-guided synthesis, were, by their nature, based on experimentation using models of disease expressed in vivo and could now be described as low-throughput. Nevertheless, this drug discovery approach led to the development of the first anti-syphilitic marketed as Salvarson 606, whose structure was revised nearly a century later (Lloyd et al., 2005). Before technologies such as HTS could contribute to the drug discovery process of today, it required the perfection and integration of several technologies and concepts, some of which include: the dose–response curve developed in the late 19th and early 20th centuries by Hugo Schulz (1853–1932) (Calabrese, 2016); UV spectroscopy (c.1941), polymer synthesis and technology (Perspex in 1934), robotics, machine vision, sufficient computational power, QSAR, statistical techniques, molecular modelling (MM2; ComFa, Quantum mechanics, pharmacophore maps); WIMP, electrospray-mass spectrometry, and precision machine tooling just to name a few! The microprocessor revolution underpinned HTS technology by integrating robotics, electronic computation/artificial intelligence systems with biochemical assays in vitro. Mass screening efforts to find effective battlefield replacements for quinine in treating malaria were hampered by low-throughput screening (Dascombe et  al., 2007). A similar situation existed for finding effective antibiotics for treating wartime injuries and accelerated developments mostly in the inter-war period and the situation had little improved up to and to the end of the American-Vietnam conflict when mefloquine was developed from quinine.

High-Throughput Screening of Phytochemicals  Chapter | 6  167

Before the introduction of HTS, traditional biochemical and pharmacological drug discovery methods required 1 mL reactions in individual vessels, ­usually test tubes, and weighed dry compounds were dissolved in relevant dosage vehicles to provide solutions for testing. It has been estimated that this approach limited testing to around 80–100 compounds per month. The choice of compounds also needs carefully consideration to avoid problems of convergence and co-linearity when constructing an initial testing set of around 3000–4000 compounds representing structural diversity. Such sets required between 1 and 2 years for screening and then construction of suitable QSAR.

6.3.  HIGH-THROUGHPUT SCREENING 1. HTS is a drug discovery method or approach that utilizes robotics, data processing/control software, liquid handling devices, and sensitive detectors to quickly conduct millions of chemical, genetic, or pharmacological tests in the process of rapidly identifying active compounds, antibodies, or genes that modulate a particular biomolecular pathway. An HTS approach accelerates target interrogation, meaning large-scale libraries can be screened both quickly and relatively cost-effectively (Dove, 2007; Janzen and Bernasconi, 2009). Full-scale HTS libraries generally extend into many thousands of constructs, and are initially divided into biological activity approach, e.g., small compound (including phytochemicals), siRNA, and shRNA. Each of these libraries is further subdivided into smaller libraries according to biological family or target specificity. All libraries are arrayed into micro-well plates enabling screening in a miniaturized form—either in 96, 384, and in some cases 1536 well plates. There are multiple steps in any HTS experiment and those steps can be generalized into three main categories: (1) sample preparation, (2) sample handling, and (3) readouts and data acquisition. 2. HTS provides a practical method to investigate large numbers of phytochemicals (or their semi-synthetic derivatives) in miniaturized assays in vitro, to identify those molecules capable of modulating activity at the biological target under consideration. One particular impetus for increasing throughput was expression of DNA allowing identification of new therapeutic targets beyond the capacity of traditional methods relying on hand pipetting and assaying (Fig. 6.2). For effective screening, several hundred thousand compounds had to be evaluated to keep up with commercial rivals of Pfizer; the interview with Sills (1998) is most informative to understand the history behind the campaign at Pfizer. Commercial sensitivity of this data meant that comparative approaches were not published until many years later (Beggs, 2000; Fox et al., 2001) prompting adoption by academic and not for profit organizations (such as the cancer research campaigns in the UK). A personal perspective of the origin of the HTS concept at Pfizer Global Research and Development is an essential reading (Pereira and Williams, 2007). This is presented in three distinct phases: (a) development of the concept and philosophical choices with regard to implementation, (b) practical implementation and technology development, and finally, (c) the logical expansion to include related disciplines in drug development.

Natural product screening automation design capacity 10,000 assays/week

1984

Replace fermentation broths with 30mM DMSO solutions of synthetic compounds 96 well plates-fixed format 50-100 µl assay volume

Philosophical choices

800-1440 compounds/week, Manual pattern recognition, photographic records

Neurotensin125 I Receptor/ligand, Dot blot filtration, Autoradiography, Image analysis

Full-file screening

1989

2880 compounds/week Data retrieve-all data recorded, introduced 96 well pipettes, Developed 96 well harvesters

1995

Nanotechnology Academic entry NIH roadmap

1997

1996

2000

2001

Applied biotechnology Screening 7200 compounds /week, 20concurrent HTS, Cell based and biochemical, Rolling triplicate assays

Applied biotechnology screening 7200 compounds/week, 20 concurrent HTS, Cell based and biochemical, Rolling triplicate assays, Reverse transcriptase Assays-Quantitative PCR, multiplex assays

Therapeutic target HTS

Miniaturization

Microsomal P450, protein binding, serum stability, CACO2 & Cytotoxicity Assays, 96 well plate format, 50-100 µl assay

Centralized, Lock-step,

1987

1986 Yeast ras oncogene

Recent advances

HTS ADMET concept

180 compounds/week High-throughput LC-MS, 4 screens, duplicate assays, Cytotoxicity MTT assays 360 compounds/week High throughput LC-MS,4 screens, duplicate assays, Cytotoxicity MTT assays, liquid Ames, Micronucleus image analysis POC ADMET HTS

FIG. 6.2  HTS as performed in Pfizer. (Redrawn from Pereira, D.A., Williams, J.A., 2007. Origin and evolution of high throughput screening. Br. J. Pharmacol. 152, 53–61.)

168  Computational Phytochemistry

HTS evolution

HTS concept:

High-Throughput Screening of Phytochemicals  Chapter | 6  169

Since the pharma-sector-considered HTS provided a competitive advantage over rivals, this is precluding from many years of publication of results other than in internal proposals and white papers. Certainly, the authors visited many such facilities in various big Pharmaceutical companies and each had variations in the strategies adopted. If the reader is interested in developing the modern infrastructure necessary to convert concept to reality, they are encouraged to read the handbook by Sittampalam et al. (2017), especially for understanding more recent key developments in automation, liquid handling, novel assay formats, image recognition, and informatics, all driven by high level of computing. Several older but useful accounts provide useful coverage of many aspects of HTS within drug discovery, which are equally applicable to phytochemical screening (Devlin, 1997; Seethala and Fernandes, 2001; Janzen, 2002; Minor, 2006). Using a combination of robotics, data control and processing software, liquid handling (pipetting) devices, and sensitive detection systems, HTS allows investigators efficiently conducting hundred to millions of chemicals, genetic, pharmacological, and toxicological tests. Through this process, the investigators can rapidly identify active compounds, antibodies, or genes that modulate a particular biomolecular pathway. The results of these experiments provide rational starting points for drug design and lay the foundation for understanding the interaction or role of a particular natural product in modulating a biochemical process. By screening a large number of compounds, as can be exemplified from the library of compounds shown in Fig. 6.3, using a validated assay within a matrix of compartments yields data of the potential suitability of the compound to show the desired effect. If the target is an enzyme, various isoforms can be tested in parallel to determine selectivity and off-target effects. In parallel, at least one cytotoxicity assay is conducted and the ratio of the two assays provides a crude measure of the relative therapeutic index. Assume good drug metabolism and pharmacokinetics (DMPK) properties, only then can it translate to an assay in vivo. By cross correlating all the data, this process can be used to decide which hits are worth pursuing and also to identify known compounds and detect so-called false positive hits (Sink et al., 2010). In practical terms, during HTS, a robot (usually equipped with an arm or manipulator) handles an assay plate under computer control (Zhang, 2011). The key reaction or testing vessel in HTS is the microtitre plate (Fig. 6.4). In its modern form, it usually takes the form of a small rectangular container, often constructed of machined or moulded disposable plastic, that is populated by a grid of miniaturized, open invaginations or divots, most commonly referred to as ‘wells’. The origin of microtitre plates can be traced back to an influenza epidemic in Hungary in the early 1950s during a shortage of standard laboratory equipment. This unfortunate event led a Hungarian physician, Gyula Takátsy (1914–80), to develop machined leucite micro-well plates in 1951 (6 × 12; 10 × 10 wells) and in the next year a 96 well V bottom machined plate. A brief history

170  Computational Phytochemistry

FIG. 6.3  Stack 'em up. Part of the compound library at the Sanofi-Aventis laboratory in Toulouse, France. More than 0.1 million compounds, stored in trays of vials, are kept there (Leeson and Springthorpe, 2007).

of the microplate and early development have been published (Manns, 1999). In general, modern (post 2013) microplates for HTS possess either 384, 1536, or 3456 wells depending on the application. All are multiples of 96 (following the original 96-well microplates). Investigators are advised to consult commercial literature to assess the current availability of various range of plates. Most of the wells contain either controls or test items, depending on the experimental protocol. Most commonly purified or in the case of natural products, crude extracts (plant extracts) dissolved in an aqueous medium also containing the supersolvent dimethyl sulfoxide (DMSO). Most often, the wells contain all the materials solubilized to conduct a particular assay including proteins, DNA, cells, or even embryos (the remaining wells may be empty, have replicates, or contain pure solvent or untreated samples, intended for use as various experimental controls) (Chang et al., 2010; Wu, 2010).

High-Throughput Screening of Phytochemicals  Chapter | 6  171

FIG. 6.4  Microtitre plates in HTS.

Plates are available for performing luminescence or fluorescence measurements for which a transparent bottom is required. Consequently, contents can be viewed easily and such microplates allow transmitted light measurements. In addition, such plates can be directly examined under a microscope, especially when using cell-based assays. The pigmented upper structure also assists on reducing any inter-well cross-talk during detection/observation. Various types of surfaces are available: (a) nontreated surfaces; (b) immunoassay surfaces; and (c) cell culture surfaces. Microplates with transparent bottoms are available in nontreated surfaces (both sterile and nonsterile), immunoassay surfaces, and cell culture surfaces of various grades.

6.3.1  Reaction Monitoring and Observation Assay consists of filling each well of the plate with some biochemical or biological entity for experimentation, such as a DNA, proteins (especially

172  Computational Phytochemistry

e­ nzymes), or combinations thereof. Investigators are referred to various volumes of ‘Methods in enzymology’, which can be used to design suitable assays (Macarrón and Hertzberg, 2009). As indicated, cells (cloned or ex vivo) or even an animal embryo have been successfully used in assays. After the desired incubation time has passed, the success of the compound or mixtures being tested to interact (­either actively or passively) is measured, eliciting a receptor-mediated response (or conversely fails to do so). Measurements are rapidly taken across all the plate's wells either singly or at once depending on the type of assay and detection devices’ availability. Normally, a specialized automated analysis robot/machine can run a number of parallel or sequential experiments in the wells (such as UV absorption, which can indicate binding). Measurements can be taken from above or below using the aforementioned clear-bottomed plates. The matrix of data thus obtained is often displayed mapped onto a graphic in a simplified format indicating hits or misses. Modern developments allow highcapacity measurements within minutes, thereby generating hundreds to thousands of experimental data-points for subsequent evaluation by an operator. Consequently, selected hits are re-assessed (often in greater detail) using new assay plates and the contents of the well dereplicated to determine the identity or identities of the compounds therein, more often than not, initially by the most sensitive techniques available, which is currently (low resolution or nominal) electrospray-mass spectrometry (ESIMS). False positives are eliminated since certain commonly occurring compounds of nominal value may also prove ‘active’. However, despite accelerated lead identification in experimental HTS, a large number of peculiar molecules have been identified. Although some have proved sufficiently interesting for further optimization, others are dead ends when attempts are made to optimize their activity, typically after a great deal of time and resources have been devoted. Such false positive hits remain a key problem in the field of HTS, and in the early stages of drug/phytochemical discovery in general. Many studies have been devoted to understanding the origins of false positives, and such findings have been incorporated into filters and methods that can predict and eliminate questionable molecules from further consideration (Sink et  al., 2010). Natural products including phytochemicals may also suffer from the aforementioned problems (Henrich and Beutler, 2013).

6.3.2  Advances in Monitoring In Vivo Although manual measurements using a trained researcher are often employed, when using microscopy to detect changes or defects, for instance, in embryonic development caused by the analyte, specialist robotic systems have been designed to detect hits that can be reconfirmed manually. In the effort to bring new and useful therapeutics to patients, whole-organism drug discovery represents a complementary approach to in vitro HTS. Recent developments have detailed a true-HTS platform for cellular resolution chemical and genetic screens on zebrafish larvae in vivo (White et al., 2016). The system automatically locates and

High-Throughput Screening of Phytochemicals  Chapter | 6  173

loads zebrafish from either reservoirs or multi-well plates. Unlike manual systems, now the problem of rapidly positioning and rotating them for high-speed confocal imaging and laser manipulation, without causing damage (in under 20 seconds!), has become a reality. One of the advantages of whole-organism screening is that it adequately accounts for biological complexity, i.e., evaluating compound effects within intact disease models. Whole-organism drug discovery is performed from a targetagnostic perspective, and thus, is well-suited to identify new druggable targets and identification of the molecular mechanism of action of ‘hiť compounds to identify the molecular target(s) and/or signaling pathway(s) eliciting the desired effect (e.g., absence of pathology). In 2000, the first large-scale chemical screen of zebrafish embryos arrayed in multi-well plates was reported by Peterson et  al. (2004). Since then, well over 60 successful zebrafish chemical screens have been published, covering a diverse array of research areas.

6.3.3  Location of Facilities A screening facility is often located in a dedicated clean room with a stable, sterile, and environmentally controlled environment. It typically curates a library of stock plates within an automated, coded, and catalogued carousel system accessible by robotic manipulators. These stock plates can be either bespoke or purchased from commercial vendors and serve as reservoirs or banks from which separate assay plates are created as required. Consequently, an assay plate is simply a replicate of a stock plate, created by automatically pipetting a small amount of liquid (in nanolitres) either singly or in parallel to a sterile empty plate to prevent microbial contamination. Analysis of data against poorly defined processes employs a so-called ‘extra-thermodynamic approach’, which does not require explicit knowledge of the drug target (Fujita, 1990; Herrmann and Franke, 2013). One strategy is to screen a library of molecules to see which, if any, reveal selectively in an in vitro version of the assay or disease model. For instance, there may be a corresponding assay to determine the extent of ‘off-target effecť, allowing a candidate to be chosen that may be less potent but more worth of development due to greater selectivity and a better ADME/toxicity profile. In medicinal chemistry, to maximize chances of success, a collection of structurally diverse, drug-like molecules, such as those produced from diversity-­orientated synthesis (DOS), are often used. Currently, compound libraries produced using combinatorial chemistry are the principal source for HTS programmes within many drug discovery campaigns, whether academic or commercial in nature. Correspondingly, nature has also provided an outstanding source novel and innovative range of scaffolds, many of which show useful pharmacological activity. An analogous approach can be used using a library of diverse natural products, e.g., plant secondary metabolites, which can include semi-synthetic modifications to address concerns of DMPK. In general, small compounds

174  Computational Phytochemistry

o­ ften fail to bind to putative drug targets with sufficient specificity and potency. Although the focus has been on screening libraries of pure natural products or phytochemicals, they are relatively few in number compared to those used for purely synthetic or unnatural products. Further details on availability of dereplicated phytochemical libraries suitable for HTS can be found in Chapter 5.

6.3.4  Is There a Difference Between So-Called Leads and Drugs? As candidates for further drug development, lead structures should ideally display most, if not all, of the following attributes: (1) possess simple chemical features, tractable for chemistry-based optimization; (2) exist within an established QSAR series; (3) good absorption, distribution, metabolism, and excretion (ADME) properties/DMPK (drug metabolism-Pharmacokinetics) and, last but not least, and (4) favourable patent situation to ensure commercialization. Opera et al. (2001) suggested that there are two distinct categories of leads: (a) those that inherently lack therapeutic utility (i.e., ‘pure’ leads) and (b) those that are marketed drugs themselves, but have been modified to provide novel drugs. Hence, combinatorial techniques have been applied to design ‘natural product-looking’ libraries covering a wide range of structural diversities, with varying degrees of success (Abreu and Branco, 2003); the onus on developing compounds with more sp2 and sp3 character has been recognized.

6.3.5  Visualization of Data The complexity of HTS data needs efficient representation. One popular method is the use of a heat map (or heatmap), and the software can be obtained free of charge from the Mapline website at https://mapline.com/. Heat map is simply a representation of data in the form of a map or diagram in which data values are represented as colours. A heat map or image represents the varying temperature or infrared radiation recorded over an area or during a period. This graphical representation of data in a matrix is where the individual values are represented by different colours or intensity of colour. The term 'heat map' was successfully trademarked by software designer Cormac Kinney in 1991, to describe a 2D colour display (depicting financial market information).

6.3.6  Dose–Response Analysis Subsequent to a primary HTS screen, in which compounds are screened at a single concentration, a subset of compounds is then selected for a more robust quantitative follow-up assessment. Consequently, the reselected molecules are tested at various concentrations, so they may be plotted against the corresponding assay response. Such ‘dose–response’ or ‘concentration–response’ curves are generally defined by four parameters: bottom asymptote (baseline

High-Throughput Screening of Phytochemicals  Chapter | 6  175

response), top asymptote (maximal response), slope (Hill slope or Hill coefficient), and the EC50 value (Weiss, 1997). The statistical methods used to treat statistical data, especially approaches to deal with data outliers, have been discussed by Goktug et al. (2013). Since statistical evaluation of assay performance can be the Achilles heel in HTS data analysis, various data analysis methods are available to correct for (a) plate-to-plate assay variability, (b) systematic errors, and (c) assessment of assay quality. Statistical analysis is also critical in the ‘hiť selection process, culling data from primary screens and subsequent re-­validation screen(s). While some of these methods may be intuitively applied using spreadsheet programs (e.g., Microsoft Excel), others may require the development of computer programs using more advanced programming environments (e.g., R, Perl, C++, Java, MATLAB). In concert with commercially available analysis tools, a growing number of open-access software packages (Table 6.1) especially designed for HTS data management/ analysis are available for investigators with little prior experience.

TABLE 6.1  Examples of Open-Access Software Packages for Library Management and Statistical Analysis of HT Screening Data Software Packages

Features

Programming Language

References

Screensaver

Web-based laboratory information management system for management of library and screen information

Java

Tolopko et al. (2010)

MScreen

Web-based compound library and siRNA plate management, QC and dose–response fitting tools

PHP, Oracle/ MySQL

Jacob et al. (2012)

NEXT-RNAi

Library design and evaluation tools for RNAi screens

Perl

Horn et al. (2010)

K-Screen

Analysis, visualization, management and mining of HT screening data including dose–response curve fitting

R, PHP, MySQL

Tai et al. (2011)

HTSCorrector

Statistical analysis, visualization and correction of systematic errors for all HT screens

C+

Makarenkov et al. (2006)

Continued

176  Computational Phytochemistry

TABLE 6.1  Examples of Open-Access Software Packages for Library Management and Statistical Analysis of HT Screening Data—cont’d Software Packages

Features

Programming Language

References

Web cellHTS2

Web-based analysis toolbox for normalization, QC, ‘hiť selection and annotation for RNAi screens

R/ Bioconductor project

Boutros et al. (2006) and Pelz et al. (2010)

RNAither

Automated pipeline for normalization, QC, ‘hiť selection and pathway generation for RNAi screens

R/ Bioconductor project

Rieber et al. (2009)

HTSanalyzeR

Gene set enrichment, network and gene set comparison analysis for postprocessing of RNAi screening data

R/ Bioconductor project

Wang et al. (2011)

6.3.7  Examples of HTS Success Three neglected diseases, which have therapeutics with poor toxicological profiles and whose efficacy is being eroded by drug resistance are African trypanosomiasis, leishmaniasis, and Chagas disease. Application of HTS of synthetic (combinatorial) chemical libraries suffered from the aforementioned lack of structural diversity required to find entirely novel and patentable chemotypes. Consequently, a subset (5976) of microbial extracts from the MEDINA Natural Products library, one of the largest natural product libraries curating more than 130,000 microbial extracts, was utilized against the three diseases in a validated phenotypic HTS platform. A total of 48 extracts contained potentially new and active compounds (Annang et al., 2015). Known active components included actinomycin D, bafilomycin B1, chromomycin A3, echinomycin, hygrolidin, and nonactins, among others (Fig. 6.5). To identify unknowns, extracts were progressed using tandem liquid chromatography–mass spectrometry for dereplication. Similarly, due to the low structural diversity within the set of antimalarial drugs currently available in the clinic and the increasing number of cases of resistance from the MEDINA natural products collection, again using a phenotypic HTS based on inhibition of Plasmodium lactate dehydrogenase, three new active compounds were discovered. As is common in such screens, four known natural products also proved active against Plasmodia (Pérez-Moreno et al., 2016). For instance, using a sensitive and robust but a

High-Throughput Screening of Phytochemicals  Chapter | 6  177

FIG. 6.5  Active compounds found in a validated HTS platform.

low-tech, inexpensive high-throughput metabolic screen for novel antibiotics using 39,000 crude extracts derived from organisms originating in the diverse ecosystems of Costa Rica allowed identification of 49 with reproducible antibacterial effects. An extract from an endophytic fungus was further characterized, which led to the discovery of three novel natural products, including mirandamycin, with broad-spectrum antibacterial activity against Escherichia coli, Pseudomonas aeruginosa, Vibrio cholerae, methicillinresistant Staphylococcus aureus (MRSA), and Mycobacterium tuberculosis. During a HTS screen involving a repurposing strategy using currently approved drugs, several natural products, including known microtubule inhibitors vinblastine, vincristine, and colchicine, were found active against Ebola virus (Kouznetsova et al., 2014). Recently, five indoline alkaloid-type synthetic compounds showed high potency (Filone et al., 2013).

178  Computational Phytochemistry

Kato et al. (2016) recently identified numerous compounds with multistage activity against Plasmodia by growth-inhibition phenotypic HTS of >100,000 compounds prepared in advance using diversity-oriented synthesis (DOS). The DOS collection was synthesized using modern asymmetric organic chemistry to impart three-dimensional topographical features using the build–couple–pair strategy (Dancík et al., 2010). The success of this strategy in revealing novel therapeutic targets is illustrated by the discovery of small-molecule antimalarial inhibitors of known targets (phenylalanyl-tRNA synthetase, PI4K and cytochrome bc1 Qi) as well as others with as-yet unidentified targets. The optimization of a potent series of antimalarial inhibitors consisting of azetidine-2-carbonitriles, which were previously shown to target P. falciparum in a biochemical assay, was discussed. Optimized compound BRD9185 showed in vitro activity against multidrug-resistant blood-stage parasites (EC50 = 0.016 μM) and was curative after just three doses in a P. berghei mouse model. BRD9185 had a long half-life (15 h) and low clearance in mice and represented a new structural class of DHODH inhibitors with potential as antimalarial drugs. Subsequently using HTS screening, a consortium involving the Broad Institute has led to exploiting this gap in natural product diversity to make natural product like compounds resulting in the design of broad-spectrum antimalarial (Maetani et al., 2017). Perhaps the most abundant azetidine containing natural product is azetidine2-carboxylic acid, a nonproteinogenic homolog of proline. Academic and nonfor-profit HTS operations worldwide are better positioned than ever to screen natural products either as extracts or purified molecules. Such global nature initiatives require special collaborative agreements protecting the rights of all parties involved, especially those countries endowed with a rich sources of biodiversity. Access to their resources necessitates fair and equitable recompense and ratification of the Nagoya Protocol, which requires signatures from 50 countries (Cragg et al., 2012). With advancements in artificial intelligence, many of the decisions to take forward a particular hit may well be aided by increasingly sophisticated decision making allowing a streamlining of which natural products to promote towards the clinic.

6.4.  HTS PLATFORMS FOR NATURAL PRODUCTS/ PHYTOCHEMICALS Although HTS protocols were initially developed for synthetic compound libraries, prepared by classical and/or combinatorial synthesis, to serve modern drug discovery process through ‘lead’ compound generation, there are several HTS platforms that are now available for screening purified natural products including phytochemicals and natural products extracts and fractions. Computation and mathematical modelling have revolutionized phytochemical and natural products research by facilitating extraction, isolation, and structure elucidation process and reducing the overall time an efforts needed for creating dereplicated

High-Throughput Screening of Phytochemicals  Chapter | 6  179

phytochemical libraries or other natural products libraries. As compared to synthetic molecules, natural products possess structural more diversity and complexity, more stereogenic centres, and drug-like properties, which are expected to contribute to the ability of natural products collections/libraries to provide hits even against the more difficult screening targets. The design of targeted or focused phytochemical libraries has its origin in combinatorial synthesis and computational chemistry. There are several reports on attempts made to design phytochemical libraries that target specific types of compounds or specific therapeutic targets, resulting in introduction of alkaloid-like and flavonoid-like libraries, for example. Molecular diversity can be obtained from careful selection of natural sources, e.g., bacteria, fungi, microalgae, and plants. The following subsections will look into the meaning of natural product, inherent molecular diversity and uniqueness that is represented by natural products, sample preparation processes, and a few specific examples of HTS platforms designed for natural products involving various levels of computation and mathematical modellings.

6.4.1  What is a Natural Product? A natural product is commonly defined as a chemical compound (i.e., metabolite) biosynthesized by a living organism (Sarker and Nahar, 2012). However, in the current discussion a metabolite is a participant in secondary or specialized metabolism (as opposed to primary metabolites or participant in central metabolism in all organisms such as respiration and so on). Numerous (>100,000) natural products, many of them are phytochemicals, have been studied and, from the latter part of the 20th century, tested within drug discovery campaigns. Notably, numerous natural products appear to possess unique mechanisms of (drug) action (Harvey, 2008). Since natural products are a proven reservoir of drugs (Fig. 6.6) and novel prototypes (Newman and Cragg, 2016), they offer powerful solutions to the rise of multi-drug-resistant pathogens. Although phytochemicals and other natural products, despite their unique chemical diversity, were somewhat neglected by the pharma industries in relation to HTS programmes, absence of any notable successes in drug discovery from synthetic combinatorial library approach has regenerated interests in natural products drug discovery and led to incorporation of dereplicated natural products libraries (phytochemical libraries) in HTS processes. An advantage of natural products is that they readily access the third dimension, but DOS libraries generally do not. As indicated before, secondary metabolites from natural sources include plants, animals, and various microorganisms, many of which show a rich source of structural diversity, which transcends many synthetic compounds, with perceived potential in drug discovery and development (Cragg et al., 1997; Grabley and Thiericke, 1999; Harvey, 2000; Newman et al., 2000; Newman and Cragg, 2016). Since, a major ­limitation of existing

180  Computational Phytochemistry

FIG. 6.6  Typical natural products with useful pharmacological activity.

synthetic leads is that they are often heterocyclic and, when synthetic, generally of ‘reduced dimensionality’, i.e., are flat in shape (Overing et al., 2009). Current evidence confirms that ‘smaller’ molecules are more likely than ‘larger’ ones to succeed as potential drugs in clinical trials (Leeson and Springthorpe, 2007), since their physio-chemical properties are not sufficiently ‘drug-like’ (Bickerton et al., 2012; Björn et al., 2013). One often cited strategy is Lipinski's rule of five for synthetic or designed drugs, but certain natural products, which do not fall into such convenient strategies, for instance taxol-based drugs, have proven sufficiently valuable that DMPK problems have been solved by bespoke solutions (Lipinski et al., 2001).

6.4.2  Natural Products for Increasing Diversity Synthesis of complex and diverse small molecules from natural products as a feed-stock can be used to extend the utility of natural products using ring-­ distortion reactions, dubbed complexity-to-diversity (CtD). CtD has been applied in the synthesis of 16 diverse scaffolds and 65 total compounds from the alkaloid natural product sinomenine (Harvey et  al., 2015). Chemoinformatic

High-Throughput Screening of Phytochemicals  Chapter | 6  181

a­nalysis confirms that these compounds possess complex ring systems and marked three-dimensionality. It is, therefore, vital to identify drug candidates, which possess both potency at a biological target and acceptable physico-­ chemical properties, i.e., have acceptable DMPK profiles. Despite chemical diversity being one of the key strengths of phytochemicals or other natural products in drug-discovery operations, there are some challenges associated with phytochemical drug discovery. Both viscosity and consistency of natural product extracts are incompatible with automated handling systems. Consequently, the widely accepted method of separating the mixtures is to fractionate the extract into three or more fractions depending on the specific polarity for the compounds within the complex mixture (Sarker and Nahar, 2012). The most common treatment uses solvents of increasing polarity: for instance, n-hexane (HPLC grade, free of phytanes) followed by dichloromethane and, finally, methanol. Hence, the polarity is being successively increased by a factor of 0.3: nhexane: 0.009; dichloromethane: 0.309; methanol: 0.762; water: 1.0. Water is normally reserved for either Clevenger or steam distillation apparatus (e.g., for isolation of volatile oils from plants). The values for relative polarity normalized from measurements of solvent shifts of absorption spectra are available (Reichardt, 2003).

6.4.3  Natural Products Sample Preparation Various sample preparation techniques for natural products can be found in the book edited by Sarker and Nahar (2012). However, the manual sample preparation process for phytochemicals can be tedious and time-consuming. One solution is to use a novel, automated sample preparation procedure based on solid-phase extraction (SPE) that can overcome some of these drawbacks (Fig.  6.7). Hence, Schmid et  al. (1999) used modified Zymark (Hopkinton, MA) RapidTrace® SPE workstations to develop a simplified and effective fractionation method which generated high-quality samples from natural fauna

10 nm

Disks are 1 mL each 6-mL syringe

1-mL syringe 5.5 mm 47 mm

Prefilter Membrane

Prefilter Membrane

4-mm disc Filter

Free disk

Syringe-barrel disk cartridges

Extraction disk microtiter plate

SPE pipette tip

FIG. 6.7  Automated sample preparation for natural products (Thurman and Snavely, 2000).

182  Computational Phytochemistry

and flora, allowing rapid integration into high-throughput drug discovery campaigns as discussed by Pereira and Williams (2007). Many commercial vendors offer solutions to this problem and an exhaustive list is beyond the scope of the current discourse. For instance, SPEEDY® is an automatic system for SPE and filtration, which can handle filtration plates and SPE plates or cartridges. Independent probes condition the column, then dilute and dispense the samples. Reproducible and reliable results are achieved with software-controlled vacuum and/or positive pressure (nitrogen or air). It automates the entire workflow for SPE and filtration, handles all commercially available filter plates, columns and cartridges (1, 3, and 6–10 mL), and filter disks, processes samples depending on consistency (plasma, sera, cell culture materials, viscous samples), and controls the extraction process, optional heating and cooling, automated vial closure, evaporation, pH-measurement, and adjustment. Although natural products represent a reservoir of untapped putative drugs and large molecular diversity, the actual process of isolating and identifying active compounds is a bottleneck in HTS natural products drug discovery programmes. The rapid isolation and efficient identification of the desired bioactive component(s) of natural product mixtures from bioassay-guided fractionation have become important economic factors. By comparing the fields of both metabolomics and natural products discovery (which evolved independently), both accentuate different aspects of metabolite analysis. The former derives meaning from extraordinarily complex data sets, while the later centres on identifying single bioactive metabolites of interest. In line with anticipated utility of new discoveries, such endeavours for this type of natural products searches are now known as bioprospecting. Both approaches focus on metabolite identification and employ identical analytical technologies (Robinette et al., 2012). Chromatographic separation followed by MS analysis, in concert with nuclear magnetic resonance (NMR), are methods of choice. Despite considerable technological advances in separation and analysis (HPLC, UPLC, ion mobility), the high variability and complexity of natural products structures and their corresponding physio-chemical behaviour remain a challenge. For instance, some bioactive metabolites are quite small (e.g., quinoline from Phasmid insects (Oreophoetes peruana) (Eisner et  al., 1997) to maitotoxin (Nicolaou and Aversa, 2011), currently the largest and most complex natural product that is neither a polysaccharide nor a protein. Some are highly ­hydrophilic (e.g., γ-aminobutyric acid (GABA; neurotransmitter) with a logP of −3.2), while others are lipophilic (e.g., paclitaxel (taxol®); anticancer diterpene with a logP of +3.7) The convergence of metabolomics and natural product discovery occurs at the stage when spectral databases are interrogated with physico-chemical parameters (e.g., relative retention time, mass-over-charge ratio, MS fragmentation patterns, arrival times, mobilograms, and/or NMR spectral peaks) determined for complex mixtures or isolated metabolites. The quality of peak annotation (assigning a chemical identity or unique identifier) in metabolomics

High-Throughput Screening of Phytochemicals  Chapter | 6  183

experiments and dereplication (i.e., recognizing and eliminating from further consideration metabolites with known structures) in natural product screening are entirely dependent on accessibility of comprehensive spectral databases. Hence, the use of reliable spectral databases (see Chapter 7) in identification processes is desirable, if not indispensable. For instance, 13C NMR spectra are less prone to solvent effects than 1H NMR spectra, prompting the establishment of commercial and open-access databases. For instance, a database containing 13 C NMR spectral information of over 6000 natural compounds allows rapid identification of known compounds (present in the crude extracts) and provides some insight towards the structural elucidation of unknown compounds. Unlike NMRShiftDB, NAPROC-13 uses Cartesian coordinates for generating graphics, which then improves structure representations and therefore easier attribution of any stereocentres in accordance with IUPAC recommendations. As of 2014, the compound set had grown to 21,000 compounds (Saez-Rodriguez et al., 2014). Until recently, natural products researchers have relied mostly on paper-based libraries, disc-based, and then online commercial databases (e.g., NIST Standard Reference MS Database and Aldrich Spectral Viewer® NMR Library). NAPROC-13: a database for the dereplication of natural product mixtures in bioassay-guided protocols is useful since the position of 13C NMR peaks is less solvent-dependent than proton spectra. Alternatively, many pharmaceutical companies have developed custom, in-house, databases with limited or no public access to ensure competiveness. In contrast, the metabolomics field promotes open-access databases and data exchange and annotation/comments for all aspects of the experimental process, from sample tracking, to data analysis algorithms and programs accessible online and in real time, and finally to metadata deposition and curation. Access-restricted commercial libraries currently are an important tool in natural products research, but these are rapidly being complemented and, if current trends continue, will surpass them by open access, not-for-profit, platforms. A list of open-access metabolomics Databases for Natural Product Research has been compiled by Johnson and Lange (2015).

6.4.4  Examples of HTS Platforms for Natural Products/ Phytochemicals PhytoLogix platform is one of the earlier HTS compatible botanical libraries, which was utlized successfully to discover new compounds with various bioactivities, e.g., anti-inflammatory properties (Zhao et al., 2007). This library, further improved since its first introduction, is still available commercially (http:// www.unigenusa.com/research/phytologix) and allows quick search through tens and thousands of ethnomedicinal plants categorized by traditional and historic uses. It is a proprietary technology platform that incorporates bioprospecting, bioinformatics, high-throughput purification, and structural dereplication and comprises an extensive collection of >8000 medicinal plants, informatics database with traditional, historic usage, and modern research of the collected

184  Computational Phytochemistry

plants, and a natural product library with >9,000 plant extracts and well over 200,000 HTS-ready natural products fractions. Another similar example is seed compounds exploratory unit for drug discovery platform (http://www.riken.jp/ en/research/labs/csrs/drug_discov/seed_compds/), which specializes in HTS of seed compounds that recover yeast phenotypes induced by human genes (Yashiroda et al., 2010). An automated, cell-based HTS platform for the rapid detection of novel androgen receptor modulators including natural products was reported by Larson and Banks (2012). In this platform, automation of the initial library screens was achieved by using easy-to-use, robust instrumentation and computations. The 506 compound Screen-Well® Natural Product Library was screened using the automated 384-well Human Androgen Receptor Assay. A multi-parameter, high-content, HTS platform to identify natural compounds that modulate insulin and Pdx1 expression was reported by Hill et al. (2010). They utilized a library of marine natural products extracts and made use of an automated fluorescence microscopy incorporating the Cellomics ArraySCan VT1. Cellomics Target Activation algorithm was applied for identification of cell. Further details on the protocols involved can be found in the publication (Hill et al., 2010). Paytubi et  al. (2017) have recently reported the first validated HTS assay of microbial natural product extracts—a HTS platform for microbial natural products for the discovery of molecules with antibiofilm properties against Salmonella. First, they developed a new methodology for growing Salmonella biofilms that are suitable for HTS platforms. Resazurin- and crystal violet-­ staining were used to quantify the biomass associated with biofilm at the solidliquid interface and to detect, respectively, living cells and total biofilm mass. A subset of 1120 extracts from the Fundación MEDINA’s collection was studied to identify molecules with antibiofilm activity. Active extracts were subjected to further fractionation and purification of the active compounds, resulting in the discovery of patulin as a potent molecule with antimicrobial activity against both planktonic cells and cells within the biofilm, from one of those active extracts. Details on the HTS assay conditions and procedures can be obtained from the publication (Paytubi et al., 2017). Automation in data analysis was adopted by the use of computational tools; extract activities were calculated automatically using the Genedata Screener software (Genedata AG, Basel, Switzerland; https://www.genedata.com/), and the percentage inhibition of each sample (extract or compound) was determined by appropriate equations, depending on the type of staining protocol. Fox et  al. (2012) screened 4156 compounds from the collection of the National Toxicology Program (http://www.ncbi.nlm.nih.gov/sites/entrez? db=pcsubstance&term=NTPHTS), and Tocris Biosciences, predominantly phytochemicals, e.g., resveratrol, genistein, and baicalein, utilizing a cellbased quantitative HTS assay, i.e., high-throughput ATAD5-luciferase assay detecting genotoxic compounds. The high-throughput ATAD5-luciferase assay in a 1536-well plate-based format was developed to identify compounds

High-Throughput Screening of Phytochemicals  Chapter | 6  185

that enhance human ATAD5 protein levels, by creating an HEK293T cell line that could stably express luciferase-tagged ATAD5. The luminescence intensity of the assay plates was quantified using a ViewLux CCD-based plate reader (PerkinElmer) after a 30-min incubation at room temperature. Raw plate reads for each titration point were normalized to MMS (0.7 mM = 100%) and DMSO (0%) controls, and then corrected by applying a patterncorrection algorithm using compound-free control plates (DMSO plates). Concentration–response titration points for each compound were then fitted to the Hill equation (Weiss, 1997). In recent years, virtual HTS-based screening of virtual phytochemical libraries has gained momentum. Virtual or in silico HTS (vHTS) utilizes high level of computing and various algorithms. vHTS method has become very popular to filter out unpromising set of compounds at the very beginning. One such recent example can be found in the publication by Powers and Setzer (2016). They applied a virtual HTS platform to screen a library of over 2000 phytochemical structures with dengue virus protein targets using a molecular docking approach. The virtual library comprised 3D structures of 290 alkaloids, 20 aurones, 81 chalcones, 100 coumarins, 349 flavonoids, 120 isoflavonoids, 74 lignans, 169 miscellaneous polyphenolic compounds, 67 quinones 58 stilbenoids, 678 terpenoids, 28 xanthones, and 160 various other phytochemicals. A commercially available vHTS data management platform is KNIME (https://www.knime.com/knime-applications/virtualhigh-throughput-screening), which masters in processing huge amounts of data. The results of predicting the activity of yet untested compounds can be visualized by the Enrichment Plotter, for example, which was specially developed for this purpose. Another useful tool for inspecting the data is the so-called neighborgram (https://www.knime.com/neighborgrams), where the neighbourhood of data-points is labeled as ‘active’. The Neighborgram feature provides nodes for data visualization and predictive model constructions using neighborgrams (Berthold et  al., 2005). The concept, protocols, and applications of vHTS in phytochemical screening are present in details in the literature (Subramaniam et al., 2008; Tripathi et al., 2009; Simmons et al., 2013; Binkowski et al., 2014; Kakarala and Jamil, 2014; Acharya et al., 2015; Pyzer-Knapp et al., 2015; Zhu et al., 2015; Powers and Setzer, 2016). Success of any vHTS depends on the careful implementation of each phase of computational screening experiment right from target preparation to hit identification and lead optimization (Subramaniam et  al., 2008). Kakarala and Jamil (2014) screened phytochemicals from Naturally Occurring Plantbased Anticancer Compound-Activity-Target database (NPACT, http://crdd. osdd.net/raghava/npact/) against protease activated receptor 1 (PAR 1) using virtual screening workflow of Schrödinger software, which could analyse pharmaceutically relevant properties using Qikprop and calculate binding energy using Glide at three accuracy levels (high-throughput virtual screening, standard precision, and extra precision).

186  Computational Phytochemistry

6.5.  HIGH-CONTENT SCREENING High-content screening (HCS), also known as high-content analysis (HCA) or cellomics, is a method that involves identification of various substances including small synthetic molecules, peptides, RNAi, or natural products that alter the phenotype of a cell in a desired and reproducible manner. It is considered increasingly useful in both biological research and drug discovery (Haney, 2008; Giuliano and Haskins, 2010). This cellular phenotypic screen modulates production of various cellular products, especially proteins and/or morphological changes (i.e., visual appearance) within the cell. HCS includes any method used to analyse complete cells and/or components of cells with simultaneous readout of several desired parameters explaining the name ‘high-content screening’ (Gasparri, 2009). In HCS, cells are incubated with the substance for a defined time and then structures/molecular components of the cells undergo analysis. Often, by labelling proteins with fluorescent tags, changes in cell phenotype are quantified using automated image analysis. Employing fluorescent tags with various absorption and emission maxima, several different cell components (cytoplasm, nucleus, mitochondria, etc.) can be monitored in parallel. Alternatively, label-free assays can also be used. A collection of microbial natural product extracts were examined by HCS to identify various nuclear export inhibitors. Leptomycins (Fig. 6.8) are secondary metabolites produced by Streptomyces spp. Leptomycin B (LMB) was originally discovered as a potent antifungal antibiotic found to cause cell elongation of the fission yeast Schizosaccharomyces pombe. However, recent data show that leptomycin causes G1 cell cycle arrest in mammalian cells and is a potent antitumour agent against murine experimental tumours in combination therapy (Lu et al., 2012). Usefully, nuclear export of many proteins including HIV-1 Rev, MAPK/ERK, and NF-κB/IκB is blocked by leptomycin B at low nM concentrations, but also inhibits the inactivation of tumour protein p53 homologue, the so-called ‘guardian of the genome’ (Hietanen et al., 2000). Disappointingly, it showed severe dose-limiting toxicity (malaise and profound anorexia) in a Phase I clinical trial (Newlands et al., 1996). Although there are currently no clinically useful inhibitors of the nuclear export, several different natural products as well as semi-synthetic/synthetic molecules are being developed for therapeutic use. Several image-based HCS

FIG. 6.8  Structure of leptomycin B.

High-Throughput Screening of Phytochemicals  Chapter | 6  187

assays have been developed to assist the efficient identification of inhibitors of the nuclear export (Zanella et al., 2007, 2010). A recent report used cell-based phenotypic profiling and image-based HCS to study the mode of action and potential cellular targets of plants historically used in Saudi Arabia’s traditional medicine including Juniperus phoenicea, Anastatica hierochuntica, and Citrullus colocynthis. Cluster analyses of their cytological profiles suggest the presence of putative topoisomerase inhibitors that may be effective in cancer treatment. Histone H2AX phosphorylation as a marker for DNA damage suggested that compounds induced double-strand DNA breaks. Interestingly, chemical analysis of the active fraction isolated from J. phoenicea revealed putative anticancer compounds. This study is one example of the utility of cell-based phenotypic screening of natural products to unravel biochemical modes of action (Hajjar et al., 2017). For genetic analysis, simple genetic model systems including yeast (Bancos et  al., 2013), Caenorhabditis elegans (Kinser and Pincus, 2017), the zebrafish (Danio rerio), and Drosophila melanogaster are useful. However, Drosophilia, as an invertebrate not requiring an animal license, is clearly the system of choice for studying cardiac development, function, and ageing, due to the presence of a fluid pumping allowing examination of gene networks involved in organogenesis upon exposure of natural products.

6.6. CONCLUSIONS HTS was initially developed for screening large synthetic compound libraries, generated from combinatorial synthesis, using high degree of automation and robotics, aided by various computational tools and mathematical modelling. At the beginning, phytochemicals and/or any other natural products were not included in this operation because of unavailability of appropriate dereplicated natural products libraries. Phenomenal progress in computation and mathematical modelling impacted on the operation of various tools pertinent to extraction, isolation, and identification of phytochemicals with greater speed, efficiency, and precision, resulting in the formation of phytochemical libraries of various sizes and specifications. Phytochemical libraries are now available commercially and also produced, albeit in a lesser extent, in academia, and they are routinely subjected to HTS.

REFERENCES Abreu, P.M., Branco, P.S., 2003. Natural product-like combinatorial libraries. J. Braz. Chem. Soc. 14, 675–712. Acharya, B., Ghosh, S., Manikyam, H.K., 2015. Nature’s response to influenza: a high throughput screening strategy of Ayurvedic medicinal phytochemicals. Int. J. Pharm. Sci. Res. 7, 2699–2719. Albert, A., 1973. Selective Toxicity: The Physico-Chemical Basis of Therapy, fifth ed. Springer, US. http://www.springer.com/gb/book/9781489971302.

188  Computational Phytochemistry Annang, F., Perez-Moreno, G., Garcia-Hernandez, R., Cordon-Obras, C., Martin, J., Tormo, J.R., Rodriguez, L., de Pedro, N., Gomez-Perez, V., Valente, M., Reyes, F., Genilloud, O., Vicente, F., Castanys, S., Ruiz-Perez, L.M., Navarro, M., Gamarro, F., Gonzalez-Pacanowska, D., 2015. High-throughput screening platform for natural product-based drug discovery against 3 neglected tropical diseases: human African trypanosomiasis, leishmaniasis, and chagas disease. J. Biomol. Screen. 20, 82–91. Bancos, I., Bida, J.P., Tian, D., Bundrick, M., John, K., Holte, M.N., Maher, L.J., 2013. Highthroughput screening for growth inhibitors using a yeast model of familial paraganglioma. PLoS One 8, e56827. Beggs, M., 2000. HTS –Where next. Drug Discov. World Summer 2000, 25–30. Berthold, M.R., Wiswedel, B., Patterson, D.E., 2005. Interactive exploration of fuzzy clusters using neighborgrams. Fuzzy Sets Syst. 149, 21–37. Binkowski, T.A., Jiang, W., Roux, B., Anderson, W.F., Joachimiak, A., 2014. Virtual high-­ throughput ligand screening. In: Structural Genomics and Drug Discovery. Springer, New York, pp. 251–261. Bickerton, G.R., Paolini, G.V., Besnard, J., Muresan, S., Hopkins, A.L., 2012. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90–98. Björn, O., Stefan, W., Christian, G., Yasushi, N., Steffen, R., Daniel, R., Herbert, W., 2013. Naturalproduct-derived fragments for fragment-based ligand discovery. Nat. Chem. 5, 21–28. Boutros, M., Bras, L.P., Huber, W., 2006. Analysis of cell-based RNAi screens. Genome Biol. 7, R66. Calabrese, E.J., 2016. The emergence of the dose–response concept in biology and medicine. Int. J. Mol. Sci. 17, 2034–2048. Chang, T.-Y., Koo, B.K., Gilleland, C.L., Wasserman, S.C., Yanik, M.F., 2010. Nat. Methods 7, 634–636. Cragg, G.M., Newman, D.J., Snader, K.M., 1997. Natural products in drug discovery and development. J. Nat. Prod. 60, 52–60. Cragg, G.M., Newman, D.J., Rosenthal, J., 2012. The impact of the United Nations Convention on Biological Diversity on natural products research. Nat. Prod. Rep. 29, 1407–1423. Dancík, V., Seiler, K.P., Young, D.W., Schreiber, S.L., Clemons, P.A., 2010. Distinct biological network properties between the targets of natural products and disease genes. J. Am. Chem. Soc. 132, 9259–9261. Dascombe, M.J., Drew, M.G.B., Evans, P.G., Ismail, F.M.D., 2007. Rational design strategies for the development of synthetic quinoline and acridine based antimalarials. Front. Drug Des. Discov. 3, 559–609. Devlin, J.P., 1997. High Throughput Screening: The Discovery of Bioactive Substances. Marcel Dekker Inc., New York. Dove, A., 2007. High-throughput screening goes to school. Nat. Methods 4, 523–532. Eisner, T., Morgan, R.C., Attygalle, A.B., Smedley, S.R., Herath, K.B., Meinwald, J., 1997. Defensive production of quinoline by a phasmid insect (Oreophoetes peruana). J. Exp. Biol. 200, 2493–2500. Filone, C.M., Hodges, E.N., Honeyman, B., Bushkin, G.G., Boyd, K., Platt, A., Ni, F., Storm, K., Hensley, L., Snyder, J.K., Connor, J.H., 2013. Identification of a broad-spectrum inhibitor of viral RNA synthesis: validation of a prototype virus-based approach. Chem. Biol. 20, 424–433. Fox, S., Wang, H., Sopchak, L., Khoury, R., 2001. Increasing the chances of lead discovery. Drug Discov. World Spring, 35–44. Fox, J.T., Sakamuru, S., Huang, R., Teneva, N., Simmons, S.O., Xia, M., Tice, R.R., Austin, C.P., Myung, K., 2012. High-throughput genotoxicity assay identifies antioxidants as inducers of DNA damage response and cell death. Proc. Natl. Acad. Sci.—PNAS 109, 5423–5438. Fujita, T., 1990. The extra-thermodynamic approach to drug design. In: Ramsden, C.A. (Ed.), Comprehensive Medicinal Chemistry. Vol. 4. Pergamon Press, N.Y, pp. 540–544. Gasparri, F., 2009. An overview of cell phenotypes in HCS: limitations and advantages. Expert Opin. Drug Discov. 4, 643–657.

High-Throughput Screening of Phytochemicals  Chapter | 6  189 Giuliano, K.A., Haskins, J.R., 2010. High Content Screening: A Powerful Approach to Systems Cell Biology and Drug Discovery. Humana Press, Totowa, NJ. Goktug, A.N., Chai, S.C., Chen, T., 2013. In: El-Shemy, H.A. (Ed.), Drug Discovery. InTech, ­United Kingdom, pp. 201–226. Grabley, S., Thiericke, R., 1999. Drug Discovery From Nature. Springer-Verlag, Berlin: Heidelberg. Hajjar, D., Kremb, S., Sioud, S., Emwas, A.-H., Voolstra, C.R., Ravasi, T., 2017. Anti-cancer agents in Saudi Arabian herbals revealed by automated high-content imaging. PLoS One 12, e0177316. https://doi.org/10.1371/journal.pone.0177316. Haney, S.A., 2008. High Content Screening: Science, Techniques and Applications. Wiley-­ Interscience, New York. Harvey, A., 2000. Strategies for discovering drugs from previously unexplored natural products. Drug Discov. Today 5, 294–300. Harvey, A., 2008. Natural products in drug discovery. Drug Discov. Today 13, 894–901. Harvey, A.L., Edrada-Ebel, R., Quinn, R.J., 2015. The re-emergence of natural products for drug discovery in the genomics era. Nat. Rev. Drug Discov. 14, 111–129. Henrich, C.J., Beutler, J.A., 2013. Matching the power of high throughput screening to the chemical diversity of natural products. Nat. Prod. Rep. 30, 1284–1298. Herrmann, E.C., Franke, R., 2013. Computer Aided Drug Design in Industrial Research. Springer Science & Business Media, Springer, Berlin. Hietanen, S., Lain, S., Krausz, E., Blattner, C., Lane, D.P., 2000. Activation of p53 in cervical carcinoma cells by small molecules. Proc. Natl. Acad. Sci. U. S. A. 97, 8501–8506. Hill, J.A., Szabat, M., Hoesli, C.A., Gage, B.K., Yang, Y.H.C., Williams, D.E., Riedel, M.J., ­Luciani, D.S., Kalynyak, T.B., Tsai, K., Ao, Z., Andersen, R.J., Warnock, G.L., Johnson, J.G., 2010. A  multi-parameter, high-content, high-throughput screening platform to identify natural compounds that modulate insulin and pdx1 expression. PLoS One 5. e12958. https://doi. org/10.1371/journal.pone.0012958. Horn, T., Esther, S., Oliver, P., 2010. Design and evaluation of genome-wide libraries for RNA interference screens. Genome Biol. 11, R61. Jacob, R.T., Larsen, M.J., Larsen, S.D., Kirchoff, P.D., Sherman, D.H., Neubig, R.R., 2012. MScreen: an integrated compound management and high-throughput screening data storage and analysis system. J. Biomol. Screen. 17, 1080–1087. Janzen, W.P., 2002. High Throughput Screening: Methods and Protocols. Humana Press Scientific and Medical Publishers, Totowa, NJ. Janzen, W.P., Bernasconi, P., 2009. High Throughput Screening: Methods and Protocols, second ed. Springer, New York. Johnson, S.R., Lange, B.M., 2015. Open-access metabolomics databases for natural product research: present capabilities and future potential. Front. Bioeng. Biotechnol. 3, 22. https://doi. org/10.3389/fbioe.2015.00022. Kakarala, K.K., Jamil, K., 2014. Screening of phytochemicals against protease activated receptor 1 (PAR1), a promising target for cancer. J. Recept. Signal Transduction 35, 26–45. Kato, N., Comer, E., Sakata-Kato, T., Sharma, A., Sharma, M., Maetani, M., Bastien, J., Brancucci, N.M., Bittker, J.A., Corey, V., Clarke, D., Derbyshire, E.R., Dornan, G.L., Duffy, S., Eckley, S., Itoe, M.A., Koolen, K.M.J., Lewis, T.A., Lui, P.S., Lukens, A.K., Lund, E., March, S., Meibalan, E., Meier, B.C., McPhail, J.A., Mitasev, B., Moss, E.L., Sayes, M., Van Gessel, Y., Wawer, M.J., Yoshinaga, T., Zeeman, A.-M., Avery, V.M., Bhatia, S.N., Burke, J.E., Catteruccia, F., Clardy, J.C., Clemons, P.A., Dechering, K.J., Duvall, J.R., Foley, M.A., Gusovsky, F., Kocken, C.H.M., Marti, M., Morningstar, M.L., Munoz, B., Neafsey, D.E., Sharma, A., Winzeler, E.A., Wirth, D.F., Scherer, C.A., Schreiber, S.L., 2016. Diversity-oriented synthesis yields novel multistage antimalarial inhibitors. Nature 538, 344–349.

190  Computational Phytochemistry Kinser, H.E., Pincus, Z., 2017. High-throughput screening in the C. elegans nervous system. Mol. Cell. Neurosci. 80, 192–197. Kouznetsova, J., Sun, W., Martínez-Romero, C., Tawa, G., Shinn, P., Chen, C.Z., Schimmer, A., Sanderson, P., McKew, J.C., Zheng, W., García-Sastre, A., 2014. Identification of 53 compounds that block Ebola virus-like particle entry via a repurposing screen of approved drugs. Emerg. Microbes. Infect. 3, e84. https://doi.org/10.1038/emi.2014.88. Larson, B., Banks, P., 2012. An Automated, Cell-Based Platform for the Rapid Detection of N ­ ovel Androgen Receptor Modulators. Application Note: Drug Discovery, BioTek, https://www. biotek.com/assets/tech_resources/Indigo_AR_Assay__App_Note.pdf. Leeson, P.D., Springthorpe, B., 2007. The influence of drug-like concepts on decision-making in medicinal chemistry. Nat. Rev. Drug Discov. 6, 881–890. Lipinski, C.A., Lombardo, F., Dominy, B.W., Feeney, P.J., 2001. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 46, 3–26. Lloyd, N.C., Morgan, H.W., Nicholson, B.K., Ronimus, R.S., 2005. The composition of Ehrlich’s salvarsan: resolution of a century-old debate. Angew. Chem. Int. Ed. 44, 941–944. Lu, C., Shao, C., Cobos, E., Singh, K.P., Gao, W., 2012. Chemotherapeutic sensitization of leptomycin B resistant lung cancer cells by pretreatment with doxorubicin. PLoS One 7, e32895. https://doi.org/10.1371/journal.pone.0032895. Macarrón, R., Hertzberg, R.P., 2009. In: Janzen, W.P., Bernasconi, P. (Eds.), High Throughput Screening: Methods and Protocols. second ed. Springer, N.Y. Maetani, M., Kato, N., Valquiria, A.P., Jabor, F.A., Nonato, C.M.C., Scherer, C.A., Schreiber, S.L., 2017. Discovery of antimalarial azetidine-2-carbonitriles that inhibit P. falciparum dihydroorotate dehydrogenase. ACS Med. Chem. Lett. 8, 438–442. Makarenkov, V., Kevorkov, D., Zentilli, P., Gagarin, A., Malo, N., Nadon, R., 2006. HTS-Corrector: software for the statistical analysis and correction of experimental high-throughput screening data. Bioinformatics 22, 1408–1409. Manns, R., 1999. Microplate History 2nd Edition, presented at MipTec-ICAR’99, May 17–21, 1999, Montreux, Switzerland; Thoma, A., 2000. Recollections of early microplate automation. J. Assoc. Lab. Autom. 5, 30–31. Minor, L.K., 2006. Handbook of Assay Development in Drug Discovery. CRC Press Taylor and Francis, Boca Raton, FL. Newlands, E.S., Rustin, G.J., Brampton, M.H., 1996. Phase I trial of elactocin. Br. J. Cancer 74, 648–649. Newman, D.J., Cragg, G.M., Snader, K.M., 2000. The influence of natural products upon drug discovery. Nat. Prod. Rep. 17, 215–234. Newman, D.J., Cragg, G.M., 2016. Natural products as sources of new drugs from 1981 to 2014. J. Nat. Prod. 79, 629–661. Nicolaou, K.C., Aversa, R.J., 2011. Maitotoxin: an inspiration for synthesis. Isr. J. Chem. 51, 359–377. Opera, T.I., Davis, A.M., Teague, S.J., Leeson, P.D., 2001. Is there a difference between leads and drugs? A historical perspective. J. Chem. Inf. Comput. Sci. 41, 1308–1315. Overing, F., Bikker, J., Humblet, C., 2009. Escape from flatland: increasing saturation as an approach to improving clinical success. Med. Chem. 52, 6752–6756. Paytubi, S., de La Cruz, M., Tormo, J.R., Martin, J., Gonzalez, I., Gonzalez-Menendez, V., Reyes, F., Vicente, F., Madrid, C., Balsalobre, C., 2017. A high-throughput screening platform of ­microbial natural products for the discovery of molecules with antibiofilm properties against Salmonella. Front. Microbiol. 8, 326 (13 pages). https://doi.org/10.3389/fmicb.2017.00326.

High-Throughput Screening of Phytochemicals  Chapter | 6  191 Pelz, O., Gilsdorf, M., Boutros, M., 2010. web cellHTS2: a web-application for the analysis of highthroughput screening data. BMC Bioinf. 11, 185. https://doi.org/10.1186/1471-2105-11-185. Pereira, D.A., Williams, J.A., 2007. Origin and evolution of high throughput screening. Br. J. Pharmacol. 152, 53–61. Pérez-Moreno, G., Cantizani, J., Sánchez-Carrasco, P., Ruiz-Pérez, L.M., Martin, J., el Aouad, N., Pérez-Victoria, I., Tormo, J.R., González-Menendez, V., Gonzalez, I., de Pedro, N., Reyes, F., Genilloud, O., Vicente, F., González-Pacanowska, D., 2016. Discovery of new compounds active against plasmodium falciparum by high throughput screening of microbial natural products. PLoS One 11, e0145812. https://doi.org/10.1371/journal.pone.0145812. Peterson, R.T., Shaw, S.Y., Peterson, T.A., Milan, D.J., Zhong, T.P., Schreiber, S.L., MacRae, C.A., Fishman, M.C., 2004. Chemical suppression of a genetic mutation in a zebrafish model of aortic coarctation. Nat. Biotechnol. 22, 595–599. Powers, C.N., Setzer, W.N., 2016. An in silico investigation of phytochemicals as antiviral agents against dengue fever. Comb. Chem. High Throughput Screen. 19, 516–536. Pyzer-Knapp, E.O., Suh, C., Gomez-Bombarelli, R., Aguilera-Iparraguirre, J., Aspuru-Guzik, A., 2015. What is high-throughput virtual screening? A perspective from organic materials discovery. Annu. Rev. Mater. Res. 45, 195–216. Reichardt, C., 2003. Solvents and Solvent Effects in Organic Chemistry, third ed. Wiley-VCH Publishers, Hoboken, New Jersey, USA. Rieber, N., Knapp, B., Eils, R., Kaderali, L., 2009. RNAither, an automated pipeline for the statistical analysis of high-throughput RNAi screens. Bioinformatics 25, 678–679. Robinette, S.L., Brüschweiler, R., Schroeder, F.C., Edison, A.S., 2012. NMR in metabolomics and natural products research: two sides of the same coin. Acc. Chem. Res. 45, 288–297. Saez-Rodriguez, J., Rocha, M. P., Fdez-Riverola, F., De Paz, J. F., 2014. 8th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2014), Springer, 21 May 2014—Computers. Sarker, S.D., Nahar, L., 2012. Natural Products Isolation, 3rd Edition. Humana Press–SpringerVerlag, USA. Schmid, I., Sattler, I., Grabley, S., Thiericke, R., 1999. Origin and evolution of high throughput screening: natural products in high throughput screening—automated high-quality sample preparation. J. Biomol. Screen. 4, 15–25. Seethala, R., Fernandes, P., 2001. Handbook of Drug Screening. Marcel Dekker Inc, New York. Sever, J.L., 1962. Application of a microtechnique to viral serologic investigations. J. Immunol. 88, 320–329. Sills, M.A., 1998. Future considerations in HTS: the acute effect of chronic dilemmas. Drug Discov. Today 3, 304–322. Simmons, K.J., Gotfryd, K., Billesbolle, C.B., Loland, C.J., Gether, U., Fishwick, C.W., Johnson, A.P., 2013. A virtual high-throughput screening approach to the discovery of novel inhibitors of the bacterial leucine transporter, LeuT. Mol. Membr. Biol. 30, 184–194. Sink, R., Gobec, S., Pecar, S., Zega, A., 2010. False positives in the early stages of drug discovery. Curr. Med. Chem. 17, 4231–4255. Sittampalam, G.S., Coussens, N.P., Brimacombe, K., Grossman, A., Arkin, M., Auld, D., Austin, C., Baell, J., Bejcek, B., Chung, T.D.Y., Dahlin, J.L., Devanaryan, V., Foley, T.L., Glicksman, M., Hall, M.D., Hass, J.V., Inglese, J., Iversen, P.W., Kahl, S.D., Kales, S.C., Lal-Nag, M., Li, Z., McGee, J., McManus, O., Riss, T., Trask Jr., J., Weidner, J.R., Xia, M., Xu, X., 2017. Assay Guidance Manual. National Center for Advancing Translational Sciences (NCATS), 6701 Democracy Boulevard, Bethesda MD 20892-4874. Available online at: https://ncats.nih.gov/ expertise/preclinical/agm.

192  Computational Phytochemistry Subramaniam, S., Mehrotra, M., Gupta, D., 2008. Virtual high throughput screening (vHTS)—a perspective. Bioinformation 3, 14–17. Tai, D., Chaguturu, R., Fang, J., 2011. K-Screen: a free application for high throughput screening data analysis, visualization, and laboratory information management. Comb. Chem. High Throughput Screen. 14, 757–765. Takatsy, G., 1950. Uj modszer sorozatos higitasok gyors es pontos elvegzesere (A rapid and accurate method for serial dilutions). Kiserl. Orvostud. 5, 393–397. Takatsy, G., 1967. Use and fields of application of a modified microtitration apparatus. Hung. Sci. Instrum. 10, 10–18. Thurman, E.M., Snavely, K., 2000. Advances in solid-phase extraction disks for environmental chemistry. TrAC Trends Anal. Chem. 19, 18–26. https://doi.org/10.1016/S0165-9936(99)00175-2. Tolopko, A.N., Sullivan, J.P., Erickson, S.D., Wrobel, D., Chiang, S.L., Rudnicki, K., Rudnicki, S., Nale, J., Selfors, L.M., Greenhouse, D., Muhlich, J.L., Shamu, C.E., 2010. Screensaver: an open source lab information management system (LIMS) for high throughput screening facilities. BMC Bioinf. 11, 260. https://doi.org/10.1186/1471-2105-11-260. Tripathi, P., Chaudhary, R., Singh, A., 2009. Virtual screening of phytochemicals to novel targets in haemophilus ducreyi towards the treatment of chancroid. Bioinformation 10, 502–506. Wang, X., Terfve, C., Rose, J.C., Markowetz, F., 2011. HTSanalyzeR: an R/Bioconductor package for integrated network analysis of high-throughput screens. Bioinformatics 27, 879–880. Weiss, J.N., 1997. The Hill equation revisited: uses and misuses. FASEB J. 11, 835–841. White, D.T., Eroglu, A.U., Wang, G., Zhang, L., Sengupta, S., Ding, D., Rajpurohit, S.K., Walker, S.L., Ji, H., Qian, J., Mumm, J.S., 2016. ARQiv-HTS, a versatile whole-organism screening platform enabling in vivo drug discovery at high-throughput rates. Nat. Protoc. 11, 2432–2453. Wu, G., 2010. Assay Development: Fundamentals and Practices. John Wiley and Sons Inc., ­Hoboken, New Jersey. Yashiroda, Y., Okamoto, R., Hatsugai, K., Takemoto, Y., Goshima, N., Hamamoto, M., Sugimoto, Y., Osada, H., Seimiya, H., Yoshida, M., 2010. A novel yeast cell-based screen identifies flavone as a tankyrase inhibitor. Biochem. Biophys. Res. Commun. 394, 569–573. Zanella, F., Rosado, A., Blanco, F., Henderson, B.R., Carnero, A., Link, W., 2007. An HTS approach to screen for antagonists of the nuclear export machinery using high content cell-based assays. Assay Drug Dev. Technol. 5, 333–341. Zanella, F., Lorens, J.B., Link, W., 2010. High content screening: seeing is believing. Trends Biotechnol. 28, 237–245. Zhang, X.D., 2011. Optimal High-Throughput Screening, Practical Experimental Design and Data Analysis for Genome-Scale RNAi Research. Cambridge University Press, Cambridge. Zhao, Y., Jia, Q., Hong, M., Zhao, J., O’Reilly, T., Ma, W., Abeysinghe, P., 2007. In: A flavonoid composition regulates the COX/LOX pathways and cytokine gene expression: a potential ingredient for topical inflammation. 68th Annual Meeting. The Society for Investigative Dermatology. Los Angeles, CA, May 9–12, 2007. Zhu, J., Mishra, R.K., Schiltz, G.E., Makanji, Y., Schiedt, K.A., Mazar, A.P., Woodruff, T.K., 2015. Virtual high-throughput screening to identify novel activin antagonists. J. Med. Chem. 58, 5637–5648.

Chapter 7

Prediction of Structure Based on Spectral Data Using Computational Techniques Fyaz M.D. Ismail, Lutfun Nahar, Satyajit D. Sarker Liverpool John Moores University, Liverpool, United Kingdom

Chapter Outline 7.1. Introduction 7.1.1 History of Spectroscopy 7.1.2 Misassignments of Structures: A Rarity or More Common Than Expected? 7.2. Structure Elucidation Strategies 7.3. What is Density Functional Theory? 7.4. Era of Assignment Versus Prediction 7.4.1 Nuclear Magnetic Resonance

193 194

195 197 201 201 202

7.4.2 Computational Mass Spectrometry 204 7.4.3 Chiral Centres 207 7.4.4 Structure by Calculations 208 7.4.5 UV Spectroscopy 214 7.4.6 Infrared (IR) Spectroscopy 214 7.4.7 Database Search Algorithm 216 7.5. Can Raman Be Used for Automated Assays and HTS? 222 7.6. X-Ray Sponge Technique 222 7.7. Conclusions 223 References 223 Further Reading 229

7.1. INTRODUCTION For phytochemistry research, absolute proof of molecular structure has been, from the outset, synonymous with unambiguous synthesis of the postulated compound (by either formal or total synthesis) (Nicolaou and Sorensen, 1996; Nicolaou et  al., 1998) and drove development of the pharmaceutical sector (Sneader, 1985). Every advance in spectroscopy has gone hand in hand with both synthesis and/or elucidation of structure (Silverstein 1991). However, this took time to establish itself until commercial construction of instrumentation and, their interfacing to computers became routine. Hence, it is necessary to place structural elucidation in its historical context before focusing on more Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00007-9 © 2018 Elsevier Inc. All rights reserved.

193

194  Computational Phytochemistry

interesting developments. There are two basic approaches to elucidating the structure of pure molecules purified from an extract or retrieval (also called dereplication in metabolomics) to identify a compound present without actually isolating individual components from a (complex) mixture. An overview of the entire process has been presented by Gaudêncio and Pereira (2015). Selected advances since then are discussed in this chapter.

7.1.1  History of Spectroscopy The invention and advancement of spectroscopy, in its current form, resulted from the interaction of various scientists from several disciplines including physics, astronomy, engineering, and chemistry. Thomas (1991) has written a general account of the early history of spectroscopy (1991) which, however, also contains some inaccuracies, as pointed out by Naqvi (1993). A more erudite discourse on early developments on nuclear magnetic resonance (NMR) is available from Becker (1993), while general advances in mass spectrometry (MS) are briefly described by Griffiths (2008). Biemann at MIT was among the very first to apply MS to natural products of unknown structure (Griffiths, 2008). Biemann wrote presciently ‘we feel this investigation points out some of the potentialities that MS has in this field’ when investigating the structure of sarpagine (Fig. 7.1) from Rawolfia sepentina (Biemann, 1960). Djerassi asked for help in setting up a natural products mass-spectrometry research group at Stanford and Bierman duly obliged (Griffiths, 2008). Shortly, thereafter, Djerassi (at Stanford) rapidly advanced applications of MS into a deep exploration of natural products MS extending as a general elucidation tool (Budzikiewicz et al., 1964; Budzikiewicz, 2015). One of the earliest champions of using spectroscopy in natural products analysis and synthesis was Woodward (at Harvard), whose team pushed such techniques to their limits (Morris, 2002). These influential pioneers of the application of spectroscopy, and spectrometry to supersede tedious chemical degradation/proof by total synthesis, for determining the structure of molecules, ensured that within a decade, or so, most affluent chemistry departments were using a combined spectroscopic and spectrometric approach. By the 1970s, useful compendia of theory, examples, and tables had appeared (Scheinmann, 1970; Maas, 1973). These efforts established MS as an essential investigative tool by formulating fragmentation pathways, in part through careful analysis of selectively deuterated analogues, at both exchangeable and nonexchangeable sites. This strategy

FIG. 7.1  Sarpagine A: line drawing; B: energy minimized structure; note the changes in configuration of the alkene.

Spectral Data Using Computational Techniques  Chapter | 7  195

also influenced NMR (Lindon et al., 2016) and IR studies (see Socrates, 2004, for leading references). Interestingly, around this time, the term computer was often used for mathematically adroit individuals required for checking ballistic calculations at NASA (McLennan and Gainer, 2012); they were all rapidly replaced by modern computers. Computers had made rapid progress from a military code breaking machines colossus to business machines (http://historycomputer.com/index.html). Consequently, in the current context, computers are electromechanical machines used to perform calculations with a set programme or using artificial intelligence (AI) methods including machine-learning (e.g., neural networks and genetic algorithms). The personal computer (PC) revolution from the mid 1970s onward, with a concurrent decrease in price through the self-build kits, and then the business/home market has enabled integration of low-cost, nonmain frame computers into the laboratory environment. Although for short-term calculations PC calculations are stable, large parallel computers are less prone to crashing and employ robust computing languages. One further consideration is that data are now mainly acquired in digital rather than analogue formats, making acquisition, storage, processing, prediction, and comparison of data with digital computers much more robust (Hilbert and López, 2011). By the early 1960s, Joshua Lederberg developed an interest in AI to study hypothesis formation and discovery in science, including deduction of unknown molecules by rule-based algorithms and deducing accurate molecular formulae, to help apply MS for solving problems in his exobiology research. He and his colleagues, based at Stanford, designed the programme Dendral (a portmanteau of ‘Dendritic Algorithm’). The first application of such techniques to the elucidation of organic chemistry problems was instrumental in moving AI forward. Although successors to such programmes are routinely incorporated within proprietary and academic mass-spectrometry software; they are seldom explicitly referred to as AI applications (Beavis et al., 2006).

7.1.2  Misassignments of Structures: A Rarity or More Common Than Expected? Occasionally, the synthesis of the postulated target natural compounds does not correspond to the original spectral assignments, requiring revision(s) of structure until they both chromatographically and spectrally matched (Yoo et  al., 2016). Compounds originally isolated from natural sources, e.g., plants, before the spectroscopic/spectrometric/X-ray diffraction (SSXRD) era were rather challenging problems, cholesterol being a seminal example (Bernal et al., 1940) (Fig.  7.2). Weiland and Windaus both successively received noble prizes in 1927 and 1928, respectively, for their natural products work including what was subsequently shown to be incorrect. Bernal narrowly missed out on receiving a Nobel prize (Brown, 2005) for revising not only this structure using single crystal X-ray diffraction, but provided 80 examples drawing generalization for structural determination (of steroids) which are relevant today as the day they were first published (Bernal et al.,1940).

196  Computational Phytochemistry

FIG. 7.2  A notable misassignment: Wieland received the Nobel Prize in Chemistry in 1927 for his investigations of the constitution of the bile acids and related substances, and Windaus in 1928 for the services rendered through his research into the constitution of the sterols and their connection with the vitamins, hence for deriving structures of various natural products, such as their proposed structure cholesterol. C27H46O (Bernal, 1932). Note the original postulated structure incorporates a considerable amount of strain compared to the actual structure.

Often, new investigators assume a degree of infallibility in published spectroscopic data, especially NMR data. Since such structures are often in part assigned/interpreted, rather than entirely deduced de novo, misassignments can occur. Another example is shown in Fig. 7.3 (Elyashberg et al., 2010). A more recent, somewhat notorious, example is that of hexacyclinol, which was first isolated from Panus rudis as an antiproliferative agent (Schlegel et al., 2002) and subsequently an initial total synthesis of this compound was reported by La Clair (2006, 2012) (Fig. 7.4). However, later, Rychnovsky (2006) simulated the 13C NMR spectrum of hexacylinol and found that it did not correspond to the spectrum of the compound synthesized by La Clair (2006); he proposed a different structure for hexacyclinol, which was confirmed by synthesis by Porco et al. (2006). In these series of event, it was evident that computer-based predictive spectrum generation, or simulation, and auto-interpretation of spectral data can

FIG. 7.3  Another example of incorrect structure elucidation and subsequent revision.

Spectral Data Using Computational Techniques  Chapter | 7  197

FIG. 7.4  Incorrect structure of hexacyclinol as proposed by Schlegel et al. (2002) and La Clair (2006), and the revised correct structure conformed by Rychnovsky (2006) and synthesized by Porco et al. (2006).

be a useful tool for correcting structures of a number of natural products, which might have been somehow misidentified. Importantly, the number of such misasignments is greater than expected, as reviewed by Nicolaou and Snyder (2005).

7.2.  STRUCTURE ELUCIDATION STRATEGIES Structure elucidation of phytochemicals, especially completely unknown metabolites, has always been one of the most challenging tasks in phytochemical research. Before the spectroscopy era, most of the isolated phytochemicals were identified by tedious processes involving chemical synthesis, degradation, and various functional group tests. However, with the introduction of, and tremendous advancements in, various spectroscopic techniques, especially NMR and MS, and most recently, the availability of several computational tools and mathematical modelling, structure elucidation of phytochemicals is no longer as difficult as used to be. Figs 7.5 and 7.6 demonstrate flowcharts of pre-1950s strategy and modern approaches for structure elucidation of natural products. From the 1960s onward, MS has gained a foothold within natural products chemistry due to commercialization of instruments replacing bespoke instruments. Originally limited to electron ionization (EI) of molecules that vapourized intact and un-decomposed into the high vacuum system, subsequent introduction of alternate, softer, ionization techniques allowed unprecedented

198  Computational Phytochemistry

FIG. 7.5  Flowchart illustrating pre-1950s strategy for structural elucidation of natural products.

FIG. 7.6  Modern methods of determining the structure of isolated natural products.

access to compounds with masses well over one million Daltons. Analysis can even be extended to analysis of intact viruses. MS underwent a paradigm shift when Barber et  al. (1981) developed ionization of large, polar molecules by fast-atom-bombardment (FAB). Early developments of the use of FABMS involving natural products and compounds for biological studies are documented in various publications (Fenselau, 1984; Claeys and Claereboudt, 2017). A timeline showing important advances in MS over the last century including instrument developments, their applications, and general scientific achievements, especially in LC–MS/MS, liquid chromatography, and tandem MS, is given by Yates III (Yates, 2011).

Spectral Data Using Computational Techniques  Chapter | 7  199

Introduction of ion mobility in the structure elucidation scene has empowered modern strategies. In fact, ion mobility spectrometry (IMS) is now recognized as an important analytical method for its speed, sensitivity, and unique separation mechanism. Hyphenated with MS, especially time of flight (TOF) MS, ion mobility MS provides a multidimensional technique for complex sample analysis such as those required for metabolomics and proteomics research. HPLC can be easily interfaced to an atmospheric drift tube ion mobility TOF MS. The power of multidimensional separation has recently been shown using chilli pepper extracts (Liu et al., 2016). Hadamard transform (HT) ion mobility and TOF MS provided an additional dimension of separation for complex samples without increasing the analysis time compared with conventional HPLC. The addition of HPLC to IMMS provided improved sensitivity because of the decrease in charge competition of the components of the complex mixture when they are introduced after chromatographic separation rather than by direct infusion ESI. McLafferty and Turecek (1993) described strategies, methods, and tools for computer identification of unknown mass spectra. Since then, several computation tools and mathematical modelling software have become available for interpreting MS data as well as predicting structures from MS data, which have revolutionized MS analysis of organic molecules, particularly complex natural products including phytochemicals and plant metabolomics. The chart (Fig.  7.7) shows one combined approach for solving structures (using MS) to deduce the hybridization, connectivity and, if present, stereochemistry within a given molecule. In addition, MS provides information regarding five important aspects of natural products: isotope analysis, q­ uantitation,

FIG.  7.7  Simple flowchart for computations of NMR and chiroptical properties of a molecular system. Abbreviations: MC, Monte Carlo; MD, molecular dynamics. (Elements adapted from Autschbach, J., 2009. Computing chiroptical properties with first-principles theoretical methods: background and illustrative examples. Chirality 21, E116–E152.)

200  Computational Phytochemistry

f­ ingerprinting, structure elucidation and, most recently, shape. The first two aspects are important for establishing biosynthetic pathways. The accurate (or nominal) weight, in conjunction with the isotope pattern, can affirm the molecular weight of purified compounds or mixtures (Budzikiewicz, 2015). Finally, shape is a parameter than can be used as a further property, which can now be determined using ion mobility mass spectrometers (Fernández-Maestre, 2012). These are commercially available since 2006 and can be used in conjunction with computed properties of the collision cross section, a parameter that can be used to supplement structural information from molecular modelling, X-ray crystallography, and NMR (Zhou et al., 2016). Functional groups are commonly assigned using infrared (IR) in conjunction with published tables, but increasingly Raman spectroscopy is proving most useful; both forms of vibrational spectroscopy can be compared with computed values. One popular online wizard by Thomas can be used to quickly interrogate an FT-IR spectrum. Fast computers, with almost limitless data storage capacities, have made possible automated structural studies and the analysis of complex mixtures. Ernst and Anderson (1966) described the application of Fourier-transform spectroscopy (FTS) to magnetic resonance. To appreciate the advances brought about by application of computers, two papers on natural products by Weigert et al. (1968) with those of Allerhand et al. (1971) are essential reads. The former showed the first practical 13C NMR spectrum of medium molecular weight natural products (squalene, a mixture of pinenes, sucrose, and cholesterol), but unassigned; in the latter, the authors reported the fully assigned proton decoupled 13C FT-NMR spectra of cholesteryl chloride, sucrose, and adenine5-monophosphate. Between the period: 1969–70, an enormous acceleration occurred within the investigation of natural products by 13C NMR. Importantly, unlike 1H NMR resonances,13C NMR is particularly useful for the evaluation of natural products, since resonances are mostly independent of solvent effects and are dispersed (i.e., spread out) over a relatively wide range of frequencies (Rychnovsky, 2006; Fattorusso et  al., 2007). Extension of this principle to single molecules and now mixtures requires computational approaches for efficient solutions to the problem of structure elucidation and dereplication (see Chapter 5 for dereplication). Various tools are available for prediction of both 1H and 13C NMR spectra within proprietary programmes such as ChemDraw. However, caution must be advised when applying such programme to structures that are complex. Tools allowing calculation of the isotope ratio, as well as putative identity prominent mass ions are now available at the click of a button using a computer interface to various internet enabled databases. They should be considered a gateway to more sophisticated calculations available in quantum chemistry programmes. After commercialization and widespread adoption of computers, quantum mechanics became easier to use, especially through the Quantum Chemistry Exchange Program (QCEP), which subsequently declined in the postinternet era (Boyd, 2013). With the help of successively improved and affordable ­computers, it

Spectral Data Using Computational Techniques  Chapter | 7  201

­culminated in the use of either computationally expensive ab initio method and then more widely used density functional theory (DFT) approaches, which used less computer time.

7.3.  WHAT IS DENSITY FUNCTIONAL THEORY? DFT is a computational quantum-mechanical modelling method used in chemistry, material science, and physics to interrogate both the electronic structure of many-body systems, in particular atoms, molecules, and the condensed phases (Parr and Yang, 1989). A history of various developments leading to the development of this powerful approach has been discussed (Schlecht, 1998; Gavroglu and Simões, 2012). A brief account on DFT can also be found in Chapter 1. In DFT, calculations are commonly performed for the ground state, but have also been extended to include excited states. Using this theory, the properties of a many-electron system are determined by so-called functionals, i.e., functions of another function, which can be thought of as the spatially dependent electron density. Since both properties and chemistry are explicable in terms of the electronic configuration of molecules, the ability to accurately depict, i.e., model, such systems is of direct relevance. The name DFT arises from the use of functionals of the electron density. DFT remains the most popular and versatile methods available for all matter, but especially for organic and natural products. Since the properties of molecules can be modelled (using empirical methods), it has been developed for predicting and assigning the spectra of sophisticated natural products without incurring a large computational penalty. DFT was first applied to solid-state physics in the 1970s, and when approximations used in the theory were greatly refined to better model the exchange and correlation interactions, it could be applied to quantum chemistry in the 1990s. Consequently, computational costs are acceptable low when compared to traditional methods, including exchange only Hartree–Fock theory and electron correlation. Although limitations exist, especially if considering intermolecular interactions especially van der Waals forces (dispersion), charge transfer excitations, transition states, etc., these do not restrict applications to determination of natural product structure. However, where dispersion effects significantly compete with other effects, this remains problematic, especially if considering large compounds, since the work of Hunter has overemphasized the contribution of π–π stacking (Grimme, 2008). Another problem is in making chemical shift predictions, when heavier elements such as the halogens are present. Solutions to some of the problems are discussed later on in this chapter.

7.4.  ERA OF ASSIGNMENT VERSUS PREDICTION Currently, more than 80 packages are available which include proprietary and freeware packages. Once the domain of physicists, the use of such approaches, e.g., GAMESS and DFT, is widely increased in chemistry and more so within

202  Computational Phytochemistry

the natural product community. However, the approach is under constant modification with variants being promoted on a regular basis, making it difficult to recommend a particular set of calculations, since they are dependent on the number, size, and level of sophistication required and that the presence of certain groups such as heavy atoms causes a deviation from calculated and measured properties.

7.4.1  Nuclear Magnetic Resonance Currently, the most versatile method for assigning a noncrystalline chemical structure in the solution state is through the use of NMR, which remains one of the most powerful analytical techniques in modern chemistry and, therefore, in the natural product and phytochemical research. Information about the atomic connectivity (bonding), dynamic behaviour, and intra and intermolecular interactions in a molecule can be established by NMR analysis (Becker, 1993, 1996; Breton and Reynolds, 2013). Prior to the 1950s, the study of NMR was practised by physicists until 1950, when Proctor and Yu discovered the phenomenon now known as 'chemical shift', i.e., the immediate chemical environment surrounding a nucleus influences the frequency at which it resonates (Levine, 2001). Shortly thereafter, at Stanford University, Arnold et al. (1951) found that 1 H nuclei within the same molecule resonate at different frequencies, demonstrating the enormous potential of NMR spectroscopy by investigating the natural product ethanol. Importantly, Gutowsky and McCall (1951) demonstrated that different spin-active nuclei in the same molecule interact with one another, generating fine structure in the NMR signals. Taken in concert, this allowed an unprecedented deduction of molecular connectivity and reconstruction of 3D structure that continues to develop to this day. Many advancements (including radio-frequency pulses) replaced a continuous source of radiation, generalized applications of NMR spectroscopy when processed, manipulated, and stored by computers. Notably, using computers allowed fourier-transform methods of data processing greatly improving method sensitivity. The combination of such improvements has led to sophisticated multidimensional NMR experiments that have and continue to revolutionize structural elucidation. Ab initio methods of quantum chemistry were successfully applied to the study of various problems of chemical interest, including phytochemicals. An important aspect of these studies is the calculation of the electric, magnetic, and optical properties associated with the responses of a molecular electronic system to perturbations, such as externally applied electromagnetic fields and nuclear magnetic and electric moments. Specifically, molecular properties giving rise to generation of NMR spectra, i.e., nuclear magnetic shielding constants, indirect nuclear spin−spin coupling constants, were identified and analysed in terms of perturbation theory by Ramsey (1950, 1953, 1970). However, because of practical problems associated with the calculation of these properties, the early applications of ab initio methods to the study of NMR parameters were

Spectral Data Using Computational Techniques  Chapter | 7  203

not particularly successful. General improvement in ab initio techniques and computer technology, partly through the development of special methods and programmes for the calculation of NMR properties overcame these obstacles. As a result, over the last 5−10  years, the calculation of NMR parameters by ab  initio methods has developed into a useful and popular tool of computational quantum chemistry, provided there are sufficiently powerful computational resources for such calculations. However, it should be noted that there is an inherent limitation of DFT methods; there is no systematic or consistent way to improve the accuracy of the results thus obtained (in contrast to ab initio methods). However, DFT methods are the only ones that can be applied, at a reasonable computational cost, to sufficiently complex molecular systems as to pose challenging issues to the experimental NMR spectroscopist. Recently, the application of DFT to a complex structure has been described in a useful step-by-step protocol by Willoughby et al. (2014) to provide nonspecialists with a useful tool for validating putative structural assignments. However, experimentally acquired 1H and/or 13C NMR spectral data and its proper interpretation for the compound of interest is required as a starting point. Their approach described the following steps: (i) using molecular mechanics (MM) calculations (with, e.g., the modelling package MacroModel) to generate a library of conformers (usually within 50 Kcal of the ground state); (ii) using DFT calculations (with, e.g., Gaussian 09) to determine the optimal low energy geometry (rather than a saddle point), free energies, and chemical shifts for each conformer; (iii) determining Boltzmann-weighted 1H and 13C NMR chemical shifts; and (iv) comparing the computed chemical shifts for two (or more) candidate structures with the experimentally determined data to determine the best fit. The authors suggest that for a typical structure assignment of a small organic molecule (e.g., fewer than around 10 non-H atoms or up to about 180 a.m.u. and ∼20 conformers), this protocol could be completed in approximately 2 h of active effort over a 2-day period. For more complex molecules, including natural products (e.g., fewer than ∼30 non-H atoms or up to ∼500 a.m.u. and ∼50 conformers), the protocol could require around 3–6 h of active effort over a 2-week period. Their protocol is written in a step-wise tutorial fashion that enables the computation of chemical shifts tractable for chemists who only possess a rudimentary computational knowledge. Single- and multiple-bond interactions within magnetically interacting nuclei are routinely used to elucidate the structure of a compound. A comparison of both 13C NMR resolution and sensitivity of HSQC and HMQC sequences and their application of HSQC-based sequences to the total 1H and 13C NMR spectral assignment is a powerful approach to natural product elucidation (Reynolds et al., 1997). They demonstrated that HSQC could give superior 13C NMR resolution and sensitivity when compared to HMQC for CH2 functionality within natural products, especially when combined with linear prediction. Similarly, coupled HSQC spectra could provide a useful method for the determination of 1 H multiplet structure and consequent assignment of individual CH2 protons as

204  Computational Phytochemistry

FIG. 7.8  Structure of clionasterol.

either axial or equatorial (within fused cyclohexane rings). Such an approach was used to assign both 1H and 13C NMR spectra of the marine sterol clionasterol (Fig. 7.8). The concomitant development of databases and web-based query tools, such as the Complex Mixture Analysis by NMR (COLMAR) database (Bingol et  al., 2015), The Human Metabolome Database (HMDB; http://www.hmdb. ca/) (Wishart et al., 2013), and the Biological Magnetic Resonance Data Bank (BMRB), has further increased the usefulness of NMR spectroscopy and enabled the automated assignment of metabolites. Individual classes of compounds of natural product origin have been reviewed as a class, for instance carbohydrates (Toukacha and Ananikov, 2013). They showed that by combining NMR spectroscopy and the computational analysis of structural information encoded in the NMR spectra reveals a strategy to the automated elucidation of the structure of carbohydrates, which could also be applied to other classes of natural products.

7.4.2  Computational Mass Spectrometry MS has played a pivotal role in structure elucidation of phytochemicals, and with the introduction of various interfaces and ionization modes like electron impact ionization (EI), electrospray ionization (ESI), chemical ionization (CI), and many others, has widened its applications in analyzing almost all types of organic molecules including complex natural products. The efficiency of methods and their applications in predictive structural analysis and automated structure elucidation have further enhanced with recent advances in computational chemistry and availability of computational tools, software, and mathematical modellings. However, the identification of small molecules from MS data remains a challenge in the interpretation of MS data. In recent years, it has been realized that one of the most important aspects of small-molecule MS is the automated processing of the resulting data, which can be aided by various computational tools. Scheubert et al. (2013) published an excellent review article encompassing the computational aspects of identifying small molecules, from the identification of a compound searching a reference spectral library, to the structural elucidation of unknowns. They also outlined the basic principles and pitfalls of searching MS reference libraries. Different methods for ­molecular formula

Spectral Data Using Computational Techniques  Chapter | 7  205

identification, focusing on isotope pattern analysis and automated methods to deal with MS data of compounds that are not present in spectral libraries, were also included. Finally, an insight into de novo analysis of fragmentation spectra using fragmentation trees and the reconstruction of metabolic networks using MS data were discussed. A comprehensive list of available software for different steps of the analysis pipeline is available in this review. First, rule-based approaches for predicting fragmentation patterns in MS, as well as explaining experimental mass spectra with the help of a molecular structure, were developed as part of the DENDRAL project in mid 1960s (Kind and Fiehn, 2010; Scheubert et al., 2013), but the project failed in its main objective of automated structure elucidation using MS data. Since then, especially over the last couple of decades, several ground breaking research works have finally developed the new area in MS research, known as ‘computational mass spectrometry’, which deals with the development of computational methods for the automated analysis of MS data, covering data processing, metabolomics databases, laboratory information management systems, hyphenated techniques (e.g., LC–MS and GC–MS), MS-based chemical fingerprinting, isotope pattern simulations, and annotation and identification of small molecules from fragmentation spectra using database search as well as de novo interpretation techniques (Kind and Fiehn, 2010). To allow data-driven development of algorithms for small-molecule identification, MS reference datasets must be made publicly available via reference databases. Several publicly available databases, e.g., MassBank (http://www. massbank.jp/?lang=en), METLIN (http://enigma.lbl.gov/metlin/), Madison Metabolomics Consortium Database (MMCD; http://mmcd.nmrfam.wisc.edu/), Golm Metabolome Database (GMD; http://gmd.mpimp-golm.mpg.de/), the Platform for RIKEN Metabolomics (PRiMe; http://prime.psc.riken.jp/), and MeltDB (https://meltdb.cebitec.uni-bielefeld.de/cgi-bin/login.cgi), are now within the easy reach of phytochemists. In addition to the publicly available free databases, there is a couple of excellent commercial databases, i.e., The National Institute of Standards and Technology (NIST) mass-spectral library (version11) and the Wiley Registry (9th edition). The NIST MS library contains EI spectra of well over 200,000 compounds, and there are EI spectra of 600,000 unique molecules in the Wiley database. Computation plays a vital role in database search, which is fundamental to the approach for identification of a metabolite through looking up in a spectral library. Database search needs a similarity or distance function for spectrum matching. The most fundamental scorings are the ‘peak counť family of measures that count the number of matching peaks. A bit more complex variant takes the dot product of two spectra and incorporates peak intensities (Scheubert et  al., 2013). False negative identifications may occur if the spectrum of the query compound differs from the spectrum in the library if the unknown compound has contamination, noise, and/or different collision energies to the reference library data. Mass-spectral data handling, aided by c­ omputational tools,

206  Computational Phytochemistry

c­ omprises the following steps: background and noise subtraction, adduct formation and detection, charge state deconvolution, accurate mass measurement, isotope abundance measurements and isotopic pattern calculation, elemental composition determination, algorithms for formula calculation from high-­ resolution MS/MS data, and complex data-dependent setups including maps and ion trees (Kind and Fiehn, 2010). Mass-spectral library search is the first step in any mass-spectral interpretation, and this can be performed with unit mass and high-resolution mass spectra of all stages, involving Ms and MS/MS and MSn libraries and search algorithms, and mass-spectral trees combining multiple-stage mass spectra. In automated MS data interpretation, spectral mappings using dimension reduction methods with principal component analysis (PCA) can be used. The availability of large public compound databases, such as PubChem and ChemSpider, or specialized drug and metabolism databases, such as KEGG, HMDB, ChEBI, DrugBank, MZedDB, and the Chemical Lookup Service, allow for a web-based search of molecular formulae or accurate masses (Kind and Fiehn, 2010). The DrugBank database is particularly useful as it possesses search interface allowing an accurate mass search in positive or negative mode within the known human metabolite pool, and the results are presented with possible adducts and link to further database sources. Adequate computational technologies are now available to generate predictions for metabolites alongside methods to predict MS spectra and score the quality of the match with experimental spectra. Stranz et al. (2008) developed metabolite predictions from molecular structure with a software product, MetaDrug. They used an in vitro microsomal incubation to produce MS data that could be used to verify the predictions with the software tool, Apex (https://www.winstonslab.com/news/2017/07/03/interpreting-apex-matchpredictions/), which can predict the molecular ion spectrum and a fragmentation spectrum, automating the detailed examination of both MS and MS/ MS spectra. To predict accurate mass fragments and their abundances, it is important to establish an in silico algorithm, which is able to generate theoretical mass spectra to match with experimentally obtained mass spectra. Several of such algorithms are now available. In fact, several mass-spectral simulation algorithms have been published in the literature. The major bottleneck of many of these algorithms, however, is to simulate or calculate peak abundances or peak intensities that reflect experimentally measured peak abundances (Kind and Fiehn, 2010). Thus, the success of MS-based structure elucidation approaches depends on technical machine developments as well as on the development of efficient software algorithms. Within the technical machine developments or technological process improvements, it must offer increased resolving power, better mass and isotopic abundance accuracy, and high data acquisition rate to enable a faster structure elucidation process. Accurate masses and high resolution for all multiple-stage mass spectra (MSn) will subsequently lead to generation of new software tools, which will be fit-for-purpose.

Spectral Data Using Computational Techniques  Chapter | 7  207

In silico generation of mass spectra is generally successful for molecules with certain structural scaffolds and reliable fragmentation pattern, e.g., lipids, oligosaccharides, glycans, and peptides. Neural network is sometimes used to simulate electron ionization mass spectra (EIMS). An example of a rule-based spectral simulation system is MASSIS/MASSIMO. Part of MASSIMO is the Fragmentation and Rearrangement ANalyZer (FRANZ) that requires a set of structure-spectrum-pairs as input. The MAss Spectrum SImulation System (MASSIS) combines cleavage knowledge (McLafferty rearrangement, retroDiels-Alder reaction, neutral losses, oxygen migration), functional groups, small fragments (end-point and pseudo end-point fragments), and fragmentintensity relationships for simulating EI spectra (Scheubert et al., 2013). The publicly available MetFrag algorithm is able to compare in silico mass spectra obtained by a bond dissociation approach, with experimental mass spectra and assign a score to all results. Computational mass spectroscopy utilizes various chemoinformatics approaches and tools. Nowadays, MS-based approaches for structure elucidation require proper molecular structure handling. Many of the currently available software tools for structural elucidation, e.g., MassFrontier, ACD/MS Manager, NIST MS Search, and Sierra’s APEX, also have inbuilt structure handling capabilities to either allow substructure analysis or perform structure–spectra correlations. However, the ultimate success of structure elucidation of small molecules, e.g., phytochemicals, depends on fit-for-purpose software programmes and the development of sophisticated tools for data evaluation of high-resolution and accurate mass multiple-stage (MSn) mass-spectral data.

7.4.3  Chiral Centres Determining the absolute configuration (AC) within a phytochemical remains a challenging task. A number of techniques are available to determine the AC of natural products which are either: (a) direct (or absolute) methods, e.g., X-ray diffraction (XRD), electronic and vibrational circular dichroism (ECD and VCD), and Raman optical activity (ROA) or (b) indirect (or relative) methods employing a reference or a derivatizing agent with known AC. For instance, circular dichroism (CD) with empirical rules and NMR utilizing anisotropic effects of chiral derivatizing agents could be useful to determine AC. All such techniques possess both limitations, and determination of the AC of compounds from the small to the large sizes can prove to be a daunting task during the structure elucidation of natural products (Barron 2004; Buckingham and Dunn, 1971; Graham and Raab, 1990). This cannot be entirely deduced using nOe experiments de novo. The ab initio calculation of molecular chiroptical properties has been discussed (Pecul and Ruud, 2005; Crawford, 2006). The calculation of natural optical activity of molecules from first principles (Pecul and Ruud 2005; Autschbach, 2009; Mukhopadhyay et  al., 2009; Autschbach et al., 2011) has been reviewed (Srebro-Hooper and Autschbach, 2017).

208  Computational Phytochemistry

Chiral compounds which contain appropriate chromophore(s), ECD, is a robust approach in the determination of their AC of chiral centres within natural products. Currently, ECD calculations by time-dependent density functional theory (TDDFT) are proving increasingly useful. Such TDDFT calculation of ECD can aid the interpretation of any ECD–AC inter-relationship and is an increasingly promising tool for the AC determination of natural products with chiral centres. For instance, Nugroho and Morita (2014) have discussed a range of examples focusing on TDDFT-calculated ECD spectra for the AC determination of selected natural products. Successful application of TDDFT calculations of ECD spectra to the determination of AC within a variety of natural products has ranged from conformationally rigid to highly flexible compounds. The latter require a necessary expenditure in computational time and can render suitable solutions rather difficult. Further improvements both in the computer system technologies and TDDFT will no doubt make the TDDFT calculation of ECD an integral part of the AC determination of any complex natural product. Corrections may have to be applied for solvent effects, further complicating the analysis.

7.4.4  Structure by Calculations Misassignment and subsequent revision of structures have been briefly discussed in Section  7.1.2. In this section, we will look into the applications of computation and various calculations pertinent to structural revisions based on various mathematical calculations and computations in more details. A classic example would be hexacyclinol, which was assigned a structure by The Graffe group (Schlegel et al., 2002) (Fig. 7.9). In the absence of a crystal structure, they illustrated the paper with a molecular 3D graphic.

FIG. 7.9  (A) Hexacyclinol; (B) its synthetic and computational revision, and (C) Artemisinin.

Spectral Data Using Computational Techniques  Chapter | 7  209

The presence of a postulated peroxide group should have been confirmed by Raman spectroscopy. For instance, for artemisinin, the third mode of the peroxide bridge was calculated at 943 cm−1 and observed at 929(R)/928(ir) cm−1 (Moroni et al., 2008). Since the detailed structure of the molecule has a large influence on its vibrational spectrum, any modification in the skeleton or in side chains may lead to marked changes in the spectral features; hence, for peroxides, it is important to calculate the vibrational spectrum for each case. A total synthesis appeared by La Clair apparently corroborating the assignment, which prompted further investigation of the structure by DFT calculation, revealing that an electron would attack the epoxide rather than the peroxide functionality. The calculated structure could not be matched with the postulate. Sometime later, this structure generated intense interest in the community (Tantillo, 2013) as endoperoxides have been prominent in antimalarial research since the discovery of artemisinin, qinghao su (Pirrung and Morehead, 1997). Rychnovsky (2006) revised the structure and it corresponded with the experimental data for hexacyclinol. The HMBC and COSY data were consistent, with the proviso that several cross-peaks with very close 1H NMR shifts suggesting incorrect assignments in the original isolation study (Fig. 7.9). He concluded that a diepoxide (which may be an isolation artefact derived from exposure of panepophenanthrin to acid and methanol) was the correct structure. Consequently, demonstrating that the prediction of 13C chemical shifts by calculation at HF/3-21G minima using DFT mPW1PW91/6-31G(d,p) was an impressive tool for validating proposed structures. The dramatic increase in the computing power of computers coupled with rapid advances in relatively low-cost software has made it possible to include sophisticated calculations in the undergraduate curriculum as such DFTB3LYP-GIAO calculations enhance student understanding. Coupling these calculations with experimental measurements provides insights into a system that cannot be readily obtained from experimental measurements alone while: (a) avoiding potential hazards; (b) the expense of the chemicals; (c) without instruments needed to make the measurements; and (d) calculations can be performed for any system. Specifically, comparison of experimental 13C NMR data with a Boltzmann-weighed average of 13C NMR chemical shifts, calculated by ab ­initio DFT method, supported the stereochemical assignment. As spectroscopy became more established, sets of tables in various books and compendia became available for various fragments or spin systems, mostly deduced by rigorous comparative analysis of compounds within a family of compounds. Statistical bar charts suggesting where groups commonly resonate, often in the form of a horizontal bar chart, are most commonly used and memorized by practioners. In modern simple 13C NMR studies, decoupled spectra are routinely acquired deliberatively devoid of connectivity and presented as single peaks. Suppressing multiplet structure in 1H NMR spectra significantly enhances spectral resolution (see below), with every multiplet reducing down to a single line centred at the appropriate chemical shift. Resolution i­mprovement

210  Computational Phytochemistry

is equivalent to using a 5 GHz as opposed to a 500 MHz spectrometer. The pulse sequence for a 1D pure shift experiment using a modified Zangger-Sterk method can be downloaded from Morris website, as well as the pulse sequence for the 1D pure shift experiment using PSYCHE. The corresponding proton experiment has been more difficult to acquire, but progress has occurred in this decade by Morris group at Manchester. The partial spectra of estradiol in DMSO-d6 obtained by normal 1H NMR spectroscopy and that acquired by CHirp Excitation (PSYCHE) are shown for comparison by Foroozandeh et al. (2014). Shift positions of spectra can be calculated using other methods such as the Charge program (Abraham and Mobli, 2008; http:// www.modgraph.co.uk/). In 1H NMR prediction (NMRPredict), predictions are based on functional groups, which have been parameterized by Abraham and Mobli (2008). The programme generates 3D conformers from a 2D structure using force field calculations prior to prediction. Spectra are then predicted for each conformer and a weighted average spectrum is calculated. It includes the substituent chemical shifts approach developed by Ernö Pretsch (ETH, Zürich). This programme then automatically selects a ‘Besť proton prediction for each atom from the two prediction methods. Investigators can obtain a trial version free of charge. A more comprehensive suite comes from ACD labs with their Computer-Assisted Structure Elucidation (CASE) systems (Elyashberg et  al., 2010; Nahar and Sarker, 2017). Another (supposedly) simple example involving the seeds of the custard apple tree Annona squamosa (the sugar apple) is worth considering since it illustrates elements of classical and modern structural elucidation in determining the correct structure of an alkaloid (http://www.haraldfischerverlag.de/hfv/ cd_rom/florasinensis_engl.php). Other chemical constituents of the plant have been well-investigated; within the root, the diterpenoid alkaloid atisine is most abundant (Fig. 7.10). Other constituents of A. squamosa include the alkaloids isocorydine and N-methylcorydaldine (Yadav et al., 2011) and oxophoebine and reticuline (Dholvitayakhun et al., 2013). The flavonoid quercetin 3-O-glucoside is also present (Gajalakshmi et al., 2011). Finally, various types of acetogenins have been isolated from the seeds (Chen et al., 2012), bark (Li et al., 1990), and leaves (Gajalakshmi et al., 2011). Notably, Bayer AG has patented the extraction process, molecular identity of the annonaceous acetogenin annonin, and its use as a biopesticide (Moeschler and Pfluger, 1987). The cytotoxic alkaloid, which seems unrelated to existing natural products in this tree, was given the trivial name samoquasine A and identified as A (using 2D-HMBC experiments), but then revised it a known compound perlolidine, a known compound B (Fig. 7.11). Subsequently, Yang et al. (2003) proposed benzo[f]phthalazin-4(3H)-one (C) as a possible structure consistent with the spectroscopic data (Fig. 7.12). Monsieurs et al. (2007) synthesized yet another structure (D), which was spectroscopically different from the natural isolate (Fig. 7.12). Notably, A, B, and D were previously synthesized by unambiguous

FIG. 7.10  Range of compounds found within the sugar apple.

FIG.  7.11  Structures (A–D) claimed previously as samoquasine A. (A) benzo[h]quinazolin4(3H)-one; (B) benzo[c][2,7]naphthyridin-4(3H)-one; (C) benzo[f]phthalazin-4(3H)-one; and (D) pyridazino[4,5-c]quinolin-1(2H)-one.

FIG. 7.12  From left to right (A) Assignment according to Morita et al. (2000); (A) Yang et al. (2003), (B and C). The latter was suggested on the basis of analysis by 2D NMR (COSY, ROESY, HMQC, and HMBC).

212  Computational Phytochemistry

routes, a tried and trusted method harking back to the prespectroscopic era. The power of a computational NMR prediction approach, GIAO-based 13C NMR chemical shifts and DFT calculations [B3LYP/6-311 + G(2d,p) DFT level], were used by Timmons and Wipf (2008). By computationally investigating a total of 48 isomeric structures, they concluded structure B was synonymous with spectroscopic and physical properties of samoquasine A. For more difficult problems, such as a nor-caryophyllane derivative, artarborol (Fig. 7.13) was isolated from wormwood (Artemisia arborescens) and its stereostructure was established by using a combination of chemical derivatization, NMR data, molecular modelling, and quantum-mechanical calculations. Here is another example, showing how the presence of heavy atoms can complicate predictions. Four brominated sesquiterpenes, aldingenins A–D, were isolated from the red algae Laurencia aldingensis, and their structures were elucidated by spectroscopic methods, including NMR (de Carvalho et al., 2006). Doubts led to the structural revision of the brominated sesquiterpenoid such as aldingenin, which necessitated construction by an unambiguous synthetic route (Takahashi et al., 2014) by employing a brief synthesis of the (initially) proposed structure for aldingenin C from trans-limonene oxide. Unfortunately, the spectral data of the synthetic compound failed to match those of the reported natural product. Careful re-examination of the reported NMR data using the CAST/CNMR Structure elucidator suggested that the structure of aldingenin C required revision, and Takahashi et al. (2014) proposed aldingenins C and D to be caespitol and 5-(S)-acetoxycaespitol (Fig. 7.14). Further computational evidence presented by and based on the computed proton spin–spin coupling constants and 13C NMR chemical shifts led to the conclusion that the remaining two aldingenins A and B were also halogenated

FIG. 7.13  Structure of artarborol and corresponding energy minimized structure.

FIG. 7.14  Structural revision of aldingenins C.

Spectral Data Using Computational Techniques  Chapter | 7  213

sesquiterpenes of the same caespitol family. Aldingenin A was then assigned the structure of 5-(S)-hydroxycaespitol and aldingenin B hemiacetal of a related 8-oxo compound (Gu and Lin, 2015). However, these reassignments, even when with computed NMR methods, presented difficulties in chemical shift calculations for carbon atoms bearing heavy elements, for example, halogens (Morishima et al., 1973). These calculations were either inaccurate or computationally expensive for large organic molecules/natural products due to significant (and hard to calculate) spin–orbit effects of heavy atoms on magnetic shielding of directly attached carbons. However, Kutateladze and Reddy (2017) by examining over 100 structures of halogenated terpenoids (and other natural products) with a new parametric approach have demonstrated that the accuracy of the combined method is sufficient to avoid misassignments. Using their approach, so far, 16 structures have been revised (Kutateladze and Reddy, 2017). Since 1-D 1H and 13C NMR data are routinely used as the seminal step in solution structure elucidation, their fast and efficient two-criterion method (nuclear spin–spin coupling and 13C chemical shifts), dubbed DU8+, is suggested as a vital step in structure assignment and validation. Finally, it is a matter of concern that, at the time of writing, the structure retrieved from the PubChem database still lists the original misassigned structure of aldingenin C; this should serve as a warning for the automated use of noncurated databases. The most recent quantum-chemistry-based protocol, termed MOSS-DFT (Hoffmann et al., 2017), allows prediction of 1H and 13C NMR chemical shifts of a wide range of organic molecules in aqueous solution, including metabolites. Molecular motif-specific linear scaling parameters are reported for five different DFT methods (B97-2/pcS-1, B97-2/pcS-2, B97-2/pcS-3, B3LYP/ pcS-2, and BLYP/pcS-2), which have been applied to a large set of 176 metabolite molecules. The chemical shift root-mean-square deviations (RMSD) for the best method, B97-2/pcS-3, are 1.93 and 0.154 ppm for 13C and 1H NMR chemical shifts, respectively. Excellent results are obtained for chemical shifts of methyl and aromatic 13C and 1H, which are not directly bonded to a heteroatom (O, N, S, or P), with RMSD values of 1.15/0.079 and 1.31/0.118 ppm, respectively. This study not only demonstrates how NMR chemical shift calculations in aqueous environment are superior to the commonly used global linear scaling approach, but also allows for motif-specific error estimates, resulting in improved chemical shift-based verification of unknown metabolite candidates in complex metabolomics samples. Currently, using a modern 600 MHz FT-NMR spectrometer, equipped with a 1.7 mm cryogenic probe and a 1 mg sample, it is possible to acquire a broad series of 2D NMR spectra that rigorously characterizes the complex structure of a natural product within a day. When 2D NMR data are united with Computer-Assisted Structure Elucidation methods, the structure can be solved in rapidly, often within a few seconds (Williams et al., 2015–2016; Jacobsen, 2017). The dramatic increase in the computing power of computers coupled with rapid advances in relatively low-cost software has made it possible to ­include ­sophisticated calculation,

214  Computational Phytochemistry

such as B3LYP-GIAO in the undergraduate curriculum, but it is now possible to use artificial neural network pattern recognition with inexpensive GIAO 13C NMR calculations to detect misassignments (Sarotti, 2013).

7.4.5  UV Spectroscopy Absorption of a particular wavelength of light correlates with the π-electron system of a molecule. The greater the degree of conjugation (of the π-electron system) within the molecule, the greater the wavelength of light it can absorb. Originally, spectra were obtained by a rather tedious and cumbersome apparatus. In contrast, the current generation of spectrophotometers has superior optics, solid-state detection devices, and importantly, connection to a computer, which allows a continuous high quality spectrum to be obtained in less than a minute. While it is possible to compile a comprehensive correlation table of the types used in IR spectroscopy, the same cannot be said of UV-Visible spectroscopy. Nevertheless, a large volume of data exists spanning most structural types capable of absorption in the ultraviolet and/or visible range (200–900 nm), and within a particular class or families of compounds, it is possible to correlate selected functional groups with structure (such as polyenes). Robert Burns Woodward and Louis Fieser formulated a set of rules allowing empirical calculation of the wavelength of maximum absorption (λmax). These sets of ‘rules’ are called ‘the Woodward-Fieser rules or Woodward’srules’, which are applicable to conjugated systems with less than four double bonds (Woodward, 1942; Woodward and Clifford 1941; Silverstein et al., 2014; Kalsi, 2004; Glagovich, 2012). For molecules possessing more than four conjugated double bonds, the Fieser–Kuhn rules allow determination of λmax. Various online sites discuss examples involving Fieser–Kuhn Rules for calculation of (λmax) of various polyenes. Hence, with selected classes of compound, it is possible to calculate, with varying degrees of accuracy, the absorption maxima (λmax) of certain structures and this allows affirmation or rejection of a particular structure. Maas (1973) discussed the subject in some detail and included useful worked examples. For spectra involving biochemical/natural products, investigators are referred to the two-volume work of Morton (1975). However, once a combination of unusual fragments (not covered by existing tables) is encountered, such ‘rules’ become self-limiting. Hence, a comparison of actual spectroscopy with computation using molecular modelling (ab initio or DFT) is simultaneously required to deal with natural products ranging from the simple to the complex.

7.4.6  Infrared (IR) Spectroscopy Michelson, in 1891, was the first scientist to make extensive use of an interferometer. His device consisted of a half-silvered mirror that split an incoming light beam into two paths (Fig. 7.15).

Spectral Data Using Computational Techniques  Chapter | 7  215 Fixed mirror

I2/2 Translating mirror

Source

I1/2

Beamsplitter Detector FIG. 7.15  Interferometer used by Michelson in 1891.

The new beams reflected off mirrors back to the half-silvered mirror, where they were recombined and directed towards a detector. One of the mirrors could be precisely moved to change the relative path lengths. Michelson then made precise measurements of the meter using the wavelength of light and laid the groundwork for Einstein’s special theory of relativity (Loewenstein, 1966). The first use of an interferometer to measure infrared radiation was by Rubens and Wood (1911), who employed quartz plates as mirrors and recorded the interferogram of the far-infrared spectrum of a Welsbach (gas) mantle (found in modern camping lanterns). However, the spectral components had to be guessed and sample spectra then iteratively matched with the recorded interferogram. Consequently, practical use of the interferometer for spectroscopy would have to wait for the invention of the digital computer. The first discussion of numerically computed FTS was presented by Fellgett in 1951 (Loewenstein, 1966). Fellgett also noted that FTS provided a multiplex advantage over standard dispersion spectroscopy by increasing the signal-to-noise ratio by N, where N is the number of spectral wavelengths being sampled. In a prism or diffraction grating spectrometer (with a single detector), the majority of the energy (within the incoming light) is ignored at any given instant. However, with FTS, all of the energy from the light is utilized at all times (Strong and Vanasse, 1959). Ab initio and DFT predictions of infrared intensities and Raman activities are now possible from small to large molecules and have been discussed by Zvereva et al. (2011). One way of solving IR spectra consists of acquiring suitable spectra in nujol mull, solution, or they may be dispersed in KBr and compressed into discs for measurement. More commonly, using a diamond ATR instrument spectra of liquids and solids can be acquired directly. Assignment of bands of interest is achieved using a set of good spectral tables such as those published by Socrates (2004). Alternatively, a spectral code can be used by Relational Database Management Systems (RDBMS) as an index for IR spectral searches in relational database (RDB). Spectral codes are constructed for all spectra within

216  Computational Phytochemistry

the database (as spectral indexes and three query strings are created with the same theory used for the creation of the index code for the query spectrum). All spectral searches are then accomplished in structured query language (SQL) approach. The sequential application of this type of procedure can reduce the original library of thousands spectra to a limited number of spectra utilized as references for subsequent detailed comparison. Using AI and pattern matching, spectral search and structure interpretation can be achieved rapidly and the results cross correlated with NMR and UV data (Li et al., 2003).

7.4.7  Database Search Algorithm Typically, the workflow for NMR-based structure prediction uses a database search algorithm. Chemical shift prediction is made for known metabolites (for the studied genus) allowing creation of a database. Then 13C NMR analysis of the crude extract and automatic peak picking is established. Then using a search algorithm, comparison is made of chemical shifts values of database records to those of the crude extract spectrum and prioritization of a list of putative molecules present within the crude extract. Results are confirmed by experimental analysis (Bakiri et al., 2017). The power of the method has been demonstrated by evaluating a crude alkaloid leaf extract obtained from Peumus boldus. Successful identification of eight alkaloids, including isocorydine, rogersine, boldine, reticuline, coclaurine, laurotetanine, N-methylcoclaurine, and norisocorydine, was achieved. Considering the structural similarity between these compounds, the success of the algorithm is commendable. In addition, three monoterpenes, namely, p-cymene, eucalyptol, and α-terpinene, were identified. It is noteworthy that the latter three would be hard to identity by electrospray mass spectrometry (ESIMS) as they fail to give reliable mass ions. A comparison of the results with other methods, either involving a fractionation step before the chemical profiling process (or using mass-spectrometry detection in the infusion mode) or coupled to gas chromatography (GC), is given. Applying the database search algorithm just after a single 13C NMR analysis of an extract allows acceptable chemical profiling process by saving time, consumables, and giving useful information for decision making towards further investigations including pharmacological or toxicological evalua­ tion, sub-­fractionation, and purification process development before isolation. Conversely, since S/N ratios of all 13C NMR spectra obtained after fractionation are greater than that of the crude extract spectrum, minor constituents are then more difficult to identify with certainty. However, this database search algorithm can be used on fractions of simplified composition enriched with the minor constituents of the extract, for instance, following Soxhlet and preparative-­HPLC (Binoy et al., 2005). DFT prediction for matching of actual near-infrared Fourier-transform Raman and Fourier-transform infrared spectra of nodakenetin angelate (Fig. 7.16) gives information unavailable from other techniques. This component, extracted from

Spectral Data Using Computational Techniques  Chapter | 7  217

FIG. 7.16  Nodakenetin angelate.

seeds of Heracleum candolleaum, which is traditionally used as an anti-arthritic and nerve tonic, illustrates a cross correlated approach that allows deductions of the geometry, including inter and intra molecular interactions. Initially, the molecule was isolated and spectra were recorded and analysed (Binoy et al., 2005). Ab initio SCF Hartree–Fock computations were performed employing the 6-31G basis set for geometry optimization for the prediction of IR and Raman spectral activities and wavenumber calculations. In this case, parameters initially optimized using AM1 calculations were used as the input for ab initio computations. The computed results allowed interpretation of the vibrational spectra. For instance, the strong band at 1712 cm−1 and medium-­ intensity band at 1731 cm−1 resulting from ester and lactone carbonyl vibrations, respectively, were identified in the Raman spectrum. The C=O stretching band in IR is broadened around 1717 cm−1 owing to the overlapping of ester and lactone carbonyl vibrations. The lowering of the carbonyl stretching vibrations is due to conjugation. The computed values indicate a larger degree of conjugation for the ester group partly confirming the postulated structure. The characteristic vibrations of the furanocoumarin ring were also confirmed. The CH stretching and bending vibrations of the methyl group of the ester functional group indicate the presence of hyperconjugation. This type of approach can allow deduction of specific intermolecular interactions such as hydrogen-bonding. The large enhancement of in-plane ring stretching and ring breathing modes in the surface-enhanced Raman scattering spectrum revealed a 'vertical' configuration, with the lactone ring perpendicular to the silver surface and (probably) on the opposite side of the lactonic CO group. Such an analysis would be difficult, if not impossible, without this type of measurement-computational analysis approach (Matsuo et al., 2017). Compound identification using unknown EIMS hyphenated methods such as in GC–MS is rather challenging in natural product chemistry, as well as untargeted metabolomics or exposome research. Although EIMS records deposited in publicly available databases (or proprietary ones) exceed around 100,000 depositions, efficient use of such databases is still cumbersome. A ‘four-step’ strategy (Fig. 7.17) for the identification of biologically significant metabolites using an integrated cheminformatics approach has been recently advocated by Matsuo et al. (2017): (i) quality control calibration curve to reduce background noise; (ii) variable selection by ‘hypothesis testing’ in PCA for the efficient selection of peaks of interest (target peaks); (iii) interrogating the EIMS spectral database and, importantly; and (iv) retention index (RI) filtering coupled with

218  Computational Phytochemistry

FIG. 7.17  Four-step strategy for the identification of unknown EIMS spectra. Step 1: After the raw MS data set was processed, the metabolome table was curated by QC curve filter, by removing artefacts/chromatographic noises. Step 2: Hypothesis testing in PCA was used to identify relevant chromatographic peaks. Step 3: EIMS database-oriented structure elucidation based on spectral similarity AI matching. Step 4: After retention index (RI) predictions were produced by multiple regression analysis; attempts are made to remove false positive candidates by RI filtering. Finally, commercially available and/or synthesized compounds were analysed to match and validate or exclude secondary metabolite annotations.

RI predictions. In their study, a new MS-FINDER spectral search engine has been described. After demonstrating its utility for searching EIMS databases (using mass spectral similarity by AI), it has incorporated the evaluation of the rate of false discovery. In silico derivatization software, MetaboloDerivatizer, was developed to calculate the chemical properties of derivative compounds including retention indexes. Notably, the strategy allowed identification of three novel metabolites (butane-1,2,3-triol, 3-deoxyglucosone, and palatinitol) using 64 GC–MS data files within Chinese medicine Senkyu (the dried root of Conidium officinale; Makino in Japan). Validation required authentic standard compounds. All tools as well as curated publicly accessible EIMS databases are freely available in the ‘Computational MS-based metabolomics’ section of the RIKEN PRIMe website (http://prime.psc.riken.jp). Lei et al. (2015) reported a plant natural product tandem mass-spectral library constructed using both authentic standards and purified compounds. At the time of writing, the library contained 1734 tandem mass spectra for 289 compounds,

Spectral Data Using Computational Techniques  Chapter | 7  219

with the majority (76%) of the compounds being plant phenolics (including flavonoids, isoflavonoids, and phenyl-propanoids). Tandem mass spectra and chromatographic retention data acquired on a triple quadrupole mass spectrometer utilized an ultra-high pressure liquid chromatograph (UPC) using a range of six different collision energies (CEs) (10–60 eV). Some generalizations can be drawn similar to those in the early days of electron impact studies. For instance, subsequent comparative analyses of the tandem mass-spectral data show that the loss of ring substituents preceded the C-ring opening during the fragmentation of both flavonoids and isoflavonoids. At lower CE (i.e., 10 and 20 eV), the flavonoids and isoflavonoid central ring structures typically remained intact, and fragmentation was characterized by the loss of the substituents (e.g., methyl/glycosyl groups). At higher CE, the flavonoid and isoflavonoid core ring systems suffered C-ring cleavage and/or rearrangement, which was structuredependant (influenced by hydroxylation patterns). In-source electrochemical oxidation is typical for phenolics especially with ortho-diphenol moieties (i.e., vicinal hydroxyl groups on the aromatic rings). Unsurprisingly, the ortho-­ diphenols were oxidized to ortho-quinones, often yielding an intensive base ion peak corresponding to a [(M-2H)-H](−) ion within their mass spectra. As their library also contains reverse-phase retention times, it allows the construction, validation, and testing of an artificial neural network for retention prediction of other flavonoids (and isoflavonoids) not contained within the original library training set. The library is freely available for nonprofit, academic use; it can be downloaded by investigators. From the aforementioned examples, it is apparent that the nature of natural products research is rapidly shifting by rapidly adopting cutting-edge tools that have radically transformed how extracts and small molecules are characterized. With the innovations in metabolomics, early integration of deep metabolome annotation information allows efficient isolation of desirable natural products. One consequence is that the massive metadata sets for the study of given extracts are generated necessitating storage in computer-controlled databases. For instance, chemotaxonomy studies can allow common biosynthetic traits (among species) to be correlated often justified for drug discovery campaigns. Hence, most studies are combined with bioactivity studies on extracts. However, one of the major bottlenecks of such studies remains the level of accuracy at which natural products can be identified (Allard et al., 2017). This type of untargeted metabolomics commonly employs LC–MS to quantify abundances of metabolites; subsequently, tandem MS is used to derive information about individual compounds. One of the problems in this experimental setup is the interpretation of fragmentation spectra to accurately and efficiently identify compounds. Hence, fragmentation trees have become a powerful tool for the interpretation of tandem mass-spectrometry data of small molecules. These trees are determined from the data using combinatorial optimization and illustrate experimental data through a series of fragmentation cascades. One advantage of this type of approach is that fragmentation tree

220  Computational Phytochemistry

computation is independent of both spectral and structural databases. To obtain biochemically meaningful trees, an elaborate optimization functions have been developed (scoring). A new scoring method for computing fragmentation trees transforms the combinatorial optimization into a Maximum A Posteriori estimator. The superiority of the new scoring for two tasks: (a) de novo identification of molecular formulas of unknown compounds and (b) for searching a database for structurally similar compounds, shows that their method, dubbed SIRIUS 3, performs significantly better than their previous version and other methods for this task. Hence, SIRIUS 3 can be usefully integrated into untargeted metabolomics workflow, allowing dereplication using automated computational methods and subsequent annotation of extracts. Traditional natural products discovery using a combination of live/dead screening followed by iterative bioassay-guided fractionation affords no information about compound structure or mode of action until late in the discovery process (Kurita et  al., 2015) integration of high-content screening and untargeted metabolomics for comprehensive functional annotation of natural product libraries. One drawback of the approaches described above is that they all tend to provide high rates of rediscovery and, concurrently, low probabilities of finding compounds with unique exploitable biological and/or chemical properties (Kurita et al., 2015). By integrating image-based phenotypic screening in HeLa cells (with high-resolution untargeted metabolomics analysis), they have developed a new platform, termed compound activity mapping (CAM), which is capable of directly predicting the identities and modes of action of bioactive constituents. This has been achieved for complex natural product extract libraries. By rapidly identifying novel bioactive constituents, predictions of compound modes of action can be achieved directly from primary screening data set. Hence, this approach usefully inverts the natural products/drug discovery process from the existing so-called ‘grind and find’ model to a more nuanced, targeted, hypothesis-driven discovery paradigm. By detecting chemical features and biological function of bioactive metabolites early in the screening workflow, lead compounds can be rationally selected based on biological and/ or chemical novelty and increases publication and/or patent potential. Kurita et al. (2015) demonstrated the utility of CAM platform by combining 10,977 mass-spectral features with 58,032 biological measurements from a total library of 234 natural products extracts. Notably, by integrating these two datasets, 13 clusters of fractions containing 11 known compound families and four new compounds can be identified. Using compound activity, they discovered the quinocinnolinomycins, a family of natural products possessing a unique, cinnoline containing carbon skeleton that causes endoplasmic reticulum stress. The structure was confirmed using multi-D NMR methods to ‘prove’ the presence of this rather rare class of cinnoline natural products (Fig. 7.18). A new metabolomics database and query algorithm for the analysis of HSQC spectra (Bingol et al., 2015) allowed unification of NMR spectroscopic

Spectral Data Using Computational Techniques  Chapter | 7  221

FIG. 7.18  Structural elucidation of quinocinnolinomycins A–D (1–4). B shows key NMR correlations. COSY correlations are indicated by emboldened lines. Curved arrows show HMBC correlations.

information on 555 metabolites from both the Biological Magnetic Resonance Data Bank (BMRB) and Human Metabolome Database (HMDB). The database, termed Complex Mixture Analysis by NMR (COLMAR) HSQC database, could be queried through an interactive, intuitive web interface at http://spin. ccic.ohio-state.edu/index.php/hsqc/index. This HSQC database was claimed to separately treat slowly exchanging isomers that belong to the same metabolite, which would permit improved query responses in cases where lowly populated isomers are below the HSQC detection limit. The performance of COLMAR and its query web server was claimed to compare favourably with an existing web server, especially for interrogating spectra of samples exhibiting high complexity (e.g., Drosophila melanogaster and Escherichia coli). For such samples, the COLMAR’s web server has on average a 37% higher accuracy (true positive rate) and a 82% lower false positive rate, allowing prompt and accurate identification of metabolites from HSQC spectra (at natural abundance). As such, information can be combined and validated with NMR data from 2D TOCSYtype spectra; it allows provision of complimentary through bond connectivity information not available within HSQC spectra. Smith et al. (2001) described a simple database of C-13/H-1-C-13 spectral lists for 11,673 natural products created in a standard commercial database format. More than half of the spectra were predicted using HOSE code descriptors, derived from the 50% of spectra with known experimental values. Prediction errors obtained by prediction of (and comparison to) the experimental spectra demonstrated an exponentially decaying dependence between the average absolute error and the depth of the matching HOSE codes. A subset of the library containing >1000 H-1-C-13 assigned experimental spectral lists was used to

222  Computational Phytochemistry

test against eight, alternate, query data sets. These sets represented query data from various combinations of 1D-C-13, 1D-DEPT, and 2D-H-1-C-13 spectra. Simulated query lists were generated using random walk, i.e., Monte Carlo methods. As expected, queries based on 2D-H-1-C-13 data were more likely to find the correct match under unfavourable conditions. Absolute/relative configurational and conformational structural information are described by the CAST (CAnonical-representation of STereochemistry) coding method by Satoh et al. (2003). Notably, prediction of 13C NMR chemical shifts was achieved with CAST/CNMR, which also took into account any stereochemistry present. Consequently, it allowed distinction of both differences and similarities in stereochemical structures around a specific carbon, which had not been previously achieved at that point within existing databases. Since CAST/CNMR employed a three-dimensional structural database, together with a 13C NMR spectral database, it could be more useful than previous databases.

7.5.  CAN RAMAN BE USED FOR AUTOMATED ASSAYS AND HTS? Since Raman spectroscopy is a noncontact, nondestructive technique, it can be used effectively for automated high-throughput screening (HTS) and assay measurements. Typical applications include analysis of liquids/powders in multiwell plates, crystal screening, and tablet content/uniformity assays with transmission Raman. HTS Raman systems use a combination of automated sample movement, autofocus devices, and automated data acquisition and analysis procedures to acquire spectra from hundreds of samples sequentially. HTS and automated measurements can even be integrated with full robot handling, removing the need for expertise and operator intervention. Applications such as diamond-like carbon (DLC) coatings for computer hard discs and crystal and polymorph analysis in drug development now use Raman spectroscopy for automated screening, as well as many other applications, which simply require routine analysis of large numbers of samples.

7.6.  X-RAY SPONGE TECHNIQUE The crystalline sponge method is a recently developed X-ray technique that does not require crystallization of the samples. The method incorporates various degrees of computational tools. A porous metal complex absorbs a target molecule into its pores, rendering the target molecule ordered and detectable by X-rays. In this method, it is presumed that the successful structural analysis of one parent compound promises the facile analysis of a series of its derivatives owing to the capture of these derivatives at the same (or better) binding sites in the crystalline sponge. This method allows determination of AC and requires that components be separated cleanly from mixtures using preparative-HPLC before absorption and crystallization (OÏBrien et al., 2014; Vinogradova et al., 2014).

Spectral Data Using Computational Techniques  Chapter | 7  223

An example of the power of this ‘crystalline sponge’ method for X-ray structure determination is the oxidation reactions of the cyclic terpene humulene, which is not a crystalline solid, but when soaks into a zinc-containing metal-organic framework (MOF), it can be diffracted. Semisynthetic derivatives produced when the starting material is treated with MCPBA or selenium dioxide are used to make epoxide and aldehyde oxidation products. If a parent structure can fit into the MOFs, that derivatives also tend to diffract. The ‘crystalline sponge’ technique has drawn criticism as protein crystallographers are generally accepting of the method, while small-molecule crystallographers accept the data more reluctantly as since solvent-exclusion routines and other data-clean-up techniques are used. However, the Clardy group fully documents when crystallographic restraints had to be used during various structure refinement. The refinement of the crystal soaking conditions for each compound is often required; interestingly, in some complexes, any impurity is also found ordered in the crystalline framework, such as phthalate plasticizer (the bane of natural product chemists). Notably, Cardy’s procedure for making the crystals dispenses with nitrobenzene entirely unlike earlier procedures.

7.7. CONCLUSIONS Over the last several decades, especially with the remarkable progress in computation and applications of AI and various mathematical modelling and advanced calculations, several automated spectral data interpretation and structure elucidation software have become available to the phytochemists, which have enhanced (and will continue do so) the quality and output of phytochemical research, whether it is plant metabolomics or phytochemical drug discovery (Nahar and Sarker, 2017). The coupling of spectroscopy with computational methods can ensure that the number of mythical structures that only ‘existed as ideas that existed only in the mind of their investigators’ is replaced by those which are ‘correcť and not biased by the inherent variety and surprising nature of substance already proven to exist.

REFERENCES Abraham, R.J., Mobli, M., 2008. Modelling 1H NMR Spectra of Organic Compounds: Theory, Applications and NMR Prediction Software. John Wiley & Sons Ltd, New York, United States. Allard, P.-M., Genta-Jouve, G., Wolfender, J.-L., 2017. Deep metabolome annotation in natural products research: towards a virtuous cycle in metabolite identification. Curr. Opin. Chem. Biol. 36, 40–49. Allerhand, A., Doddrell, D., Komoroski, R.J., 1971. Natural abundance carbon-13 partially relaxed fourier transform nuclear magnetic resonance spectra of complex molecules. Chem. Phys. 55, 189–198. Arnold, J.T., Dharmatti, S.S., Packard, M.E., 1951. Chemical effects on nuclear induction signals from organic compounds. J. Chem. Phys. 19, 507. Autschbach, J., 2009. Computing chiroptical properties with first-principles theoretical methods: background and illustrative examples. Chirality 21, E116–E152.

224  Computational Phytochemistry Autschbach, J., Nitsch-Velasquez, L., Rudolph, M., 2011. Time-dependent density functional response theory for electronic chiroptical properties of chiral molecules. Top. Curr. Chem. 298, 1–98. Bakiri, A., Hubert, J., Reynaud, R., Lanthony, S., Harakat, D., Renault, J.-H., Nuzillard, J.-M., 2017. J. Nat. Prod. 80, 1387–1396. Barber, M., Bordoli, R.S., Sedgwick, R.D., Tyler, A.N., 1981. Fast atom bombardment of solids (FAB): a new ion source for mass spectrometry. J. Chem. Soc. Chem. Commun. (7)325–327. Barron, L.D., 2004. Molecular Light Scattering and Optical Activity, second ed. Cambridge University Press, Cambridge, UK. Beavis, R.C., Colby, S.M., Goodacre, R., de Harrington, P.B., Reilly, J.P., Sokolow, S., Wilkerson, C.W., 2006. Artificial intelligence and ixpert systems in mass spectrometry. In: Encyclopedia of Analytical Chemistry. Wiley. Becker, E.D., 1996. Magnetic resonance: an account of some key discoveries and their c­ onsequences. Appl. Spectrosc. 50, 16A–28A. Becker, E.D., 1993. A brief history of nuclear magnetic resonance. Anal. Chem. 65, 295A–302A. Bernal, J.D., 1932. Crystal structures of vitamin D and related compounds. Nature 129, 277–278. Bernal, J.D., Crowfoot, D., Fankuchen, I., 1940. X-ray crystallography and the chemistry of the steroids. Part I. Phil. Trans. R. Soc. A 239, 135–182. Biemann, K., 1960. The determination of carbon skeleton of sarpagine by mass spectrometry. ­Tetrahedron Lett. 1, 9–14. Bingol, K., Li, D.-W., Bruschweiler-Li, L., Cabrera, O.A., Megraw, T., Zhang, F., Brüschweiler, R., 2015. Unified and isomer-specific NMR metabolomics database for the accurate analysis of (13)C-(1)H HSQC spectra. ACS Chem. Biol. 10, 452–459. Binoy, J., Abraham, J.P., Hubert-Joe, I., George, V., Jayakumar, V.S., Aubard, J., Nielsen, O.F., 2005. Near-infrared fourier transform Raman, surface-enhanced Raman scattering and ­fourier ­transform infrared spectra and ab initio calculations of the natural product nodakenetin ­angelate. J. Raman Spectrosc. 36, 63–72. Boyd, D.B., 2013. Quantum chemistry program exchange, facilitator of theoretical and computational chemistry in pre-internet history. In: Pioneers of Quantum Chemistry. ACS Symposium Series, Vol. 1122. American Chemical Society, pp. 221–273. Breton, R.C., Reynolds, W.F., 2013. Using NMR to identify and characterize natural products. Nat. Prod. Rep. 30, 501–524. Brown, A., 2005. J. D. Bernal: The Sage of Science. Oxford University Press, Oxford. Buckingham, A.D., Dunn, M.B., 1971. Optical activity of oriented molecules. J. Chem. Soc. A 1988–1991. Budzikiewicz, H., Djerassi, C., Williams, D.H., 1964. Structure Elucidation of Natural Products by Mass Spectrometry. Vols. I and II. Holden-Day, San Francisco, CA. Budzikiewicz, H., 2015. Mass spectrometry in natural product structure elucidation. In: Kinghorn, A., Falk, H., Kobayashi, J. (Eds.), Progress in the Chemistry of Organic Natural Products. 100. Springer, Berlin, Germany, pp. 77–221. Chen, Y., Xu, S.S., Chen, J.W., Wang, Y., Xu, H.Q., Fan, N.B., Li, X., 2012. Antitumor activity of Annona squamosa seeds extract containing annonaceous acetogenin compounds. J. Ethnopharmacol. 142, 462–466. Claeys, M., Claereboudt, J., 2017. Fast atom bombardment ionization in mass ­spectrometry. In: ­ Lindon, J., Tranter, G.E., Koppenaal, D. (Eds.), Encyclopedia of Spectroscopy and ­Spectrometry, third ed, pp. 581–587. Crawford, T.D., 2006. Ab initio calculation of molecular chiroptical properties. Theor. Chem. ­Accounts 115, 227–245.

Spectral Data Using Computational Techniques  Chapter | 7  225 de Carvalho, L.R., Fujii, M.T., Roque, N.F., Lago, J.H.G., 2006. Aldingenin derivatives from the red alga Laurencia aldingensis. Phytochemistry 67, 1331–1335. Dholvitayakhun, A., Trachoo, N., Sakee, U., 2013. Potential applications for Annona squamosa leaf extract in the treatment and prevention of foodborne bacterial disease. Nat. Prod. Commun. 8, 385–388. Elyashberg, M., Williams, A.I., Blinov, K., 2010. Structural revisions of natural products by ­Computer-Assisted Structure Elucidation (CASE) systems. Nat. Prod. Rep. 27, 1296–1328. Ernst, R.R., Anderson, W.A., 1966. Application of Fourier transform spectroscopy to magnetic ­resonance. Rev. Sci. Instrum. 37, 93. https://doi.org/10.1063/1.1719961. Fattorusso, C., Stendardo, E., Appendino, G., Fattorusso, E., Luciano, P., Romano, A., Taglialatela-­ Scafati, O., 2007. Artarborol, a nor-caryophyllane sesquiterpene alcohol from Artemisia ­arborescens: stereostructure assignment through concurrence of NMR data and computational analysis. Org. Lett. 9, 2377–2380. Fenselau, C., 1984. Fast atom bombardment and middle molecule mass spectrometry. J. Nat. Prod. 47, 215–225. Fernández-Maestre, R., 2012. Ion mobility spectrometry: history, characteristics and applications. Revista UDCA Actualidad & Divulgación. Científica 15, 467–479. Foroozandeh, M., Adams, R.W., Nilsson, M., Morris, G.A., 2014. Ultrahigh-resolution total ­correlation NMR spectroscopy. J. Am. Chem. Soc. 136, 11867–11869. Gajalakshmi, S., Divya, R., Divya, V., Deepika, P., Mythili, S., Sathiavelu, A., 2011. ­Pharmacological activities of Annona squamosa: A review. Int. J. Pharm. Sci. Rev. Res. 10, 24–29. Gaudêncio, S.P., Pereira, F., 2015. Dereplication: racing to speed up the natural products discovery process. Nat. Prod. Rep. 32, 779–810. Gavroglu, K., Simões, A., 2012. Neither Physics Nor Chemistry: A History of Quantum Chemistry. MIT Press, Cambridge, MA, London. Glagovich, N., 2012. Woodward’s rules for conjugated carbonyl compounds, http://www.chemistry. ccsu.edu/glagovich/teaching/316/uvvis/conjugated.html (accessed 27.07.12). Graham, E.B., Raab, R.E., 1990. Light propagation in cubic and other anisotropic crystals. Proc. R. Soc. Lond. A 430, 593–614. Griffiths, J.A., 2008. Brief history of mass spectrometry. Anal. Chem. 80, 5678–5683. Grimme, S., 2008. Do special noncovalent π–π stacking interactions really exist? Angew. Chem. Int. Ed. 47, 3430–3434. Gu, B.-B., Lin, H.-W., 2015. Quantum chemical calculation of 1H and 13C chemical shifts and 1 H–1H coupling constants in structural assignment of natural products. J. Int. Pharm. Res. 42, 706–712. Gutowsky, H.S., McCall, D.W., 1951. Nuclear magnetic resonance fine structure in liquids. Phys. Rev. 82, 748–749. Hilbert, M., López, P., 2011. The world’s technological capacity to store, communicate, and ­compute Information. Science 332, 60–65. Hoffmann, F., Li, D.-W., Sebastiani, D., Brüschweiler, R., 2017. Improved quantum chemical NMR chemical shift prediction of metabolites in aqueous solution toward the validation of unknowns. J. Phys. Chem. A 121, 3071–3078. Jacobsen, N.E., 2017. NMR Data Interpretation Explained: Understanding 1D and 2D NMR Spectra of Organic Compounds and Natural Products, first ed. John Wiley & Sons, NJ, United States. Kalsi, P.S., 2004. Spectroscopy of Organic Compounds, sixth ed. New Age International Publishers, New Delhi. Kind, T., Fiehn, O., 2010. Advances in structure elucidation of small molecules using mass ­spectrometry. Bioanal. Rev. 2, 23–60.

226  Computational Phytochemistry Kurita, K.L., Glassey, K.L., Linington, E., 2015. Integration of high-content screening and untargeted metabolomics for comprehensive functional annotation of natural product libraries. Proc. Natl. Acad. Sci. U. S. A. 112, 11999–12004. Kutateladze, A.G., Reddy, D.S., 2017. High-throughput in silico structure validation and revision of halogenated natural products is enabled by parametric corrections to DFT-computed 13C NMR chemical shifts and spin–spin coupling constants. J. Org. Chem. 82, 3368–3381. La Clair, J.J., 2006. Total syntheses of hexacyclinol, 5-epi-hexacyclinol, and desoxohexacyclinol unveil an antimalarial prodrug motif. Angew. Chem. Int. Ed. 45, 2769–2773. La Clair, J.J., 2012. Retraction: total syntheses of hexacyclinol, 5-epi-hexacyclinol, and ­desoxohexacyclinol unveil an antimalarial prodrug motif. Angew. Chem. Int. Ed. 51, 11647–11662. Lei, Z., Jing, L., Qiu, F., Zhang, H., Huhman, D., Zhou, Z., Sumner, L.W., 2015. Construction of an ultrahigh pressure liquid chromatography-tandem mass spectral library of plant natural products and comparative spectral analyses. Anal. Chem. 87, 7373–7381. Levine, S.G., 2001. A short history of the chemical shift. J. Chem. Educ. 78, 133. Li, X.H., Hui, Y.H., Ripprecht, J.K., Liu, Y.M., Wood, K.V., Smith, D.L., Chang, C.J., McLaughlin, J.L., 1990. Bullatacin, bullatacinone, and squamone, a new bioactive acetogenin, from the bark of Annona squamosa. J. Nat. Prod. 53, 81–86. Li, J.F., Fan, B.T., Doucet, J.-P., Panaye, A., 2003. Spectral Code Index (SPECOIND): a general infrared spectral database search method. Appl. Spectrosc. 57, 858–867. Lindon, J., Tranter, G.E., Koppenaal, D., 2016. Encyclopedia of Spectroscopy and Spectrometry, third ed. Academic Press, LondonISBN: 9780128032244. Liu, W., Zhang, X., Knochenmuss, R., Siems, W.F., Hill Jr., H.H., 2016. Multidimensional separation of natural products using liquid chromatography coupled to hadamard transform ion mobility mass spectrometry. J. Am. Soc. Mass Spectrom. 27, 810–821. Loewenstein, E.V., 1966. The history and current status of Fourier transform spectroscopy. Appl. Opt. 5, 845–854. Maas, D.H., 1973. An introduction to ultraviolet spectroscopy with problems. In: Scheinmann, F. (Ed.), An Introduction to Spectroscopic Methods for the Identification of Organic Compounds. Volume Mass Spectrometry, Ultraviolet Spectroscopy, Electron Spin Resonance Spectroscopy, Nuclear Magnetic Resonance Spectroscopy (Recent Developments), Use of Various Spectral Methods Together, and Documentation of Molecular Spectra. Pergamon Press, Oxford, pp. 93–139. McLafferty, F.W., Turecek, F., 1993. Computer identification of unknown mass spectra. In: Interpretation of Mass Spectra. University Science Books, California. Matsuo, T., Tsugawa, H., Miyagawa, H., Fukusaki, E., 2017. Integrated strategy for unknown EI– MS identification using quality control calibration curve, multivariate analysis, EI–MS spectral database, and retention index prediction. Anal. Chem. 89, 6766–6773. McLennan, S., Gainer, M., 2012. When the computer wore a skirt: Langley’s computers, 1935– 1970. In: NASA History Program Office News & Notes, Vol. 29(1). First Quarter. Moeschler, H. F., Pfluger, W., 1987. Insecticide US 4689232 A (retrieved 12.03.14). Monsieurs, K., Tapolcsányi, P., Loones, K.T.J., Neumajer, G., De Ridder, J.A., Goubitz, K., ­Lemière, G.L.F., Dommisse, R.A., Mátyus, P., Maes, B.U.W., 2007. Is samoquasine A indeed benzo[f]phthalazin-4(3H)-one? Unambiguous, straightforward synthesis of benzo[f] phthalazin-4(3H)-one and its regioisomer benzo[f]phthalazin-1(2H)-one. Tetrahedron 63, 3870–3881. Morishima, I., Endo, K., Yonezawa, T., 1973. Effect of the heavy atom on the nuclear shielding constant. I. Proton chemical shifts in hydrogen halides. J. Chem. Phys. 59, 3356–3364.

Spectral Data Using Computational Techniques  Chapter | 7  227 Morita, H., Sato, Y., Chan, K.L., Choo, C.Y., Itokawa, H., Takeya, K., Kobayashi, J., 2000. Samoquasine A, a benzoquinazoline alkaloid from the seeds of Annona squamosa. J Nat. Prod. 63, 1707–1708. Erratum in: J Nat Prod 2002 Nov;65(11):1748. Moroni, L., Gellini, C., Miranda, M.M., Salvi, P.R., Foresti, M.L., Innocenti, M., Loglio, F., ­Salvietti, E., 2008. Raman and infrared characterization of the vibrational properties of the antimalarial drug artemisinin. J. Raman Spectrosc. 39, 1097–4555. Morris, P.J.T., 2002. In: From classical to modern chemistry: the instrumental revolution. From a conference on the history of chemical instrumentation: “From the Test-tube to the ­Autoanalyzer: the Development of Chemical Instrumentation in the Twentieth Century”, London, ­August 2000. Royal Society of Chemistry in assoc. with the Science Museum, 2002, Cambridge. Morton, R.A., 1975. Biochemical Spectroscopy. Vol. 2 Adam Hilger, London. Mukhopadhyay, P., Wipf, P., Beratan, D.N., 2009. Optical signatures of molecular dissymmetry: combining theory with experiments to address stereochemical puzzles. Acc. Chem. Res. 42, 809–819. Nahar, L., Sarker, S.D., 2017. Automated structure elucidation of phytochemicals. Trends ­Phytochem. Res. 1, 109–110. Naqvi, K.R., 1993. Historical inaccuracies. J. Chem. Educ. 70, 605. https://doi.org/10.1021/ ed070p605.1. Nicolaou, K.C., Sorensen, E.J., 1996. Classics in Total Synthesis: Targets, Strategies, Methods. Wiley-VCH, Weinheim, Germany, ISBN: 978-3-527-29231-8, p. 821. Nicolaou, K.C., Sorensen, E.J., Winssinger, N., 1998. The art and science of organic and natural products synthesis. J. Chem. Educ. 75, 1225. https://doi.org/10.1021/ed075p1225. Nicolaou, K.C., Snyder, S.A., 2005. Chasing molecules that were never there: misassigned natural products and the role of chemical synthesis in modern structure elucidation. Angew. Chem. Int. Ed. 44, 1012–1044. Nugroho, A.E., Morita, H., 2014. Circular dichroism calculation for natural products. J. Nat. Med. 68, 1–10. OÏBrien, A.G., Maruyama, A., Inokuma, Y., Fujita, M., Baran, P.S., Blackmond, D.G., 2014. ­Angew. Chem. Int. Ed. 53, 11868–11871. Parr, R.G., Yang, W., 1989. Density-Functional Theory of Atoms and Molecules. Oxford University Press, Oxford. Pecul, M., Ruud, K., 2005. The ab initio calculation of optical rotation and electronic circular ­dichroism. In: Jensen, H.J.A., Olsen, J. (Eds.), Advances in Quantum Chemistry. Response Theory and Molecular Properties, Vol. 50. Elsevier, San Diego, CA, pp. 185–212. Pirrung, M.C., Morehead Jr., A.T., 1997. A sesquidecade of sesquiterpenes, 1980–1994: Part A. Acyclic and monocyclic sesquiterpenes, Part 1. In: Goldsmith, D. (Ed.), The Total Synthesis of Natural Products. Vol. 10. John Wiley & Sons, New York, pp. 90–96. Porco, J.A., Su, S., Xiaoguang, L., Bardhan, S., Rychnovsky, S.D., 2006. Total synthesis and ­structure assignment of (+)-hexacyclinol. Angew. Chem. Int. Ed. 45, 5790–5792. Ramsey, N.F., 1950. Magnetic shielding of nuclei in molecules. Phys. Rev. 78, 699–703. Ramsey, N.F., 1953. Electron coupled interactions between nuclear spins in molecules. Phys. Rev. 91, 303–307. Ramsey, N.F., 1970. Possibility of field-dependent nuclear magnetic shielding. Phys. Rev. A 1, 1320–1322. Reynolds, W.F., McLean, S., Tay, L.-L., Yu, M., Enriquez, R.G., Estwick, D.M., Pascoe, K.O., 1997. Comparison of 13C resolution and sensitivity of HSQC and HMQC sequences and application of HSQC-based sequences to the total 1H and 13C spectral assignment of clionasterol. Magn. Reson. Chem. 35, 455–462.

228  Computational Phytochemistry Rubens, H., Wood, R.W., 1911. Focal isolation of long heat-waves. Philos. Mag. 21, 249–261. Rychnovsky, S.D., 2006. Predicting NMR spectra by computational methods: structure revision of hexacyclinol. Org. Lett. 8, 2895–2898. Sarotti, A.M., 2013. Successful combination of computationally inexpensive GIAO 13C NMR calculations and artificial neural network pattern recognition: a new strategy for simple and rapid detection of structural misassignments. Org. Biomol. Chem. 11, 4847–4859. Satoh, H., Koshino, H., Uzawa, J., Nakata, T., 2003. CAST/CNMR: Highly accurate 13C NMR chemical shift prediction system considering stereochemistry. Tetrahedron 59, 4539–4547. Scheinmann, F., 1970. An Introduction to Spectroscopic Methods for the Identification of Organic Compounds: Nuclear Magnetic Resonance and Infrared Spectroscopy. Vol. 1. Pergamon press, Oxford. Schlecht, M.F., 1998. Molecular Modeling on the PC. Wiley-VCH, New York. Schlegel, B., Hartl, A., Dahse, H.-M., Gollmick, F.A., Grafe, U., Dorfelt, H., Kappes, B., 2002. Hexacyclinol, a new antiproliferative metabolite of Panus rudis HKI 0254. J. Antibiot. 55, 814–817. Scheubert, K., Hufsky, F., Bocker, S., 2013. Computational mass spectrometry for small molecules. J. Cheminform. 5, 12 (24 pages). Silverstein, B.M., 1991. Spectroscopic Determination of Organic Compounds, fifth ed. John Wiley and Sons, NJ, United States. Silverstein, R.N., Webster, F.X., Kiemle, D.J., Bryce, D.L., 2014. Spectrometric Identification of Organic Compounds, eighth ed. Wiley, United States, 464 pp. ISBN: 978-0-470-61637-6. Smith, S.K., Cobleigh, J., Svetnik, V., 2001. Evaluation of a H-1–C-13 NMR spectral library. J. Chem. Inf. Comput. Sci. 41, 1463–1469. Sneader, W., 1985. Drug Discovery: The Evolution of Modern Medicines. John Wiley & Sons, New York, ISBN: 0471-90471-6. Socrates, G., 2004. Infrared and Raman Characteristic Group Frequencies: Tables and Charts, third ed. Wiley, ISBN: 978-0-470-09307-8. Srebro-Hooper, M., Autschbach, J., 2017. Calculating natural optical activity of molecules from first principles. Annu. Rev. Phys. Chem. 68, 399–420. Stranz, D.D., Miao, S., Campbell, S., Maydwell, G., Ekins, S., 2008. Combined computational ­metabolite prediction and automated structure-based analysis of mass spectrometric data. ­Toxicol. Mech. Methods 18, 243–250. Strong, J., Vanasse, G.A., 1959. Interferometric spectroscopy in the far infrared. Annu. Rev. Phys. Chem. 49, 844–850. Takahashi, S., Yasuda, M., Nakamura, T., Hatano, K., Matsuoka, K., Koshino, H., 2014. ­Synthesis and structural revision of a brominated sesquiterpenoid, aldingenin C. J. Org. Chem. 79, 9373–9380. Tantillo, D.J., 2013. Walking in the woods with quantum chemistry—applications of quantum chemical calculations in natural products research. Nat. Prod. Rep. 30, 1079–1086. Thomas, N.C., 1991. The early history of spectroscopy. J. Chem. Educ. 68, 632–634. Timmons, C., Wipf, P., 2008. Density functional theory calculation of 13C NMR shifts of diazaphenanthrene alkaloids: reinvestigation of the structure of samoquasine A. J. Org. Chem. 73, 9168–9170. Toukacha, F.V., Ananikov, V.P., 2013. Recent advances in computational predictions of NMR parameters for the structure elucidation of carbohydrates: methods and limitations. Chem. Soc. Rev. 42, 8376–8415. Vinogradova, E.V., Mîller, P., Buchwald, S.L., 2014. Structural reevaluation of the electrophilic hypervalent iodine reagent for trifluoromethylthiolation supported by the crystalline sponge method for X-ray analysis. Angew. Chem. Int. Ed. 53, 3125–3128.

Spectral Data Using Computational Techniques  Chapter | 7  229 Weigert, F.J., Jautelat, M., Roberts, J.D., 1968. Natural-abundance C13 nuclear magnetic resonance spectra of medium-molecular-weight organic compounds. Proc. Natl. Acad. Sci. U. S. A. 60, 1152–1155. Williams, A.J., Martin, G.E., Rovnyak, D., 2015–2016. Modern NMR Approaches to the Structure Elucidation of Natural Products: Volume 1: Instrumentation and Software; Volume 2: Data Acquisition and Applications to Compound Classes, first ed. Royal Society of Chemistry, United Kingdom. Willoughby, P.H., Jansma, M.J., Hoye, T.R., 2014. A guide to small-molecule structure assignment through computation of (1H and 13C) NMR chemical shifts. Nat. Protoc. 9, 643–660. Wishart, D.S., Jewison, T., Guo, A.C., Wilson, M., Knox, C., Liu, Y., Djoumbou, Y., Mandal, R., Aziat, F., Dong, E., Bouatra, S., Sinelnikov, I., Arndt, D., Xia, J., Liu, P., Yallou, F., Bjorndahl, T., Perez-Pineiro, R., Eisner, R., Allen, F., Neveu, V., Greiner, R., Scalbert, A., 2013. HMDB 3.0—The Human Metabolome Database in 2013. Nucleic Acids Res. 41, D801–D807. Woodward, R.B., Clifford, A.F., 1941. Structure and absoprtion spectra. II. 3-Acetoxy-Δ5-(6)-norcholestene-7-carboxylic acid. J. Am. Chem. Soc. 63, 2727–2729. Woodward, R.B., 1942. Structure and absoprtion spectra. IV. Further observations on α,βunsaturated ketones. J. Am. Chem. Soc. 64, 76–77. Yadav, D.K., Singh, N., Dev, K., Sharma, R., Sahai, M., Palit, G., Maurya, R., 2011. Anti-ulcer constituents of Annona squamosa twigs. Fitoterapia 82, 666–675. Yang, Y.-L., Chang, F.-R., Wu, Y.-C., 2003. Total synthesis of 3,4-dihydrobenzo[h]quinazolin-4-one and structure elucidation of perlolidine and samoquasine A. Tetrahedron Lett. 44, 319–322. Yates III, J.R.A., 2011. Century of mass spectrometry: from atoms to proteomes. Nat. Methods 8, 633–637. Yoo, H.-D., Nam, S.-J., Chin, Y.W., 2016. Misassigned natural products and their revised structures. Arch. Pharm. Res. 39, 143–150. Zhou, Z., Xiaotao, S., Jia, T., Zheng-Jiang, Z., 2016. Large-scale prediction of collision cross-­section values for metabolites in ion mobility-mass spectrometry. Anal. Chem. 88, 11084–11091. Zvereva, E.E., Shagidullin, A.R., Katsyuba, S.A., 2011. Ab initio and DFT predictions of infrared intensities and Raman activities. J. Phys. Chem. A 115, 63–69.

FURTHER READING Autschbach, J., 2011. Time-dependent density functional theory for calculating origin-independent optical rotation and rotatory strength tensors. ChemPhysChem 12, 3224–3235.

This page intentionally left blank

Chapter 8

Application of Mathematical Models and Computation in Plant Metabolomics Denis S. Willett*, Caitlin C. Rering*, Dominique A. Ardura†, John J. Beck* *U.S. Department of Agriculture, Gainesville, FL, United States, †Independent Scientist, Davis, CA, United States

Chapter Outline 8.1. Introduction 8.2. Create Clarity From Chaos— Mindset 8.3. Analytical Tools 8.4. Experimental Considerations 8.4.1 Data Collection Considerations 8.4.2 Instrumentation 8.4.3 Sample Preparation 8.4.4 Analysis Modalities

231 232 235 235 236 236 237 238

8.4.5 Throughput in Plant Metabolomics 8.4.6 Data Structures 8.5. Analysis 8.5.1 Data Processing 8.5.2 Unsupervised Approach 8.5.3 Supervised Approach 8.5.4 Inference 8.6. Metabolomics in Agriculture 8.7. Conclusions References

238 239 239 240 241 243 247 248 250 250

8.1. INTRODUCTION In the past decade, data-science opportunities have exploded as instrumentation, data warehousing, and analytics capabilities have expanded to allow faster, cheaper collection and processing of ever-increasing amounts of data. This expansion in capabilities has engendered a revolution in the plant sciences: we can now ask and answer systems questions that are inherently complex, multivariate, and dynamic. Best of all, the answers to those questions prompt better questions and facilitate the design of agronomic solutions to challenges besetting our climate and imperiling our food supply.

Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00008-0 © 2018 Elsevier Inc. All rights reserved.

231

232  Computational Phytochemistry

Underpinning our ability to answer questions and design effective solutions is data, which is the focus of this chapter. Our aim is twofold: 1. We want to provide you the tools and techniques that empower you to ask and answer systems questions in your field. Whether you are investigating how the metabolome of cropping systems responds to climate change or the impact of pathogen infection on secondary metabolites, we want you to know what to do and how to do it. 2. We want to impart a mindset for thinking about systems problems. Tools and techniques are powerful and are immediately applicable in your work, but tools and techniques are also ephemeral: new and better tools and techniques are always around the corner. While we will discuss some of the imminent new tools (e.g., deep learning), we will also highlight ways of thinking about systems. This mental framework will prepare you for asking interesting systems questions far into the future. While this may seem daunting for one book chapter, do not worry. If you are new to this sort of analysis, you will find concrete examples to guide you on your way. If you have experience with some (or many) of these techniques, you will likely find some novelty in the application and may want to check out our discussion of agricultural applications towards the end of this chapter. While the previous paragraphs provide some insight into the nature of this chapter, we do not advocate blind acceptance of what we say as a path to success. Like you, we have a healthy dose of skepticism. We have worked in private industry, government, and academia doing basic and applied research in places ranging from corporate labs in Silicon Valley to corn fields in rural Brazil. We became interested in these computational methods as a way to improve our own research and we are enthusiastic about the opportunity to share what we have learned. It is our hope that the work we have performed and discussed herein can help you in your own research. Collectively, our world gets better when everyone understands how to work with and analyse data.

8.2.  CREATE CLARITY FROM CHAOS—MINDSET The data available for analysis in plant metabolomics are exploding at an exponential rate. Similarly, the amount of research questions available for possible pursuit is limited only by the imagination of the researcher. In pursuing datadriven approaches to plant metabolomics, it is easy to become overwhelmed by the amount of data available and the different means of analysing them. In our experience, a few principles will help greatly in creating clarity from the chaos. In rough order of importance: 1. Utility: Utility is paramount. One of the most insightful statements about data science was written by George Box: ‘Essentially, all models are wrong, but some are useful’ (Box and Draper, 1987). In experimentation, the focus should be on developing useful models and information. While this may

Computation for Plant Metabolomics  Chapter | 8  233

seem anathema to those engaged strictly in basic research for research’s sake, often a quick and simple analysis can provide you with information as or more useful than a more complicated, complete, or ‘correcť analysis. Thus, a focus on utility may result in greater project productivity and more insightful results. 2. Clarity: Clear questions produce clear analysis. Explicitly defining the research question upfront will allow you to determine the necessary steps for acquiring data to answer the question. Once you have a clear question and the necessary data, the question will often engender its solution. 3. Data-driven: Be data-driven. While this may seem like a tautology in a chapter about data, this point is included to emphasize an important distinction and to help you avoid falling into a common data analysis trap: qualitative comparisons and descriptive statistics. There is nothing inherently wrong with a qualitative or descriptive approach. The trap lies in stopping there. The true power of computation in plant metabolomics is in hard quantitative prediction and analysis. The tools available now allow you make quantitative conclusions and predictions from your data. Do you think you have qualitative differences between treatments or levels of certain metabolites? Be data-driven and quantify them. 4. Trade-offs: Understand the Bias-Variance trade-off. For scientists, the relationship between accuracy and precision is second nature (Fig. 8.1). There are similar principles in modelling, where we have bias (model accuracy) and variance (model precision). Bias is how well our model does on average in predicting the variable we seek to predict. Variance is how variable our High variance Low precision

High bias Low accuracy

Low bias High accuracy

Low variance High precision

FIG. 8.1  Bias and Variance in relation to accuracy and precision. In developing models in plant metabolomics, we strive for low bias and low variance: models that accurately predict what we are interested in with high precision. Often there is a trade-off between the two in terms of overall error and utility of the model.

234  Computational Phytochemistry

model is in making those predictions. While low bias and low variance are ideal, there is almost always a trade-off between the two. If you build a model to predict plant infection status based on metabolite profiles, the temptation will be to add complexity to the model (more metabolite information) to achieve more accurate predictions about plant infection status. This often leads to overfitting. The result is models that are not generalizable in the real world and, while potentially more descriptive of a specific scenario, not useful (see section above on utility). Considering the bias-variance trade-off in the context of total error is a path to utility (Fortmann-Roe, 2012). In developing models in plant metabolomics, we strive for low bias and low variance: models that accurately predict what we are interested in with high precision. Often there is a trade-off between the two in terms of overall error and utility of the model (Fortmann-Roe, 2012). 5. Reproducibility: Science is useful because it is reproducible, but more importantly, the believability of science is due to its ability to be reproduced. Creating reproducible workflows is good science, and a critical component of reproducible workflows is reproducible analytics pipelines. In the case of analytics pipelines, reproducible work is a form of clear communication. While many researchers rely on Microsoft Excel or analytical suites with graphical user interfaces, these programmes are inherently not reproducible. Raw data can be manipulated inadvertently without documentation and data analysis can become a black box. Fortunately, there are three easy to implement ingredients to reproducible analytical pipelines: (Relatively) Immutable raw data: There should be a common source of raw data upon which the analysis is conducted. This data should not be readily altered and the analysis should not write over the data. This is an extremely common pitfall of Microsoft Excel where data can be permanently lost through an ill-conceived sort or erroneous pasting of formulas. Better options include text files or databases, which can be accessed by scripting languages. Scripting language: a scripting language allows every step of the process to be completely documented and, if dependencies are properly managed, rerun by other researchers. If you have not learned a scripting language yet, please do or work with someone who does. It will make your life easier and your work better. We will get into more specifics later, but some common languages that fit this bill (in order of utility) are R, Python, and SAS. Documentation: documenting your code is extremely important for the next researcher to look through your analysis (recall that next researcher could be you in a few years). There are a number of literate programming paradigms available now that allow you to combine text and code blocks in a single document (Knuth, 1984; Kluyver et al., 2016; Team, 2017). At the very least, commenting your code is a good idea.

Computation for Plant Metabolomics  Chapter | 8  235

While not a necessary ingredient for reproducible research, a useful fourth point is worth considering. Version control: This is a life-saver for keeping track of developments in a project and essential to working in groups. Check out Git for how to get started (Ram, 2013). We would advocate that you consider adopting reproducible analysis practices. It will make your research better and be a boon to other researchers. We are all in this together. Leťs help each other out.

8.3.  ANALYTICAL TOOLS Now that you have been introduced to the principles of a data-science mindset, it is an opportune time to discuss how those principles are translated into practice. In the reproducibility section above, we discussed scripting languages and mentioned three popular choices. SAS was developed by a private company and is a commercial software (Littell et al., 1996). It has a graphical user interface, but limited data visualization capabilities. Python is an open source general purpose programming language that recently has received a lot of data-science development, particularly in the area of machine learning (Van Rossum and Drake, 2003; Pedregosa et al., 2011). It is used extensively in production and development environments, but does not yet have the suite of statistical tools found in R. Importantly, though, there is extensive support for literate programming through the development of notebooks. R is an open source statistical programming language that was designed from the ground up for data analysis (Ihaka and Gentleman, 1996; R Core Team, 2017). It has a plethora of actively developed statistical, machine learning, and data visualization tools. R tends to be preferred in academia because of its development pace for statistical tools, the availability of a robust development environment in RStudio (Team, 2017), and the sophisticated literate programming support provided in RMarkdown (Allaire et al., 2017) and knitr (Xie, 2015). We will use R for the example workflow included here.

8.4.  EXPERIMENTAL CONSIDERATIONS Having been introduced to the tools for a data analysis mindset, we will next put those principles into an experimental design context. First, we will be discussing where all plant metabolomics experiments start—in the experimental design phase. We will highlight considerations for data collection, instrumentation differences, sample preparation, analysis modalities, throughput considerations, and output data structures. Following the data collection information, we will delve into the analysis, considerations on how to pre-process the data for the analysis, initial explorations of the data, predictive powers, and statistical inference.

236  Computational Phytochemistry

8.4.1  Data Collection Considerations While the goal of a metabolomics workflow is identification and quantitation of all compounds within a given sample matrix or experimental system, with current technologies, this may be only partially realized (Sumner et al., 2003; Tugizimana et al., 2013). Before discussions of post-data collection processing techniques, it is imperative to note that all data are inherently biased by the techniques adopted to collect it (Fiehn et al., 2007a,b, 2008; Goodacre et al., 2007; Sumner et al., 2007). Reasons for this include differences in compound polarity, volatility, size, thermal lability, ionization capacity, structure complexity, stability, isobaric species, isomeric species, equilibria, etc. Even the most advanced data processing cannot overcome flaws in experimental design or analytical limitations. Though we focus here on computational methods and analysis, an overview of sample preparation and relevant analytical approaches are important context for eventual data analysis.

8.4.2 Instrumentation The most commonly adopted instrumentation for metabolomics studies are mass spectrometry (MS) and nuclear magnetic resonance spectroscopy (NMR) (Okazaki and Saito, 2012; Issaq et  al., 2009; Pimenta et  al., 2013; Putri et al., 2013; Gemperline et al., 2016; Jorge et al., 2016). Other detectors in use by metabolomics practitioners include ultraviolet (UV-Vis) (Socaciu et al., 2009; Wehrens et al., 2013; Pop et al., 2014; Rambla et al., 2015; Kwon et al., 2016) and infrared spectroscopy (IR) (Allwood et al., 2006, 2010, 2015; Cozzolino, 2012). NMR spectroscopy has many benefits for studies of plant metabolomics. These include complete structural information, absolute quantitation, unbiased analysis of compounds, and no requirement for sample preparation prior to analysis. On the other hand, NMR suffers from low sensitivity (can detect compounds only at micro-molar levels and above) and complex sample matrices can result in complex spectra with many overlapping peaks (2D-NMR methods can help with this). Additionally, NMR can be sensitive to the environment of the sample (e.g., pH) and can range from low-to-high throughput (Sumner et al., 2003; Tugizimana et al., 2013). Similar to NMR, mass spectrometry has many benefits for metabolomics studies including medium-to-high sensitivity (depending in part on inlet choice and MS choice; at best pico- to femtomolar levels), medium-to-high throughput (again dependent on inlet method and acquisition method details), and high comprehensiveness for compounds analysed in a sample. Additionally, quantitation can be achieved using standards and an abundance of structural information is obtained through analysis (Sumner et  al., 2003; Tugizimana et  al., 2013). However, despite the utility of MS, only ionized compounds are detected. Isobaric species (compounds with the same molecular weight) are also

Computation for Plant Metabolomics  Chapter | 8  237

challenging, particularly in cases where compounds are poorly separated in time by chromatography (ion mobility-MS is an analytical methodology that attempts to solve this problem by adding an extra dimension of separation). Another important consideration is that most analyses by MS require at least some, if not extensive, sample preparation (excluding direct analysis in real time, DART-MS). Because of the utility, popularity, and frequent use in highthroughput workflows, for the remainder of this chapter, focus will be kept on MS-based plant metabolomics.

8.4.3  Sample Preparation Within a biological context, sample matrix compositions are often complex and contain many small compounds of varied character, proteins, and other small to large cellular components. While cellular components and proteins can be removed in sample preparation, most often MS-based metabolomics studies are coupled to some type of separation technique to reduce the number of compounds hitting the MS at one time. These techniques include gas chromatography (GC), liquid chromatography (LC), and capillary electrophoresis (CE). Methods using GC are amenable to smaller compounds that are also volatile and not thermally labile (while derivatization in sample preparation can expand the range of compounds analysed using this method). Experiments analysing the volatile headspace also benefit from only collecting and analysing volatile components, thus alleviating the need to separate the desired analytes from the tissue matrix. One requirement for LC methods is that compounds must be able to go into solution. LC methodologies are applicable to many varied classes of compounds, with different separation modalities (normal phase, reversed phase, hydrophilic interaction (HILIC), and ion-pairing), allowing more specific analysis of compounds with disparate chemical properties. Capillary electrophoresis methods allow for analysis of polar ionogenic compounds that are a challenge to measure using typical LC or GC approaches. While separation techniques typically work with homogenized sample material, it may be the case that for certain metabolomics questions, understanding the spatial resolution of compounds may be desirable. For this type of question, the analytical methodology of mass spectrometry imaging (MSI) has come to be used in the field. In this technique, a native piece of sample material is introduced into a mass spectrometer most commonly through matrix-assisted laser desorption (MALDI), secondary ion mass spectrometry (SIMS), desorption electrospray ionization (DESI), or laser ablation electrospray ionization (LAESI) (Gemperline et al., 2016). In these techniques, increased lateral resolution of sample substrate introduced to the detector is most sought after and often limits successful application of this approach. Additionally, data processing for MSI requires significant computing resources. Reviews on instrumentation selection in metabolomics are available (Rai et al., 2013).

238  Computational Phytochemistry

Sample preparation must be performed thoughtfully to maintain biochemical integrity of the sample at the time of harvest and to ensure good recovery of analytes of interest. Analysis may require an extensive extraction process using a variety of solvents and/or clean-up procedures to capture analytes with disparate physicochemical properties. Reviews have covered this topic more thoroughly (Ryan and Robards 2006; Kim and Verpoorte 2010; Ernst et al. 2014). General considerations in sample preparation include proper and adequate sampling, quenching, drying, storage, solvent characteristics, pH, ratio of solvents, extraction times and temperatures, etc. Some sample matrices have been extensively investigated using metabolomics approaches and there exist more standardized procedures for extractions and sample preparation, i.e. grains, fresh vegetable, fresh fruit, seeds, flowers, and other plant-specific tissues (Gullberg et  al., 2004; Oikawa et  al., 2008; Allwood et  al., 2011, 2014; Nadella et  al., 2012; Roessner and Dias, 2013; Tohge et al., 2014; Shiratake and Suzuki, 2016; Zhu et al., 2016).

8.4.4  Analysis Modalities Depending on the questions being asked of the biological system, metabolomics analyses generally fall into three categories: targeted, semi-targeted, and unknown or metabolomics profiling (non-targeted) analysis. As the name implies, targeted analysis refers to analyses, where the metabolites of interest are known to the researchers. In this case, commercially available standards allow for full quantitation. Semi-targeted analysis involves the detection of compounds of known classes (e.g., neutral lipids, gibberellins, or sterols). Knowledge of an analyte’s physicochemical properties allows the researcher to select better sample extraction techniques and instrumentation. Pseudo-quantitation is often used for semi-targeted analysis, wherein a structurally related analyte’s response is used to calibrate and quantitate related compounds within a class (e.g., one or several triacylglycerols (TGs) in a lipidomics study, which may profile hundreds of individual TGs). In contrast, unknown metabolomics profiling is a naïve approach, often incorporating a variety of sample preparation methods and sometimes requiring multiple instrumental techniques to fully characterize compounds of interest. In this case, analytes are limited to qualitative comparison or relative quantitation.

8.4.5  Throughput in Plant Metabolomics The definition of a high-throughput method has been discussed and defined in several ways throughout the literature (Habchi et al., 2016; De Raad et al., 2016). Often the definition focuses on short analysis time and number of samples that may be acquired per day. Generally, there is a trade-off here with short analysis time, reducing the number of metabolites that can be reliably investigated. This conception of high-throughput comes from the ‘omics’ standard of large

Computation for Plant Metabolomics  Chapter | 8  239

screening approaches and comprehensive study designs, including replication and multiple factors. This approach to understanding a high-throughput method does not consider method development and data processing. Metabolomics studies generate large data sets and post-analysis processing often requires the largest time investment. When considering the throughput of a metabolomics method for a large-scale study design with potentially thousands of samples, consideration of the following points may be recommended: 1. Time constraints: The time required for method development with sufficient resolution, method execution, analysis per sample, and post-data collection processing. 2. Robustness and reproducibility: Variability is inevitable, but may be minimized by the adoption of protocols that ensure sample integrity. Additionally, pre-purchasing sufficient materials (solvents, modifiers, analytical columns) from the same batch or lot may improve consistency. 3. Inter- and intra-batch normalization: The use of internal standards and quality control samples facilitates comparison between collected samples.

8.4.6  Data Structures Mass spectrometry data collected in metabolomics studies typically takes the form of a vendor-specific file containing retention times, mass-to-charge ratios (m/z), and ion intensity for specified m/z value or m/z fragment, among other metadata related to acquisition method. Most MS vendors have their own proprietary file format extensions (e.g., RAW from Thermo Scientific, .D from Agilent, and .WIFF from SCIEX), and raw file formats from one vendor software are not interchangeable among others. Open MS data formats have been developed such as netCDF (network common data form), mzXML (m/z ­extended markup language), mzData, and mzML that are used to share and ­exchange data arrays produced in MS-based metabolomics or proteomics studies (Deutsch, 2012). It is possible to readily convert data files to be read by other software packages, often with free conversion software. This allows an analyst to perform functions piecewise, using beneficial features provided by different software.

8.5. ANALYSIS Following collection of the data, it is time for the next adventure: analysis. In order to highlight the analysis steps in a concrete way, we will use freely available data published on the Metabolomics workbench (Sud et  al., 2015). The study we have chosen as an example is an investigation of root-knot nematode infection of bermuda grass (Study ID: ST000353). The metabolomic profiles of three genotypes if either infected or not infected by root-knot nematode were analysed by LC-MS. The raw data were used directly from the download.

240  Computational Phytochemistry

No additional adjustments to the data were made outside of the code seen here. A complete codebase is available at the Github repository.

8.5.1  Data Processing Before doing any sort of statistical analysis, the data must be converted into a usable form. Raw data in proprietary file formats are in most cases completely inaccessible by analytics packages and must be reformatted for analysis. The first step in this process is converting raw data into structured data with the information necessary for desired analysis. For most workflows, this tends to result in tabular data sets with peak areas associated with specific compounds (or retention times for unnamed compounds) as discussed in the data structures in the section above. After tabular data is in hand, data must be loaded into the analysis package of choice. Double check to make sure the data loaded correctly and are what you expect. Please do not overlook this step. It sounds simple, but a quick check to make sure the data loaded correctly and as expected can save hours of later headache. An example inspection and testing framework can be found at the Github repository.

8.5.1.1  Data Cleaning After loading the data, it must be cleaned for analysis. This involves making sure compounds are sufficiently resolved and correctly annotated. The cleaning process ensures compounds in the final dataset that result from differences in derivatization or adducts are combined. Additionally, any contaminants (e.g., phthalates, column-related compounds) or non-biologically relevant compounds (e.g., internal standards) may be removed. 8.5.1.2  Missing Values The next step in the process is examining the data for missing values. An inevitable part of data collection is that sometimes there will be missing data—either a compound was not recorded, or the values not processed appropriately. In examining the missing values, check for magnitude and patterns in the missingness. In most cases, the magnitude will be small and the pattern random. In this case, imputation—estimating likely values for the missing data—can be done. There are a number of methods for imputation ranging from simple substitution to modelling approaches. Simply substituting a measure of central tendency (e.g., mean or median) is quick and dirty, but may obscure detail necessary for your analysis. Modelling approaches usually do a better job of imputation (Armitage et  al., 2015; Schmitt et  al., 2015). Nearest neighbours and fuzzy k-means clustering are two approaches that tend to give decent performance (Armitage et al., 2015; Shah et al., 2015; Schmitt et al., 2015; Beretta and Santaniello, 2016). Both work by capturing patterns in the existing data and using those patterns to estimate missing values. For our example workflow with

Computation for Plant Metabolomics  Chapter | 8  241

the nematode Bermuda grass data, we used a quick implementation of nearest neighbours for estimating missing values.

8.5.1.3 Normalization After imputation of missing values, you will want to normalize the data to facilitate comparisons between and among groups. In conducting experiments, there will always be variation between batches, runs, and platforms. Normalization is a way of removing that unwanted variation in order to focus on explaining the biologically relevant variation of interest (De Livera et al., 2012). In adjusting for the unwanted variation, normalization can often improve performance of resulting analysis and facilitate detection of patterns that may be otherwise obscured (Kohl et al., 2012; Li et al., 2016). As in imputation, choosing the normalization procedure can have important implications for the resulting analysis. For our example workflow, we chose variance stabilizing normalization to account for heteroscedasticity (unequal variances) and because of its performance relative to other normalization procedures (Li et al., 2016).

8.5.2  Unsupervised Approach As a next step in analysing our data, we will explore our data for patterns. This exploratory approach is considered unsupervised. Unsupervised in this context means that we are looking for patterns. We will use the computer (pattern recognition algorithms) to detect patterns that we, as human observers, may not be able to tease out with our naked eye. Later interest may turn to matching those observed patterns with our experimental design, but initially, just for now, we will look for patterns in our data. This exploratory approach is a good opportunity to confirm assumptions about our data set and to determine if there are any interesting anomalies.

8.5.2.1 Ordination One of the best places to start in unsupervised exploratory analysis is with ordination techniques. Ordination is a technique in which multivariate data are re-projected in reduced dimensions. These techniques are particularly useful when the number of variables per sample is extremely large—exactly the case in metabolomics where the number of compounds monitored far exceeds the samples. Common ordination methods are principal components analysis, correspondence analysis, and multidimensional scaling. Principal components analysis (PCA) is the most popular ordination and dimension reduction technique. It re-projects data onto axes that orthogonally (i.e. right angles and independently) capture variation through linear combinations of variables. This is extremely useful as often a small number of linear combinations can capture the vast majority of variation observed among an exceedingly large number of samples. Principal component analysis can be used for cursory visualization of large dimensional datasets and as preparation for

242  Computational Phytochemistry

machine learning techniques, which will be discussed later. To learn more about PCA, there are a number of excellent resources available (Abdi and Williams, 2010; Bro and Smilde, 2014). Correspondence analysis is similar to PCA, but is applied with categorical variables (Greenacre, 2016). Various forms of discriminant analysis are often discussed in relation to PCA, but will be discussed in the supervised approach section below. Multidimensional scaling is an oft-overlooked method particularly relevant for metabolomics analysis. Instead of re-projecting data onto linear combinations of variables that capture variation, multidimensional scaling re-projects dissimilarities between samples. Multidimensional scaling captures the multidimensional differences between samples (e.g., by calculating multivariate Euclidean distances), then re-projects those differences as best as possible in a reduced dimension (Young, 2013). Useful for us, multidimensional scaling can produce the best possible representation of multivariate differences in easily understandable two dimensions. Because multidimensional scaling re-projects differences and does re-project captured orthogonal variation, it can often be better than PCA in initially exploring data through visual analysis. In applying multidimensional scaling to our example workflow with the nematode Bermuda grass data, we used Euclidean distance as our measure of dissimilarity and projected into two dimensions with non-metric multidimensional scaling (Fig.  8.2). In this example, we see that the genotypes strongly separate and that there seem to be differences between plants infected with nematodes and not infected with nematodes within genotype.

FIG. 8.2  Non-metric multidimensional scaling of metabolite profiles of Bermuda grass infected or not with plant parasitic nematodes.

Computation for Plant Metabolomics  Chapter | 8  243

8.5.2.2 Clustering As a next step in exploring our data, we will use a clustering approach to look at similarities in our data. Two common clustering approaches are k-means clustering and hierarchical clustering. K-means clustering is often applied with and is related to PCA. It iteratively partitions data into clusters based on group means and is the prototypical, popular clustering approach (Jain, 2010). It can be extremely useful when there are explicit distinct groups, but does not handle non-linearity and relies to some extent on user choice of the ideal number of clusters. Hierarchical cluster analysis examines similarities (or dissimilarities) between groups, then assembles hierarchies based either on top-down separation by dissimilarity or bottom-up association by similarity. The result is a tree structure with similar items on the same branches and the height of the branch proportional to the similarity. Hierarchical clustering allows you to quickly get a sense of how related different aspects of your metabolomic profiles are even when you have a plethora of data. In our example workflow with the nematode Bermuda grass data, we applied hierarchical cluster analysis to the named compounds in the metabolite profiles (Fig. 8.3). The named compounds were clustered using average-based agglomerative clustering (i.e. bottom-up grouping) with correlations in metabolite expression across treatment groups used for similarities. The result highlights compounds with similar levels across different nematode treatments and Bermuda grass genotypes.

8.5.3  Supervised Approach Thus far, we have explored our data without providing the pattern recognition algorithms any information about our experimental design would provide. Our exploration has been unsupervised. Now, we transition into a supervised approach where we use the information in our experimental design to make predictions and gather insights about our data. A common need in plant metabolomics is the identification of biomarkers for a certain condition—compounds that, when over- or under-expressed, denote a relevant change in the system. To identify these compounds, we often need to be able to predict plant status given a suite of metabolomic variables. Fortunately for us, there are an increasing array of classification and regression prediction algorithms available. Many are extremely powerful. For our purposes, we will discuss four broad categories of prediction approaches: regression, discriminant analyses, tree-based methods, and neural nets.

8.5.3.1  Linear Regression If you are reading this, you are likely to be familiar with linear regression; the use of linear combinations of predictors to estimate a continuous response variable. What you may not know is that the world of regression is extremely vast

244  Computational Phytochemistry

FIG. 8.3  Heirarchical clustering of metabolites based on expression in three Bermuda grass genotypes infected or not with plant parasitic nematodes.

and adaptations of basic linear regression have been developed for almost any sort of continuous/categorical multivariate or univariate situation conceivable. Regression shines when relationships are mostly linear and when there is a premium on interpret-ability; the construction of regression models makes them readily understood and communicated. While we will not dive deeply into regression here (other methods tend to be more applicable), there are numerous other excellent regression resources (Draper and Smith, 2014).

8.5.3.2  Discriminant Analysis Discriminant analysis is similar in some respects to PCA in that it relies on linear combinations of predictors to discriminate (i.e. separate) between categorical groups. These combinations of predictors are called latent variables, and while all forms of discriminant analysis rely upon them, the manner in which they are constructed differs. In linear discriminant analysis (LDA), latent variables are constructed to maximally separate the predicted response categories. In partial

Computation for Plant Metabolomics  Chapter | 8  245

least squares discriminant analysis (PLS-DA), latent variables are constructed to account for variation in the predicted response. In orthogonal least squares discriminant analysis (OPLS-DA), such latent variables are constructed such that they are independent and at right angles to each other. While these methods can be excellent prediction models in some circumstances, they do not handle non-linearity well and rely upon assumptions of normality.

8.5.3.3  Tree-Based Methods Tree-based methods rely upon classification and regression trees to make predictions. They do so by assembling decision trees of predictors. A decision tree is assembled by identifying and combining cutoffs for predictors. If, for example, we were trying to decide (i.e. predict) whether a fruit was an apple or banana, a first item in our decision tree might be colour: if it were yellow we would think banana. To handle more complicated situations, we could add more levels: If it were yellow and had a volume of below 20 mL, we might have a special kind of apple. Tree-based methods for classification and regression can become extremely powerful by increasing tree depth (the number of variable included in a tree) and by combining multiple trees. Random forests models do exactly this by combining many trees in such a way to get close to optimizing the bias-variance trade-off (Breiman, 2001). Random forests combine orthogonal trees trained on random subsets of the original data through a process called bootstrap aggregation (bagging) (Breiman, 2001). Predictions from each individual tree in a forest are combined by taking the mode or mean for classification of categorical data or regression of continuous data, respectively. While interpreting the structure of a forest of trees is more challenging than linear regression, random forests are able to capture highly nonlinear relationships as result. This makes them extremely well-suited for the non-linearities inherent in plant metabolomics work. Additionally, tree-based classification methods provide useful information about the predictors: variable importance. Based on the position and prevalence of predictors in the random forests, predictors can contribute more or less to the prediction outcome. This can be quantified and returned as an output of the model. In our fruit classification example where we are separating apples and bananas, colour is an important predictor and would assume a place of prominence in the classification tree. In applying a random forest model to our nematode and Bermuda grass example, we can ask the model to predict nematode infection based on observed metabolite profiles. In this case, the observed variables of importance are mostly unidentified compounds (Fig.  8.4). This highlights an important facet of metabolomics work: not all important metabolites are easily identified and not all easily identified compounds are important. Using the combination of known and unknown compounds in initial data exploration can help to identify interesting unknowns that may warrant the extra time needed for annotation/ identification.

246  Computational Phytochemistry

FIG. 8.4  Variable importance measures from random forests models of metabolites from Bermuda grass useful in predicting nematode infection.

8.5.3.4  Performance Considerations: Validation In important consideration when working through predictions is quantifying performance. In our case, although our variables are important in our model, how good is our model? How useful is it? In measuring model performance for continuous responses, performance is often achieved through low prediction error: small differences between predicted values and actual values. With categorical responses, we assess performance by examining the correct and erroneous predictions; we look at the relationship between the true positive and false positive rate. In ideal world, predicting nematode infection, we would want a high true positive rate and a low false positive rate. The relationship between the two and area under that relationship is a measure of model performance (Fig. 8.5). In determining performance of our predictions, it is important to remember the bias-variance trade-off: we do not want to overfit our models to the data. We want our models to be useful. To be useful, they have to generalize well to other situations. To measure how well our models perform in new situations, we separate our data into training and test data. The model will learn to recognize patterns on the training data, then will make predictions on the test data it has never seen. This will give us an estimate of how well the model might do on never before seen data. In training the model, we want to prevent overfitting and prepare the model for later generalization. To do so, we can use a method called cross validation that separates the training data into random subsets for training and evaluation within the overall training process. After doing this with our nematode data, we find that including unknown compounds increases performance and that many unknown compounds are

Computation for Plant Metabolomics  Chapter | 8  247

FIG.  8.5  Receiver operator characteristic curve for random forests models predicting infection status of Bermuda grass using either just named compounds or both named and unidentified compounds. AUC is Area Under the Curve

among the most important in predicted nematode infection on Bermuda grass (Figs. 8.4 and 8.5). This is an important lesson that just examining named compounds may not be sufficient for metabolomics pipelines. In many or some situations, it may be useful to investigate unidentified compounds.

8.5.4 Inference While prediction is extremely useful and can provide useful insights into biological processes, statistical inference is often the focus of academic works. Instead of focusing on predictive power, this is where we focus on quantifying observed patterns in our experiment. This is where we dive into the realm of p-values and statistical tests. In applying common statistical inference tools (e.g., t-tests, ANOVA) to metabolomics data, a few points are important to keep in mind: 1. Most common statistical tests are designed for normal, low dimension data. Most metabolomics data are non-normal and highly multivariate. Normalization procedures can correct for this to some extent, but doublechecking is important. 2. The more tests you do, the more likely you are going to find a significant difference where none exists. This is termed alpha-inflation and correcting for this is important when doing many tests simultaneously, i.e. hundreds or thousands of compounds for each hypothesis test.

248  Computational Phytochemistry

FIG.  8.6  Difference in unidentified compound RT 401.1206 between treatments by genotype. Points denote raw data values. Triangles and error bars denote mean and bootstrapped 95% confidence intervals, respectively.

To address non-normality, bootstrapping and permutation tests are helpful. To address alpha-inflation, corrections like the Bonferroni correction can be useful (Fig. 8.6).

8.6.  METABOLOMICS IN AGRICULTURE Plant metabolomics approaches have been widely applied to address many challenges faced in agricultural systems. Paired-‘omics’ approaches, incorporating small metabolite detection with some combination of genomics, transcriptomics, and proteomics, have served to create more complete networks of interactions in a systems biology approach. Some examples where cutting-edge analytics techniques have been paired with plant metabolomics workflows for advancing agriculture are: 1. Many studies have investigated biotic and abiotic stress responses in the model organism Arabidopsis thaliana, i.e. heat, freezing, drought, and salinity (Kaplan et al., 2004; Nakamura et al., 2009; Mao et al., 2013; Nakabayashi et al., 2014). Generally, these studies have provided greater understanding of the accumulation of specialized metabolites in response to and in mediation of stressors (Tian et al., 2016). Knowledge of these stress responses can be used to mitigate climate change impacts on agricultural production in at-risk areas. 2. Other successes in agricultural settings using plant metabolomics include plant phenotyping, measurement of genetic diversity, development of ­metabolic markers and metabolic signatures for plant growth, development and responses to stress, verification of metabolic effects following gene

Computation for Plant Metabolomics  Chapter | 8  249

knock-out, and mechanisms of inheritance of biochemical traits and pathways (Nadella et al., 2012). This basic research can pay applied dividends in the future as metabolomics mapping can facilitate development of resistant and resilient agricultural crops. 3. There is concern among consumers and scientists over potential unforeseen, harmful alterations to plants as a result of traditional and modern plant breeding techniques. Historically, targeted approaches have been adopted to quantify major plant constituents (i.e. nutrients and toxicants). A targeted approach relies on fragmented knowledge of plant biosynthetic pathways, whereas metabolomics profiling seeks to quantify all plant constituents (Cellini et  al., 2004). Understanding the complete metabolomics systems can facilitate optimization of desired plant characteristics including nutritional profiles and developmental success. Efforts to breed a slow-ripening tomato variety, for example, revealed an unintended change in fruit composition (Noteborn et al., 2000). The slow-ripening mutant contained 2–4x more α-lycopene, an antioxidant believed to confer anticancer and heart disease benefits (Story et al., 2010), than the unedited variety. 4. A comparison between genetically modified and conventional potato crops found the varieties to be equivalent in composition, excluding the intentional targeted metabolic changes (in this case, upregulation of inulin-type fructans that provide resistance to digestive tract pathogens) (Catchpole et al., 2005). These metabolomics data are important for regulatory purposes and for assuaging consumer concerns. 5. Aliphatic glucosinolates are a class of insect defence compounds produced by plants from the family Brassicaceae (Kliebenstein, 2004). A double knock-out mutation of two transcription factors responsible for regulating glucosinolate production in Arabidopsis halts synthesis of these defence compounds, resulting in increased cabbage moth (Mamestra brassicae) larvae-induced damage (Beekwilder et al., 2008). Metabolomics knowledge of plant defence systems can be used to design plants able to resist pest pressures in the field. 6. A paired ‘-omics’ approach introduced a new cucumber variety with improved flavour by revealing the importance of chromosome placement of the transgene on production of metabolites contributing to flavour and perceived sweetness in cucumber (Tagashira et al., 2005; Zawirska-Wojtasiak et al., 2009). Plant metabolomics used for improving taste is an immediate consumer success. 7. Glyphosate is the most commonly used herbicide in the world, yet the mechanism responsible for its mode of action and mode(s) of resistance is not fully understood. Metabolomics research in the model plant Arabidopsis has helped identify a protein kinase that facilitates the toxicity action of the herbicide (Faus et al., 2015). Knock-out mutants for the enzyme exhibit lower inhibition of photosynthesis. Improving metabolomics knowledge of herbicide function results in more efficient commercial agriculture around the world.

250  Computational Phytochemistry

8. Finally, an important field of agricultural chemical ecology research that involves plant metabolites is that of plant-insect, plant-microbe, and ­insect-microbe interactions, and their associated chemical communications between these organisms (Beck and Vannette, 2017). The volatile profiles emitted from plants, insects, and microbes (Beck et al., 2017a,b) are a result of extremely complex ecological interactions and benefit greatly from the application of statistical methods to help delineate important biomarkers and volatile signals.

8.7. CONCLUSIONS Phytochemical analyses have undergone some extraordinary transformations over the years, with sustained interest in ‘metabolites’ from plants starting in the 1930s. It was not until the 1970s that publications (Scopus search ‘planť AND ‘metabolites’) started numbering in the 100s per year. In the early 2000s, the number of publications broke into the 1000s with the last four years having greater than 4000 publications per year. Interestingly, publications including the term ‘metabolomics’ started to make their debut in 2003; the same year ‘plants’ and ‘metabolites’ exceeded 1000 publications per year. The increase in interest in metabolomics may be an obvious extension of where the field of phytochemical analysis was headed given the increase in cross-disciplinary collaborations and the increase in instrumentation sensitivity and high-throughput. However, it could also be argued that this explosion of metabolomics-based research may be due in part to the availability and application of computational methods and their corresponding statistical analyses of metabolomics data. It is the intent of this chapter that the vast utility of computational methods for exploring the metabolomes of plants becomes more commonplace among researchers, and that these methods continue to have positive impacts on the quality, growth, and future directions of phytochemical analyses. Following good data analysis pipelines including consistent workflow documentation, with stable ontologies to describe known and unknown compounds and fieldspecific metadata, will greatly engender our ability to collaborate and share research. This will lead to a decrease in the unknown chemical space of biology, filling the empty nodes and completing the network.

REFERENCES Abdi, H., Williams, L.J., 2010. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433–459. Allaire, J.J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wickham, H., Atkins, A., Hyndman, R., Arslan, R., 2017. Rmarkdown: dynamic documents for R. https://CRAN.Rproject.org/package=rmarkdown. Allwood, J.W., Ellis, D.I., Heald, J.K., Goodacre, R., Mur, L.A.J., 2006. Metabolomic approaches reveal that phosphatidic and phosphatidyl glycerol phospholipids are major discriminatory non-polar metabolites in responses by Brachypodium distachyon to challenge by Magnaporthe grisea. Plant J. 46, 351–368.

Computation for Plant Metabolomics  Chapter | 8  251 Allwood, J.W., Clarke, A., Goodacre, R., Mur, L.A.J., 2010. Dual metabolomics: a novel approach to understanding plant–pathogen interactions. Phytochemistry 71, 590–597. Allwood, J.W., De Vos, R.C.H., Moing, A., Deborde, C., Erban, A., Kopka, J., Goodacre, R., Hall, R.D., 2011. Plant metabolomics and its potential for systems biology research: background concepts, technology, and methodology. Methods Enzymol. 500, 299–336. Allwood, J.W., Cheung, W., Xu, Y., Mumm, R., De Vos, R.C.H., Deborde, C., Biais, B., Maucourt, M., Berger, Y., Schaffer, A.A., Rolin, D., Moing, A., Hall, R.D., Goodacre, R., 2014. Metabolomics in melon: a new opportunity for aroma analysis. Phytochemistry 99, 61–72. Allwood, J.W., Chandra, S., Xu, Y., Dunn, W.B., Correa, E., Hopkins, L., Goodacre, R., Tobin, A.K., Bowsher, C.G., 2015. Profiling of spatial metabolite distributions in wheat leaves under normal and nitrate limiting conditions. Phytochemistry 115, 99–111. Armitage, E.G., Godzien, J., Alonso-Herranz, V., López-Gonzálvez, A., Barbas, C., 2015. Missing value imputation strategies for metabolomics data. Electrophoresis 36, 3050–3060. Beck, J.J., Vannette, R.L., 2017. Harnessing insect-microbe chemical communications to control insect pests of agricultural systems. J. Agric. Food Chem. 65, 23–28. Beck, J.J., Willett, D.S., Mahoney, N.E., Gee, W.S., 2017a. Silo-stored pistachios at varying humidity levels produce distinct volatile biomarkers. J. Agric. Food Chem. 65, 551–556. Beck, J.J., Torto, B., Vannette, R.L., 2017b. Eavesdropping on plant-insect-microbe chemical communications in agricultural ecology: a virtual issue on semiochemicals. J. Agric. Food Chem. 65, 5101–5103. Beekwilder, J., Van Leeuwen, W., Van Dam, N.M., Bertossi, M., Grandi, V., Mizzi, L., Soloviev, M., Szabados, L., Molthoff, J.W., Schipper, B., Verbocht, H., de Vos, R.C.H., Morandini, P., Aarts, M.G.M., Bovy, A., 2008. The impact of the absence of aliphatic glucosinolates on insect herbivory in Arabidopsis. PLoS One 3, e2068. Beretta, L., Santaniello, A., 2016. Nearest neighbor imputation algorithms: a critical evaluation. BMC Med. Inf. Decis. Mak. 16, 74. Box, G.E.P., Draper, N.R., 1987. Empirical Model-Building and Response Surfaces. Wiley, Oxford, England. Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. Bro, R., Smilde, A.K., 2014. Principal component analysis. Anal. Methods 6, 2812–2831. Catchpole, G.S., Beckmann, M., Enot, D.V., Mondhe, M., Zywicki, B., Taylor, J., Hardy, N., Smith, A., King, R.D., Kell, D.B., Fiehn, O., Draper, J., 2005. Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. Proc. Natl. Acad. Sci. U. S. A. 102, 14458–14462. Cellini, F., Chesson, A., Colquhoun, I., Constable, A., Davies, H.V., Engel, K.H., Gatehouse, A.M., Kärenlampi, S., Kok, E.J., Leguay, J.J., Lehesranta, S., Noteborn, H.P., Pedersen, J., Smith, M., 2004. Unintended effects and their detection in genetically modified crops. Food Chem. Toxicol. 42, 1089–1125. Cozzolino, D., 2012. Benefits and limitations of infrared technologies in omics research and development of natural drugs and pharmaceutical products. Drug Dev. Res. 73, 504–512. De Livera, A.M., Dias, D.A., De Souza, D., Rupasinghe, T., Pyke, J., Tull, D., Roessner, U., McConville, M., Speed, T.P., 2012. Normalizing and integrating metabolomics data. Anal. Chem. 84, 10768–10776. De Raad, M., Fischer, C.R., Northen, T.R., 2016. High-throughput platforms for metabolomics. Curr. Opin. Chem. Biol. 30, 7–13. Deutsch, E.W., 2012. File formats commonly used in mass spectrometry proteomics. Mol. Cell. Proteomics 11, 1612–1621. Draper, N.R., Smith, H., 2014. Applied Regression Analysis. John Wiley & Sons, New Jersey, United States.

252  Computational Phytochemistry Ernst, M., Silva, D.B., Silva, R.R., Vêncio, R.Z., Lopes, N.P., 2014. Mass spectrometry in plant metabolomics strategies: from analytical platforms to data acquisition and processing. Nat. Prod. Rep. 31, 784–806. Faus, I., Zabalza, A., Santiago, J., Nebauer, S.G., Royuela, M., Serrano, R., Gadea, J., 2015. Protein kinase gcn2 mediates responses to glyphosate in Arabidopsis. BMC Plant Biol. 15, 14. Fiehn, O., Robertson, D., Griffin, J., van der Werf, M., Nikolau, B., Morrison, N., Sumner, L.W., Goodacre, R., Hardy, N.W., Taylor, C., Fostel, J., Kristal, B., Kaddurah-Daouk, R., Mendes, P., van Ommen, B., Lindon, J.C., Sansone, S.–.A., 2007a. The metabolomics standards initiative (Msi). Metabolomics 3, 175–178. Fiehn, O., Sumner, L.W., Rhee, S.Y., Ward, J., Dickerson, J., Lange, B.M., Lane, G., Roessner, U., Last, R., Nikolau, B., 2007b. Minimum reporting standards for plant biology context information in metabolomic studies. Metabolomics 3, 195–201. Fiehn, O., Wohlgemuth, G., Scholz, M., Kind, T., Lee, D.Y., Lu, Y., Moon, S., Nikolau, B., 2008. Quality control for plant metabolomics: reporting msi-compliant studies. Plant J. 53, 691–704. Fortmann-Roe, S., 2012. Understanding the bias-variance tradeoff. http://www.webcitation. org/6dQKoNqXb. Gemperline, E., Keller, C., Li, L., 2016. Mass spectrometry in plant-omics. Anal. Chem. 88, 3422–3434. Goodacre, R., Broadhurst, D., Smilde, A.K., Kristal, B.S., Baker, J.D., Beger, R., Bessant, C., Connor, S., Capuani, G., Graig, A., Ebbels, T., Kell, B.B., 2007. Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics 3, 231–241. Greenacre, M., 2016. Correspondence Analysis in Practice, third ed. CRC press, Florida, United States. Gullberg, J., Jonsson, P., Nordström, A., Sjöström, M., Moritz, T., 2004. Design of experiments: an efficient strategy to identify factors influencing extraction and derivatization of Arabidopsis thaliana samples in metabolomic studies with gas chromatography/mass spectrometry. Anal. Biochem. 331, 283–295. Habchi, B., Alves, S., Paris, A., Rutledge, D.N., Rathahao-Paris, E., 2016. How to really perform high throughput metabolomic analyses efficiently? Trends Anal. Chem. 85, 128–139. Ihaka, R., Gentleman, R., 1996. R: a language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299–314. Issaq, H.J., Van, Q.N., Waybright, T.J., Muschik, G.M., Veenstra, T.D., 2009. Analytical and statistical approaches to metabolomics research. J. Sep. Sci. 32, 2183–2199. Jain, A.K., 2010. Data clustering: 50 years beyond K-means. Pattern Recogn. Lett. 31, 651–666. Jorge, T.F., Mata, A.T., António, C., 2016. Mass spectrometry as a quantitative tool in plant metabolomics. Phil. Trans. R. Soc. A 374, 20150370. Kaplan, F., Kopka, J., Haskell, D.W., Zhao, W., Schiller, K.C., Gatzke, N., Sung, D.Y., Guy, C.L., 2004. Exploring the temperature-stress metabolome of Arabidopsis. Plant Physiol. 136, 4159–4168. Kim, H.K., Verpoorte, R., 2010. Sample preparation for plant metabolomics. Phytochem. Anal. 21, 4–13. Kliebenstein, D.J., 2004. Secondary metabolites and plant/environment interactions: a view through Arabidopsis thaliana tinged glasses. Plant Cell Environ. 27, 675–684. Kluyver, T., Ragan-Kelley, B., Pérez, F., Granger, B.E., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., Willing, C., 2016. In: Jupyter Notebooks—a publishing format for reproducible computational workflows. Positioning and Power in Academic Publishing: Players, Agents and Agendas. IOS Press, pp. 87–90. Knuth, D.E., 1984. Literate programming. Comput. J. 27, 97–111. Kohl, S.M., Klein, M.S., Hochrein, J., Oefner, P.J., Spang, R., Gronwald, W., 2012. State-of-the art data normalization methods improve NMR-based metabolomic analysis. Metabolomics 8, 146–160.

Computation for Plant Metabolomics  Chapter | 8  253 Kwon, H.N., Phan, H.-D., Xu, W.J., Ko, Y.J., Park, S., 2016. Application of a smartphone metabolomics platform to the authentication of Schisandra sinensis. Phytochem. Anal. 27, 199–205. Li, B., Tang, J., Yang, Q., Cui, X., Li, S., Chen, S., Cao, Q., Xue, W., Chen, N., Zhu, F., 2016. Performance evaluation and online realization of data-driven normalization methods used in LC/ Ms based untargeted metabolomics analysis. Sci. Rep. 6, 38881. Littell, R.C., Milliken, G.A., Stroup, W.W., Wolfinger, R.D., 1996. SAS System for Mixed Models. SAS Institute, North Carolina, United States. Mao, G., Seebeck, T., Schrenker, D., Yu, O., 2013. CYP709B3, a cytochrome P450 monooxygenase gene involved in salt tolerance in Arabidopsis thaliana. BMC Plant Biol. 13, 169. Nadella, K.D., Marla, S.S., Kumar, P.A., 2012. Metabolomics in agriculture. OMICS 16, 149–159. Nakabayashi, R., Yonekura-Sakakibara, K., Urano, K., Suzuki, M., Yamada, Y., Nishizawa, T., Matsuda, F., Kojima, M., Sakakibara, H., Shinozaki, K., Michael, A.J., Tohge, T., Yamazaki, M., Saito, K., 2014. Enhancement of oxidative and drought tolerance in Arabidopsis by overaccumulation of antioxidant flavonoids. Plant J. 77, 367–379. Nakamura, Y., Koizumi, R., Shui, G., Shimojima, M., Wenk, M.R., Ito, T., Ohta, H., 2009. Arabidopsis lipins mediate eukaryotic pathway of lipid metabolism and cope critically with phosphate starvation. Proc. Natl. Acad. Sci. 106, 20978–20983. Noteborn, H.P.J.M., Lommen, A., van der Jagt, R.C., Weseman, J.M., 2000. Chemical fingerprinting for the evaluation of unintended secondary metabolic changes in transgenic food crops. J. Biotechnol. 77, 103–114. Oikawa, A., Matsuda, F., Kusano, M., Okazaki, Y., Saito, K., 2008. Rice metabolomics. Rice 1, 63–71. Okazaki, Y., Saito, K., 2012. Recent advances of metabolomics in plant biotechnology. Plant Biotechnol. Rep. 6, 1–15. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, V., Passos, A., Cournapeau, D., 2011. ScikitLearn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830. Pimenta, L.P., Kim, H.K., Verpoorte, R., Choi, Y.H., 2013. NMR-based metabolomics: a probe to utilize biodiversity. In: Metabolomics Tools for Natural Product Discovery: Methods and ­Protocols. Springer, Berlin, Germany, pp. 117–127. Pop, R.M., Buzoianu, A.D., Ioan, V.R., Socaciu, C., 2014. Untargeted metabolomics for Sea Buckthorn (Hippophae rhamnoides sp. Carpatica) berries and leaves: fourier transform infrared spectroscopy as a rapid approach for evaluation and discrimination. Notulae Botanicae Horti Agrobotanici Cluj-Napoca 42, 545–550. Putri, S.P., Nakayama, Y., Matsuda, F., Uchikata, T., Kobayashi, S., Matsubara, A., Fukusaki, E., 2013. Current metabolomics: practical applications. J. Biosci. Bioeng. 115, 579–589. R Core Team, 2017. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Rai, A., Umashankar, S., Swarup, S., 2013. Plant metabolomics: from experimental design to knowledge extraction. In: Legume Genomics: Methods and Protocols. Springer, pp. 279–312. Ram, K., 2013. Git can facilitate greater reproducibility and increased transparency in science. Source Code Biol. Med. 8, 7. Rambla, J.L., López-Gresa, M.P., Bellés, J.M., Granell, A., 2015. Metabolomic profiling of plant tissues. In: Plant Functional Genomics: Methods and Protocols. Springer, pp. 221–235. Roessner, U., Dias, D.A., 2013. Plant tissue extraction for metabolomics. In: Metabolomics Tools for Natural Product Discovery: Methods and Protocols. Springer, pp. 21–28. Ryan, D., Robards, K., 2006. Analytical chemistry considerations in plant metabolomics. Sep. ­Purif. Rev. 35, 319–356.

254  Computational Phytochemistry Schmitt, P., Mandel, J., Guedj, M., 2015. A comparison of six methods for missing data imputation. J. Biom. Biostat. 6, 224. Shah, J.S., Brock, G.N., Rai, S.N., 2015. Metabolomics data analysis and missing value issues with application to infarcted mouse hearts. BMC Bioinf. 16, P16. Shiratake, K., Suzuki, M., 2016. Omics studies of citrus, grape and rosaceae fruit trees. Breed. Sci. 66, 122–138. Socaciu, C., Ranga, F., Fetea, F., Leopold, L., Dulf, F., Parlog, R., 2009. Complementary advanced techniques applied for plant and food authentication. Czech J. Food Sci. 27, 70–75. Story, E.N., Kopec, R.E., Schwartz, S.J., Harris, G.K., 2010. An update on the health effects of tomato lycopene. Annu. Rev. Food Sci. Technol. 1, 189–210. Sud, M., Fahy, E., Cotter, D., Azam, K., Vadivelu, I., Burant, C., Edison, A., Fiehn, O., Higashi, R., Nair, K.S., Sumner, S., Subramaniam, S., 2015. Metabolomics workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 44, D463–D470. Sumner, L.W., Amberg, A., Barrett, D., Beale, M.H., Beger, R., Daykin, C.A., Fan, T.W.-M., Fiehn, O., Goodacre, R., Griffin, J.L., Hankemeier, T., Hard, N., Harnly, J., Higashi, R., Kopka, J., Lane, A.N., Lindon, J.C., Marriott, P., Nicholis, A.W., Reily, M.D., Thaden, J.J., Viant, M.R., 2007. Proposed minimum reporting standards for chemical analysis. Metabolomics 3, 211–221. Sumner, L.W., Mendes, P., Dixon, R.A., 2003. Plant metabolomics: large-scale phytochemistry in the functional genomics era. Phytochemistry 62, 817–836. Tagashira, N., Plader, W., Filipecki, M., Yin, Z., Wisniewska, A., Gaj, P., Szwacka, M., Fiehn, O., Hoshi, Y., Kondo, K., Malepszy, S., 2005. The metabolic profiles of transgenic cucumber lines vary with different chromosomal locations of the transgene. Cell. Mol. Biol. Lett. 10, 697–710. Team, R.S., 2017. RStudio: Integrated Development for R. RStudio, Inc., Boston, MA. http://www. Rstudio.com. Tian, H., Lam, S.M., Shui, G., 2016. Metabolomics, a powerful tool for agricultural research. Int. J. Mol. Sci. 17, 1871. Tohge, T., Alseekh, S., Fernie, A.R., 2014. On the regulation and function of secondary metabolism during fruit development and ripening. J. Exp. Bot. 65, 4599–4611. Tugizimana, F., Piater, L., Dubery, I., 2013. Plant metabolomics: a new frontier in phytochemical analysis. S. Afr. J. Sci. 109, 1–11. Van Rossum, G., Drake, F.L., 2003. Python Language Reference Manual. Network Theory. Wehrens, R., Carvalho, E., Masuero, D., de Juan, A., Martens, S., 2013. High-throughput carotenoid profiling using multivariate curve resolution. Anal. Bioanal. Chem. 405, 5075–5086. Xie, Y., 2015. Dynamic Documents with R and Knitr, second ed. Chapman & Hall/CRC, Boca Raton, Florida. http://yihui.name/knitr/. Young, F.W., 2013. Multidimensional Scaling: History, Theory, and Applications. Psychology Press. Zawirska-Wojtasiak, R., Gośliński, M., Szwacka, M., Gajc-Wolska, J., Mildner-Szkudlarz, S., 2009. Aroma evaluation of transgenic, thaumatin Ii-producing cucumber fruits. J. Food Sci. 74, C204–C210. Zhu, M., Liu, L., Guo, M., 2016. Current advances in the metabolomics study on Lotus seeds. Front. Plant Sci. 7, 891.

Chapter 9

Application of Computation in the Biosynthesis of Phytochemicals Nilanjan Adhikari*,†, Sk Abdul Amin*, Tarun Jha*, Achintya Saha† *Jadavpur University, Kolkata, India † University of Calcutta, Kolkata, India

Chapter Outline 9.1. Introduction 9.2. Genome-Mining Tools 9.3. Computational Tools and Databases for Identification and Analysis of BGCs and Secondary Metabolites 9.3.1 BACTIBASE 9.3.2 DoBISCUIT 9.3.3 MIBiG 9.3.4 IMG-ABC 9.3.5 CluStscan Database 9.3.6 ClusterMine360 9.3.7 AntiSMASH 9.3.8 SMURF 9.3.9 BAGEL 9.3.10 NaPDos 9.3.11 MultiGeneBlast 9.3.12 eSNaPD 9.3.13 NRPSPredictor 9.4. Computational Tools for Metabolomics Study 9.4.1 Cycloquest 9.4.2 NRPquest 9.4.3 RIPPquest 9.4.4 Pep2Path

256 257

257 259 260 260 260 261 261 261 261 262 262 262 263 263 263 264 264 264 264

9.4.5 GNPS 9.4.6 DEREPLICATOR 9.5. Tools for Prediction of Biochemical Pathways 9.5.1 From Metabolite to Metabolite 9.5.2 Biochemical Network-Integrated Computational Explorer 9.5.3 RetroPath 9.5.4 DESHARKY 9.5.5 Cho System Framework 9.6. Chemical Compound Databases 9.6.1 Dictionary of Natural Products 9.6.2 StreptomeDB 9.6.3 NORINE 9.6.4 ChEBI 9.6.5 ChEMBL 9.6.6 PubChem 9.6.7 ChemSpider 9.7. Overview and Conclusions References

Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00009-2 © 2018 Elsevier Inc. All rights reserved.

264 265 265 265

266 267 268 268 268 269 269 269 269 270 270 270 271 272

255

256  Computational Phytochemistry

9.1. INTRODUCTION Plant secondary metabolites are an important source of diverse drug molecules, namely antibiotics, anticancer, antimalarial, antihypertensive, lipid-lowering agent, anaesthetic, immunomodulator, and many more (Medema and Fischbach, 2015). These secondary metabolites comprise diverse chemical structures from a variety of chemical classes including iridoids, limonoids, peptides, phenolics, terpenoids, and others. Without any alteration of the biodiversity, it is possible to improve the production of important plant metabolites through utilizing the subsequent biotechnological fields of ‘omics’ technologies including genomics, metabolomics, and transcriptomics and proteomics (Rai et al., 2017). With the help of ‘omics’ technologies, it is possible to identify unknown secondary metabolites effectively through proper identification of the biosynthetic gene clusters (BGCs) (Weber and Kim, 2016). The ‘omics’ approach has brought a renaissance in the field of natural products and related drug discovery process, and thus, may be utilized further industrially for the production of effective rationally engineered secondary metabolites through biosynthetic designing methodologies (Weber and Kim, 2016). Nowadays, great interest resides in the identification of pharmaceutically and medicinally important natural secondary metabolites in fungus and bacteria through exploration of their biosynthetic mechanisms (Khater et al., 2016). Biosynthetically engineered secondary metabolites may be useful as potential drug molecules (Weissman and Leadlay, 2005; Helfrich et  al., 2014; Xu et  al., 2014). The structural and biosynthetic complexity of these natural compounds makes this process quite challenging. Therefore, understanding the biosynthetic mechanisms and subsequently systemizing them into bioengineering framework may provide useful information to yield natural compounds of economic and medicinal importance (Du et al., 2011). Exploration and cultivation of microorganisms may be crucial for the natural product discovery strategies (Manivasagan et  al., 2014; Khater et  al., 2016). Moreover, the relation between secondary metabolites and BGCs helps to identify novel metabolites along with reconstructing the existing biosynthetic pathways (Walsh and Fischbach, 2010; Medema and Fischbach, 2015). These BGCs, in turn, help to enhance the production rate of different secondary metabolites of particular interest. Forward and retro-biosynthetic methods are utilized fruitfully for this purpose. In the forward method, the structure of biosynthetic secondary metabolites may be predicted from the genomic information obtained from gene and gene clusters. However, the retro-biosynthetic method deals with a known secondary metabolite and helps to get information about the probable gene or gene clusters (Irschik et al., 2010). Exploration of the biosynthetic mechanisms to identify novel natural compounds has unveiled a new vista in the field of genomics-driven drug discovery (Deane and Mitchell, 2014; Helfrich et al., 2014; Milshteyn et al., 2014). In silico genomic data-mining has been successfully utilized as well as incorporated fruitfully in the prediction and characterization of novel secondary metabolites by virtue of these biotechnological approaches (Boddy, 2014; Weber, 2014; Medema and

Application of Computation in the Biosynthesis  Chapter | 9  257

Fischbach, 2015). Despite huge diversity among structures of natural products, the biosynthetic principles may be highly conserved. A number of enzymes are related to biosynthesis of numerous secondary metabolites. Therefore, information obtained from known genetic structure may be utilized for genome mining to understand the biosynthetic pathway of secondary metabolites (Weber and Kim, 2016). A variety of computational tools with different algorithms and designing software may be utilized to reduce the time and cost-effective production of secondary plant metabolites. In this chapter, a number of computational tools along with different software and databases have been discussed in details so that these software, databases and computational approaches will be further utilized for the identification, prediction, analyses, and biosynthesis process development of secondary metabolites. Different gene cluster tools and databases, as well as chemical library databases, are utilized successfully, and compiling these gene clusters with small molecular structures through networking approaches may be used for predicting new molecules. These computational methodologies may also be helpful in the de novo prediction of biosynthetic pathways to enhance the production of secondary metabolites. Therefore, computational methodologies may significantly contribute to the processing of secondary phytochemicals in the field of synthetic biology (Fig. 9.1).

9.2.  GENOME-MINING TOOLS Genome-mining technique may be utilized for the identification of biosynthetic enzymes. The tools compile different techniques, such as identification of the microorganism genomic structure for the biosynthesis of secondary metabolites along with identification and characterization of metabolic pathways as well as prediction of secondary metabolites through these pathways (Medema and Osbourn, 2016). Gene clusters may be identified by using BLAST or PSIBLAST (Altschul et al., 1997; Weber and Kim, 2016). These tools utilize amino acid sequences of known proteins used as a query. Moreover, HMMer (Eddy, 2011) is another tool, where alignment of input query sequences is used to build Hidden Markov Models (HMMs). MultiGeneBlast (Medema et  al., 2013) is based on the BLAST technique, which could be useful for the analysis of gene clusters. Both these HMMer and BLAST tools may be utilized for identifying a variety of secondary metabolites, such as non-ribosomal peptides (NRPs) and polyketides (PKs).

9.3.  COMPUTATIONAL TOOLS AND DATABASES FOR IDENTIFICATION AND ANALYSIS OF BGCS AND SECONDARY METABOLITES Computational tools are useful in the identification of BGCs in genomic sequences. Blast (Camacho et al., 2009) and HMMer (Eddy, 2011) help to build the genomic sequences, though these are manual approaches. These tools are

Integration Transcriptomics Proteomics

Genomics Metabolomics

Database analysis

Computational analysis Data processing

Mathematical modeling Hypothesis validation

Systemic biology

FIG. 9.1  Integrative in silico computational strategies in biosynthesis of secondary metabolites/phytochemicals.

258  Computational Phytochemistry

Systems analysis Generation of hypothesis

Application of Computation in the Biosynthesis  Chapter | 9  259

based on two algorithms, i.e. high confidence/low novelty and low confidence/ high novelty methods. The latter has been utilized recently to identify the loopholes of the first category of tools (Medema and Fischbach, 2015). The first category includes several tools, such as antiSMASH (Medema et  al., 2011a; Blin et  al., 2013), CLUESCAN (Weber et  al., 2009), SMURF (Khaldi et  al., 2010), ClustScan (Starcevic et  al., 2008), and NPsearcher (Li et  al., 2009). These tools compile well-defined and known queries generated from multiple sequence alignments for identifying signature genes similar to the known classes of biosynthetic pathways. These tools are utilized to identify the BGC of the single strain from its genomic sequence having a lower false positive rate (Medema and Fischbach, 2015). Moreover, known types of enzymes related to biosynthesis are encoded by BGCs. These tools may be effective for identifying BGCs of known biosynthetic pathways. However, identification of gene clusters from unknown classes should be of greater priority as these may contain information of molecular encoding of new chemical entities (Fischbach and Walsh, 2009). For identifying new classes of gene clusters, sophisticated computational tools may be intended. ClusterFinder algorithm (Cimermancic et al., 2014; Medema and Fischbach, 2015) helps to search broad genomic functions instead of individual genomic functions. It offers Pfam database (Punta et al., 2012) for translating a genome into a long protein domain and finds the genomic function in that protein string having similarity with BGC. Many natural biochemical pathways and systems, especially in prokaryotes and fungi, are encoded by genes that are located physically close to each other on the chromosomes, in operons or gene clusters. In the natural products drug discovery, the systematic exploration of large-scale genomic data is the potential discovery route. Due to the unavailability of resources, this genomic data exploration process is largely utilized in a systematic way. Secondary metabolites of natural origin may serve as candidates for drug development. Polyketide synthases (PKSs), non-ribosomal peptide synthetases (NRPSs), and mixed clusters (containing both PKS and NRPS modules) have attracted attention of researchers for their roles in constructing complex compounds. These modular biosynthetic clusters play crucial role in combinatorial biosynthesis and heterologous expression of important products of biochemical and pharmaceutical interest. Various literature-based or knowledge-based databases have comprehensively described the information about gene clusters.

9.3.1 BACTIBASE It is a freely available database (http://bactibase.pfba-lab-tun.org) that contains sequence and information of bacterial natural antimicrobial peptides, known as bacteriocins (Hammami et  al., 2007, 2010). It is a web-based interface and it requires Java platform for visualization of phylogenic trees of antimicrobial peptides. The BACTIBASE contains calculated or predicted physicochemical properties of 230 bacteriocins produced by various Gram-positive and

260  Computational Phytochemistry

­ ram-negative bacteria, and it is under continuous development. It provides a G comprehensive microbiological and physicochemical data that help in the detailed structural and functional analysis of bacteriocins. It also helps in the rapid prediction of structure/function relationships. This database includes a number of additional functions and provides curated annotation of antimicrobial peptide sequences. It includes a number of computational tools for the HMM and retrieval through taxonomy browser, homology modelling search of multiple sequence alignment as well as for molecular modelling (http://bactibase.pfbalab-tun.org). Information of the BACTIBASE database helps in rapid and easier prediction of structure or function activity relationship of peptides of target organisms, and subsequently the bioactivity may be further utilized in medical fields and applications.

9.3.2 DoBISCUIT Ichikawa et al. (2012) from the National Institute of Technology and Evaluation (NBRC), Japan, have developed a database DoBISCUT (Database of BIoSynthesis clusters CUrated and InTegrated) of known PKS and NRPS gene clusters based on existing scientific research articles (http://www.bio.nite.go.jp/ pks/). This database serves as a useful tool for secondary metabolites analysis. Information of biosynthesis clusters and KS/A domain sequences are available from this database. This easily accessible database also provides standardized gene/module/domain descriptions related to such gene clusters.

9.3.3 MIBiG Minimum Information about a Biosynthetic Gene cluster (MIBiG) is a Genomic Standards Consortium project (Medema et al., 2015) that builds on the Minimum Information about any Sequence (MIxS) framework (http:// mibig.secondarymetabolites.org/index.html). The MIBiG allows standardized deposition and retrieval of biosynthetic gene cluster data. It facilitates robust experimental evidence and rich metadata components that guide research on the biosynthesis, chemistry, and ecology of bioactive secondary metabolites. The MIBiG data can be downloaded in a variety of formats. The sequences of the biosynthetic gene clusters can be downloaded in GBK format, while the JSON format is available for the annotations/metadata from MIBiG entries. The amino acid sequence translations of genes are available in FASTA file format.

9.3.4 IMG-ABC Integrated Microbial Genomes-Atlas of Biosynthetic gene Clusters (IMGABC) resource (Hadjithomas et al., 2015, 2017) is a knowledge-based resource that integrates the structural and functional genomics with annotated secondary metabolite biosynthetic gene clusters (https://img.jgi.doe.gov/cgi-bin/abcpublic/main.cgi). This resource helps to predict biosynthetic clusters putatively

Application of Computation in the Biosynthesis  Chapter | 9  261

producing new secondary metabolites and provides a stage for identifying genomic elements important in the biosynthesis of secondary metabolites with novel structures.

9.3.5  ClustScan Database The hierarchical organization of ClustScan database (CSDB) provides easy protein sequences extraction of polypeptides and DNA sequences (Starcevic et al., 2008; Diminic et al., 2013). The recombinant ClustScan database (CSDB) contains information about predicted recombinants among PKS clusters. The recombinants are originated by homologous recombination modelling and are subsequently related to the annotation and further prediction of product chemistry automatically generated by the model. The CSDB database comprises more than 20,000 recombinants and is a useful resource for in silico prediction approaches for detecting promising new compounds. Methods are available to construct the corresponding recombinants in the laboratory.

9.3.6 ClusterMine360 It is an important database that deals with the microbial polyketide and nonribosomal peptide synthetase (PKS/NRPS) gene clusters (Conway and Boddy, 2012). This open-source database is available at http://www.clustermine360.ca/. ClusterMine360 stores more than 200 gene clusters from different compound families. This database also provides a unique sequence repository containing more than 10,000 PKS/NRPS domains and the sequences can be downloaded as an individual or multiple sequence of FASTA formats. ClusterMine360 is also useful to explore the polyketide and non-ribosomal peptide biosynthetic pathways.

9.3.7 antiSMASH Antibiotics and Secondary Metabolite Analysis Shell (antiSMASH) is a comprehensive tool (http://antismash.secondarymetabolites.org) that identifies biosynthetic loci of known secondary metabolites, such as aminoglycosides, aminocoumarins, bacteriocins, beta-lactams, butyrolactones, terpenes, polyketides, melanins, non-ribosomal peptides, indolocarbazoles, antibiotics, nucleosides, and siderophores (Medema et al., 2011a; Weber et al., 2015; Blin et al., 2013, 2016, 2017). From a database of known gene clusters, antiSMASH aligns the regions of interest at the gene cluster level to their nearest neighbours. This tool performs integration or cross-linking of available secondary metabolite-specific gene analysis.

9.3.8 SMURF Secondary Metabolite Unique Regions Finder (SMURF) is an important computational web-based tool that helps to find secondary metabolite biosynthesis

262  Computational Phytochemistry

genes and pathways in fungal genomes (Khaldi et  al., 2010). It provides the precomputed clusters for most sequenced fungal genomes. Depending on gene's chromosomal position and the PFAM (protein families that include their annotations and multiple sequence alignments, generated using hidden Markov models) as well as TIGRFAM (tuning of the breadth of each protein family to serve the needs of genome annotation) domain content, SMURF predicts the result.

9.3.9 BAGEL This useful bacteriocin-mining web-based tool combines direct and indirect mining of context genes (de Jong et  al., 2006, 2010). It is freely accessible at http://bagel.molgenrug.nl/. This tool allows checking that a genetic data of interest contains bacteriocin encoding genes or other bacterial ribosomally synthesized and post-translationally modified peptides (RiPPs). The latest version of BAGEL3 uses FASTA (format for representing either nucleotide sequences or peptide sequences) DNA sequences or folders containing multiple FASTA formatted files (van Heel et al., 2013). The BAGEL3 has been extended to cover a broad-range of post-translationally modified peptides and allows prokaryotes genome mining that is independent of open reading frame (ORF) predictions. A new identification approach enables the combining of direct mining for the gene and indirect mining via context genes is included in BAGEL3. Overall, BAGEL3 provides versatility in fast genome mining for modified and nonmodified bacteriocins as well as non-bactericidal RiPPs (van Heel et al., 2013).

9.3.10 NaPDos Natural Product Domain Seeker (NaPDos) is used for rapid detection and analysis of secondary metabolite genes (Ziemert et al., 2012). It is helpful to detect and extract C- and KS-domains from amino acid or DNA sequence. In NaPDos, the secondary metabolite domains are identified by sequence comparison to a manually curated reference genes sets. The sample gene sequences are extracted, trimmed, translated, and subjected to domain-specific phylogenetic clustering to predict their putative products.

9.3.11 MultiGeneBlast It is used to identify the homologs of multigene modules (e.g., operons, gene clusters). This open-source tool is based on a reformatting of the FASTA headers of National Center for Biotechnology Information (NCBI) GenBank protein entries (Medema et al., 2013). A graphical user interface, as well as commandline, is also available for MultiGeneBlast. It allows Basic Local Alignment Search Tool (BLAST) for searching multiple predicted proteins/genes and subsequently maps their hits onto their parent nucleotides. It is useful in characterization of all homologous genomic areas through compiling the results of single BlastP runs on each gene and sorting genomic regions from any GenBank entry

Application of Computation in the Biosynthesis  Chapter | 9  263

by the number of hits, synteny conservation, and cumulative Blast bit score. The MultiGeneBlast contains homology search mode and architecture search mode. Homology search mode finds operons or gene clusters homologous to a known operon or gene cluster. Similar to the algorithm of antiSMASH tool, this MultiGeneBlast can perform architecture search additionally. The architecture search mode finds novel genomic loci, which contains a certain user-specified combination of genes.

9.3.12 eSNaPD Environmental Surveyor of Natural Product Diversity (eSNaPD) is an automated analysis tool that estimates the diversity of natural products gene clusters of a sample (Reddy et al., 2014). It is helpful for surveying secondary metabolite gene cluster diversity in crude metagenomics samples. From the relating sequence information to a database of characterized known molecules (CKM), eSNaPD identifies and predicts gene clusters that may encode novel natural products. Moreover, eSNaPD allows a broader view of natural products diversity across different geochemical environments. The latest version of eSNaPD allows more detail searches with any conserved biosynthetic domain sequence and all functionally characterized gene clusters and provides dendrograms on the relationships of CKM domains to library sequences. In the current version of eSNaPD, the data are visualized through Molecule Explorer, Arrayed Library Explorer, New Clades Explorer, and Google world Map Explorer showing the soil sample locations.

9.3.13 NRPSpredictor It is a web-based tool for the prediction of NRPS adenylation domain specificity (Rausch et al., 2005; Röttig et al., 2011). Based on the sequence or sequence signatures, the prediction of Adenylation (A) domain specificity can be achieved through machine learning methods. Currently, NRPSpredictor2 is available with an improved prediction capability than the previous NRPSpredictor. The NRPSpredictor2 predicts A-domain specificity through a machine learning support vector machine (SVM) on four hierarchical levels. The NRPSpredictor2 additionally predicts fungal A-domains. The service can be accessed at http:// nrps.informatik.uni-tuebingen.de/.

9.4.  COMPUTATIONAL TOOLS FOR METABOLOMICS STUDY Metabolomics study (see Chapter 8) allows identification and quantification of metabolites along with a variety of utility. The paramount importance of this field is in natural products and drug discovery processes (Booth et al., 2013). A variety of computational tools and databases help in the analysis and interpretation of the metabolomics data. Biochemical databases along with their correlation with the metabolomics tools may be helpful in understanding the

264  Computational Phytochemistry

biochemical pathways and other metabolomics studies. These metabolomics computational tools help understanding of biochemical pathways, their characterization, ranking and proper indepth understanding of the subsequent metabolite mapping study, and further contribution in the field of natural products drug discovery.

9.4.1 Cycloquest It is the first reported database search algorithm for identification of cyclopeptides by mass spectrometry (MS). This is an open-source tool and available at http://proteomics.ucsd.edu. Mohimani et al. (2011) designed and reported this Cycloquest database search methodology in details. To validate the search strategy, they used sporulation killing factor (SKF) from Bacillus subtilis, Rhesus θ-defensin (RTD-1) from Rhesus macaque, and sunflower trypsin inhibitor-1 (SFTI-1) and SFTI-like 1 (SFT-L1) from Helianthus annuus.

9.4.2 NRPquest Non-ribosomal peptide (NRP) identification algorithm (NRPquest) is a genome-­ mining technique for non-ribosomal peptide discovery (Mohimani et al., 2014a). It integrates MS and genome mining that transforms approaches for de novo sequencing of NRPs into an MS/MS database search approach in order to identify NRPs.

9.4.3 RiPPquest It provides a tandem mass spectrometry database search tool to identify ribosomally synthesized and post-translationally modified peptides (RiPPs) that are important bioactive natural products. It may also be applied in the discovery of lanthipeptide. This tool is available at www.cyclo.ucsd.edu and allows genomics to analyse at the vicinity of RiPP biosynthetic genes and also uses proteomics for extensive peptide modifications search and subsequently helps in computing p-statistics of peptide-spectrum matches (PSMs).

9.4.4 Pep2Path It is a Python-based tool and useful for connecting predicted peptide sequences (for RiPPs and NRPs) that are deduced from genomic data to peptide mass fragments observed from MS/MS studies.

9.4.5 GNPS Global Natural Products Social Molecular Networking (GNPS) is an open-source community-driven portal (http://gnps.ucsd.edu) for sharing of raw, processed, or identified tandem mass- (MS/MS) spectrometry data (Mohimani et al., 2014b).

Application of Computation in the Biosynthesis  Chapter | 9  265

This knowledge-based organized tool allows MS/MS networking-­based compound dereplication (see Chapter 5) and identification. This tool is developed and maintained by the Dorrestein lab and colleagues at the University of California, San Diego, La Jolla, California, USA.

9.4.6 Dereplicator It could identify known natural products from the LC/MS data by using dereplicator algorithm (Mohimani et al., 2017). This new dereplication algorithm provides high-throughput identification of peptidic natural products (PNPs) and large-scale mass-spectrometry-based screening platforms.

9.5.  TOOLS FOR PREDICTION OF BIOCHEMICAL PATHWAYS For the commercial production of natural compounds, the synthetic design of biochemical pathways is of utmost importance (Walsh and Fischbach, 2010; Medema et al., 2011b; Medema et al., 2012). These pathways may be designed through in silico techniques and subsequently optimized and engineered in specific host systems (Prather and Martin, 2008; Martin et al., 2009; Medema et al., 2012). This approach may be beneficial for the cost-effective production of target secondary metabolites (Medema et  al., 2012). Following paragraphs will provide details about the computational tools and software along with databases used successfully for this biochemical pathway prediction purposes.

9.5.1  From Metabolite to Metabolite It is a user-friendly web server (http://fmm.mbc.nctu.edu.tw/) that helps in metabolic pathway identification. It uses search options and helps to identify possible pathways between input and output (Cho et al., 2010). It uses the features of Kyoto Encyclopaedia of Genes and Genomes (KEGG) ligands and KEGG maps and combines these to identify specific genes that provide information about different pathways (Chou et  al., 2009). It helps to obtain information about the corresponding genes and organisms, and the different pathways as output system may be compared. The KEGG maps help to identify all the probable reactions along with the maps. The pathway comprising most of the paths is selected and pathway having only one reaction is discarded. It helps to design a metabolic pathway from different KEGG maps (Chou et  al., 2009). From metabolite to metabolite (FMM) may be useful in producing an output obtained from different pathways and provides the possible metabolic route for a specific target compound (Mienda and Shamsir, 2015). However, there are several disadvantages of FMM web server. The pathways characterized are restricted to only the KEGG database (Kanehisa et al., 2012) and may not be able to provide the detailed insight about the thermodynamic feasibility of those ­pathways

266  Computational Phytochemistry

(Chou et al., 2009; Medema et al., 2012), and therefore, may be useful for determining preliminary studies along with an overview of probable metabolic pathways. However, this method requires further optimization and validation for application of metabolic pathway identification.

9.5.2  Biochemical Network-Integrated Computational Explorer This tool predicts novel metabolic pathways depending on enzyme classification (Hatzimanikatis et  al., 2005). It not only predicts known pathways, but also predicts chemically possible unknown pathways along with their thermodynamic properties (Hatzimanikatis et al., 2005). The generated pathways are not restricted to a single database. The biochemical network-integrated computational explorer (BNICE) may be utilized in identifying a number of metabolic pathways including polyketide and isoprenoid pathways for the production of a huge diverse chemical compounds (Khosla and Keasling, 2003). It allows identification of newer routes for valuable biosynthetic compounds depending on known sources (Hatzimanikatis et al., 2005). It compiles the starting material and a number of reactions are searched using enzymes and their related reaction mechanisms in a single or multiple pathways (Medema et al., 2012). This tool takes into account compounds included in a variety of chemical and biological databases and proposes novel biochemical pathways for the generation of compounds, which are not yet synthesized or identified through engineered pathways (Hatzimanikatis et al., 2005). The BNICE helps to search for metabolic pathways compiling the starting materials, pathway length, and range of reactions (Henry et al., 2010; Medema et al., 2012). With the help of BNICE framework, it is possible to search for a single known pathway including enzyme reactions, multiple pathways, or the total metabolic route (Hatzimanikatis et al., 2005; Henry et al., 2010). It may be useful in selecting probable pathways, whereas further analysis is required to obtain meaningful results. Although it is a choice for selecting efficient pathways, it is limited to analysis of the first step reactions (Medema et al., 2012). Moreover, there are also limitations and difficulties of using BNICE framework as it predicts more than 10,000 pathways for biosynthesis and metabolic degradation of compounds simultaneously. Therefore, it is difficult to choose the proper and exact pathway. Therefore, ranking of these pathways through pathway prioritization criteria may be useful to obtain a significant outcome (Mienda and Shamsir, 2015). The pathway prioritization is devised into the BNICE framework through four different approaches, including length of the pathway, thermodynamic feasibility of the pathway, maximum activity, and highest yield (Henry et al., 2010). The BNICE is based on the molecular bond-electron matrix (BEM) where molecules are represented by row and column as well as a graph-based matrix of enzyme reaction rules and biochemical compound. The diagonal elements of BEM indicate the non-bonded valence electron, whereas the n­ on-diagonal

Application of Computation in the Biosynthesis  Chapter | 9  267

elements of BEM represent the connectivity between atoms and bond orders (Hatzimanikatis et al., 2005). Moreover, BNICE also encodes enzymatic reactions of every enzyme class represented by molecular fragments. A number of molecules are provided as an input and subsequently judged every molecule whether it matches the corresponding reactions of particular enzyme classes or not. The enzymatic reaction matrix is incorporated to BEM for substrate and BEM subsequently generates reaction products (Hatzimanikatis et  al., 2005). BNICE may also be useful in other computational applications including retrobiosynthesis of metabolic pathways (Bachmann, 2010). Application of BNICE for designing host strain has enormous importance for bioprocessing of chemicals (Mienda and Shamsir, 2015).

9.5.3 RetroPath It is a web server that offers a retro-synthetic technique and helps in identifying a target chemical through feasible and proper biosynthetic pathways (Carbonell et al., 2011). It is better and more effective than BNICE as RetroPath considers only the top ranking pathways. Starting from the target molecule, RetroPath utilizes reverse enzyme-catalyzed reactions to identify indigenous precursors to the host (Carbonell et al., 2011). It is unique in designing and characterizing metabolic pathways, reactions, and products. The molecular signature used by RetroPath is denoted by different heights and variation of these heights may be used to reduce the number of reactions as well as the complexity (Carbonell et al., 2011). It is associated with different databases, such as KEGG (Kanehisa et al., 2012) that combines both systemic and chemical information as well as genomic data. The RetroPath may be able to generate promising data related to the heterologous pathway-guided designing strategy of new compounds by utilizing several criteria including optimization of desired biological system, standardization, and optimization of experimental validation. Retro-synthesis may provide successful prediction of higher yield production along with the problems related to it through a streamlined methodology (Carbonell et al., 2013). Retro-synthetic heterologous pathway design follows different steps including host selection criteria, model selection criteria from databases, metabolic space, and gene selection and finally evaluation of metabolic yield (Carbonell et al., 2013; Mienda and Shamsir, 2015). The COnstaintBased Reconstruction and Analysis (COBRA) toolbox (Rocha et  al., 2010) and COmplex PAthway SImulator (COPASI) software (Hoops et  al., 2006; Schaber, 2012) may be fruitful in the estimation of metabolic products. COBRA is a MATLAB-based software that is helpful in the quantitative prediction of constraint-­based analysis related to cellular and multicellular metabolic networks (Becker et al., 2007). Moreover, prediction of the toxic metabolite in the retro-synthetic metabolic pathway may also be estimated (Planson et al., 2012). The software COPASI is devoted to simulate and analyse metabolic networks as well as their dynamics study (Hoops et al., 2006).

268  Computational Phytochemistry

9.5.4 DESHARKY This pathway prediction tool utilizes the enzymatic reactions through unique search algorithms. It allows all the probable pathways and helps to interconnect these into the host organisms. If the host organism has already been identified, DESHARKY may be the choice for generating compound through certain metabolic pathway into the host organism (Medema et al., 2012). It uses a Monte Carlo-based heuristics algorithm for identifying probable route linking the target and host metabolism (Rodrigo et al., 2008) to provide a biochemical route encompassing host metabolism along with unique mathematical models of cellular metabolism (Rodrigo et al., 2008). It is able to estimate the energy and thermodynamic calculation related to translation and transcription. The input and output are the target compound and engineered metabolic pathway, respectively, along with the estimation of transcription and translation (Rodrigo et al., 2008). The DESHARKY is also useful in designing strategies related to bioproduction and biodegradation of metabolites that have enormous therapeutic implications in medicinal chemistry purpose and chemical industry, such as cancer chemotherapy, reduction of cholesterol, and production of synthetic polyesters and nylons (Rodrigo et al., 2008).

9.5.5  Cho System Framework Depending on the integrated information on structural diversity, Cho system framework utilizes proper enzymes for the biosynthesis of desired chemicals in host organisms and reaction mechanisms of system database (Cho et al., 2010). This framework helps to rank the enzymes through a unique prioritizationbased scoring algorithm. It is useful in identifying enzymatic reactions of the desired pathway for intended chemicals through production by microbial organisms. The prioritization method utilizes five criteria, such as chemical similarity, specificity of the organism, pathway distance, thermodynamic favourability, and binding site covalence. Depending on these criteria, the priority score of three groups may be estimated (Cho et al., 2010). This framework may be applied for the betterment of biofuel and other industrial compounds production (Mienda and Shamsir, 2015).

9.6.  CHEMICAL COMPOUND DATABASES A chemical compound database consists or rather stores numerous important information related to diverse chemicals. Apart from the structural properties, these are mainly spectroscopic and crystal data, physicochemical properties, thermophysical data as well as reactions and their medicinal/toxicological properties, if any. The chemical structures may be represented through adjacency matrices or through MDL Molfile/PDB/CML file format or through SMILES/ SMARTS/SLN/InChI file format. These chemical databases have immense

Application of Computation in the Biosynthesis  Chapter | 9  269

i­ mportance in the field of natural products and related drug discovery processes. Structural queries obtained by these chemical databases may be utilized to design synthetic/semisynthetic chemical compounds related to naturally obtained secondary metabolites.

9.6.1  Dictionary of Natural Products It provides a comprehensive structured-database of more than 290,000 compounds from natural origins. This database is the most useful and comprehensive source of natural products information. Dictionary of natural products is accessed through online source http://dnp.chemnetbase.com. However, it is not free and requires purchase or subscription.

9.6.2 StreptomeDB It is the first and the largest database that fully incorporates all natural products isolated from Streptomyces (Lucas et  al., 2013). Important information regarding the names, source organisms, molecular structures, biological activities, and synthetic routes of the compounds along with references are available in StreptomeDB. Data can be accessed via http://www.pharmaceutical-bioinformatics.org/streptomedb through a particular query on compound names or their annotations such as structures, source organisms, etc. The newer version of StreptomeDB 2.0 (Klementz et al., 2016) provides a unique scaffold-based navigation system that helps in the structural finding of StreptomeDB chemical diversity. A phylogenetic tree based on 16S rRNA sequences (comprises about two thirds of 2500 host organisms) has been included in the latest StreptomeDB 2.0 that allows visualization of the appearance, frequency, and persistence of compounds.

9.6.3 Norine It is a platform that contains a database of 1186 non-ribosomal peptides (NRPs) (Caboche et  al., 2008). This database is accessible at http://bioinfo.lifl.fr/ norine/. It provides a computational tool for the systematic study of NRPs in numerous species. This database enables structural information and various annotations, such as source organisms, biological properties, and references for each peptide. Data can be accessed through a query on annotations or through their monomeric structures.

9.6.4 ChEBI Chemical Entities of Biological Interest (ChEBI) is a database of molecular entities of biological interest (Degtyarenko et al., 2008, 2009; Hastings et al., 2013, 2016). This is a freely available web-based interface that can be accessed via

270  Computational Phytochemistry

http://www.ebi.ac.uk/chebi. Currently, more than 46,000 entries are included in this database. Each of the molecular entity is classified within the ontology and assigned annotations, such as synonyms, chemical structure, and database cross-references. The ChEBI comprises an ontological classification tool that is entirely dedicated to the exploration of the relationships between molecular entities or classes of entities and their ancestors and/or siblings are specified. Recently, a web-based enrichment analysis tool, BiNChE (http://www. ebi.ac.uk/chebi/tools/binche/BiNChE), and a ontology query tool, OntoQuery (http://www.ebi.ac.uk/chebi/tools/ontoquery), are added to the ChEBI platform.

9.6.5 ChEMBL This freely available database contains binding, functional, and ADMET information for a huge set of biologically active drug-like compounds (Gaulton et al., 2012; Bento et al., 2014; Davies et al., 2015). This database is a well-established resource for medicinal chemistry research and drug discovery. The ChEMBL database curates and stores molecule name, its target, and standardized bioactivity from the primary medicinal chemistry literature. This web-based platform can be accessed and data can be downloaded from the web server at https:// www.ebi.ac.uk/chembldb.

9.6.6 PubChem It is a well-known public repository of small molecules and their biological activities hosted by the US National Institutes of Health (NIH) (Wang et  al., 2009; Li et  al., 2010; Kim et  al., 2016). It is a component of the Molecular Libraries Roadmap Initiatives of NIH, USA. For the past 13 years, PubChem has served as a chemical information resource for the scientific communities belonging to the chemical biology, cheminformatics, medicinal chemistry, and drug discovery. Three inter-linked databases, namely, substance, compound, and BioAssay, are the key components of PubChem (Kim et  al., 2016). The substance database is devoted to the chemical information deposited by data contributors to PubChem, whereas the compound database consists of unique chemical structures. The BioAssay database stores the biological activity data of chemical substances (Kim et al., 2016). This web-based service enables rapid data retrieval, integration along with the comparison of bioassay results to build structure-activity relationships, and examines target selectivity. This tool can be accessed at http://pubchem.ncbi.nlm.nih.gov/assay/.

9.6.7 ChemSpider ChemSpider is a chemical structure database containing the information of more than 58 million chemicals (http://www.chemspider.com) (Pence, 2010). It is an important database for not only teaching and learning, but also for

Application of Computation in the Biosynthesis  Chapter | 9  271

r­ esearch purposes (Pence, 2010). It helps in searching chemical structures from numerous data sources. The chemical names may be searched through different criteria including synonyms, trade names, systematic names, and database identifiers. The chemical structures of compounds may also be searched through structure-based queries or input the drawn structure from the webpage or from other sources. It compiles more than 2000 spectroscopic data and is useful in predicting chemical reactions (Pence, 2010). Apart from the chemical structure, the physical properties, interactive spectra, chemical suppliers, and literature reference related to the particular compound may also be searched. ChemSpider Web-services are the useful computational tools that allow to access many of these compounds and their features through Application Programming Interfaces (APIs). This also helps to enrich the data workflow tools, such as PipelinePilot, KNIME, etc. Information gathered through the ChemSpider database may be utilized to identify or correlate the known or unknown chemical entities that may be further designed or synthesized for future drug discovery strategies.

9.7.  OVERVIEW AND CONCLUSIONS As far as the production of secondary metabolites is concerned, ‘omics’-based techniques along with genome-mining strategies may have valuable importance in the field of natural products/phytochemical discovery. A number of bioinformatics computational tools along with a number of databases have enriched this drug discovery field, and subsequently, link between the experimental and computational synthetic biology. These computational tools and databases may provide valuable information and may shed light on identification and characterization of desired secondary metabolites. Therefore, synthetic biology field combing with the bioinformatics techniques will be highly enriched by the application of these computational tools and methodologies for the effective production of desired secondary metabolites of economic and therapeutic interest through engineered biosynthetic pathways. The gene clusters of natural products are helpful not only in terms of understanding their biosynthetic process, but also help to understand the designing strategies of natural product drug discovery approaches. Computational tools and the related in silico-based applications have immense importance in experimental validation and identification of numerous secondary metabolites of therapeutic interest. These computational tools and methodologies utilize newer bioinformatics to analyse the in silico complex reaction mechanisms of the enzymes along with a proper understanding of the chemical structures of secondary metabolites of photochemical. Moreover, these computational strategies may provide fruitful information about the correlation between the chemical and structural diversity of the secondary metabolite obtained through in silico bioinformatics approaches that will help further to rationalize the photochemical production experimentally.

272  Computational Phytochemistry

REFERENCES Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402. Bachmann, B.O., 2010. Biosynthesis: is it time to go retro? Nat. Chem. Biol. 6, 390–393. Becker, S.A., Feist, A.M., Mo, M.L., Hannum, G., Palsson, B.Ø., Herrgard, M.J., 2007. Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox. Nat. Protoc. 2, 727–738. Bento, A.P., Gaulton, A., Hersey, A., Bellis, L.J., Chambers, J., Davies, M., Krüger, F.A., Light, Y., Mak, L., McGlinchey, S., Nowotka, M., Papadatos, G., Santos, R., Overington, J.P., 2014. The ChEMBL bioactivity database: an update. Nucleic Acids Res. 42, D1083–D1090. Blin, K., Medema, M.H., Kazempour, D., Fischbach, M.A., Breitling, R., Takano, E., Weber, T., 2013. antiSMASH 2.0—a versatile platform for genome mining of secondary metabolite producers. Nucleic Acids Res. 41, W204–W212. Blin, K., Medema, M.H., Kottmann, R., Lee, S.Y., Weber, T., 2016. The antiSMASH database, a comprehensive database of microbial secondary metabolite biosynthetic gene clusters. Nucleic Acids Res. 45 (D1), D555–D559. Blin, K., Wolf, T., Chevrette, M.G., Lu, X., Schwalen, C.J., Kautsar, S.A., Duran, S., Hernando, G., de los Santos, E.L., Kim, H.U., Nave, M., 2017. antiSMASH 4.0-improvements in chemistry prediction and gene cluster boundary identification. Nucleic Acids Res. https://doi.org/10.1093/ nar/gkx319. Boddy, C.N., 2014. Bioinformatics tools for genome mining of polyketide and non-ribosomal peptides. J. Ind. Microbiol. Biotechnol. 41, 443–450. Booth, S.C., Weljie, A.M., Turner, R.J., 2013. Computational tools for the secondary analysis of metabolomics experiments. Comput. Struct. Biotechnol. J. 4, e201301003. Caboche, S., Pupin, M., Leclère, V., Fontaine, A., Jacques, P., Kucherov, G., 2008. NORINE: a database of nonribosomal peptides. Nucleic Acids Res. 36, D326–D331. Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer, K., Madden, T.L., 2009. BLAST+: architecture and applications. BMC Bioinf. 10, 421. Carbonell, P., Planson, A.G., Fichera, D., Faulon, J.L., 2011. A retrosynthetic biology approach to metabolic pathway design for therapeutic production. BMC Syst. Biol. 5, 122. Carbonell, P., Planson, A.G., Faulon, J.L., 2013. Retrosynthetic design of heterologous pathways. In: Methods Molecular Biology. Springer Science Business Media, LLC, pp. 149–173. Cho, A., Yun, H., Park, J.H., Lee, S.Y., Park, S., 2010. Prediction of novel synthetic pathways for the production of desired chemicals. BMC Syst. Biol. 4, 35. Chou, C., Chang, W., Chiu, C., Huang, C., Huang, H., 2009. FMM: a web server for metabolic pathway reconstruction and comparative analysis. Nucleic Acids Res. 37, W129–W134. Cimermancic, P., Medema, M.H., Claesen, J., Kurita, K., Brown, L.C.W., Mavrommatis, K., Pati, A., Godfrey, P.A., Koehrsen, M., Clardy, J., Birren, B.W., 2014. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell 158, 412–421. Conway, K.R., Boddy, C.N., 2012. ClusterMine360: a database of microbial PKS/NRPS biosynthesis. Nucleic Acids Res. 41 (D1), D402–D407. Davies, M., Nowotka, M., Papadatos, G., Dedman, N., Gaulton, A., Atkinson, F., Bellis, L., Overington, J.P., 2015. ChEMBL web services: streamlining access to drug discovery data and utilities. Nucleic Acids Res. 43, W612–W620. de Jong, A., van Heel, A.J., Kok, J., Kuipers, O.P., 2010. BAGEL2: mining for bacteriocins in genomic data. Nucleic Acids Res. 38 (suppl 2), W647–W651.

Application of Computation in the Biosynthesis  Chapter | 9  273 de Jong, A., van Hijum, S.A., Bijlsma, J.J., Kok, J., Kuipers, O.P., 2006. BAGEL: a web-based bacteriocin genome mining tool. Nucleic Acids Res. 34 (Suppl. 2), W273–W279. Deane, C.D., Mitchell, D.A., 2014. Lessons learned from the transformation of natural product discovery to a genome-driven endeavour. J. Ind. Microbiol. Biotechnol. 41, 315–331. Degtyarenko, K., de Matos, P., Ennis, M., Hastings, J., Zbinden, M., McNaught, A., Alcántara, R., Darsow, M., Guedj, M., Ashburner, M., 2008. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 36, D344–D350. Degtyarenko, K., Hastings, J., de Matos, P., Ennis, M., 2009. ChEBI: an open bioinformatics and cheminformatics resource. In: Current Protocols in Bioinformatics. Chapter 14: Unit 14.9. Diminic, J., Zucko, J., Ruzic, I.T., Gacesa, R., Hranueli, D., Long, P.F., Cullum, J., Starcevic, A., 2013. Databases of the thiotemplate modular systems (CSDB) and their in silico recombinants (r-CSDB). J. Ind. Microbiol. Biotechnol. 40, 653–659. Du, J., Shao, Z., Zhao, H., 2011. Engineering microbial factories for synthesis of value-added products. J. Ind. Microbiol. Biotechnol. 38, 873–890. Eddy, S.R., 2011. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195. Fischbach, M.A., Walsh, C.T., 2009. Antibiotics for emerging pathogens. Science 325, 1089–1093. Gaulton, A., Bellis, L.J., Bento, A.P., Chambers, J., Davies, M., Hersey, A., Light, Y., McGlinchey, S., Michalovich, D., Al-Lazikani, B., Overington, J.P., 2012. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40, D1100–D1107. Hadjithomas, M., Chen, I.M.A., Chu, K., Ratner, A., Palaniappan, K., Szeto, E., Huang, J., Reddy, T.B.K., Cimermančič, P., Fischbach, M.A., Ivanova, N.N., 2015. IMG-ABC: a knowledge base to fuel discovery of biosynthetic gene clusters and novel secondary metabolites. MBio 6 (4), e00932–e01015. Hadjithomas, M., Chen, I.M.A., Chu, K., Huang, J., Ratner, A., Palaniappan, K., Andersen, E., Markowitz, V., Kyrpides, N.C., Ivanova, N.N., 2017. IMG-ABC: new features for bacterial secondary metabolism analysis and targeted biosynthetic gene cluster discovery in thousands of microbial genomes. Nucleic Acids Res. 45, D560–D565. Hammami, R., Zouhir, A., Hamida, J.B., Fliss, I., 2007. BACTIBASE: a new web-accessible database for bacteriocin characterization. BMC Microbiol. 7, 89. Hammami, R., Zouhir, A., Le Lay, C., Hamida, J.B., Fliss, I., 2010. BACTIBASE second release: a database and tool platform for bacteriocin characterization. BMC Microbiol. 10, 22. Hastings, J., de Matos, P., Dekker, A., Ennis, M., Harsha, B., Kale, N., Muthukrishnan, V., Owen, G., Turner, S., Williams, M., Steinbeck, C., 2013. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 41, D456–D463. Hastings, J., Owen, G., Dekker, A., Ennis, M., Kale, N., Muthukrishnan, V., Turner, S., Swainston, N., Mendes, P., Steinbeck, C., 2016. Improved services and an expanding collection of metabolites. Nucleic Acids Res. 44, D1214–D1219. Hatzimanikatis, V., Li, C., Ionita, J.A., Henry, C.S., Jankowski, M.D., Broadbelt, L.J., 2005. Exploring the diversity of complex metabolic networks. Bioinformatics 21, 1603–1609. Helfrich, E.J., Reiter, S., Piel, J., 2014. Recent advances in genome-based polyketide discovery. Curr. Opin. Biotechnol. 29, 107–115. Henry, C.S., Broadbelt, L.J., Hatzimanikatis, V., 2010. Discovery and analysis of novel metabolic pathways for the biosynthesis of industrial chemicals: 3-Hydroxypropanoate. Biotechnology. Bioengineering 106, 462–473. Hoops, S., Sahle, S., Gauges, R., Lee, C., Pahle, J., Simus, N., Singhal, M., Xu, L., Mendes, P., Kummer, U., 2006. COPASI—a COmplex PAthway SImulator. Bioinformatics 22, 3067–3074. Ichikawa, N., Sasagawa, M., Yamamoto, M., Komaki, H., Yoshida, Y., Yamazaki, S., Fujita, N., 2012. DoBISCUIT: a database of secondary metabolite biosynthetic gene clusters. Nucleic Acids Res. 41, D408–D414.

274  Computational Phytochemistry Irschik, H., Kopp, M., Weissman, K.J., Buntin, K., Piel, J., Muller, R., 2010. Analysis of the sorangicin gene cluster reinforces the utility of a combined phylogenetic/retrobiosynthetic analysis for deciphering natural product assembly by trans-AT PKS. Chembiochem 11, 1840–1849. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., Tanabe, M., 2012. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109–D114. Khaldi, N., Seifuddin, F.T., Turner, G., Haft, D., Nierman, W.C., Wolfe, K.H., Fedorova, N.D., 2010. SMURF: genomic mapping of fungal secondary metabolite clusters. Fungal Genet. Biol. 47, 736–741. Khater, S., Anand, S., Mohanty, D., 2016. In silico methods for linking genes and secondary metabolites: the way forward. Synth. Syst. Biol. 1, 80–88. Khosla, C., Keasling, J.D., 2003. Metabolic engineering for drug discovery and development. Nat. Rev. Drug Discov. 2, 1019–1025. Kim, S., Thiessen, P.A., Bolton, E.E., Chen, J., Fu, G., Gindulyte, A., Han, L., He, J., He, S., Shoemaker, B.A., Wang, J., Yu, B., Zhang, J., Bryant, S.H., 2016. PubChem Substance and Compound databases. Nucleic Acids Res. 44, D1202–D1213. Klementz, D., Döring, K., Lucas, X., Telukunta, K.K., Erxleben, A., Deubel, D., Erber, A., Santillana, I., Thomas, O.S., Bechthold, A., Günther, S., 2016. StreptomeDB 2.0—an extended resource of natural products produced by streptomycetes. Nucleic Acids Res. 44, D509–D514. Li, Q., Cheng, T., Wang, Y., Bryant, S.H., 2010. PubChem as a public resource for drug discovery. Drug Discov. Today 15, 1052–1057. Li, M.H., Ung, P.M., Zajkowski, J., Garneau-Tsodikova, S., Sherman, D.H., 2009. Automated genome mining for natural products. BMC Bioinf. 10, 185. Lucas, X., Senger, C., Erxleben, A., Grüning, B.A., Döring, K., Mosch, J., Flemming, S., Günther, S., 2013. StreptomeDB: a resource for natural compounds isolated from Streptomyces species. Nucleic Acids Res. 41, D1130–D1136. Manivasagan, P., Kang, K.H., Sivakumar, K., Li-Chan, E.C., Oh, H.M., Kim, S.K., 2014. Marine actinobacteria: an important source of bioactive natural products. Environ. Toxicol. Pharmacol. 38, 172–188. Martin, C.H., Nielsen, D.R., Solomon, K.V., Prather, K.L., 2009. Synthetic metabolism: engineering biology at the protein and pathway scales. Chem. Biol. 16, 277–286. Medema, M.H., Blin, K., Cimermancic, P., de Jager, V., Zakrzewski, P., Fischbach, M.A., Weber, T., Takano, E., Breitling, R., 2011a. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–W346. Medema, M.H., Breitling, R., Bovenberg, R., Takano, E., 2011b. Exploiting plug-and-play synthetic biology for drug discovery and production in microorganisms. Nat. Rev. Microbiol. 9, 131–137. Medema, M.H., van Raaphorst, R., Takano, E., Breitling, R., 2012. Computational tools for the synthetic design of biochemical pathways. Nat. Rev. Microbiol. 10, 191–202. Medema, M.H., Takano, E., Breitling, R., 2013. Detecting sequence homology at the gene cluster level with MultiGeneBlast. Mol. Biol. Evol. 30, 1218–1223. Medema, M.H., Fischbach, M.A., 2015. Computational approaches to natural product discovery. Nat. Chem. Biol. 11, 639–648. Medema, M.H., Kottmann, R., Yilmaz, P., Cummings, M., Biggins, J.B., Blin, K., De Bruijn, I., Chooi, Y.H., Claesen, J., Coates, R.C., Cruz-Morales, P., 2015. Minimum information about a biosynthetic gene cluster. Nat. Chem. Biol. 11, 625–631. Medema, M.H., Osbourn, A., 2016. Computational genomic identification and functional reconstitution of plant natural product biosynthetic pathways. Nat. Prod. Rep. 33, 951–962.

Application of Computation in the Biosynthesis  Chapter | 9  275 Mienda, B.S., Shamsir, M.S., 2015. An overview of pathway prediction tools for synthetic design of microbial chemical factories. AIMS Bioeng. 2, 1–14. Milshteyn, A., Schneider, J.S., Brady, S.F., 2014. Mining the metabiome: identifying novel natural products from microbial communities. Chem. Biol. 21, 1211–1223. Mohimani, H., Liu, W.T., Mylne, J.S., Poth, A.G., Colgrave, M.L., Tran, D., Selsted, M.E., Dorrestein, P.C., Pevzner, P.A., 2011. Cycloquest: identification of cyclopeptides via database search of their mass spectra against genome databases. J. Proteome Res. 10, 4505–4512. Mohimani, H., Kersten, R.D., Liu, W.T., Wang, M., Purvine, S.O., Wu, S., Brewer, H.M., Pasa-Tolic, L., Bandeira, N., Moore, B.S., Pevzner, P.A., Dorrestein, P.C., 2014a. Automated genome mining of ribosomal peptide natural products. ACS Chem. Biol. 9, 1545–1551. Mohimani, H., Liu, W.T., Kersten, R.D., Moore, B.S., Dorrestein, P.C., Pevzner, P.A., 2014b. NRPquest: coupling mass spectrometry and genome-mining for non-ribosomal peptide discovery. J. Nat. Prod. 77, 1902–1909. Mohimani, H., Gurevich, A., Mikheenko, A., Garg, N., Nothias, L.F., Ninomiya, A., Takada, K., Dorrestein, P.C., Pevzner, P.A., 2017. Dereplication of peptidic natural products through database search of mass spectra. Nat. Chem. Biol. 13, 30–37. Pence, H.E., 2010. ChemSpider: an online chemical information resource. J. Chem. Educ. 87, 1123–1124. Planson, A.G., Carbonell, P., Paillard, E., Pollet, N., Faulon, J.L., 2012. Compound toxicity screening and structure-activity relationship modeling in Escherichia coli. Biotechnol. Bioeng. 109, 846–850. Prather, K.L., Martin, C.H., 2008. De novo biosynthetic pathways: rational design of microbial chemical factories. Curr. Opin. Biotechnol. 19, 468–474. Punta, M., Coggill, P.C., Eberhardt, R.Y., Mistry, J., Tate, J., Boursnell, C., Pang, N., Forslund, K., Ceric, G., Clements, J., Heger, A., 2012. The Pfam protein families database. Nucleic Acids Res. 40, D290–D301. Rai, A., Saito, K., Yamazaki, M., 2017. Integrated omics analysis of specialized metabolism in medicinal plants. Plant J. 90, 764–787. Rausch, C., Weber, T., Kohlbacher, O., Wohlleben, W., Huson, D.H., 2005. Specificity prediction of adenylation domains in nonribosomal peptide synthetases (NRPS) using transductive support vector machines (TSVMs). Nucleic Acids Res. 33, 5799–5808. Reddy, B.V.B., Milshteyn, A., Charlop-Powers, Z., Brady, S.F., 2014. eSNaPD: a versatile, webbased bioinformatics platform for surveying and mining natural product biosynthetic diversity from metagenomes. Chem. Biol. 21, 1023–1033. Rocha, I., Maia, P., Evangelista, P., Vilaça, P., Soares, S., Pinto, J.P., Nielsen, J., Patil, K.R., Ferreira, E.C., Rocha, M., 2010. OptFlux: an open-source software platform for in silico metabolic engineering. BMC Syst. Biol. 4, 45. Rodrigo, G., Carrera, J., Prather, K.J., Jaramillo, A., 2008. DESHARKY: automatic design of metabolic pathways for optimal cell growth. Bioinformatics 24, 2554–2556. Röttig, M., Medema, M.H., Blin, K., Weber, T., Rausch, C., Kohlbacher, O., 2011. NRPSpredictor2—a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res. 39, W362–W367. Schaber, J., 2012. Easy parameter identifiability analysis with COPASI. Biosystems 110, 183–185. Starcevic, A., Zucko, J., Simunkovic, J., Long, P.F., Cullum, J., Hranueli, D., 2008. ClustScan: an integrated program package for the semi-automatic annotation of modular biosynthetic gene clusters and in silico prediction of novel chemical structures. Nucleic Acids Res. 36, 6882–6892.

276  Computational Phytochemistry van Heel, A.J., de Jong, A., Montalban-Lopez, M., Kok, J., Kuipers, O.P., 2013. BAGEL3: automated identification of genes encoding bacteriocins and (non-) bactericidal posttranslationally modified peptides. Nucleic Acids Res. 41, W448–W453. Walsh, C.T., Fischbach, M.A., 2010. Natural products version 2.0: connecting genes to molecules. J. Am. Chem. Soc. 132, 2469–2493. Wang, Y., Xiao, J., Suzek, T.O., Zhang, J., Wang, J., Bryant, S.H., 2009. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 37, W623–W633. Weber, T., Rausch, C., Lopez, P., Hoof, I., Gaykova, V., Huson, D.H., Wohlleben, W., 2009. CLUSEAN: a computer-based framework for the automated analysis of bacterial secondary metabolite biosynthetic gene clusters. J. Biotechnol. 140, 13–17. Weber, T., 2014. In silico tools for the analysis of antibiotic biosynthetic pathways. Int. J. Med. Microbiol. 304, 230–235. Weber, T., Blin, K., Duddela, S., Krug, D., Kim, H.U., Bruccoleri, R., Lee, S.Y., Fischbach, M.A., Müller, R., Wohlleben, W., Breitling, R., 2015. antiSMASH 3.0—a comprehensive resource for the genome mining of biosynthetic gene clusters. Nucleic Acids Res. 43, W237–W243. Weber, T., Kim, H.U., 2016. The secondary metabolite bioinformatics portal: Computational tools to facilitate synthetic biology of secondary metabolite production. Synth. Syst. Biol. 1, 69–79. Weissman, K.J., Leadlay, P.F., 2005. Combinatorial biosynthesis of reduced polyketides. Nat. Rev. Microbiol. 3, 925–936. Xu, Y., Zhou, T., Zhang, S., Xu, P., Zhou, T., Zhang, S., Espinosa-Artiles, P., Wang, L., Zhang, W., Lin, M., Gunatilaka, A.A.L., Zhan, J., Molnár, I., 2014. Diversity oriented combinatorial biosynthesis of benzenediol lactone scaffolds by subunit shuffling of fungal polyketide synthases. Proc. Natl. Acad. Sci. U. S. A. 111, 12354–12359. Ziemert, N., Podell, S., Penn, K., Badger, J.H., Allen, E., Jensen, P.R., 2012. The natural product domain seeker NaPDoS: a phylogeny based bioinformatics tool to classify secondary metabolite gene diversity. PLoS One 7, e34064.

Chapter 10

Computational Aids for Assessing Bioactivities Evelyn Wolfram*, Adriana Trifan† ⁎

Zurich University of Applied Sciences (ZHAW), Wädenswil, Switzerland, †Grigore T. Popa University of Medicine and Pharmacy, Iași, Romania

Chapter Outline 10.1. Introduction: Computational Aids in Science and Their Role in Bioactivity Studies of Natural Products 10.2. Strategies for Separation and Identification of Bioactive Natural Compounds for Drug Discovery 10.3. Bioactivity Assessment in Phytochemistry 10.3.1 Protein-Based In Vitro Models 10.3.2 In Vitro Cell Culture Models 10.3.3 In Situ and ex vivo Models 10.3.4 Animal Models

277

280 282

283 284 284 285

10.4. Computational Tools for Data Analysis From Metabolomics and Bioactivity Assessment Data in Natural Product Research and Drug Discovery 10.5. Data- and Text-Mining Strategies 10.6. Virtual or In Silico Screening of Natural Products 10.7. Application Example of an In Silico Assessment of Bioactivities on the Example of the Cannabinoid Receptor 2 10.8. Overview of Software and Web-Tools for Bioactive Phytochemicals Research 10.9. Conclusions References

285 287 288

290

292 295 295

10.1.  INTRODUCTION: COMPUTATIONAL AIDS IN SCIENCE AND THEIR ROLE IN BIOACTIVITY STUDIES OF NATURAL PRODUCTS Modern science is fundamentally related to the achievements in computer science and the technical frontiers, which are pushed continuously forward. In an article in the intersecting fields of Science, Philosophy, and Computer Science, Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00010-9 © 2018 Elsevier Inc. All rights reserved.

277

278  Computational Phytochemistry

the authors analysed the role of computational supporting technology in science throughout the history of scientific development and their role today for science, which they characterize as being ‘E-Science’ (Amigoni and Schiaffonati, 2007). As an example, the authors pointed out the tendency of scientist to enlarge their own intellectual capacities given by their human nature with technical aids. One famous example illustrating the basic principles of the role of technical support to human discovery is the use of a telescope by Galileo Galilei, who opened a new era by being able to acquire additional data from observations, which would not have been possible to gather with the naked eyes. Today, the Hubble telescope demonstrates how such originally technical support of a single scientist could evolve to a multitechnological and complex system. Apart from the pure physical functionality, to serve as a magnifier to observe more and more details in space, it is now a collaborative technical and scientific entity, around which the complexity is handled by computational means. In addition to the role of the computer for Hubble to operate the instrument, acquire and transform the signals to data, and for storage and visualization of them, the possibility to copy and share the enormous amount of the acquired information is of great value in science. It provides the possibility of pursuing a joint scientific progress in the interpretation of the extra-terrestrial signals and their meaning for life on earth and terrestrial natural science, and as such, the scientific discovery process as the main goal of scientific activity. Thus, from this example, it can be concluded that the role of computational aids for scientists is twofold: on the one hand, guarding and representing results of scientific experiments, and on the other hand, to support the scientists’ human intelligence for the scientific discovery process by tools of artificial intelligence and mathematical power. Transferring these general thoughts to the topic of computational aids in bioactivity assessment of phytochemicals, the role of computers can be divided into following areas: 1. Functional automation of lab instruments, data acquisition with these instruments, and storage of the data. 2. Data analysis and visualization, development of suitable reductionist models to interpret the results from a growing amount of data. 3. Finding and using all necessary knowledge base from a vast pool of scientific reports and databases on the topic (text and data mining). 4. Development and/or application of models representing formerly achieved scientific knowledge. So computational aids could be used as prospecting tools, which help to avoid useless empirical work—e.g., molecular docking and modelling studies. 5. Computer-supported experimental design. In the empirical process, the stochastic power of computational means, combining statistics and probability theory, helps to design the experimental workflow differently from the intuitive human approach. However, these are not all what computational tools can offer in the search for bioactive phytochemicals.

Computational Aids for Assessing Bioactivities  Chapter | 10  279

The usage of bioactive phytochemicals and natural products in human life and life style has a long tradition. While phytochemicals and natural products, taken as whole plant or as extract, had been the only source of bioactive compounds before the development of chemical science and chromatographic techniques, nowadays synthetic and medicinal chemistry compete in discovery of bioactive compounds for pharmaceutical, health food, cosmetic as well as pesticidal, herbicidal, and hygienic antimicrobial products. The main field interested in bioactive compounds is the pharmaceutical drug discovery process, be it in academic settings or in sophisticated HTS programmes in industry. Until recently, the retrieval of promising bioactive lead compounds was predominated by natural sources. While chemical science has evolved enough to be able to synthesize almost all possible molecular conformations of organic compounds, the use of natural sources for active pharmaceutical ingredients has decreased significantly (Li and Vederas, 2009). The limitations of pure synthetic chemistry have also been overcome with recent developments in biocatalysis and microbial biotech processes with gene-modified organisms as biofactories for the desired compounds. However, not all known compounds from natural sources can be produced by chemical or biotechnological processes in an economic way. Many of them still need to be obtained directly from natural materials. It has been acknowledged in the recent years that in contrast to combinatorial chemistry approaches in drug discovery, the hit rate of natural products is higher based on the total number of screened compounds in High-Throughput Drug Discovery experiments (Strohl, 2000; Li and Vederas, 2009). Especially in some areas of public healthcare, such as the antibiotic crisis and neglected diseases, there is a need for efficient bioprospecting tools. The search for such novel potential natural actives and not already known and evaluated compounds justifies the application of computational aids in a complementary way to the common empirical efforts (Fig 10.1). Overall, the aim of research and development in this field is not only the discovery of any bioactive compound, but also the discovery of compounds that fulfil the criteria of the Quality–Efficacy– Safety triangle. Computational tools, if fit-for-purpose, and if they receive the correct and abundant input, might be capable of exploiting the vast knowledge available on natural products by means of data- and text-mining programmes and combining it with empirical data for a targeted and rational active discovery process. The input data includes mainly ethnopharmacological (traditional) knowledge, chemical and physical molecular details, bioavailability, pharmacodynamic and pharmacokinetic as well as toxicity data, to name only some available categories of knowledge. This chapter addresses computational aids applied to discovery of bioactive phytochemicals covering the following areas: 1. Statistical tools for analysis of the available multivariate data. 2. Data- and knowledge-mining tools.

280  Computational Phytochemistry

Empirical phytochemical information assessment

Computational tools for phytochemical and bioactivity assessment

Mining of existing data and knowledge Big data algorithms for screening published data Efficacy Spread sheet and data base software at biological Collection, sorting and assuring storage Unknown pure compound target and availability of the data, Assessment of molecular mass Relational data bases—Excel, Oracle Computational NMR spectrum Graphic and statistic software aids for Visualisation, statistical bioactivity Known compound correllations, tests of significance assessment of Structural data of differences – e.g. SPSS natural products Physicochemical data Mathematical data Safety Quality Separation processing and modelling quantification For humans and Content Programming tools environment standardization depending on hardware and Scientific publication and contaminant system software of the data control e.g. R, Matlab, Unknown compound mixture Fractionation and isolation – e.g. bioguided approach.

Application in food, pharma, personal care, hygiene, and agriculture

FIG. 10.1  The well-known triangle for Quality, Safety, and Efficacy for development of products from active compounds and the overview of contributions from empirical and computational aids in the development process.

3. Evaluation of data from bioactivity testing of multicompound mixtures and separated biochemicals to find the ‘needle in the haystack’, meaning the one bioactive compound responsible for the mechanism of action (MOE). 4. Virtual Screening, assessment of (quantitative) structure activity relationship ((Q)SAR). 5. Discovery of synergistic activities as a paradigm shift from one active.

10.2.  STRATEGIES FOR SEPARATION AND IDENTIFICATION OF BIOACTIVE NATURAL COMPOUNDS FOR DRUG DISCOVERY It is estimated that up to date just over 200,000 natural compounds have been reported in the literature (Rollinger et  al., 2008), many of which are phytochemicals. Their role in nature must be beneficial to the biosynthesizing donor organisms. Evolution has conserved those compounds because of their functions for the organisms, especially in defence or symbiosis. Ethnobotanical knowledge of traditional use of natural resources in healing (Leonti, 2011; Pak et al., 2016), hygiene, mental well-being, and even poisoning for hunting or defence (Heinrich and Jäger, 2016) deliver many hints for sources of potent natural bioactives. The main strategy in recent years for discovery of novel bioactive natural products has been the approach of bioassay-guided fractionation and targeted isolation from complex mixtures to acquire pure active compounds and explain the desired bioactivity. This approach combines the phytochemical characterization and multicompound separation with the identification of bioactive c­ ompounds.

Computational Aids for Assessing Bioactivities  Chapter | 10  281

In the case, where a positive effect in the separated multicomponent mixture is detected, subsequent intense phytochemical and analytical investigations are necessary to elucidate the identity and structure of the compound. This approach poses problems, which are summarized by Strohl (2000). Many of the potent natural compounds found are already known, and this is usually found quite late in the process after tedious purification and structure elucidation work. The rediscovery of known compounds should be avoided, since this wastes time and other resources. In addition, some natural compounds are, when separated from their biological environment, susceptible to degradation and the interaction of compounds in the mixture and with the assay components might produce artefacts. However, appropriate dereplication processes can be put in place to avoid isolating known compounds, without delving into isolation process (see Chapter 5). In recent years, the plant metabolomics approach, enabled by the enormous technical advancement in analytical chemistry and the parallel evolvement of molecular structure databases of metabolomics data, has somewhat overtaken the bioassay-guided fractionation strategy. Metabolomics can be defined as the analytical study of low molecular weight compounds of biological origin (Misra et  al., 2017) (see Chapter  8). Experimental platforms that are used to study plants metabolomes include chromatography and spectroscopy; chromatographic techniques (gas- and liquid-chromatography) are employed to separate the metabolites, which are then quantified and identified with various detectors: mass spectroscopy (MS) for a targeted approach, while nuclear magnetic resonance (NMR) and MS are used without the chromatographic steps as an untargeted analytical tool. Technologies for metabolic fingerprinting or profiling have been already applied for drug discovery from natural sources, quality control of herbal material/preparations, and putative identification of new lead compounds (van der Kooy et al., 2009; Turi et al., 2015; Donno et al., 2016). Computational means for data analysis in metabolomics have been reviewed by Misra et al. (2017), detailing current tools for data analysis mainly in GC-MS, LC-MS MS imaging, and MS/MS as well as NMR techniques. In separation science, computational tools are used to speed up and rationalize the experimental development process. The mathematical foundations for prediction of retention of a given molecule in a given chromatographic settings have been laid piece by piece throughout the development of chromatographic techniques. However, the combination of the known equation for modelling and prediction has been pioneered by R. Kaliszan starting in 1977 by coining the term Quantitative Structure Retention Relationship (QSRR) (Héberger, 2007; Kaliszan, 2007). Since then, several dedicated tools for rational development and prediction of chromatographic analytical or preparative separation methods have been developed; one widely used software is DryLab for liquid-chromatography, combining physicochemical parameters of the chromatographic system with a Design of Experiments approach, training the model with data from experimental input runs in the extremes of the parameter field (Molnar, 2002). The mathematical basis of such tools was outlined by Nikitas and Pappa-Louisi (2009).

282  Computational Phytochemistry

The QSRR concept was expanded to the Quantitative Retention Activity Relationship (QRAR) (Kaliszan, 2007). Since lipophilicity is one of the famous five crucial parameters for drug-like properties, known as Lipinski’s rules (Lipinski et al., 2001), chromatography might be a complementary approach to conventional bioactivity screening, by efficiently delivering separated drug-like natural substances from the mixture. The application of QRAR is visible in studies concerning prediction of bioavailability in pharmacological development, among other applications recently reported for subcutaneous absorption of medicinal plant constituents for skin care (Stepnik and Malinowska, 2017). A summary of how analytical and bioactivity information can be combined to assess the chemical and bioactive diversity of natural products was presented by Potterat and Hamburger (2013), structured according to separation techniques and their hyphenations with bioassays. They also mentioned a simple but rapid and cost-effective tool known as ‘bioautography’, which combines thin layer chromatographic fingerprint separation and bioassays. This approach has also been reviewed in relation to different target assays (Choma and Grzelak, 2011; Marston, 2011; Favre-Godal et  al., 2013; Bräm and Wolfram, 2017). After a first crude screening of an extract by obtaining the visual fingerprint as a TLC photograph, a time-consuming in-depth analysis of the fingerprint compared to references and controls for evaluation of the potential ‘hits’ is required. Especially, the exclusion of artefacts by controls or even parallel control assays is a must. In a special issue of Phytochemical Analysis, the recent advances in combination of chromatographic separation and bioactivity assays in natural product research have been addressed (Bucar and Wolfram, 2017). If such hyphenated chromatographic bioassays would be combined with the above-mentioned QRAR approach, an integrated strategy for simultaneous separation of the natural product mixtures, evaluation of their drug-like properties, and in vitro bioactivity assessment in combination with bioavailability predictions might be achieved in one method. To implement such strategy, the application of computational data analysis and drug-like prediction tools need to be linked with the empirical investigations in the lab.

10.3.  BIOACTIVITY ASSESSMENT IN PHYTOCHEMISTRY Discovery of active ingredients from natural sources requires a multidisciplinary approach in which the rate of success is dependent on a well-chosen set of in vitro and in vivo assays. Based on the research objectives, screening models that differ in their complexity and throughput capacity are chosen and applied. Bioactivity of natural products can be evaluated by in vitro receptor/ enzyme models with purified proteins, cell culture systems, models with isolated tissues/organs, and in vivo preclinical animal models (Wang et al., 2011). Selection of appropriate preclinical models may assure the clinical efficacy of

Functional activity assessment (in vivo and in vitro models)

Lead compounds identification

Molecular target identification (in silico tools, in vitro, and in vivo models)

Reverse pharmacology approach

Forward pharmacology approach

Computational Aids for Assessing Bioactivities  Chapter | 10  283

FIG.  10.2  Screening assays for drug discovery from natural products—paradigm shift from a forward pharmacology to dereplication by a reversed pharmacology approach.

novel drug candidates and could reduce the rate of failure during drug development campaigns (Kubinyi, 2003). Investigation on bioactive natural compounds relies on either a ‘forward’ or a ‘reverse’ pharmacology approach, both using the above-mentioned bioassays, but in a different order (Fig 10.2). Based on hints from traditional medicine, biorational, and phylogenetic criteria, ‘forward pharmacology’ approach first determines the biological activity by using in vivo animal models, organ, or tissue models, followed by in vitro investigation of molecular target; meanwhile, the ‘reverse pharmacology’ approach starts by screening plant-derived compounds libraries against specific targets, thus identifying potential ‘hits’ with desired activity, which are further validated in selected in vivo models (Rollinger et al., 2006; Schenone et al., 2013; Atanasov et al., 2015; Leonti et al., 2017). Bioactivity assessment methods based on simple chemical reactions, such as some commonly used assays to determine in vitro antioxidant properties [e.g., the 2,2-diphenyl-1-picrylhydrazyl (DPPH) radical scavenging assay, 2,2'-azinobis (3-ethylbenzothiazoline-6-sulfonic acid) diammonium salt (ABTS) radical cation scavenging assay, reducing power or metal ion chelating assay, etc.], will not be discussed in the current chapter.

10.3.1  Protein-Based In Vitro Models High sensitivity and specificity can be obtained from classical protein-based in vitro assays for screening drug candidates from natural sources. This class of assays relies on binding of the test compound to the active centre of the target protein or on influencing its functional activity by inhibiting a protein–protein interaction (Rask-Andersen et al., 2011). Being associated with different physiological and pathological in  vivo processes, proteins are considered as main targets for drugs action. Therefore, purified protein targets, such as enzymes, receptors, transport proteins, and ion channels, have been employed in ­numerous

284  Computational Phytochemistry

drug discovery programmes, e.g., discovery of G protein-coupled receptors involved in asthma, pain, heart diseases, and peptic ulcers (Wise et  al., 2002). These assays are well-suited for high-throughput screening (HTS) and can be performed in laboratory without the need for cell culture or animal facilities (Imming et al., 2006). However, despite the fact that protein-based in vitro models provide the mechanism of action of a drug candidate, they may not guarantee its functionality, which is further tested in cell-based in vitro assays or in vivo animal models (Butterweck and Nahrstedt, 2012).

10.3.2  In Vitro Cell Culture Models Assays using cultured cells can mimic the physiopathological state of different diseases, thus can be simple and inexpensive tools for the initial assessment of pharmacological activity and toxicological profile of phytochemicals. Cellbased assays provide useful information about bioactivity at a cellular level and are primarily used in both academic and pharmaceutical industrial research as a link between compound screening and in vivo pharmacology (Rausch, 2006). Cell culture systems can be used to confirm the results from in vitro proteinbased models, giving additional information on the interaction of test compound with specific protein function and changes in cellular phenotype such as cell growth and proliferation, migration, or cell apoptosis and senescence (Lee and Bogyo, 2013). In vitro cell assays employ mostly mammalian cells, but yeast cells are also used for drug target deconvolution, understanding of mechanism of action, and identification of target pathways (Hoon et al., 2008).

10.3.3  In Situ and ex vivo Models Tissues and isolated organs assays bridge the gap between in vitro and in vivo models, with high physiopathological relevance and providing few information about absorption and metabolism of test compounds. Thus, efficacy of drug candidates can be tested in in vivo resembling conditions, with the advantage of a decrease in samples amount and labour, but with several disadvantages, mostly related to the short half-life of tissues/isolated organs and ethical concerns by use of animals (Wang et al., 2011; Luo et al., 2013; Atanasov et al., 2015). Curtis et al. (2008) described the so-called ex vivo Metrics technology that uses intact human organs ethically donated for research purposes, thus avoids potential species differences and provides a similar environment to the human system prior to clinical trials. Recently, 3D in vitro systems have significantly advanced the drug screening process as 3D tissue and organs models can mimic native tissues, and also, but in a lesser extent, the physiological response to drugs. Bioprinting is a promising technology, which enables growth of cells into native-like organizations and holds significant potential in drug testing, disease modelling, and HTS (Peng et al., 2016). Moreover, the derived concept ‘human organ-on-chip’ could revolutionize drug discovery programmes, as models that mimic human normal and pathological

Computational Aids for Assessing Bioactivities  Chapter | 10  285

physiology are under development, being considered as promising substitutes for animal testing (Luni et al., 2014; Kizawa et al., 2017; Yi et al., 2017).

10.3.4  Animal Models Animal testing remains crucial for the drug evaluation and validation, as it provides important data regarding efficacy, bioavailability, adverse drug reactions, and toxicity of test compound in an organism, being widely used for pharmacokinetic and safety studies, a prerequisite for human clinical trials (Butterweck and Nahrstedt, 2012). However, in recent years, with the introduction of predictive toxicology (see Chapter 2), which involves computer-aided in silico toxicity studies of drugs or any bioactive compounds, some animal-based toxicity models can be avoided. Mouse and rat are indispensable species in the drug discovery process; therefore, a number of parameters need to be considered in this type of animal models, such as route of administration, dose of the test compound, experimental design, and use of a well-known positive control, when available (Vogel and Vogel, 2002; Martin and Novali, 2011). Nonrodent species (e.g., rabbits, dogs, swine, and monkeys) are also used in pharmacokinetic studies and for pharmacological safety assessment, mostly due to the fact that regulatory guidelines of international medicines authorities require safety testing in at least two mammalian species, one thereof nonrodent, before authorization of human trials (Parasuraman, 2011). Recently, genetically modified rodent models (with a gene-knockout, downregulation (knockdown), or overexpression of certain proteins) have been successfully employed for assessing in vivo efficacy prior to human trials, for target identification, or discovery of pharmacological mechanisms of action (Whiteside et al., 2013; Barrett and McGonigle, 2017; Munro et al., 2017). Nevertheless, the use of traditional vertebrate animal models for drug discovery implies technical and financial challenges, which may explain the relative scarcity of in vivo studies used to identify new treatments. Thus, alternative nonmammalian animal models, such as nematodes (Caenorhabditis elegans) and zebrafish (Danio rerio), are particularly well-suited for high-throughput drug screening in intact living organisms. These species offer the advantage of genetic interrogation and automated techniques, therefore being able to provide fast phenotype-based screens, with identification of relevant therapies and simultaneously evaluation of drug toxicity (Wood et al., 2011; O'Reilly et al., 2014; Deveau et al., 2017).

10.4.  COMPUTATIONAL TOOLS FOR DATA ANALYSIS FROM METABOLOMICS AND BIOACTIVITY ASSESSMENT DATA IN NATURAL PRODUCT RESEARCH AND DRUG DISCOVERY Most common analytical platforms used in metabolomics studies are able to generate large amounts of increasingly complex data. Nowadays, the bioanalytical instruments are linked to computers, which have become an indispensable

286  Computational Phytochemistry

tool for data acquisition, databases search, and statistical analysis of the results (Roussel et al., 2014). Chemometrics, also referred to as multivariate statistical analysis in a more general way, applies mathematical and statistical methods to visualize, mathematical manipulate, and analyse chemical data and to extract significant information from the experimental sets (Trygg et al., 2006; Pirhadi et al., 2016). Since in bioactivity assessment, metabolomics and bioanalytical data are linked to each other, the same tools for data analysis as in conventional chemometrics can be applied. According to Eliasson et al. (2011), the process of handling metabolomics data can be divided into three major steps. First, the data processing step, where signal processing methods (e.g., filtration and feature detection), alignment, and normalization procedures are applied on the raw data files. Secondly, the data analysis and data visualization step, where various regression, projection, or machine learning methods are used, such as principal and independent component analysis, multidimensional scaling, clustering techniques, and discriminant function analysis. Of the multivariate data analysis, principal component analysis (PCA) and projection to latent structures by partial least squares (PLS) are routine processing steps, both reducing the complexity of the data set. PCA is an excellent tool for extracting and displaying the systematic variation in a data matrix. Moreover, PLS is a regression extension of PCA, which is designed to reveal the relationship/correlation between two kinds of raw data sets. The third step employs statistical validation procedures such as test set validation, crossvalidation, and re-sampling methods, which are used to assess the significance of the results (Steuer, 2006; van der Kooy et al., 2009; Eliasson et al., 2011). Interpretation of the analytical data relies crucially on these computational tools, as they build upon interdependencies between metabolites, and therefore, chemometrics technologies are expected to improve model interpretation and biochemical relevance in the area of plant metabolome research (Bansal et al., 2014; Boccard and Rudaz, 2014). Jansen et  al. (2010) presented a remarkable basic treatise of fundamental theory and practical applications of multivariate data analysis in the phytochemical metabolomics context. By taking a greenhouse experiment and a photographer for data acquisition, the authors provide for a helpful analogy to understand the basic principles of preparing the raw data, transforming them mathematically (e.g., normalization), and principle component analysis (PCA) as a tool for analysing multivariate data sets. The storage and structured management of the acquired data require suitable data base software. An electronic Lab-Notebook (ELN) (Beato et al., 2011) and Laboratory Information Management Systems (LIMS) (Hunter et al., 2017) are important computational tools. While many academic researchers rely on paper lab notebooks and common office software tools, which they set up individually for their daily needs, such professional computational data management tools are applied in corporate research and development departments. Data a­ cquisition, storage, and protection from manipulation are crucial topics in pharmaceutical

Computational Aids for Assessing Bioactivities  Chapter | 10  287

quality assurance and part of Good Manufacturing Practice (GMP) and Good Laboratory Practice (GLP) guidelines and auditing of their implementation by regulatory bodies. The principle rule is that data need to be recorded immediately when it is acquired and not later, which is often the reason for paper-based notebooks. Beato et al. (2011) published on experiences as a bioanalytical professional research lab with an ELN called E-WorkBook Suite, an Oracle data base-connected commercial tool. Besides its attractiveness for corporate and regulatory needs, the authors addressed the application for basic research labs as well. A relational database allows for linking chemical structures with bioanalytical results as well as corresponding scientific literature and offers full search-ability for key words in all linked data files. A LIMS includes all features like an ELN, but is even an extended tool, which might include many other related information around the lab and project management, such as, instrument maintenance, chemical stock management, batches and shelf-life information as well as supplier management. A recently presented LIMS tool, MASTR-MS, claims to be the first to be tailored to metabolomics laboratories and is attractive to academics, since it is web-based and allows for remote collaboration by many researchers from different institutions on the same project. This tool includes, besides an ELN feature for project management, sample trace-ability, data capture, storage and protection as well as accessibility and collaboration on the same data by the authorized users within a project. In contrast to the above-mentioned tools with a broad asset of features, Höck and Riedl (2012) presented with CyBy2 a small data management tool, which was tailored for structure-related data handling and storage as the basis for structure-bioactivity-driven research applications for medicinal and synthetic organic chemistry. However, it is also applicable for natural products extracts and isolated molecules and their related acquired or published data as a basis for data mining or subsequent virtual screening approaches.

10.5.  DATA- AND TEXT-MINING STRATEGIES The basic concept of all computational approaches is the systematic search— also called mining, of a large volume of data, e.g., compound databases in order to filter for potent compound candidates for the disease under study. According to Rollinger et al. (2008), the term ‘data-mining’ was first used in 1996 by Morell et al. The idea behind was to use computational tools to search and evaluate knowledge from a large and heterogeneous set of data and use this for model building to explain observed phenomena and predict or extrapolate outcomes in the future or other settings (Gasteiger et al., 2003). The biological activities of natural products extracts might result from synergetic, additive, or antagonistic effects. It is possible that a change in composition and proportion of the active substances might result in a different outcome of the bioactivity assessment. Thus, tools are needed to understand the relationships between the chemical components and their pharmacological effects as proposed by

288  Computational Phytochemistry

Wang et  al. (2008). An algorithm was developed, described in detail as formulas programming code, and implemented in Matlab, which was suitable for identifying active compounds on MCF-7 cells among nine ginsenosides from 28 Panax ginseng extracts. By this preliminary study, the authors aimed to demonstrate and evaluate the informatics data-mining approach for assessment of bioactive natural product mixtures. Text-mining is a tool, which is a recent strategy to save time on tedious and partly expensive literature review work. Since traditional knowledge and a vast variety of reports from traditional knowledge have been published, computational algorithms, used also in other applications of linguistic analysis, are applied to the exploration of published abstracts and manuscripts. Choi et al. (2016) presented a so-called corpus, which in the field of linguistics means a set of structured text elements, for plant-chemical relationships. The algorithm was able to mine Pubmed abstracts for plant names, contained chemicals and even solvents, which yielded the phytochemicals and annotated each parameter to the relevant data set. A web-based tool is the result of a text-mining development to link plantbased foods with the small molecule compounds and the human disease phenotypes (Jensen et al., 2015). In this tool, only a structure of a molecule has to be entered, and a result in the field of activity after oral consumption in food is given in form of text citations.

10.6.  VIRTUAL OR IN SILICO SCREENING OF NATURAL PRODUCTS To overcome the gap between known bioactives and the discovery of new structures, computational tools could help to perform rational discovery investigations by using the available knowledge. Rollinger et al. (2008) pointed out that the link between bioinformatics (gene—protein information) and chemoinformatics (molecular and physicochemical information) is a crucial opportunity and challenge for computational tools for drug discovery. More and more databases of 3D structures of proteins (e.g., Protein Data Bank) as well as growing electronic databases of structures of potential ligands provide great opportunities for a rationalized concept in drug discovery by in silico studies (see Chapter 5). Some remarkable tools in the natural product context, from the authors’ perspective, are highlighted below. ChemGPS-NP (Larsson et al., 2007) and the web-based version ChemGPSNPweb (Rosén et al., 2009) is a tool for evaluating drug-like properties of natural products by analysing their physicochemical characteristics in eight dimensions. Large compound library data sets can be compared and visual cluster analysis helps selecting and prioritizing the compounds for further drug discovery or other detailed bioactivity studies. In a follow-up study, ChemGPS-NP was applied to compare a marine, terrestrial natural product and synthetic database. It was demonstrated that despite overlapping regions, there were differences

Computational Aids for Assessing Bioactivities  Chapter | 10  289

between the drug-like properties and the data sets, which revealed that marine natural products could cover regions in the physicochemical characteristics space, which might not be easy to populate by synthesis (Muigg et al., 2013). PASS (Prediction of Activity Spectra for Substances) is another example of a virtual screening tool, which provides quite broad applications and can be operated online via the internet (Sadym et  al., 2003). A compound structure file (SDF or MOL format) can be inserted online and a large set of bioactivity spectra can be obtained according to parameters set by the user in the web mask. On the website—a large set of tools can be accessed to predict pharmacological effects, mechanism of actions, toxicity, and many other questions relevant in bioactivity screening and drug discovery. Among many publications involving PASS as a tool, few examples of application to natural products can be found. For example, Goel et al. (2011) presented a study on selected Ayurvedic medicinal plants and their main known secondary metabolites: Withania somnifera (withanolide A), Curcuma longa (curcumin), Boerhaavia diffusa (liriodendrin), Piper longum (piperine), and Alium sativum (allicin). The aim of the study was to compare the PASS prediction with the existing empirical reports about the compounds’ activities to evaluate the relevance of the model. It was concluded that PASS predictions of the selected phytochemicals well-corresponded to the activities reported in the literature. PASS-predicted activities with a high prediction score, which have not yet been studied, provide for unexplored potentials for studying new indications of known phytochemicals and their source plants. LigandScout—a commercial pharmacophore drug discovery modelling tool (Wolber and Langer (2005)), was used several times since its creation for molecular docking studies involving natural products. From the 43 publications that can be found in the Pubmed with the key word ‘LigandScouť, eight are in silico bioactivity studies with natural compounds. However, there are more studies with natural products, which only mention the software name in the text and cannot be easily found in Pubmed. Among them, the study by Rollinger et al. (2009) exemplifies the applicability of the tool for modern natural product bioactivity research, combining an in silico prestudy with a wet study of predicted hits. The authors used LigandScout as a 3D pharmacophore model for parallel screening to find potential bioactivity targets, called ‘target fishing’, for a selected medicinal plant; in this case Ruta graveolens and its known, isolated constituents. This way, not the target is fixed and the suitable ligand to be fitted—it is the ligands that are known and the targets that fit to the ligands are to be identified. In phytochemical bioactivity research, such reversed in silico approach can be a complementary strategy when, in parallel, the wet lab approach involves metabolomics dereplication experiments. Most recently, Das et al. (2017) presented a comprehensive and a recent example of an in silico study to verify bioactivity at a specific target. In this study, a set of chosen ligands has been explored to dock to the Alzheimer-related enzyme target. Using different tools for ADME and toxicology, drug likeness, target selection, and virtual docking, led to a QSAR model. This ample in silico

290  Computational Phytochemistry

approach, using different computational tools in one study, allowed for the identification of potential AChE inhibiting actives against Alzheimer disease with subsequent verification in vitro. Extracts as natural multicompound mixtures might exert biological activities at different targets at the same time in an organism. Thus, they might contribute to multiple pharmacological activities in form of network effects within the complex biochemical pathways. The study of such additive or synergistic effects is known as ‘Network Pharmacology’. Cytoscape is a tool that was designed to visualize such network relations like in biochemical pathways or network activity by multicompound active ingredients (Shannon et al., 2013). Generally, many common and often freeware molecular docking tools, which are not specifically designed for application of natural products, can of course be used as well. In Table 10.1, we list besides commercial tools, various web-based or freely available tools, e.g., Autodock. It is impossible to treat this topic in a concise but ample way. Therefore, we have decided to give an example of the CB2 receptor, because of the current revival of interests in cannabis as a source of natural bioactive compounds for therapeutic, and not only for recreational uses.

10.7.  APPLICATION EXAMPLE OF AN IN SILICO ASSESSMENT OF BIOACTIVITIES ON THE EXAMPLE OF THE CANNABINOID RECEPTOR 2 The cannabinoid type-2 (CB2) receptors belong to the class A rhodopsin-like G protein-coupled receptor (GPCR) family, which also includes cannabinoid type-1 (CB1); both are the main targets of endogenous cannabinoids consisting of arachidonic acid linked to a polar head group, such as N-arachidonoylethanolamide and 2-arachidonoylglycerol. The cannabinoid receptors show an overall identity of 48%, with 68% identity in the ligand-binding domain of the transmembrane spanning regions. Upon activation, both receptors modulate various intracellular signal transduction pathways, such as inhibition of adenylate cyclase and stimulation of phospholipase C, and in a lesser extent those involving β-arrestin, ceramide, and ion channel modulation (Pertwee, 2006; Svízenská et al., 2008; Dhopeshwarkar and Mackie, 2014). While CB1 receptors are located mainly in the central nervous system, CB2 receptor expression is higher in immune cells. However, a separation of cannabinoid receptors into central (CB1) and peripheral (CB2) receptors is not possible, since immune cells express also CB1 receptors and CB2 receptors are present in the central nervous system (Gertsch et al., 2006). The homology between both receptors and the higher tissue expression of CB1 increase challenges to develop selective ligands that target only CB2 receptors. Moreover, high selectivity is required to determine the precise role of each receptor in various pathological and physiological processes and to avoid the psychotropic effects related to modulation of CB1 receptors (Pacher et al., 2006; Soethoudt et al., 2017). Thus, the development of selective CB2 receptor ligands as potential drug candidates is mandatory for the treatment

Computational Aids for Assessing Bioactivities  Chapter | 10  291

of various diseases, such as chronic and inflammatory pain, pruritus, inflammatory bowel disease, diabetic neuropathy and nephropathy, neurodegenerative diseases like Alzheimer’s and Parkinson’s disease, and multiple sclerosis, liver cirrhosis, and atherosclerosis (Cabral and Griffin-Thomas, 2009; Sharma et al., 2015; Carnovali et al., 2016; Gómez-Gálvez et al., 2016; Laprairie et al., 2017). Nowadays, in silico tools are frequently used in the drug discovery campaigns to complement experimental high-throughput screenings in the identification and selection of the most promising candidates for clinical development (Rollinger et al., 2009; Schuster, 2010). In the area of natural product research, virtual screening of selective CB2 receptor ligands represents a promising in silico tool for identification of novel active molecules and pharmacological profiling of plant extracts (Poso and Huffman, 2008; Schuster, 2010; Brogi et al., 2014). Herein, we aim to give a short overview of different in silico approaches for the identification of natural products selectively targeting the CB2 receptor. Members of the N-alkyl amides from Echinacea sp. that bind CB2 receptors were discovered by molecular modelling, thus providing a first insight into the mechanism of action for the pharmacological effects commonly associated with Echinacea extracts. The modelling of the CB2 receptor used rhodopsin as a template, together with SYBYL for the generation of 3D pharmacophore structures and docking experiments. Bioactivity screening using radioligand displacement assays on CB2 receptor revealed that dodeca-2E,4E,8Z,10Z-tetraenoic acid N-isobutyl amide and dodeca-2E,4E-dienoic acid N-isobutyl amide selectively interact with the CB2 receptor (Ki ∼ 60 nM) via hydrogen bonding and π–π interactions in the solvent-accessible cavity of the receptor (Raduner et al., 2006). A virtual screening approach using docking experiments into CB2 receptor was developed on constituents of aerial parts and roots of Otanthus maritimus L. (Asteraceae), an aromatic plant native to the Mediterranean coasts. Theoretical 3D models of the most active compounds were built by means of MAESTRO GUI. GLIDE XP was applied for the docking study; 3D model of CB2 human receptor was generated by homology modelling using as template the human A2A adenosine receptor. Among the 16 identified constituents, one novel alkylamide, 1-((2E,4E,8Z)-tetradecatrienoyl) piperidine, was discovered as a potent binder of CB2 receptors, with a Ki value of 0.16 μM (Ruiu et al., 2013). β-Caryophyllene, a bicyclic sesquiterpene commonly found in spices and also a major constituent of Cannabis sativa L., has been shown to selectively target the CB2 receptor by using a radioligand displacement assay (Ki = 155 nM). Molecular docking was used to assess the specificity of β-caryophyllene for the cannabinoid receptor, showing π–π stacking interactions with the hydrophobic region of the solvent-accessible cavity of the receptor. In this study, a rhodopsin-based homology model was used for the CB2 receptor and SYBYL for the generation of 3D pharmacophore structures and docking studies (Gertsch et al., 2008). Markt et al. (2009) developed a 3D structure-based pharmacophore model with LIGANDSCOUT for the discovery of novel scaffolds acting as CB2 receptor ligands by virtual screening of large chemical databases. CATALYST was used for the generation of ligand-based pharmacophore models and p­ harmacophore-based

292  Computational Phytochemistry

virtual screening; the model was based on a CB2 training set comprising five selective agonists: AM1241, GW405833, HU-308, JWH-133, and JWH-267. The workflow resulted in the selection of 14 compounds that were finally used for biological testing. The CB2 receptor-selective pyridine tetrahydrocannabinol analogue exhibited modulating activity with low μM Ki values and was identified as a CB2 partial agonist. Moreover, two acetamides derivatives were identified as new scaffolds for CB2 receptor-selective antagonists and inverse agonists, respectively. Rollinger et al. (2009) applied a parallel screening approach to identify potential targets for 16 secondary metabolites isolated from the aerial parts of medicinal plant Ruta graveolens L. (Rutaceae). Low energy conformational models were generated with LIGANDSCOUT and their 3D structures were virtually screened (using CATALYST) against 2208 in-house pharmacophore models. The hitting model for CB2 ligands was a ligand-based model that has been previously developed and applied for virtual screening of natural compounds databases by Markt et al. (2009). Only one virtual CB2 ligand was identified from Rutae herba, namely coumarin derivative rutamarin, which showed significant selectivity for the CB2 receptor (Ki of 7.4 μM) in a radioligand displacement assay. Their results showed that parallel screening is a promising in silico tool for target fishing and bioactivity profiling of extracts and natural compounds (Rollinger et al., 2009). All above-mentioned in silico methods have been applied for the identification of novel bioactives targeting the CB2 receptor, but it seems that the suitability of these tools depends on different parameters (e.g., available databases, setting the parameters of the models correctly, or the availability of data on the properties of the target and binding ligands). A poor target validation of CB2 receptor seems to be the main cause of unsatisfactory results, with the lack of translation into clinical studies of its putative modulators. Therefore, in order to study the role of CB2 receptors in physiological and pathological processes, a consensus has been reached regarding recommended selective agonists for in silico screening of natural bioactives. In a joint effort between multiple academic and industrial laboratories, these selective CB2 agonists were found to be HU910, HU308, and JWH133, compounds that are deprived of in vivo CB1 effects, when tested in the mouse cannabinoid triad (antinociception, catalepsy, and hypothermia). (Soethoudt et al., 2017). In summary, the above examples have demonstrated how in silico tools are integrated in the drug development process, and how a rational selection of the screening methods could represent a strategy to the successful identification of natural bioactives targeting the CB2 receptors.

10.8.  OVERVIEW OF SOFTWARE AND WEB-TOOLS FOR BIOACTIVE PHYTOCHEMICALS RESEARCH There are several computational tools, be it commercially available, or open source projects. For the sake of summarizing all examples of computational tools, mentioned in this chapter, we have provided a sorted list in Table 10.1.

TABLE 10.1  A Sorted List of Computational Tools Covered in the Text Name

Content

URL

Citation If Applicable

Software for data analysis and data management Mathematical Programming

https://ch.mathworks.com/products/matlab.html



R

Freeware Mathematical Programming

https://www.r-project.org/about.html



Electronic laboratory notebook (ELN)

https://www.idbs.com/discover-e-workbook/

Beato et al. (2011)

CyBy

Structure-based data management tool for chemical and biological data

Höck and Riedl (2012) https://blogs.oracle.com/geertjan/cyby2:-storevisualize-chemical-biological-data https://www.youtube.com/watch?v=5aZxh2SoiWs

Höck and Riedl (2012)

MASTR-MS

Web-based collaborative LIMS for metabolomics

https://muccg.github.io/mastr-ms/

Hunter et al. (2017)

Cytoscape

Network pharmacology tool and more

http://www.cytoscape.org/

Shannon et al. (2013)

Image J

Java-based image analysis tool provided by the NIH

https://imagej.nih.gov/ij/



rTLC

Image analysis and multivariate data analysis of TLC fingerprints

http://shinyapps.ernaehrung.uni-giessen.de/rtlc

Fichou et al. (2016)

https://www.certara.com/wp-content/ uploads/Resources/Brochures/BR_SYBYL-X_ MolecularModeling.pdf

Markt et al. (2009); Ruiu et al. (2013)

EWorkBook Suite 2

Software for molecular docking and in silico screening SYBYL-X by Certara, formerly Tripos Inc.

Drug design, safety, off target pharmacology Ligand-based or structure-based virtual screening, and chemical library design QSAR

Continued

Computational Aids for Assessing Bioactivities  Chapter | 10  293

Matlab

Name

Content

URL

Citation If Applicable

CATALYST (Tool in Biovia Discovery Studio by Accelrys Inc.)

Small molecules affinity modelling

http://accelrys.com/products/collaborative-science/ biovia-discovery-studio/

Markt et al. (2009) and Laprairie et al. (2017)

CHEMGPSNP(web)

Analysis and comparison of natural product compound libraries by eight dimensional cluster analysis physicochemical characteristics

http://chemgps.bmc.uu.se

Larsson et al. (2007) and Rosén et al. (2009)

LIGANDSCOUT

Use of Pharmacophore models for virtual screening

http://www.inteligand.com/ligandscout3

Wolber and Langer (2005), Poso and Huffman (2008), and Markt et al. (2009)

PASS

Predicts currently 4000 different bioactivities for a compound

http://www.way2drug.com/PASSOnline/

Goel et al. (2011)

AUTODOCK/ AUTODOCK Vina

One of the most frequently used docking programs

http://vina.scripps.edu/

Chen (2015)

MOE

Different Drug Discovery Software Packages

https://www.chemcomp.com/MOE-Molecular_ Operating_Environment.htm

Markt et al. (2009)

Molsoft L.L.C.

Online portal for drug likeness

http://www.molsoft.com/

Jensen et al. (2015)

Mobyle@RPBS

Online portal for in silico ADME and toxicity evaluation

http://mobyle.rpbs.univ-paris-diderot.fr

Jensen et al. (2015)

294  Computational Phytochemistry

TABLE 10.1  A Sorted List of Computational Tools Covered in the Text—cont’d

Computational Aids for Assessing Bioactivities  Chapter | 10  295

For a recent review of Open Source Molecular modelling tools, see Pirhadi et al. (2016). For an abundant recent list of docking software tools, see the article ‘Beware of Docking!’ by Chen (2015).

10.9. CONCLUSIONS Drug discovery from natural sources requires a multidisciplinary approach in which the success is dependent on a well-chosen set of separation techniques of complex natural mixtures and in vitro and in vivo assays with adequate controls. Computational tools are available on all levels of this discovery process. The proper and attentive application of such tools requires interdisciplinary collaboration between phytochemists and experts in various other related fields, especially in relation to acquisition of the necessary data as well as mathematical and computational techniques. In silico tools are frequently used in the drug discovery campaigns to complement experimental HTSs in the identification and selection of the most promising candidates for clinical development. In the area of natural product research, virtual screening represents a promising tool for identification of novel active molecules and pharmacological profiling of plant extracts. ‘Reverse pharmacology’ is used to screen plant-derived compounds libraries against specific targets, thus identifying potential ‘hits’ with desired activity that can be further validated in selected in vivo models. The limitations lie in the quality of the models and the control of the quality is as important as the form, representative number, and content quality of the data sets provided as model input, which should avoid a ‘rubbish in—rubbish ouť situation in the research and development project. As bioanalytical platforms used in plant metabolomics studies generate increasingly complex data, advances in chemometrics is the key to improve model interpretation and biochemical relevance in this area of research.

REFERENCES Amigoni, F., Schiaffonati, V., 2007. The multiagent technology and paradigm within scientific discovery. Int. J. Artif. Intell. Tools 16 (2), 219–242. https://doi.org/10.1142/S0218213007003291. Atanasov, A.G., Waltenberger, B., Pferschy-Wenzig, E.M., Linder, T., Wawrosch, C., Uhrin, P., Temml, V., Wang, L., Schwaiger, S., Heiss, E.H., Rollinger, J.M., Schuster, D., Breuss, J.M., Bochkov, V., Mihovilovic, M.D., Kopp, B., Bauer, R., Dirsch, V.M., Stuppner, H., 2015. Discovery and resupply of pharmacologically active plant-derived natural products: a review. Biotechnol. Adv. 33, 1582–1614. https://doi.org/10.1016/j.biotechadv.2015.08.001. Bansal, A., Chhabra, V., Ravindra, K., Rawal, R.K., Sharma, S., 2014. Chemometrics: a new scenario in herbal drug standardization. J. Pharm. Anal. 4, 223–233. https://doi.org/10.1016/j. jpha.2013.12.001. Barrett, J.E., McGonigle, P., 2017. Rodent models for Alzheimer’s disease in drug discovery. In: Adejare, A. (Ed.), Drug Discovery Approaches for the Treatment of Neurodegenerative Disorders. Elsevier, Amsterdam, pp. 235–247. Beato, B., Pisek, A., White, J., Grever, T., Engel, B., Pugh, M., Schneider, M., Carel, B., Branstrator, L., Shoup, R., 2011. Going paperless: implementing an electronic laboratory notebook in a bioanalytical laboratory. Bioanalysis 3, 1457–1470. https://doi.org/10.4155/BIO.11.117.

296  Computational Phytochemistry Boccard, J., Rudaz, S., 2014. Harnessing the complexity of metabolomics data with chemometrics. J. Chemometr. 28, 1–9. https://doi.org/10.1002/cem.2567. Bräm, S., Wolfram, E., 2017. Recent advances in effect-directed enzyme assays based on thin-layer chromatography. Phytochem. Anal. 28, 74–78. https://doi.org/10.1002/pca.2669. Brogi, S., Taf, A., Désaubry, L., Nebigil, C.G., 2014. Discovery of GPCR ligands for probing signal transduction pathways. Front. Pharmacol. 5, 255. https://doi.org/10.3389/fphar.2014.00255. Bucar, F., Wolfram, E., 2017. Bioassay-coupled chromatography: challenges and applications in natural product research. Phytochem. Anal. 28, 73. https://doi.org/10.1002/pca.2675. Butterweck, V., Nahrstedt, A., 2012. What is the best strategy for preclinical testing of botanicals? A critical perspective. Planta Med. 78, 747–754. https://doi.org/10.1055/s-0031-1298434. Cabral, G.A., Griffin-Thomas, L., 2009. Emerging role of the cannabinoid receptor CB2 in immune regulation: therapeutic prospects for neuroinflammation. Expert Rev. Mol. Med. 11, e3. https:// doi.org/10.1017/S1462399409000957. Carnovali, M., Ottria, R., Pasqualetti, S., Banfi, G., Ciuffreda, P., Mariotti, M., 2016. Effects of bioactive fatty acid amide derivatives in zebrafish scale model of bone metabolism and disease. Pharmacol. Res. 104, 1–8. https://doi.org/10.1016/j.phrs.2015.12.009. Chen, Y.C., 2015. Beware of docking! Trends Pharmacol. Sci. 36, 78–95. https://doi.org/10.1016/j. tips.2014.12.001. Choi, W., Kim, B., Cho, H., Lee, D., Lee, H., 2016. A corpus for plant-chemical relationships in the biomedical domain. BMC Bioinf. 17, 386–400. https://doi.org/10.1186/s12859-016-1249-5. Choma, I.M., Grzelak, E.M., 2011. Bioautography detection in thin-layer chromatography. J. Chromatogr. A 1218, 2684–2691. Curtis, C.G., Bilyard, K., Stephenson, H., 2008. Ex  vivo metrics, a preclinical tool in new drug development. J. Transl. Med. 6, 5. https://doi.org/10.1186/1479-5876-6-5. Das, S., Laskar, M.A., Sarker, S.D., Choudhury, M.D., Choudhury, P.R., Mitra, A., Jamil, S., Lathiff, S.M.A., Abdullah, S.A., Basar, N., Nahar, L., Talukdara, A.D., 2017. Prediction of antiAlzheimer’s activity of flavonoids targeting acetylcholinesterase in silico. Phytochem. Anal. 28, 324–331. https://doi.org/10.1002/pca.2679. Deveau, A.P., Bentley, V.L., Berman, J.N., 2017. Using zebrafish models of leukemia to streamline drug screening and discovery. Exp. Hematol. 45, 1–9. https://doi.org/10.1016/j. exphem.2016.09.012. Dhopeshwarkar, A., Mackie, K., 2014. CB2 Cannabinoid receptors as a therapeutic target-what does the future hold? Mol. Pharmacol. 86, 430–437. https://doi.org/10.1124/mol.114.094649. Donno, D., Beccaro, G.L., Carlen, C., Ançay, A., Cerutti, A.K., Mellano, M.G., Bounous, G., 2016. Analytical fingerprint and chemometrics as phytochemical composition control tools in food supplement analysis: characterization of raspberry bud preparations of different cultivars. J. Sci. Food Agric. 96, 3157–3168. https://doi.org/10.1002/jsfa.7494. Eliasson, M., Rännar, S., Trygg, J., 2011. From data processing to multivariate validation - essential steps in extracting interpretable information from metabolomics data. Curr. Pharm. Biotechnol. 12, 996–1004. https://doi.org/10.2174/138920111795909041. Favre-Godal, Q., Queiroz, E.F., Wolfender, J.L., 2013. Latest developments in assessing antifungal activity using TLC-bioautography: a review. J. AOAC Int. 96, 1175–1188. https://doi. org/10.5740/jaoacint.SGEFavre-Godal. Fichou, D., Ristivojević, P., Morlock, G.E., Proof-of-principle of rTLC, an open-source software developed for Image evaluation and multivariate analysis of planar chromatograms. Anal. Chem. 88, 2016. 12494–12501. https://doi.org/10.1021/acs.analchem.6b04017. Gasteiger, J., Teckentrup, A., Terfloth, L., Spycher, S., Neural networks as data mining tools in drug design. J. Phys. Org. Chem. 16, 2003. 232–245. https://doi.org/10.1002/poc.597.

Computational Aids for Assessing Bioactivities  Chapter | 10  297 Gertsch, J., Raduner, S., Altmann, K.H., New natural noncannabinoid ligands for cannabinoid type-2 (CB2) receptors. J. Recept. Signal Transduct. Res. 26, 2006. 709–730. https://doi. org/10.1080/10799890600942674. Gertsch, J., Leonti, M., Raduner, S., Racz, I., Chen, J.Z., Xie, X.Q., Altmann, K.H., Karsak, M., Zimmer, A., 2008. Beta-caryophyllene is a dietary cannabinoid. Proc. Natl. Acad. Sci. U. S. A. 105, 9099–9104. Goel, R.K., Singh, D., Lagunin, A., Poroikov, V., 2011. PASS-assisted exploration of new therapeutic potential of natural products. Med. Chem. Res. 20, 1509–1514. https://doi.org/10.1007/ s00044-010-9398-y. Gómez-Gálvez, Y., Palomo-Garo, C., Fernández-Ruiz, J., García, C., Potential of the cannabinoid CB(2) receptor as a pharmacological target against inflammation in Parkinson’s disease. Prog. Neuropsychopharmacol. Biol. Psychiatry 64, 2016. 200–208. https://doi.org/10.1016/j. pnpbp.2015.03.017. Heinrich, M., Jäger, A.K., 2016. Ethnopharmacology, ULLA Series in Pharmaceutical Sciences. Wiley-Blackwell, Chichester. Héberger, K., 2007. Quantitative structure–(chromatographic) retention relationships. J. Chromatogr. A 1158, 273–305. https://doi.org/10.1016/j.chroma.2007.03.108. Höck, S., Riedl, R., 2012. CyBy2: a structure-based data management tool for chemical and biological data. Chimia 66, 132–134. https://doi.org/10.2533/chimia.2012.132. Hoon, S., St Onge, R.P., Giaever, G., Nislow, C., 2008. Yeast chemical genomics and drug discovery: an update. Trends Pharmacol. Sci. 29, 499–504. https://doi.org/10.1016/j.tips.2008.07.006. Hunter, A., Dayalan, S., De Souza, D., Power, B., Lorrimar, R., Szabo, T., Nguyen, T., O’Callaghan, S., Hack, J., Pyke, J., Nahid, A., Barrero, R., Roessner, U., Likic, V., Tull, D., Bacic, A., McConville, M., Bellgard, M., 2017. MASTR-MS: a web-based collaborative laboratory information management system (LIMS) for metabolomics. Metabolomics 13, 14. https://doi. org/10.1007/s11306-016-1142-2. Imming, P., Sinning, C., Meyer, A., 2006. Drugs, their targets and the nature and number of drug targets. Nat. Rev. Drug Discov. 5, 821–834. https://doi.org/10.1038/nrd2132. Jansen, J.J., Smit, S., Hoefsloot, H.C.J., Smilde, A.K., The photographer and the greenhouse: how to analyse plant metabolomics data. Phytochem. Anal. 21, 2010. 48–60. https://doi.org/10.1002/pca.1181. Jensen, K., Panagiotou, G., Kouskoumvekaki, I., 2015. NutriChem: a systems chemical biology resource to explore the medicinal value of plant-based foods. Nucleic Acids Res. 43, D940– D945. https://doi.org/10.1093/nar/gku724. Kaliszan, R., 2007. Quantitative structure—chromatographic retention relationships. Chem. Rev. 107, 3212–3246. https://doi.org/10.1021/cr068412z. Kizawa, H., Nagao, E., Shimamura, M., Zhang, G., Torii, H., 2017. Scaffold-free 3D bio-printed human liver tissue stably maintains metabolic functions useful for drug discovery. Biochem. Biophys. Rep. 10, 186–191. https://doi.org/10.1016/j.bbrep.2017.04.004. Kubinyi, H., 2003. Drug research: myths, hype and reality. Nat. Rev. Drug Discov. 2, 665–668. https://doi.org/10.1038/nrd1156. Laprairie, R.B., Bagher, A.M., Denovan-Wright, E.M., 2017. Cannabinoid receptor ligand bias: implications in the central nervous system. Curr. Opin. Pharmacol. 32, 32–43. https://doi. org/10.1016/j.coph.2016.10.005. Larsson, J., Gottfries, J., Muresan, S., Backlund, A., 2007. ChemGPS-NP: tuned for navigation in biologically relevant chemical space. J. Nat. Prod. 70, 789–794. https://doi.org/10.1021/ np070002y. Lee, J., Bogyo, M., 2013. Target deconvolution techniques in modern phenotypic profiling. Curr. Opin. Chem. Biol. 17, 118–126. https://doi.org/10.1016/j.cbpa.2012.12.022.

298  Computational Phytochemistry Leonti, M., 2011. The future is written: impact of scripts on the cognition, selection, knowledge and transmission of medicinal plant use and its implications for ethnobotany and ethnopharmacology. J. Ethnopharmacol. 134, 542–555. https://doi.org/10.1016/j.jep.2011.01.017. Leonti, M., Stafford, G.I., Cero, M.D., Cabras, S., Castellanos, M.E., Casu, L., Weckerle, C.S., 2017. Reverse ethnopharmacology and drug discovery. J. Ethnopharmacol. 198, 417–431. https://doi.org/10.1016/j.jep.2016.12.044. Li, J.W.H., Vederas, J.C., 2009. Drug discovery and natural products: end of an era or an endless frontier? Science 325, 161–165. https://doi.org/10.1126/science.1168243. Lipinski, C.A., Lombardo, F., Dominy, B.W., Feeney, P.J., 2001. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 46, 3–26. https://doi.org/10.1016/j.addr.2012.09.019. Luni, C., Serena, E., Elvassore, N., 2014. Human-on-chip for therapy development and fundamental science. Curr. Opin. Biotechnol. 25, 45–50. https://doi.org/10.1016/j.copbio.2013.08.015. Luo, Z., Liu, Y., Zhao, B., Tang, M., Dong, H., Zhang, L., Lv, B., Wei, L., Ex  vivo and in situ approaches used to study intestinal absorption. J. Pharmacol. Toxicol. Methods 68, 2013. 208–216. https://doi.org/10.1016/j.vascn.2013.06.001. Markt, P., Feldmann, C., Rollinger, J.M., Raduner, S., Schuster, D., Kirchmair, J., Distinto, S., Spitzer, G.M., Wolber, G., Laggner, C., Altmann, K.H., Langer, T., Gertsch, J., 2009. Discovery of novel CB2 receptor ligands by a pharmacophore-based virtual screening workflow. J. Med. Chem. 52, 369–378. https://doi.org/10.1021/jm801044g. Marston, A., 2011. Thin-layer chromatography with biological detection in phytochemistry. J. Chromatogr. A 1218, 2676–2683. https://doi.org/10.1016/j.chroma.2010.12.068. Martin, J.G., Novali, M., 2011. Small animals models for drug discovery. Pulm. Pharmacol. Ther. 24, 513–524. https://doi.org/10.1016/j.pupt.2011.05.002. Misra, B.B., Fahrmann, J.F., Grapov, D., 2017. Review of emerging metabolomics tools and resources: 2015-2016. Electrophoresis. (in press) https://doi.org/10.1002/elps.201700110. Molnar, I., 2002. Computerized design of separation strategies by reversed-phase liquid chromatography: development of DryLab software. J. Chromatogr. A 965, 175–194. https://doi. org/10.1016/S0021-9673(02)00731-8. Muigg, P., Rosén, J., Bohlin, L., Backlund, A., 2013. In silico comparison of marine, terrestrial and synthetic compounds using ChemGPS-NP for navigating chemical space. Phytochem. Rev. 12, 449–457. https://doi.org/10.1007/s11101-012-9256-2. Munro, G., Jansen-Olesen, I., Olesen, J., 2017. Animal models of pain and migraine in drug discovery. Drug Discov. Today 22, 1103–1111. https://doi.org/10.1016/j.drudis.2017.04.016. Nikitas, P., Pappa-Louisi, A., 2009. Retention models for isocratic and gradient elution in reversedphase liquid chromatography. J. Chromatogr. A 1216, 1737–1755. https://doi.org/10.1016/j. chroma.2008.09.051. O’Reilly, L.P., Luke, C.J., Perlmutter, D.H., Silverman, G.A., Pak, S.C., 2014. C. elegans in highthroughput drug discovery. Adv. Drug Deliv. Rev. 69–70, 247–253. https://doi.org/10.1016/j. addr.2013.12.001. Pacher, P., Bátkai, S., Kunos, G., 2006. The endocannabinoid system as an emerging target of pharmacotherapy. Pharmacol. Rev. 58, 389–462. https://doi.org/10.1124/pr.58.3.2. Pak, M.E., Kim, Y.R., Kim, H.K., Ahn, S.M., Shin, H.K., Baek, J.U., Choi, B.T., 2016. Studies on medicinal herbs for cognitive enhancement based on the text mining of Dongeuibogamand preliminary evaluation of its effects. J. Ethnopharmacol. 179, 383–390. https://doi.org/10.1016/j.jep.2016.01.006. Parasuraman, S., 2011. Toxicological screening. J. Pharmacol. Pharmacother. 2, 74–79. https://doi. org/10.4103/0976-500X.81895. Peng, W., Unutmaz, D., Ozbolat, I.T., 2016. Bioprinting towards physiologically relevant tissue models for pharmaceutics. Trends Biotechnol. 34, 722–732. https://doi.org/10.1016/j.tibtech.2016.05.013.

Computational Aids for Assessing Bioactivities  Chapter | 10  299 Pertwee, R.G., 2006. The pharmacology of cannabinoid receptors and their ligands: an overview. Int. J. Obes. (Lond) 30 (Suppl 1), S13–18. https://doi.org/10.1038/sj.ijo.0803272. Pirhadi, S., Sunseri, J., Koes, D.R., 2016. Open source molecular modeling. J. Mol. Graph. Model. 69, 127–143. https://doi.org/10.1016/j.jmgm.2016.07.008. Poso, A., Huffman, J.W., 2008. Targeting the cannabinoid CB2 receptor: modelling and structural determinants of CB2 selective ligands. Br. J. Pharmacol. 153, 335–346. https://doi.org/10.1038/sj.bjp.0707617. Potterat, O., Hamburger, M., 2013. Concepts and technologies for tracking bioactive compounds in natural product extracts: generation of libraries, and hyphenation of analytical processes with bioassays. Nat. Prod. Rep. 30, 546–564. https://doi.org/10.1039/C3NP20094A. Raduner, S., Majewska, A., Chen, J.Z., Xie, X.Q., Hamon, J., Faller, B., Altmann, K.H., Gertsch, J., Alkylamides from Echinacea are a new class of cannabinomimetics. J. Biol. Chem. 281, 2006. 14192–14206. https://doi.org/10.1074/jbc.M601074200. Rask-Andersen, M., Almen, M.S., Schioth, H.B., 2011. Trends in the exploitation of novel drug targets. Nat. Rev. Drug Discov. 10, 579–590. https://doi.org/10.1038/nrd3478. Rausch, O., High content cellular screening. Curr. Opin. Chem. Biol. 10, 2006. 316–320. https://doi. org/10.1016/j.cbpa.2006.06.004. Rollinger, J.M., Langer, T., Stuppner, H., 2006. Integrated in silico tools for exploiting the natural products’ bioactivity. Planta Med. 72, 671–678. https://doi.org/10.1055/s-2006-941506. Rollinger, J.M., Stuppner, H., Langer, T., Virtual screening for the discovery of bioactive natural products. In: Petersen, F., Amstutz, R. (Eds.), Natural Compounds as Drugs Volume I. Progress in Drug Research Book Seriesvol. 65. 2008. Birkhäuser Verlag AG, Basel/Boston,MA/Berlin, pp. 211–249. Rollinger, J.M., Schuster, D., Danzl, B., Schwaiger, S., Markt, P., Schmidtke, M., Gertsch, J., Raduner, S., Wolber, G., Langer, T., Stuppner, H., 2009. In silico target fishing for rationalized ligand discovery exemplified on constituents of Ruta graveolens. Planta Med. 75, 195–204. https://doi.org/10.1055/s-0028-1088397. Rosén, J., Lövgren, A., Kogej, T., Muresan, S., Gottfries, J., Backlund, A., 2009. ChemGPSNPWeb: chemical space navigation online. J. Comput. Aided Mol. Des. 23, 253–259. https:// doi.org/10.1007/s10822-008-9255-y. Roussel, S., Preys, S., Chauchard, F., Lallemand, J., 2014. Multivariate data analysis (chemometrics). In: O’Donnell, C., Fagan, C., Cullen, P. (Eds.), Process Analytical Technology for the Food Industry. Food Engineering SeriesSpringer, New York, NY, pp. 7–14. Ruiu, S., Anzani, N., Orrù, A., Floris, C., Caboni, P., Maccioni, E., Distinto, S., Alcaro, S., Cottiglia, F., 2013. N-Alkyl dien- and trienamides from the roots of Otanthus maritimus with binding affinity for opioid and cannabinoid receptors. Bioorg. Med. Chem. 21, 7074–7082. https://doi. org/10.1016/j.bmc.2013.09.017. Sadym, A., Lagunin, A., Filimonov, D., Poroikov, V., 2003. Prediction of biological activity spectra via the internet. SAR QSAR Environ. Res. 14, https://doi.org/10.1080/10629360310001623935. Schenone, M., Dancik, V., Wagner, B.K., Clemons, P.A., 2013. Target identification and mechanism of action in chemical biology and drug discovery. Nat. Chem. Biol. 9, 232–240. https://doi. org/10.1038/nchembio.1199. Schuster, D., 2010. 3D pharmacophores as tools for activity profiling. Drug Discov. Today Technol. 7, e205–211. https://doi.org/10.1016/j.ddtec.2010.11.006. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowski, B., Ideker, T., 2013. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504. https://doi.org/10.1101/gr.1239303. Sharma, C., Sadek, B., Goyal, S.N., Sinha, S., Kamal, M.A., Ojha, S., 2015. Small molecules from nature targeting G-protein coupled cannabinoid receptors: potential leads for drug discovery and development. Evid. Based Complement. Alternat. Med. 238482, https://doi. org/10.1155/2015/238482.

300  Computational Phytochemistry Soethoudt, M., Grether, U., Fingerle, J., Grim, T.W., Fezza, F., de Petrocellis, L., Ullmer, C., Rothenhausler, B., Perret, C., van Gils, N., Finlay, D., MacDonald, C., Chicca, A., Gens, M.D., Stuart, J., de Vries, H., Mastrangelo, N., Xia, L., Alachouzos, G., Baggelaar, M.P., Martella, A., Mock, E.D., Deng, H., Heitman, L.H., Connor, M., Di Marzo, V., Gertsch, J., Lichtman, A.H., Maccarrone, M., Pacher, P., Glass, M., van der Stelt, M., 2017. Cannabinoid CB2 receptor ligand profiling reveals biased signalling and off-target activity. Nat. Commun. 8, 13958. https://doi.org/10.1038/ncomms13958. Stepnik, K., Malinowska, I., 2017. Skin-mimetic chromatography for prediction of human percutaneous absorption of biologically active compounds occurring in medicinal plant extracts. Biomed. Chromatogr. 31, 1–10. https://doi.org/10.1002/bmc.3922. Steuer, R., 2006. On the analysis and interpretation of correlations in metabolomic data. Brief. Bioinform. 7, 151–158. https://doi.org/10.1093/bib/bbl009. Strohl, W.R., 2000. The role of natural products in a modern drug discovery program. Drug Discov. Today 5, 39–41. Svízenská, I., Dubový, P., Sulcová, A., 2008. Cannabinoid receptors 1 and 2 (CB1 and CB2), their distribution, ligands and functional involvement in nervous system structures-a short review. Pharmacol. Biochem. Behav. 90, 501–511. https://doi.org/10.1016/j.pbb.2008.05.010. Trygg, J., Gullberg, J., Johansson, A.I., Jonsson, P., Moritz, T., 2006. Chemometrics in metabolomics—an introduction. In: Saito, K., Dixon, R.A., Willmitzer, L. (Eds.), Plant Metabolomics. Biotechnology in Agriculture and Forestryvol. 57. Springer, Berlin/Heidelberg, pp. 117–128. Turi, C.E., Finley, J., Shipley, P.R., Murch, S.J., Brown, P.N., Metabolomics for phytochemical discovery: development of statistical approaches using a cranberry model system. J. Nat. Prod. 78, 2015. 953–966. https://doi.org/10.1021/np500667z. van der Kooy, F., Maltese, F., Choi, Y.H., Kim, H.K., Verpoorte, R., 2009. Quality control of herbal material and phytopharmaceuticals with MS and NMR based metabolic fingerprinting. Planta Med. 75, 763–775. https://doi.org/10.1055/s-0029-1185450. Vogel, H.G., Vogel, W., 2002. Drug Discovery and Evaluation. Springer, New York. Wang, B., Deng, J., Gao, Y., Zhu, L., He, R., Xu, Y., 2011. The screening toolbox of bioactive substances from natural products: a review. Fitoterapia 82, 1141–1151. https://doi.org/10.1016/j. fitote.2011.08.007. Wang, Y., Jin, Y., Zhou, C., Qu, H., Cheng, Y., 2008. Discovering active compounds from mixture of natural products by data mining approach. Med. Biol. Eng. Comput. 46, 605–611. https://doi. org/10.1007/s11517-008-0323-1. Whiteside, G.T., Pomonis, J.D., Kennedy, J.D., 2013. An industry perspective on the role and utility of animal models of pain in drug discovery. Neurosci. Lett. 557, 65–72. https://doi. org/10.1016/j.neulet.2013.08.033. Wise, A., Gearing, K., Rees, S., 2002. Target validation of G-protein coupled receptors. Drug Discov. Today 7, 235–246. https://doi.org/10.1016/S1359-6446(01)02131-6. Wolber, G., Langer, T., 2005. LigandScout: 3-D pharmacophores derived from protein-bound ligands and their use as virtual screening filters. J. Chem. Inf. Model. 45, 160–169. https://doi. org/10.1021/ci049885e. Wood, A.J., Lo, T.W., Zeitler, B., Pickle, C.S., Ralston, E.J., Lee, A.H., Amora, R., Miller, J.C., Leung, E., Meng, X., Zhang, L., Rebar, E.J., Gregory, P.D., Urnov, F.D., Meyer, B.J., 2011. Targeted genome editing across species using ZFNs and TALENs. Science 333, 307. https:// doi.org/10.1126/science.1207773. Yi, H.G., Lee, H., Cho, D.W., 2017. 3D printing of organs-on-chips. Bioengineering 4, 10. https:// doi.org/10.3390/bioengineering4010010.

Chapter 11

Virtual Screening of Phytochemicals Manabendra D. Choudhury*, Walid A. Atteya*, Keshav Dahal*, Pankaj Chetia†, Karabi D. Choudhury†, Anant Paradkar* ⁎

University of Bradford, Bradford, United Kingdom Assam University, Silchar, India



Chapter Outline 11.1. Introduction 301 11.1.1 Artificial Neural Networks (ANNs) 302 11.1.2 Application of ANNs in Pharmaceutical Science 303 11.1.3 ANNs in Predicting Bioactivity 303 11.1.4 Gossypol and its Derivatives 303 11.2. Materials and Methods 304 11.2.1 Input and Output Vector Definition for Data 304

11.2.2 Software and Hardware Environment 304 11.2.3 Modelling Procedure 306 11.2.4 Training and Test Data Set 306 11.2.5 Experimental Data Set 308 11.2.6 Docking Experiment 308 11.3. Results and Discussion 308 11.4. Conclusions 332 Acknowledgements 332 References 332

11.1. INTRODUCTION The fundamental objective in any drug discovery process is to reduce the time and cost involved in the process of bringing an effective drug to the market. The techniques used in drug discovery are aimed at shortening the time to identify drug candidates. Although Quantitative Structure Activity Relationship (QSAR) technique (Perkins et al., 2003; Ishikawa et al., 2012) has been in application in drug discovery process for quite some time, it does not regularly have a high impact on lead discovery as it mainly influences later stages of drug development, particularly, prediction of IC50 and a few other parameters. A revolutionary change in conventional methods of testing bioactivity of plant

Computational Phytochemistry. https://doi.org/10.1016/B978-0-12-812364-5.00011-0 © 2018 Elsevier Inc. All rights reserved.

301

302  Computational Phytochemistry

materials has taken place with the advent of computer and computational techniques. The computer-aided tools have significantly reduced the time necessary for bioactivity assessments, and a quick and cost-effective prediction of biological activity of plant materials is now possible based on physicochemical properties. Although Computer-Aided Drug Discovery (CADD) (Zhang, 2011; Silwoski et  al., 2014) has emerged as a broad subject in the field of medicinal plants research, computational molecular docking and computational target fishing (Wand and Xie, 2014; Katsila et al., 2016) are the specific techniques in this field. Above all, Artificial Neural Networks (ANNs) (Dayfoff and DeLeo, 2001) has proved to be a promising technique for direct prediction of a specific property of specific biomolecules. This chapter, utilizing a specific example, describes how ANNs could be used to predict possible biological activity of a few drug molecules.

11.1.1  Artificial Neural Networks (ANNs) ANNs are biologically inspired computer programmes designed to simulate the way in which the human brain processes information (Dayfoff and DeLeo, 2001). ANNs gather their knowledge by detecting the patterns and relationships in data and learn (or are trained) through experience, not from programming, and there lies the basic difference between ANNs and other classical computer programmes. Another significant difference between ANNs software and other computer programmes is that the algorithms used for data analysis are flexible. They can be changed anytime during the progress of analysis. The distinctive feature of ANNs is their ability to deal effectively with multidimensional problems, including several thousands of features. An ANN is formed from hundreds of single units, i.e. artificial neurons or processing elements, connected with coefficients (weights), which constitute the neural structure and are organized in layers. The ability of neural computations comes from connecting neurons in a network. The better the neurons are connected in networks, the better is the prediction as output. The activity of a neural network is determined by transfer functions of its neurons, by the learning rule, and by the architecture itself. Achievement of successful result from ANNs studies depends on minimization of prediction error by optimization of interunit connections during training. By doing so as trial and error, the network reaches the specified level of accuracy. Once the network is trained with minimum prediction error and tested, it may be used with new input information to predict the output. The information in ANNs is encoded in the strength of the network's ‘synaptic’ connections (Zupan and Gasteiger, 1993; Kaliszan et al., 2003). Latest studies on ANNs are mainly centred on designing new network types by changing transfer connection of neurons, by changing learning rule, and by initiating new connection formula.

Virtual Screening of Phytochemicals  Chapter | 11  303

11.1.2  Application of ANNs in Pharmaceutical Science Use of ANNs in drug discovery is not that old; in fact, their use in drug discovery process started at the end of the 1980s, when they were applied to solve various chemical problems including the study of Quantitative Structure Activity Relationships (QSAR). ANNs were found to be useful in compound classification, modelling of structure activity relationships, identification of potential drug targets, and localization of structural and functional features of biopolymers (Isu et al., 1994). Hussain et al. (1991) were the first to introduce ANNs to the field of pharmaceutical technology, pointing to the possible advantages of guided search for the optimal pharmaceutical formulation. Various dosage forms were the subjects of neural analysis: tablets (Bourquin et al., 1998a,b,c), pellets (Peh et al., 2000), capsules (Mendyk et al., 2007), emulsions (Gašperlin et al., 1998, 2000) and microemulsions (Agatonovic-Kustrin and Alany, 2001), hydrogels (Takayama et  al., 1999, 2003), and transdermal delivery systems (Kandimalla et al., 1999). Reports on the potential use of ANNs for pharmacological classification of drugs (Buciński et al., 2000), optimization of HPLC separations of bioactive compounds (Buciński and Bączek, 2002), in vitro permeability determination in Caco-2 cells (Paixão et  al., 2010), and predicting drug release and formulation (Chen et al., 1999; Petrović et al., 2009) are also available. Competitive adsorption of phenol and resorcinol from water environment using carbonaceous adsorbents has also been modelled using ANNs (Aghav et al., 2011).

11.1.3  ANNs in Predicting Bioactivity ANNs in predicting bioactivity classes based on physicochemical parameters of agents was demonstrated for dihydrofolate reductase inhibitors (So and Richards, 1992). Antitumour activity could also be predicted by ANNs (Zupan and Gasteiger, 1993, 1999). ANNs were proposed as decision support systems in dentistry (Brickley et al., 1998) and urology (Snow et al., 1999) and in the assessment of HIV/AIDS-related health performance (Lee and Park, 2001). Antioxidant capacity of cruciferous sprouts was also predicted using neural network (Buciński et al., 2004). Prediction of specific bioactivity like anti-HIV, anticancer, and psychometric activity of a series of molecules has been performed using ANNs (Vanyúr et al., 2003; Naik and Patel, 2009; Haghdadi and Fatemi, 2010).

11.1.4  Gossypol and its Derivatives Gossypol (Fig. 11.1) is a phenolic aldehyde, yellow in colour, found in cottonseeds (Gossypium herbaceum). It can permeate cells and act as an inhibitor for several dehydrogenase enzymes and has been tested as a male oral contraceptive

304  Computational Phytochemistry

FIG. 11.1  Gossypol from Gossypium herbaceum.

in China (Coutinho, 2002). In addition to its contraceptive properties, it also possesses antimalarial properties (Keshmiri-Neghab and Goliaei, 2014). The IUPAC name of the compound is 2,2′-bis-(formyl-1,6,7-trihydroxy-5isopropyl-3-methylnaphthalene) and it has 14 derivatives available in the NCBI PubChem compound database. Although the parent compound gossypol is known for its male contraceptive property, biological activities of its derivatives are still awaiting experimentation. We have undertaken a work involving gossypol and its derivatives, as depicted below, to demonstrate the application of ANNs as a tool for predicting biological activity of these compounds with an intention of suggesting possible new drug leads. The prediction obtained from ANNs is cross-validated by in silico search of receptors for the chosen ligands.

11.2.  MATERIALS AND METHODS 11.2.1  Input and Output Vector Definition for Data Data related to physicochemical properties of 117 molecules, whose biological activities are well-known, were collected from NCBI PubChem Compound database. Twenty two descriptors for each of the 117 molecules were recorded from the database. Physicochemical properties of all compounds of the data set were considered as input and their respective biological activities were taken as output. Similarly, 22 descriptors for each of 6 compounds, whose biological properties have not been tested, were also recorded as experimental data set. The physicochemical descriptors used for preparing training, test, and experimental dataset are shown in Table 11.1.

11.2.2  Software and Hardware Environment The test bit used for the implementation of the ANNs was Intel Core 2 Duo 2.93 GHz processor with 4 Giga Ram and windows 7 Enterprise Edition, Service Pack 1. Operating System type was 64-bit. The ANNs were implemented using the MATLAB (Matrix Laboratory) software, version 7.12.0.635 Release 2011a, 64-bit, windows. MATLAB Neural Network Toolbox with some in-house developed MATLAB code was used to design, implement, visualize, and simulate the neural networks. This allowed flexible implementation of neural network

Virtual Screening of Phytochemicals  Chapter | 11  305

TABLE 11.1  Physicochemical Descriptors Used for Preparing Training, Test, and Experimental Dataset Serial

Module Descriptor

Description

1

Molecular weight

Weight of all atoms in a compound

2

XLogP3

Partition coefficient

3

H-bond donor

A hydrogen atom attached to a relatively electronegative atom plays the role of the hydrogen bond donor

4

H-bond acceptor

Any group which can share its electron for H-bond, e.g., O, N, etc.

5

Rotatable bond count

Number of rotatable bonds in a molecule

6

Exact mass

Sum of masses of the individual isotopes of a molecule

7

Mono-isotopic mass

Sum of the masses of the atoms in a molecule

8

Topological polar surface area

Topological polar surface area (TPSA) makes use of functional group contributions based on a large database of structures, is a convenient measure of the polar surface area that avoids the need to calculate ligand 3D structure or to decide which is the relevant biological conformation or conformations.

9

Heavy atom count

Number of atom that contains more than the common number of neutrons

10

Formal charge

Charge assigned to an atom in a molecule

11

Complexity

Complexity describes the behaviour of a system or model whose components interact in multiple ways

12

Isotope atom count

Number of atoms with the same number of protons, but differing numbers of neutrons in a molecule

13

Defined atom stereo centre count

Count of any point in a molecule that leads to stereoisomerism

14

Undefined atom stereo centre count

Unknown group in a molecule that may lead to stereoisomerism

15

Defined bond stereo centre count

Nonrotatable bond known

16

Undefined bond stereo centre count

Nonrotatable bond unknown

17

Covalently bonded unit count

Count of groups bound through covalent bonds Continued

306  Computational Phytochemistry

TABLE 11.1  Physicochemical Descriptors Used for Preparing Training, Test, and Experimental Dataset—cont’d Serial

Module Descriptor

Description

18

Feature 3D acceptor count

Count of 3D feature acceptor of the molecule

19

Feature 3D ring count

Number of aromatic rings present in a molecule

20

Effective rotor count

Which provides flexibility

21

Conformer sampling RMSD

Deviation among all conformer upon reaction

22

CID conformer count

Number of different conformers of a molecule with same molecular formula

23

Biological activity

Activity of the compounds as drugs

a­lgorithm, plotting the required functions and data. The common and wellknown back propagation algorithm for training neural network was chosen to minimize the objective function.

11.2.3  Modelling Procedure To build a model that can predict the biological activity of unknown molecules from their physicochemical descriptors, a four-layered feed-forward neural network, was developed. Experiments were conducted for a sensitivity analysis by changing the number of neural network layers and the number of hidden nodes inside the layers to get the best prediction accuracy. Other important parameters had significant effect on the error rate in the proposed model. These parameters were the back propagation and the hidden layer training functions. At the beginning, neural network parameters that are commonly used were applied (Buciński et  al., 2004). These parameters were learning coefficient of 0.02, momentum equalled to 0.6, and a limitation of maximum 3000 epochs. After achieving the desired goal, these parameters were changed to decrease the time needed to build the neural network model without affecting the goal. The ANN model set out finally for present study is presented in Fig. 11.2.

11.2.4  Training and Test Data Set Data as recorded from NCBI PubChem compounds database for training and test of the neural networks and after removing the redundancy were divided into a training data set of 70% and test data set of 30% for internal validation. Before splitting data into training and test set, biological activity attribute was encoded as shown in Table 11.2. This is performed to have numerical representation of each type of bioactivity.

Virtual Screening of Phytochemicals  Chapter | 11  307

Molecular weight XLogP3 H-bond donor H-bond acceptor Rotatable bond count Exact mass MonoIsotopic mass Topological polar surface area Heavy atom count Formal charge Complexity BA_code

Isotope atom count Defined atom stereocenter count Undefined atom stereocenter count Defined bond stereocenter count Undefined bond stereocenter count Covalently-bonded unit count Feature 3D acceptor count Feature 3D ring count Effective rotor count Conformer sampling RMSD CID conformer count

FIG. 11.2  Artificial Neural Network model set out finally for present study.

TABLE 11.2  Biological Activity Attributes Biological Activity Code

Biological Activity Description/Name

10

Anabolic agent

20

Analgesic

30

Narcotic analgesic

40

Antiinflammatory

50

Anticonvulsant

60

Antineoplastic

70

Cardioprotection

80

Vasodilator

90

Antianxiety agent

100

Hypnotic

110

Contraceptive

120

Atherogenic

308  Computational Phytochemistry

11.2.5  Experimental Data Set Six gossypol derivatives, i.e. diaminogossypol, CID:198041 (Compound-1), mono-aldehyde gossypol, CID:195071 (Compound-2), gossylic lactone,CID:5479154 (Compound-3), gossypol tetraacetic acid, CID: 130831 (Compound-4), ethyl gossypol, CID: 374353 (Compound-5), and gossypol-6,6'dimethyl ether, CID: 25200979 (Compound-6), were selected for prediction of their bioactivity. Descriptors of these six compounds with their chemical identity (CID) numbers were downloaded from NCBI PubChem compound database. Selection of these compounds for present investigation was based on the facts that their biological activities have not been tested yet, either in silico or in vivo. However, the parent compound from which these compounds have been derived by chemical group substitution is known for its oral male contraceptive property. Physical and chemical descriptors used for experimental compounds are described in Table 11.3.

11.2.6  Docking Experiment Once the ANNs prediction is over, chosen compounds were considered as ligands, and docking experiments were performed to search for their respective suitable targets/receptors for validation of ANNs prediction. As the predicted activity of chosen ligands is contraceptive and parent gossypol compound is known for its male contraception property, docking was carried out with chosen ligands against acrosin and hyaluronidase—the two vital enzymes of human spermatozoa. Since 3D structure of acrosin and hyaluronidase could not be obtained from RCSB Protein Databank, the structures of these molecules were predicted using homology modelling technique with Modeller 9v7. Active site of the receptors (enzymes) was predicted using Q-Site finder (Laurie and Jackson, 2005). Structure of chosen ligands was obtained from NCBI PubChem database in SDF format and docking experiments were carried out using BiosolveIT FlexX 1.3.0 (Stahl, 2000). As control, parent gossypol was also docked with those enzymes using same software.

11.3.  RESULTS AND DISCUSSION Predicted output of the experimental compounds as recorded in Table  11.4 showed that compounds 3, 4, and 6 are contraceptive, compounds 1 and 2 are, respectively, antianxiety agent and vasodilator, and compound 5 is hypnotic. As parent compound gossypol has contraceptive properties, it was expected that some of its derivatives would have the same property. While training the model for obtaining prediction, many attempts were made by changing the neural network settings to get the best prediction accuracy. These settings included the number of neural network layers, the number of hidden nodes inside the layers, and the layers propagation functions. The effect of the neutral network parameters on the learning rate totally depends on the

TABLE 11.3  Physicochemical Descriptors of the Gossypol Derivatives Compounds 2

3

4

5

6

Descriptors

CID:198041

CID:195071

CID:5479154

CID:5479154

CID:374353

CID:25200979

Molecular weight

548.5837

490.5443

514.5226

686.7011

518.5544

546.6076

XLogP3

5.6

6.9

7.7

6.3

6.7

7.6

H-bond donor

8

6

4

2

6

4

H-bond acceptor

10

7

8

12

8

8

Rotatable bond count

5

4

3

13

5

7

Exact mass

548.2159

490.1992

514.1628

686.2363

518.1941

546.2254

Mono-isotopic mass

548.2159

490.1992

514.1628

686.2363

518.1941

546.2254

Topological polar surface area

156

208

138

289

180

192

Heavy atom count

40

36

38

50

38

40

Formal charge

0

0

0

0

0

0

Complexity

848

773

890

1180

808

879

Isotope atom count

0

0

0

0

0

0

Defined atom stereo centre count

0

0

0

0

0

0

Undefined atom stereo centre count

0

0

0

0

0

0 Continued

Virtual Screening of Phytochemicals  Chapter | 11  309

1

Compounds 1

2

3

4

5

6

Descriptors

CID:198041

CID:195071

CID:5479154

CID:5479154

CID:374353

CID:25200979

Defined Bond stereo centre count

0

0

0

0

0

0

Undefined bond stereo centre count

0

0

0

0

0

0

Covalently bonded unit count

1

1

1

1

1

1

Feature 3D acceptor Count

2

1

2

6

2

4

Feature 3D ring count

8

6

4

2

6

4

Effective rotor count

5

4

3.4

13

5

7

Conformer sampling RMSD

1

0.8

0.8

1.4

0.8

1

CID conformer count

2

2

2

2

4

3

310  Computational Phytochemistry

TABLE 11.3  Physicochemical Descriptors of the Gossypol Derivatives—cont’d

Predicted Output for the Compounds 1

2

3

4

5

6

Biological activity code

90.0424

79.9101

109.9587

109.7493

100.0998

109.6576

Biological activity names

Antianxiety agent

Vasodilator

Contraceptive

Contraceptive

Hypnotic

Contraceptive

Virtual Screening of Phytochemicals  Chapter | 11  311

TABLE 11.4  ANNs Predicted Output of the Gossypol Derivatives

312  Computational Phytochemistry

data set. Our goal was to optimize these settings until the error was minimized and reached the required acceptable level of accuracy. Two parameters are very important in the learning process, these are Mean Square Error and the regression value. The Mean Squared Error represents the learning accuracy of the neural network. It is the average squared difference between outputs and targets. Normally, lower values are better and zero means no error. The pattern of accuracy of the learning of the neural network for the present studies is presented in Fig. 11.3. Mean square error recorded during present learning was 0.02, which indicated that learning of the ANN in the present study was accurate. Mean square error was calculated as follows. N

MSE = å ( Ti - Oi )

2

i =1

Where, MSE is the mean square error, N is the number of inputs, T is the target values, and O is the output values from the neural network model. The Regression Factor is a measure for correlation between outputs and targets and indicates the prediction accuracy. A value of 1 means a close relationship, 0 means a random relationship. The regression factor of the neural network used in the present study is presented in Fig. 11.4. Value of regression factor during present work was 0.99, which signifies the accuracy of the prediction of the present work. During the work, data were processed into the suitable format used in Matlab. Input and target data attributes were defined. Many experiments were executed by tuning the neural network parameters to get the best accuracy and the best target data. Fig. 11.5 shows a comparison between

FIG. 11.3  Pattern of accuracy of the learning of the neural network for the present study.

R = 0.99179 2

Data Fit Y=T

Output ∼ = 0.97 × Target + 0.019

1.5

1

0.5

0

−0.5

−1

−1

−0.5

0

0.5

1

1.5

2

Target

FIG. 11.4  Regression factor of the proposed neural network used in the present study.

FIG. 11.5  Comparison between the neural network output data and the target data.

314  Computational Phytochemistry

the neural network output data and the target data. Data values for output and target are shown in Table  11.5. The data show how the output values of the neural network model are close to the target values. During present study, best results were found at four-layered feed-forward multilayer perception neural network. There were one input layer, three hidden layers, and one output layer. Ten artificial neurons were there in the first hidden layer, three each in second and third hidden layers and one artificial neuron in the output layer.

TABLE 11.5  Predicted Output of Training and Test Set With Reference to Biological Activity Attribute (Target) Output

Target

69.37214

70

39.70692

40

50.63383

50

25.8101

30

39.70048

40

40.06018

40

31.64394

30

39.57521

40

41.42172

40

40.02308

40

101.2782

100

40.05743

40

90.04237

90

18.92682

20

119.8577

120

29.47375

30

48.97252

50

28.35779

30

20.13945

20

88.79931

90

69.33511

70

80.70457

80

Virtual Screening of Phytochemicals  Chapter | 11  315

TABLE 11.5  Predicted Output of Training and Test Set With Reference to Biological Activity Attribute (Target)—cont’d Output

Target

109.9587

110

61.64463

60

109.7493

110

80.10604

80

28.92282

30

20.35135

20

39.17767

40

9.11467

10

119.8899

120

9.632499

10

39.26161

40

79.91012

80

39.63317

40

18.29689

20

79.47068

80

39.09463

40

79.90057

80

120.63

120

90.56023

90

59.46885

60

59.87975

60

40.29981

40

40.22298

40

50.05615

50

110.3123

110

39.84036

40

39.74226

40

79.92715

80 Continued

316  Computational Phytochemistry

TABLE 11.5  Predicted Output of Training and Test Set With Reference to Biological Activity Attribute (Target)—cont’d Output

Target

20.22006

20

20.95283

20

39.32984

40

39.37376

40

38.44877

40

70.20587

70

10.16076

10

10.71323

10

20.37211

20

79.0657

80

39.44648

40

10.48536

10

19.84768

20

11.41296

10

90.43699

90

117.9819

120

18.36777

20

109.5418

110

39.51115

40

19.79686

20

41.86343

40

79.70558

80

39.39451

40

119.3081

120

30.23937

30

39.81784

40

79.94202

80

30.67188

30

69.83762

70

Virtual Screening of Phytochemicals  Chapter | 11  317

TABLE 11.5  Predicted Output of Training and Test Set With Reference to Biological Activity Attribute (Target)—cont’d Output

Target

118.5441

120

22.49832

20

18.79895

20

20.71337

20

40.40256

40

18.81931

20

21.7158

20

20.12916

20

40.36983

40

40.05577

40

41.04731

40

39.6325

40

109.6576

110

37.64742

40

99.48608

100

37.87494

40

41.51219

40

110.2016

110

42.77141

40

119.1196

120

37.1825

40

29.26267

30

21.32417

20

79.47948

80

32.78027

30

120.059

120

19.59192

20

39.44699

40 Continued

318  Computational Phytochemistry

TABLE 11.5  Predicted Output of Training and Test Set With Reference to Biological Activity Attribute (Target)—cont’d Output

Target

99.15079

100

20.47612

20

69.74487

70

28.65926

30

39.48187

40

18.90024

20

118.4873

120

119.0782

120

100.0998

100

Different training functions, including ‘tansig’ and ‘purlin’, were tested in different layers to find the function that fits best to the model. The best input layer function was ‘tansig’, the best hidden layer function was ‘purlin’, and the best back propagation training function was 'trainlm'. Docking of gossypol with human sperm enzymes acrosin and hyaluronidase showed that it could bind successfully with a score of −25.6376 and −24.2222 kcal/mol, respectively. Docking experiments with compounds having ANNs predicted contraceptive activity (compounds 3, 4, and 6) and showed that affinity of gossypol-6,6'-dimethyl ether with acrosin and hyaluronidase was comparable to that of gossypol. However, all three derivatives had significant bonding capability with these two enzymes with an exception that gossypol tetraacetic acid’s affinity towards hyaluronidase was less (Tables 11.6, 11.7A–11.7H, and Fig. 11.6). As hyaluronidase and acrosin are two vital enzymes of human spermatozoa and are responsible, respectively, for digestion of hyaluronan in the corona radiata, enabling conception (Alberts, 2008), and for making sperms able to penetrate into the ovum by lysis of the zona pellucida through acrosome reaction, thus facilitating penetration of the sperm through the innermost glycoprotein layers (Honda et al., 2002), inhibition of any of these two enzymes by gossypol derivatives in question will have adverse effect on the ability of the spermatozoa to enable conception. Majority of mammalian ova are covered in a layer of granulosa cells interwined in an extracellular matrix that contains a high

Virtual Screening of Phytochemicals  Chapter | 11  319

TABLE 11.6  Docking Score of the Chosen Ligands and Gossypol Against Acrosin and Hyaluronidase H-Bond Forming Amino Acids

Docking Score (kcal/mol)

Acrosin

Tyr39, Val70, His71, Arg74, and Val99

−25.6376

Hyaluronidase

Tyr179, His187, Ser222, Asn226, Thr227, and Gln228

−24.2222

Acrosin

His71, Trp73, Val99, and Glu100

−17.9569

Hyaluronidase

Tyr179, His187, Thr227, and Gln228

−15.1068

Gossypol tetraacetic acid (CID 130831)

Acrosin

Lys21, Ala23, and Trp156

−12.1150

Hyaluronidase

Tyr357 and His359

−5.1035

Gossypol-6,6 dimethyl ester (CID 25200979)

Acrosin

Trp30 and Cys136

−20.7042

Hyaluronidase

Asp279, Asn353, Ser355, and Tyr357

−19.5150

Ligand

Enzyme

Gossypol (CID 3503)

Gossylic lactone (CID 5479154)

concentration of hyaluronan. When a capacitated sperm reaches the ovum, it is able to penetrate this layer with the assistance of hyaluronidase enzymes present on the surface of the sperm. Once this occurs, the sperm is capable of binding with the zona pellucida, and the acrosome reaction can occur then with the help of the enzyme acrosin. The resulted lysis from acrosome reaction enables spermatozoa to reach the innermost glycoprotein layers of the ovum to effect conception (Alberts, 2008). Alteration of activity of these two enzymes is, therefore, directly linked with male contraceptive activity. Docking experiments, therefore, in one hand validated the ANNs prediction with respect to contraceptive action of compounds 3, 4, and 6 by showing their binding potential with two vital enzymes of human spermatozoa, and on the other hand, suggested possible molecular path ways by which these compounds may affect contraception.

TABLE 11.7A  Docking Parameters of Gossylic Lactone [CID 5479154] Against Acrosin #

Ligand

1

(1) 5479154

Structure

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−17.9569

−17.3358

−10.1318

−11.1563

6.8671

8.4000

RMSD

Simil

#Match 16

OH

HO O

HO

O O

O

OH

TABLE 11.7B  Docking Parameters of Gossypol Tetraacetic Acid [CID 130831] Against Acrosin #

Ligand

1

(1) 130831

Structure O O

O O

HO

O OH

O O O

O

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−12.1150

−17.3531

−11.2893

−9.5479

3.8753

16.8000

RMSD

Simil

#Match 11

TABLE 11.7C  Docking Parameters of Gossypol-6,6 Dimethyl Ether [CID 25200979] Against Acrosin #

Ligand

1

(1) 25200979

Structure O

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−20.7042

−23.6019

−10.0613

−6.9311

3.2901

11.2000

RMSD

Simil

#Match 15

HO O HO HO

TABLE 11.7D  Docking Parameters of Gossylic Lactone [CID 5479154] Against Hyaluronidase #

Ligand

1

(1) 5479154

Structure HO

OH O

O OH

HO O

O

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−15.1068

−19.3266

−5.5109

−7.5996

3.5302

8.4000

RMSD

Simil

#Match 14

TABLE 11.7E  Docking Parameters of Gossypol Tetraacetic Acid [CID 130831] Against Hyaluronidase #

Ligand

1

(1) 130831

Structure O

O

O O

HO

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−5.1035

−16.2112

−10.8139

−7.8980

7.6196

16.8000

RMSD

Simil

#Match 14

O O O O

OH O

TABLE 11.7F  Docking Parameters of Gossypol-6,6 Dimethyl Ether [CID 25200979] Against Hyaluronidase #

Ligand

1

(1) 25200979

Structure O HO O HO HO

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−19.5150

−27.1367

−8.1126

−7.4221

6.5564

11.2000

RMSD

Simil

#Match 17

TABLE 11.7G  Docking Parameters of Gossypol [CID 3503] Against Acrosin #

Ligand

1

(1) 3503

Structure HO

O

OH

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−25.6376

−27.8142

−12.0931

−9.7579

7.4276

11.2000

RMSD

Simil

#Match 21

OH

TABLE 11.7H  Docking Parameters of Gossypol [CID 3503] Against Hyaluronidase #

Ligand

1

(1) 3503

Structure HO

OH

O

HO

OH

Rank

Score

Match

Lipo

Ambig

Clash

Rot

1

−24.2222

−31.7353

−3.2238

−6.9964

1.1334

11.2000

RMSD

Simil

#Match 13

Virtual Screening of Phytochemicals  Chapter | 11  323

HO

324  Computational Phytochemistry Val99

O −

R

R

H

R H N

Trp73

N

R

O

O Glu100

OH

O

O

R

O H

Glu100 Val70

H

H N

O

His71 O

H O

O

O

O

Thr123 Tyr98 Phe37 Arg74 His71

R

(A)

(B) FIG. 11.6  Bonding pattern of chosen ligands and Gossypol with their receptors (A) 2D view of gossylic lactone and acrosin bonding (B) 3D view of gossylic lactone and acrosin bonding

Virtual Screening of Phytochemicals  Chapter | 11  325

OH

Trp156 N

Glu81

O

H O

Ala23 Lys80

R

O

N H

Ala23

O O

O

R

O

O O

O

O Trp156

H O

R O

Lys21 Ile159 Ala22

R HN

(C)

Lys21

(D) FIG. 11.6, CONT’D  (C). 2D view of gossypol tetraacetic acid and acrosin bonding (D). 3D view of gossypol tetraacetic acid and acrosin bonding

326  Computational Phytochemistry Cys136

R

N

R Cys136 Leu137

O H

O

OH

O

H O H O

Trp28

O Trp152

H N H

O

O Trp30 Trp30

(E)

(F) FIG. 11.6, CONT’D  (E). 2D view of gossypol-6,6 dimethyl ether and acrosin bonding (F). 3D view of gossypol-6,6 dimethyl ether and acrosin bonding

Virtual Screening of Phytochemicals  Chapter | 11  327 Tyr179

Thr227

R

O

O H

R

O

Asp270 Tyr224

N

H OH

H O H

O H O

O

O

O

Asn226 Leu180 Thr227 Gln228

H

N

His187 N

H O

O

(G)

H N H

Gln228

(H) FIG. 11.6, CONT’D  (G). 2D view of gossylic lactone and hyaluronidase bonding (H). 3D view of gossylic lactone and hyaluronidase bonding

328  Computational Phytochemistry

Tyr283 Tyr357

OH

Val282 Ser355

O Lys392

R

N H

O

O

O

O

O

Tyr357

O

O

O

O

R

Pro362 His359

O HO

O R

H N R

His359

(I)

(J) FIG.  11.6, CONT’D  (I). 2D view of gossypol tetraacetic acid and hyaluronidase bonding (J). 3D view of gossypol tetraacetic acid and hyaluronidase bonding

Virtual Screening of Phytochemicals  Chapter | 11  329 O −

Asp279

O

R NH Ser355 R

Tyr283 Tyr357 Val282

O R Tyr357

N

Asp279 Asp356 O

O

H H

H O

O

Ser355 H O O

O

R

H O Asn353

O N

H

O

H Lys392 Asn353

(K)

(L) FIG. 11.6, CONT’D  (K). 2D view of gossypol-6,6 dimethyl ether and hyaluronidase bonding (L). 3D view of gossypol-6,6 dimethyl ether and hyaluronidase bonding

330  Computational Phytochemistry Val99

R

N

O

R R

H

NH Tyr98

H O

O H O

O

R

O Val70

H N H

H N

R

+ NH

H

His71

R

His71 Val70

O

H N Arg74

H

Arg74

Tyr39 Phe37

HO

R

O HO

O H

O Tyr39

H N

(M)

R

(N) FIG. 11.6, CONT’D  (M). 2D view of gossypol and acrosin bonding (N). 3D view of gossypol and acrosin bonding

Virtual Screening of Phytochemicals  Chapter | 11  331 His187 Ser222

O

N

H N H

H

H H O

O

O

Asn226

O O

Tyr224

H2N

H

Tyr179

O R

O H H N

O O

O H

R Thr227

O

H

O H

Gln228 Asn226

(O)

H

N

H

O Gln228

(P) FIG.  11.6, CONT’D  (O). 2D view of gossypol and hyaluronidase bonding (P). 3D view of ­gossypol and hyaluronidase bonding.

332  Computational Phytochemistry

11.4. CONCLUSIONS Out of the six gossypol derivatives, gossylic lactone, gossypol tetraacetic acid, and gossypol-6,6'-dimethyl ether are contraceptives. Diaminogossypol and mono-aldehyde gossypol have antianxiety and vasodilation activity, respectively, and ethyl gossypol is hypnotic. Cross validation of ANNs prediction with respect to contraceptive action of hossylic lactone, gossypol tetraacetic acid, and gossypol-6,6'-dimethyl ether with the help of docking experiments confirmed ANNs result. By inhibiting acrosin and hyaluronidase enzymes of human spermatozoa, gossylic lactone, gossypol tetraacetic acid, and gossypol-6,6'-dimethyl ether are supposed to exert male contraceptive action.

ACKNOWLEDGEMENTS Manabendra D. Choudhury sincerely acknowledges Commonwealth Scholarship Commission for awarding him with the Commonwealth Academic Staff Fellowship at University of Bradford, United Kingdom. Support of Bioinformatics Centre, Assam University, Silchar, India, is sincerely acknowledged for carrying out docking part of the work.

REFERENCES Agatonovic-Kustrin, S., Alany, R.G., 2001. Role of genetic algorithms and artificial neural networks in predicting the phase behavior of colloidal delivery systems. Pharm. Res. 18, 1049–1055. Aghav, R.M., Kumar, S., Mukherjee, S.N., 2011. Artificial neural network modelling in competitive adsorption of phenol and resorcinol from water environment using some carbonaceous adsorbents. J. Hazard. Mater. 188, 67–77. Alberts, B., 2008. Molecular Biology of the Cell. Garland Science, New York, p. 1298. ISBN0-8153-4105-9. Brickley, M.R., Shepherd, J.P., Armstrong, R.A., 1998. Neural networks. J. Dent. 26, 305–309. Bourquin, J., Shmidli, H., Hoogevest, P., Van Leuenberger, H., 1998a. Advantages of Artificial Neural Networks (ANNs) as alternative modeling technique for data sets showing non-linear relationship using data from a galenical study on a solid dosage form. Eur. J. Pharm. Sci. 7, 5–16. Bourquin, J., Shmidli, H., Hoogevest, P., Van Leuenberger, H., 1998b. Comparison of artificial neural networks (ANN) with classical modeling techniques using different experimental designs and data from a galenical study on a solid dosage form. Eur. J. Pharm. Sci. 6, 287–300. Bourquin, J., Shmidli, H., Hoogevest, P., Van Leuenberger, H., 1998c. Pitfalls of artificial neural networks (ANN) modeling technique for data sets containing outlier measurements using a study on mixture properties of a direct compressed dosage form. Eur. J. Pharm. Sci. 7, 17–28. Buciński, A., Bączek, T., 2002. Optimization of HPLC separations of flavonoids with the use of artificial neural networks. Pol. J. Food Nutr. Sci. 4, 47–51. Buciński, A., Nasal, A., Kaliszan, R., 2000. Pharmacological classification of drugs based on neural network processing of molecular modeling data. Comb. Chem. High Throughput Screen. 3, 525–533. Buciński, A., Zieliński, H., Kozłowska, H., 2004. Artificial neural networks for prediction of antioxidant capacity of cruciferous sprouts. Trends Food Sci. Technol. 15, 161–169. Chen, Y., McCall, T.W., Baichwal, A.R., Meyer, M.C., 1999. The application of an artificial neural network and pharmacokinetic simulations in the design of controlled-release dosage forms. J. Control. Release 59, 33–41.

Virtual Screening of Phytochemicals  Chapter | 11  333 Coutinho, E.M., 2002. Gossypol: a contraceptive for men. Contraception 65, 259–263. Dayhoff, J.E., DeLeo, J.M., 2001. Artificial neural networks: opening the black box. Cancer 91, 1615–1635. Gašperlin, M., Tušar, L., Tušar, M., Kristl, J., Šmid-Korbar, J., 1998. Lipophilic semisolid emulsion systems: viscoelastic behavior and prediction of physical stability by neural network modeling. Int. J. Pharm. 168, 243–254. Gašperlin, M., Tušar, L., Tušar, M., Šmid-Korbar, J., Zupan, J., Kristl, J., 2000. Viscosity prediction of lipophilic semisolid emulsion systems by neural network modeling. Int. J. Pharm. 196, 37–50. Haghdadi, M., Fatemi, M.H., 2010. Artificial neural network prediction of the psychometric activities of phenylalkylamines using DFT-calculated molecular descriptors. J. Serb. Chem. Soc. 75, 1391–1404. Honda, A., Siruntawineti, J., Baba, T., 2002. Role of acrosomal matrix proteases in sperm-zona pellucida interactions. Hum. Reprod. Update 8, 405–412. Hussain, A.S., Yu, X., Johnson, R.D., 1991. Application of neural computing in pharmaceutical product development. Pharm. Res. 8, 1248–1252. Ishikawa, T., Hirano, H., Saito, H., Sano, K., Ikegami, Y., Yamaotsu, N., Hirono, S., 2012. Quantitative structure-activity relationship (QSAR) analysis to predict drug-drug interactions of ABC transporter ABCG2. Mini Rev. Med. Chem. 12, 505–514. Isu, Y., Nagashima, U., Hosoya, H., Aoyma, T., 1994. Development of neural network simulator for structure-activity correlation of molecules. J. Chem. Softw. 2, 76–95. Kaliszan, R., Bączek, T., Buciński, A., Buszewski, B., Sztupecka, M., 2003. Prediction of gradient retention from the linear solvent strength (LSS), quantitative structure-retention relationships (QSRR), and artificial neural networks (ANN). J. Sep. Sci. 26, 271–282. Kandimalla, K.K., Kanikkannan, N., Singh, M., 1999. Optimization of a vehicle mixture for the transdermal delivery of melatonin using artificial neural networks and response surface method. J. Control. Release 61, 71–82. Katsila, T., Spyroulias, G.A., Patrinos, G.P., Matsoukas, M.-T., 2016. Computational approaches in target identification and drug discovery. Comput. Struct. Biotechnol. J. 14, 177–184. Keshmiri-Neghab, H., Goliaei, B., 2014. Therapeutic potential of gossypol: an overview. Pharm. Biol. 52, 124–128. Laurie, A., Jackson, R., 2005. Q-SiteFinder: an energy-based method for the prediction of protein– ligand binding sites. Bioinformatics 21, 1908–1916. Lee, C.W., Park, J.A., 2001. Assessment of HIV/AIDS-related health performance using an artificial neural network. Inf. Manage. 38, 231–238. Mendyk, A., Dorożyński, P., Jachowicz, R., 2007. Proceedings of International Joint Conference on Neural Networks, Orlando, FL, 12–17 August 2007. IEEE catalog number: 07CH37922C, ISBN: 1-04244-1380-X, ISSN: 1098-7576. Naik, P.K., Patel, A., 2009. Prediction of anticancer/non-anticancer drugs based on comparative molecular moment descriptor using Artificial Neural Network and support vector machine. Dig. J. Nanomater. Biostruct. 4, 19–43. Paixão, P., Gouveia, L.F., Morais, J.A.G., 2010. Prediction of the in vitro permeability determined in Caco-2 cells by using artificial neural networks. Eur. J. Pharm. Sci. 41, 107–117. Peh, K.K., Lim, C.P., Quek, S.S., Khoh, K.H., 2000. Use of artificial neural networks to predict drug dissolution profiles and evaluation of network performance using similarity factor. Pharm. Res. 17, 1384–1388. Perkins, R., Fang, H., Tong, W., Welsh, W.J., 2003. Quantitative structure-activity relationship methods: perspectives on drug discovery and toxicology. Environ. Toxicol. Chem. 22, 1666–1679.

334  Computational Phytochemistry Petrović, J., Ibrić, S., Betz, G., Jelena Parojčić, Đ.Z., 2009. Application of dynamic neural networks in the modeling of drug release from polyethylene oxide matrix tablets. Eur. J. Pharm. Sci. 38, 172–180. Silwoski, G., Kothiwale, S., Meiler, J., Lowe Jr., E.W., 2014. Computational methods in drug discovery. Pharmacol. Rev. 66, 334–395. Snow, P.B., Rodvold, D.M., Brandt, M.J., 1999. Artificial neural networks in clinical urology. Urology 54, 787–790. So, S.S., Richards, W.G., 1992. Application of neural networks. quantitative structure-activity relationships of the derivatives of 2,4-diamino-5-(substituted-benzyl)pyrimidines as DHFR inhibitors. J. Med. Chem. 35, 3201–3207. STAHL, 2000. Stahlbau, 69, 672. https://doi.org/10.1002/stab.200002470. Takayama, K., Fujikawa, M., Nagai, T., 1999. Artificial neural network as a novel method to optimize pharmaceutical formulation. Pharm. Res. 16, 1–6. Takayama, K., Fujikawa, M., Obata, Y., Morishita, M., 2003. Neural network based optimization of drug formulations. Adv. Drug Deliv. Rev. 55, 1217–1231. Vanyúr, R., Héberger, K., Jakus, J., 2003. Prediction of anti-HIV-1 activity of a series of tetrapyrrole molecules. J. Chem. Inf. Comput. Sci. 43, 1829–1836. Wang, L., Xie, X.-Q., 2014. Computational target fishing: what should chemogenomics researchers expect for the future of in silico drug design and discovery? Future Med. Chem. 6, 247–249. Zhang, S., 2011. Computer-aided drug discovery and development. Methods Mol. Biol. 716, 23–38. Zupan, J., Gasteiger, J., 1993. Neural Networks for Chemists. An Introduction. Wiley-VCH, Weinheim. Zupan, J., Gasteiger, J., 1999. Neural Networks in Chemistry and Drug Design, second ed. Wiley-VCH, Weinheim.

Index Note: Page numbers followed by f indicate figures, t indicate tables, and b indicate boxes.

A

Ab initio methods, 202–203 Absolute configuration (AC), 207 geissolaevine, 151–153, 153f of phyllostin, 15, 18f of phytotoxins, 18–19, 18f of scytolide, 15, 18f Accelerated solvent extraction (ASE), 101, 105t Acrosin gossylic lactone, 320t gossypol, 323t gossypol-6,6 dimethyl ether, 321t gossypol tetraacetic acid, 320t Actinosporins, 151, 152f AFC. See Automated flash chromatography (AFC) Affinity capillary electrophoresis (ACE), 126 Aldingenins, 212–213, 212f Aliphatic glucosinolates, 249 Angiotensin-converting enzyme, 11 Animal models, 285 Annona squamosa, 210, 211f ANNs. See Artificial neural networks (ANNs) Antibiotic resistance platform (ARP), 151 Antibiotics and Secondary Metabolite Analysis Shell (antiSMASH), 261 Antioxidants, Lespedeza vigrata, 9, 9f Apex, 206 APExBIO, 143–146 ArgusLab docking software, 11 Artarborol, 212, 212f Artemisia annua, 55–56, 56f Artemisinin, 208f, 209 Artificial intelligence (AI), 16, 194–195 Artificial neural networks (ANNs), 31–33 data, input and output vector, 304 docking experiment, 308 experimental data set, 308, 309–310t gossypol derivatives, 303–304, 304f, 308, 309–310t

docking, 318, 319–323t from Gossypium herbaceum, 303–304, 304f mean squared error, 308–312 modelling procedure, 306, 307f output data vs. target data, 312–314, 313f, 314–318t in pharmaceutical science, 303 physicochemical descriptors, 304, 305–306t, 309–310t in predicting bioactivity, 303 regression factor, 312–314, 313f software and hardware environment, 304–306 training and test data set, 306, 307t Artocarpus anisophyllus, 13–14, 13f Asteraceae, 4–5 Atmospheric pressure chemical ionization (APCT), 131 Aurantiamide acetate, 55–56, 56f Aurora Compound Library, 143–146 AutoDock 4, 12 Automated flash chromatography (AFC) advantages, 111 Biotage, 109 flavonoids, 111, 112f heteroclitin D from Kadsurae caulis, 113 indigotin, 111 indirubin, 111 instrumentation, 109 isoflavonoids, 111, 112f in natural product chemistry, 111 principle, 109 reversed-phase, 111 solvents, 110t 9-tetrahydrocannabinolic acid A (THCA), 113 types, 109 Automated Mass Spectral Deconvolution and Identification Software (AMDIS), 150–151, 154

335

336  Index Automatic Interaction Detection (AID) algorithm, 31–33 Azetidine-2-carbonitriles, 178 Azetidine-2-carboxylic acid, 178

B

Bacteriocins, 259–260 BACTIBASE, 259–260 BAGEL, 262 Basic Local Alignment Search Tool (BLAST), 257, 262–263 BBD. See Box-Behnken design (BBD) Beneficial regulator targeting (BeReTa), 33–34 Bermuda grass, nematode infection of heirarchical clustering, 243, 244f non-metric multidimensional scaling, 242, 242f random forest model, 245, 246–247f β-carotene, 30, 31f β-caryophyllene, 291 Bioactivity assessment cannabinoid receptor 2, in silico assessment, 290–292 computational aids, 277–280 computers, role of, 278 data and text-mining strategies, 287–288 natural compounds, separation and identification of, 280–282 in natural product research and drug discovery, 285–287 in phytochemistry animal models, 285 ex vivo models, 284–285 in situ models, 284–285 in vitro cell culture models, 284 protein-based in vitro models, 283–284 software, 292–295, 293–294t virtual/in silico screening, 288–290 web-tools, 292–295, 293–294t Bioassay-guided fractionation, 280–281 Bioautography, 282 Biochemical network-integrated computational explorer (BNICE), 266–267 Biochemical pathways biochemical network-integrated computational explorer, 266–267 Cho system framework, 268 DESHARKY, 268 from metabolite to metabolite, 265–266 RetroPath, 267 Biological Magnetic Resonance Data Bank (BMRB), 204, 220–221

Biological networks, 56–58 Bioprinting, 284–285 Bioprospecting, 182, 279 Biotage, 109 Blocking, 97 Boltzmann distribution, 9–10 Bond-electron matrix (BEM), 266–267 Boswellic acid, 11, 11f Box-Behnken design (BBD), 100, 100f BRD9185, 178

C

CADD. See Computer-aided drug discovery (CADD) Caffeine, 59, 59f Camptothecin, 62–63, 63f Cannabinoid type-1 (CB1) receptors, 290–291 Cannabinoid type-2 (CB2) receptors, 290–292 Capillary electrochromatography (CEC), 125 Capillary electrophoresis (CE), 237 instrumentation, 124–125 microemulsion electrokinetic chromatography, 126 modes, 125 from Nelumbo nucifera, 126, 126f non-aqueous capillary electrophoresis, 125 in phytochemical analysis, 126, 126f, 127–128t principle, 124 Capillary gel electrophoresis (CGE), 125 Capillary isoelectric focusing (CIEF), 125 Capillary isotachophoresis (CITP), 125 Capillary zone electrophoresis (CZE), 125 CARAMEL, 154 CASE. See Computer-aided structure elucidation (CASE) CCC. See Counter current chromatography (CCC) CCD. See Central composite design (CCD) CE. See Capillary electrophoresis (CE) Central composite design (CCD), 99–100, 100f Charles River Compound Library, 143–146 ChEMBL, 270 ChemBridge Compound Library, 143–146 ChemDraw, 200–201 ChemGPS-NP, 288–289

Index 337 Chemical compound databases ChEBI, 269–270 ChEMBL, 270 ChemSpider, 270–271 natural products, 269 Norine, 269 PubChem, 270 StreptomeDB, 269 Chemical Entities of Biological Interest (ChEBI), 269–270 Chemical shift, 16–17, 202–203, 212–213 Chemoinformatic analysis, 180–181 Chemometrics, 285–286 classification, 21–22 defined, 21 pattern recognition methods, 21–22 in phytochemical research, 22–23 Chemotaxonomy, 60 Chemscore, 11–12 ChemSpider, 270–271 Chiral centres, 207–208 Chlorinated diterpene, 16–17, 17f Chlorogenic acid, 11–12, 11f Cholesterol, 195, 196f Cho system framework, 268 ChromaDex CRSTM High Purity Phytochemical Library, 155 Clionasterol, 203–204, 204f Cluster analysis, 4 ClusterMine360, 261 ClustScan database (CSDB), 261 Combinatorial library, 148 backbone-based, 148 scaffold-based, 148 COMET, 150–151 Competitive learning, 4 Complexity-to-diversity (CtD), 180–181 Complex Mixture Analysis by NMR (COLMAR) database, 204, 220–221 COmplex PAthway SImulator (COPASI) software, 267 Compound library, 52 APExBIO, 143–146 Aurora, 143–146 biological descriptors, 146 Charles River Compound Libraries, 143–146 ChemBridge, 143–146 class, 147 combinatorial library, 148 diversity-based, 146 drug-like molecule, 142–143

focused libraries, 146–147 for high-throughput screening, 142 ligand-centric approaches, 147 morphine, 2D and 3D structural representation of, 143, 143f phytochemical library, 149 trends in library design, 142–143, 142f Computational chemistry, 2 Computational mass spectrometry, 204–207 Computational methods, in natural products isolations automated flash chromatography, 108–113, 110t, 112f capillary electrophoresis, 124–126, 126f, 127–128t counter current chromatography, 120–124, 122–123t high-performance/pressure liquid chromatography, 110t, 113–116, 114t, 117–118t, 119f hyphenated technique GC-MS, 129, 130t LC-MS, 131, 132t LC-NMR, 131–133 ultra-performance liquid chromatography, 116–120 Computational strain design methods, 33 Computer-aided drug discovery (CADD), 301–302 drug design, 48f ligand-based, 51–56 network pharmacology, 56–58 structure-based, 49–51 Computer-aided structure elucidation (CASE), 16, 17f, 154 COnstaint-Based Reconstruction and Analysis (COBRA) toolbox, 267 Counter current chromatography (CCC) advantages, 120 high-speed counter current chromatography, 121 in phytochemical analysis, 121, 122–123t principle, 120 Critical micelle concentration (CMC), 125 C. sativus Linn., 7 Cucumis trigonus Roxb., 7 Cycloquest, 264 Cynara scolymus, 23 Cytoscape, 290 Cytotoxicity assay, 169

338  Index

D

Database of BIoSynthesis clusters CUrated and InTegrated (DoBISCUT), 260 Databases, 26–29 Database search algorithm chemical shifts, 216 compound identification, 217–218 electrospray mass spectrometry, 216 four-step strategy, 217–218, 218f MetaboloDerivatizer, 218 nodakenetin angelate, 216–217, 217f Data mining, 26–29, 287–288 in medicinal plant, 65–67 Data processing data cleaning, 240 missing values, 240–241 normalization, 241 Density functional theory (DFT), 201 ab initio calculations, 6 antioxidants, 9, 9f Becke’s three-parameter functional, 8 C. sativus Linn., 7 Cucumis trigonus Roxb., 7 diospyrin, 7f, 8 flavonoids, 6–8 HAT mechanism, 6–7 highest occupied molecular orbital, 7 pistagremic acid, 7–8, 7f SET-PT mechanism, 6–7 smeathxanthone A, 8–9, 9f SPLET mechanism, 6–7 (+)-tephrodin, 8, 8f Dereplicated phytochemical library application, 155–160, 157f compound library, 142–149, 142f, 144–145t dereplication, 149–155, 152–153f, 152t Dereplication process AMDIS, 150–151 antibiotic resistance platform, 151 cluster analyses, 150–151 COMET, 150–151 computer-aided, 154–155 databases, 150–151, 152t data-mining strategies, 150–151 defined, 149–150 DEREP1, 150 DEREP2, 150 DEREP3, 150 DEREP4, 150 DEREP5, 150 GC-TOF-MS-based, 154 geissolaevine, 151–153, 153f

legonmaleimides, 153, 153f pipeline approach, 153 Plant Metabolite Annotation Toolbox, 151–153 Q-DIS/MARLIN, 150–151 X-hitting algorithm, 150–151 Dereplicator, 265 DESHARKY, 268 Design cube, 86, 86f Design of Experiment (DoE) of accelerated solvent extraction, 101, 105t designing phase fractional factorial design, 84–86, 85t, 86f full factorial design, 83–84, 84f, 85t Plackett-Burman design, 86–88, 87t, 89–91t, 92f screening, 80–92, 81–83b Taguchi design, 88–92, 93t of MAE process, 101, 102–103t optimization phase Box-Behnken design, 100, 100f central composite design, 99–100, 100f response surface methodology, 99 terminology, 96–99 planning phase, 78–79 statistical methods, 78 of supercritical fluid extraction, 101, 104t variations sources, 76 DFT. See Density functional theory (DFT) Diamond-like carbon (DLC) coatings, 222 Diospyrin, 8, 7f Diospyros, 8, 7f Diplodia africana, 17–18, 18f Discriminant analysis, 244–245 Diversity-oriented synthesis (DOS), 178 Dose–response analysis, 174–175, 175–176t Drosophilia, 187 DrugBank database, 205–206 Drug metabolism and pharmacokinetics (DMPK), 169

E

Echinacea sp., 291 Ecological approach, 61 Ecological plant defence theory, 61 Economic Botany Data Collection Standards, 61–62 Electronic circular dichroism (ECD), 17–19 Electronic Lab-Notebook (ELN), 286–287 Electron impact ionization (EI), 129 Electro-osmotic flow (EOF), 124 Electrospray (ESI), 131

Index 339 Electrospray mass spectrometry (ESIMS), 172, 216 Ellman’s method, 13–14 Environmental Surveyor of Natural Product Diversity (eSNaPD), 263 8-Epicordatin, 20, 20f Epoxyroussoedione, 20, 20f Epoxyroussoeone, 20, 20f Ethnobotany, 59–60 Euro+Med Plantbase, 28 E-WorkBook Suite, 286–287 Ex vivo models, 284

F

FASTA format, 12–13 Fast-atom-bombardment (FAB), 197–198 Febrifugine, 60–61, 61f Ficus racemose lupeol, 29–30, 30f triterpene, 29–30, 30f Fieser–Kuhn rules, 214 Fingerprint chemical, 22–23, 25 high-throughput screening, 146 molecular, 55–56 visual, 282 Flavanols, 111 Flavonoids Artocarpus anisophyllus, 13–14, 13f automated flash chromatography, 111, 112f density functional theory, 6–7 by preparative HPLC, 116, 119f selforganizing map, 4–5 Focused-libraries, 146–147 Fourier-transform spectroscopy (FTS), 200 Fractional factorial design, 84–86, 85t, 86f Fragmentation and Rearrangement ANalyZer (FRANZ), 206 From metabolite to metabolite (FMM), 265–266 Full factorial design (FFD), 83–84, 84f, 85t, 88 Fused pentacyclic flavonoid skeleton, 19, 19f

G

GABAA receptor, 12–13 Gas chromatography (GC), 237 Gauge invariant atomic orbital (GIAO) method, 16–17 GC-MS, 129, 130t Geissolaevine, 151–153, 153f Geissospermum laeve, 151–153, 153f

Genedata Screener software, 184 Genome-mining technique, 257 Germplasm Resources Information Network, 27 Global Natural Products Social Molecular Networking (GNPS), 64–65, 264–265 Glyphosate, 249 Good Laboratory Practice (GLP), 286–287 Good Manufacturing Practice (GMP), 286–287 Gossypol acrosin, 320–321t, 323t bonding pattern, 318, 324–331f derivatives, 303–304, 304f, 308, 309–310t docking, 318, 319–323t from Gossypium herbaceum, 303–304, 304f hyaluronidase, 321–323t G protein-coupled receptors (GPCRs), 146–147, 283–284, 290–291 GRIN Taxonomy database (GRIN), 27

H 1

H and 13C NMR-based structure elucidation, 16 Harperspinoids, 12, 12f Harrisonia perforate, 12, 12f HAT mechanism, 6–7 Heat map, 174 Herbal medicine, tools and techniques, 65, 66t Heteroclitin D, 113 Hexacyclinol, 208, 208f 13 C NMR spectrum of, 196–197 incorrect structure of, 197f from Panus rudis, 196–197 Hierarchical cluster analysis, 243, 244f High-content screening (HCS), 186–187 High-performance/pressure liquid chromatography (HPLC) analytical, 115–116 detectors, 115 normal phase, 114 in phytochemical analyses, 113, 116 preparative, 115–116 reversed-phase, 115 semi-preparative, 115–116 separation mode, 114 sorbents, 114, 114t High-speed counter current chromatography (HSCCC), 121 High-throughput ATAD5-luciferase assay, 184–185

340  Index High-throughput screening (HTS), 44–45 active compounds, 176, 177f compound library, 142, 169, 170f data visualization, 174 dose–response analysis, 174–175, 175–176t drug discovery method, 167 leads and drugs, 174 microtitre plates in, 169–170, 171f monitoring in vivo, 172–173 natural products defined, 179–180 for increasing diversity, 180–181 sample preparation, 181–183, 181f in Pfizer, 167, 168f pre-HTS era, 166–167 reaction monitoring and observation, 171–172 screening facility, 173–174 virtual, 147 HMG-CoA reductase, 13 HMMer, 257 Homology modelling, 15, 50, 159, 291 HPLC. See High-performance/pressure liquid chromatography (HPLC) HTS. See High-throughput screening (HTS) Human Metabolome Database (HMDB), 204, 220–221 Human organ-on-chip, 284–285 Hyaluronidase gossylic lactone, 321t gossypol, 323t gossypol-6,6 dimethyl ether, 322t gossypol tetraacetic acid, 322t Hydrangea macrophylla, 60–61, 61f Hyoscine, 60, 61f Hyoscyamine, 60, 61f Hypericum perforatum, 133 Hyphenated technique GC-MS, 129, 130t LC-MS, 131, 132t LC-NMR, 131–133

I

Image-based phenotypic screening, in HeLa cells, 220 Indigotin, 111 Indirubin, 111 Inference, 247–248, 248f Infrared (IR) spectroscopy, 214–216 In silico screening childhood absence epilepsy, 12–13 FASTA format, 12–13

GABAA receptor, 12–13 HMG-CoA reductase, 13 Molinspiration software server, 12–13 pyranoflavonoids, anti-Alzheimer’s activity of, 13–14 Schrodinger Glide module, 12–13 Trypanosoma brucei, 15 Integrated approach, 63–64 Integrated Microbial Genomes-Atlas of Biosynthetic gene Clusters (IMGABC), 260–261 Interaction model, 98 Interferometer, 214–215, 215f International Classification of Diseases (ICD), 61–62 Inula viscosa, 18–19, 18f (+)-Inuloxin B, 18–19, 18f (–)-Inuloxin C, 18–19, 18f In vitro cell culture models, 284 Ion mobility spectrometry (IMS), 199 Isoflavanones, 116, 119f Isoflavonoids, 111, 112f Isolation–structure identification–activity confirmation, 51

K

Kadsurae caulis, 113 Kohonen-based selforganizing map (SOM), 3 Kyoto Encyclopaedia of Genes and Genomes (KEGG), 265–266

L

Laboratory Information Management Systems (LIMS), 286–287 Latent variables, 244–245 Latin Hypercube sampling (LHS), 47 LC-MS, 131, 132t LC-NMR, 131–133 Leads, 174 Legonmaleimides, 153, 153f Leptomycin B (LMB), 186–187, 186f Lespedeza vigrata, 9, 9f Ligand-based CADD, 51–56 Ligand-centric approaches, 147 Ligand Fit of Accelrys Discovery studio 2.1, 13 Ligand library, 51 LigandScout, 289, 291–292 Linear model, 98 Linear regression, 243–244 Lipinski’s rule of five, 51

Index 341 Liquid chromatography (LC), 237 Lupeol, 29–30, 30f Lutein, 30, 31f

M

Markov chain, 46 Markov chain Monte Carlo (MCMC) sampler, 46–47 Mass-spectral library search, 205–206 MAss Spectrum SImulation System (MASSIS), 206 Mathematical models, 45–47 Medical Subject Headings (MeSH), 65–67 Medicinal plants chemotaxonomy, 60–61 databases, 26, 27t, 64–65 data mining in, 65–67 ecological approach, 61 ethnobotany-directed drug discovery, 59–60 integrated approach, 63–64 random approach, 62–63 safety considerations, 67–68 Melilotus officinalis, 43–44, 44f MetaDrug, 206 Metal-organic framework (MOF), 223 Micellar electrokinetic chromatography (MEKC), 125 Microbial Screening Technology, 143–146, 150–151 Microemulsion electrokinetic chromatography (MEEKC), 126 Microtubule inhibitors, 176–177 Microwave-assisted extraction (MAE), 82, 86–88 DoE-based optimization, 101, 102–103t Minimum Information about a Biosynthetic Gene cluster (MIBiG), 260 Misassignments, structures, 195–197, 196–197f Molecular descriptors, 2, 4–5, 53 Molecular docking, 10 AutoDock 4, 12 boswellic acid, 11, 11f chemscore, 11–12 chlorogenic acid, 11–12, 11f gossypol, 318, 319–323t Phyllanthis niruri, 15 ursolic acid, 11, 11f Molecular dynamics simulations (MDSs), 15 Molecular fingerprint-based technique, 55–56 Molegro Virtual Docker version 6.0, 14–15 Molinspiration software server, 12–13 Monte Carlo simulation, 46–47

Morphine, 2D and 3D structural representation, 143, 143f Multidimensional scaling, 242 MultiGeneBlast, 257, 262–263 Multiple linear regression (MLR), 54–55

N

N-alkyl amides, 291 NAPRALERT, 26–27 NAPROC-13, 183 National Center for Biotechnology Information (NCBI), 262–263 Natural Product Domain Seeker (NaPDos), 262 Natural products, HTS process automated sample preparation, 181–182, 181f defined, 179–180 for increasing diversity, 180–181 with pharmacological activity, 179, 180f sample preparation, 181–183, 181f Neighborgram, 185 Network pharmacology, 56–58, 290 New chemical entries (NCEs), 107–108 Noise factors, 97 Non-aqueous capillary electrophoresis (NACE), 125 Non-metric multidimensional scaling, 242, 242f Non-ribosomal peptide (NRP) identification algorithm (NRPquest), 264 Non-ribosomal peptide synthetases (NRPSs), 259 Norine, 269 Normal phase HPLC (NP-HPLC), 114 NRPSpredictor, 263 Nuclear magnetic resonance (NMR), 131–133, 202–204

O

One-Variable-At-a-Time (OVAT), 77–78 Open-access software packages, 174–175, 175–176t Optical rotatory dispersion (ORD), 17–18 Orthogonal arrays, 88–92, 93t Oxidosqualene cyclases, 50 Oxysporone, 17–18, 18f

P

Paclitaxel, 63–64 Pareto plot chart, 88, 92f Pattern recognition methods, 21–22 PBD. See Plackett–Burman design (PBD)

342  Index PCA. See Principal component analysis (PCA) Pep2Path, 264 Pharmacokinetic parameters, 45 Phyllanthis niruri, 15 Phyllosticta cirsii, 17–18, 18f Phyllostin, 17–18, 18f Physicochemical descriptors, gossypol, 304, 305–306t, 308, 309–311t Phytochemical library, 149 PhytoLogix, 183–184 Piperitenone oxide, 68, 68f Pistacia integerrima, 7–8, 7f Pistagremic acid, 7–8, 7f Plackett–Burman design (PBD), 29–30, 83–84, 86–88, 87t, 89–90t ANOVA and regression analysis, 88, 91t Plant Metabolite Annotation Toolbox (PlantMAT), 151–153 Plant metabolomics in agriculture, 248–250 analysis modalities, 238 analytical tools, 235 bias and variance, 233–234, 233f clarity, 233 Cycloquest, 264 data collection considerations, 236 data-driven approaches, 232–233 data processing data cleaning, 240 missing values, 240–241 normalization, 241 data structures, 239 dereplicator, 265 GNPS, 264–265 high-throughput method, 238–239 inference, 247–248 instrumentation, 236–237 NRPquest, 264 Pep2Path, 264 reproducibility, 234–235 RiPPquest, 264 sample preparation, 237–238 supervised approach discriminant analysis, 244–245 linear regression, 243–244 performance considerations, 246–247 tree-based methods, 245 trade-offs, 233–234 unsupervised approach clustering, 243 ordination, 241–242 utility, 232–233

Polarizable continuum model (PCM), 16–17 Polyketide synthases (PKSs), 259 Polymethoxyflavones, 111 Prediction errors, 221–222 Prediction of activity spectra for substances (PASS), 289 Predictive toxicology, 68 Principal component analysis (PCA), 116, 205–206, 241–242, 286 dimensionality, reduction in, 23 metabolomics, 25 multivariate data, 23 in phytochemical analysis, 22–23 software, 24, 24t Protein-based in vitro models, 283–284 Pseudo-quantitation, 238 Psychotria pilifera, 19, 19f Psychotripine, 19, 19f PubChem, 270 Python, 235

Q

Q-DIS/MARLIN, 150–151 QSAR. See Quantitative structure-activity relationship (QSAR) QSAR descriptors, 13–14 Quadratic model, 98–99 Quality–efficacy–safety triangle, 279, 280f Quantitative composition activity relationship (QCAR), 55 Quantitative retention activity relationship (QRAR), 282 Quantitative structure-activity relationship (QSAR), 48, 53–54 data matrix for, 54, 55t in herbal formulae, 55 multiple linear regression, 54–55 Quantitative structure retention relationship (QSRR), 281–282 Quantum Chemistry Exchange Program (QCEP), 200–201 Quercetin glycosides, 11 Quinine, 59, 59f

R

Raintree database, 28 Raman spectroscopy, 222 Random approach, 62–63 Random forest (RF) model, 5–6, 245 Randomization, 97 Random selection approach, 65–67

Index 343 Receptor-ligand docking, 14–15 Regression analysis, 60 Relational Database Management Systems (RDBMS), 215–216 Relational data model (RDM), 28–29 Replication, 97 Response surface methodology (RSM), 29–30, 94, 94–96f, 99 RetroPath, 267 Reversed pharmacology approach, 283, 283f Reversed-phase HPLC (RP-HPLC), 115 RiPPquest, 264 Root-mean-square deviations (RMSD), 213 Roussoella japanensis epoxyroussoedione, 20, 20f epoxyroussoeone, 20, 20f RSM. See Response surface methodology (RSM)

S

Safety considerations, 67–68 Salmonella, 184 Sanggenon, 50–51 Sarpagine, 194, 194f Saturated designs. See Plackett–Burman design (PBD) Schrodinger Glide module, 12–13 Scopoletin, 50–51, 51f Scytolide, 17–18, 18f Secondary metabolites, 107–108 biochemical pathways biochemical network-integrated computational explorer, 266–267 Cho system framework, 268 DESHARKY, 268 from metabolite to metabolite, 265–266 RetroPath, 267 biosynthetic gene clusters antiSMASH, 261 BACTIBASE, 259–260 BAGEL, 262 ClusterMine360, 261 ClustScan Database, 261 DoBISCUIT, 260 eSNaPD, 263 IMG-ABC, 260–261 MIBiG, 260 MultiGeneBlast, 262–263 NaPDos, 262 NRPSpredictor, 263 SMURF, 261–262 chemical compound databases ChEBI, 269–270

ChEMBL, 270 ChemSpider, 270–271 natural products, 269 Norine, 269 PubChem, 270 StreptomeDB, 269 genome-mining technique, 257 metabolomics study Cycloquest, 264 dereplicator, 265 GNPS, 264–265 NRPquest, 264 Pep2Path, 264 RiPPquest, 264 Secondary Metabolite Unique Regions Finder (SMURF), 261–262 Selforganizing map (SOM) Asteraceae, 4–5 cluster analysis, 4 competitive learning, 4 components, 3–4 flavonoids, 4–5 modes, 4 molecular descriptors, 4–5 sesquiterpene lactones, 4–5 unsupervised learning, 4 Sesquiterpene lactones, 4–5 SET-PT mechanism, 6–7 Smeathxanthone A, 8–9, 9f Soft independent modelling of class analogy (SIMCA), 23 Solid-phase extraction (SPE), 181–182 Spectral data chiral centres, 207–208 computational mass spectrometry, 204–207 database search algorithm, 216–222 density functional theory, 201 era of assignment vs. prediction, 201–222 history, 194–195 infrared spectroscopy, 214–216 misassignments of structures, 195–197, 196–197f nuclear magnetic resonance, 202–204 Raman spectroscopy, 222 structure by calculations, 208–214 structure elucidation, 197–201, 198–199f UV spectroscopy, 214 X-ray sponge technique, 222–223 SPEEDY®, 181–182 SPLET mechanism, 6–7 Standard run order, 84 StreptomeDB, 269 Streptomyces albus, 153, 153f

344  Index Structure-activity relationships (SAR), 53–54 Structure-based CADDs, 49–51 Structured query language (SQL), 215–216 Structure elucidation ChemDraw, 200–201 fast-atom-bombardment, 197–198 Fourier-transform spectroscopy, 200 ion mobility in, 199 of natural products, 197, 198f, 199–200 of phytochemicals, 197 of quinocinnolinomycins, 220, 221f Stypopodium flabelliforme, 16–17, 17f Supercritical fluid extraction (SFE), 101, 104t Supervised approach discriminant analysis, 244–245 linear regression, 243–244 performance considerations, 246–247 tree-based methods, 245

T

Taguchi design, 88–92 Target-fishing, 14–15, 289 Taxol, 62–63, 63f (+)-Tephrodin, 8, 8f Text-mining, 288 Time-dependent density functional theory (TDDFT), 207–208 Time of flight (TOF) MS, 199 Traditional Chinese medicine (TCM), 5–6 Traditional Chinese Medicine Integrated Database (TCMID), 28 Transcriptional regulator (TR) manipulation, 33–34 Tree-based methods, 245 Trifolium resupinatum

antioxidant activity, 6–7 flavonoids, 6–7 Triterpene, 29–30, 30f Trypanosoma brucei, 15

U

Ultra-performance liquid chromatography (UPLC), 116–120 United Nations Convention on Biological Diversity (CBD), 156 Universal Functional-group Activity Coefficients (UNIFAC) model, 124 Unsupervised approach clustering, 243 ordination, 241–242 Unsupervised learning, 4 Ursolic acid, 11, 11f US National Institutes of Health (NIH), 270 UV spectroscopy, 214

V

Vibrational circular dichroism (VCD), 17–18 Virtual high-throughput-screening (vHTS), 10, 185 Virtual/in silico screening, 288–290 Virtual phytochemical library, 158–160

W

Woodward–Fieser rules, 214

X

X-hitting algorithm, 150–151 X-ray sponge technique, 222–223