Computational Vaccine Design 1071632388, 9781071632383

391 99 23MB

English Pages 512 [513] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Vaccine Design
 1071632388, 9781071632383

Citation preview

Methods in Molecular Biology 2673

Pedro A. Reche Editor

Computational Vaccine Design

METHODS

IN

MOLECULAR BIOLOGY

Series Editor John M. Walker School of Life and Medical Sciences University of Hertfordshire Hatfield, Hertfordshire, UK

For further volumes: http://www.springer.com/series/7651

For over 35 years, biological scientists have come to rely on the research protocols and methodologies in the critically acclaimed Methods in Molecular Biology series. The series was the first to introduce the step-by-step protocols approach that has become the standard in all biomedical protocol publishing. Each protocol is provided in readily-reproducible step-bystep fashion, opening with an introductory overview, a list of the materials and reagents needed to complete the experiment, and followed by a detailed procedure that is supported with a helpful notes section offering tips and tricks of the trade as well as troubleshooting advice. These hallmark features were introduced by series editor Dr. John Walker and constitute the key ingredient in each and every volume of the Methods in Molecular Biology series. Tested and trusted, comprehensive and reliable, all protocols from the series are indexed in PubMed.

Computational Vaccine Design Edited by

Pedro A. Reche School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain

Editor Pedro A. Reche School of Medicine, Department of Immunology Complutense University of Madrid Madrid, Spain

ISSN 1064-3745 ISSN 1940-6029 (electronic) Methods in Molecular Biology ISBN 978-1-0716-3238-3 ISBN 978-1-0716-3239-0 (eBook) https://doi.org/10.1007/978-1-0716-3239-0 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Humana imprint is published by the registered company Springer Science+Business Media, LLC, part of Springer Nature. The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

Preface Vaccines are the most successful and cost-effective medical intervention human kind has ever developed to combat infectious diseases. Vaccines have saved millions of lives and, unlike any other medicines, a single or a few doses can keep us disease free during an entire life. However, the number of available vaccines is still limited and vaccine development has often been neglected in favor of small molecule drugs. Fortunately, the interest in vaccines and vaccine development is seen a renaissance in the last few years, and not only for infectious diseases but also for cancer, allergy and autoimmune diseases. Fueled by advances in technology and knowledge, computational vaccinology has emerged in this context as a potent instrumental discipline that can save time and cost involved in vaccine development. Computational vaccinology is somewhat similar to reverse vaccinology, but there are differences. Both, computational and reverse vaccinology, rely on computational methods and tools to identify vaccine candidates from genomic/proteomic data. These vaccine candidates include antigens that are likely to induce protective responses or the precise antigen regions, epitopes, recognized by the immune system. However, computational vaccinology can go beyond the step of vaccine candidate selection and arrive to a rational vaccine design that can also be tested in silico through computational simulations. This book is about computational vaccine design and the technologies that support it. The book has been divided in four parts, representing fundamental pillars for computational vaccine design. Part I, Immunomics and System Immunology, depicts chapters dedicated to technologies and methodologies providing basic data—e.g., epitope data— and fundamental knowledge required for vaccine design. In addition, a chapter describing a computational method for grafting epitopes and another for manufacturing nano-scale vaccine platforms are included in this part. Part II, Databases, includes chapters describing important immunological databases that are tuned for vaccine design. Part III, Prediction of Antigenicity and Immunogenicity: Tools and Protocols, depicts immunoinformatic tools for vaccine design and to predict antigens and antigen determinants. Likewise, it also includes tools and protocols to predict steps of antigen processing and recognition that determine the immunogenicity of T cell epitopes as well as novel methods to identify T cell epitopes capable of inducing the production of specific cytokines. Finally, part IV, Computational Vaccinology Applications and Protocols, contains chapters describing full protocols for computational vaccine design as well as specific applications to pathogens, including various viruses, protozoan, and allergens. Overall, these chapters chiefly reflect how the rigorous and imaginative use of computational technologies can catalyze future efforts to improve global public health through the development a broad range of novel vaccines. We tried to make this book as comprehensive as possible but we realize the difficulty of the task and apologize for its many shortcomings and omissions. We wish to thank all the authors for their precious contributions, which made possible this book. We also thank Prof. John Walker, Editor-in-Chief of the Methods in Molecular Biology, for all his attention, encouragement, and help. Likewise, we also thank the staff at Humana for their guide and

v

vi

Preface

support throughout the entire process of getting this book published. Finally, we wish to thank the readers of this book and hope that they can gather up some wisdom and practical knowledge from it. Madrid, Spain

Pedro A. Reche

Contents Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v xi

1 Vaccine Design: An Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tara Fiyouzi and Pedro A. Reche

1

PART I

IMMUNOMICS AND SYSTEM IMMUNOLOGY

2 Epitope Binning of Monoclonal and Polyclonal Antibodies by Biolayer Interferometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Kaito Nagashima and Jarrod J. Mousa 3 Clustering and Annotation of T Cell Receptor Repertoires . . . . . . . . . . . . . . . . . . . 33 Sebastiaan Valkiers, Sofie Gielis, Vincent M. L. Van Deuren, Kris Laukens, and Pieter Meysman 4 Protocol for Classification Single-Cell PBMC Types from Pathological Samples Using Supervised Machine Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Minjie Lyu, Lin Xin, Huan Jin, Lou T. Chitkushev, Guanglan Zhang, Derin B. Keskin, and Vladimir Brusic 5 Unbiased, High-Throughput Identification of T Cell Epitopes by ELISPOT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 Paul V. Lehmann, Diana R. Roen, and Alexander A. Lehmann 6 CD4+ T Cell Epitope Identification from Complex Parasite Antigen Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 ´ lvaro-Benito, Friederike Ebner, Miriam Bertazzon, Miguel A and Eliot Morrison 7 Computational Grafting of Epitopes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Manish Manish, Smriti Mishra, Monika Pahuja, Ayush Anand, Naidu Subbarao, and Ram Samudrala 8 Manufacture of Mesoporous Silicon Microparticles (MSMPs) as Adjuvants for Vaccine Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Ana Lopez-Gomez, Irene Real-Are´valo, Rau´l Martı´n-Palma, Eduardo Martı´nez-Naves, and Manuel Go mez del Moral

PART II

DATABASES

9 IEDB and CEDAR: Two Sibling Databases to Serve the Global Scientific Community . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Nina Blazeska, Zeynep Kosaloglu-Yalcin, Randi Vita, Bjoern Peters, and Alessandro Sette 10 Updates on Databases of Allergens and Allergen-Epitopes . . . . . . . . . . . . . . . . . . . 151 Rajat Kanti Sarkar, Nandini Ghosh, Gaurab Sircar, and Sudipto Saha

vii

viii

Contents

11

TSNAD and TSNAdb: The Useful Toolkit for Clinical Application of Tumor-Specific Neoantigens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 Jingcheng Wu and Zhan Zhou 12 EPIPOX: A Resource Facilitating Epitope-Vaccine Design Against Human Pathogenic Orthopoxviruses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Laura Ballesteros-Sanabria, Hector F. Pelaez-Prestel, Pedro A. Reche, and Esther M. Lafuente

PART III 13 14

15

16

17 18

19

20

21

PREDICTION OF ANTIGENICITY AND IMMUNOGENICITY: TOOLS AND PROTOCOLS

Prediction of Linear B Cell Epitopes in Proteins . . . . . . . . . . . . . . . . . . . . . . . . . . . . Juan R. de los Toyos Design of Linear B Cell Epitopes and Evaluation of Their Antigenicity, Allergenicity, and Toxicity: An Immunoinformatics Approach . . . . . . . . . . . . . . . . Vijaya Sai Ayyagari NetCleave: An Open-Source Algorithm for Predicting C-Terminal Antigen Processing for MHC-I and MHC-II. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Roc Farriol-Duran, Marina Vallejo-Valle´s, Pep Amengual-Rigo, Martin Floor, and Vı´ctor Guallar Prediction of TAP Transport of Peptides with Variable Length Using TAPREG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hector F. Pelaez-Prestel, Sara Alonso Fernandez, Laura Ballesteros-Sanabria, and Pedro A. Reche Docking-Based Prediction of Peptide Binding to MHC Proteins . . . . . . . . . . . . . Mariyana Atanasova and Irini Doytchinova The PANDORA Software for Anchor-Restrained Peptide: MHC Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dario F. Marzella, Giulia Crocioni, Farzaneh M. Parizi, and Li C. Xue Prediction of Peptide and TCR CDR3 Loops in Formation of Class I MHC-Peptide-TCR Complexes Using Molecular Models with Solvation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nairuti Milan Mehta, Yuhui Li, Vini Patel, Wanning Li, Noam Morningstar-Kywi, Mateusz Pospiech, Houda Alachkar, and Ian S. Haworth Prediction of Bacterial Immunogenicity by Machine Learning Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivan Dimitrov and Irini Doytchinova Vaxi-DL: An Artificial Intelligence-Enabled Platform for Vaccine Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P. Preeti, Swarsat Kaushik Nath, Nevidita Arambam, Trapti Sharma, Priyanka Ray Choudhury, Alakto Choudhury, Vrinda Khanna, Ulrich Strych, Peter J. Hotez, Maria Elena Bottazzi, and Kamal Rawal

189

197

211

227

237

251

273

289

305

Contents

22

23

A Web-Based Method for the Identification of IL6-Based Immunotoxicity in Vaccine Candidates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 Anjali Dhall, Sumeet Patiyal, Neelam Sharma, Salman Sadullah Usmani, and Gajendra P. S. Raghava In Silico Tool for Identification, Designing, and Searching of IL13-Inducing Peptides in Antigens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 Shipra Jain, Anjali Dhall, Sumeet Patiyal, and Gajendra P. S. Raghava

PART IV 24

25

26

27

28

29

30

31 32

33

ix

COMPUTATIONAL VACCINOLOGY APPLICATIONS AND PROTOCOLS

A Lean Reverse Vaccinology Pipeline with Publicly Available Bioinformatic Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bart Cuypers, Rino Rappuoli, and Alessandro Brozzi Immunoinformatics Protocol to Design Multi-Epitope Subunit Vaccines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Parismita Kalita, Aditya K. Padhi, and Timir Tripathi In Silico Structure-Based Vaccine Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sakshi Piplani, David Winkler, Yoshikazu Honda-Okubo, Varun Khanna, and Nikolai Petrovsky Reverse Vaccinology for Influenza A Virus: From Genome Sequencing to Vaccine Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Valentina Di Salvatore, Giulia Russo, and Francesco Pappalardo Immunoinformatics Vaccine Design for Zika Virus. . . . . . . . . . . . . . . . . . . . . . . . . . Ana Clara Antonelli, Vinnycius Pereira Almeida, and Simone Gonc¸alves da Fonseca Immunoinformatics Approaches in Designing Vaccines Against COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ankita Chakraborty, Jagadeesh Bayry, and Suprabhat Mukherjee A Sample Guideline for Reverse Vaccinology Approach for the Development of Subunit Vaccine Using Varicella Zoster as a Model Disease . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elif Cireli and Levent C ¸ avas¸ Computational Vaccine Design for Poxviridae Family Viruses . . . . . . . . . . . . . . . . Abbas Khan, Dong-Qing Wei, and Muhammad Suleman Computational Prediction of Trypanosoma cruzi Epitopes Toward the Generation of an Epitope-Based Vaccine Against Chagas Disease. . . . . . . . . . Albert Ros-Lucas, David Rioja-Soto, Joaquim Gasco n, and Julio Alonso-Padilla Computational Vaccine Design for Common Allergens . . . . . . . . . . . . . . . . . . . . . . Nandini Ghosh, Gaurab Sircar, and Sudipto Saha

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

341

357 371

401 411

431

453 475

487

505 515

Contributors HOUDA ALACHKAR • Titus Family Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA VINNYCIUS PEREIRA ALMEIDA • Department of Bioscience and Technology, Institute of Tropical Pathology and Public Health, Federal University of Goia´s, Goiaˆnia, Goia´s, Brazil JULIO ALONSO-PADILLA • Barcelona Institute for Global Health (ISGlobal), Hospital Clinic – University of Barcelona, Barcelona, Spain; CIBERINFEC, ISCIII—CIBER de Enfermedades Infecciosas, Instituto de Salud Carlos III, Madrid, Spain ´ LVARO-BENITO • Laboratory of Protein Biochemistry, Institute of Chemistry and MIGUEL A Biochemistry, Department of Biology, Chemistry and Pharmacy, Freie Universit€ a t Berlin, Berlin, Germany PEP AMENGUAL-RIGO • Barcelona Supercomputing Center (BSC), Barcelona, Spain AYUSH ANAND • BP Koirala Institute of Health Sciences, Dharan, Nepal ANA CLARA BARBOSA ANTONELLI • Department of Bioscience and Technology, Institute of Tropical Pathology and Public Health, Federal University of Goia´s, Goiaˆnia, Goia´s, Brazil NEVIDITA ARAMBAM • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India MARIYANA ATANASOVA • Faculty of Pharmacy, Medical University – Sofia, Sofia, Bulgaria VIJAYA SAI AYYAGARI • Department of Biotechnology, School of Biotechnology & Pharmaceutical Sciences, Vignan’s Foundation for Science, Technology & Research (Deemed to be University), Vadlamudi, Guntur, Andhra Pradesh, India LAURA BALLESTEROS-SANABRIA • School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain JAGADEESH BAYRY • Department of Biological Sciences & Engineering, Indian Institute of Technology Palakkad, Palakkad, Kerala, India MIRIAM BERTAZZON • Laboratory of Protein Biochemistry, Institute of Chemistry and Biochemistry, Department of Biology, Chemistry and Pharmacy, Freie Universit€ a t Berlin, Berlin, Germany NINA BLAZESKA • Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, CA, USA MARIA ELENA BOTTAZZI • Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA; Texas Children’s Hospital Center for Vaccine Development, Houston, TX, USA; Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA; Department of Biology, Baylor University, Waco, TX, USA ALESSANDRO BROZZI • GSK, Siena, Italy VLADIMIR BRUSIC • School of Computer Science, University of Nottingham, Ningbo, Zhejiang, China LEVENT C ¸ AVAS¸ • Dokuz Eylu¨l University, Faculty of Science, Department of Chemistry (Biochemistry Division), I˙zmir, Turkey ANKITA CHAKRABORTY • Integrative Biochemistry and Immunology Laboratory, Department of Animal Science, Kazi Nazrul University, Asansol, West Bengal, India

xi

xii

Contributors

LOU T. CHITKUSHEV • Department of Computer Science, Metropolitan College, Boston University, Boston, MA, USA ALAKTO CHOUDHURY • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India PRIYANKA RAY CHOUDHURY • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India ELIF CIRELI • Department of Life Sciences and Chemistry, Constructor University Bremen, Bremen, Germany GIULIA CROCIONI • Netherlands eScience Center, Amsterdam, The Netherlands BART CUYPERS • Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing (AUDACIS), Antwerp, Belgium VINCENT M. L. VAN DEUREN • Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; AUDACIS, Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing, University of Antwerp, Antwerp, Belgium ANJALI DHALL • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India IVAN DIMITROV • Faculty of Pharmacy, Medical University – Sofia, Sofia, Bulgaria IRINI DOYTCHINOVA • Faculty of Pharmacy, Medical University – Sofia, Sofia, Bulgaria FRIEDERIKE EBNER • Division of Infection Pathogenesis, TUM School of Life Sciences, Technical University of Munich, Freising, Germany ROC FARRIOL-DURAN • Barcelona Supercomputing Center (BSC), Barcelona, Spain SARA ALONSO FERNANDEZ • School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain TARA FIYOUZI • School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain MARTIN FLOOR • Barcelona Supercomputing Center (BSC), Barcelona, Spain SIMONE GONC¸ALVES DA FONSECA • Department of Bioscience and Technology, Institute of Tropical Pathology and Public Health, Federal University of Goia´s, Goiaˆnia, Goia´s, Brazil JOAQUIM GASCO´N • Barcelona Institute for Global Health (ISGlobal), Hospital Clinic – University of Barcelona, Barcelona, Spain; CIBERINFEC, ISCIII—CIBER de Enfermedades Infecciosas, Instituto de Salud Carlos III, Madrid, Spain NANDINI GHOSH • Department of Microbiology, Vidyasagar University, Midnapore, West Bengal, India SOFIE GIELIS • Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; AUDACIS, Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing, University of Antwerp, Antwerp, Belgium VI´CTOR GUALLAR • Barcelona Supercomputing Center (BSC), Barcelona, Spain; Institucio Catalana de Recerca i Estudis Avanc¸ats (ICREA), Barcelona, Spain IAN S. HAWORTH • Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA YOSHIKAZU HONDA-OKUBO • Vaxine Pty Ltd, Adelaide, SA, Australia PETER J. HOTEZ • Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA; Texas Children’s Hospital Center for Vaccine Development, Houston, TX, USA; Department of Molecular Virology and Microbiology, Baylor College of Medicine, Houston, TX, USA; Department of Biology, Baylor University, Waco, TX, USA

Contributors

xiii

SHIPRA JAIN • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India HUAN JIN • School of Computer Science, University of Nottingham, Ningbo, Zhejiang, China PARISMITA KALITA • Molecular and Structural Biophysics Laboratory, Department of Biochemistry, North-Eastern Hill University, Shillong, Meghalaya, India DERIN B. KESKIN • Translational Immuno-Genomics Lab, Dana-Farber Cancer Institute, Boston, MA, USA ABBAS KHAN • Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, People’s Republic of China; Zhongjing Research and Industrialization Institute of Chinese Medicine, Zhongguancun Scientific Park, Nanyang, Henan, People’s Republic of China VARUN KHANNA • Vaxine Pty Ltd, Adelaide, SA, Australia VRINDA KHANNA • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India ZEYNEP KOSALOGLU-YALCIN • Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, CA, USA ESTHER M. LAFUENTE • School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain KRIS LAUKENS • Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; AUDACIS, Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing, University of Antwerp, Antwerp, Belgium ALEXANDER A. LEHMANN • Research & Development Department, Cellular Technology Limited, Shaker Heights, OH, USA PAUL V. LEHMANN • Research & Development Department, Cellular Technology Limited, Shaker Heights, OH, USA WANNING LI • Titus Family Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA YUHUI LI • Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA ANA LO´PEZ-GOMEZ • School of Medicine, Department of Cell Biology, Complutense University of Madrid, Madrid, Spain; School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain MINJIE LYU • School of Computer Science, University of Nottingham, Ningbo, Zhejiang, China MANISH MANISH • School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India EDUARDO MARTI´NEZ-NAVES • School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain RAU´L MARTI´N-PALMA • School of Science, Department of Applied Physics, Autonoma University of Madrid, Madrid, Spain DARIO F. MARZELLA • Center for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboudumc, Nijmegen, The Netherlands NAIRUTI MILAN MEHTA • Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA

xiv

Contributors

PIETER MEYSMAN • Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; AUDACIS, Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing, University of Antwerp, Antwerp, Belgium SMRITI MISHRA • School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India MANUEL GO´MEZ DEL MORAL • School of Medicine, Department of Cell Biology, Complutense University of Madrid, Madrid, Spain NOAM MORNINGSTAR-KYWI • Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA ELIOT MORRISON • Laboratory of Protein Biochemistry, Institute of Chemistry and Biochemistry, Department of Biology, Chemistry and Pharmacy, Freie Universit€ a t Berlin, Berlin, Germany JARROD J. MOUSA • Center for Vaccines and Immunology, College of Veterinary Medicine, University of Georgia, Athens, GA, USA; Department of Infectious Diseases, College of Veterinary Medicine, University of Georgia, Athens, GA, USA; Department of Biochemistry and Molecular Biology, Franklin College of Arts and Sciences, University of Georgia, Athens, GA, USA SUPRABHAT MUKHERJEE • Integrative Biochemistry and Immunology Laboratory, Department of Animal Science, Kazi Nazrul University, Asansol, West Bengal, India KAITO NAGASHIMA • Center for Vaccines and Immunology, College of Veterinary Medicine, University of Georgia, Athens, GA, USA; Department of Infectious Diseases, College of Veterinary Medicine, University of Georgia, Athens, GA, USA SWARSAT KAUSHIK NATH • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India ADITYA K. PADHI • Laboratory for Computational Biology & Biomolecular Design, School of Biochemical Engineering, Indian Institute of Technology (BHU), Varanasi, Uttar Pradesh, India MONIKA PAHUJA • Indian Council of Medical Research, New Delhi, India ` degli Studi FRANCESCO PAPPALARDO • Department of Health and Drug Sciences, Universita di Catania (IT), Catania, Italy FARZANEH M. PARIZI • Center for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboudumc, Nijmegen, The Netherlands VINI PATEL • Department of Pharmacology and Pharmaceutical Sciences, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA SUMEET PATIYAL • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India HECTOR F. PELAEZ-PRESTEL • School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain BJOERN PETERS • Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, CA, USA; Department of Medicine, University of California San Diego, La Jolla, CA, USA NIKOLAI PETROVSKY • Vaxine Pty Ltd, Adelaide, SA, Australia SAKSHI PIPLANI • Vaxine Pty Ltd, Adelaide, SA, Australia MATEUSZ POSPIECH • Titus Family Department of Clinical Pharmacy, Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, Los Angeles, CA, USA

Contributors

xv

P. PREETI • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India GAJENDRA P. S. RAGHAVA • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India GAJENDRA P. S. RAGHAVA • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India RINO RAPPUOLI • Fondazione Biotecnopolo, Siena, Italy KAMAL RAWAL • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India IRENE REAL-ARE´VALO • School of Medicine, Department of Cell Biology, Complutense University of Madrid, Madrid, Spain PEDRO A. RECHE • School of Medicine, Department of Immunology, Complutense University of Madrid, Madrid, Spain DAVID RIOJA-SOTO • Barcelona Institute for Global Health (ISGlobal), Hospital Clinic – University of Barcelona, Barcelona, Spain DIANA R. ROEN • Research & Development Department, Cellular Technology Limited, Shaker Heights, OH, USA ALBERT ROS-LUCAS • Barcelona Institute for Global Health (ISGlobal), Hospital Clinic – University of Barcelona, Barcelona, Spain; CIBERINFEC, ISCIII—CIBER de Enfermedades Infecciosas, Instituto de Salud Carlos III, Madrid, Spain ` degli Studi di GIULIA RUSSO • Department of Health and Drug Sciences, Universita Catania (IT), Catania, Italy SUDIPTO SAHA • Division of Bioinformatics, Bose Institute, Unified Campus Salt Lake, College More, Kolkata, West Bengal, India ` degli Studi VALENTINA DI SALVATORE • Department of Health and Drug Sciences, Universita di Catania (IT), Catania, Italy RAM SAMUDRALA • Department of Biomedical Informatics, Jacobs School of Medicine and Biomedical Sciences, University at Buffalo, Buffalo, NY, USA RAJAT KANTI SARKAR • Department of Genetics and Plant Breeding, Visva-Bharati University, Bolpur, West Bengal, India; Institute of Health Sciences, Presidency University (Newtown Campus), Kolkata, West Bengal, India ALESSANDRO SETTE • Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, CA, USA; Department of Medicine, University of California San Diego, La Jolla, CA, USA NEELAM SHARMA • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India TRAPTI SHARMA • Centre for Computational Biology and Bioinformatics, AIB, Amity University, Noida, Uttar Pradesh, India GAURAB SIRCAR • Institute of Health Sciences, Presidency University (Newtown Campus), Kolkata, West Bengal, India ULRICH STRYCH • Department of Pediatrics, Division of Tropical Medicine, Baylor College of Medicine, Houston, TX, USA; Texas Children’s Hospital Center for Vaccine Development, Houston, TX, USA NAIDU SUBBARAO • School of Computational and Integrative Sciences, Jawaharlal Nehru University, New Delhi, India MUHAMMAD SULEMAN • Centre for Biotechnology and Microbiology, University of Swat, Khyber Pakhtunkhwa, Pakistan

xvi

Contributors

´ JUAN R. DE LOS TOYOS • Area de Inmunologı´a, Facultad de Medicina y Ciencias de la Salud, Universidad de Oviedo, Oviedo, Spain TIMIR TRIPATHI • Molecular and Structural Biophysics Laboratory, Department of Biochemistry, North-Eastern Hill University, Shillong, Meghalaya, India; Regional Director’s Office, Indira Gandhi National Open University, Regional Centre Kohima, Kohima, Nagaland, India SALMAN SADULLAH USMANI • Department of Computational Biology, Indraprastha Institute of Information Technology, New Delhi, India SEBASTIAAN VALKIERS • Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium; AUDACIS, Antwerp Unit for Data Analysis and Computation in Immunology and Sequencing, University of Antwerp, Antwerp, Belgium MARINA VALLEJO-VALLE´S • Barcelona Supercomputing Center (BSC), Barcelona, Spain RANDI VITA • Center for Infectious Disease and Vaccine Research, La Jolla Institute for Immunology, La Jolla, CA, USA DONG-QING WEI • Department of Bioinformatics and Biological Statistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, People’s Republic of China; Zhongjing Research and Industrialization Institute of Chinese Medicine, Zhongguancun Scientific Park, Nanyang, Henan, People’s Republic of China; State Key Laboratory of Microbial Metabolism, Shanghai-Islamabad-Belgrade Joint Innovation Center on Antibacterial Resistances, Joint Laboratory of International Laboratory of Metabolic and Developmental Sciences, Ministry of Education and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, People’s Republic of China DAVID WINKLER • School of Pharmacy, University of Nottingham, Nottingham, UK; Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, VIC, Australia; Department of Biochemistry and Chemistry, La Trobe Institute for Molecular Science, La Trobe University, Bundoora, VIC, Australia JINGCHENG WU • Innovative Institute for Artificial Intelligence in Medicine and Zhejiang Provincial Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, China LIN XIN • School of Computer Science, University of Nottingham, Ningbo, Zhejiang, China LI C. XUE • Center for Molecular and Biomolecular Informatics, Radboud Institute for Molecular Life Sciences, Radboudumc, Nijmegen, The Netherlands GUANGLAN ZHANG • Department of Computer Science, Metropolitan College, Boston University, Boston, MA, USA ZHAN ZHOU • Innovative Institute for Artificial Intelligence in Medicine and Zhejiang Provincial Key Laboratory of Anti-Cancer Drug Research, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, Zhejiang, China

Chapter 1 Vaccine Design: An Introduction Tara Fiyouzi and Pedro A. Reche Abstract Vaccines are the most successful and cost-effective medical interventions available to fight infectious diseases. They consist of biological preparations that are capable of stimulating the immune system to confer protective immunity against a particular harmful pathogen/agent. Vaccine design and development have evolved through the years. Early vaccines were obtained with little implementation of technology and in the absence of fundamental knowledge, representing a pure feat of human ingenuity. In contrast, modern vaccine development takes advantage of advances in technology and in our enhanced understanding of the immune system and host-pathogen interactions. Moreover, vaccine design has found novel applications beyond the prophylactic arena and there is an increasing interest in designing vaccines to treat human ailments like cancer and chronic inflammatory diseases. In this chapter, we focus on prophylactic vaccines against infectious diseases, providing an overview on immunology principles underlying immunization and on how vaccines work and are designed. Key words Innate immunity, Adaptive immunity, Vaccine, Immune recognition, Antigen, Epitope, Adjuvant, Vaccine design

1

Introduction Vaccines and vaccinations rose as a means to protect (immunize) humans from infectious diseases and so they did. Thanks to vaccines, the terrifying smallpox was eradicated [1], and the incidence of diseases like diphtheria, tetanus, whooping cough, measles, mumps, rubella, and poliomyelitis has been highly reduced [2]. The beginnings of vaccination are lost in time and rooted in the observation that deadly diseases like smallpox never infected the same person twice. There is some evidence that as early as the eleventh century, the Chinese may have used smallpox scab insufflations to immunize against the disease. Moreover, in eighteenthcentury Europe, it was a common practice to inoculate material obtained from smallpox pustules in order to protect the population from the disease [3]. This procedure, known as variolation, was not without risks; up to 2–3% of people got infected with smallpox and

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_1, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

1

2

Tara Fiyouzi and Pedro A. Reche

died from it. The invention of vaccination is however credited to Edward Jenner, who used material isolated from pustules of people infected with cowpox to immunize against smallpox, introducing the term vaccine (after vacca in Latin) [4]. Jenner was not the first to try cowpox inoculation to protect from smallpox; however, he entered the procedure into the scientific record [5] and disseminated it [1, 6]. Although Jenner’s vaccine faced some initial difficulties, it was soon adopted worldwide for being much safer than variolation [3]. Jenner developed his vaccine without knowing that a virus causes smallpox, nor could he know that the vaccine exploited a situation in which a virus confers protection against another viral disease. The discovery of microbes as the causative agent of infections was discovered a century later. It was then that Louis Pasteur enlightened the rational design of vaccines and established the basic rules of vaccinology [7]. Pasteur also extended the term vaccine to any formulation capable of inducing immunity regardless of the pathogen. Although Pasteur knew about the existence of microbes and developed the first rational vaccines, he did so without having any understanding of basic immunology principles. In fact, for a long period, vaccines were entirely developed through empirical research with little knowledge of the immune system. We now know that vaccines work by exploiting the extraordinary ability of the human immune system to fight and remember encounters with pathogen antigens. We also know about the molecular and cellular mechanisms underlying antigen recognition and the development of immunological memory. Moreover, we have significant knowledge on host-pathogen interactions and have access to powerful technology for characterizing pathogens and immune responses. All this knowledge and technology can be used to quickly develop rational vaccines against emerging infectious diseases; an instance is the development of coronavirus disease 2019 vaccines. Likewise, it could be used to tackle neglected pathogens, constantly mutating pathogens, drug-resistant pathogens, and bioterrorism agents. In this chapter, we review some basic immunological principles that are important to understand how vaccines work. We also review how vaccines are designed, classifying the existing vaccine platforms. Finally, we envision future developments in vaccine design.

2

Basic Immunology Principles: Innate and Adaptive Immunity Life can be dangerous, and most organisms count on some sort of defense mechanism to protect themselves from unwanted encounters. In complex multicellular organisms, these mechanisms are implemented by a dedicated immune system. The immune system of vertebrates possesses two types of immunity: innate immunity

Vaccine Design

3

and adaptive immunity [8]. Innate immunity acts quickly after a microbe appears in the body (within hours) and has little specificity, while adaptive immunity is highly specific and takes longer to develop (days). Adaptive immune system can remember the pathogens that fight, producing stronger and faster attacks each time the pathogen is reencountered. Vaccination is aimed to develop a protective and pathogen-specific immunological memory without harming the host. Innate immunity involves several passive and active defense mechanisms. Passive mechanisms do not need activation and include simple physical, chemical, and microbiological barriers like the skin, the acidic pH of the stomach, and commensal microbiota, which impede infection and/or the proliferation of harmful microbes. Active innate immunity is triggered by the recognition of microbes by soluble and cellular mediators. Innate immunity cells recognize generic pathogen-associated molecular patterns (PAMP) through conserved, germline-encoded, sensors known as pathogen-recognition receptors (PRRs). There are several families of PRRs, including Toll-like receptors (TLRs), RIG-I-like receptors (RLRs), and NOD-like receptors (NLRs) [9]. TLRs are crucial for defending against many microbial infections as they include members that can sense PAMPs present in RNA virus (TLR3,7,8), DNA virus (TLR9), Gram-positive bacteria (TLR1,2,6: Lipoproteins, TLR9: DNA), Gram-negative bacteria (TLR4: Lipopolysaccharide, TLR5: flagellin, TLR9: DNA), fungi (TLR2: Zimosan and β-glycans), and protists (TLR9: DNA, TLR2,5: GPI anchors, TLR11:profilin) (Fig. 1). TLRs are also present in non-immune cells and can also recognize host cell components (e.g., DNA) derived from damage caused by the pathogens [10]. Signaling through PRRs gives rise to an inflammatory response with the production of inflammatory cytokines and type I interferons. The activation of innate immunity can be enough to clear the pathogen, and it is required to initiate adaptive immune responses. Adaptive immunity is articulated by B and T lymphocytes—B and T cells—that are responsible for the humoral and cell-mediated immunity, respectively. Moreover, B and T cells acquire pathogenspecific memory after recognizing them for the first time and are hence the targets of vaccines to induce protective immunity. B and T cells recognize molecular components of pathogens known as antigens. Antigens can have different chemical nature, but the most abundant and variable antigens are proteins. Antigen recognition by specific receptors is needed to activate B and T cells; however, secondary signals are required for the activation of these cells. Antigen-recognition receptors are generated during lymphocyte development by a process involving stochastic and sequential arrangement of gene segments. This process leads to the generation of millions of different lymphocytes, each expressing a single yet different antigen-recognition receptor. Those lymphocytes with

4

Tara Fiyouzi and Pedro A. Reche

Fig. 1 Recognition of pathogens by TLRs. Figure depicts a rendering of a representative TLR (PDB: 3FXI), surrounded by pathogens recognized by distinct TLRs

receptors capable of recognizing self-antigens are eliminated to avoid auto-reactive immune responses [11]. In the end, a vast repertoire of lymphocytes survives with different antigen receptors, which guarantees that any antigen can be recognized with exquisite specificity. B and T cells do not recognize antigens as a whole, but as small portions known as epitopes [12]. The B cell antigen receptor, known as B cell receptor (BCR), consists of a membrane-bound immunoglobulin and recognizes solvent-exposed antigens (Fig. 2a). Once activated, B cells differentiate and secrete soluble forms of immunoglobulins, also known as antibodies, which mediate humoral adaptive immunity. A B cell epitope is the specific part of an antigen recognized by the immunoglobulin or antibody (Fig. 2a). The antigen receptor expressed by T cells is known as T cell receptor (TCR). Unlike B cells, T cells have access to any antigen and/or parts of the antigen, hidden or accessible, as they recognize small derivative peptides (T cell epitopes) displayed on the cell surface of antigen-presenting and/or infected cells bound to major histocompatibility complex molecules (MHC); in humans known as HLA molecules (human leukocyte antigens) (Fig. 2b). There are two main classes of T cells, CD8

Vaccine Design

5

Fig. 2 Antigen recognition by adaptive immunity. (a) T cells mediate cellular adaptive immunity recognizing T cell peptide epitopes. (b) Humoral immunity is mediated by antibodies produced by B cells recognizing B cell epitopes

and CD4 T cells, which incidentally see peptides presented by two different classes of MHC molecules. CD8 T cells recognize peptide antigens bound to MHC class I (MHC I/HLA I) molecules, while CD4 T cells recognize peptide antigens bound to MHC class II (MHC II/HLA II) molecules. MHC I and MHC II molecules are structurally related but yet differ in many ways [13]. MHC I molecules can be expressed by all nucleated cells. In contrast, few cells can express MHC II molecules and some enter in the category of professional antigen-presenting cells (APCs), including B cells, macrophages, and dendritic cells (DCs). Incidentally, antigen presentation by MHC I and MHC II molecules follows different routes. Peptides presented by MHC I molecules are short (8–11 residues) and derive from antigens (often defective or newly synthesized antigens) that are degraded by the proteasome and translocated to the reticulum endoplasmic by TAP (transporter associated with antigen presentation) where they are loaded onto nascent MHC I molecules. On the other hand, peptides presented by MHC II molecules derive from antigens endocytosed by APCs that are degraded by various lysosomal proteases in endosomal compartments, prior to loading onto MHC II molecules. Antigen-specific activated CD8 and CD4 T cells mediate adaptive cellular immunity as it is explained in the next section.

6

Tara Fiyouzi and Pedro A. Reche

From all the above, it follows that identifying antigens and epitopes targeted by adaptive immunity is of great interest for vaccine design. Such a task used to be costly and time-consuming. However, vaccinologists can nowadays resort to the use of computational methods that can identify antigens of interest and epitopes within primary sequences that are readily available in databases or provided by next-generation sequencing technology platforms [12, 14].

3

What Is a Vaccine and How Vaccines Work? A vaccine is a biological formulation that can safely stimulate an immune response leading to the acquisition of a long-lasting immunity that confers protection against a harmful agent (generally a pathogen). To achieve this, the vaccine must contain or lead to the production of specific antigens, from or related to the pathogen, that are targeted by the adaptive immune system. In addition, vaccines usually combine antigens with compounds known as adjuvants to enhance their ability to induce adaptive immunity (immunogenicity), often by stimulating the innate immune system [15]. The inclusion of adjuvants is particularly important for vaccines that are consisted of small antigens like peptides, which typically exhibit little immunogenicity [16]. In addition to enhancing immunogenicity, adjuvants can work as delivery vehicles and determine the quality and type of immune response against the antigens [15]. For many years, aluminum salts and water-in-oil emulsions have been the most commonly used adjuvants. However, novel adjuvants are being recognized and developed that target specific cells and receptors [15]. Fighting pathogens at the site of entry is of great relevance and considerable efforts are invested in the development of adjuvants for mucosal immunization [17]. Moreover, mucosal immunization is also able to induce systemic immunity [18]. Finally, vaccines may contain components that work as preservatives, emulsifiers, or stabilizers, which do not affect vaccine immunogenicity [19]. Vaccines work by emulating the immune response to a pathogen without the harmful effects (Fig. 3). The process includes three steps: (1) induction of local inflammatory response and uptake of antigens by sentinel DCs (2) transport of antigens to secondary lymphoid tissues and (3) induction of antigen-specific effector and memory B and T cells. We briefly review these processes. At the local site of vaccination, resident cells such as epithelial cells and fibroblasts, will sense danger caused by vaccine components (more likely the adjuvant) and release various inflammatory factors (e.g., prostaglandins and cytokines) that attract innate immune cells to the site. Also, at the local site, sentinel DCs, like Langerhans cells of the skin, will capture antigens, process them,

Vaccine Design

7

Fig. 3 Vaccine-induced immune response. Vaccine components, antigens and adjuvant, induce a local inflammatory response at the site of vaccination, prompting sentinel DCs to capture antigens and present them to T cells in secondary lymphoid tissues. Antigen-specific effector and memory T and B cells are subsequently generated. Effector B cells, also known as plasma B cells, produce antibodies, and can survive for long periods in the bone marrow. These antibodies are responsible for protective immunity

and in response to danger signals (e.g., vaccine components activating TLR signaling and/or cytokines released by other cells) will promote DC maturation and migration to secondary lymphoid tissues, such as proximal lymph nodes. Mature DCs (mDCs) express co-stimulatory signals (e.g., CD80 or CD86) and secrete various cytokines. The cytokines that mDC produce as well as their specific phenotype are conditioned by the local inflammatory response at the vaccination site [20]. In the secondary lymphoid tissue, mDCs present the antigens bound to MHC I and II molecules to CD4 and CD8 T cells, respectively, together with co-stimulatory signals, priming and expanding those cells with antigen-specific receptors. Subsequently, antigen-specific CD8

8

Tara Fiyouzi and Pedro A. Reche

and CD4 T cells differentiate into effector and memory cells. Effector CD8 T cells become cytotoxic T lymphocytes (CTL), which play a key role in fighting intracellular pathogens like viruses by killing infected cells that display peptide antigens bound to MHC I molecules on their cell surface. CD4 T cells become T helper (Th) cells with distinct phenotypes and cytokine production profiles, which accordingly amplify, coordinate and typify the immune responses. For example, Th1 cells secrete IFN-γ, promoting cell-mediated immunity against intracellular pathogens; Th2 cells secrete IL-4 and IL-5, providing help to B cells and enhancing antibody-mediated immunity; and Th17 cells secrete IL-17 and IL-21, participating in defense responses against extracellular bacteria [21]. The phenotype of Th cells is determined during differentiation by concurrent signals, usually cytokines produced by mDCs. Hence, DCs bridge innate and adaptive immunity bringing their immunological experience at the local vaccination site to T cells [22]. Effector T cells produced during an immune response, natural or elicited through vaccination, disappear shortly after the antigen is cleared. However, memory CTL and Th cells can survive for years. These cells act quickly upon encountering the same antigen, providing vaccine-induced long-lasting cell-mediated immunity against the relevant pathogen. Vaccine antigens from the vaccination site can also reach secondary lymphoid tissues through lymphoid drainage in a soluble form. Subsequently, the antigen can be recognized by cognate B cells which become activated, differentiating into antibody-producing plasma B cells and memory B cells. Differentiation of antigen-specific B cells in plasma cells can often require second signals. In particular, the production of antibodies against protein antigens requires the collaboration of antigen-specific Th cells. Antigen-specific B cells recruit cognate Th cells by displaying on their cell surface peptide antigens bound to MHC II molecules derived from antigens captured upon BCR recognition. Th cells are also required for antibody isotype class switching, determining also the specific switch through the production of cytokines. For example, in the presence of IFN-γ the main antibody produced is IgG2a. In general, antibodies label antigens for destruction by innate immune cells and can be neutralizing by blocking the entrance of pathogens to cells [11]. Plasma cells induced by vaccination generally remain in secondary lymphoid tissues and live only for a few weeks. However, some plasma cells can survive in the bone marrow, producing antibodies for years. These antibodies mediate long-lasting protective humoral immunity and are the goal of most vaccines. On the other hand, memory B cells can participate in secondary encounters with the same antigen providing faster and stronger humoral responses.

Vaccine Design

4

9

Rational Vaccine Design and Vaccine Platforms Vaccine development is generally a tedious and costly empiric process. However, the advances made in our knowledge of host-pathogen interactions along with the advent of new technologies, both experimental and computational, can greatly facilitate the design and development of vaccines. The development/production of any vaccine formulation requires knowledge and technology to identify, produce, isolate, and assemble the vaccine components (K: Knowledge) and time to produce it (T: time). Moreover, vaccine production can be more or less complex, depending on the experimental process involved (C: complexity) and it can have some risks or safety hazards when cultures of pathogenic organisms are required (R: Risk). Depending on the microbe (bacteria, virus, protozoa) that is being targeted, there are different approaches and technologies to develop a vaccine. For most authors, there are at least six major types of vaccine platforms that can be selected during the process of vaccine development. We next introduce the different types of vaccines by explaining the process of vaccine design. These vaccine platforms are depicted in Fig. 4 and are valued in terms of the parameters K, T, C, and R introduced earlier. The fundamental step and required knowledge in vaccine design for infectious diseases is to identify the causative agent. This is not a simple task, particularly for an emergent pathogen. However, we count on existing extensive knowledge and catalogs

Fig. 4 Rational vaccine design. Figure depicts the process of vaccine design leading to different vaccines. Knowledge and technology (K), safety and hazards/risks (R), complexity (C), and time (T) required for producing the different vaccines appear plotted in a relative arbitrary scale

10

Tara Fiyouzi and Pedro A. Reche

of potential human pathogens, which in combination with forensic and genomic technologies can quickly lead to the identification of novel and or emergent infectious agents. For example, in the 1980s, after identifying acquired immunodeficiency syndrome (AIDS), scientists lasted 3 years to find the human immunodeficiency virus (HIV) as the causative agent. In contrast, 37 years later, it only took about a month to report the genome of a novel coronavirus, SARS-CoV-2, as the etiological agent of a new type of pneumonia whose first cases were reported in late 2019. Circumstantial clues may have also helped to identify the virus in a record time, including the fact that the first cases were identified in the proximity of a street market, with meat sold from wild-life animals, close to Wuhan Institute of Virology, an institution with a long history of research on bat coronaviruses. Clearly, the more we understand about the biology, structure and host-pathogen interactions, the more effective and specific can be the approach to vaccine design. This knowledge about the pathogen and pathogen-host interactions can often be extrapolated using comparative genomics. Likewise, reverse vaccinology can also be used to identify antigens that are relevant for vaccine design, like virulent factors and viral envelope proteins [23]. Once we know the pathogen causing an infectious disease, the simple approach to vaccine design is to use the entire pathogen, either life-attenuated or inactivated. Measles, mumps, and rubella (MMR) vaccine and varicella (chickenpox) vaccine are examples of live-attenuated vaccines. These vaccines are highly immunogenic, do not need adjuvant, and are highly effective inducing long-lasting memory. An example of inactivated vaccine is the polio vaccine, which is also highly effective. Although vaccines based on entire pathogens appear simple, producing them can actually be quite complex. For example, it is not always easy to grow the pathogen, particularly viruses, which require the availability of surrogate cells. Moreover, biohazard risks associated with handling pathogenic microorganisms can be high. An alternative to entire pathogens in vaccine design is to use a pathogen subunit or product that can be targeted by the immune system to confer protection. As noted earlier, understanding the biology, structure, and interaction of the pathogen with the host, help to select the appropriate subunit that can be used in the vaccine. In the case of bacteria, polysaccharides isolated from the cell wall are good choices to induce protective immunity. Examples of these vaccines are Haemophilus influenzae type B (Hib) vaccine, pneumococcal vaccine (polysaccharide or conjugate), and MenACWY (conjugate) vaccine. In the conjugated forms of these vaccines, the polysaccharide is linked to a carrier protein to enhance immunogenicity and the quality of the immune response. Vaccines

Vaccine Design

11

including toxoids like tetanus (T), diphtheria (D) and acellular pertussis (aP) vaccines (DTaP, TD, Tdap, Td vaccines) can also be considered in the subunit group. The toxins are components secreted by the relevant bacteria and isolated along with numerous other proteins. These isolates are then treated with detoxifying agents like glutaraldehyde which inactivates toxins into toxoids [24]. Proteomics analysis of DTaP, TD, and Td tetanus vaccines has identified hundreds of proteins in them, not just the toxoids [25–27]. In fact, it has been shown that tetanus and diphtheria toxoids in vaccines represent less than 50% of total protein [28]. Producing vaccines based on pathogen subunits can be rather complex and subject to biosafety hazards. However, vaccine subunits based on protein antigens can be designed by alternative methods. For example, vaccines can be based on protein antigens produced in heterologous systems, like yeast, thanks to the DNA recombinant technology. Relevant examples of recombinant vaccines are shingles vaccine and hepatitis B vaccine (HBV). Recombinant vaccines do combine antigens with adjuvants to enhance immunogenicity and are rather simple and fast to produce. Moreover, like any subunit vaccine, recombinant vaccines cannot induce the disease that they are aimed to prevent. A novel approach in vaccine design that has been implemented in response to the COVID-19 pandemic is the use of vector platforms that let the host to produce the targeted antigens [29, 30]. These approaches include the use of harmless viruses to deliver the genetic code of the antigen to host cells. An example of viral vector vaccine is AstraZeneca-Oxford COVID-19 Vaccine (ChAdOx1 nCoV-19), which uses a modified chimpanzee adenovirus, ChAdOx1, encoding the spike protein of SARS-CoV2 [31]. Viral vector vaccines trigger a strong immune response and usually only one dose is needed to develop immunity. An even more innovative approach to deliver the genetic code of the antigen to cells is as synthetic mRNA or DNA molecules. An example of mRNA vaccine is Pfizer-BioNTech COVID-19 vaccine, which uses modified RNA encapsulated in liposomes [32]. These liposomes help to protect RNA molecules and serve as adjuvant. DNA vaccines have the advantage of stability over mRNA vaccines. An example of DNA vaccine is ZyCoV-D, which includes a DNA plasmid vector encoding the spike protein of SARS-CoV2 [32]. The plasmid also contains unmethylated CpG motifs to enhance its immunostimulatory properties. Both, mRNA and DNA vaccines are simple to make and can be adapted quickly and easily to target any antigen and antigen variation.

12

5

Tara Fiyouzi and Pedro A. Reche

Concluding Remarks: Future Vaccines Vaccines have evolved to take advantages of gains in fundamental knowledge and technology, facilitating the characterization of pathogens, and the identification, production, and delivery of antigens that are likely relevant to induce protective responses. The combination of knowledge and technology reaches its highest expression in the form of computer methods that can quickly identify relevant antigens from microbial genomic data, which in turn are obtained using high-throughput sequencing technologies. As a result, there is a shift from vaccines based on whole organism to those based on protein antigens. These antigens can be produced as recombinant proteins for immunization or incorporated into viral vectors, mRNA, or DNA platforms, which prompt the production of the antigen by the host as seen in the COVID-19 vaccines. Overall, these vaccines are easier to make and take less time and cost to produce. Following this trend, the ultimate and next step will be the design of vaccines that incorporate the precise antigen epitopes recognized by the immune system (Fig. 5). These epitope vaccines will likely be safer than more complex vaccines, as potentially autoreactive epitopes can be readily detected by computer algorithms and discarded. Moreover, epitope vaccines could be multi-antigenic and address sequence variability enabling pathogen immune evasion by considering only conserved epitopes. Targeting epitopes will allow precision vaccines with a controlled

Fig. 5 Vaccine design evolution. As knowledge and technology increase, antigen components in vaccines become increasingly simpler and more specific

Vaccine Design

13

specificity. Epitope vaccines also ought to be very versatile as epitopes can be produced as synthetic or recombinant peptides, delivered with different adjuvant platforms, or incorporated into viral vectors, mRNA, or DNA platforms. Advances in knowledge and technology will also enable us to find new compounds with adjuvant capacity as well as delivery systems to induce specific types of immunity at particular sites like the nasopharyngeal mucosa. Given the complexity of the immune system, vaccine design will evolve to incorporate artificial intelligence to select the most appropriate vaccine components (antigens/epitopes and adjuvant), dose, and delivery routes that ought to be protective against a particular pathogen [33]. Overall, all these advances could lead to the development of vaccines against unmet pathogens, like those leading to chronic infections. Moreover, these advances will be applied not only to develop prophylactic vaccines against microbial agents but also to tailor therapeutic vaccines against autoimmune and chronic inflammatory diseases.

Acknowledgment We wish to thank Esther M. Lafuente and Hector F. Pelaez for critical reading and valuable comments. References 1. Fenner F, Henderson DA, Arita I, Jezek Z, Ladnyi ID (1998) Smallpox and its eradication. In: WHO (ed) History of international public health. WHO, Geneva 2. Andre FE, Booy R, Bock HL, Clemens J, Datta SK, John TJ, Lee BW, Lolekha S, Peltola H, Ruff TA, Santosham M, Schmitt HJ (2008) Vaccination greatly reduces disease, disability, death and inequity worldwide. Bull World Health Organ 86(2):140–146 3. Weiss RA, Esparza J (2015) The prevention and eradication of smallpox: a commentary on Sloane (1755) ‘An account of inoculation’. Philos Trans R Soc Lond Ser B Biol Sci 370(1666):20140378 4. Smith KA (2011) Edward Jenner and the small pox vaccine. Front Immunol 2(21):21 5. Jenner E (1798) An enquiry into the causes and effects of the variolae vaccinae, a disease discovered in some of the western counties of England, particularly Gloucestershire, and known by the name of cowpox immunity. Sampson Low, London 6. Hammarsten JF, Tattersall W, Hammarsten JE (1979) Who discovered smallpox vaccination?

Edward Jenner or Benjamin Jesty? Trans Am Clin Climatol Assoc 90:44–55 7. Pasteur L (1880) De l’attenuation du virus du cholera des poules. R Acad Sci Paris 91:673– 680 8. Flajnik M, Singh NJ, Holland SM (2022) Paul’s fundamental immunology, 8th edn. Wolters Kluwer, Philadelphia 9. Kumar H, Kawai T, Akira S (2011) Pathogen recognition by the innate immune system. Int Rev Immunol 30(1):16–34 10. Sasai M, Yamamoto M (2013) Pathogen recognition receptors: ligands and signaling pathways by toll-like receptors. Int Rev Immunol 32(2):116–133 11. Paul WE (1998) Fundamental immunology. Lippincott-Raven, Philadelphia 12. Sanchez-Trincado JL, Gomez-Perosanz M, Reche PA (2017) Fundamentals and methods for T- and B-cell epitope prediction. J Immunol Res 2017:2680160. https://doi.org/10. 1155/2017/2680160 13. Reche PA, Reinherz EL (2003) Sequence variability analysis of human class I and class II MHC molecules: functional and structural

14

Tara Fiyouzi and Pedro A. Reche

correlates of amino acid polymorphisms. J Mol Biol 331(3):623–641 14. Moxon R, Reche PA, Rappuoli R (2019) Editorial: reverse vaccinology. Front Immunol 10(2776):2776 15. Facciola A, Visalli G, Lagana A, Di Pietro A (2022) An overview of vaccine adjuvants: current evidence and future perspectives. Vaccines (Basel) 10(5):819 16. Azmi F, Ahmad Fuaad AA, Skwarczynski M, Toth I (2014) Recent progress in adjuvant discovery for peptide-based subunit vaccines. Hum Vaccin Immunother 10(3):778–796 17. Harandi AM, Medaglini D (2010) Mucosal adjuvants. Curr HIV Res 8(4):330–335 18. Fujkuyama Y, Tokuhara D, Kataoka K, Gilbert RS, McGhee JR, Yuki Y, Kiyono H, Fujihashi K (2012) Novel vaccine development strategies for inducing mucosal immunity. Expert Rev Vaccines 11(3):367–379 19. Eldred BE, Dean AJ, McGuire TM, Nash AL (2006) Vaccine components and constituents: responding to consumer concerns. Med J Aust 184(4):170–175 20. Yin X, Chen S, Eisenbarth SC (2021) Dendritic cell regulation of T helper cells. Annu Rev Immunol 39:759–790 21. Sun B, Zhang Y (2014) Overview of orchestration of CD4+ T cell subsets in immune responses. Adv Exp Med Biol 841:1–13 22. Pulendran B (2015) The varieties of immunological experience: of pathogens, stress, and dendritic cells. Annu Rev Immunol 33:563– 606 23. Sette A, Rappuoli R (2010) Reverse vaccinology: developing vaccines in the era of genomics. Immunity 33(4):530–541 24. Yuen CT, Asokanathan C, Cook S, Lin N, Xing D (2016) Effect of different detoxification procedures on the residual pertussis toxin activities in vaccines. Vaccine 34(18):2129–2134 25. Moller J, Kraner M, Sonnewald U, Sangal V, Tittlbach H, Winkler J, Winkler TH, Melnikov V, Lang R, Sing A, Mattos-Guaraldi AL, Burkovski A (2019) Proteomics of diphtheria toxoid vaccines reveals multiple proteins that are immunogenic and may contribute to protection of humans against

Corynebacterium diphtheriae. Vaccine 37(23):3061–3070. https://doi.org/10. 1016/j.vaccine.2019.04.059 26. Moller J, Kraner ME, Burkovski A (2019) Proteomics of Bordetella pertussis whole-cell and acellular vaccines. BMC Res Notes 12(1):329. https://doi.org/10.1186/s13104-1301914373-13102 27. Moller J, Kraner ME, Burkovski A (2019) More than a toxin: protein inventory of Clostridium tetani toxoid vaccines. Proteomes 7(2): 1 5 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / proteomes7020015 28. Shrivastaw KP, Jhamb SS, Kumar A (1995) Quantitation of the protein content of diphtheria and tetanus toxoids by the Biuret method during production of combined vaccines. Biologicals 23(1):61–63 29. Ballesteros-Sanabria L, Pelaez-Prestel HF, Ras-Carmona A, Reche PA (2022) Resilience of spike-specific immunity induced by COVID-19 vaccines against SARS-CoV-2 variants. Biomedicine 10(5):996 30. Abufares HI, Oyoun Alsoud L, Alqudah MAY, Shara M, Soares NC, Alzoubi KH, El-Huneidi W, Bustanji Y, Soliman SSM, Semreen MH (2022) COVID-19 vaccines, effectiveness, and immune responses. Int J Mol Sci 23(23):15415 31. Mahase E (2021) How the OxfordAstraZeneca COVID-19 vaccine was made. BMJ 372(372):n86 32. Polack FP, Thomas SJ, Kitchin N, Absalon J, Gurtman A, Lockhart S, Perez JL, Pe´rez Marc G, Moreira ED, Zerbini C, Bailey R, Swanson KA, Roychoudhury S, Koury K, Li P, Kalina WV, Cooper D, Frenck RW Jr, ¨ , Nell H, Schaefer A, Hammitt LL, Tu¨reci O ¨ nal S, Tresnan DB, Mather S, Dormitzer PR, U S¸ahin U, Jansen KU, Gruber WC (2020) Safety and efficacy of the BNT162b2 mRNA COVID-19 vaccine. N Engl J Med 383(27): 2603–2615 33. Russo G, Reche P, Pennisi M, Pappalardo F (2020) The combination of artificial intelligence and systems biology for intelligent vaccine design. Expert Opin Drug Discov 15(11): 1267–1281

Part I Immunomics and System Immunology

Chapter 2 Epitope Binning of Monoclonal and Polyclonal Antibodies by Biolayer Interferometry Kaito Nagashima and Jarrod J. Mousa Abstract Understanding the epitopes of antibodies elicited by infection and vaccination is often useful in immunogen design. In this chapter, we describe biolayer interferometry (BLI)-based methods to evaluate such epitopes and permit simultaneous analysis of antibodies from several sources, including monoclonal antibodies (mAbs) and polyclonal serum antibodies (pAbs). Using previously characterized antibodies with known epitopes as controls, the distribution of epitopes for the influenza hemagglutinin (HA) is shown for isolated human mAbs and pooled serum from HA-immunized mice. This method is versatile, high-throughput, and can be adapted to several antigens. Key words Biolayer interferometry, Monoclonal antibodies, Serum, Epitope mapping, Epitope binning, Antigen, Vaccine

1

Introduction Elucidation of antibody epitopes to an antigen is often a critical part of vaccine development. The binding epitopes to a given target often dictate the potency and breadth of the antibody response, and thereby can serve as a measure of vaccine efficacy. Biolayer interferometry (BLI) is employed to measure antibody binding to an antigen by measuring shifts in the interference pattern of reflected white light between an internal reference layer and from a layer of immobilized protein on biosensor tips [1]. This technique has been commonly used in vaccine development, and as a diagnostic tool for the detection of SARS-CoV-2 antibodies in serum for its high sensitivity, throughput, and versatility [2, 3]. BLI permits characterization of the antibody epitopes of either monoclonal antibodies (mAbs) or polyclonal antibodies (pAbs) from serum to an antigen. This assay measures binding competition between a mAb or pAbs and a known antibody to identify novel epitopes [2]. Competition indicates that a mAb or pAbs bind(s) to an overlapping epitope with

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_2, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

17

18

Kaito Nagashima and Jarrod J. Mousa

the competing antibody, whereas no competition indicates that a mAb/pAbs bind(s) a novel epitope. The extent of competition is then quantified for each competing antibody pair, which can then be used to identify the major epitopes of a given antibody panel or serum. This approach provides a high-throughput method to identify the epitopes of dozens of mAbs and is amenable to virtually any antigen of interest, including protein and carbohydrate antigens [4]. It can provide structural insight into the mechanisms of broadly neutralizing antibodies (bnAbs) and broadly reactive antibodies, which are of major interest in the influenza [5–7], pneumovirus [8–10], pneumococcal [11], and HIV vaccinology fields [12].

2

Materials The assay requires a BLI instrument, a computer, and associated software for experimental setup, data acquisition, and analysis. The Octet RED384 system, which is commonly used, is described in this method. However, this protocol can be adapted to other BLI systems. Plates containing the appropriate buffers and samples, in addition to biosensor tips to measure antibody association and dissociation to and from antigens, are also required. Some reagents require storage at 4 °C, -20 °C, or -80 °C.

2.1 Equipment/ Software

1. Octet RED384 system (Sartorius). 2. Octet biosensor tips: The choice of tips depends on the nature of the antigen tested. Those that contain a histidine tag (His-tag) are compatible with HIS1K biosensors (anti-penta HIS biosensors, Sartorius 18-5120). Those that are biotinylated are compatible with streptavidin biosensors (SA biosensors, Sartorius 18-5019). Multiple variations could also be used with the wide array of biosensors available, including protein A and protein G biosensors. In these cases, antibody could be loaded onto the biosensors. Then, the antigen could be associated onto biosensors and used for kinetics and binning assays. 3. Octet data acquisition software (version 11.1). 4. Octet data analysis software (version 11.1). 5. Microsoft Excel.

2.2

Plates

All plates are stored at room temperature. 1. Black 96-well plates (Greiner, 655079). 2. Tilted-bottom 384-well plates (Sartorius, 18-5076).

Antibody Epitope Binning by Biolayer Interferometry

2.3 Buffers/ Antigens/Samples

19

1. Octet buffer (kept at 4 °C): Phosphate-buffered saline (PBS), 0.5% bovine serum albumin (BSA), 0.05% Tween-20. Alternatively, this may be purchased as a 10× solution of kinetics buffer (Sartorius, catalog number 18-1105) and diluted into PBS at the time of use. 2. ChonBlock (kept at -20 °C) (Chondrex, 9068). Diluted to 25% ChonBlock with Octet buffer at the time of use. 3. 0.1 M glycine, pH = 2.7 (kept at 4 °C). 4. PBS (kept at room temperature). 5. Purified antigen (kept at -80 °C, aliquoted). 6. mAb (kept at -80 °C, aliquoted). 7. Serum from vaccinated or immunized mice (kept at -80 °C, aliquoted).

3

Methods

3.1 Kinetics Measurement of Antibody Binding

3.1.1 Kinetics Experiment Assay Definition

It is often useful to perform a kinetics experiment using individual mAbs with the antigen prior to epitope binning to measure intrinsic association and dissociation characteristics. Moreover, this experiment can help to optimize the duration of each step. In this assay, described in Fig. 1a, the following steps are used: (1) baseline, (2) loading of antigen, (3) baseline, (4) association of the mAb, (5) dissociation of the mAb, and (6) biosensor regeneration (not shown in the figure). 1. Define a kinetics experiment with the Octet software. It is useful to prepare a spreadsheet of the assay plate (in a 384-well format) to define the positions of all buffers/samples. An example is shown in Fig. 1b. The Octet RED384 system holds a maximum of 16 biosensors at once, such that every other well in a 384-well plate is skipped. 2. Open the Data Acquisition version 11.1 program. 3. Open a new kinetics experiment with Basic Kinetics. 4. In the first tab of the software, Plate Definition, define the positions of the buffers and samples on the 384-well assay plate. There is an option to right-click each well to define it as buffer, load, sample, regeneration, or neutralization. (a) Buffer indicates wells that contain Octet buffer. (b) Load indicates wells that contain antigen. (c) Sample indicates wells that contain mAb. The concentration can also be included for future reference.

20

A

Kaito Nagashima and Jarrod J. Mousa

C

H1 HA Kinetics

Response (nm)

1.5 1.0 0.5 0.0 200 -0.5

400

600

800

1000

Time (s)

Ab6649 (100 µg/mL) 5J8 (100 µg/mL) CR6261 (100 µg/mL)

B

Fig. 1 BLI-based mAb kinetics experiment. (a) Schematic of the expected signals from a typical kinetics experiment using HIS1K biosensors. Binding is measured as a change in interference from a biosensor with immobilized antigen. A baseline reading is taken, antigen is loaded until the binding reaches a signal of about 1 nm, another baseline reading in buffer is taken, and mAb is associated and then dissociated. The buffers or sample used for each step is indicated. (b) A general sample plate format for a kinetics experiment. The steps in the assay are shown in the bottom table. (c) A trace from a typical kinetics experiment for three mAbs, Ab6649, 5J8, and CR6261, binding to the A/California/04/2009 hemagglutinin (HA) on an anti-penta-HIS sensor. The dotted line indicates the end of the association step and the beginning of the dissociation step

(d) Regeneration indicates wells that contain 0.1 M glycine, pH = 2.7. (e) Neutralization indicates wells that contain PBS. (f) It is possible to annotate the sample and load wells (i.e., with the molecular weight and concentration) either by right-clicking and selecting Set Well Data or through the sample plate table. (g) The read head format can be selected on the left-hand panel. If only a few samples are being run, choose the 8-channel read head. If several are being run, then choose the 16-channel head.

Antibody Epitope Binning by Biolayer Interferometry

21

5. In the second tab of the software, Assay Definition, input the steps of the assay: baseline, loading, association, dissociation, or regeneration and indicate the step duration and shaking speed (see Note 1). 6. Indicate the order of the steps. These include the following: (1) baseline, (2) antigen loading, (3) baseline, (4) mAb association, (5) mAb dissociation, and (6) biosensor regeneration. (a) To assign a given column of wells in the sample plate a step, click on the corresponding column and then click on the desired assay step from step 5. (b) It is also possible to duplicate multiple steps in the assay step list by selecting multiple steps, then selecting Replicate. (c) Indicate the biosensor type under the Sensor Type column. This will vary based on the chemistry used to load the antigen onto the biosensor. (d) The duration of the antigen-loading step will have to be empirically determined based on biosensor type and antigen to achieve a signal of approximately 1 nm. 7. In the third tab of the software, Sensor Assignment, the locations of the biosensor tips in the Sensor Tray for the assay are indicated in the 96-well plate. In this step, a 96-well plate containing Octet buffer is used to equilibrate the biosensors prior to the experiment. To assign biosensors, select the desired wells and then click Fill and right-click to indicate the biosensor type. Filled wells will be shown in blue. Ensure that the biosensor type is correct as in the second tab. If multiple sets of biosensors are to be used with regeneration between each set, indicate their positions here (see Note 2). 8. In the fourth tab of the software, Review Experiment, a full run of the experiment indicating the locations of the biosensors for each assay step is shown. It is possible to scroll through each step to ensure that all biosensors move to the expected buffers and samples. 9. Save the experiment to the desired directory on the computer for future use. 3.1.2 Kinetics Experiment Plate Setup

1. On the day of the experiment, prepare a tilted-bottom 384-well plate with the appropriate buffers and samples as defined in the spreadsheet in the kinetics experiment plate definition step, using 100 μL per well. Cover the plate to minimize volume loss to evaporation. (a) Use 10–100 μg/mL concentration of the antigen in Octet buffer. This concentration may have to be optimized to give an antigen-loading signal shift of 1.0 nm. (b) Also prepare the mAbs to be tested at 100 μg/mL in Octet buffer.

22

Kaito Nagashima and Jarrod J. Mousa

2. Also prepare a black 96-well plate with 200 μL/well of Octet buffer for all tips to be equilibrated in prior to the beginning of the kinetics experiment. Cover the plate to minimize volume loss to evaporation. 3.1.3 Transfer of Assay Plates to the BLI Instrument and Assay Start

1. Switch on the Octet RED384 machine and open the data acquisition software. Open the lid on the machine and load the appropriate plates into their corresponding positions in the machine (see Note 3). 2. Load the assay definition from the first section and navigate to the right-most Run Experiment tab, confirming that the assay plate wells correspond to those from the previous section. Select a directory to save the results. Do not start the assay yet (see Note 4). 3. Prepare the biosensor tips. The packaging of the biosensors contains a green biosensor tray that holds the biosensors. Use an extra biosensor tray to hold the biosensors, placing the required biosensors in the appropriate positions based on the anticipated positions as indicated in the Sensor Assignment tab of the software. Return the unused biosensors into the packaging with the drying agent for storage. 4. Place the loaded biosensor tray on top of the 96-well plate with Octet buffer for equilibration. 5. Set the time before assay start by checking the delayed experiment start option in the last tab so that a total of 5 min elapse after adding the biosensor container on top of the 96-well plate (see Note 5). 6. Start the assay in the software by clicking the green Go button. 7. Once the assay begins, the lid will automatically close on the machine. It is possible to select View and Instrument Status for a real-time log of the experiment. (a) It is useful to check that no errors are encountered, so it is recommended to check the machine a few minutes after the assay is started. 8. Real-time measurements of the biosensor signals will be displayed on the data acquisition software. A sample trace of the association and dissociation steps for three mAbs, Ab6649 [13], 5J8 [14], and CR6261 [15], to an H1 influenza hemagglutinin is shown in Fig. 1c (see Note 6). (a) If needed, the antigen-loading step can be extended or shortened through this log to achieve the desired signal of 1 nm.

Antibody Epitope Binning by Biolayer Interferometry 3.1.4 Data Analysis/ Processing

23

1. After the assay has finished, the lid on the Octet 384RED will open. Remove the assay and buffer plates from the machine, and also dispose of the biosensors appropriately. 2. Open the Data Analysis version 11.1 program and choose the directory leading to the saved data for the experiment under the Data Selection tab. 3. Go to the Processing tab. This will load a window containing the trace for all steps of the experiment. 4. Align the binding curves by the start point by choosing Align by Begin Point and All Aligned to One Step to facilitate visual inspection of mAb association and dissociation. 5. On the left-hand panel, select Save Raw Data to save the binding curves as a .csv file. 6. Plot the .csv file and visually inspect the traces of all mAbs. Determine if there is an appreciable decrease in the dissociation step for any mAb and if so, consider re-optimization of the assay conditions (see Note 7).

3.2 Epitope Binning of mAbs

This approach is used if multiple mAbs isolated against the same antigen are to be characterized. Control mAbs with known, usually non-overlapping, epitopes are included to determine whether the isolated mAbs bind to these control epitopes.

3.2.1 mAb Epitope Binning Assay Definition

1. Similar to the previous kinetics assay section, prepare a spreadsheet of the assay plate, as well as the required assay steps, prior to running the experiment. The overall steps of the epitope binning assay are shown in Fig. 2a: (1) baseline, (2) antigen loading, (3) baseline, (4) association of the first mAb, (5) association of the competing mAb, and (6) biosensor regeneration. 2. This can be performed days before the actual assay and the assay protocol may then be saved for future use. The setup of one such plate is shown in Fig. 2b. 3. Open the Data Acquisition version 11.1 program. 4. Open a new kinetics experiment with Basic Kinetics. 5. In the first tab of the Data Acquisition software, Plate Definition, define the positions of the buffers and samples on the 384-well assay plate as for the Kinetics experiment assay definition step. 6. In the second tab of the software, Assay Definition, input the steps of the assay: baseline, loading, association, or regeneration, entering the desired shaking speed and duration. Afterwards, the order and corresponding wells of the steps are indicated. These include the following: (1) baseline, (2) antigen loading, (3) baseline, (4) association of the first mAb, (5) association of the competing mAb, and (6) biosensor regeneration (see Note 8).

C

D

)

A

Kaito Nagashima and Jarrod J. Mousa C A -2 C 4 A -2 A 1 b6 C 64 A 9 -2 ( 2 Sa C ) A -1 C 7 A -1 C 9 A -1 P1 6 -0 5J 2 8 P1 (Sb - 0 /C C 3 a2) R 62 61 (S te m

24

Response (nm)

Second mAb

4

Baseline Loading Baseline First mAb

H1 HA Epitope Binning

CA09-16/CA09-16 CA09-16/CA09-17 CA09-16/CA09-19

3

CA09-16/CA09-21

2 1 0 0

500

CA-24 CA-21

27

29

34

109 107 115 118 179 111 117

21

22

12

56

104

97

99

111 119 105 111

Ab6649 (Sa)

11

20

-69

28

112 103

79

102

32

107

95

CA-22

6

92

11

9

9

10

28

110 168

90

105

29

13

16

121 118 169 108 112

CA-17

28

112 104 156

CA-19

130 126 138

17

31

32

107 107 152 120 116

CA09-16/P1-03

CA-16

101

92

63

16

106

98

20

45

CA09-16/P1-02

P1-02

108 102

93

111 105

97

28

33

CA09-16/CA09-22

5J8 (Sb/Ca2)

136 104

21

142 148 134

38

60

CA09-16/CA09-24

P1-03 CR6261 (Stem)

8

106 106

2

107 110

-59 136 100 37

115

141 142 108 146 154 141 141 133 119 136

18

125 132 151 106 124 115 118 109 163

1000

Time (s)

B

Fig. 2 BLI-based mAb epitope binning. (a) Schematic of the expected signals from a mAb epitope binning experiment. A baseline reading is taken, antigen is loaded, another baseline reading is taken, and one mAb is associated. A second mAb is associated onto the sensor with three possible outcomes: no competition with the first mAb, permitting binding to antigen shown as a high signal, intermediate competition, with moderate binding to antigen, or complete competition, with no binding. Complete competition is expected for a single mAb to itself. The buffers or sample used for each step is indicated. (b) A general sample plate format for a mAb epitope binning experiment. The steps in the assay are shown in the bottom table. The regeneration step (step 5) should be performed after each competition step (i.e. after step 1 and prior to step 2, and again after step 2 and before step 3, and so on) to prepare the biosensors for the subsequent competition cycle. (c) A trace from a typical mAb epitope binning experiment for eight mAbs, CA09-16, CA09-17, CA09-19, CA09-21, CA09-22, CA09-24, P1-02, and P1-03, against the A/California/04/2009 HA on an anti-penta-His sensor. CA09-16 was associated first, then each of the indicated mAbs was competed in the second association step. Dotted lines denote each assay step. (d) A heat map showing the results from a binning experiment with the eight mAbs from (c) and control mAbs, Ab6649, 5J8, and CR6261 in both directions. mAbs on the vertical axis were associated first, then those on the horizontal axis competed. Black (complete competition) indicates a percent competition of 50,000 TCRs. First, the file is split into smaller subfiles (chunks), each consisting of ≤50,000 TCRs. Next, TCRex predictions will be performed for each chunk individually, after which the result files are finally merged back together

42

Sebastiaan Valkiers et al.

TCRex allows the adjustment of a few additional settings: IMGT parsing enabling automatic parsing of all V/J genes in the standard IMGT format and the enrichment threshold needed to perform enrichment analyses. It is advisable to keep the default values for these parameters. Further in this tutorial, the enrichment threshold will be explained in more detail. However, TCRex also provides a very detailed overview of all steps and explanation of the parameters on its instructions page. 3.2.3 Download TCRex Results

While performing all predictions, the user is redirected to a new page providing information regarding the selected parameters and the status of the job. This page is linked through its unique Task ID which is part of its url: tcrex.biodatamining.be/task/TaskID/. The same page will give access to the results afterward up until 7 days after submission. Results include an overview table of epitopespecific p values derived from the enrichment analysis and a table with all identified epitope-specific TCRs. The latter is dependent on the selected baseline prediction rate (BPR) threshold (default 0.01%), which must be the same as the previously defined enrichment threshold. All identified epitope-specific TCRs can be downloaded by clicking the ‘Download results’ button at the bottom of the results page.

3.2.4 Concatenate the Results from the Original Files

In case you adjusted the size of the original TCR repertoires by splitting them into smaller files, you must concatenate these results back together before performing any further analyses (Fig. 1). Part 3 of the tutorial notebooks provides a range of useful functions for easily performing the task of merging the individual TCRex results, originating from the same repertoire, back into one file. The code allows automatic merging of TCR data from all files stored within the same folder. In brief, every file is read as a separated pandas.DataFrame and stored in one list before concatenating into one final data frame. The latter is stored in the desired folder, in this case “./results/tcrex” (see part 3 of the notebook tutorials). This entire step can be omitted if your original TCR data files contained less than 50,000 TCRs and thus were not split into smaller files before performing TCRex predictions.

3.2.5 Calculate Identification Metrics

For every TCR epitope combination, TCRex returns two values: the class probability score calculated by the random forest model and the BPR value which represents the fraction of identified epitope-specific TCRs with a higher or similar class probability score in a background repertoire. These scores might assist you in the interpretation of individual TCR-epitope interactions. However, if conclusions are needed for entire TCR repertoires, additional metrics must be calculated. First, it is interesting to have a look at the general identification rate of epitope-specific TCRs in the full repertoire. The identification rate is defined per epitope as

Clustering and Annotation of T Cell Receptor Repertoires

43

the fraction of unique identified epitope-specific TCRs over the total number of unique TCRs in the full repertoire. Here unique TCRs refer to unique combinations of the CDR3β sequences and the V and J genes. The identification rate R is determined by N ˜ 100, where N is the number of TCRs identified to target the S n epitope of interest and S represents the size of the repertoire. Since naive TCR repertoires also contain epitope-specific TCRs, the identification rate on itself cannot be used to derive accurate information regarding a possible enrichment in epitopespecific TCRs due to illness, vaccines, or treatments. To circumvent this problem, the presence of epitope-specific TCRs can be assessed by performing an enrichment analysis. By default, TCRex performs a one-sided binomial test for every epitope where the number of identified epitope-specific TCRs in a repertoire is compared with an enrichment threshold representing the fraction of epitope-specific TCRs at baseline. A p value below the selected significance level indicates an enrichment of identified TCRs for that specific epitope. Since this enrichment analysis is carried out on every uploaded file separately, you need to recalculate it when splitting files. To calculate these metrics, the number of identified epitope-specific TCRs and the repertoire size is needed. The former can be calculated directly from the TCRex results, while the latter can be found at the online results page reported by TCRex and in the comments section of the results file. Table 1 displays the metrics of the top 10 most significant TCRex hits for the combined files of day 0. Note that the standard enrichment analysis for the online tool is done at a user-defined enrichment threshold when at least two TCRs are predicted to bind the epitope. In case of comparing Table 1 Example of the top 10 predicted epitopes by p value, at day 0, produced by the TCRex algorithm Epitope

Pathology

Identification rate

p value

Corrected p value

TPQDLNTML

HIV

0.713104

0

0

NLVPMVATV

CMV

1.112989

0

0

ELAGIGILTV

Melanoma

0.982117

0

0

GTSGSPIVNR

DENV1

0.084049

5.929876e-159

7.264098e-158

LPRRSGAAGA

Influenza

0.047695

4.027836e-57

3.947280e-56

HPVGEADYFEY

EBV

0.039843

9.260445e-40

7.562696e-39

VTEHDTLLY

CMV

0.037226

1.838391e-34

1.286874e-33

GLCTLVAML

EBV

0.036062

3.409950e-32

2.088594e-31

TPRVTGGGAM

CMV

0.033736

7.965513e-28

4.336779e-27

AMFWSVPTV

Melanoma

0.031118

3.402548e-23

1.667249e-22

44

Sebastiaan Valkiers et al.

repertoires at different time points, it is advisable to replace the standard enrichment threshold for each epitope by the identification rate of TCRs specific for that epitope in your baseline repertoire (i.e., repertoire before disease or treatment). Make sure you download both pre- and post-treatment repertoires using the same BPR threshold. If no pre-treatment repertoire is available, we recommend the use of the default enrichment threshold of 0.0001 (i.e., 0.01%). In this case, it is also important to select the same value for the enrichment threshold and BPR threshold when downloading the results (both are default 0.01%). Keep in mind that the enrichment results with the default threshold will be less accurate in comparison with the use of a pre-calculated identification rate at baseline. The functions required for calculating the identification rate and enrichment score are provided in part 4 of the tutorial notebooks provided with this chapter. The tutorial provides an example of how these metrics may be calculated for a single epitope as well as for a combination of epitopes. When calculating these metrics, make sure you use the appropriate repertoire size that corresponds to the output of the TCRex algorithm (without duplicates if merging the individual output files). Note that the enrichment analysis is performed on every epitope separately. Because multiple statistical tests are performed, we need to correct the p values in order to control the false discovery rate. In the example provided in the tutorial notebook, the Benjamini-Hochberg correction was performed on the “p_value” column of the results, thereby adding a new column to the data frame with the adjusted p values for every epitope, allowing the use of the original significance level. Figure 3a, b shows the epitope-specific TCR expression at day 0 and day 15. Ultimately, we are interested in the effect of vaccination on the expression of epitope-specific TCRs. To get a rough idea of epitope-specific expansion or contraction between the two different time points, we can calculate the fold change in identification rate between day 0 and 15. Here, this metric is calculated as log2

IR day 15 IR day 0

. To prevent division by 0, we identify the smallest

number in the identification rate column and add this value to the identification rate of all other epitopes. Figure 2c shows the normalized fold change in identification rate (IR-FC) for all epitopes and their different pathologies. The YFV epitope LLWNGPMAV in particular shows the great proportional expansion at day 15 post vaccination, as compared to day 0. To get an impression of how this effect manifests itself on the pathology level, we can group the epitopes according to their pathology of origin (Fig. 3d). Note that this may be biased (up- or downward) by the difference in available epitope models for each pathology.

Clustering and Annotation of T Cell Receptor Repertoires

45

A

B

Fig. 2 Visualization of TCRex prediction results for 2 samples from one individual pre (day 0) and post (day 15) receiving YFV vaccination. (a, b) Identification rate of epitope-specific TCRs at day 0 and 15, respectively. (c) Fold change in identification rate for individual epitopes between day 15 and day 0. (d) Fold change in identification rate between day 15 and day 0, grouped according to pathology. IR-FC identification rate fold change

46

Sebastiaan Valkiers et al.

C

D

Fig. 2 (continued)

Clustering and Annotation of T Cell Receptor Repertoires

47

Fig. 3 Example of cluster visualization. A TCR cluster (id = 11250) is represented as a network in which TCRs with a hamming distance ≤1 are connected by an edge. YFV-specific TCRs were assigned a red color 3.2.6 Examine the Epitope-Specific Clusters

It is interesting to take a deeper look at those clusters having at least one identified epitope-specific TCR since they can be of different sizes. This requires some processing, in order to merge all the results (from ClusTCR and TCRex) together into one data frame. To merge the data, we will use a custom function (located in the “./src” folder) that is provided in the GitHub repository associated with this chapter. This is illustrated in the code block below and part 5 of the notebook tutorials:

import pandas as pd from src.tools import merge_results

# TCRex results tcrex = pd.read_csv("./results/tcrex/P1_15.tsv", sep = "\t") # ClusTCR results clust = pd.read_csv("./results/clustcr/P1_15_clusters.tsv", sep = "\t") # Original data rawdata = pd.read_csv("./data/examples/P1_15.tsv", sep = "\t")

48

Sebastiaan Valkiers et al.

# Merge all files merged_15 = merge_results( original = rawdata, clusters = clust, predictions = tcrex )

# Drop all records that do not belong to a cluster merged_15 = merged_15.dropna(subset = ["cluster"])

keep = ["duplicate_count", "frequency", "junction_aa", "v_call", "j_call", "cluster", "epitope", "pathology", "score", "bpr"] # Only keep desired columns merged_15 = merged_15[keep] # Save results merged_15.to_csv("./results/merged/P1_15_clusters_tcrex.tsv", sep = "\t", index = False)

Clustering and Annotation of T Cell Receptor Repertoires

49

After bringing together all the result files, we can start analyzing and interpreting the data. We start by counting the number of epitope annotations for each cluster. This can be done by first identifying all the clusters that contain at least one TCR with a TCRex epitope annotation. Next, we count the number of epitopes annotations per cluster by using the pandas groupby functionality, as illustrated in the following code block. clusters_with_hits = set(merged_15.dropna(subset = ["epitope"]).cluster) subset = merged_15[merged_15.cluster.isin(clusters_with_hits)]

epi_count = subset.groupby("cluster").count().epitope.sort_index() cluster_size = merged_15.cluster.value_counts()[clusters_with_hits].sort_index()

res = pd.concat([epi_count, cluster_size], axis = 1)

These operations return a data frame summarizing the number of epitope-specific TCRs and the size of every selected cluster. Table 2 shows the top 10 clusters containing the most epitopeTable 2 Top 10 clusters in terms of total number of epitope-specific TCRs Cluster ID

Number of epitope-specific TCRs

Cluster size

1861

508

528

1850

504

598

1852

403

554

2058

326

364

2063

286

365

2481

195

266

1859

169

191

1854

155

212

2064

154

192

1858

133

232

Epitope specificity was predicted using the TCRex algorithm. Example shown for sample at day 0

50

Sebastiaan Valkiers et al.

specific TCRs for the pre-vaccination sample. A deeper look into these clusters may reveal whether these clusters have a heterogeneous response against multiple epitopes or exhibit specificity toward a single epitope as a result of a convergent recombination process. Ultimately, we are interested in studying the YFV-specific response. Therefore, we can filter the epitope-specific responses to only include YFV-specific TCRs (procedure illustrated in the notebook available with this chapter). Based on this data, we can situate the YFV-specific TCRs in their corresponding clusters (Fig. 3). This may reveal a convergent selection of highly similar TCRs specific for the epitope(s) of interest.

4

Notes 1. In case your data is not saved in one of the required formats, you must transform it to the TCRex format which consists of 3 tab-delimited columns called “CDR3_beta,” “TRBV_gene” and “TRBJ_gene”. More info on the TCRex tab-delimited format and an example file can be found on the TCRex instructions page. Note that future TCRex versions might accept a wider range of file formats.

Acknowledgments This work was funded by the iBOF MIMICRY grant, the Flanders AI Impulse program, and the Research Foundation Flanders (FWO) [1S48819N to SG and 1S40321N to SV]. References 1. Shcherbinin DS, Belousov VA, Shugay M (2020) Comprehensive analysis of structural and sequencing data reveals almost unconstrained chain pairing in TCRαβ complex. PLoS Comput Biol 16:e1007714 2. Mora T, Walczak AM (2016) Quantifying lymphocyte receptor diversity. bioRxiv 046870 3. Jenkins MK, Chu HH, McLachlan JB, Moon JJ (2010) On the composition of the preimmune repertoire of T cells specific for peptidemajor histocompatibility complex ligands. Annu Rev Immunol 28:275–294 4. Pai JA, Satpathy AT (2021) High-throughput and single-cell T cell receptor sequencing technologies. Nat Methods 18:881–892

5. Dash P, Fiore-Gartland AJ, Hertz T et al (2017) Quantifiable predictive features define epitope-specific T cell receptor repertoires. Nature 547:89–93 6. Glanville J, Huang H, Nau A et al (2017) Identifying specificity groups in the T cell receptor repertoire. Nature 547:94–98 7. Meysman P, De Neuter N, Gielis S et al (2019) On the viability of unsupervised T-cell receptor sequence clustering for epitope preference. Bioinformatics 35:1461–1468 8. Dolton G, Zervoudi E, Rius C et al (2018) Optimized peptide-MHC multimer protocols for detection and isolation of autoimmune T-cells. Front Immunol 9:1378

Clustering and Annotation of T Cell Receptor Repertoires 9. Bagaev DV, Vroomans RMA, Samir J et al (2020) VDJdb in 2019: database extension, new analysis infrastructure and a T-cell receptor motif compendium. Nucleic Acids Res 48: D1057–D1062 10. Vita R, Mahajan S, Overton JA et al (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47:D339–D343 11. Tickotsky N, Sagiv T, Prilusky J et al (2017) McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences. Bioinformatics 33:2924–2929 12. Gielis S, Moris P, Bittremieux W et al (2019) Detection of enriched T cell epitope specificity in full T cell receptor sequence repertoires. Front Immunol 10:2820 13. Jokinen E, Huuhtanen J, Mustjoki S et al (2021) Predicting recognition between T cell receptors and epitopes with TCRGP. PLoS Comput Biol 17:e1008814 14. Montemurro A, Schuster V, Povlsen HR et al (2021) NetTCR-2.0 enables accurate prediction of TCR-peptide binding by using paired TCRα and β sequence data. Commun Biol 4: 1060 15. Tong Y, Wang J, Zheng T et al (2020) SETE: sequence-based ensemble learning approach for TCR epitope binding prediction. Comput Biol Chem 87:107281 16. Weber A, Born J, Rodriguez Martı´nez M (2021) TITAN: T-cell receptor specificity

51

prediction with bimodal attention networks. Bioinformatics 37:i237–i244 17. Moris P, De Pauw J, Gielis S et al (2020) Current challenges for unseen-epitope TCR interaction prediction and a new perspective derived from image classification. Brief Bioinform 22: bbaa318 18. Valkiers S, Van Houcke M, Laukens K, Meysman P (2021) ClusTCR: a python interface for rapid clustering of large sets of CDR3 sequences with unknown antigen specificity. Bioinformatics. https://doi.org/10.1093/bio informatics/btab446 19. Pogorelyy MV, Minervina AA, Touzel MP et al (2018) Precise tracking of vaccine-responding T cell clones reveals convergent and personalized response in identical twins. Proc Natl Acad Sci U S A 115:12704–12709 20. Bolotin DA, Poslavsky S, Mitrophanov I et al (2015) MiXCR: software for comprehensive adaptive immunity profiling. Nat Methods 12: 380–381 21. Enright AJ, Van Dongen S, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30:1575–1584 22. Abu-Jamous B, Kelly S (2018) Clust: automatic extraction of optimal co-expressed gene clusters from gene expression data. Genome Biol 19:172

Chapter 4 Protocol for Classification Single-Cell PBMC Types from Pathological Samples Using Supervised Machine Learning Minjie Lyu, Lin Xin, Huan Jin, Lou T. Chitkushev, Guanglan Zhang, Derin B. Keskin, and Vladimir Brusic Abstract Peripheral blood mononuclear cells (PBMC) are mixed subpopulations of blood cells composed of five cell types. PBMC are widely used in the study of the immune system, infectious diseases, cancer, and vaccine development. Single-cell transcriptomics (SCT) allows the labeling of cell types by gene expression patterns from biological samples. Classifying cells into cell types and states is essential for single-cell analyses, especially in the classification of diseases and the assessment of therapeutic interventions, and for many secondary analyses. Most of the classification of cell types from SCT data use unsupervised clustering or a combination of unsupervised and supervised methods including manual correction. In this chapter, we describe a protocol that uses supervised machine learning (ML) methods with SCT data for the classification of PBMC cell types in samples representing pathological states. This protocol has three parts: (1) data preprocessing, (2) labeling of reference PBMC SCT datasets and training supervised ML models, and (3) labeling new PBMC datasets from disease samples. This protocol enables building classification models that are of high accuracy and efficiency. Our example focuses on 10× Genomics technology but applies to datasets from other SCT platforms. Key words Single-cell transcriptomics, Cell type classification, Supervised machine learning, Protocol, Disease, Peripheral blood mononuclear cells

1

Introduction Peripheral blood mononuclear cells (PBMC) are mixed subpopulations of blood cells, composed of five principal cell types: B cells (BC), dendritic cells (DC), monocytes (MC), natural killer cells (NK), and T cells (TC) [1]. The proportions of PBMC subtypes in healthy individuals fit within broadly defined ranges: BC make 5–15%, DC make 1–2%, MC make 10–30%, NK make 5–10%, and T cells make 40–70% of total PBMC [1, 2]. PBMC are the key research targets in studying the immune system, infectious diseases, cancer, autoimmunity, and vaccine development. PBMC are used for medical diagnosis of diseases, including cancer,

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_4, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

53

54

Minjie Lyu et al.

pulmonary fibrosis, viral hepatitis, and many others [3–6]. PBMC cell types and subtypes have well-defined gene expression profiles that are very similar among healthy individuals, which have been observed by both bulk sequencing [7, 8] and single-cell transcriptomes [9]. Traditional bulk RNA sequencing can only measure the expression value of each gene as an average of expression levels across the whole sample [8], cannot label cell types as there are mixed cell types, and only give a list of gene reads. In contrast, single-cell transcriptomics (SCT) technology captures gene expression of individual cells, enabling granular analysis of functions and phenotypes at the individual cell level. The expression of genes allows labeling cell types or even discovering new cell types in studied biological samples [10–14]. Accurate classification of cells into cell types and characterization of their states is essential for single-cell analyses, especially for the study of disease, the assessment of therapeutic intervention, and in many secondary analyses. However, current cell types labeling pipelines mostly start with dimension reduction (feature extraction) combined with unsupervised clustering, followed by annotation of clusters that represent similar cells by observing gene expression of canonical markers [3– 9, 15]. Unsupervised clustering requires considerable computational resources and includes manual correction of outcomes. The results are of variable accuracy and show poor generalization properties when applied to datasets obtained from different studies [9]. Here we propose a protocol of PBMC cell type classification using supervised machine learning (ML) models that have demonstrated high accuracy across multiple PBMC sample processing methods (PBMC extraction, bead enrichment, FACS, or MACS) [2, 16, 17].

2

Workflow Cell type classification is essential for disease analysis. The workflow of this protocol (Fig. 1) provides steps that enable the classification of cell types within PBMC in pathological samples. The protocol for cell type classification in disease consists of three parts: 1. Data preprocessing. 2. Construct labeled reference datasets and trained supervised machine models. 3. Label new datasets (Fig. 1).

Protocol for Classifying Pathological Single Cells

55

Disease SCT Datasets Basic Statistic and Metadata Check

Samples & Processing are OK?

No

Discard Dataset

Data Pre-processing

Yes Supervised ML Methods

Standardization

Profile Based Prediction ANN Based Prediction

Quality Control

Other ML Models Prediction

Assist Labelling

Dataset of High Quality?

No

Discard Dataset

Yes

Clean Dataset Cell Type Classification

Cell Type Labelling Protein Marker Hierarchical Clustering Based Prediction

Reference Datasets

Generalized ML Classifiers

Basic Statistics and Metadata

Label New Datasets

Label New Datasets

Label Cell Types & Training Supervised ML models

Create Reference Datasets

Training Supervised ML Models Generalized ML Classifiers

Fig. 1 The workflow for the classification of a single cell from PBMC pathological samples using supervised machine learning. The workflow consists of three parts: (1) Data preprocessing. (2) Construction of reference datasets and training of supervised machine models. (3) Labeling of new datasets

3

The Protocol

3.1 Data Preprocessing

Data preprocessing is the most important part of cell type classification. Supervised machine learning models are data-driven models that depend on representative and high-quality data. To have highquality datasets, we need to construct a pipeline to clean datasets.

56

Minjie Lyu et al.

The data preprocessing pipeline includes four steps: (1) Basic statistic and metadata checking. (2) Data standardization. (3) Data quality control. (4) Documentation of basic statistic and metadata. Common SCT data sources include the following: 1. NCBI GEO: https://www.ncbi.nlm.nih.gov/geo/ 2. Broad Institute: https://www.broadinstitute.org 3. 10× Genomics: https://www.10xgenomics.com 4. Human Protein Atlas: https://www.proteinatlas.org 5. Human Cell Atlas: https://www.humancellatlas.org 3.1.1

Basic Statistic

The first step of the data preprocessing is downloading the datasets. The immediate next step involves the analysis of the basic properties of the obtained datasets. We employed the below basic statistic tools [9, 18, 19]: 1. Gene expression distribution visualization by violin plot (Fig. 2a). 2. Hierarchical clustering and generation of heatmaps (Fig. 2b). 3. Data distribution statistics (Fig. 2c). Violin plots show the overall distribution of the datasets and enable direct comparison of the patterns between the subsets of data in the same study and across different studies. Heatmaps of hierarchical clusters show the overall similarities between cells within the datasets. The percentile thresholds and interquartile ranges can help find potential outliers and redefine thresholds for quality control. Standard deviation, skewness, and excess kurtosis show the patterns of the datasets as compared to normal distribution. The basic statistic provides an overview of the datasets, especially for new disease samples representing samples of unknown data distributions. This analysis also allows comparison of dataset properties with healthy datasets or other samples representing disease.

3.1.2

Metadata Checking

The metadata usually accompany the datasets in a separate file and are provided by the producer. Metadata are important because they include properties of the samples (gender, age, disease phrase, treatment status, etc.), sample collection and storage (fresh, frozen, low temperature storing, room temperature), sample processing methods (tissue dissociation methods, cell sorting protocol, additional processing: in vitro culture, cell activation, enrichment, cell stimulation, etc.), and the setting of SCT instrumentation, reagents, and other key information. The decision to include or discard the dataset for the next step of analysis is made judiciously after inspecting summary statistics and metadata for relevance, acceptable study design, and reasonable data distributions. The

Protocol for Classifying Pathological Single Cells

57

(Thousand)

Gene Counts

a

Therapy Days

b

c

D0

D120

Fig. 2 An example of gene expression data distribution statistics. These statistics include (a) violin plot visualizations, (b) heatmap of hierarchical cluster maps, and (c) data distribution summary statistics. The shown example is the PBMC SCT data of one CLL patient before the therapy and after 120 days ibrutinib therapy from NCBI GEO GSE111014 [20]

datasets may be discarded because of extraordinarily high or low gene counts, unusual distributions, and uninformative heatmaps, but the decision will be made on a case-by-case basis. SCT methods show highly reproducible gene expression results when sample preparations follow standard operating procedures (SOP) and consistent sample processing methods [21]. It is important to note that a seemingly small change during the SCT experiment such as updating reagent kits (e.g., chemistry v2 vs. v3 for 10× Chromium,

58

Minjie Lyu et al.

support.10xgenomics.com) may produce significant changes in gene expression profiles [9]. Cell activation causes profound changes in gene expression to the extent that activated cells are directly comparable with other samples. Cell sorting methods such as fluorescence-activated cell sorting (FACS) or magnetic-activated cell sorting (MACS) separate cells by type using protein marker antibodies. Cell sorting significantly changes gene expression profiles. 3.1.3

Standardization

SCT data files appear in the form of sparse matrices, and they can be encoded in several formats, including Text File (TXT), Commaseparated Values (CSV), Tab-separated Values (TSV), Hierarchical Data Format (H5), and Sparse Matrix Format (MTX) (see [22], Note 1). After checking the basic statistic and metadata, we need to convert each dataset to the standardized format—the matrix with the defined common named gene list. The datasets from different studies may use different gene lists where the number, order, and names of genes may differ. The common gene nomenclature used in 10× Genomics for humans are GRCh37 (hg19) containing 32,738 genes, and GRCh38 (hg38) containing 33,694 genes. In addition, gene list in datasets may be subsets of hg19 or hg38 or may contain probes of additional genes. This creates problems with the comparison, integration of datasets into a common study, performing statistical analyses, and building classification models. To prevent this problem, we created a list of 30,698 common genes. The conversion of datasets to the common gene list includes the following: 1. The Ensembl gene ID (https://www.ensembl.org) is attached to each gene name. Those IDs are unique—they do not change, even if the gene names are updated [23]. 2. Gene names from input data files are converted to hg38 names. 3. Only the genes from the common gene lists are selected, while other genes are removed. 4. Gene lists together with the corresponding gene counts are sorted in ascending order of Ensemble IDs (Unique IDs). 5. If a gene is missing in the input data, the corresponding gene count values are set to zeros for all single cells in the input dataset. After standardization, the number, order, and name of the genes are the same as the common gene list (Fig. 3) and thus are directly comparable.

3.1.4

Quality Control

The next operation is quality control (QC). QC is used to remove low-quality cells present in the standardized datasets. Cells of poor quality will directly affect the accuracy of cell labeling and cell classification. Quality control metrics include the range of gene counts, number of expressed genes, and expression level of

Protocol for Classifying Pathological Single Cells

59

New Datasets

Genomics Genomics

Genomics

Unique IDs and Gene Names

Common Gene list

Construct the Common Gene List

Convert Gene Names to Unique IDs

Match the IDs with Common Gene List

Standardized Datasets

Fig. 3 Data standardization workflow. A standard gene list is the common gene list across all datasets that are used. A unique ID is used to link the gene lists in input files and the common gene list. Each gene name in the new dataset is replaced by its corresponding unique ID and checked if it exists in the common gene list. If the gene is not present in the common gene list, the corresponding gene count is set to zero. The genes from the input datasets and the corresponding gene counts in sparse matrices rows are removed if the genes in the common list but not in the input data file. After the standardization, the number, order, and name of the genes are the same in standardized files as in the common gene list

mitochondrial genes, heat shock genes, and ribosomal genes. The gene counts and number of expressed genes should be between lower and upper thresholds to remove noise (empty data, doublets, and cells with too low or too high gene counts). The expressed levels of mitochondrial genes, heat shock genes, and ribosomal genes are used to remove dead cells. Commonly used removal thresholds are 10% for mitochondrial genes, 8% for heat shock genes, and 10% for ribosomal genes [24–27]. In healthy samples, the range of gene counts is between 500 and 6000, and the number of expressed genes is between 300 and 2000. However, for disease samples, the ranges of gene counts and the number of expressed genes are much more complex—the ranges should be defined by the features of the studied disease. For example, PBMC in chronic lymphocytic leukemia (CLL) patients show a much higher gene expression than healthy PBMC. In CLL, the upper bound of QC values of gene counts needs to be set up higher than for healthy PBMC. On the other hand, in CLL patients responding to ibrutinib therapy, PBMC cells shut down molecular pathways silencing gene expression. In ibrutinib treatment, the lower bound of QC values of gene counts needs to be decreased relative to healthy PBMC. Checking the purity of cell type is a crucial QC step since some samples may be mixed—they may contain other cell types that are not discarded during sample processing. For example, the most

60

Minjie Lyu et al.

Standardized Datasets For Each Cell Gene Counts within Range

No

Yes Expressed Gene within Range

No

Yes No Mitochondrial Genes Expression < 10%

Yes Heat Shock Genes Expression < 8%

No

Yes Ribosomal Genes Expression > 10%

No

Yes Not Other Cell Types

No

Yes

Discard Quality Control Datasets

Fig. 4 Data quality control workflow. Quality control metrics include checking ranges of gene counts and the numbers of expressed genes. Furthermore, cells with high expression levels of mitochondrial genes, heat shock genes, and ribosomal genes need to be removed. Datasets that are of low purity should be discarded because they are not the target cells but will be wrongly labelled as the target cell type

common way to separate PBMC from whole blood is density gradient centrifugation. However, there may be some red blood cells mixed in the PBMC layer due to short centrifugation time or low centrifugation speed. By checking the expression of red blood cell canonical genes such as hemoglobin genes, we can eliminate the impurity cells. The steps of QC are shown in Fig. 4. To make the

Protocol for Classifying Pathological Single Cells

61

QC in disease samples more efficient, using disease-related genes to help assess the quality of gene expression measurement is recommended. 3.1.5 Documenting Statistics and Metadata

After QC, the statistical analyses described in Subheading 3.1.1 should be applied to the datasets that have passed QC. The statistics and metadata need to be documented so that the differences caused by the disease or by different sample processing methods are noted and used for interpretation of results. The key metadata that should be included are the source of the datasets, type of sample, sample information, SCT experiment setting, data generation process, basic statistics, and date of dataset. The sample type should include organism, tissue, and cell type (or mixed cell types). The same cell type will show different gene expression profiles when taken from different tissue and organs. The sample information includes the donor information and health status. Details such as age, gender, health condition, disease phase, treatment methods, therapy outcome, and others are crucial information that may help identify patterns that correlate to different cell states. The description of experiment settings includes the processing and storing conditions of the sample. Data generation converts the RNA sequencing results into gene reads. The QC methods are used to remove noise and low-quality cells from the analysis. The statistic includes the number of cells from the original datasets, the number of cells after our QC filter, and the QC pass rate. The dates include the time when the data were first published, the last update of the dataset, and the time we collected them. This information is important since the data updates are common because of error fixes or updates by the authors (Table 1).

3.2 Cell-Type Labeling and Supervised Machine Learning

To comprehensively label PBMC cell types in diseases, the datasets must be of good quality and representative of the studied cells, tissues, or organism status. Whenever possible, datasets should also be produced using the minimal number of sample processing steps. Each additional sample processing step will distort gene expression measurement as compared to the sample in the organism. A supervised prediction model must generalize well across different studies and different sample processing methods. Datasets representative of the selection of sample processing methods and various cell states should be integrated into joint training sets that account for biological, technical, and experimental design variation. To train the supervised ML models, reference datasets of high quality that are representative of data variability, and where cell labels of cells are highly accurate (>99%) are needed to train useful ML models.

62

Minjie Lyu et al.

Table 1 The metadata components Type

Component

Description or examples

Study ID

Source Series IDa Sample IDa

The source of this dataset The series ID, if available The sample ID, if available

Sample type

Organism Tissue Cell type

Homo sapiens or others Blood or bladder or other The cell type or mixed cell types

Sample information

Donor information Health status

Donor ID, gender, age, etc. Healthy, in disease, under treatment, etc.

Experiment settings

Sample condition Sample processing

Frozen, fresh, etc. Disassociation methods, stimulate, culture, cell sorting, etc. The version of 10× genomics chemistry version The type of RNA sequencing machine (e.g., Illumina NextSeq500)

Chemistry versionb Sequencing machine Data generation

Cell rangerb Genome QC methods

Statistic

The 10× genomics mapping software from sequencing to genes The version of genomics (e.g., GRCh37, GRCh38) The quality control methods used by data provider

Number of cells Number of cells after QC QC pass rate

Number of cells from the given dataset Number of cells after our quality control Number of cells after QC divided by number of cells

Others

Comment

The additional information

Date

First publish date Last update date Collect date

First publish date Last update date Collect date

QC Quality control a Series ID and Sample ID are commonly used in the data published in NCBI GEO database b Chemistry version and Cell Ranger are the chemistry and software used in 10× Genomics SCT machine, other SCT machines may use different terms. Other SCT platforms may have other components

3.2.1 Protein Marker– Based Prediction and Hierarchical Clustering

Current methods of choice for the detection of cell types include experimental (bead enrichment or cell sorting) followed by SCT. Alternatively, separation of PBMC by centrifugation combined with unsupervised clustering and labeling of clusters by detecting the presence of canonical gene markers. Unsupervised clustering with canonical markers typically achieves lower accuracy than supervised methods. In our experiment (data not shown), the accuracy of unsupervised clustering was ~85%. In disease datasets, the conditions are more complex because the overall gene expression profiles of cells show significant changes. Protein markers are molecules that are specific for particular cell types—protein markers can be

Protocol for Classifying Pathological Single Cells

63

found on the cell surface or are secreted by a specific type of cell. The mRNA of protein markers may degrade after translation, be absent from a cell where protein is expressed, or may be present in low copy numbers. Nevertheless, for most PBMC cell type-specific protein markers, we observed that the corresponding mRNA was expressed, indicating that protein markers can be used to classify cell types from SCT data. We can calculate weighted means of total gene counts of selected protein markers for each PBMC subtype. Each cell is assigned the cell type of the subtype that shows the largest weighted mean expression of genes representing cell typespecific protein markers. When the largest weighted means of protein marker-specific gene expression is shared by two or more subtypes, and we could not determine the best cell subtype, we set the cell type to “Ambiguous.” We demonstrated that proteinmaker-based prediction could give an accuracy of 91.8% in healthy PBMC datasets across multiple sources [17]. In addition to using conventional protein markers found in the healthy PBMC, diseasespecific markers can increase the quality of labeling cell types in disease. Our preliminary analysis of the expression of protein maker coding genes indicates a relatively high accuracy of classification of cell type in disease PBMC samples. Hierarchical clustering of cells provides additional evidence of correct cell type classification. Cells in the same cluster show high similarity within their shared group (see Note 2). Hierarchical clustering, in combination with proteinmaker-based prediction, offers high-confidence PBMC cell-type labeling results. 3.2.2 Other Supervised Machine Learning Methods

Labeling of PBMC cell types in disease can also be assisted by other supervised ML methods. Profile-based prediction built using PBMC profiles of healthy cells [9] may be useful as it predicts cell types with an accuracy of ~90% [28]. Artificial neural networks could predict healthy PBMC cell types with ~95% accuracy [16, 29]. We combined the methods of profile-based prediction, ANN-based prediction, protein-marker-based prediction, and RNA-markers-based prediction. The overall classification accuracy of 91.8% was achieved across all the datasets representing healthy PBMC states. More importantly, the combination of the four prediction methods provided more confidence in prediction results. The analysis of possible mislabeling allowed cell type re-labeling and achieved classification accuracy of ~97% [17]. Similar methods can be used to label cell types in the disease samples, but they need to be applied iteratively and are likely to be study specific.

3.2.3 Supervised ML Models

Once the PBMC disease cell types are labeled, the datasets combined with their metadata can be used as reference datasets. By using reference datasets and multiple supervised ML methods, we can train and validate these ML models (see Note 3). The reference

64

Minjie Lyu et al.

datasets are divided into three groups—training set, validation set, and testing set. The training set is used to train the supervised ML models that learn the patterns embedded in gene expression data. The validation set is used to control the training process and prevent overfitting the model. The testing set is used to assess the accuracy of trained ML models—particularly in their ability to classify cell types accurately on previously unseen datasets. Because SCT is sensitive to experimental conditions and sample processing, multiple datasets from disease samples across multiple experimental and sample processing conditions are included in the training datasets. This ensures good generalization properties and high classification accuracy of supervised ML models. The information available in metadata could explain why low accuracy happens in some testing sets. 3.3 Prediction of Cell Types by Supervised Machine Learning

After the supervised ML models are validated, we can label the cell types in the new target datasets. The target datasets need to be preprocessed to standardize the format, check whether data quality is satisfactory, and establish that the target dataset is suitable for analysis and classification. Prediction results of disease datasets should be checked with the same disease reference datasets to establish if the proportion of cell types is similar to the expected values. If there is a large difference in the proportion of cell types, manual correction cell type labels are needed. Checking metadata can help explain the differences between metadata and predicted cell type proportions.

3.4

Labeling cell types is an early yet crucial step in the SCT research of samples representing pathologies. However, the single-cell gene measurement results can be affected by the nature of samples, the conditions of sample collection and processing, disease phases, and the experimental processing and sequencing methods. To develop ML models for accurate and robust classification of PBMC cell types in diseases, we need a significant number of high-quality representative datasets from multiple SCT studies. Currently used unsupervised clustering methods do not generalize well and are, therefore, unsuitable for automation. On the other hand, supervised ML methods can handle a large volume of datasets and can be retrained with new datasets [2, 16, 29]. By following the proposed protocol steps for data preprocessing, labeling reference datasets, and training supervised ML models (Fig. 1), we demonstrated an efficient way to classify PBMC cell types in diseases with high accuracy. Because of the diversity of disease phenotypes, personalized ML models may be needed, as opposed to healthy PBMC profiles that are highly reproducible across individuals.

Discussion

Protocol for Classifying Pathological Single Cells

4

65

Notes 1. One serious problem may occur when opening CSV, TXT, and TSV data files with Microsoft Excel. The gene names may be auto-formatted into month-year format. In most of the data files, the columns represent cells and rows represent genes, however, but in some data files, the data are transposed. We have chosen the column-cell format as standard. 2. When performing hierarchical clustering on disease data, fixed number of clusters may not produce best results. We recommend adjusting the number of clusters by analyzing the similarities between clusters and adjusting the target number of clusters. 3. Unbalanced training sets, with low representation of a particular cell type, will cause low value of precision in testing data. The best way of resolving this issue is by adding new data to datasets.

Acknowledgments DK received funding from the Division of Cancer Epidemiology and Genetics, National Cancer Institute (R21 CA216772-01A1) and from National Cancer Institute (SPORE-2P50CA10194211A1). VB received Ningbo Municipal Bureau of Science and Technology Grant (2019F1028). References 1. Verhoeckx K, Cotter P, Lo´pez-Expo´sito I, Kleiveland C, Lea T, Mackie A, Requena T, Swiatecka D, Wichers H (2015) The impact of food bioactives on health: in vitro and ex vivo models. Springer, Cham 2. Shaikh RA, Zhong J, Lyu M, Lin S, Keskin D, Zhang G, Chitkushev L, Brusic V (2019) Classification of five cell types from PBMC samples using single cell transcriptomics and artificial neural networks. In: 2019 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 2207–2213 3. Baine MJ, Chakraborty S, Smith LM, Mallya K, Sasson AR, Brand RE, Batra SK (2011) Transcriptional profiling of peripheral blood mononuclear cells in pancreatic cancer patients identifies novel genes with potential diagnostic utility. PLoS One 6(2):e17014

4. Wang W-S, Liu L-X, Li G-P, Chen Y, Li C-Y, Jin D-Y, Wang X-L (2013) Combined serum CA19-9 and miR-27a-3p in peripheral blood mononuclear cells to diagnose pancreatic cancer combined CA19-9 and miR-27a-3p to diagnose pancreatic cancer. Cancer Prev Res 6(4):331–338 5. Scott MK, Quinn K, Li Q, Carroll R, Warsinske H, Vallania F, Chen S, Carns MA, Aren K, Sun J (2019) Increased monocyte count as a cellular biomarker for poor outcomes in fibrotic diseases: a retrospective, multicentre cohort study. Lancet Respir Med 7(6): 497–508 6. El-Awady MK, Ismail SM, El-Sagheer M, Sabour YA, Amr KS, Zaki EA (1999) Assay for hepatitis C virus in peripheral blood mononuclear cells enhances sensitivity of diagnosis

66

Minjie Lyu et al.

and monitoring of HCV-associated hepatitis. Clin Chim Acta 283(1–2):1–14 7. Monaco G, Lee B, Xu W, Mustafah S, Hwang YY, Carre C, Burdin N, Visan L, Ceccarelli M, Poidinger M (2019) RNA-Seq signatures normalized by mRNA abundance allow absolute deconvolution of human immune cell types. Cell Rep 26(6):1627–40.e1627 8. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3):562–578 9. Yang L, Zhang Y, Mitic N, Keskin DB, Zhang GL, Chitkushev L, Brusic V (2020) Single-cell mRNA Profiles in PBMC. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1318–1323 10. Bakken T, Cowell L, Aevermann BD, Novotny M, Hodge R, Miller JA, Lee A, Chang I, McCorrison J, Pulendran B (2017) Cell type discovery and representation in the era of high-content single cell phenotyping. BMC Bioinf 18(17):7–16 11. Arendt D, Musser JM, Baker CV, Bergman A, Cepko C, Erwin DH, Pavlicev M, Schlosser G, Widder S, Laubichler MD (2016) The origin and evolution of cell types. Nat Rev Genet 17(12):744–757 12. Morris SA (2019) The evolving concept of cell identity in the single cell era. Development 146(12):dev169748 13. Kim HJ, Tam PP, Yang P (2021) Defining cell identity beyond the premise of differential gene expression. Cell Regen 10(1):1–3 14. Savulescu AF, Jacobs C, Negishi Y, Davignon L, Mhlanga MM (2020) Pinpointing cell identity in time and space. Front Mol Biosci 7:209 15. L€ahnemann D, Ko¨ster J, Szczurek E, McCarthy DJ, Hicks SC, Robinson MD, Vallejos CA, Campbell KR, Beerenwinkel N, Mahfouz A (2020) Eleven grand challenges in single-cell data science. Genome Biol 21(1): 1–35 16. Lin X, Zhong J, Lyu M, Lin S, Keskin DB, Zhang G, Brusic V, Chitkushev LT (2020) Artificial neural network system for cell classification using single cell RNA expression. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1253–1257 17. Lyu M, Zhang Y, Yang L, Lin X, Li Y, Jin H, Bellotti AG, Mitic N, Brusic V (2021) PBMC cell classification from single cell mRNA

expression by artificial neural networks, profiles, gene markers, and protein markers. In: 2021 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 3285–3290 18. Deng Y, Bao F, Dai Q, Wu LF, Altschuler SJ (2019) Scalable analysis of cell-type composition from single-cell transcriptomics using deep recurrent learning. Nat Methods 16(4): 311–314 19. Karaiskos N, Rahmatollahi M, Boltengagen A, Liu H, Hoehne M, Rinschen M, Schermer B, Benzing T, Rajewsky N, Kocks C (2018) A single-cell transcriptome atlas of the mouse glomerulus. J Am Soc Nephrol 29(8): 2060–2068 20. Rendeiro AF, Krausgruber T, Fortelny N, Zhao F, Penz T, Farlik M, Schuster LC, Nemc A, Tasna´dy S, Re´ti M (2020) Chromatin mapping and single-cell immune profiling define the temporal dynamics of ibrutinib response in CLL. Nat Commun 11(1):1–14 21. Ding J, Adiconis X, Simmons SK, Kowalczyk MS, Hession CC, Marjanovic ND, Hughes TK, Wadsworth MH, Burks T, Nguyen LT (2020) Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat Biotechnol 38(6):737–746 22. Zhang Y, Luning Y, Brusic V (2020) Automation of gene expression profile analysis in single cell data. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1329–1334 23. Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T (2004) An overview of Ensembl. Genome Res 14(5):925–928 24. Lun AT, McCarthy DJ, Marioni JC (2016) A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor. F1000Res 5:2122 25. Agarwal D, Sandor C, Volpato V, Caffrey TM, Monzo´n-Sandoval J, Bowden R, AlegreAbarrategui J, Wade-Martins R, Webber C (2020) A single-cell atlas of the human substantia nigra reveals cell-specific pathways associated with neurological disorders. Nat Commun 11(1):1–11 26. Liu W, Venugopal S, Majid S, Ahn IS, Diamante G, Hong J, Yang X, Chandler SH (2020) Single-cell RNA-seq analysis of the brainstem of mutant SOD1 mice reveals perturbed cell types and pathways of amyotrophic lateral sclerosis. Neurobiol Dis 141:104877 27. Vicidomini R, Nguyen TH, Choudhury SD, Brody T, Serpe M (2021) Assembly and exploration of a single cell atlas of the drosophila

Protocol for Classifying Pathological Single Cells larval ventral cord. Identification of rare cell types. Curr Protoc 1(2):e37 28. Luning Y, Zhang Y, Mitic N, Keskin DB, Zhang GL, Chitkushev L, Brusic V (2020) Prediction of PBMC cell types using scRNAseq reference profiles. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1324–1328

67

29. Zhong J, Shaikh RA, Haoguo W, Xin L, Zhiwei C, Chitkushev LT, Zhang G, Keskin DB, Brusic V (2020) Classification of PBMC cell types using scRNAseq, ANN, and incremental learning. In: 2020 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, pp 1351–1355

Chapter 5 Unbiased, High-Throughput Identification of T Cell Epitopes by ELISPOT Paul V. Lehmann, Diana R. Roen, and Alexander A. Lehmann Abstract Recent systematic immune monitoring efforts suggest that, in humans, epitope recognition by T cells is far more complex than has been assumed based on minimalistic murine models. The increased complexity is due to the higher number of HLA loci in humans, the typical heterozygosity for these loci in the outbred population, and the high number of peptides that each HLA restriction element can bind with an affinity that suffices for antigen presentation. The sizable array of potential epitopes on any given antigen is due to each individual’s unique HLA allele makeup. Of this individualized potential epitope space, chance events occurring in the course of the T cell response determine which epitopes induce dominant T cell expansions. Establishing the actually-engaged T cell repertoire in each human subject, including the individualized peptides targeted, therefore requires the systematic testing of all peptides that constitute the potential epitope space in that person. The goal of comprehensive, high-throughput epitope mapping can be readily established by the methods described in this chapter. Key words Elispot, Fluorospot, CD4 T cell, CD8 T cell, Epitope prediction, T cell determinant, Immune monitoring

1

Introduction The B and T lymphocyte systems have evolved to cooperate in highly specific antigen recognition. To do so, B cells and T cells rely on two fundamentally different criteria for telling antigens apart among the millions of structures foreign to the body to which they must respond on demand, and a similar number of self-antigens that they must neglect [1]. B cells assess the form and the 3-dimensional shape of native antigens: B cell receptors (BCR) bind to complementary surfaces on the antigen, small areas called B cell determinants; however, they do so without discriminating the antigen’s chemical nature as a protein, sugar, nucleic acid, etc. T cells, in contrast, recognize proteins only, discriminating between them solely based on their unique amino acid sequence. For this to occur, short peptide fragments of the protein need to

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_5, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

69

70

Paul V. Lehmann et al.

bind to specialized antigen-presenting molecules, so-called major histocompatibility complex (MHC) molecules. The antigenderived peptide is accommodated in the “peptide-binding groove” of these MHC molecules (which are called HLA molecules in humans), whereby the unique peptide-binding motif of each MHC/HLA allele defines which peptide is bound, and which is not. The T cell antigen receptor (TCR) binds this binary structure consisting of the nominal peptide fragment of the antigen aligned on an MHC/HLA molecule for antigen recognition. B cell responses that lead to production of IgG, IgA, or IgE antibodies strictly depend on T cell help, that is, the verification of the antigen’s identity via its amino acid sequence by T cells. The precise identification of an antigen via its amino acid sequence is at the heart of self/non-self-discrimination by T cells, and there has been a strong evolutionary pressure to optimize this strategy [2]. Subsequently, the antigen-presenting MHC/ HLA molecule system evolved to be polygenic (there are three major Class I and three Class II loci for humans) in addition to being one of the most polymorphic gene systems known. The human genome contains hundreds of alleles for these loci, whereby each of the alleles encode MHC/HLA molecules that differ in their peptide-binding groove, affording each of them a unique peptidebinding motif and thus peptide presentation property. While certain HLA alleles are more frequent in some human populations than in others, for example the A2 allele in Caucasians, even within the A2-positive subpopulation there are hardly two individuals who would also match in the other five Class I alleles they express, and consequentially, there are hardly two humans on this planet who would present the identical set of peptides for their T cells to recognize. The MHC/HLA system has evolved to make antigen recognition unique in each individual, thereby minimizing dangerous consequences of failures in self/non-self-recognition. By mutation, pathogens could easily evade T cells if the potential epitope space was limited in individuals. This danger is minimized for the individual by the presence of (in the typical case of heterozygosity for all three loci) six HLA Class I, and six Class II molecules, each with different peptide-binding motifs. And even if the unique HLA molecule set expressed by a certain individual does not convey T cell high responder status to an antigen, such can endanger only that subject, but not other individuals who express different sets of HLA molecules. Due to HLA polygenism and polymorphism, the array of antigenic peptides presented to T cells is highly diverse across populations and each individual has a unique potential epitope space. The peptide-binding motifs for the individual allelic HLA molecules are increasingly well defined, and it is now possible to predict in silico the peptides of an antigen that will be presented by this HLA molecule [3]. Individuals who share an HLA allele can be

T Cell Epitope Mapping

71

expected to share a fraction of their potential epitope space (but will differ in the potential epitope space defined by all the other, non-overlapping HLA alleles expressed in the individual). Subsequently, one might expect that individuals who share an HLA allele will predictably respond to an overlapping set of peptides. This notion, originating in reductionist studies of simple model antigens in inbred mice (expressing minimal MHC diversity) does not even hold up for F1 mice: while in such mice the same set of peptides is selected for antigen presentation due to their shared MHC makeup, their T cells typically show individual, unpredictable “aleatory” response patterns to these peptides [4]. Aleatory recognition likely results from the low frequency of antigen-reactive T cells being linked to fate decisions dictated by chance events [5–7]. Systematic epitope mapping showed that also in humans aleatory epitope recognition by T cells prevails, even in HLA-A2 allelematched subjects, resulting in highly divergent and individualized T cell response patterns [8, 9]; see also Fig. 1. Therefore, while the potential epitope space of an individual can be predicted in silico, based on the peptide-binding motives of the allelic HLA molecules present in an individual, the minor subset of such peptides that actually triggers a dominant T cell response cannot be predicted. The latter constitutes the individual’s expressed epitope space, being a minor fraction of the individual’s potential epitope space. The expressed epitope space consists of the individual peptides of an antigen that the T cell system actually targets in an individual (highlighted in color in Fig. 1). The number of such peptides depends on the size of the protein. The number (frequency) of T cells present in the blood that are specific for a given peptide gives insights into the extent of clonal expansions that the cells with this peptide specificity have undergone during the immune response. Within an individual, different peptides of the antigen can elicit strong, medium, or weak T cell expansions, to which we refer as dominant, subdominant, or cryptic epitopes (see the red, orange, and yellow highlights in Fig. 1). The total number of T cells recognizing all expressed epitopes in an individual defines the magnitude of the T cell immunity against the antigen. The cytokine signature of these antigen-specific T cells provides insights about the quality of T cell-mediated immunity that becomes shaped by an instructed differentiation process in the course of the immune response resulting in the generation of T cells with different effector functions [10]. T cell immune monitoring aims at assessing the magnitude and quality of the antigen-specific T cell compartment in an individual. Its successful implementation depends on covering the entire expressed epitope space of each individual. Confining T cell immune monitoring to select peptides only, as it has been frequently done, for example, to one or a few in silico predicted, or previously established epitopes, runs a great risk of

Fig. 1 Example of aleatory epitope recognition by CD8+ T cells in HLA allele-matched individuals. Ten HCMVpositive individuals, all expressing the HLA-A02:01 allele, were tested for T cell reactivity against 553 peptides that systematically cover the sequence of the HCMV pp65 protein (as illustrated in Fig. 1). The numbers of T cells responding by IFN-γ production normalized to 1,000,000 PBMC tested by ImmunoSpot® are shown. High-frequency (“dominant”) recall responses are highlighted in red, intermediate “subdominant” responses in orange, and statistically significant but weak “cryptic” responses in yellow. Only the peptides that elicited a positive response in at least one of the test subjects are listed. Peptides with predicted binding for the shared HLA-A02:01 allele are highlighted in green, with the binding score and rank specified according to the predicted binding. Of note: peptide 495–503 is the only one that is co-dominant in several (5 of 10) of these subjects, being just one of several immune dominant epitopes in most individuals. This 495–503 peptide might be unique, however: while it ranks highest for predicted HLA-A02:01 binding, it also ranks highest for predicted binding for many other alleles expressed by these individuals. (The data are adapted from and are fully described in [9])

T Cell Epitope Mapping

73

underrepresenting—or even missing—the antigen-specific T cell pool (Fig. 1). The peptide(s) chosen might be just one of several dominant epitopes recognized in that individual, or, worse, might be recognized in a subdominant, or cryptic fashion, or not at all, while other peptides dominate. One theoretical solution would be to narrow in on testing all peptides that encompass the potential epitope space for an individual [11]. This would require making in silico predictions for all HLA alleles expressed in each individual, and custom preparation of matching peptide libraries for each subject. Recent studies suggest, however, that it is not necessarily the peptides with highest predicted binding scores that constitute the actually recognized epitopes, but rather that immune dominant epitopes frequently rank downstream in the predicted binding hierarchy [5, 9, 12–14]—see also in Fig. 1 the predicted peptides (highlighted in green) vs. those that actually elicited CD8+ T cell responses. Thus, in silico prediction-based T cell monitoring would need to accommodate an estimated 30 peptides with the highest binding scores per HLA allele expressed in each test subject, that is, up to 180 customized peptides per individual for CD8+ T cell detection alone, a scope that is hardly realizable for immune monitoring purposes. Therefore, T cell immune monitoring is increasingly moving toward “agnostic testing,” an approach in which all possible peptide segments of the antigen are systematically covered (see Note 1). Agnostic testing can be done, depending upon the size of the antigen, by using several peptide pools, each of which contains up to 200 individual consecutive peptides [11]. This so-called “mega peptide pool” approach has the advantage of practicality as it is economic on white blood cell numbers needed and labor involved. Its disadvantage is that the many peptides compete with each other for HLA-binding whereby irrelevant (non-recognized, but HLA-binding) peptides can outcompete the binding of the relevant (recognized) peptides [15]. Thus, mega peptide pools are prone to under-represent the actual size of the engaged antigen-specific T cell population. An additional disadvantage of the mega peptide pool approach is its low resolution; that is, it does not provide information on the identity, diversity, and hierarchy of the individual epitopes recognized. Comprehensive epitope mapping [16], the subject of this chapter, overcomes all these limitations providing the highest possible resolution on the actual antigen-specific T cell compartment operational in a given test subject. In the comprehensive single peptide epitope mapping approach, peptides from libraries that systematically cover the sequence of the antigen are tested individually (Fig. 2). Thereby every peptide of the potential epitope space is covered, one by one. Such testing involves -depending upon the size of the antigen of interest—hundreds, or even thousands, of individual peptides. In a modification, this testing can also be done in an easy-to-

74

Paul V. Lehmann et al.

Fig. 2 Comprehensive epitope mapping illustrated. To test for CD8+ T cell epitope recognition a library of 9-mer peptides is synthesized and tested—this peptide length is chosen because MHC/HLA Class I molecules’ peptide-binding groove is closed on both ends, permitting it to bind peptides 8–10 amino acids long [22]. The individual peptides walk the protein’s amino acid sequence amino acid by amino acid, thus providing complete coverage for any possible epitope

Fig. 3 Testing peptides pooled into matrix format permits identification of the individual peptides that have elicited T cell responses. A 10 × 10 subsection of a 36 × 36 matrix is shown as an example. The peptide pools specified on the horizontal axis contain the individual peptides listed in the corresponding columns, while the peptide pools specified on the vertical axis contain the individual peptides listed in the corresponding lines. As the peptide pools overlap in individual peptides only, that peptide is identified as positive if the T cells respond to the corresponding two pools. For a detailed description of the matrix testing approach see [32]

deconvolute matrix format (Fig. 3). Dependent on the configuration of the matrix, this approach cuts down substantially on the number of test conditions (see Note 2), and thus on the number of cells needed, however, without losing the ability to identify the individual peptides recognized. Also, the number of peptides within a matrix is not high enough for peptide competition to be of major concern [17]. Importantly, as every possible epitope is covered, in both the single peptide or the matrix-based epitope mapping approach, testing is done “agnostically,” that is, without the need of tailoring the test peptides to the HLA type of the test subject.

T Cell Epitope Mapping

75

Until recently, comprehensive T cell epitope mapping has been limited to short antigens that involve less than 200 individual peptides or matrix pools. Limiting factors have been the number of peripheral blood mononuclear cells (PBMC) available for testing, and lack of a technology that enables such testing in a streamlined, high-throughput manner, while doing so in a cost- and laborefficient way. ELISPOT/FluoroSpot is such a technology (we refer to both collectively as ImmunoSpot®, as they differ only in the enzymatic vs. fluorescent detection of the plate-bound analyte). The assay principle is shown in Fig. 4. In ImmunoSpot®

Fig. 4 Illustration of ImmunoSpot® testing for revealing antigens/peptides that have induced T cell responses in vivo. (a) A special PVDF membrane-bottomed 96 (or 384 well) plate is coated with a capture antibody (or, in the case of multiplexed assays, several capture antibodies) specific for the cytokine(s) to be detected. (b) Onto this sensitized membrane, the PBMC are plated in numbers to form a coherent monolayer [18], and the antigen/peptide is added. During a subsequent 24 h cell culture period, the antigen/peptide-specific T cells become activated and engage in cytokine secretion—this cytokine is captured on the membrane around each secreting T cell, being retained as a secretory footprint on the membrane after the cells are removed (c), which then is detected by addition of cytokine-specific detection antibody (d). (e) The plate-bound detection antibody is then visualized either by an enzymatic reaction creating a precipitating substrate (single- or double-color enzymatic assays that can be analyzed under white light, [23]), or by multicolor fluorescence [24]

76

Paul V. Lehmann et al.

assays, PBMC are plated at 100,000–1,000,000 PBMC per well into 96-well plates (see Note 3), but the assay can be miniaturized by using 384-well plates [18], by reducing by one-third the number of PBMC and peptides needed (see Note 4). Moreover, testing peptides in matrix format can cut down on the cell numbers needed. For example, 1263 peptides can be arranged into a 36 × 36 matrix generating a total of 72 peptide pools that can be tested and then deconvoluted to identify the individual peptides of interest, reducing 94% of the test conditions (see Notes 5 and 2). Cell numbers are therefore no longer limiting for the comprehensive testing of T cell epitope recognition of complex antigens, or even genomes (Note 6). Neither is the cost and labor involved, as outlined below. ImmunoSpot® assays measure the number of T cells in PBMC that engage in cytokine production after recognition of their cognate antigen/peptide [19]. As naı¨ve T cells do not produce cytokine upon their first antigen encounter, and occur in very low frequencies within PBMC, only in vivo antigen-primed, clonally expanded memory or effector T cells are being detected in the typical recall assay format when PBMC are exposed to antigen in vitro for 24 h [20]. As these T cells secrete the cytokine, it is captured around the cell by a cytokine-specific capture antibody— leaving a secretory footprint on the membrane that is then visualized by the addition of a cytokine-specific detection antibody (Fig. 4). Typically, IFN-γ secretion by T cells is being measured, but the assay can be multiplexed to detect up to four cytokines simultaneously (see Note 7) permitting to define other effector cell lineages and the “fitness” thereof [10]. The number of spots per well is then counted by a dedicated reader in a fully automated fashion, reflecting the number of peptide-specific T cells secreting the interrogated cytokine within the number of PBMC plated into that well [21]. In this way, the magnitude of the T cell population responding to each peptide is established, permitting to identify dominant, subdominant, cryptic, and non-immunogenic peptides (see the color code in Fig. 1). The sum of T cells recognizing all these epitopes reflects on the total size of the antigen-specific T cell pool, that is, the magnitude of T cell-mediated immunity [9]. Typical results of such an experiment are shown in Fig. 1. If the assay was done in a multiplexed cytokine format, the quality of the antigen-specific T cell response is also revealed [10]. In the closing part of the Introduction, we are outlining the logistic of how such “monstrous” tests can be performed with relatively minor effort; the specifics are provided below. The process starts with (see step 1) creating the plate layout, assigning in successive order a consecutive peptide to each well (see Fig. 1), whereby wells A1 to A4 on each plate are typically reserved for positive and negative controls, respectively (see Note 8). A dedicated software is available for creating such plate layouts (SpotMap™ by CTL) that

T Cell Epitope Mapping

77

Fig. 5 Illustration of the Reagent Tracker™ technique for assurance of peptide identity, purity, and concentration during high-throughput peptide testing. A 384-well 100 × Master Peptide Plate is shown. Biologically neutral Reagent Tracker™ dyes [26] are admixed to the individual peptides on the master plates to obtain a distinct pattern. Wells A1–A4 are reserved for negative and positive controls. When liquid is transferred with multichannel pipettors successfully to the corresponding 10 × Master Peptide Plates, and then to the actual test plates, the pattern of the distinct colors is maintained, verifying the transfer, and the lack of spill-overs. Moreover, the dilution of the dye permits to measure whether the planned volume has indeed been transferred. Using CTL’s Reagent Tracker™ Suite, the actually measured and expected result can be compared in a fast, automated fashion

also includes barcode printing options for the plates (see Note 9), and provides the template for the subsequent automated analysis of the plate so that “spot forming unit” (SFU) counts for each well on each plate can be automatically assigned to the individual peptide’s (or array’s) identity for each test subject. In step 2, this plate layout is provided to a peptide manufacturer to synthetize the peptide library according to this design into barcoded 96 (or 384) well plates (see Notes 10–12). Step 3 initiates with dissolving of the peptides, and diluting them into ready-to-use barcode- and colorcoded 10 × Master Plates to assure the peptides’ identity, concentration, and purity (see Fig. 5, and Notes 13–17). Step 4 is the actual high-throughput ImmunoSpot® epitope testing, in which 96-channel (or 384 channel) pipettors are used to transfer the peptides from the 10 × peptide master plates to the test plates containing the PBMC; suggestions are made on how to implement such “monstrous” experiments with relatively little labor, and, importantly, minimizing the possibility of experimental error (see Notes 18–22). The final step, step 5 is the reading and databasing of the experimental results, a fully automated process [11] (see Note 23).

78

2

Paul V. Lehmann et al.

Materials Thawing of Cryopreserved PBMC 1. Cryopreserved PBMC sample (see Note 24).

2. DNase-containing washing medium (see Note 25). 3. CTL-Test™ Medium (see Note 22). 4. 50 mL conical tubes. 5. Parafilm. 6. Tabletop Centrifuge. 7. Hemocytometer(s). 8. CTL ImmunoSpot® Reader, any model. ImmunoSpot® Assay 1. Human IFN-γ Pre-coated ImmunoSpot® Kit, including the following:

(a) Pre-coated ELISPOT plate, PVDF membrane. (b) Diluents B, C, and Blue. (c) Detection Ab. (d) Streptavidin-AP. (e) Substrate solutions S1, S2, and S3. 2. CTL-Test™ Medium. 3. Test antigens/peptides (see Notes 11–17). 4. PBS. 5. PBS-T (PBS + 0.05% Tween-20). 6. Distilled water.

3

Methods Warm all media to 37 °C before use. All thawing and culture steps should be performed with warm (37 °C) media to retain functionality of the cells. Perform all culture steps in a biological safety cabinet following all appropriate safety protocols.

3.1 Thaw Cryopreserved PBMC (Sterile Conditions)

1. Place cryovials into the 37 °C water (or, preferably bead) bath for 8 min to thaw. 2. Remove cryovials from 37 °C bath and wipe with 70% ethanol before unscrewing caps. 3. Using a serological pipet, transfer contents of cryovial into a 50 mL conical tube. Up to 5 vials of the same donor can be pooled into one 50 mL tube.

T Cell Epitope Mapping

79

4. For each cryovial used, rinse the cryovial with 1 mL AntiAggregate solution (see Note 25). Transfer the rinse solution to the 50 mL tube slowly, dropwise, while swirling the tube to ensure adequate mixing of the cells and thawing medium. 5. Add an additional 2 mL Anti-Aggregate solution to the tube dropwise while swirling. The cells are now in a total of 4 mL. 6. Add the final 6 mL of Anti-Aggregate solution to the tube, swirling gently to mix. The cryovial is now resuspended in a total of 10 mL of Anti-Aggregate solution. (If additional cryovials are pooled, calculate using 1 mL cell suspension + 9 mL Anti-Aggregate solution for each cryovial to find total resuspension volume). 7. Cap the 50 mL conical tube tightly and invert twice to mix. 8. Centrifuge PBMC at 300 × g for 10 min at RT. 9. Discard supernatant and flick the bottom of the conical tube gently to resuspend the cell pellet. Add 10 mL Anti-Aggregate solution for each cryovial thawed. Cap the tube and invert gently twice to mix. 10. Pipet 20 μL Live/Dead Cell Counting dye (e.g., acridine orange/propidium iodide) onto a small piece of parafilm. 11. Remove 20 μL of cell suspension and add to the Live/Dead Cell Counting dye. Pipet up and down 3–5 times to mix avoiding formation of bubbles. 12. Transfer 10 μL of the cell and dye suspension into each chamber of a hemocytometer. 13. Count live cells under UV microscope, or using CTL’s Live/ Dead Cell Counting Suite. 14. Centrifuge the 50 mL tube containing PBMC again at 300 × g for 10 min at RT. 15. Discard supernatant and gently flick the bottom of the tube to resuspend the cell pellet. 16. Resuspend the cell pellet in pre-warmed (37 °C) CTL-Test™ Medium to the desired concentration (see Notes 3, 4, and 26). 3.2 Plate PBMC (Sterile Conditions)

1. Gently swirl PBMC suspension to ensure the even distribution of the cells (see Note 28). 2. Using wide-orifice tips and a 96-channel pipettor, add 180 μL PBMC suspension to each well of the pre-coated ImmunoSpot plate (90 μL for the 384-well plate) (see Note 28). 3. Store plates with cells in the incubator until the plating of the peptides.

80

Paul V. Lehmann et al.

3.3 Plate Antigens/ Peptides (Sterile Conditions)

1. Prepare all antigens/peptides at 10 × final concentration in Peptide Master Plates) (see Notes 11–17). 2. For a 96-well plate, using a 96 channel pipettor, add 20 μL of the 10 × master peptide solution per well into the pre-coated human IFN-γ ImmunoSpot® plates containing the PBMC (10 μL per well of peptide for 384-well plates). 3. Replace the plate cover and gently tap the plate on all sides to ensure even sedimentation of the cells across the membrane (see Note 27). 4. Incubate plate at 37 °C supplemented with 8–9% CO2 for 24 h. Do not disturb or relocate the plate during the incubation, as this will result in poor spot morphology. Open and close the incubator door gently to avoid disturbing the ELISPOT plate.

3.4 Develop the ELISPOT Plate (Nonsterile Conditions)

1. Upon completion of the cell culture incubation, remove plate from the incubator. Decant the plate (or harvest the cells for additional downstream analysis, if desired; see Note 21) and wash 5 × with 200 μL PBS/well for 96-well plates (100 μL for 384-well plates) using an automated plate washer with adjusted pin height (see Note 29). 2. Decant plate and wash 5 × with 200 μL PBS-T/well for 96-well plates (100 μL/well for 384-well plates). 3. Prepare Detection Solution by adding 40 μL Detection antibody to 10 mL Diluent B. 4. Decant wash from plate and add 80 μL/well Detection Solution. 5. Incubate for 2 h at RT in the dark. 6. Decant Detection Solution and wash plate 5 × with 200 μL/ well PBS-T for 96-well plates (100 μL for 384-well plates). 7. Prepare Tertiary Solution by adding 10 μL Streptavidin-AP to 10 mL Diluent C. 8. Decant wash from plate and add 80 μL/well Tertiary Solution. 9. Incubate for 30 min at RT in the dark. 10. Decant Tertiary Solution and wash plate 5 × with 200 μL/well PBS-T for 96-well plates (100 μL for 384-well plates). 11. Decant and wash 5 × with 200 μL/well-distilled water for 96-well plates (100 μL for 384-well plates). 12. Prepare Substrate Solution by adding 160 μL of S1 to 10 mL Diluent Blue. Mix well. Add 160 μL of S2 to the solution and mix well. Finally, add 92 μL of S3 to the Substrate Solution and mix well (protect from light and prepare the Substrate Solution immediately prior to use).

T Cell Epitope Mapping

81

13. Decant wash and add 80 μL/well Substrate Solution. 14. Incubate for 15 min at RT. 15. Remove underdrain and rinse both sides of plates 3 × with tap water. Allow plates to dry completely prior to imaging and analysis (see Note 23).

4

Notes 1. The traditional approach is to use 15-mer peptide libraries that walk the sequence with several (typically four) amino acid steps, and presently most commercially available peptide libraries follow this design [11]. While this approach cuts down on the number of peptides needed for testing, it is presently unclear how well 15-mer peptides are suited for CD8+ T cell immune monitoring (9-mers are ideal for binding to MHC/HLA Class I molecules [22]. Also, Class I molecules being closed on both ends are intolerant to frame shifts in the amino acid sequence of peptides [22] and therefore it can be expected that major gaps in epitope coverage arise with the 15-mer, 4 amino acid gap coverage). 2. Dependent on the configuration of the matrix, this approach cuts down substantially on the number of test conditions. For example, when testing a large virus-like SARS-Cov-2 we identified 1263 peptides of interest. In a traditional 96-well format, 383,700,000 cells would be needed to test these 1263 individual peptides at 300,000 cells per well. However, by generating a 36 × 36 matrix and running the assay in a 384 well at 33,333 cells per well, only 2,533,308 cells would be needed. Running the assay in the 384-well Matrix format requires only 0.66% of the cell material compared to the traditional 96-well individual format. Thus, using the 384-well approach, we can have highresolution screening of 1263 individual peptides in 72 peptide pools generated from a 36 × 36 matrix, using 100 mL of blood readily available by venipuncture. 3. In 96-well ImmunoSpot® assays, a linear relationship is observed between the number of PBMC plated per well and the number of antigen-induced IFN-γ spots detected when the PBMC are plated between 100,000–1,000,000 PBMC per well [18]. 4. The assay can be miniaturized by using 384well plates, which reduces the number of PBMC to one-third that of the original 96-well assay [18]. 5. Using a 36 × 36 matrix, for example, permits high-resolution testing of 1263 peptides with 100 mL of blood, see also Note 2.

82

Paul V. Lehmann et al.

6. ImmunoSpot® assays record the number of memory T cells that engage in cytokine production after recognition of their cognate antigen/peptide [18]. 7. Typically, IFN-γ secretion by T cells is being measured, but the assay can be multiplexed to detect several cytokines simultaneously (dual-color enzymatic [23], and 4-color fluorescent [24]). 8. Media control wells (A1, A2) on all plates can be summed up as the negative control for each test subject. In addition, as 3 mg) to allow for experimental repetitions for all steps of antigen characterization, reconstituted antigen processing, and T cell reactivity testing. 3. MS facilities will ideally be equipped with at least an LC-ESIMS instrument allowing for sensitive measurements. Most measurements can be done following conventional methods, but we recommend to discuss in advance with the MS-personnel sample requirements and specifications that they may have.

106

Miguel A´lvaro-Benito et al.

4. Ideally, PBMC with known HLA-haplotype are used in these experiments. Therefore, it is essential that you make sure that your project fulfills all ethics and logistics required for working with blood samples or buffy coats from typed donors. Optionally, cryopreserved HLA-typed PBMC are commercially available. 5. FF-Sepharose coupled to L243 is usually produced in house by coupling the mAb to FF-Sepharose (NHS-activated) according to the recommendations of the vendor. 6. Stage-Tips are used to clean-up samples that will undergo MS measurements. In brief, Nerbe tips, C18 Material from 3MM, and LC-ESI compatible buffers are required. Please check the details provided in [17]. 7. Peptides of interest are synthesized by any commercial company. We recommend to use 15–20 amino acids peptides, and check whether peptides are water-soluble before ordering. Consider extensions or shortenings, if possible, to keep them soluble at physiological pH (never below 15 amino acids). Although the peptides may be diluted initially in 100% DMSO at 10 mM concentration or above, then they are diluted in PBS prior to their use. DMSO concentration should never exceed 1% in cell culture. 8. Tetramers are produced in house according to the methods described in Ebner et al. [9]. Companies and/or facilities may provide these reagents upon request and their preparation has also been described in detail [15]. 9. In house prepared template grids can be used to guide band cutting from the gel (e.g., 6.5 × 8 cm with 20 rows of equal height, such as 32.5 mm each) and can be printed/drawn in paper or transparent plastic. 10. Alkylation of cysteines facilitates the identification of peptides that contain cysteines. During the reduction step all disulfide bonds are reduced and free thiols are then irreversibly modified by Iodoacetamide (Carbamidomethylation). Carbamidomethylation (CAM) has to be considered as a fixed modification of cysteines during the database search. 11. It is extremely important to prepare Trypsin stock solution fresh, ideally on the day of its use. 12. Alternating high and low salt concentration washing steps is recommended for immunopeptidomics. We recommend to include a final wash in H2O-LC-ESI-grade before the elution. 13. MaxQuant has a user-friendly interface and extensive docummentation. Follow the instructions provided with the available manuals and unless specifically suggested by your MS facility personnel keep default settings. Note that you have to set

CD4+ Epitopes from Complex Antigens

107

“unespecific” in the Enzyme specificity tab, that we recommend to check the “match between runs,” and include your own database. In this set of experiments the database used should not only contain the antigenic proteins identified in the protein sources (see Subheading 3.1, named Antigen. fasta), but also the sequence of cathepsins and all recombinant molecules incorporated in the assay (e.g., HLA-DR, HLA-DM, Cathepsins). 14. Peptide’s (as retrieved from MaxQuant output) and consensus peptide’s sizes (from PLAtEAU) are directly provided by this webserver and should be in the range of MHCII binders (11–25 amino acids). Determine the extent of CLIP that has been exchanged for antigenic peptides belonging to the antigen considered. To do so, sum up the relative intensities of consensus peptides that belong to the invariant chain, and those that belong to protein sources from the antigen. A ratio higher than 10:1 (cumulative total ion current from antigenic proteins:CLIP) is usually considered optimal. 15. If the antigen and/or pathogen/parasite under consideration has no hits, one could still try to broaden the search to similar organisms or perform searches on individual antigens. Note as well that eventually one could do sequence-based searches indicating the extent of matching between the sequence used as input and those available as epitopes. Note as well that when using the corresponding hits from Candidate_epitopes.txt the size of the corresponding peptide could be a limitation for searches of “exact matches.” 16. We suggest considering each variable as follow: Rel_abundance (ESM to ESF) continuous numerical variable; Score_2Cat and Score_3Cat categorical values based on numbers; Score_Comp treat as a continuous numerical variable; IEDB, Lit and NetMHC are in all instances categorical variables based on numbers. As for the functions to define the final ranking we suggest: Score_Comp+Score_Comp*LF(IEDB/Lit/ NetMHC); where LF(IEDB/Lit/NetMHCII) = 1 if all terms are 1, 0.5 if only 2 are 1, and 0.3 if only 1 equals 1. 17. We recommend using freshly isolated PBMC instead of cryopreserved PBMC, if available. We recommend performing T cell expansion for complex antigen mixtures with 50–80 Mio PBMC per donor. 18. Heat-inactivation does both, destroys the conformation of proteins making antigen fragments more accessible and inactivates factors that potentially interfere with antigen presentation (e.g., MHCII loading and processing). Parasite ES products for example often contain cysteine proteases. 19. Optimal stimulation times can vary between 6 and 8 h.

108

Miguel A´lvaro-Benito et al.

20. We recommend using freshly isolated PBMC instead of cryopreserved PBMC, if available. CD3 depletion using the Okt3 clone stimulates T cell activation, and should be reconsidered if CD3+ T cells are needed for additional assays. 21. To avoid peptide-binding to MHC II competition and nail down easier the fine-restriction of the expanded cells we recommend to pool less than 10 peptides.

Acknowledgments The authors are thankful to Christian Freund and Susanne Hartmann for resources, materials, and infrastructure provided for the development of these protocols. The authors are also thankful to the FU for internal funding and the Core Facility BioSupraMol for the Mass Spectrometry equipment and service. References 1. Grifoni A, Moore E, Voic H, Sidney J, Phillips E, Jadi R, Mallal S, de Silva AD, de Silva AM, Peters B, Weiskopf D, Sette A (2019) Characterization of magnitude and antigen specificity of HLA-DP, DQ, and DRB3/4/5 restricted DENV-specific CD4+ T cell responses. Front Immunol 10:1568. https://doi.org/10.3389/fimmu.2019. 01568 2. Yewdell JW (2006) Confronting complexity: real-world immunodominance in antiviral CD8+ T cell responses. Immunity 25:533–543 3. Blum JS, Wearsch PA, Cresswell P (2013) Pathways of antigen processing. Annu Rev Immunol 31:443–473 4. Chen B, Khodadoust MS, Olsson N, Wagar LE, Fast E, Liu CL, Muftuoglu Y, Sworder BJ, Diehn M, Levy R, Davis MM, Elias JE, Altman RB, Alizadeh AA (2019) Predicting HLA class II antigen presentation through integrated deep learning. Nat Biotechnol 37: 1332–1343 5. Abelin JG, Harjanto D, Malloy M, Suri P, Colson T, Goulding SP, Creech AL, Serrano LR, Nasir G, Nasrullah Y, McGann CD, Velez D, Ting YS, Poran A, Rothenberg DA, Chhangawala S, Rubinsteyn A, Hammerbacher J, Gaynor RB, Fritsch EF, Greshock J, Oslund RC, Barthelme D, Addona TA, Arieta CM, Rooney MS (2019) Defining HLA-II ligand processing and binding rules with mass spectrometry enhances cancer epitope prediction. Immunity 51:766–779 6. Reynisson B, Barra C, Kaabinejadian S, Hildebrand WH, Peters B, Nielsen M (2020)

Improved prediction of MHC II antigen presentation through integration and motif deconvolution of mass spectrometry MHC eluted ligand data. J Proteome Res 19:2304– 2315 7. Midha A, Ebner F, Schlosser-Brandenburg J, Rausch S, Hartmann S (2021) Trilateral relationship: Ascaris, microbiota, and host cells. Trends Parasitol 37:251–262 8. Hartman IZ, Kim A, Cotter RJ, Walter K, Dalai SK, Boronina T, Griffith W, Lanar DE, Schwenk R, Krzych U, Sadegh-Nasseri S (2010) A reductionist cell-free major histocompatibility complex class II antigen processing system identifies immunodominant epitopes. Nat Med 16:1333–1340 9. Ebner F, Morrison E, Bertazzon M, Midha A, ´ lvaro-Benito Susanne Hartmann Freund C, A M (2020) CD4+ Th immunogenicity of the Ascaris spp. secreted products. NPJ Vaccin 5: 25. https://doi.org/10.1038/s41541-0200171-z ´ lvaro-Benito M, Morrison E, Abualrous ET, 10. A Kuropka B, Freund C (2018) Quantification of HLA- DM-dependent major histocompatibility complex of class II immunopeptidomes by the peptide landscape antigenic epitope alignment utility. Front Immunol 9:872. https:// doi.org/10.3389/fimmu.2018.00872 11. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47: D339–D343

CD4+ Epitopes from Complex Antigens 12. Bacher P, Schink C, Teutschbein J, Kniemeyer O, Assenmacher M, Brakhage AA, Scheffold A (2013) Antigen-reactive T cell enrichment for direct, high-resolution analysis of the human naive and memory Th cell repertoire. J Immunol 190:3967–3976 13. Bacher P, Scheffold A (2015) New technologies for monitoring human antigen-specific T cells and regulatory T cells by flow-cytometry. Curr Opin Pharmacol 23:17–24 14. Saggau C, Scheffold A, Bacher P (2021) Flow cytometric characterization of human antigenreactive T-helper cells. Methods Mol Biol 2285:141–152

109

15. Chow IT, Kwok WW (2021) Identification of human antigen-specific CD4+ cells with peptide-MHC multimer technologies. Methods Mol Biol 2285:153–163 ´ lvaro-Benito M, Wieczorek M, Sticht J, 16. A Kipar C, Freund C (2015) HLA-DMA polymorphisms differentially affect MHC class II peptide loading. J Immunol 194:803–816 17. Rappsilber J, Ishihama Y, Mann M (2003) Stop and go extraction tips for matrix-assisted laser desorption/ionization, nanoelectrospray, and LC/MS sample pretreatment in proteomics. Anal Chem 75:663–670

Chapter 7 Computational Grafting of Epitopes Manish Manish, Smriti Mishra, Monika Pahuja, Ayush Anand, Naidu Subbarao, and Ram Samudrala Abstract Epitopes are the cornerstones for the development of rational vaccine design strategies. Conventionally, epitopes are used by chemical conjugation with the carrier protein. This chapter describes our computational epitope grafting methodology to identify the preferential grafting site in a carrier protein/scaffold. We have used the mota epitope as an example, as it was already experimentally validated by an independent group. In this chapter, we have provided sufficient details to enable the wet experimentalist to employ this computational methodology in their research objective. Scripts/programs are extensively described in this chapter and freely accessible through the provided link. Key words Epitope, Rational vaccine design, Epitope grafting, Computational protein design

1 Introduction Epitopes are immunologically active regions of the antigen. Until now, Immune Epitope Database has approximately 1,545,392 epitopes reported in approximately 23,343 research articles [1]. Epitopes have been extensively utilized as vaccine candidates by employing chemical conjugation with a carrier protein. With the advancements in computational protein modeling, custom gene synthesis, and heterologous protein expression, it is possible to graft the epitope computationally in a carrier protein and further validate this chimeric construct employing custom gene synthesis and heterologous protein expression. Such a chimeric antigen can also be evaluated by the mRNA vaccine platform, which has recently been extensively utilized against COVID-19 [2]. Previously, we have developed a computational model to identify the preferential epitope grafting site in a carrier protein [3]. Our experimental epitopes were 39 residues and 53 residues. Using Manish Manish and Smriti Mishra contributed equally with all other contributors. Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_7, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

111

112

Manish Manish et al.

seven residue control epitopes, we showed that our model does not have artifacts to show preferential grafting regions. This also mimics the conventional experiments of inserting small peptide fragments in virus-like particles. To the best of our efforts, we cannot find any computational/experimental grafting studies of epitope length 39–53 residues. Previous studies were in the range of 6–24 residue epitopes. In this work, we have validated our method using a mota epitope, having sequence NSELLSLINDMPITNDQKKLMSNN and a sequence length of 24 residues. This epitope was taken from an independent study entitled “Proof of principle for epitopefocused vaccine design,” published in Nature in March 2014 [4]. This study was done by Correia et al. In this study, mota epitopes have been grafted in scaffold 3LHP, Chain S. The 5th-ranked position by employing our computational grafting methodology was used by Correia et al. for experimental validation in their studies. We will employ a torsional angle-based protein structure prediction simulation protocol to insert mota epitopic regions as “loops” into 3LHP, Chain S scaffold at various positions, perform folding simulations, and examine the most stable configurations which we take to inform us how and where the epitopes can be grafted onto the scaffolds. This process is akin to what occurs naturally during biological evolution, where insertions and deletions of peptide sequences, usually at the boundaries of protein secondary structure, are made to evolve from one protein to another. This objective will be achieved by protein modeling simulations. The “loop” can be built using torsional angle-based folding simulations. The chimeric protein constructs consisting of the epitopes and the scaffolds are then scored for stability using knowledge-based discriminatory functions, which have been shown to differentiate between non-native and native protein structures [5].

2

Materials

2.1 Software Packages

1. RAMP We will require the mcgen_semfold_loop program from the RAMP suite of programs. The RAMP suite of programs, along with detailed installation instructions and manual, can be found at http://compbio.buffalo.edu/software/ramp/ramp. html. The RAMP suite of programs can also be downloaded from a custom link here https://drive.google.com/file/d/1 S21C-thHk4lx8asFFXjFaGNaw1hyh3A4/view?usp=share_ link. (See Note 1).

Epitope Grafting

113

2. TCSH The RAMP suite of the program requires a tcsh shell (see Note 2). Whether or not the tcsh is installed on the computer can be easily checked by just typing tcsh in your default terminal. Either you will enter the tcsh shell or get the information about how to install tcsh. Typically, you can install tcsh by typing sudo apt install tcsh. 2.2

Install RAMP

RAMP will be downloaded in tar-zipped format. Keep the downloaded tarball in the home or any directory where you want to install it. We use /home/scis to represent the user’s home directory for this method article and extract the ramp tarball in my home directory (i.e., /home/scis). 1. Type the following commands in your terminal, and this will extract the ramp and will create a directory having path / home/scis/ramp (see Note 3).

$ tar -zxvf ramp.tar.gz

2. Use the command cd to change the directory. For example, the command cd /home/scis/ramp can reach the extracted ramp directory. Type the following commands to reach the directory “scripts”. $ cd /home/scis/ramp/bin/scripts/

3. Open the “onramp” file using any text editor such as vi, emacs, gedit, and nano. 4. Change the path of RAMP_ROOT to /home/scis/ramp, as highlighted in Fig. 1. 5. You may need to install the following compilers gcc, f77, f2c, fort77, gfortran, and g77 (see Notes 4–6). 6. To enter in tcsh shell, type. $ tcsh

7. Source the file “setup_environment” using. $ source /home/scis/ramp/bin/scripts/setup_environment

8. Source the file “onramp” using. $ source /home/scis/ramp/bin/scripts/onramp

114

Manish Manish et al.

Fig. 1 RAMP_ROOT in file “onramp”

9. Change current directory to the directory “scripts” using. $ cd /home/scis/ramp/bin/scripts/

10. Run the installer script file, “install_ramp”. A successful installation will output as in Fig. 2 (see Note 7). $ ./install_ramp

3 Methods 3.1 Prepare Epitope Sequence and Sequence db File

1. Make a file name “seq” using any text editor such vi, gedit, or emacs. In this example, as we are using mota epitope, hence, open a file as vim seq and then paste the sequence NSELL SLINDMPITNDQKKLMSNN (see Note 8). 2. Derive 3tuplets and name it as seqdb using. Change the path “/ home/scis” in command below as in Note 3.

$ prune_ntuplet_db seq /home/scis/ramp/lib/astral_159_e4.xray.3tuplet_db 3 seqdb

3.2 Prepare the PDB File of Carrier Protein/ Scaffold

1. PDB file 3LHP has author-designated chains H, I, L, M, S, and T. We need to graft in the S chain. Hence extract the chain S using the command below. A typical output is shown in Fig. 3.

$ extract_pdb_chain 3lhp.pdb S 3lhp_s.pdb

2. Clean the pdb file 3lhp_s.pdb using the command below (see Note 9). A typical output is shown in Fig. 4. $ clean_pdb 3lhp_s.pdb c3lhp_s.pdb

Epitope Grafting

115

Fig. 2 Successful installation of mcgen and potential modules from RAMP suite

Fig. 3 Typical output of extract_pdb_chain script

Fig. 4 The typical output of clean_pdb script 3.3 Prepare Loop Files

1. Copy and paste the text below into a text editor and save it as file name “alb”. ab cd NSELLSLINDMPITNDQKKLMSNN c3lhp_s.pdb ef gh constraint jk mn 3.0 1.0

2. Copy the script below in a text editor and name it as copy.pl. The number (2 . . . 92) denotes the positions where we want to search for preferential grafting sites. #! /usr/bin/perl -w use File::Copy; $original = alb; @new_copy = qw (2 .. 92);

116

Manish Manish et al. foreach $new_copy (2 .. 92) { copy( $original, $new_copy) or die "Copy failed: $!";}

3. Make a directory named “loop” using. $ mkdir loop

4. Keep the script copy.pl and file alb in the loop directory. 5. Run the following command. This will generate the placeholder’s loop files. $ perl copy.pl

6. Make a list of placeholder loop files following the command below and name it “list”. $ ls |grep '^[1-9]' |sort -n >list

7. Copy and paste the following script in a text editor and save it as “catsed.sh”. This will generate the specific loop files for each position (see Notes 10 and 11). #! /bin/bash for i in $(cat list); do echo $i a=$(($i+23)) b=$(($i+24)) echo $a echo $b sed "s/ab cd/$i $a/g; s/ef gh/$i 1/g; s/jk mn/$a $b/g" "${i}" > "${i}.l" done

3.4 Graft the Epitope Using mcgen_semfold_loop

1. Make a directory and call it a “graft” directory.

$ mkdir graft

2. Move the loop files to the graft directory using. $ mv [1-9] *.l

graft

3. cd to graft directory, $ cd graft

Epitope Grafting

117

4. Make a list of all the positions using, $ ls |grep .l |sort -n >list

5. Copy the seqdb and template pdb file c3lhp_s.pdb in the graft directory. 6. Copy the following script and save it as runmcgen. Change the “/home/scis” in the script below as mentioned in Note 3. #! /usr/bin/tcsh -f source /home/scis/ramp/bin/scripts/setup_environment source /home/scis/ramp/bin/scripts/onramp foreach x ( ‘ cat $argv[1] ‘ ) echo $x set seed = 0 while ($seed < 50) @ seed = $seed + 1 echo $seed mcgen_semfold_loop

$x

/home/scis/ramp/lib/scores

seqdb $x$seed 1 1000000 $seed end end

7. Run the runmcgen script (see Note 12). $ ./runmcgen list > & /dev/null &

3.5 Score the Chimeric Grafted Models Using RAPDF Scoring Function

1. Make a list of all the output best files by

$ ls | grep best| sort -n >filelist

2. Copy and save the following script as protinfo_rapdf_simple #! /usr/bin/tcsh -f set arg0 = ‘echo $0 | awk -F’/’ ’{print $NF}’‘ if ($#argv < 1) then echo "Usage: $arg0 conformation-file|conformation-filelist [overwrite]" exit endif if ($2 == "overwrite") then rm $1.rapdf_scores $1.rapdf37_scores $1.rapdfi37_scores endif

118

Manish Manish et al. potential_rapdf $1 /home/scis/ramp/lib/scoring_functions/astral_169_e4_allatoms_xray_lt2.0_scores >> $1.rapdf_scores

3. Score the output best files using protinfo_rapdf_simple. As this script uses the potential_rapdf module from RAMP suite, you need to source as mentioned in Note 7. Change the “/home/ scis” in the protinfo_rapdf_simple script as mentioned in Note 3. This will generate an output file name filelist.rapdf_score. This text file contains the RAPDF score and the model name, residue number, and RAPDF score/residue number. $ ./protinfo_rapdf_simple filelist

3.6 Extract the Scores from fileslist. rapdf_score Text Files

1. This command below will give the model’s name in the first column and the score in the second column

$ cat filelist.rapdf_scores | awk '{print $5, $7}' | sort -k 2

2. To get a position-wise score, use the command below $ cat filelist.rapdf_scores | awk '{print $5, $7}' | sed 's/.l[09]*.best[0-9]*.pdb//g' | sort -n

3. To get position-wise separate score files, use the command below $ cat filelist.rapdf_scores | awk '{print $5, $7}' | sed 's/.l[09]*.best[0-9]*.pdb//g' | sort -n | awk '{ outFile=$1; print $0 > outFile}'

4. Calculate the position-wise mean score, standard deviation, and standard error using the script below. The first column will be position, and the mean, standard deviation, and standard error as second, third, and fourth, respectively. #! /usr/bin/tcsh foreach x (‘ls | grep -x -E ’^[0-9]+’|sort -n‘) awk ’{ delta = $2 - avg; avg += delta / NR; mean2 += delta * ($2- avg); } END { print "’$x’", avg , sqrt(mean2 /(NR-1) ), (sqrt(mean2/(NR-1)))/sqrt(NR); }’ $x end

Epitope Grafting

119

Fig. 5 Identification of preferential grafting site in a carrier protein. The figure depicts the computational grafting of mota epitope in scaffold PDB ID 3LHP Chain S. The error bars represent standard error 3.7 Expected Outcome

4

The expected outcome of this methodology is represented in Fig. 5. Lower RAPDF scores correspond to preferential grafting sites in a carrier protein/scaffold. Position 74 is ranked 5th in our method and has been independently validated by Correia et al.

Notes 1. The RAMP suite of the program can be installed on various flavors of the Linux operating system. To avoid redundancy, we describe the methodology in Ubuntu 16.04. However, with little customization, the described method can be implemented on another version of Linux. For example, the term “apt-get” in Debian can be replaced by the term “yum” in Centos/ RHEL. 2. RAMP can be configured for other shells, e.g., BASH shell as described in the RAMP manual. In a typical ubuntu installation, the default shell is Bourne/Bash Shell. As the name implies, the shell is the outermost “shell” used to communicate with the inner computer operating system and hardware through easy-to-use text keywords.

120

Manish Manish et al.

3. We will use /home/scis as a path for the home directory. However, please replace this path /home/scis with the path of your home directory. 4. GCC should be typically installed in your Linux distribution, and compilers such as f77, f2c, fort77, and gfortran can be easily installed using sudo apt install or yum. For the installation of g77, please see Note 5. After installing the compiler, you can check whether the compiler is installed or not using the following commands, as shown below. $ gcc gcc: fatal error: no input files compilation terminated.

$ f77 f77: fatal error: no input files compilation terminated.

$ f2c

$ gfortran gfortran: fatal error: no input files compilation terminated.

$ fort77 /usr/bin/fort77: No input files specified

$ g77 g77: no input files

5. Download the g77_x64_debian_and_ubuntu.tar.gz from the link https://drive.google.com/file/d/1oo_kRnKR2 JtRegAaiPHku74_OAkkHwfF/view?usp=sharing. Then use the following commands to extract and install g77 on your computer. tar -xzvf g77_x64_debian_and_ubuntu.tar.gz cd g77_x64_debian_and_ubuntu chmod +x ./install.sh ./install.sh

Epitope Grafting

121

6. In ubuntu 16.04, you have to follow these additional steps. cd /usr/lib

and then $ ln -s x86_64-linux-gnu/crt1.o $ ln -s x86_64-linux-gnu/crti.o $ ln -s x86_64-linux-gnu/crtn.o

7. For each evocation of the ramp, you need to source the file / home/scis/ramp/bin/scripts/setup_environment and /home/scis/ ramp/bin/scripts/onramp. Otherwise, you can include these lines in tcshrc file. After adding these lines, when you enter in tcsh, these files will automatically be sourced. 8. We are describing our methodology using mota epitope, which has the sequence of NSELLSLINDMPITNDQKKLMSNN . Please use the epitope sequence of your research interest. Similarly, change the scaffold/carrier protein according to your interest. 9. For refining, the structural coordinates downloaded from the Protein Data Bank, various approaches, such as minimization, can be exploited depending on the target scaffold. A wellprepared protein structural model will enhance the discrimination of preferential grafting sites using RAPDF scoring function. 10. After saving the shell scripts, please make them executable. For example, using chmod u+x catsed.sh 11. The sequence length of mota epitope is 24 residues. Hence, we used the digits 23 and 24. Please change these digits according to the sequence length of epitopes of your research interest. 12. This “> & /dev/null & ” is only required when running on the remote computer. You can also use another alternative, such as screen.

Acknowledgments This work was supported in part by the National Institutes of Health (NIH) Director’s Pioneer Award (DP1OD006779), NIH Clinical and Translational Sciences (NCATS) Award (UL1TR001412), NIH National Library of Medicine (NLM) T15 Award (T15LM012495), NIH NCATS ASPIRE Design Challenge Award, NIH NCATS ASPIRE Reduction-to-Practice Award, and startup funds from the Department of Biomedical Informatics

122

Manish Manish et al.

at the University at Buffalo. MM is supported by a senior research associate fellowship under the CSIR- Scientist’s Pool Scheme. SM is supported by a research fellowship under the ICMR research grant. References 1. IEDB. IEDB.org: free epitope database and prediction resource. https://www.iedb.org/. Accessed 17 Nov 2022 2. Szabo´ GT, Mahiny AJ, Vlatkovic I (2022) COVID-19 mRNA vaccines: platforms and current developments. Mol Ther 30:1850–1868. https://doi.org/10.1016/J.YMTHE.2022. 02.016 3. Mishra S, Manish M (2018) Studies on computational grafting of malarial epitopes in serum albumin. Comput Biol Med 102:126–131.

https://doi.org/10.1016/J.COMPBIOMED. 2018.09.018 4. Correia BE, Bates JT, Loomis RJ et al (2014) Proof of principle for epitope-focused vaccine design. Nature 507:201–206. https://doi.org/ 10.1038/NATURE12966 5. Samudrala R, Moult J (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 275:895–916. https:// doi.org/10.1006/jmbi.1997.1479

Chapter 8 Manufacture of Mesoporous Silicon Microparticles (MSMPs) as Adjuvants for Vaccine Delivery Ana Lo´pez-Gomez, Irene Real-Are´valo, Rau´l Martı´n-Palma, Eduardo Martı´nez-Naves, and Manuel Go´mez del Moral Abstract The advent of computational approaches has accelerated the identification of vaccine candidates like epitope peptides. However, epitope peptides are usually very poorly immunogenic and adequate platforms are required with adjuvant capacity to verity immunogenicity and antigenicity of vaccine subunits in vivo. Silicon microparticles are being developed as potential new adjuvants for vaccine delivery due to their physicochemical properties. This chapter explains the methodology to fabricate and functionalize mesoporous silicon microparticles (MSMPs) which can be loaded with antigens of different nature, such as viral peptides, proteins, or carbohydrates, and this strategy is particularly suitable for delivery of epitopes identified by computer. Key words Mesoporous silicon microparticles, Vaccine, Adjuvant, Functionalization, Peptides, Carbohydrates

1

Introduction Peptide or subunit-based vaccines often have low immunogenicity, are easily degraded, or require the administration of large amounts of antigen. Adjuvants are compounds that increase the duration, efficacy, and magnitude of vaccines when administered simultaneously with the target antigen [1, 2]. In both clinical and research settings, one of the most widely used adjuvants is aluminum hydroxide (Al(OH)3), which is found, for example, in the diphtheria, tetanus, and pertussis (DTaP) vaccines [3]. However, to meet the new challenges in the field of vaccines, new, more potent, and safer adjuvants capable of activating T and B lymphocyte responses are needed. Micro-and nanoparticles, including organic and metallic nanoparticles [4], are gaining importance as adjuvants [2]. Mesoporous silicon microparticles have been shown to increase specific antigen

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_8, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

123

124

Ana Lo´pez-Gomez et al.

presentation in humans [5]. These are based on ordered Si nanocrystals covered with SiO2 with a large adsorption surface due to the presence of small pores. Mesoporous silicon microparticles have some key properties such as chemical stability, high surface area (up to 1500 cm2/g), adjustable pore size (typically between 5 and 250 nm) or high pore volumes (≈1 cm3/g) that make them ideal to be used as carriers of biomolecules and adjuvants [5, 6]. In addition, these particles are biodegradable having silicic acid, an inorganic component of bones, as an end product [4]. It has been shown in vitro that these microparticles degrade readily under physiological conditions within 24 h while in vivo it is eliminated from the body after 4 weeks [4]. In addition, these particles can be functionalized by adding amino groups, thus changing their charge to positive, which allows binding to molecules through covalent bonds [7, 8]. Finally, production from silicon wafers makes it possible to obtain large quantities of mesoporous silicon microparticles [5, 9].

2

Materials The production and functionalization of mesoporous silicon microparticles (MSMPs) have to be carried out under sterile conditions.

2.1 Mesoporous Silicon Microparticles Production

Materials are kept at room temperature. Once the microparticles are produced, they are stored at 2–8 °C. For the production, the following components are necessary: 1. Silicon wafers. 2. 100% ethanol. 3. Hydrofluoric acid. 4. Distilled water. 5. Sieves of 0.01, 0.005, and 0.001 mm. 6. Centrifuge. 7. Scale.

2.2 Mesoporous Silicon Microparticles Functionalization

1. Nitrogen. 2. Nitrogen cabin. 3. 3-aminopropyltriethoxysilane (APTs): It is stored at 4 °C, free from oxygen. 4. Stirrer. 5. 100% ethanol. 6. UV light.

Generation of Mesoporous Silicon Based Vaccines

2.3 Load of Peptides and Carbohydrates into Mesoporous Silicon Microparticles

125

All products should be stored and used under sterile conditions. For this part, it needed: 1. Peptides that you want to load to the particles. 2. Fluorescein isothiocyanate (FITC). 3. Wheel. 4. Fridge. 5. RPMI 1640: it is stored at 4 °C. 6. Fetal Bovine Serum (FBS): it is stored at -20 °C. 7. L-glutamine: it is stored at -20 °C. 8. Antibiotic/antimycotic: it is stored at -20 °C. 9. Distilled water. 10. PBS 1X.

3

Methods

3.1 Mesoporous Silicon Production

Start with silicon wafers with a thickness of 1000 ± 25 μm which have a typical resistivity of 0.01–0.02 Ωcm. An electrochemical fabrication method is used to create a porous structure. The process has to be done under sterile conditions. 1. Place the wafers into a beaker. 2. Use a metal support attached to two electrodes (black and red) (Fig. 1). 3. Add 1:1 hydrofluoric acid (48%) and ethanol (96%) solution to the beaker. 4. Connect the electrodes to the machine and select the fabrication program to be used. In our case, we created a multi-step program where each step has a different intensity and duration (Table 1). 5. This program is applied three times to obtain a uniform and homogeneous porous structure. 6. When the program is finished, it is necessary to rinse the wafer five times in ethanol to eliminate any remaining hydrofluoric acid. 7. Transfer the particles to a Petri dish and allow them to dry overnight (see Note 1).

3.2 Obtention of Microparticles of the Desired Size

1. Transfer the particles to a falcon tube and add a small amount of ethanol. 2. For the particle grinding, an agate quartz mortar is used. Grinding is done manually (see Note 2).

126

Ana Lo´pez-Gomez et al.

Fig. 1 Experimental setup for the electrochemical fabrication of porous silicon. (Herna´ndez-Montelongo et al. [8]) Table 1 Program to generate pores in silicon wafers Step

I (Ma)/V (V)

T (s)

Repetitions

Step 1

3435

50

70

Step 2

4580

1

70

3. After grinding, the particles should be transferred to a falcon tube and immersed in ethanol. It is recommended to wash the mortar with ethanol in order to collect the remaining particles. 4. Pass the solution through three different sieves (0.01, 0.005, and 0.001 mm) using a brush. Collect the solution obtained after the last sieve because it is here where microparticles will have the desired size (see Note 3). 5. Wash three times the final product with ethanol, and centrifuge at 10.000 rpm for 10 min between each wash. 6. Resuspend the microparticles in ethanol (see Note 4). 3.3 Microparticles Functionalization

For this process, a nitrogen cabin is needed. 1. Place everything you need in the cabin adjacent to the nitrogen cabin (Fig. 2). In this case, 3-aminopropil trietoxisilano

Generation of Mesoporous Silicon Based Vaccines

127

Fig. 2 Nitrogen cabin. The adjacent cabin is marked in red. (Herna´ndez [10])

(APTS), syringes, three needles, one beaker with ethanol, three empty beakers, eppendorf tubes, yellow pipette tips, 20–200 μL pipette, shaker, and UV light is needed. 2. Turn on the vacuum machine and open the vacuum valve until the pressure is -15 BAR. 3. Close the vacuum valve and open the nitrogen valve until the pressure is 0 again. 4. Repeat these two steps three times. 5. Close vacuum and nitrogen valves before opening the adjacent chamber to be able to introduce what it is inside (see Note 5). 6. Once everything is located inside the cabin, take 20 μL of APTS (20 μL for each 10 mL of ethanol) with the syringe and put it on an eppendorf. 7. Pour 5 mL of particles into the empty beaker, add 5 mL of ethanol to the falcon to wash it, and pour it into the beaker. 8. Add the APTS into the beaker, mix all of them and put it under UV light. 9. Turn on the UV light and leave the beaker on agitation for 20 min.

128

Ana Lo´pez-Gomez et al.

10. Turn off the UV light, now the particles are functionalized. Put these particles into the falcon. 11. Before turning off the nitrogen cabin, it is necessary to remove the material using the adjacent cabin. 12. Wash the functionalized particles with ethanol five times at 10.000 rpm for 10 min and store in ethanol at 4 °C until use. 3.4 Weighing Particles

1. Weigh three empty eppendorfs with the lid closed. 2. Take any volume of the particles three times, for example 200 μL, and put it into three eppendorfs. 3. Wash the particles as described before. 4. Discard the supernatant and leave the eppendorf open to allow the particles to dry out. 5. Once the particles are dry (they turn lighter), weigh the eppendorf and calculate the differences between its empty and full weight. 6. Put the particles again in the same volume (200 μL) and calculate the concentration using the volume and average weights. You can leave the particles at the concentration that is most comfortable for you to work with, by adding more or less ethanol (see Note 6).

3.5 Washing Particles

1. Take the volume of the particle that is going to be used. 2. Centrifuge at 10.000 rpm for 10 min. 3. Discard the supernatant. 4. Resuspend the particles in distilled water. 5. Repeat this process two more times.

3.6 Characterization of MSMPs

3.7 Fluorescent Labeling of MSMP with FITC

The optical laser backscattering method was employed. Scanning electron microscopy (SEM) was used to determine the morphology of the microparticles and the size distribution, and FTIR (Fouriertransfer infrared spectroscopy) characterization was used to determine the effectiveness of the functionalization process of the microparticles with amino groups. 1. Take 1 mg of functionalized particles. 2. Mix with an ethanolic solution of fluorescein isothiocyanate (FITC) (0.19 mmol). 3. Incubate it overnight (O/N) at 4 °C in rotation and darkness. 4. Rinse the particles in distilled water, as described before. 5. Resuspend the particles in PBS1X or RPMI.

Generation of Mesoporous Silicon Based Vaccines

3.8 Load Viral Peptides to MSMPsNH3

129

1. Wash the functionalized MSMPs (0.09 mg/mL). 2. Resuspend the particles in the smallest possible volume of PBS 1X or distilled water. 3. Add the peptides at a concentration that ensures a saturation loading that provides 1 μg of peptide per milligram of MSMPs. 4. Incubate O/N at 4 °C in rotation. 5. Centrifuge at 10.000 rpm for 10 min. 6. Discard supernatant and resuspend the particles in PBS1X or RMPI. For additional uses see Notes 7 and 8.

4

Notes 1. When dry particles are required, it is recommended to cover the particles a little bit to avoid them blowing away. 2. It is recommended to wear a mask during this process during the grinding. 3. During the filtration of the particles, it is recommended to pass the solution little by little to avoid sieve saturation. 4. After sieving, centrifugation at low rpm is recommended (3000 rpm for 10 min) because there might be very small particles that do not fall off after centrifugation and that may affect functionalization. In these cases, MSMP particles oxidize earlier when diluted with water. When two particle phases (dark and light) are observed in the pellet, this indicates the presence of nanoparticles. 5. While you are going to use nitrogen cabin, introduce your hands into the gloves carefully, first one hand and then the other to prevent cabin breakage. 6. To weigh the particles it is better to put them in ethanol. In this case, they dry faster. 7. Load carbohydrates to MSMPs-NH3. First, wash functionalized particles and resuspend in a carbohydrate solution containing a concentration that ensures a saturation loading that provides 1.56 μg of carbohydrate per milligram of MSMPs. After this, incubate o/n at 4 °C under rotation. Centrifuge 10.000 rpm for 10 min. Discard supernatant and resuspend the particles in PBS1X or RMPI. 8. The number of particles to be used will depend on the type of experiment. If you want to stimulate cells in vitro, a concentration of 0.09 mg of particle per mL of PBS will be required. In the case of mice immunization experiments, 0.64 μg of particle will be needed.

130

Ana Lo´pez-Gomez et al.

Acknowledgments ˜ ez for intellectual support, We wish to thank Arturo Jime´nez Perian and Marı´a Arroyo Herna´ndez for providing Fig. 2 image. This work was supported by grant COV20/01101 from CAM to MGM and EMN. References 1. Joshi D, Chbib C, Uddin MN, D’Souza MJ (2021) Evaluation of microparticulate (S)-4,5dihydroxy-2,3-pentanedione (DPD) as a potential vaccine adjuvant. AAPS J 23:84. https://doi.org/10.1208/s12248-02100617-6 2. Zhang L, Yang W, Hu C et al (2018) Properties and applications of nanoparticle/microparticle conveyors with adjuvant characteristics suitable for oral vaccination. Int J Nanomedicine 13: 2973 3. O’Hagan DT, Lodaya RN, Lofano G (2020) The continued advance of vaccine adjuvants – “we can work it out”. Semin Immunol 50: 101426 4. Navarro-Tovar G, Rocha-Garcı´a D, WongArce A et al (2018) Mesoporous silicon particles favor the induction of long-lived humoral responses in mice to a peptide-based vaccine. Materials (Basel) 11:1083. https://doi.org/ 10.3390/ma11071083 ˜ ez A, Abos Gracia B, Lo´pez 5. Jime´nez-Peria´n ˜ o J et al (2013) Mesoporous silicon Relan microparticles enhance MHC class I crossantigen presentation by human dendritic cells. Clin Dev Immunol 2013:362163. https://doi. org/10.1155/2013/362163

6. Kupferschmidt N, Qazi KR, Kemi C et al (2014) Mesoporous silica particles potentiate antigen-specific T-cell responses. Nanomedicine 9:1835. https://doi.org/10.2217/ NNM.13.170 7. Goscianska J, Olejnik A, Nowak I (2017) APTES-functionalized mesoporous silica as a vehicle for antipyrine – adsorption and release studies. Colloids Surf A Physicochem Eng Asp 533:187. https://doi.org/10.1016/j.colsurfa. 2017.07.043 ˜oz-Noval A, 8. Herna´ndez-Montelongo J, Mun Garcı´a-Ruı´z JP et al (2015) Nanostructured porous silicon: the winding road from photonics to cell scaffolds – a review. Front Bioeng Biotechnol 3:60 ´ , Sa´nchez-Vaquero V, Torres˜ oz-Noval A 9. Mun Costa V et al (2011) Hybrid luminescent/magnetic nanostructured porous silicon particles for biomedical applications. J Biomed Opt 16: 025002. https://doi.org/10.1117/1. 3533321 10. Herna´ndez MA (2017) Biofunctionalization of monocrystalline, nanostructured and macroporous silicon surfaces. PhD thesis

Part II Databases

Chapter 9 IEDB and CEDAR: Two Sibling Databases to Serve the Global Scientific Community Nina Blazeska, Zeynep Kosaloglu-Yalcin, Randi Vita, Bjoern Peters, and Alessandro Sette Abstract Various methodologies have been utilized to analyze epitope-specific responses in the context of non-selfantigens, such as those associated with infectious diseases and allergies, and in the context of self-antigens, such as those associated with transplantation, autoimmunity, and cancer. Further to this, epitope-specific data, and its associated immunological context, are crucial to training and developing predictive algorithms and pipelines for the development of specific vaccines and diagnostics. In this chapter, we describe the methodology utilized to derive two sibling resources, the Immune Epitope Database (IEDB) and Cancer Epitope Database and Analysis Resource (CEDAR), to specifically host this data, and make them freely available to the scientific community. Key words Epitope, Database, Infectious disease, Allergy, Autoimmunity, Transplant, Cancer

1

Introduction The study of epitopes, the part of an antigen that is recognized by the immune system to mount a response has been part of immunological research for decades. Epitope data is relevant to understanding host-pathogen interactions, immunopathology, and for the development and evaluation of vaccines, diagnostics, and therapeutics. However, prior to 2003, there was no central repository to broadly access epitope-related information. To fill this gap, in 2003, the Immune Epitope Database (IEDB—iedb.org) [1] was established, providing free and open access to published epitope data and data generated in several large-scale initiatives sponsored by the National Institute of Allergy and Infectious Disease (NIAID). The primary aim of the IEDB is to catalog experimental data on antibody and T cell epitopes. In addition to this data-warehouse function, the IEDB also developed an Analysis Resource (tools.

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_9, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

133

134

Nina Blazeska et al.

iedb.org), which provides users with a collection of T and B cell epitope prediction and analysis tools, and serves as a companion to the IEDB database. The scope of the IEDB is focused on epitopes recognized in humans, non-human primates, and other animal species in the context of infectious disease, allergy, autoimmunity, and transplantation. Cancer epitope data is relevant to our understanding of the role that the immune system plays in cancer progression, prevention, and treatment. Epitopes from non-mutated antigens, recognized across different individuals, provide potential targets for more broadly applicable immunotherapies and have been extensively studied [2–31]. More recently, researchers have studied neoepitopes in the context of checkpoint blockade treatments and epitope-based vaccines, as well as the transfer of epitope-specific T cells and T cell receptors (TCRs) in personalized therapies [32–43]. Given the importance of cancer epitope research, the Cancer Epitope Database and Analysis Resource (CEDAR—cedar.iedb. org) [44] was initiated in 2021, through grant funding from the National Cancer Institute (NCI). CEDAR provides a central, freely accessible catalog of cancer epitope and receptor data linked to the biological, immunological, and clinical contexts in which they were described. It builds on almost two decades of technical and scientific knowledge obtained from the IEDB, and utilizes similar infrastructure and processes that have been adapted to the cancer research setting. CEDAR will also provide an Analysis Resource with cancer-specific prediction and analysis tools, which is currently under development. These tools will allow cancer researchers to identify potential candidates for further study by running predictions where the wild-type sequence and mutated position can be specified, for example. Overall, the IEDB and CEDAR are two sibling resources that, together, provide a comprehensive catalog of experimentallyderived immune epitopes and the immunological context in which they were tested. The resources cover a multitude of research areas, namely, infectious disease, allergy, autoimmunity, transplant, and now cancer, and host tools specifically designed to facilitate the analysis of epitope data in different biological contexts. The implications of these resources to computational vaccine design mean that bioinformaticians have access to high-quality, curated data across the aforementioned areas of immunology, allowing researchers to train specific algorithms to better predict targets for possible therapies. Next, this chapter will describe the processes involved in collecting and querying IEDB data, and how these processes have been adapted for CEDAR.

IEDB & CEDAR: Two Sibling Databases

2

135

IEDB Data Curation As of December 1, 2022, the IEDB contains over 2.1 million epitopes from 75,000 antigens, studied in the context of almost 6.5 million T cell, B cell, and MHC ligand assays from over 23,000 references. We also capture sequence information for over 183,000 T cell and B cell receptors. This data has been collected in two ways: (1) manual curation of peer-reviewed literature by a team of scientific curators, and (2) direct submissions to the IEDB by Epitope Discovery Contracts from NIAID. The largest proportion of data is obtained through curation of scientific literature; therefore, the next section explores this in more detail.

2.1 Criteria for Epitope Inclusion

First, in order for an epitope to be included in the IEDB, it must meet clearly defined criteria set by the IEDB team. We curate both linear and discontinuous peptides, and linear peptides must be less than 50 amino acids in length and tested as an immunogen or an antigen. We also curate non-peptidic epitopes, provided that they are mapped to a structure of less than 5000 Daltons in size. Finally, we require that the experiments related to each curated epitope contain minimal information, such as the epitope sequence or structure, the outcome of the experiment, its host, and a range of other immunological data. It is important to note that the IEDB curates only experimental data and does not include programmatically predicted epitopes. The primary challenge that the IEDB is presented with is that the data in scientific reports is dispersed throughout the paper in tables, figures, methods, and supplementary data. As a result, we require a well-defined data structure for consistent data entry, which is described in our curation process.

2.2 Consistent Data Entry and Literature Curation

The general process of curation involves four key steps (Fig. 1) [45]: (1) PubMed—a complex query is run biweekly to retrieve all possibly curatable references that contain epitope data. (2) Classifier—these PubMed references are run through our classifier, which categorizes each paper based on its content (e.g., allergy) and classifies whether the paper is curatable by IEDB standards. (3) Abstract Review—a senior immunologist manually reviews the abstract of each identified article and confirms its classification and categorization. (4) Manual Curation—the identified references are assigned to the curation team to be completed, peerreviewed by multiple members of the curation team, and finalized before publishing to the IEDB database. To date, the IEDB has identified over 240,000 epitope-related publications in PubMed, which has been reduced to 151,000 potentially curatable papers following the classification step. This is further refined to approxi-

136

Nina Blazeska et al.

Fig. 1 High-level flow of the four primary steps in data curation and the associated number of references that have been identified in each stage of the curation workflow

Fig. 2 Breakdown of curated references in the IEDB by category; infectious disease (53%), autoimmunity (25%), allergy (8%), transplant (4%), and other (10%), which includes cancer references

mately 44,000 likely curatable references following the abstract review, enabling us to promote 23,400 references to the public IEDB database. The curation pipeline described above also results in a categorical breakdown of the curatable references, as shown in Fig. 2. Overall, infectious disease references account for approximately half of these references, and autoimmunity-related references account for about a quarter. Of the remaining references, 8% and 4% are related to allergy and transplantation, respectively, and finally, 10% are references generally within scope but not specifically related to any of the previously mentioned categories. The data curation process [45] has been refined over the last two decades and involves a team of PhD-level curators who utilize their immunological experience to extract relevant epitope data from the various aspects of the manuscript and input the information into a web form; the IEDB Curation Application [45]. The application creates a uniform structure for curation to ensure that all data is curated in a systematic manner by the team and has builtin validation rules, which are triggered and resolved prior to curation submission. We also have defined a set of rules and guidelines

IEDB & CEDAR: Two Sibling Databases

137

Fig. 3 Schematic of data that can be found in the IEDB and the underlying ontology that is used to aid in crossresearch sematic understanding

in our curation manual [46], which is freely accessible to the broader scientific community. This further underpins our ability to capture data in a consistent way, enabling the downstream querying and understanding of the information. For data curation, we rely heavily on collaborations with various ontologies, or semantic standards, that are available in the scientific community, such as the National Center for Biotechnology Information (NCBI) [47] and Ontology for Biomedical Investigations (OBI) [48]. Through utilizing existing ontologies (Fig. 3), we ensure that the IEDB uses a common language understandable by all members of the broader scientific community. It provides standardized nomenclature, definitions, synonyms, and hierarchical relationships, which we use in the development of our finders. This further ensures that the IEDB data remains FAIR— Findable, Accessible, Interoperable, and Reproducible [49]. The graphical representation of the hierarchy streamlines curation, enhances the user experience, ensures accuracy as errors are more easily identifiable, and facilitates interoperability with other resources.

138

3

Nina Blazeska et al.

Querying the IEDB Data The IEDB database can be accessed at iedb.org, which will take you to a web interface where users can search for epitope data of their choice. The home page (Fig. 4) has been designed such that 98% of possible user queries can be run using discrete parameters across six areas: (1) Epitope, (2) Assay, (3) Epitope Source, (4) MHC Restriction, (5) Host, and (6) Disease. The home page also contains a summary of IEDB metrics and upcoming events and news on the left-most panel, and access points to the various epitope prediction and analysis tools on the right-most panel. In 2021, the IEDB also made available an application programming interface (API), called the IEDB Query API (IQ-API) [50], enabling bioinformaticians to programmatically query the database using a multitude of endpoints (see Note 1). However, given the high usage of the IEDB web page, we will next demonstrate how a user can design a research query via the IEDB website and refine it on the results page. In the following example, we are interested in determining how many epitopes have been reported, which are derived from the influenza virus (Epitope Source: Organism = Influenza virus) and recognized in a mouse host (Host: Mouse) (Fig. 4). We further restrict our query to epitopes tested in T cell assays (Assay: T Cell) where a positive response was detected (Assay: Outcome = positive). We also decide to include any MHC restriction in our query parameters (MHC Restriction: Any). Once the query has been set up using the various checkboxes and radio buttons, the user can push the “Search” button.

4

Navigating the Results of IEDB Queries Upon executing a query, the results page will display all the available data relevant to the query parameters used. Figure 5A displays the original query parameter inputs, which can be removed from the pane (using the red “x”) and the query re-run. Figure 5B displays the available results across five key tabs: (1) Epitopes, (2) Antigens, (3) Assays, (4) Receptors, and (5) References. The assays are further subdivided into the T cell, B cell, and MHC ligand assays, and the receptors are subdivided into T cell receptors and B cell receptors. Figure 5C provides users the ability to further refine their query and obtain new results with greater granularity. Given that T cell assays were selected in our initial query, the Filter Options have automatically selected the T cell view (Fig. 5C), which provides T cell-specific search options, such as cytokine production and MHC multimer assays in a quick-search format. Similarly, if B cell assays had been selected in the initial search, a B cell-specific Filter Option would automatically be selected.

IEDB & CEDAR: Two Sibling Databases

139

Fig. 4 Selected query parameters on the IEDB home page, searching for epitopes that have been tested in the context of the influenza virus in mouse hosts

From the results page, users are able to investigate the available data and export the results in a spreadsheet format using the “Export Results” option in the top right corner. Overall, this query yields 958 epitopes from 15 antigens, curated from 518 references. To illustrate the IEDB’s advanced filtering capabilities, we can further narrow the query to identify epitopes that have been shown to be protective in a mouse model system. To do this utilizing our existing query, we access the assay finder in the “T Cell Assay” search panel on the results page and select assays where epitopes tested have in vivo activity, either through decreased disease, adoptive transfer assays, or challenge assays. This results in just over 40 different epitopes (Fig. 6) experimentally tested in mice associated with an in vivo effect. This highlights the power of the IEDB and its ability to query and refine one’s search results.

5

IEDB Community Outreach Community outreach has been pivotal to the overall success of the IEDB in two main respects. First, the true value of the IEDB is best measured by the extent that it helps the user community of academic and applied scientists in their work. In this respect, the outreach and promotion efforts, such as exhibitor booths and

140

Nina Blazeska et al.

Fig. 5 IEDB results displayed following the home page query for influenza virus epitopes found in mouse hosts. Panel A displays the original query parameters, which can be removed from the pane and re-run. Panel B displays the available results across five key tabs: (1) Epitopes, (2) Antigens, (3) Assays, (4) Receptors, and (5) References. Panel C provides users the ability to further refine their query and obtain new results with greater granularity

Fig. 6 Epitopes that have been tested in the context of the influenza virus in mouse hosts, where experiments have utilized decreased disease, adoptive transfer, and challenge assays

IEDB & CEDAR: Two Sibling Databases

141

Fig. 7 Literature citations to the IEDB; formal citations have cited an IEDB paper, while in-line citations refer to instances where the IEDB has been mentioned in a paper but not formally cited

conference talks, are crucial as they build awareness of the IEDB and keep the community apprised as the resource evolves. Second, the IEDB grows best when there is constant input from the scientific community, in particular, critical feedback on each of its key components, ranging from suggestions on curation, query and reporting, the interface of the web portal, and the nature and utility of the tools provided in the Analysis Resource. The IEDB has witnessed continued growth in its usage through two key metrics; (1) annual citations and (2) monthly site visits. From 2004 to 2021, we have performed citation analyses based on IEDB-published literature, where we identify articles that have cited the IEDB (but, importantly, exclude self-citations from IEDB publications). In 2021, the IEDB received a total of 4432 individual citations, which can be broken into 3880 formal citations and 552 in-line citations (Fig. 7). This number has continued to grow year-on-year, highlighting the continued relevance and usefulness of the IEDB to ongoing scientific research. Additionally, we monitor IEDB website traffic on a monthly basis, excluding traffic from the IEDB team, which shows over 32,000 monthly visitors to the IEDB database and Analysis Resource, based on current 2022 data (Fig. 8). Overall, this exemplifies the continued utilization of the IEDB resource and growth in the user base since its inception.

6 Adapting IEDB Processes to CEDAR At the inception of CEDAR, the curation pipeline described in the context of the IEDB identified approximately 3500 potentially curatable papers in our tracking system that were related to cancer studies. In 2021, the curation of cancer epitope publications

142

Nina Blazeska et al.

Fig. 8 Median annual visits per month to the IEDB database and Analysis Resource based on metrics obtained from Google Analytics. Data for 2022 is based on usage from January to October only

commenced, starting with two fundamentally different epitope categories, namely the neoepitope category (Fig. 9), encompassing mutated epitopes from a variety of different cellular protein antigens, and prostate epitopes (Fig. 10), which are non-mutated and derived from a well-defined subset of cancer-associated antigens. Figures 9 and 10 highlight the progress of curation for both categories, whereby the yellow line represents finalized papers that have been published in the CEDAR database. As of December 2022, the neoepitope category is approximately 75% complete, while the prostate cancer category is in maintenance mode, with over 95% of references in that category curated and published to CEDAR. The development of CEDAR capitalizes on processes for curation and database design established in the context of the IEDB over the last two decades. Importantly, however, we also implemented a series of additional procedures to ensure that CEDAR was built-for-purpose with the cancer research community in mind (Fig. 11). Firstly, cancer experts were interviewed to better understand the data and search parameters that would be most essential to their work. This led us to establish new fields, such as cancerassociated antigens (e.g., neoantigen and viral antigen) on the home page, and update the curation rules to capture, for example, vaccination type (e.g., prophylactic or therapeutic). These steps were essential in designing a cancer-specific search interface on the database home page. A late alpha version of the CEDAR database can be accessed at cedar.iedb.org, and given user feedback, will be further refined as needed (see Note 2).

IEDB & CEDAR: Two Sibling Databases

143

Fig. 9 Metrics tracking the number of total finalized neoepitope references for curation (yellow line), the outstanding references (red line), the addition of new references (green line), and output of references on a biweekly basis (blue line)

Fig. 10 Metrics tracking the number of total finalized prostate cancer references for curation (yellow line), the outstanding references (red line), the addition of new references (green line), and output of references on a biweekly basis (blue line)

144

7

Nina Blazeska et al.

Searching the CEDAR Database To illustrate CEDAR’s functionality, we run an example query from the home page, similar to the IEDB. In Fig. 11, we are interested in all epitopes from the prostate-specific antigen (PSA), where experiments have utilized T cell assays, resulting in both positive and negative assay outcomes in human hosts. We also want to ensure that the T cell assays are restricted to MHC class I, and that the disease identified is naturally occurring (rather than an animal model or induced by vaccination). Once the parameters are selected using the various radio buttons and checkboxes, the user can run their query with the “Search” button. Similar to the IEDB, users are then presented with the results page (Fig. 12), displaying all the relevant epitopes, antigens, assays, receptors, and references related to that query. As seen previously, the query can be further refined using the Filter Options, and the data extracted through the “Export Results” feature (see Note 3). Overall, CEDAR maintains the excellent functionality and usability of the IEDB, but it employs a cancerspecific search interface with data that has been carefully curated from peer-reviewed cancer publications.

Fig. 11 Example query performed in the CEDAR database, searching for all epitopes from the prostate-specific antigen (PSA), where experiments have utilized T cell assays, resulting in both positive and negative assay outcomes in human hosts

IEDB & CEDAR: Two Sibling Databases

145

Fig. 12 Results of the example query performed in the CEDAR database, which can be further refined using the Filter Options

8

Two Sibling Resources Overall, the IEDB is a cross-functional resource that aids researchers and bioinformaticians alike in expanding their understanding of infectious disease, allergy, autoimmunity, and transplant studies. With almost two decades of experience in epitope data curation, development and maintenance of bioinformatics resources, and focusing on the needs of the research community, we have been able to use this as a starting point for the development of its sibling resource, CEDAR. We have built upon the solid IEDB foundation and adapted it to the unique needs of the cancer research community, through the curation of cancer epitope references and the introduction of cancer-specific search parameters. Both the IEDB and CEDAR databases are built on principles of usability, with 98% of queries able to be performed from the home page web interface, which can then be further refined once the results are displayed. Together, the IEDB and CEDAR form a powerful set of sibling resources for the scientific community in advancing our understanding of epitopes, and identifying potential candidates for vaccines and therapies.

146

9

Nina Blazeska et al.

Notes 1. In 2021, the IEDB made it possible for users to programmatically query the database using an Application Programming Interface (API). With a multitude of endpoints available, users can complete most queries available from the IEDB home page and work with the data directly in their preferred environment. The IEDB Query API (IQ-API) is still in its beta phase, and continues to undergo iterative updates with feedback from users. However, its current format is very powerful for bioinformatic scientists. The IQ-API is built upon a PostgREST platform that allows for transparent access to the Postgres tables on the backend. Each table can be queried through individual endpoints that are described in our interactive Swagger documentation [50]. There are plans to extend the IQ-API to the CEDAR database, but this likely will not be fully functional until 2024. 2. The CEDAR database is in its late alpha version, which means that the database is still actively under development, though in a usable state. The team is open to feedback and suggestions from the cancer research community, which can be shared via email at [email protected]. We expect the database to undergo more minor iterative updates through to the end of 2023, by which time the database features should stabilize. The CEDAR tools are actively under development, and will utilize the IEDB Analysis Resource as its foundation. We expect the first iteration of the Analysis Resource for CEDAR to be live in early 2023, after which time we will make sequential updates to include cancer-specific tool features. For example, the MHC class I tool suite will be available in early 2023, which will be followed by an update to allow users to specify the wild-type sequence and mutant sequence. With each successive Analysis Resource release, we expect the utility to increase substantially for scientific researchers. Finally, given that data curation remains a manual process, we expect this to continue to the completion of our CEDAR grant in 2025 as we work through each cancer category by publication year. 3. In order to accelerate computational vaccine design, the IEDB and CEDAR provide an easy-to-export option for all data available per the user’s query parameters. On the results page of each resource, users can select “Export Results,” which will export all data in a CSV format for further analysis. It is important to note that the exports contain many additional data fields than what is shown in the IEDB or CEDAR results pages. This is by design, as it further contextualizes the available information and allows users to conduct their own specific analysis. That said, the IEDB team is currently working on

IEDB & CEDAR: Two Sibling Databases

147

improving the export function, which is expected to be available in early- to mid-2023. The new exports will allow users to select their fields of choice and the export format (e.g., TSV, Excel), enabling a much more customized experience for researchers. Finally, we have also provided complete database exports for the IEDB, which can be accessed via the “More IEDB” dropdown and “Database Export.” This provides users a simple way to extract all available IEDB data and complete their own programmatic analysis, without running a specific query. This feature will also be made available in CEDAR in 2023.

Acknowledgment and Funding We wish to thank and acknowledge the IEDB and CEDAR teams. Research reported in this book chapter was supported by the National Institutes of Health contract 75N93019C00001 and grant U24CA248138. References 1. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res 47(D1):D339–D343. https://doi.org/10. 1093/nar/gky1006 2. Lloyd KO, Old LJ (1989) Human monoclonal antibodies to glycolipids and other carbohydrate antigens: dissection of the humoral immune response in cancer patients. Cancer Res 49(13):3445–3451 3. Boon T, De Plaen E, Lurquin C, Van den Eynde B, van der Bruggen P, Traversari C, Amar-Costesec A, Van Pel A (1992) Identification of tumour rejection antigens recognized by T lymphocytes. Cancer Surv 13:23–37 4. Finn OJ, Jerome KR, Henderson RA, Pecher G, Domenech N, Magarian-Blander J, Barratt-Boyes SM (1995) MUC-1 epithelial tumor mucin-based immunity and cancer vaccines. Immunol Rev 145:61–89. https://doi. org/10.1111/j.1600-065x.1995.tb00077.x 5. Slingluff CL Jr, Hunt DF, Engelhard VH (1994) Direct analysis of tumor-associated peptide antigens. Curr Opin Immunol 6(5): 733–740. https://doi.org/10.1016/09527915(94)90077-9 6. Abrams SI, Hand PH, Tsang KY, Schlom J (1996) Mutant ras epitopes as targets for cancer vaccines. Semin Oncol 23(1):118–134

7. Livingston PO (1995) Augmenting the immunogenicity of carbohydrate tumor antigens. Semin Cancer Biol 6(6):357–366. https:// doi.org/10.1016/1044-579x(95)90005-5 8. Melief CJ, Offringa R, Toes RE, Kast WM (1996) Peptide-based cancer vaccines. Curr Opin Immunol 8(5):651–657. https://doi. org/10.1016/s0952-7915(96)80081-1 9. Celis E, Sette A, Grey HM (1995) Epitope selection and development of peptide based vaccines to treat cancer. Semin Cancer Biol 6(6):329–336. https://doi.org/10.1016/ 1044-579x(95)90002-0 10. Robbins PF, Kawakami Y (1996) Human tumor antigens recognized by T cells. Curr Opin Immunol 8(5):628–636. https://doi. org/10.1016/s0952-7915(96)80078-1 11. Wang RF, Rosenberg SA (1999) Human tumor antigens for cancer vaccine development. Immunol Rev 170:85–100. https:// doi.org/10.1111/j.1600-065x.1999. tb01331.x 12. Thomas AK, June CH (2001) The promise of T-lymphocyte immunotherapy for the treatment of malignant disease. Cancer J 7(Suppl 2):S67–S75 13. Holmberg LA, Sandmaier BM (2001) Theratope vaccine (STn-KLH). Expert Opin Biol Ther 1(5):881–891. https://doi.org/10. 1517/14712598.1.5.881

148

Nina Blazeska et al.

14. Rammensee HG, Weinschenk T, Gouttefangeas C, Stevanovic S (2002) Towards patient-specific tumor antigen selection for vaccination. Immunol Rev 188:164– 176. https://doi.org/10.1034/j.1600-065x. 2002.18815.x 15. Van Der Bruggen P, Zhang Y, Chaux P, Stroobant V, Panichelli C, Schultz ES, Chapiro J, Van Den Eynde BJ, Brasseur F, Boon T (2002) Tumor-specific shared antigenic peptides recognized by human T cells. Immunol Rev 188:51–64. https://doi.org/ 10.1034/j.1600-065x.2002.18806.x 16. Khong HT, Restifo NP (2002) Natural selection of tumor variants in the generation of “tumor escape” phenotypes. Nat Immunol 3(11):999–1005. https://doi.org/10.1038/ ni1102-999 17. Stevanovic S (2002) Identification of tumourassociated T-cell epitopes for vaccine development. Nat Rev Cancer 2(7):514–520. https:// doi.org/10.1038/nrc841 18. Schumacher TN (2002) T-cell-receptor gene therapy. Nat Rev Immunol 2(7):512–519. https://doi.org/10.1038/nri841 19. Vlad AM, Kettel JC, Alajez NM, Carlos CA, Finn OJ (2004) MUC1 immunobiology: from discovery to clinical applications. Adv Immunol 82:249–293. https://doi.org/10.1016/ S0065-2776(04)82006-6 20. Novellino L, Castelli C, Parmiani G (2005) A listing of human tumor antigens recognized by T cells: March 2004 update. Cancer Immunol Immunother 54(3):187–207. https://doi. org/10.1007/s00262-004-0560-6 21. Knutson KL, Disis ML (2005) Tumor antigenspecific T helper cells in cancer immunity and immunotherapy. Cancer Immunol Immunother 54(8):721–728. https://doi.org/10. 1007/s00262-004-0653-2 22. DeLeo AB, Whiteside TL (2008) Development of multi-epitope vaccines targeting wildtype sequence p53 peptides. Expert Rev Vaccines 7(7):1031–1040. https://doi.org/10. 1586/14760584.7.7.1031 23. Schietinger A, Philip M, Schreiber H (2008) Specificity in cancer immunotherapy. Semin Immunol 20(5):276–285. https://doi.org/ 10.1016/j.smim.2008.07.001 24. Mittendorf EA, Holmes JP, Ponniah S, Peoples GE (2008) The E75 HER2/neu peptide vaccine. Cancer Immunol Immunother 57(10): 1511–1521. https://doi.org/10.1007/ s00262-008-0540-3 25. Overwijk WW, Wang E, Marincola FM, Rammensee HG, Restifo NP (2013) Mining the mutanome: developing highly personalized

Immunotherapies based on mutational analysis of tumors. J Immunother Cancer 1:11. https://doi.org/10.1186/2051-1426-1-11 26. Heemskerk B, Kvistborg P, Schumacher TN (2013) The cancer antigenome. EMBO J 32(2):194–203. https://doi.org/10.1038/ emboj.2012.333 27. Hinrichs CS, Restifo NP (2013) Reassessing target antigens for adoptive T-cell therapy. Nat Biotechnol 31(11):999–1008. https:// doi.org/10.1038/nbt.2725 28. Trajanoski Z, Maccalli C, Mennonna D, Casorati G, Parmiani G, Dellabona P (2015) Somatically mutated tumor antigens in the quest for a more efficacious patient-oriented immunotherapy of cancer. Cancer Immunol Immunother 64(1):99–104. https://doi.org/ 10.1007/s00262-014-1599-7 29. Gubin MM, Artyomov MN, Mardis ER, Schreiber RD (2015) Tumor neoantigens: building a framework for personalized cancer immunotherapy. J Clin Invest 125(9): 3413–3421. https://doi.org/10.1172/ JCI80008 30. Blankenstein T, Leisegang M, Uckert W, Schreiber H (2015) Targeting cancer-specific mutations by T cell receptor gene therapy. Curr Opin Immunol 33:112–119. https:// doi.org/10.1016/j.coi.2015.02.005 31. van der Burg SH, Arens R, Ossendorp F, van Hall T, Melief CJ (2016) Vaccines for established cancer: overcoming the challenges posed by immune evasion. Nat Rev Cancer 16(4): 219–233. https://doi.org/10.1038/nrc. 2016.16 32. Schumacher TN, Scheper W, Kvistborg P (2019) Cancer neoantigens. Annu Rev Immunol 37:173–200. https://doi.org/10.1146/ annurev-immunol-042617-053402 33. Vormehr M, Tureci O, Sahin U (2019) Harnessing tumor mutations for truly individualized cancer vaccines. Annu Rev Med 70:395– 407. https://doi.org/10.1146/annurev-med042617-101816 34. Guedan S, Ruella M, June CH (2019) Emerging cellular therapies for cancer. Annu Rev Immunol 37:145–171. https://doi.org/ 10.1146/annurev-immunol-042718-041407 35. Curran MA, Glisson BS (2019) New hope for therapeutic cancer vaccines in the era of immune checkpoint modulation. Annu Rev Med 70:409–424. https://doi.org/10.1146/ annurev-med-050217-121900 36. Yee C, Lizee GA (2017) Personalized therapy: tumor antigen discovery for adoptive cellular therapy. Cancer J 23(2):144–148. https://doi. org/10.1097/PPO.0000000000000255

IEDB & CEDAR: Two Sibling Databases 37. Villani AC, Sarkizova S, Hacohen N (2018) Systems immunology: learning the rules of the immune system. Annu Rev Immunol 36: 813–842. https://doi.org/10.1146/annurevimmunol-042617-053035 38. Brightman SE, Naradikian MS, Miller AM, Schoenberger SP (2020) Harnessing neoantigen specific CD4 T cells for cancer immunotherapy. J Leukoc Biol 107(4):625–633. https://doi.org/10.1002/JLB.5RI0220603RR 39. Roudko V, Greenbaum B, Bhardwaj N (2020) Computational prediction and validation of tumor-associated neoantigens. Front Immunol 11:27. https://doi.org/10.3389/fimmu. 2020.00027 40. Yamamoto TN, Kishton RJ, Restifo NP (2019) Developing neoantigen-targeted T cell-based treatments for solid tumors. Nat Med 25(10): 1488–1499. https://doi.org/10.1038/ s41591-019-0596-y 41. Kaumaya PT (2020) B-cell epitope peptide cancer vaccines: a new paradigm for combination immunotherapies with novel checkpoint peptide vaccine. Future Oncol 16(23): 1767–1791. https://doi.org/10.2217/fon2020-0224 42. Poorebrahim M, Mohammadkhani N, Mahmoudi R, Gholizadeh M, Fakhr E, Cid-Arregui A (2021) TCR-like CARs and TCR-CARs targeting neoepitopes: an emerging potential. Cancer Gene Ther 28(6): 581–589. https://doi.org/10.1038/s41417021-00307-7 43. Pearlman AH, Hwang MS, Konig MF, Hsiue EH, Douglass J, DiNapoli SR, Mog BJ, Bettegowda C, Pardoll DM, Gabelli SB, Papadopoulos N, Kinzler KW, Vogelstein B, Zhou S (2021) Targeting public neoantigens for cancer immunotherapy. Nat Cancer 2(5): 487–497. https://doi.org/10.1038/s43018021-00210-y 44. Kosaloglu-Yalcin Z, Blazeska N, Vita R, Carter H, Nielsen M, Schoenberger S, Sette A, Peters B (2022) The cancer epitope database and analysis resource (CEDAR). Nucleic Acids Res 51:D845. https://doi.org/ 10.1093/nar/gkac902 45. Salimi N, Edwards L, Foos G, Greenbaum JA, Martini S, Reardon B, Shackelford D, Vita R,

149

Zalman L, Peters B, Sette A (2020) A behindthe-scenes tour of the IEDB curation process: an optimized process empirically integrating automation and human curation efforts. Immunology 161(2):139–147. https://doi. org/10.1111/imm.13234 46. IEDB (2007) Curation Manual 2.0. http:// curationwiki.iedb.org 47. Federhen S (2012) The NCBI Taxonomy database. Nucleic Acids Res 40(Database issue): D136–D143. https://doi.org/10.1093/nar/ gkr1178 48. Bandrowski A, Brinkman R, Brochhausen M, Brush MH, Bug B, Chibucos MC, Clancy K, Courtot M, Derom D, Dumontier M, Fan L, Fostel J, Fragoso G, Gibson F, GonzalezBeltran A, Haendel MA, He Y, Heiskanen M, Hernandez-Boussard T, Jensen M, Lin Y, Lister AL, Lord P, Malone J, Manduchi E, McGee M, Morrison N, Overton JA, Parkinson H, Peters B, Rocca-Serra P, Ruttenberg A, Sansone SA, Scheuermann RH, Schober D, Smith B, Soldatova LN, Stoeckert CJ Jr, Taylor CF, Torniai C, Turner JA, Vita R, Whetzel PL, Zheng J (2016) The ontology for biomedical investigations. PLoS One 11(4):e0154556. https://doi.org/10. 1371/journal.pone.0154556 49. Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten JW, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJ, Groth P, Goble C, Grethe JS, Heringa J, t Hoen PA, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone SA, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3: 160018. https://doi.org/10.1038/sdata. 2016.18 50. IEDB Query API (IQ-API). https://queryapi.iedb.org/docs/swagger/#/

Chapter 10 Updates on Databases of Allergens and Allergen-Epitopes Rajat Kanti Sarkar, Nandini Ghosh, Gaurab Sircar, and Sudipto Saha Abstract The increasing prevalence of allergic diseases is of great public health concern. Environmental and food allergens are the major triggers of allergic diseases via respiratory or gastrointestinal routes, respectively. A major setback in the clinical management of allergies is the unavailability of purified allergens required for diagnostic purposes. Furthermore, manipulation of allergen sequences and structures by employing protein-engineering approaches is needed to design immunotherapeutic vaccines. All these approaches rely upon the sequence, structure, and epitope location of allergens. A number of databases have therefore been developed that serve as repositories of molecular information of allergens. In this chapter, we discuss the five most important widely used allergen databases that might be helpful for the research community working on molecular allergology. Key words Database, Allergen, IgE-epitope, IgE-antibody, Cross-reactivity, Bioinformatics

1 Introduction Allergens are usually proteins or glycoproteins that are capable of triggering an abnormal immune response called hypersensitivity [1]. The allergens can sensitize atopic patients through mucosal (respiratory and gastrointestinal) route or dermal route. Until now, a few hundred of allergens have been discovered from various sources such as pollen grains, fungi, various foods, insects, and dust mites [2]. By employing various immunobiochemical methods the antigenic determinants of a number of these allergens have also been identified [3]. In the last few decades, there has been the availability of a great deal of molecular information on allergens such as allergen-encoding cDNA, the primary sequence of the allergens, three-dimensional structures of allergens, conformational and linear IgE-epitopes, antigenically active carbohydrate moieties, T cell epitopes, and cross-reactivity both at IgE as well

Rajat Kanti Karkar and Nandini Ghosh contributed equally with all other contributors. Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_10, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

151

152

Rajat Kanti Sarkar et al.

as T cell level [4]. This information needs to be systematically available to the research community. The allergen database of WHO/IUIS is the official repository of allergens [5]. In addition to this, several other databases are also available that store comprehensive information on allergens and allergen-epitopes. Here, we have discussed the latest updates on six databases available for allergens and allergen-epitopes. The databases that store allergen information can be classified into two major types. Some databases are dedicatedly designed only for the allergens; whereas some databases store exhaustive immunological information including those of allergens as well as pathogenic antigens. In this chapter, we will cover the description and usage of five extensively used allergen databases, namely, WHO/IUIS allergen database, AllFam database, AllergenOnline database, SDAP, and AllerBase.

2

Materials and Methods

2.1 WHO/IUIS Allergen SubCommittee Database 2.1.1

Description

The WHO/IUIS allergen sub-committee provides a distinctive and organized nomenclature system of well-characterized allergens and deposits them in a database, which is available at http://www. allergen.org/. This database includes all the allergenic molecules officially approved by WHO/IUIS [6]. The sub-committee is composed of leading experts in molecular allergology, and bioinformatics. The main menus are interconnected and are located on the top side of the home page. A brief description of each menu is as follows. Home: This menu connects to the home page of WHO/IUIS Allergen Nomenclature Sub-Committee database. Home page includes a brief account of the database and permits the users to search either by allergen name or by the source of allergen. Search: This menu connects to the query page which allows the users to perform search by (1) allergen name, (2) allergen source (common or scientific name), (3) major taxonomic group to be selected from dropdown box (e.g., Fungi Basidiomycota) which can again be filtered by orders from specific dropdown menu and (4) limiting the search to all, food, airborne, contact, injection, and unknown allergens by their biochemical name. TreeView: This menu connects to the “Tree view” page which has an upgradable list of allergens with official names categorized based on the Linnaean system, viz. kingdoms Fungi, Plantae, and Animalia. Each kingdom is subdivided into hyperlinked orders which open up a list of allergen-containing organisms. Publications: It comprises the publication list of allergen nomenclature.

Databases of Allergens and their Epitopes

153

Carbohydrate Epitopes: This menu provides information about the glycan epitopes which currently recognized as targets for IgEantibodies. Executive Committee: This menu comprises the list of names of the WHO/IUIS executive committee members along with their affiliations. Submission Form: This page permits researchers to submit information on newly discovered allergen to the database. Log-In: This page permits members to sign/log in to the WHO/IUIS database (see Note 1). 2.1.2

Usage

1. Users can search the allergen information by typing the name of either allergen or the allergen source directly in the search bar located in the home page (see Note 2). 2. Alternatively, users can go to the search option and the major taxa group (e.g., Plantae Magnoliopsida) can be selected from the dropdown box to obtain a list of allergens from all the reported organisms belonging to that group. This list can be filtered by taxonomic order (e.g., Solanales) provided in a second dropdown. Figure 1a shows an example of such an allergen list from the order Saccharomycetales (group Fungi Ascomycota). 3. Users can also search food, airborne, contact, ingestion, and unknown allergens by their biochemical name. 4. The complete information of any allergen can be obtained in a new page by clicking on the hyperlinked IUIS name.

2.1.3

Query Result

2.2 AllFam—The Database of Allergen Family 2.2.1

Description

Figure 1b illustrates the details of the output of the queried allergen. The result displays supplementary and scientifically reviewed data (see Note 3) such as biochemical nature, size of the monomeric allergen (in kDa), allergenic property (IgE-reactivity, and basophil activation test), route of allergen exposure, and submitter info. Each entry may include a list of isoallergen variants appropriately numbered by WHO/IUIS, which are hyperlinked to NCBINucleotide, GenBank Protein, UniProt data through accession numbers, and, PDB Ids (if available). The entries may also be externally linked to corresponding references via PubMed ID. Accessible online link of AllFam is https://www.meduniwien.ac.at/ allfam/. Each allergen in AllFam is classified based on corresponding protein families (Pfam). There are a total 1042 allergens of which 959 were allotted to 151 allergen families [7]. In the homepage, interlinked menus can be found on the left. A brief depiction of the menus is as follows:

154

Rajat Kanti Sarkar et al.

Fig. 1 Screenshots of the WHO/IUIS Allergen Nomenclature Sub-Committee database. (a) Results of a broad search with only the major taxonomic group of the source organism selected from the dropdown menu; (b) The page containing detailed information about a particular allergenic molecule

Databases of Allergens and their Epitopes

155

AllFam Home: It connects to the home page and provides a generalized description of allergen families along with AllFam statistics, AllFam news updates, and a user guide. The team of researchers maintaining and updating this database can also be found here. Browse AllFam: This page connects to the query page which allows the users to search the families based on source organism and route of allergen contact. Help/FAQ: This page connects to background information about the curation method of AllFam database along with user interface, errors, and problems. About AllFam: It connects to the citing reference of AllFam, and the developer team. References: This page contains related references. 2.2.2

Usage

1. The allergen information can be accessed via “Browse AllFam” menu. There are two search options such as Source and Routes of exposure that can be selected from the relevant dropdown menu. 2. “Allergen Source” has five options such as animals, bacteria, fungi, plants, and all that can be selected from dropdown. The output result of “All” sources option is a list of names of protein family along with the total number of member allergens belonging to each of the families. 3. Users can acquire detailed information of each allergen entry by clicking on the allergen name found in the list. 4. The users can search AllFam database by allergen names, allergen sources, family names, Pfam domain accession numbers, or by AllFam family accession number.

2.2.3

Query Result

The AllFam output page showing list of allergen families is displayed in Fig. 2a. Each and every AllFam Allergen Family has two links. First one is the “ID” (AllFam family accession number) which connects to the page containing description of individual protein family such as biochemical properties, member allergens of that family, published references, and links to Pfamas well as Wikipedia. Second one is “Number of allergens” which connects to a page showing the list of allergens belonging to that particular allergen family as shown in Fig. 2b. The enlisted allergens are organized and linked to IUIS allergen database, Allergen Online database, source organism, taxa of the source, and the routes of exposure. SDAP is an online server that integrates a repository of allergen structures with a number of bioinformatics tools to perform structural analysis of allergenic proteins and epitope characterization.

156

Rajat Kanti Sarkar et al.

Fig. 2 Screenshots of AllFam database. (a) Description page of a AllFam allergen protein family; (b) Page containing list of all allergens from a particular protein family

Databases of Allergens and their Epitopes

2.3 Structural Database of Allergenic Proteins Database (SDAP) 2.3.1

Description

157

This database is available at https://fermi.utmb.edu/ and hosts data of approximately 1526 allergens, and 1312 allergen sequences including isoallergens. Out of these, 92 allergens have their PDB structures available in SDAP [8]. SDAP is freely accessible for Academic users; however, commercial users may need a license. It also provides certain prediction tools which includes IgE-epitopes and in silico allergenicity test recommended by FAO/WHO. The main SDAP menus are located on the left side of the homepage and a brief description of them is appended below. SDAP Home Page: It allows browsing of allergens in an alphabetic way and presents links for references and recent updates. There is an option located at the top of the homepage that allows searching “SDAP All allergens” and “SDAP Food allergens.” SDAP Overview: This page provides comprehensive information regarding the content of the database, computational tools, and SDAP lists. SDAP database includes the allergen lists, their protein sequences, PDB structures, 3D models, IgEepitope sets, and list of Pfam classes. This page also allows alphabetic browsing of allergens. Use SDAP (SDAP All and SDAP Food): These two menus help users to search all allergens as well as only food allergens by selecting from a dropdown search field. SDAP Tools: These menus provide users some important web tools such as full FASTA search against SDAP, Allergenicity test of FAO/WHO, peptide similarity, peptide match, peptideprotein PD index, allergen markup language (Aller_ML), and SDAP list. About SDAP: This page contains manual, general information, FAQ, publications, team, and members of the SDAP advisory board. Allergy Links: It contains information on other important databases related to allergy. Our Software Tools: There are a number of additional software available such as MPACK for homology modeling, FANTOM (Fast Newton-Raphson Torsion Angle Minimizer), GETAREA to calculate the solvent accessible area, PCPMer for patch and cluster analysis, and EpiSearch for conformational epitopes mapping. Protein Databases: This option is externally linked to certain important protein databases such as PDB, SWISSPROT, MMDB Entrez, NCBIEntrez, and PIR. Protein Classification Link: This option is externally linked to certain important protein classifications databases such as CATH, CE, FSSP, iProClass, and so on.

158

Rajat Kanti Sarkar et al.

Bioinformatics Server: This option is externally linked to certain important bioinformatics web servers such as PIR-Peptide match, BCM-ClustalW, TOME, ClustalW, NCBI-BLAST, PIR-BLAST, PIR-FASTA, and PIR-ClustalW. Bioinformatics Tools: Users can download the Cn3D macromolecular structure viewer (NCBI) through this link. Bioinformatics Links: It connects to bioinformatics.ca, which stores information about Canadian bioinformatics resources, workshops, and contacts. 2.3.2

Usage

1. The database can be searched by using the “Use SDAP” menu in left panel and also by clicking “SDAP All allergens” or “SDAP Food allergens” located on the top. 2. Users can search by selecting any of the search fields such as scientific name of allergen, scientific or common name of the source, and allergen description. Users are allowed to search for a term or a phrase in the query search. 3. Users can also browse the allergen data listed alphabetically. 4. Detailed information such as amino acid sequence and 3D structure of any allergen can be obtained by clicking on the IUIS name appeared as a search result.

2.3.3 SDAP Database Query Result

Query result opens in a new page which contains a list of allergens and their corresponding homologs. The list starts with alphabet “A” and contains preliminary data of allergens such as IUIS name, source or species name, biochemical property of the protein under “Keywords,” and corresponding IUIS link (Fig. 3a). Users can obtain more information of any allergen by clicking on the IUIS name as shown in Fig. 3b.

2.4 AllergenOnline Database

AllergenOnline stores data on the peer-reviewed allergens deposited in a searchable database format to be used for the identification of any unknown protein that may possess a potential risk of allergenicity. The database is freely accessible and provides a simple yet useful tool for the evaluation of food safety. It is available at http:// www.allergenonline.org/ and is updated annually. The version 21 (released on February 14, 2021) contains 2233 peer-reviewed allergen sequences along with 912 protein groups assigned to specific taxa [9]. The interlinked menus are located on the top of the home page. A brief description of menus of Allergen Online database is as follows:

2.4.1

Description

About: This page contains the brief outline of the database including processed data entries, recent publications, removal of “False” entries, and the peer review process for the inclusion of entries into the database.

Databases of Allergens and their Epitopes

159

Fig. 3 Screenshots of SDAP database; (a) Output of search page showing alphabetical listing of allergens; (b) Sample output page with detailed information on the allergen molecule

Contact: This page shows the contact details of the developer team. Browse the Database: This menu allows the users to retrieve all the data entries in a single page. The data is represented in ten columns. Each and every column has filters for quick search. Sequence Search: This menu allows carrying out the query search using multiple FASTA sequences.

160

Rajat Kanti Sarkar et al.

Version History: This menu shows statistical data along with the previous and current versions of the database. Celiac Disease: This menu contains a tool to perform search by exact peptide match and full FASTA analysis of any novel protein for the risk assessment of celiac disease. 2.4.2

Usage

There are two different ways to search here such as browsing of entries and sequence search. 1. For browsing all the entries, a user can simply click on the “Browse” link. 2. For “sequence search,” users can enter single or multiple protein sequences in search box and perform either full FASTA or eight amino acid sliding window analysis.

2.4.3 AllergenOnline Database Query Result

The browsing result is shown in Fig. 4. The whole allergen database is presented in a table composed of ten columns such as species and common name of the source, IUIS name, type of allergen, allergen group where it belongs, test performed for allergenicity, sequence length, database accession number, NCBI accession, and version number. There is a filter option at the bottom of each column for quick search by using a particular keyword. A new page opens up upon clicking on any entry from the “Group” column and that page contains published references which are used to categorize the protein as a member allergen of that cluster. NCBI page opens up upon clicking any entries from the “GI” column which contains the

Fig. 4 Screenshot of Allergen browsing page of the AllergenOnline database

Databases of Allergens and their Epitopes

161

complete information of the protein. For sequence search, the anticipated result page displays the best hit protein depending on the similarity values, highest Z score, and percentage of identity. 2.5

AllerBase

2.5.1 Description of AllerBase

The AllerBase database is available at http://bioinfo.unipune.ac. in/AllerBase/Home.html. It is a comprehensive database (see Note 4) that stores molecular information of experimentally validated allergens (2281 as on July 2022), cross-reactivity, and IgEepitopes [10]. Registration is not required by the users to access this database. There are six major navigation bars on the top of the main page. Home: This menu links to the homepage that provides a short description of the data types available here in this database along with a diagrammatic representation. Home page also contains five submenus on the right-hand side such as a statistical account of various data types, a “help” page where the usage of this database is illustrated along with screenshots, the team of researchers who has curated the database, feedback to be given by the users, and contact details of the team. About: This page contains a short description of the data types such as molecular information of allergens, IgE-epitopes, IgE-antibody, allergenicity assays, and cross-reactivity that is useful even for the non-specialist users. Browse: It allows users to access information classified as either allergen or IgE-epitope or IgE-antibody. Each of the classifiers is further subdivided into two or more options. For example, a user can search allergens based on the corresponding source. The IgE-epitopes can be searched based on structure such as linear or conformational. The IgE-antibodies can be searched either for a specific allergen or specific for a particular disease (e.g., asthma, rhinitis, and dermatitis) or routes of allergen exposure (e.g., inhalant, ingested, and contact). A list of some non-specific IgE-antibodies that are not yet assigned with any particular allergen is also available in the “browse” menu. Search: It links to a query page where user can either perform either a basic search or advanced search. The “basic search” is a more generalized search engine with five different search items and each along with corresponding search options to be selected from the dropdown menu. Here search is done by simply typing the name of the allergen or isoallergen to dig out the molecular information, IgE-epitopes, and cross-reactivity. User can also search for IgE-antibodies specific for allergen name, allergen category (route of exposure), disease phenotype, source organisms of the allergens, and PubMed ID of the publications where IgE-antibody was reported. The “advanced

162

Rajat Kanti Sarkar et al.

search” can be performed either for Allergens or for IgE-epitopes. Allergen list can be retrieved either from a specific taxonomic level (e.g., Fungi) or from a specific source organism (e.g., Aspergillus fumigatus) based on the presence of four attributes such as 3-D structure, IgE-epitope, IgE cross-reactivity, and IgE-antibodies. Each of the advanced searches can be done using two attributes at a time. The experimentally verified IgE-epitopes can be retrieved as a list of either sequential (linear) epitopes or discontinuous (conformational) epitopes from a particular allergen or the source organism. In addition to epitope structure, AllerBase also contains the information of any allergen epitope such as immunoaffinity (binding strength) as well as frequency of reactivity (major or minor epitopes). Thus a user can search for a specific type of epitope(s) (Major or strongly binding) from any particular allergen or a source organism. 2.5.2

Usage of AllerBase

1. Users can access information stored in AllerBase by using “Browse” and “Search” menus. 2. In the Browse options all the information are classified into three data types such as Allergen, IgE-epitope, and IgE-antibody. Each of these three data types is further subdivided in submenus. The entire dataset of allergens is classified into five classes such as plant, animal, bacteria, virus, and fungi. Clicking any of these will help users to retrieve a list of allergens reported from that particular taxonomic group. Similarly, the entire dataset of IgE-epitope is classified into two classes such as Linear and Conformational. The dataset for IgE-antibody is classified into four groups such as allergen-specific, allergic disease, allergen category, and non-specific. Allergic disease and allergen category are again subdivided into three major types of diseases (asthma, rhinitis, and dermatitis) and three different routes of allergen exposure (respiratory, food, and contact) respectively. 3. User can “Search” the database by using “Basic” search tool and “Advanced” search tool. In basic search, details of any allergen can be retrieved by using the IUIS name of that allergen and selecting “Allergen Name” from the dropdown menu. Similar details of any allergen(s) can be retrieved by using the name of the Source Organism (e.g., Aspergillus fumigatus), Taxonomic Level (e.g., Plant), and Food Source (e.g., Shrimp). IgE-antibodies can be searched on the basis of five parameters each to be selected from the dropdown menu. For example, a user can write “Asthma” in the search bar and select “Allergic disease” from the dropdown menu.

Databases of Allergens and their Epitopes

163

4. Advanced search for allergens allows users to perform the search by using two parameters at a time. For example, allergens from a specific taxonomic level (e.g., Plant) can be retrieved based on the availability of any two of the four attributes such as 3-D Structure, IgE-binding epitope, IgE crossreactive allergens, and IgE antibodies. In the same way, advanced search for IgE-epitopes can be performed based on either the structure of epitope (e.g., Linear) or nature of epitope (e.g., Major) from the first dropdown menu followed by the search criteria from the second dropdown menu (e.g., Allergen Name) and then writing the allergen name (e.g., Asp f 1). 2.5.3 AllerBase Query Result

3

Each result of either browsing or search will appear as an alphabetically arranged tabular list of allergens. Clicking on each allergen entry will open up a new page for that particular entry which contains comprehensive information of that allergen such as sequence information with hyperlinked NCBI and UniProt accession numbers as shown in Fig. 5a. The PDB ID is also given in case any experimentally determined structure is available. The sequences of the experimentally determined IgE-epitopes (if available) are listed in a tabular format along with a brief description of each epitope such as nature of the epitope, antibody used for detection, and the assay method employed as shown in Fig. 5b. The IgEantibody information associated with either any allergic disease (e.g., asthma) or specific for any allergen (e.g., Bet v 1) appears as a list where the amino acid sequence is hyperlinked with IMGT and UniProt database as shown in Fig. 5c.

Notes 1. Certain datasets can be downloaded from some of these databases. However, user registration may be required. 2. The query used for searching all these databases is not casesensitive. 3. WHO/IUIS database is the prototype and official allergen database where information of scientifically reviewed allergens is deposited. 4. AllerBase is the only available database to the best of our knowledge which stores information of both allergens as well as cross-reactivity, IgE-epitopes, and IgE-antibodies.

164

Rajat Kanti Sarkar et al.

Fig. 5 Screenshots of AllerBase database. (a) Output page for individual allergen search; (b) Output page for IgE-epitope search for a particular allergen; (c) Output page of IgE-antibody search associated with a particular disease

Databases of Allergens and their Epitopes

165

Acknowledgements GS acknowledges DBT/Wellcome Trust India Alliance for providing an Early Career Fellowship Grant (IA/E/17/1/ 503696). SS acknowledges the BIC COE project funded by the Department of Biotechnology, Government of India (sanction no. BT/PR40174/BTIS/137/45/2022). References 1. Galli JS (2000) Allergy. Curr Biol 10(3):R93– R95 2. Kay AB (2000) Overview of allergy and allergic diseases: with a view to the future. Br Med Bull 56(4):843–864 3. Mari A, Scala E, Palazzo P et al (2007) Bioinformatics applied to allergy: allergen databases, from collecting sequence information to data integration. The allergome platform as a model. Cell Immunol 244:97–100 4. Bhattacharya K, Sircar G, Dasgupta A, Gupta Bhattacharya S (2018) Spectrum of allergens and allergen biology in India. Int Arch Allergy Immunol 177(3):219–237. https://doi.org/ 10.1159/000490805. Epub 2018 Jul 27. PMID: 30056449 5. Mari A, Rasi C, Palazzo P et al (2009) Allergen databases: current status and perspectives. Curr Allergy Asthma Rep 9:376–383 6. Larsen JN, Lowenstein H (1996) Allergen nomenclature. J Allergy Clin Immunol 97: 577–578

7. Radauer C, Bublin M, Wagner S et al (2008) Allergens are distributed into few protein families and possess a restricted number of biochemical functions. J Allergy Clin Immunol 121:847–852 8. Ivanciuc O, Schein CH, Braun W (2003) SDAP: database and computational tools for allergenic proteins. Nucleic Acids Res 31(1):359–362 9. Goodman RE, Ebisawa M, Ferreira F, Sampson HA, van Ree R, Vieths S, Baumert JL, Bohle B, Lalithambika S, Wise J, Taylor SL (2016) AllergenOnline: a peer-reviewed, curated allergen database to assess novel food proteins for potential cross-reactivity. Mol Nutr Food Res 60(5):1183–1198. https://doi.org/ 10.1002/mnfr.201500769. Epub 2016 Mar 3. PMID: 26887584 10. Kadam K, Karbhal R, Jayaraman VK, Sawant S, Kulkarni-Kale U (2017) AllerBase: a comprehensive allergen knowledgebase. Database 2017:bax066

Chapter 11 TSNAD and TSNAdb: The Useful Toolkit for Clinical Application of Tumor-Specific Neoantigens Jingcheng Wu and Zhan Zhou Abstract Tumor-specific neoantigens play important roles in tumor immunotherapy. How to predict neoantigens accurately and efficiently has attracted much attention. TSNAD is the first one-stop neoantigen prediction tool from next-generation sequencing data, and TSNAdb provides both predicted and validated neoantigens based on pan-cancer immunogenomics analyses. In this chapter, we describe the usage of TSNAD and TSNAdb for the clinical application of neoantigens. The latest version of TSNAD is available at https://pgx. zju.edu.cn/tsnad, and the latest version of TSNAdb is available at https://pgx.zju.edu.cn/tsnadb. Key words Tumor immunotherapy, Neoantigen, Prediction, Bioinformatics, Database

1

Introduction Tumor-specific neoantigens have attracted much attention because they can be used as potential targets for tumor immunotherapy [1– 3]. Compared with traditional tumor-associated antigens (TAAs), neoantigens are exclusively expressed by tumor cells and are tumorspecific antigens (TSAs) that can prevent “on-target off-tumor” side effects. Limited by the high cost of experiments for neoantigen identification, many neoantigen prediction tools, such as TSNAD [4, 5], pTuneos [6], and pVACtools [7], have been developed. One function of these tools is to identify tumor-specific somatic mutations from next-generation sequencing (NGS) data, and another function is to assess the binding ability between mutant peptides and patients’ human leukemia alleles (HLAs). Then, the construction of tumor neoantigen databases plays an important and positive role in the efficient screening of tumor targets and promoting the development of tumor immunotherapy. Several databases with predicted or collected neoantigens have been developed, such as TCLP [8], TCIA [9], dbPepNeo [10], NeoPeptide [11], and TSNAdb [12, 13]. Due to the importance of neoantigen prediction tools and

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_11, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

167

168

Jingcheng Wu and Zhan Zhou

databases in neoantigen-based immunotherapy, in this chapter, we describe the usage of TSNAD and TSNAdb that we developed previously for the potential clinical application of neoantigens. TSNAD is the first one-stop neoantigen prediction tool, which identifies cancer somatic mutations following the best practices of the genome analysis toolkit (GATK) from the genome/exome sequencing data of tumor-normal pairs [4]. In the updated version of TSNAD v2.0, RNA-Seq data analysis was included for gene expression and gene fusion analyses, two versions of the reference genome (GRCh37 and GRCh38) could be selected when calling mutations, three sources of neoantigens could be predicted (see Note 1), and two more types of tool usage have been provided (see Note 2) [5]. TSNAdb was constructed based on the prediction results of TSNAD on 7748 tumor samples from TCGA. It contains 3,707,562/1,146,961 potential neoantigens generated by single nucleotide variants [12]. In the updated version of TSNAdb v2.0, several changes have been made (see Note 3). In total, 372,273 SNV-derived neoantigens, 137,130 INDEL-derived neoantigens, and 11,093 fusion-derived neoantigens are displayed in the updated database [13].

2

Materials This chapter demonstrates the usage of the neoantigen prediction tool TSNAD and neoantigen database TSNAdb. To run the TSNAD locally, the user will need a computer with the configuration of 64G memory and 512G hard disk space. In addition, users should also need to download the following software: Docker (https://docs.docker.com/). To run the web-server version of TSNAD (https://pgx.zju. edu.cn/tsnad) or use TSNAdb (https://pgx.zju.edu.cn/tsnadb), users need to have web browsers such as Edge and Google Chrome.

3 3.1

Methods Docker of TSNAD

The Docker version of TSNAD comprises all tools and reference files that are required to run neoantigen prediction. It is easy to run TSNAD with next-generation sequencing data (whole-exome sequencing, whole-genome sequencing, and RNA sequencing, see Note 4).

High-throughput Neoantigen Prediction

169

1. Download the TSNAD through Docker docker pull biopharm/tsnad:latest

2. Enter the TSNAD running environment. docker run -it -v [dir of WES/WGS]/:/home/tsnad/samples -v [dir of RNA-seq]:/home/tsnad/RNA-seq -v [output dir]:/home/ tsnad/results biopharm/tsnad:latest /bin/bash (with RNA Sequencing data) docker run -it -v [dir of WES/WGS]/:/home/tsnad/samples -v [output dir]:/home/tsnad/results biopharm/tsnad:latest / bin/bash (without RNA Sequencing data)

3. Perform neoantigen prediction. cd /home/tsnad bash uncompress.sh python TSNAD.py -I samples/-R RNA-seq/-V [grch37/ grch38] -O results/(with RNA Sequencing data) python TSNAD.py -I samples/-V [grch37/grch38] -O results/(without RNA Sequencing data)

The web service of the TSNAD provides a partial function of the Docker version of the TSNAD (Fig. 1). Users should provide the following data to run it correctly: VCF files containing mutation data and at least one HLA allele. The default version of the reference genome is GRCh38, and GRCh37 is also available as users wanted (it should be consistent with the version reference genome calling mutations). The email is optional to notify the users that the prediction is completed.

3.2 Web-Server of TSNAD

HLA Alleles Predicted neoantigens SNV/INDEL (VCF files)

VEP

Mutant peptides

Fig. 1 The function that the web service of TSNAD provides

170

Jingcheng Wu and Zhan Zhou

3.3 Architecture of TSNAdb

TSNAdb contains two parts of neoantigens (predicted neoantigens and validated neoantigens). The predicted neoantigens are based on the mutations (972,187 SNVs, 112,404 INDELs, and 12,639 Fusions) from TCGA samples. These mutations are then translated to mutant peptides and predicted by three tools (DeepHLApan [14], MHCflurry [15], and NetMHCpan v4.0 [16]) to identify potential neoantigens. The validated neoantigens are collected from literature and other related databases such as dbPepNeo [10], NeoPeptide [11], NEPdb [17], and CAPD [18]. All the neoantigens are displayed in the web interface of TSNAdb (Fig. 2).

3.4 Web Interface of TSNAdb

The web interface of TSNAdb contains six main pages, the “Home” page, the “Browse” page, the “Search” page, the “Collected” page, the “Download” page, and the “Help” page (Fig. 3). TSNAdb provides the statistics chart and table of the database on the “Home” page. By default, the statistics chart is displayed. Users can switch between tables and graphs by clicking table and boxplot. Users can select how many tumor types to display on the statistical graph, and click on each column of the statistical graph to jump to the “Tumor type” subpage in the “Browse” page. On the right is a brief description of the TSNAdb database and the developer’s contact information. The “Browse” page contains three types of subpages, “Tumor type,” “Mutation type,” and “Shared_neoantigen.” “Tumor type” contains neoantigen information of 16 tumor types, and each “Tumor type” page contains three parts: “Statistics,” “Neoantigen TCGA samples 972,187 SNVs

112,404 INDELs

Prediction tool

12,639 Fusions

Other databases

DeepHLApan

dbPepNeo

CAPD

MHCflurry

NeoPeptide

NEPdb

NetMHCpan v4.0

TSNAdb Web interface

Fig. 2 The architecture of TSNAdb

High-throughput Neoantigen Prediction

171

Statistics results of the database Three types of pages for browsing: • Tumor type • Mutation type • Shared neoantigens

TSNAdb

Three ways to search: • Mutation,tissue and gene • Gene • HLA Collected experimentally validated neoantigens Different data for download: • Tumor type • Mutation type Usage instruction of the database

Fig. 3 The web interface of TSNAdb

with clinical information,” and “Detailed neoantigen.” The “Statistics” section shows the distribution of tumor neoantigens derived from different mutation types and shows the correlation between mutations and mutation-derived neoantigens. The section “Neoantigen with clinical information” shows the correlation between neoantigens and various clinical information (such as age, gender, and smoking). The “Detailed neoantigen” section shows the detailed neoantigen information of the tumor type. The “Mutation type” contains the tumor neoantigen information related to three mutation types (SNV, INDEL, Fusion), and each “Mutation type” page contains four parts: “Statistics,” “Neoantigen with mutation,” “Neoantigen with clinical information,” and “Detailed neoantigen.” The “Statistics” section shows the distribution of mutations and neoantigens. “Neoantigen with mutation” shows the correlation between neoantigen and mutation. The section “Neoantigen with clinical information” shows the correlation between neoantigens and various clinical information (such as age, gender, and smoking). The “Detailed neoantigen” section shows the detailed neoantigen information of the mutation type. The subpage “Shared neoantigen” shows the distribution of 16,913 cases of shared neoantigen. The statistical graph and the detailed information table are shown on the page. Users can adjust the threshold value to find the shared neoantigens that meet the threshold. The genes and mutations in the table are linked to CandrisDB to evaluate whether the gene or mutation is a driver gene or driver mutation. The “Search” page consists of three parts. The first part is the general interface, where users can select the mutation type, tumor type, and gene to search for the corresponding neoantigen. The search results are displayed in the form of a list. The second part is the gene search interface, which only provides a gene search, and

172

Jingcheng Wu and Zhan Zhou

the search results are displayed in the form of a pie chart and table. Each pie block in the pie chart is interconnected with the table, and the table will change when the pie block is clicked. The third part is the HLA search interface, which only provides HLA typing search, and its search results are displayed in the same way as the gene search interface. The “Collected” page contains 1856 experimentally validated neoantigens from open literature and databases, and they are classified according to different levels of experimental validation (see Note 5). The table listed in the “Collected” page includes the grade, tumor type, mutation type, gene, mutation, mutated peptide, HLA typing, references, and database for each tumor neoantigen. Genes and mutations were linked to CandrisDB to identify driver genes and driver mutations, as shown in the table on the “Shared_neoantigens” page.

4

Notes 1. There are two versions of TSNAD, and the updated version provides more functions, such as adding the function of RNA-Seq data analysis, supporting both GRCh38 and GRCh37 versions of the reference genome when calling mutations, adding the neoantigen prediction derived from INDELs and derived from gene fusions, and replacing NetMHCpan with DeepHLApan. 2. TSNAD is available at https://github.com/jiujiezz/tsnad (source code), https://hub.docker.com/r/biopharm/tsnad (Docker version), and https://pgx.zju.edu.cn/tsnad (web service). The GitHub version provides the same function as the Docker version, but it needs to install many embedding tools and download many files for final use. The web service only provides the function of mutant peptide generation and peptide-MHC binding prediction due to the large size of nextgeneration sequencing data. 3. There are two versions of TSNAdb. Version 1.0 is available at https://pgx.zju.edu.cn/tsnadb1, and version 2.0 is available at https://pgx.zju.edu.cn/tsnadb. In this chapter, we introduce the main content of TSNAdb v2.0. Compared with the first version, TSNAdb v2.0 implements new features and improvements, including (1) Provide the predicted neoantigens derived not only from SNVs but also from INDELs and Fusions. (2) Stricter criteria are used for neoantigen identification to handle the high false-positive rate of neoantigen prediction in practice. (3) Collect as many experimentally validated neoantigens from public databases and literature.

High-throughput Neoantigen Prediction

173

4. The input data of the whole TSNAD are next-generation sequencing data in fastq format. The minimum depth should be 15× for whole-genome sequencing (WGS) and 50× for whole-exome sequencing (WES), and the recommended depth should be 30× for WGS and 100× for WES. For samples with WES tumor/normal data and RNA-seq data, it takes approximately 50 h to finish neoantigen prediction in the Ubuntu system with 64G memory and 512G hard disk space. 5. The experimentally validated neoantigens are collected from public literature and databases and are divided into three tiers. The neoantigens that have been validated to be presented to the cell surface and have immunogenicity are labeled tier 1. The neoantigens that have only been validated as immunogenic are labeled tier 2. The neoantigens that have only been validated to be presented to the cell surface are labeled tier 3.

Acknowledgments This work was supported by the National Natural Science Foundation of China (Grant No. 31971371), the Key R&D Program of Zhejiang Province (Grant No. 2020C03010), and the AlibabaZhejiang University Joint Research Center of Future Digital Healthcare. We thank the Information Technology Center, State Key Lab of CAD&CG, and Innovation Institute for Artificial Intelligence in Medicine, Zhejiang University for the support of computing resources. We also gratefully acknowledge the TCGA Research Network for referencing the TCGA datasets, and the TCIA for referencing HLA-type data of TCGA samples. References 1. Blass E, Ott PA (2021) Advances in the development of personalized neoantigen-based therapeutic cancer vaccines. Nat Rev Clin Oncol 18:215–229 2. Yamamoto TN, Kishton RJ, Restifo NP (2019) Developing neoantigen-targeted T cell–based treatments for solid tumors. Nat Med 25: 1488–1499 3. Cui C, Wang J, Fagerberg E et al (2021) Neoantigen-driven B cell and CD4 T follicular helper cell collaboration promotes anti-tumor CD8 T cell responses. Cell 184:6101– 6118.e13 4. Zhou Z, Lyu X, Wu J et al (2017) TSNAD: an integrated software for cancer somatic mutation and tumour-specific neoantigen detection. R Soc Open Sci 4:170050

5. Zhou Z, Wu J, Ren J et al (2021) TSNAD v2.0: a one-stop software solution for tumor-specific neoantigen detection. Comput Struct Biotechnol J 19:4510–4516 6. Zhou C, Wei Z, Zhang Z et al (2019) pTuneos: prioritizing tumor neo antigens from nextgeneration sequencing data. Genome Med 11:67 7. Hundal J, Kiwala S, McMichael J et al (2020) PVACtools: a computational toolkit to identify and visualize cancer neoantigens. Cancer Immunol Res 8:409–420 8. Scholtalbers J, Boegel S, Bukur T et al (2015) TCLP: an online cancer cell line catalogue integrating HLA type, predicted neo-epitopes, virus and gene expression. Genome Med 7:118

174

Jingcheng Wu and Zhan Zhou

9. Charoentong P, Finotello F, Angelova M et al (2017) Pan-cancer immunogenomic analyses reveal genotype-immunophenotype relationships and predictors of response to checkpoint blockade. Cell Rep 18:248–262 10. Tan X, Li D, Huang P et al (2020) dbPepNeo: a manually curated database for human tumor neoantigen peptides. Database 2020:baaa004 11. Zhou WJ, Qu Z, Song CY et al (2019) NeoPeptide: an immunoinformatic database of Tcell-defined neoantigens. Database 2019: baz128 12. Wu J, Zhao W, Zhou B et al (2018) TSNAdb: a database for tumor-specific neoantigens from immunogenomics data analysis. Genomics Proteomics Bioinformatics 16:276–282 13. Wu J, Chen W, Zhou Y et al (2022) TSNAdb v2.0: the updated version of tumor-specific neoantigen database. Genomics Proteomics Bioinformatics. https://doi.org/10.1016/j. gpb.2022.09.012 14. Wu J, Wang W, Zhang J et al (2019) DeepHLApan: a deep learning approach for

neoantigen prediction considering both HLA-peptide binding and immunogenicity. Front Immunol 10:2559 15. O’Donnell TJ, Rubinsteyn A, Laserson U (2020) MHCflurry 2.0: improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing. Cell Syst 11:42–48.e7 16. Jurtz V, Paul S, Andreatta M et al (2017) NetMHCpan-4.0: improved peptide–MHC class I interaction predictions integrating eluted ligand and peptide binding affinity data. J Immunol 199:3360–3368 17. Xia J, Bai P, Fan W et al (2021) NEPdb: a database of T-cell experimentally-validated neoantigens and pan-cancer predicted neoepitopes for cancer immunotherapy. Front Immunol 12:644637 18. Vigneron N, Stroobant V, Van Den Eynde BJ, Van Der Bruggen P (2013) Database of T celldefined human tumor antigens: the 2013 update. Cancer Immun 13:15

Chapter 12 EPIPOX: A Resource Facilitating Epitope-Vaccine Design Against Human Pathogenic Orthopoxviruses Laura Ballesteros-Sanabria, Hector F. Pelaez-Prestel, Pedro A. Reche, and Esther M. Lafuente Abstract EPIPOX is a specialized online resource intended to facilitate the design of epitope-based vaccines against orthopoxviruses. EPIPOX is built upon a collection of T cell epitopes that are shared by eight pathogenic orthopoxviruses, including variola minor and major strains, monkeypox, cowpox, and vaccinia viruses. In EPIPOX, users can select T cell epitopes attending to the predicted binding to distinct major histocompatibility molecules (MHC) and according to various features that may have an impact on epitope immunogenicity. Among others, EPIPOX allows to discern epitopes by their structural location in the virion and the temporal expression of the counterpart antigens. Overall, the annotations in EPIPOX are optimized to facilitate the rational design of T cell epitope-based vaccines. In this chapter, we describe the main features of EPIPOX and exemplify its use, retrieving orthopoxvirus-specific T cell epitopes with features set to enhance their immunogenicity. EPIPOX is available for free public use at http://bio.med.ucm.es/epipox/. Key words T cell epitopes, Vaccine design, Orthopoxviruses, HLA molecules, EPIPOX

1 Introduction Smallpox is an infectious disease caused by two types of variola virus (VARV), major and minor, belonging to the Orthopox genus of the Poxviridae family [1]. VARV major, the most prevalent and deadly form [2], was responsible for the death of millions of people, wiping out civilizations in the whole world since its first appearance, estimated around 10,000 BC [3]. With the introduction of systematic vaccination in the early 19th century, variola infections started to reduce until the disease was declared eradicated in May 1980 by the WHO [1]. Since then, smallpox vaccination has completely ceased. Subsequently, all people under 40 are unvaccinated and naive to smallpox, and in older people, the immunity elicited by Laura Ballesteros-Sanabria and Hector F. Pelaez-Prestel contributed equally to the chapter. Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_12, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

175

176

Laura Ballesteros-Sanabria et al.

the vaccine may have already waned. Hence, the current population is largely vulnerable to smallpox. The lack of immunity to smallpox in the population is worrisome as smallpox virus stockpiles persist and the virus has been identified as a potential bioweapon. Moreover, the population is also vulnerable to zoonosis by other orthopoxviruses like monkeypox virus (MPXV) [4] that were prevented through cross-reactive immunity elicited by smallpox vaccines. Orthopoxvirus are very similar and as a result, the chance for cross-reactive immunity between them is indeed very high [5]. Smallpox vaccines stem from Edward Jenner’s method to prevent smallpox through scarifications with substance from cowpox lesions [6], a benign disease caused by cowpox virus (CPXV) [7]. Current smallpox vaccines consist of vaccinia virus (VACV), as those used for smallpox eradication. VACV was originally isolated from early smallpox vaccines and hence thought to be a version of CPXV. However, DNA analysis of early smallpox vaccines now support that horsepox virus is a more likely ancestor [8]. VACV prevent smallpox by resembling the natural infection, making the vaccine highly immunogenic and effective. Moreover, VACV vaccine is safe, although in immunocompromised or immune-suppressed individuals, it can have serious adverse effects [9]. In particular, people with T cell disorders can suffer severe disease upon vaccination with VACV [10]. The lack of immunity to smallpox in the population along with increasing cases of monkeypox spreading to non-endemic countries [11], as well as with bioterrorism concerns, has renewed the interest in exploring new and safer smallpox vaccines. Moreover, the shadow of SARS-CoV-2 remarks the importance of being able to develop and deploy efficient vaccines in a short time. In this sense, T cell epitopes pose a relevant foundation to develop safer and quick smallpox vaccines [5]. Moreover, T cell epitope-based vaccines ought to be more resilient to virus variation than vaccines aimed to elicit humoral immunity [12]. In this chapter, we describe an online resource, EPIPOX, which facilitates T cell epitope vaccine design against orthopoxviruses. Interestingly, EPIPOX includes a unique selection of T cell epitopes that are shared between several pathogenic orthopoxviruses, including VARV, MPXV, CPXV, and VACV. T cells only recognize peptides bound to major histocompatibility complex (MHC) molecules, also known as human leukocyte antigens, HLA, in humans. Thereby, in EPIPOX, CD8 and CD4 T cell epitopes can be identified/selected by the predicted binding to HLA class I (HLA I) and class II (HLA II) molecules, respectively. EPIPOX also allows the selection of T cell epitopes with regard to antigen features like location in viral particles and time point expression that will be exemplified in this chapter.

EPIPOX: Orthopoxvirus-Specific T Cell Epitope Database

2

177

Materials

2.1 The EPIPOX Resource 2.1.1 Description of EPIPOX Data: Predicted T Cell Epitopes

EPIPOX provides accession to information about T cell epitopes from poxviruses and its predicted MHC molecule peptide binding. EPIPOX database is built from 124 shared proteins of eight orthopoxvirus: VARV major (strain Bangladesh-1975), VARV major (strain India-1967), Variola major minor (strain Garcia), Monkeypox virus (strain Zaire-96-I-16), Cowpox virus strain (strain Brighton Red), Vaccinia virus (strain Copenhagen), Vaccinia virus (strain Tian Tan) and Vaccinia virus (strain Ankara). The system uses as reference VARV major strain Bangladesh-1975. Over these 124 proteins, identical, and hence cross-reactive, T cell epitopes are predicted by peptide-MHC binding profiles using 32 HLA-I and 33 HLA-II alleles experimentally verified for peptide binding (see Note 1). Further details on EPIPOX building, predictions, and implementation are described in the original article [5].

2.1.2 Experimental T Cell Epitopes

EPIPOX database also contains some experimentally defined poxvirus-specific HLA I and HLA II-restricted T cell epitopes deposited at the Immune Epitope Database (IEDB) [13]. IEDB (https://www.iedb.org/) contains experimental information on B cell and T cell epitopes whose binding to antibodies, BCR, TCR, or MHC has been experimentally assessed. Vaccinia virus epitopes in IEDB are included in EPIPOX database only if their sequences are fully shared among the eight orthopoxviruses. In this chapter, we will use IEDB to identify experimentally determined T cell epitopes within a particular output of EPIPOX.

2.2 Description of Web Interface

EPIPOX is available at http://bio.med.ucm.es/epipox/. The user encounters at this site an Input page composed of a brief description of the resource, followed by the Search section, the Limit Search section, a section with Related Resources, and a section to Submit New Sequences, which allows the user to check if the sequence information is already in the database. After completing the search, the results are shown in a new output page. Here we describe the main parts of EPIPOX under the Search section, the Limit Search section, and the Result page.

2.2.1 Description of Search Options

The Search section is the first part in EPIPOX input page and consists of two search parameters, HLA restriction, and VARV proteins, as shown in Fig. 1. Here, the user can select HLA I and/or HLA II molecules restricting/binding the epitopes, as well as the viral proteins from which to retrieve T cell epitopes. HLA selection can be single or combinations (AND/OR) without limit in the number of molecules selected. There is also the option to query the database for promiscuous T cell epitopes binding to three HLA I supertypes (A2, A3, and B7), which are set of HLA

178

Laura Ballesteros-Sanabria et al.

Fig. 1 EPIPOX input page SEARCH section. The HLA field permits the single or multiple selection of class I and/or class II HLA molecules within the list. In protein field, users can select one protein or all the 124 VARV major proteins shared with other pathogenic Orthopoxviruses. Supertypes field enables the selection of the HLA alleles belonging to Supertypes A2, A3, or B7. We selected all proteins and A3 Supertype, as an example

molecules that have similar peptide-binding repertoires. For the viral protein selection, there is no limit in the number of proteins selected and user can either select individual proteins or select all of them by using the option All. 2.2.2 Description of Search Filters

The Limit Search Results is the second section of EPIPOX input page. Here the user can define the search for a specific sequence by introducing it at the SEQ box or alternatively, initiate a search using the sequences available at the resource, by leaving empty the SEQ box. Next, the user can restrict the search results by applying a series of filters which include the selection of the protein temporal expression during the viral life cycle (EXPRESS), the protein location in the virion structure (LOCATION), and other features, all shown in Fig. 2. This enables the retrieval of T cell epitopes querying by a set of annotations that are relevant for vaccine design, for example the temporal expression of gene products (early, intermediate, late) and the location in relevant structures of the virus like the core or the membrane of infective forms, the intracellular mature virion (IMV) or extracellular envelope virion (EEV). Early expressed proteins and nonstructural gene products are generally more immunogenic with regard to CD8 T cells, while late proteins and highly abundant gene products, generally located in membrane structures, are more immunogenic for CD4 T cells [14–16]. Other filters available in this section are the presence of leader signal (LEADER) or transmembrane region (TRANS). These characteristics are interesting for vaccine design as proteins with these features often interact with host cells [17, 18]. Finally, users can select only those peptides with a relative score above some selectable value of HLA-specific profile used to score T cell epitopes

EPIPOX: Orthopoxvirus-Specific T Cell Epitope Database

179

Fig. 2 EPIPOX input page LIMIT SEARCH RESULT section. The interface shows the parameters used for limiting the search of T cell epitopes binding to the previously selected restriction elements. We selected E to retrieve early express proteins. For the rest of the parameters (LEADER, TRANS, LOCATION, PCP OPT, and CLEAVED), we left the default option

(PCT OPT). Also, for HLA I-restricted epitopes the search can be restricted to epitopes generated from proteasome cleavage (CLEAVED). 2.2.3 Description of Output Page

3

The output EPIPOX page generated after completing a search consists of a tabulated list including the sequence of the epitopes, information about the protein containing each epitope, and the HLA molecules that bind each peptide, as shown in Fig. 3. SOURCE NAME and SOURCE GI columns show the name of the protein and the GeneBank protein GI (GenInfo Identifier) used as epitope source, respectively. The specific HLA binding to the peptide and its class (I or II) are shown in MTX and CLASS columns, respectively. The rest of the columns, CLEAVED, PCT OPT, LOCATION, EXPRESS, LEADER, and TRANS, give the corresponding output, depending on the previously selected search filters.

Methods

3.1 EPIPOX Search Example Case

To show the practical use of EPIPOX, in the next sections, we describe how to search through the database to obtain a reduced group of predicted T cell epitopes with characteristics that make them suitable for designing a peptide-based orthopoxvirus vaccine. As an example, we will search peptides that bind to the HLA-A3 supertype without defining a specific protein (all proteins) but limiting the search to those with early expression, as these proteins may be more immunogenic [19]. Selecting as HLA-I restriction the HLA-A3 supertype ensures a wide population coverage (see Note 2).

180

Laura Ballesteros-Sanabria et al.

Fig. 3 EPIPOX result page. A slice of the output table resulting from promiscuous EEV protein peptides binding to the A3 supertype. Each column shows the corresponding protein features used to limit the epitope search

3.2 Selection of Restriction Elements and Proteins

In this section, we number the steps needed to select the HLA molecules and VARV proteins to retrieve T cell epitopes. 1. Access EPIPOX web server at http://bio.med.ucm.es/ epipox/. 2. Select a restriction element. There are two options: using the HLA field or the supertypes field.

EPIPOX: Orthopoxvirus-Specific T Cell Epitope Database

181

• Use the HLA field to choose all HLA I and/or HLA II alleles as well as individual HLA alleles from the list. To make a multiple selection, click one HLA, pulse, and hold control key to select other HLAs that you want to include in the search. Then click AND or OR to search for peptides binding all or at least one of the selected HLA molecules. • Use the Supertype field to query the database for promiscuous T cell epitopes binding to HLA I supertypes A2, A3, and B7. 3. Select one protein (or all of them) from VARV as T cell epitope source. 3.3 Selection of Search Filters

In this section, we describe the different options in each of the following search fields and how to use them to limit the resulting epitopes from the search. 1. Use LEADER field for selecting proteins with leader peptide sequence. Choose one of the following options: • All: All proteins are selected (default setting). • T: Protein has a leader peptide sequence. • F: Protein does not have a leader peptide sequence. In our example, we leave the default setting. 2. Use TRANS field for selecting proteins with transmembrane regions. Choose one of the following options: • All: All proteins are selected (default setting). • T: Protein has predicted transmembrane region. • F: Protein does not have a transmembrane region. In our example, we leave the default setting. 3. Use EXPRESS field for selecting gene products according to temporal expression. Choose one of the following options: • All: All proteins are selected (default setting). • E: Early expressed gene. • I: Intermediate expressed gene. • L: late expressed gene. In our example, we select proteins with early expression by clicking E. 4. Use LOCATION field for selecting proteins located in virion (CORE) or in the membrane of infective forms, IMV or EEV (see Note 3). Choose one of the following options: • All: All proteins are selected (default setting). • CORE: Protein is part of the virion.

182

Laura Ballesteros-Sanabria et al.

• IMV: Protein is located in the membrane of intracellular mature virion form of the virus. • EEV: Protein is located in the membrane of extracellular enveloped virus form of the virus. In our example, we leave the default setting. 5. Use PCT OPT field for selecting epitopes with a given relative score (see Note 4). • All: for selecting epitopes with any score (default setting). • Otherwise, select one of the given percentages: >25%, >50%, >60%, >70%, or > 80%. In our example, we leave the default setting. 6. Use CLEAVED field for HLA I-restricted epitopes that are predicted to be generated by the proteasome. Choose one of the following options: • All: All peptides are selected. • T: Only peptides predicted to be cleaved by the proteasome are selected. In our example, we leave the default setting. 7. Once selected the desired restriction elements, epitope source, and protein annotations, click search to obtain the predicted epitopes. Some combination of search parameters may give no results, if so, go back and click reset to delete the selected options to try a new search. 3.4 Getting Search Results

In this section, we explain the management of the results from the search in EPIPOX. Once you click search a new tab will open, showing a peptide list depicted in Fig. 3. The information in this table can be copied and pasted into an EXCEL document. Also, the user can access additional information about the protein from the Virus Pathogen Database (http://www.viprbrc. org/) (see Note 5). To do this, click the protein name that appears in SOURCE NAME. To see the epitope within the whole protein sequence, click the epitope in the SEQ column. A new tab will open showing the epitope inside the protein in bold. As an example, in Fig. 4 we show this information for peptide “ATYIDALAK” in the F1R protein. In our example, the search of early expressed proteins binding to HLA-A3 supertype found 190 results, representing 38 different peptides and the corresponding HLA they can bind (see Note 6). From these epitopes, eight are identified and deposited at The Immune Epitope Database as vaccinia virus epitopes, while the rest are epitopes predicted by EPIPOX using peptide-MHC I-binding profiles (Table 1).

EPIPOX: Orthopoxvirus-Specific T Cell Epitope Database

183

Fig. 4 EPIPOX additional output results. (a) By clicking the proteins from field SOURCE NAME, a new tab will open giving access to the Virus Pathogen Database (http://www.viprbrc.org/). The figure shows part of the information from the protein available on this website. (b) By clicking the epitope sequence, in field SEQ, users will get the amino acid sequence of the protein, showing the peptide in bold

4

Notes 1. EPIPOX has a limitation of HLA-peptide-binding search as it only predicts the binding of 9 residue peptides, while HLA II can bind peptides up to 22 amino acids. 2. The HLA molecules that conform the supertypes A2 (A*0201, A*0202, A*0203, A*0205, A*0206, A*0207, A6802), A3 (A*0301, A*1101, A*3101, A*3301, A*6801, A*6601), and B7 (B*0702, B*3501, B*5101, B*5102, B*5301, B*540) are present in more than 88% of the population. Thus, this selection facilitates maximizing the population coverage of vaccines with a minimum number of peptides. 3. If location is not specified, after the search the user will get a list of proteins from CORE, IMV, EEV, and others which locations has not been determined, NA (Not available). 4. Each HLA-specific profile has a maximum score and with PCP OPT users can select peptides with scores relative, expressed in percentage, to the maximum. The higher this binding threshold, less epitopes will be retrieved. 5. By visiting Virus Pathogen Database (http://www.viprbrc. org/) you can access detailed information about the virus strain. It also provides information and annotations about the

184

Laura Ballesteros-Sanabria et al.

Table 1 VARV peptide-binding HLA-A3 supertype alleles Epitope

IEDB ID

Epitope

IEDB ID

DIFVSLVKK

8645

EISGKMAKK



ATYIDALAK



KSYESGLPK



VMFDKITSR



KVNYGEIKK



TVLITVYEK



TLARKIIKK



KLMEEYLRR

32018

KTSSFKISK

33868

TIFDFSEAR



SLLIDTYVK



KIFYKHIHK

919582

QLLASNQVK



DLLNSMMNR

9182

KIIMTKLKK



NLGNAVSNK



TVEEVDISK



ALLNAALHK



HIADPSYSK



NMKDITYEK



TLFNAGTSR



ALVSATKQK



RIAQLIYQR



LMFEYPLTK



IVNENLAER



KVIVRNLNK

34058

TIFEKFYEK



STIQESFIR



LLNAALHKK



SIYNVEIRK



AVKDVTITK

5397

KINNKIVER



DMFNLLLMK



IINHSIVTR



AVIRANNNR

5384

SIYTGENMR



ELYNEHSKK



specific protein, including the isoelectric point, molecular weight, domains and motifs, predicted epitopes, ontologies, orthologues, and protein similarities. 6. If you select multiple HLA in the SEARCH section, note that different HLA molecules binding the same peptide appear as individual rows in each column of the output page.

Acknowledgments We are thankful to the ANTICIPA-CM project of Complutense University of Madrid for supporting L.B-S. H.P-P is supported by FPU 2019 Grant.

EPIPOX: Orthopoxvirus-Specific T Cell Epitope Database

185

References 1. Theves C, Biagini P, Crubezy E (2014) The rediscovery of smallpox. Clin Microbiol Infect 20(3):210–218 2. Buller RM, Palumbo GJ (1991) Poxvirus pathogenesis. Microbiol Rev 55(1):80–122 3. Riedel S (2005) Edward Jenner and the history of smallpox and vaccination. Proc (Bayl Univ Med Cent) 18(1):21–25 4. Diaz JH (2021) The disease ecology, epidemiology, clinical manifestations, management, prevention, and control of increasing human infections with animal orthopoxviruses. Wilderness Environ Med 32(4):528–536 5. Molero-Abraham M, Glutting JP, Flower DR, Lafuente EM, Reche PA (2015) EPIPOX: immunoinformatic characterization of the shared T-cell epitome between variola virus and related pathogenic orthopoxviruses. J Immunol Res 2015:738020 6. Jenner E (1801) On the origin of the vaccine inoculation. Med Phys J 5(28):505–508 7. Alzhanova D, Fruh K (2010) Modulation of the host immune response by cowpox virus. Microbes Infect 12(12–13):900–909 8. Schrick L, Tausch SH, Dabrowski PW, Damaso CR, Esparza J, Nitsche A (2017) An early American smallpox vaccine based on horsepox. N Engl J Med 377(15):1491–1492 9. Maurer DM, Harrington B, Lane JM (2003) Smallpox vaccine: contraindications, administration, and adverse reactions. Am Fam Physician 68(5):889–896 10. Bray M, Wright ME (2003) Progressive vaccinia. Clin Infect Dis 36(6):766–774 11. Bunge EM, Hoet B, Chen L, Lienert F, Weidenthaler H, Baer LR et al (2022) The changing epidemiology of human monkeypox-a potential threat? A systematic review. PLoS Negl Trop Dis 16(2):e0010141 12. Ballesteros-Sanabria L, Pelaez-Prestel HF, Ras-Carmona A, Reche PA (2022) Resilience

of spike-specific immunity induced by COVID-19 vaccines against SARS-CoV-2 variants. Biomedicine 10(5):996 13. Vita R, Overton JA, Greenbaum JA, Ponomarenko J, Clark JD, Cantrell JR et al (2015) The immune epitope database (IEDB) 3.0. Nucleic Acids Res 43(Database issue): D405–D412 14. Jing L, Davies DH, Chong TM, Chun S, McClurkan CL, Huang J et al (2008) An extremely diverse CD4 response to vaccinia virus in humans is revealed by proteome-wide T-cell profiling. J Virol 82(14):7120–7134 15. Oseroff C, Kos F, Bui HH, Peters B, Pasquetto V, Glenn J et al (2005) HLA class I-restricted responses to vaccinia recognize a broad array of proteins mainly involved in virulence and viral gene regulation. Proc Natl Acad Sci U S A 102(39):13980–13985 16. Smith CL, Mirza F, Pasquetto V, Tscharke DC, Palmowski MJ, Dunbar PR et al (2005) Immunodominance of poxviral-specific CTL in a human trial of recombinant-modified vaccinia Ankara. J Immunol 175(12):8431–8437 17. Kim M, Yang H, Kim SK, Reche PA, Tirabassi RS, Hussey RE et al (2004) Biochemical and functional analysis of smallpox growth factor (SPGF) and anti-SPGF monoclonal antibodies. J Biol Chem 279(24):25838–25848 18. Yang H, Kim SK, Kim M, Reche PA, Morehead TJ, Damon IK et al (2005) Antiviral chemotherapy facilitates control of poxvirus infections through inhibition of cellular signal transduction. J Clin Invest 115(2):379–387 19. Kastenmuller W, Gasteiger G, Gronau JH, Baier R, Ljapoci R, Busch DH et al (2007) Cross-competition of CD8+ T cells shapes the immunodominance hierarchy during boost vaccination. J Exp Med 204(9):2187–2198

Part III Prediction of Antigenicity and Immunogenicity: Tools and Protocols

Chapter 13 Prediction of Linear B Cell Epitopes in Proteins Juan R. de los Toyos Abstract The accurate prediction of B cell epitopes is crucial for the design and development of vaccines, especially of those preventive for emerging pathogenic diseases. Preventive vaccines are mainly based on the induction of highly specific neutralizing antibodies. This chapter deals with some prediction methods, which are currently available as user-friendly online servers, to predict B cell epitopes in proteins. A final assessment to validate these predictions is done by recurring to the Immune Epitope Database (IEDB). Key words Prediction, Linear B cell epitopes, Proteins

1

Introduction Most of the approved vaccines are preventive, aimed to avoid the potential development of a disease. In the case of diseases triggered by a pathogen, that prevention mainly relies on the induction of specific antibodies to pathogen antigenic components present on the surface of the pathogen or with anti-toxin capacity. Against developing tumors and autoimmune diseases, therapeutic vaccines are the applicable choice. Besides those based on the induction of specific cellular responses, other vaccines pursue the development of specific antibodies to tumor-associated antigens of prominent cell surface expression. So far, in the case of autoimmune diseases, therapeutic vaccines are mainly aimed to elicit a cellular immune response. Conversely, the onset of an autoimmune condition following a vaccination is a rare event. For neurological disorders, as the Alzheimer’s disease, either preventive or therapeutic vaccines are far from being a reality. The antigenic components with B cell epitopes, able to induce and to be recognized by specific antibodies, are of varied nature as they can be haptens, carbohydrates, nucleic acids or proteins, and less frequently lipids.

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_13, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

189

190

Juan R. de los Toyos

For computational vaccine design, it is of paramount importance to precisely predict the existence and performance of B cell epitopes with the capacity to efficiently induce prophylactic and/or therapeutical antibodies. For methodological reasons, the prediction of B cell epitopes has been exclusively restricted to proteins, so far [1]. In proteins, B cell epitopes can be continuous (linear), discontinuous (conformational), or neoepitopes, not recognized as such in the native protein but after a change as, for instance, a proteolytic process. B cell epitopes in native proteins are predominantly conformational, being linear B cell epitopes only a minority, but easier to predict and relevant for vaccination as they can be formulated as peptides conjugated to a carrier. Since the pioneering “Antigenic index (Ai)” of Jameson and Wolf in 1988 [2], the in silico prediction of linear B cell epitopes in proteins has relied on the increasing growth of knowledge on the properties and characterization of them—the Immune Epitope Database (IEDB): www.iedb.org [3] maintains an updated and curated collection of experimental data—and on the formulation and implementation of continuously improved software [4–6].

2

Materials Some of the software developed to predict linear B cell epitopes in proteins is company property but some other is freely available on Internet as user-friendly online servers. The usage of EpitopeVec [7], while freely available, may require the installation of new packages. The prediction can be made from the amino acid sequence of the protein or from a pdb file of its 3D structure, if known. As a demonstration, the following user-friendly online servers, among those with reported best performance and most recent, will be tried: 1. ElliPro: http://tools.iedb.org/ellipro/. 2. BepiPred-2.0: http://tools.iedb.org/bcell/ or https://services. healthtech.dtu.dk/service.php?BepiPred-2.0. 3. DLBEpitope: http://ccb1.bmi.ac.cn:81/dlbepitope/. 4. BCEPS: http://imbio.med.ucm.es/bceps/. For our purpose, pneumolysin, a 471 amino acid bacterial protein, with Swiss-Prot ID P0C2J9 https://www.uniprot.org/ uniprot/P0C2J9 and with PDB 5CR6 https://www.rcsb.org/ structure/5cr6, will be analyzed.

Prediction of Linear B Cell Epitopes in Proteins

3

191

Methods As the prediction of protein B cell epitopes is still of low reliability [4, 8], in practice, for a given protein, it is advisable to apply some different prediction methods and to try to identify concordant consensus results (see Note 1). ElliPro, derived from Elli psoid and Pro trusion [9], predicts both linear and conformational epitopes from a pdb file, based on structural features such as solvent-accessibility and flexibility. It has similar results displayed for BepiPred-2.0 [10] and is accompanied by the Jmol molecular viewer which renders the location of each predicted epitope in the context of the whole 3D structure of the protein. BepiPred-2.0, Sequential B-Cell Epitope Predictor [10], predicts linear epitopes from a Swiss-Prot ID or a protein sequence. This tool renders a graphical display of the antigenic amino acid stretches along the whole sequence of the protein, as well as a table of the predicted peptides, of different lengths, with an antigenic score, above the threshold set (default, 0.500), of each residue. DLBEpitope [11] is a deep learning-based model for linear B cell epitope prediction from the sequence of a protein in FASTA format. It allows to set Epitope length (11 ~ 50, default 38) (see Note 2); the Score threshold (0 ~ 11, default 11); and Filter of overlapping epitopes. BCEPS, B Cell Epitope Prediction Software [12], starts with the sequence of a protein in FASTA format and makes predictions according to various models but the best-performing model is based on a support vector machine (SVM). The user may select the length of peptides (see Note 3) as well as the threshold (default, 0.5). B cell epitopes may be predicted as “Extended” or “Collapsed.” Results are presented in a table reporting the sequence of peptides which are also shown in a graphical display. Figure 1 reproduces the graphical displays from the ElliPro, BepiPred-2.0, and BCEPS analyses of pneumolysin. It is fairly apparent the similarity between the predictions of BepiPred-2.0 and BCEPS. Tables 1, 2, 3, and 4 show the linear peptide characteristics resulting from the ElliPro, Bepi-Pred-2.0, DLBEpitope, and BCEPS analyses of pneumolysin. Coincident amino acid stretches between analyses are in bold. From these predictions, taking only into consideration results concordant in the four analyses, it may be concluded that, at least, amino acid stretches (18) KKKLLTHQGE (27), (135), WHQDYGQVNNV (145), (221) EDLKQRGISAERPL (234), (379), and RNGELSYDHQGKEVLTPKAWDRNGQDLTA (406), could be or encompass B cell epitopes of pneumolysin.

192

Juan R. de los Toyos

ElliPro: 2D Score Chart(S) for 5CR6 Chain D (1 - 471) 1.0

Threshold

0.8

Score

0.6

0.4

0.2

0.0 -20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480

Position

BepiPred-2.0 2D Score Chart for UniProtKB - POC2J9 (TACY_STRPN) Center position: 4 Threshold: 0.500

Recalculate

0.65 0.60

Score

0.55 0.50 0.45 0.40 0.35 0.30 0.25

-20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 380 400 420 440 460 480

Position

BCEPS Graphical Display UniProtKB - POC2J9 (TACY_STRPN) 100

200

300

400

100

200

300

400

Fig. 1 Graphical displays of the ElliPro, BepiPred-2.0, and BCEPS analyses of pneumolysin

Prediction of Linear B Cell Epitopes in Proteins

193

Table 1 Peptide characteristics from the ElliPro prediction analysis of pneumolysin

No. Chain Start End Peptide

Number of residues

Score

1

D

125

146 GAVNDLLAKWHQDYGQVNNVPA

22

0.806

2

D

420

445 NLSVKIREATGLAWEWWRTVYEKTDL

26

0.796

3

D

364

413 LLDHSGAYVAQYYITWDELSYNHQGKEVLTPKA 50 WDRNGQDLTAHFTTSIP

0.739

4

D

452

471 TISIWGTTLYPQVEDKVEND

20

0.74

5

D

1

33

33

0.734

6

D

204

233 VDAVKNPGDVFQDTVTVEDLKQRGISAERP

30

0.734

7

D

315

326 QEGSRFTADHPG

12

0.659

8

D

89

118 LLAVDRAPMTYSIDLPGLASSDSFLQVEDP

30

0.627

9

D

336

341 LRDNVV

6

0.604

10 D

269

275 GVKVAPQ

7

0.564

11 D

293

308 GSGARVVTGKVD

12

0.522

MANKAVNDFILAMNYDKKKLLTHQGESIEN RFI

As an assessment exercise for these predictions, the IEDB may be consulted: https://www.iedb.org/, specifying linear peptide of any length, Streptococcus pneumoniae (ID: 1313), Pneumolysin [Q7ZAK5] (Streptococcus pneumoniae) and only positive B cell assays. From this assessment, it is found that the amino acid stretch (221) EDLKQRGISAERPL (234) is part of the DTVTVEDLKQRGISAERPLVYISS linear peptidic epitope (epitope ID: 10519); and the end of (379) RNGELSYDHQGKEVLTPKAWDRNGQDLTA (406), includes the GQDLTAH linear peptidic epitope (epitope ID: 21856). Conversely, the amino acid stretches (18) KKKLLTHQGE (27) and (135) WHQDYGQVNNV (145) have not been identified so far as taking part of pneumolysin epitopes, as well as some more experimentally recognized epitopes have not been identified in these prediction analyses.

4

Notes 1. To obtain bona fide comparable results, it would be very desirable that the different methods applied would allow to set the same or similar analytical parameters such as epitope length/ window size and/or score threshold values. Nevertheless, default settings seem to be the best for each particular method.

194

Juan R. de los Toyos

Table 2 Peptide characteristics from the BepiPred-2.0 prediction analysis of pneumolysin No.

Start

End

Peptide

Length

1

7

42

NDFILAMNYDKKKLLTHQGESIENRFIKEGNQLPDE

36

2

48

65

RKKRSLSTNTSDISVTAT

18

3

67

67

D

1

4

80

85

ETLLEN

6

5

104

123

PGLASSDSFLQVEDPSNSSV

20

6

135

144

HQDYGQVNNV

10

7

155

162

AHSMEQLK

8

8

166

192

GSDFEKTGNSLDIDFNSVHSGEKQIQI

27

9

207

230

VKNPGDVFQDTVTVEDLKQRGISA

24

10

256

258

SDE

3

11

268

283

KGVKVAPQTEWKQILD

16

12

297

300

SSGA

4

13

318

325

SRFTADHP

8

14

349

360

DYVETKVTAYRN

12

15

383

409

SYDHQGKEVLTPKAWDRNGQDLTAHFT

27

16

430

433

GLAW

4

17

440

450

YEKTDLPLVRK

11

18

462

468

PQVEDKV

7

2. For our DLBEpitope analysis, the epitope length will be set at 14, as, according to Galanis et al. [8], the mean peptide length of the data sets used in their study is about 14; the Score threshold will be the default 11; and overlapping epitopes will be filtered on. 3. As above, in our BCEPS analysis, the epitope length will be set up at 14. And to obtain the most simple and comparable results with the other analyses, only collapsed B cell epitopes will be returned.

Prediction of Linear B Cell Epitopes in Proteins

195

Table 3 Peptide characteristics from the DLBE prediction analysis of pneumolysin You want epitopes 14 AAs long with a threshold ≥ 11. Filter on. Filter of overlapping epitopes

ON

Seq. name

No.

Start

End

Sequence

Score

AJS15225

1

14

27

NYDKKKLLTHQGES

11

AJS15225

2

92

105

VDRAPMTYSIDLPG

11

AJS15225

3

134

147

WHQDYGQVNNVPAR

11

AJS15225

4

169

182

FEKTGNSLDIDFNS

11

AJS15225

5

189

202

QIQIVNFKQIYYTV

11

AJS15225

6

221

234

EDLKQRGISAERPL

11

AJS15225

7

271

284

KVAPQTEWKQILDN

11

AJS15225

8

325

338

PGLPISYTTSFLRD

11

AJS15225

9

372

385

VAQYYITWNELSYD

11

AJS15225

10

393

406

TPKAWDRNGQDLTA

11

AJS15225

11

424

437

KIRECTGLAWEWWR

11

AJS15225

12

450

463

KRTISIWGTTLYPQ

11

END

Table 4 Peptide characteristics from the BCEPS prediction analysis of pneumolysin Start

Peptide

BEPI

18

KKKLLTHQGESIENRFIKE

1

44

PDEIERKKRSLSTNTSD

1

114

LQVEDPSNSSVRG

1

133

AKWHQDYGQVNNV

1

178

DIDFNSVHSGE

1

210

NPGDVFQDTVTVEDLKQ

1

275

PQTEWKQI

1

312

DGKIQEGSRFTADHPGL

1

379

RNGELSYDHQGKEVLTPKAWDRNGQDLTA

1

≪ Previous

Page 1 of 1

Next ≫

196

Juan R. de los Toyos

References 1. Zierep PF, Vita R, Blazeska N, Moumbock AFA, Greenbaum JA, Peters B et al (2022) Towards the prediction of non-peptidic epitopes. PLoS Comput Biol 18(2):e1009151. https://doi.org/10.1371/journal.pcbi. 1009151 2. Jameson BA, Wolf H (1988) The antigenic index: a novel algorithm for predicting antigenic determinants. Comput Appl Biosci 4: 181–186. https://doi.org/10.1093/bioinfor matics/4.1.181 3. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res 47:D339–D343. https://doi.org/10.1093/ nar/gky1006 4. Sun P, Guo S, Sun J, Tan L, Lu C, Ma Z (2019) Advances in in-silico B-cell epitope prediction. Curr Top Med Chem 19:105–115. https:// d o i . o r g / 1 0 . 2 1 7 4 / 1568026619666181130111827 5. Ashford J, Reis-Cunha J, Lobo I, Lobo F, Campelo F (2021) Organism-specific training improves performance of linear B-cell epitope prediction. Bioinformatics 37:4826–4834. https://doi.org/10.1093/bioinformatics/ btab536 6. Ras-Carmona A, Lehmann AA, Lehmann PV, Reche PA (2022) Prediction of B cell epitopes in proteins using a novel sequence similaritybased method. Sci Rep 12:13739. https://doi. org/10.1038/s41598-022-18021-1

7. Bahai A, Asgari E, Mofrad MRK, Kloetgen A, McHardy AC (2021) EpitopeVec: linear epitope prediction using deep protein sequence embeddings. Bioinformatics 37:4517–4525. https://doi.org/10.1093/bioinformatics/ btab467 8. Galanis KA, Nastou KC, Papandreou NC, Petichakis GN, Pigis DG, Iconomidou VA (2021) Linear B-cell epitope prediction for in silico vaccine design: a performance review of methods available via command-line interface. Int J Mol Sci 22(6):3210. https://doi.org/10. 3390/ijms22063210 9. Ponomarenko J, Bui HH, Li W, Fusseder N, Bourne PE, Sette A, Peters B (2008) ElliPro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinformatics 9:514. https://doi.org/10.1186/1471-2105-9-514 10. Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequencebased B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45:W24– W29. https://doi.org/10.1093/nar/gkx346 11. Liu T, Shi K, Li W (2020) Deep learning methods improve linear B-cell epitope prediction. BioData Min 13:1. https://doi.org/10. 1186/s13040-020-00211-0 12. Ras-Carmona A, Pelaez-Prestel HF, Lafuente EM, Reche PA (2021) BCEPS: a web server to predict linear B cell epitopes with enhanced immunogenicity and cross-reactivity. Cell 10: 2 7 4 4 . h t t p s : // d o i . o r g / 1 0 . 3 3 9 0 / cells10102744

Chapter 14 Design of Linear B Cell Epitopes and Evaluation of Their Antigenicity, Allergenicity, and Toxicity: An Immunoinformatics Approach Vijaya Sai Ayyagari Abstract Immunoinformatics is a modern branch of science formed as a result of the intersection between immunology and computer science. One of the important steps in the design of multi-epitope vaccines is the prediction of B cell epitopes. B cell epitopes are of two types, linear and discontinuous. Linear epitope residues lie next to each other in the primary structure of a protein. The amino acids that constitute discontinuous epitopes lie close to each other in the three-dimensional structure of the protein. Recognition of B cell epitopes by antibodies on an antigen constitutes an important event in the immune responses toward the antigenic challenge and also forms the basis for several immunological applications. Prediction of B cell epitopes in an antigen constitutes one of the important steps in the design of multi-epitope-based vaccines. This chapter explains the prediction of linear B cell epitopes in an antigen as well as their allergenicity, antigenicity, and toxicity by using online tools. Key words Immunoinformatics, Epitopes, B cell epitope prediction, Antigenicity, Allergenicity, Toxicity

1

Introduction Immunoinformatics encompasses the prediction of epitopes, their safety, and efficacy by computational means [1–3]. It enables the design of multi-epitope vaccines by targeting precise antigenic determinants of a pathogen. Unlike conventional whole-cell vaccines, multi-epitope vaccines reduce the antigenic load on the system thereby reducing the chances of side effects [4]. Prediction of potential residues of an antigen known as “epitopes” that are recognized by the receptors of B cells and T cells is the first step in the design of multi-epitope vaccines. B cell receptors bind with the epitopes on the surface of an antigen and get differentiated into antibody-secreting plasma cells and memory cells. The antibodies then neutralize the antigens. B cell epitopes are of

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_14, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

197

198

Vijaya Sai Ayyagari

two types: sequential and nonsequential (discontinuous). Sequential epitopes are constituted by residues that are contiguous in the linear amino acid sequence. Whereas, nonsequential epitopes are made of residues that lie close to each other in the tertiary structure of the protein which in general lie far apart in the primary structure. The residues constituting the nonsequential epitopes were brought close to each other by the virtue of protein folding [5, 6]. Epitopes that induce strong B cell responses are essential in the design of multi-epitope vaccines, the development of therapeutic antibodies and diagnostics, etc. Prediction of the epitopes by computational methods saves resources and time when compared with the determination of the same by experimental methods [3, 7, 8]. A review of the different B cell epitope prediction methods is available from El-Manzalawy and Honavar [7], Saha and Raghava [9], and Tomar and De [10]. Several B cell epitope prediction methods were based on the physicochemical properties of amino acids [9, 11]. Later, it was found that a combination of physicochemical properties of the amino acids yields a better prediction of linear B cell epitopes. Further, the availability of the experimentally determined B cell epitopes (e.g., obtained from the crystal structures of antigenantibody complexes [8]) in combination with different machine learning tools, resulted in several linear B cell epitope prediction tools with improved accuracy [11]. List of different types of B cell epitope prediction tools is available from Evans [2], Raoufi et al. [3], El-Manzalawy and Honavar [7], Tomar and De [10], El-Manzalawy et al. [11], and El-Manzalawy et al. [12]. This chapter explains the prediction of linear B cell epitopes from a given antigenic sequence as well as their antigenicity, allergenicity, and toxicity by different online tools.

2

Materials 1. In general, the target protein(s) for which epitopes are to be predicted is identified and the sequences are retrieved from the NCBI Nucleotide database. 2. Multiple sequence alignment needs to be performed to obtain the consensus/conserved residues among the protein sequences that are aligned. Epitopes are predicted for these conserved residues (see Note 1). 3. As mentioned earlier, B cell epitopes are of two types (i) linear and (ii) discontinuous (see Note 2). In this chapter, two linear B cell epitope prediction tools, BcePred [9], and BepiPred-2.0 [8] were used for the prediction of linear B cell epitopes. The former is based on the physicochemical properties of amino

In Silico Design of Linear B Cell Epitopes

199

acids and the latter is a machine learning-based approach toward the linear B cell epitope prediction. 4. BcePred predicts linear B cell epitopes by combining four residue properties, namely, hydrophilicity, flexibility, accessibility, and turns. This resulted in improved accuracy of 58.70% in the prediction of linear B cell epitopes. This is slightly better when compared with the accuracy obtained (52.92–57.53%) when only one residue property was used [9]. 5. BepiPred-2.0, a linear B cell epitope prediction tool is based on the Random Forest Regression Factor algorithm trained on the dataset consisting of the antigens derived from the antigenantibody crystal structures and from the epitopes obtained from the IEDB database. It is considered to outperform existing linear B cell epitope prediction tools [8]. 6. Identification of epitopes that are non-allergenic and non-toxic besides immunogenic is of utmost importance in terms of the safety and efficacy of a multi-epitope vaccine (see Note 3). Antigenicity, allergenicity, and toxicity of the linear B cell epitopes were predicted using VaxiJen v2.0 [13], AllerTOP v2.0 [14], and ToxIBTL [15] web servers, respectively. 7. One of the important steps in the design of multi-epitope vaccines is the prediction of antigenic peptides from a given protein sequence or the evaluation of a given epitope as an antigen or as a non-antigen or the evaluation of antigenicity of a whole multi-epitope vaccine construct. Reverse vaccinology bypasses the need to carry out wet lab experimentation involving the culturing of microbes, isolation of antigens, and testing them on animal models for their efficacy [16, 17]. VaxiJen v2.0 predicts the antigenicity of a given amino acid sequence based on auto cross-covariance (ACC) transformation [13]. Another online tool, VirVACPRED, predicts viral antigens from a given viral protein sequence based on the gradient-boosting classifier [17]. 8. Allergens induce the secretion of IgE antibodies, which leads to the activation of basophils, eosinophils, and mast cells resulting in the secretion of mediators of inflammation that cause damage to the tissues. Therefore, screening of epitopes that may induce allergic reactions is an important step in the construction of a multi-epitope vaccine. There are different in silico tools available to predict the allergenicity of an epitope. For example, AllerTop is an alignment-free, auto- and crosscovariance (ACC)-based online tool that predicts the allergenicity of proteins [14]. 9. As certain peptides are toxic in nature; it is important to ensure that the peptides designed for the multi-epitope vaccines are non-toxic. As mentioned in the preceding paragraphs,

200

Vijaya Sai Ayyagari

evaluating the toxicity of the individual peptides in vitro and in vivo is cumbersome. However, this is circumvented by predicting the toxicity of the peptides in silico. For example, ToxinPred, a web server available at http://crdd.osdd.net/ raghava/toxinpred/, was developed using a Support Vector Machine (SVM) to predict whether a peptide is toxic or non-toxic [18]. Recently developed web server ToxIBTL predicts the toxicity of the peptides based on the information bottleneck principle and transfer learning. It is considered to achieve higher performance than existing tools [15].

3

Methods

3.1 Identification of Consensus Regions

1. For the present study, a total of 16 sequences (LC650048, LC650052, LC650055, LC650057, MT066156, MT292577, MW156805, MW425837, MW531680, MZ344999, MZ345001, MZ362440, MZ362445, MZ362449, MZ362451, OK091660) of Membrane glycoprotein of SARSCoV2 [1000] were retrieved from the NCBI Nucleotide database (https://www.ncbi.nlm.nih.gov/nucleotide/). 2. Multiple sequence alignment of all 16 sequences was carried out by Clustal Omega available at https://www.ebi.ac.uk/ Tools/msa/clustalo/ for the identification of the consensus regions (see Fig. 1, Table 1). 3. In the present study, only consensus regions that were greater than or equal to 15 amino acids were considered for the prediction of B cell epitopes.

3.2 Prediction of B Cell Epitopes

1. In this chapter, two linear B cell epitope prediction tools were used, namely, BepiPred-2.0 (https://services.healthtech.dtu. dk/service.php?BepiPred-2.0) and BcePred (http://crdd. osdd.net/raghava/bcepred/). The epitopes that were predicted in common by the two web servers were considered for inclusion in the construction of a multi-epitope vaccine (see Note 4). 2. For the prediction of epitopes using BepiPred-2.0, the amino acid sequences of the consensus regions were pasted in the fasta format and clicked on the “Submit” button (see Fig. 2). The amino acid(s) that constitutes an epitope(s) is marked with the letter “E” on top of it as well as by a color gradient (see Fig. 3). 3. For the prediction of epitopes using BcePred, one consensus sequence at a time was pasted in the designated space and clicked on “Submit sequence.” The threshold values for various parameters, namely, hydrophilicity, accessibility, exposed surface, antigenic propensity, flexibility, turns, polarity, and

In Silico Design of Linear B Cell Epitopes

201

Fig. 1 Multiple sequence alignment of the Membrane glycoprotein of different accessions of SARS-CoV2 retrieved from NCBI GenBank by Clustal Omega Table 1 Consensus sequences obtained after multiple sequence alignment Sr. no.

Consensus region id

1.

Consensus_1

MADSNGTITVEELKKLLEQWNLVIGFLFLTWICLLQFAYANRNRFL YIIKLIFLWLLWPVTLACFVLAAVYRINWITGGIA

2.

Consensus_2

AMACLVGLMWLSYFIASFRLFARTRSMWSFNPETNILLNVPLHGTIL TRPLLESELVIGAVILRGHLRIAGHHLGRCDIKDLPKEITVATSRTL SYYKLGASQRVAGDSGFAAYSRYRIGNYKLNTDHSSSSDNIALLVQ

Consensus sequence

combined were left to their defaults. Default thresholds yield results with the best sensitivity and specificity. Changes in threshold result in better specificity (percentage of correctly predicted non-epitopes) at the expense of sensitivity (percentage of epitopes that are correctly predicted as epitopes) [9]. All four physicochemical properties, hydrophilicity, flexibility, accessibility, and turns were selected and then clicked on the “Submit sequence” button (see Fig. 4). The results were displayed in graphical and tabular formats (not shown) and also as an “overlay display” (see Fig. 5). If there are two overlapping epitopes, they were merged as a single epitope. In the output, the amino acids whose

202

Vijaya Sai Ayyagari

Fig. 2 Submission page of BepiPred-2.0

Fig. 3 Output of BepiPred-2.0

In Silico Design of Linear B Cell Epitopes

203

Fig. 4 Submission page of BcePred

Fig. 5 A part of the Output page of BcePred displaying the results for consensus sequence #2

normalized score was above the threshold value (2–2.5), act as B cell epitopes and were displayed in blue color in graphical format, tabular format, and in overlay display. 4. Only epitopes that were predicted in common by both the tools and were of a minimum of 12 amino acids long were considered for further analyses (see Note 5). For example, the epitopes LGASQRVAGDSG and DIKDLPKEITVATSRTLSYYKLGASQR were predicted by BepiPred-2.0 and BcePred,

204

Vijaya Sai Ayyagari

respectively. Then the consensus epitope was DIKDLPK EITVATSRTLSYYKLGASQRVAGDSG. The consensus epitope was deducted by looking for the overlapping residues between the two epitopes generated by both online tools. In this case, the overlapping residues were LGASQR. Let’s take another example, the epitopes RYRIGNYKLNTDHSSSS DNIA and AAYSRYRIGNYKLNTDHSSSSDNIA that were predicted by BepiPred-2.0 and BcePred, respectively. The consensus epitope was AAYSRYRIGNYKLNTDHSSSSDNIA. 3.3 Prediction of Antigenicity, Allergenicity, and Toxicity

1. In VaxiJen v2.0 (http://www.ddg-pharmfac.net/vaxijen/ VaxiJen/VaxiJen.html), one epitope at a time was pasted in the space provided. The target organism was selected as “Tumor.” The threshold was left at the default value of 0.5. Of the three options, ACC output, Sequence output, and Summary mode, the option “Sequence output” was selected as the default by the server (see Fig. 6). On the results page, the nature of the epitope (i.e., antigen or non-antigen along with the prediction score were displayed) (see Fig. 7). 2. In AllerTop v.2.0 (https://www.ddg-pharmfac.net/AllerTOP/), one epitope at a time was pasted as plain text in the space provided and clicked on the “Get the result” (see Fig. 8). On the next page, the result was displayed as either “probable allergen” or “probable non-allergen” (see Fig. 9).

Fig. 6 Home page of VaxiJen v2.0

In Silico Design of Linear B Cell Epitopes

Fig. 7 Results page of VaxiJen v2.0

Fig. 8 Home page of AllerTOP v2.0

205

206

Vijaya Sai Ayyagari

Fig. 9 Results page of AllerTop v2.0

3. In ToxIBTL (https://server.wei-group.net/ToxIBTL/Server. html), multiple peptide sequences were pasted in the Fasta format in the space provided and clicked on the “submit” button (see Fig. 10). In the results page (see Fig. 11), the toxic or non-toxic nature of the peptides was displayed. 3.4 Deduction of Final B Cell Epitope(s)

4

Consensus sequences of the epitopes along with their antigenicity, allergenicity, and toxicity were shown in Table 2. The epitope, DIKDLPKEITVATSRTLSYYKLGASQRVAGDSG (Sr. No. 2) was found to be antigenic, non-allergenic, and non-toxic. Therefore, it may be used as a B cell epitope in the construction of a multi-epitope vaccine.

Notes 1. Aligning different isolates of similar/different strain(s) or genotype(s) by multiple sequence alignment identifies the consensus regions. Designing vaccines against the consensus regions may result in enhanced efficiency and broad coverage of the vaccine. 2. 10% of all the B cell epitopes recognized by antibodies are found to be continuous [19]. Therefore, it is important to look for the discontinuous B cell epitopes in the vaccine construct by using B cell discontinuous epitope prediction tools.

In Silico Design of Linear B Cell Epitopes

Fig. 10 Home page of ToxIBTL

Fig. 11 Results page of ToxIBTL

207

208

Vijaya Sai Ayyagari

Table 2 Consensus sequences of epitopes predicted by BcePred and BepiPred-2.0 along with their allergenicity, toxicity, and antigenicity Consensus sequences (minimum 12 amino Sr. Consensus acids long) predicted by BcePred and no. region ID BepiPred-2.0

Antigenicity Allergenicity Toxicity

1.

1

MADSNGTITVEELKKLLEQW

Antigen

Allergen

Nontoxic

2.

2

DIKDLPKEITVATSRTLSYYKLGA SQRVAGDSG

Antigen

Nonallergen

Nontoxic

3.

2

AAYSRYRIGNYKLNTDHSSSSDNIA

Nonantigen

Nonallergen

Nontoxic

3. B cell epitopes that were predicted may be toxic, allergenic, and non-immunogenic. It is important to verify the toxicity, allergenicity, and antigenicity of each of the epitopes before using them in the construction of multi-epitope-based vaccines. Keeping in view the safety of the multi-epitope vaccines, only those epitopes which are non-toxic, non-allergenic, and antigenic need to be included in the construction of epitope-based vaccines. 4. It is essential to use more than two B cell epitope prediction tools in order to get a reliable prediction of epitopes instead of relying on a single tool [12, 20]. Those epitopes that were predicted in common by all the prediction tools need to be included in the design of multi-epitope vaccines. 5. For instance, the lengths of the B cell epitopes predicted using BepiPred range from 1 to >40. Therefore, choosing the optimal length for the B cell epitopes to be included in the design of multi-epitope vaccines is essential. Analyses carried out by Rubinstein et al. [21] revealed that the length of the majority of epitopes ranges from 15 to 25 amino acids which was found to be in line with the experimental evidence. EL-Manzalawy et al. [12] suggested the optimal epitope length to be between 12 and 16 amino acids. Therefore, epitopes whose length falls in the above ranges and beyond may be considered for inclusion in the design of multi-epitope vaccines. References 1. Brusic V, Petrovsky N (2005) Immunoinformatics and its relevance to understanding human immune disease. Expert Rev Clin Immunol 1(1):145–157. https://doi.org/10. 1586/1744666X.1.1.145

2. Evans MC (2008) Recent advances in immunoinformatics: application of in silico tools to drug development. Curr Opin Drug Discov Devel 11(2):233–241

In Silico Design of Linear B Cell Epitopes 3. Raoufi E, Hemmati M, Eftekhari S, Khaksaran K, Mahmodi Z, Farajollahi MM, Mohsenzadegan M (2020) Epitope prediction by novel immunoinformatics approach: a stateof-the-art review. Int J Pept Res Ther 26(2): 1155–1163. https://doi.org/10.1007/ s10989-019-09918-z 4. Oli AN, Obialor WO, Ifeanyichukwu MO, Odimegwu DC, Okoyeh JN, Emechebe GO, Adejumo SA, Ibeanu GC (2020) Immunoinformatics and vaccine development: an overview. Immunotargets Ther 9:13–30. https:// doi.org/10.2147/ITT.S241064 5. Goldsby RA, Kindt TJ, Osborne BA (2000) Kuby immunology, 4th edn. W. H. Freeman and Company 6. Doytchinova IA, Guan P, Flower DR (2006) EpiJen: a server for multistep T cell epitope prediction. BMC Bioinformatics 7:131. https://doi.org/10.1186/1471-2105-7-131 7. El-Manzalawy Y, Honavar V (2010) Recent advances in B-cell epitope prediction methods. Immunome Res 6(Suppl 2):S2. https://doi. org/10.1186/1745-7580-6-S2-S2 8. Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequencebased B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45(W1): W24–W29. https://doi.org/10.1093/nar/ gkx346 9. Saha S, Raghava GPS (2004) BcePred: prediction of continuous B-cell epitopes in antigenic sequences using physico-chemical properties. In: Nicosia G, Cutello V, Bentley PJ, Timmis J (eds) Artificial immune systems. ICARIS 2004. Lecture notes in computer science, vol 3239. Springer, Heidelberg, pp 197–204 10. Tomar N, De RK (2014) Immunoinformatics: a brief review. Methods Mol Biol 1184:23–55. https://doi.org/10.1007/978-1-49391115-8_3 11. El-Manzalawy Y, Dobbs D, Honavar V (2008) Predicting flexible length linear B-cell epitopes. Comput Syst Bioinformatics Conf 7:121–132 12. El-Manzalawy Y, Dobbs D, Honavar VG (2017) In silico prediction of linear B-cell epitopes on proteins. Methods Mol Biol 1484: 255–264. https://doi.org/10.1007/978-14939-6406-2_17 13. Doytchinova IA, Flower DR (2007) VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinformatics 8:4. https://doi.org/10. 1186/1471-2105-8-4

209

14. Dimitrov I, Flower DR, Doytchinova I (2013) AllerTOP – a server for in silico prediction of allergens. BMC Bioinformatics 14(Suppl 6): S4. https://doi.org/10.1186/1471-210514-S6-S4 15. Wei L, Ye X, Sakurai T, Mu Z, Wei L (2022) ToxIBTL: prediction of peptide toxicity based on information bottleneck and transfer learning. Bioinformatics 38(6):1514–1524. https://doi.org/10.1093/bioinformatics/ btac006 16. Doytchinova IA, Flower DR (2008) Bioinformatic approach for identifying parasite and fungal candidate subunit vaccines. Open Vaccine J 1:22–26 17. Herrera-Bravo J, Farı´as JG, Contreras FP, Herrera-Bele´n L, Norambuena JA, Beltra´n JF (2022) VirVACPRED: a web server for prediction of protective viral antigens. Int J Pept Res Ther 28(1):35. https://doi.org/10.1007/ s10989-021-10345-2 18. Gupta S, Kapoor P, Chaudhary K, Gautam A, Kumar R, Open Source Drug Discovery Consortium, Raghava GPS (2013) In silico approach for predicting toxicity of peptides and proteins. PLoS One 8(9):e73957. https://doi.org/10.1371/journal.pone. 0073957 19. Ayyagari VS (2022) Design of siRNA molecules for silencing of membrane glycoprotein, nucleocapsid phosphoprotein, and surface glycoprotein genes of SARS-CoV2. J Genet Eng Biotechnol 20:65. https://doi.org/10.1186/ s43141-022-00346-z 20. Lusˇtrek M, Lorenz P, Kreutzer M, Qian Z, Steinbeck F, Wu D, Born N, Ziems B, Hecker M, Blank M, Shoenfeld Y, Cao Z, Glocker MO, Li Y, Fuellen G, Thiesen HJ (2013) Epitope predictions indicate the presence of two distinct types of epitope-antibodyreactivities determined by epitope profiling of intravenous immunoglobulins. PLoS One 8(11):e78605. https://doi.org/10.1371/jour nal.pone.0078605 21. Potocnakova L, Bhide M, Pulzova LB (2016) An introduction to B-cell epitope mapping and in silico epitope prediction. J Immunol Res 2016:6760830. https://doi.org/10.1155/ 2016/6760830 22. Rubinstein ND, Mayrose I, Halperin D, Yekutieli D, Gershoni JM, Pupko T (2008) Computational characterization of B-cell epitopes. Mol Immunol 45(12):3477–3489. https://doi.org/10.1016/j.molimm.2007. 10.016

Chapter 15 NetCleave: An Open-Source Algorithm for Predicting C-Terminal Antigen Processing for MHC-I and MHC-II Roc Farriol-Duran, Marina Vallejo-Valle´s, Pep Amengual-Rigo, Martin Floor, and Vı´ctor Guallar Abstract T cell epitopes presented on the surface of mammalian cells are subjected to a complex network of antigen processing and presentation. Among them, C-terminal antigen processing constitutes one of the main bottlenecks for the generation of epitopes, as it defines the C-terminal end of the final epitope and delimits the peptidome that will be presented downstream. Previously (Amengual-Rigo and Guallar, Sci Rep 111(11):1–8, 2021), we demonstrated that NetCleave stands out as one of the best algorithms for the prediction of C-terminal processing, which in its turn can be crucial to design peptide-based vaccination strategies. In this chapter, we provide a pipeline to exploit the full capabilities of NetCleave, an open-source and retrainable algorithm for predicting the C-terminal antigen processing for the MHC-I and MHC-II pathways. Key words Bioinformatics, Immunoinformatics, Immune system, T cells, MHC-I, MHC-II, HLA, Antigen processing, Neural networks, Epitope predictor

1 Introduction Predicting T cell epitopes is key to understanding the targets of the adaptive immune response and, therefore, to design T cell-based immunotherapies [1] such as peptide-based vaccines [2–5]. However, T cell epitopes are generated by a complex network of antigen processing and presentation pathways that make their prediction challenging [6–8]. The field of T cell epitope prediction has drawn attention to antigen presentation [9], mainly through the determination of MHC binding affinity, with current tools such as NetMHCpan4.1 [10], MHCflurry 2.0 [11], and MixMHCpred [12]. Nevertheless, there have been conflicting reports regarding the correlation Roc Farriol-Duran and Marina Vallejo-Valle´s are co-first authors. Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_15, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

211

212

Roc Farriol-Duran et al.

between MHC binding affinity and epitope immunogenicity [13]. For this reason, we and others [11, 14–17] maintain that describing additional antigen processing pathway features can also complement and enrich the prediction of immunogenic epitopes. Cleavage of proteins into short peptides is the initial bottleneck of antigen processing. This process is performed by the proteasome or the immunoproteasome [6, 18] in the case of MHC-class-I epitopes and by cathepsins in the MIIC compartment for MHCclass-II [19] (Fig. 1). Concretely, these proteolytic enzymes determine the C-terminal endings of the peptides that will remain intact until their presentation to T cells as epitopes. Hence, this prediction has been attempted previously by different tools such as Netchop [20], MAPPP [21], PAProc [22], PCPS [14], MHCflurry 2.0 [11] and as presented here by NetCleave [23]. NetCleave offers a series

Fig. 1 C-terminal antigen processing during protein degradation. Class I epitopes derive mainly from intracellular proteins degraded by the proteasome or the immunoproteasome in the cytoplasm. In contrast, class II epitopes originate from membrane and extracellular proteins and are degraded in the lysosome-like MIIC compartments. Proteasomal enzymes and cathepsins, respectively, cleave proteins into shorter peptides. These peptidases act through the recognition of a cleavage motif, which—in NetCleave’s core—is formed by four amino acids previous to the cleavage site (at the end of the epitope) and the three continuing amino acids on the parental protein sequence. Between these two stretches, the cleavage takes place. Importantly, the C-terminal end of the epitope will remain intact during the rest of the antigen-processing pathway. Hence, predicting this proteolytic activity can help define the final T cell epitope

NetCleave, a C-Terminal Antigen Processing Predictor for MHC-I and MHC-II

213

of advantages compared to the other cleavage site predictors. To start with, NetCleave can be easily retrained using the most updated experimental data (compared to most of the other algorithms that were developed almost two decades ago and with no apparent mechanism for retraining). Second, NetCleave can predict the cleavage site for MHC-class-I and MHC-class-II epitopes (compared to other algorithms that can only predict MHC class I). Third, NetCleave shows an increased performance for C-terminal cleavage in MHC-I than other state-of-the-art methods (Fig.2). In addition, for MHC-II epitopes, the performance is lower, as expected [23], due to the more promiscuous activity of cathepsins and less restricted MHC-II binding groove (Fig. 2). Hereby, we have presented an update to the functions available in NetCleave [23] (Fig. 3), enabling a more user-friendly, flexible C-terminal processing prediction and retraining modes. The presented pipeline allows the user to exploit the full potential of NetCleave applicable to antigen-processing selection of T cell epitopes. C-terminal processing is a useful feature in epitope

0.8

HLA-A HLA-B

0.6

HLA-C Score

HLA class I HLA-DP HLA-DQ

0.4

HLA-DR HLA class II

0.2

0.0

accuracy

Precision

recall

MCC

AUC

Fig. 2 Statistical scores of the NetCleave models on six human allele isotypes (HLA-A, -B, -C, -DP, -DQ, and -DR) and two human allele classes (HLA class I and II), including accuracy, precision, recall, MCC, and AUC. A threshold of 0.5 was defined to compute the classification statistic scores. (Figure extracted from AmengualRigo and Guallar [23])

214

Roc Farriol-Duran et al.

Fig. 3 NetCleave’s main functions and usage scheme. (a) C-terminal processing prediction. Users can choose between the NetCleave Model or a Custom Model. Three types of input files can be used (a1, a2, and a3); for all the cases, cleavage motifs are generated and scored. Then, the user can rank or filter the candidate epitopes based on the NetCleave score. (b) Model retraining. Users can develop Custom Models by retraining NetCleave using an updated version of the IEDB (b1), a combination between IEDB and another dataset (b2), or exclusively using any external dataset (b3). For all the cases, a training dataset is assembled and used to retrain NetCleave’s algorithm to generate a custom model

prediction that complements peptide-based vaccine design efforts. To this end, we have ensured that the NetCleave algorithm will be periodically updated to maintain state-of-the-art performance. 1.1 C-Terminal Antigen Processing as a Predictor of T Cell Epitope Immunogenicity

NetCleave C-terminal processing score performance was evaluated to predict T cell epitope immunogenicity. We used IEDB T cell assay data (IFN-g Elispots) restricted to the human HLA alleles HLA-A*02:01 and HLA-B*07:02. Regarding the epitopes selected for T cell assays, it should be considered that most of them have been previously chosen due to their strong HLA binding affinities. Hence, regardless if they are immunogenic or not, the majority of the epitopes studied in this type of data are successfully cleaved by the proteasome, since this cleavage is the initial antigenprocessing step that leads to epitope presentation to T cells. For this reason, the ability of a C-terminal processing predictor to distinguish between immunogenic and non-immunogenic epitopes

NetCleave, a C-Terminal Antigen Processing Predictor for MHC-I and MHC-II

215

should be limited. Accordingly, NetCleave processing score displays a low AUC score for epitope immunogenicity determination (0.53 for HLA-A*02:01 and 0.55 for HLA-B*07:02). This performance is similar to other C-terminal processing methods such as MHCflurry 2.0 and slightly outperforms NetChop and PCPS. We envision C-terminal processing predictors, such as NetCleave, will play a crucial role in telling apart between cleaved and noncleaved epitopes, a distinction that can be used as a stringent filter in T cell epitope selection pipelines. For instance, when developing an epitope-based vaccine, assessing all the potential epitopes that can be generated from a target protein sequence is an undoable task due to the large number of possible candidates. For this reason, selecting those epitopes more likely to be cleaved by the proteasome can greatly reduce and optimize the number of candidates to consider. In these regards, we have recently observed NetCleave as a TOP3 contributing feature in a T cell epitope immunogenicity predictor developed using a Machine Learning-based multi predictor approach (unpublished data). 1.2 NetCleave Advantages

NetCleave provides a state-of-the-art performance thanks to the following features: 1. Cleavage motif-based C-terminal processing prediction. 2. Suitable for MHC-I and for the first time for MHC-II epitope predictions. 3. Trained on extensive MHC-eluted peptides sequenced by mass spectrometry. 4. Flexible inputs for C-terminal processing prediction: a protein sequence in a FASTA file, a list of epitopes and their linked UniProt IDs, or a list of epitopes and their corresponding parental protein sequences. 5. Flexible model retraining mode using customized data sources.

1.3

Program

NetCleave is an immunoinformatics predictor of the proteasedriven selection step in the antigenic processing of T cell epitopes. It uses a deep-learning algorithm trained on 48 physicochemical descriptors of a 4 + 3 peptide span—4 being the last residues of the peptide and 3 the following residues of the protein—representing the characteristics of the cleavage motif (Figs. 1 and 4). Due to this encoding, the method has a more generalizable representation than those based on simple sequence-only encodings (e.g., one-hot encoding). The neural network architecture consists of three layers: the input layer, with 336 neurons, one for each physicochemical descriptor; the middle layer, with 112 hidden neurons; and the output layer, consisting of a single output neuron that represents the cleavage probability in the range between 0 and 1.

216

Roc Farriol-Duran et al.

User defined experimental conditions

Obtaning flanking regions IEDB

UNIPROT

Peptide

Flanking

IEDB

Generating cleavage and decoy data

Peptide

QSAR descriptors

Neural network

Flanking

Cleavage Decoy Decoy

Steric Electrostatic Hydrophobic

Fig. 4 Overview of the NetCleave data preparation framework. Peptides reported by defined experimental conditions are automatically selected from the IEDB, and their C-terminal flanking residues are retrieved from UniParc/UniProt. Cleavage and decoy data of seven residue lengths are generated and encoded by 46 QSAR descriptors of steric, electrostatic, and hydrophobic properties. (Figure extracted from Amengual-Rigo and Guallar [23])

NetCleave’s original training data comes exclusively from mass spectrometry measurements of processed peptides found in the Immune Epitope Database (IEDB) [24]. The detailed information in this database allowed for training flexibility, permitting more specific and personalized predictions for different MHC types, alleles, pathogens, and experimental techniques. In this line, NetCleave’s C-terminal processing prediction mode is flexible to predict full protein sequences and target epitopes of interest. In addition, this flexibility is greatly expanded by its retraining mode that allows users to train the neural network architecture weights using arbitrary ad-hoc experimental data. This option circumvents problems arising from outdated databases or under-represented data for specific prediction scenarios. NetCleave code is open source and is publicly available under an MIT license at https://github.com/BSC-CNS-EAPM/ NetCleave. 1.3.1

Installation

To install NetCleave, you can clone the Github repository. This option will require the user to install all NetCleave’s dependencies.

git clone https://github.com/BSC-CNS-EAPM/NetCleave.git

Python dependencies: argparse, pandas, numpy, matplotlib, pathlib, sklearn, keras, tensorflow, and biopython. R dependencies: dplyr and argparser.

NetCleave, a C-Terminal Antigen Processing Predictor for MHC-I and MHC-II

2

217

Using the NetCleave Algorithm This section provides detailed descriptions of how to use the two main functions of the NetCleave algorithm: C-terminal processing prediction and model retraining.

2.1 C-Terminal Processing Prediction

As an example, we describe how to predict the C-terminal processing of the entire Monkeypox virus envelope protein (Q8V4U9). This protein is involved in the Monkeypox virus fusion with the host cell; hence it could be a potential candidate for vaccination [25]. The host organism used for prediction will be Homo sapiens; therefore, the mhc_allele chosen will be HLA. The predictions will be performed for MHC-I- and MHC-II-targeted epitopes without specifying any particular allele. This pipeline is organized according to the three types of input files the C-terminal processing prediction mode accepts: 1. A FASTA file with the sequence of the protein of interest (--pred_input 1). See Notes 1–4. (a) Depending on the MHC class chosen, this option will generate all the potential peptides to consider. By default, for MHC-I, peptides from 8 to 11mers will be generated, whereas, for MHC-II, peptides from 13 to 17mers will do so, as these are the most frequent lengths of these types of epitopes. If the user wants to specify the generation of a concrete epitope length, they can use the flag --epitope_length and add the specified length by a single number. (b) The C-terminal processing of the generated epitopes will be calculated. 2. A CSV file with the epitope sequences to score (column name = epitope) and the corresponding UniProt ID from their parental protein (column name = uniprot_id). (--pred_input 2). See Note 5. (a) The epitopes and UniProt IDs need to be matched (i.e., the protein sequence needs to contain the peptide). (b) If you use this option for the first time, uncompress the files at data/databases/uniprot and data/databases/ uniparc. (c) The peptide will be searched for within the protein sequence corresponding to the UniProt ID and the cleavage motif will be generated (four last residues from the peptide and three continuing residues in the protein sequence). (d) The C-terminal processing of the target epitopes will be calculated.

218

Roc Farriol-Duran et al.

3. A CSV file with the epitope sequences to score (column name = epitope), the corresponding protein sequence from their parental protein (column name = protein_seq), and the target protein name (column name = protein_name). (--pred_input 3). (a) The matching epitopes, protein sequences, and names need to be placed in the same row. (b) The epitopes will be searched for within the matched protein sequence, and the cleavage motif will be extracted. (c) The C-terminal processing of the target epitopes will be calculated. 2.1.1 C-Terminal Prediction from a Protein Sequence in FASTA File (Type 1)

1. Use the FASTA file containing the amino acid sequence of your protein of interest. In this case, the Envelope protein of the Monkeypox virus (Q8V4U9) (see Notes 1–4). 2. Predict the C-terminal processing of all the potential HLA-class I epitopes by using:

python3 NetCleave.py --predict input/Q8V4U9.fasta --pred_input 1 --mhc_class I --mhc_allele HLA

The results will be saved at output/Q8V4U9_NetCleave.csv. Please, rename the file to output/I_Q8V4U9_NetCleave.csv to avoid overwriting. 3. Predict the C-terminal processing of all the potential class II epitopes by using: python3 NetCleave.py --predict input/Q8V4U9.fasta --pred_input 1 --mhc_class II --mhc_allele HLA

The results are available at output/Q8V4U9_NetCleave.csv, rename it to output/II_Q8V4U9_NetCleave.csv. 2.1.2 C-Terminal Prediction from Peptides and Corresponding UniProt IDs in CSV File (Type 2)

When using this option for the first time, uncompress the files at data/databases/uniprot and data/databases/uniparc. 1. Prepare your input CSV file by generating an epitope and uniprot_id columns. (a) Place your epitopes and their matched UniProt IDs in the same row (see Note 5). 2. Predict the C-terminal processing of MHC-I epitopes using:

python3 NetCleave.py --predict input/Q8V4U9_pept_uniprot.csv --pred_input 2 --mhc_class I -mhc_allele HLA

The results will be available at output/Q8V4U9_pept_uniprot_NetCleave.csv, rename it to output/ I_Q8V4U9_pept_uniprot_NetCleave.

NetCleave, a C-Terminal Antigen Processing Predictor for MHC-I and MHC-II

219

3. Predict the C-terminal processing by cathepsins in case you are working with MHC-II epitopes. Execute: python3 NetCleave.py --predict input/Q8V4U9_pept_uniprot.csv --pred_input 2 --mhc_class II -mhc_allele HLA

The results are available at output/Q8V4U9_pept_uniprot_NetCleave.csv, rename it to output/II_Q8V4U9_pept_uniprot_NetCleave.csv. 2.1.3 C-Terminal Prediction from Peptides and Corresponding Parental Protein Sequences in a CSV File (Type 3)

1. Prepare your input CSV file by generating an epitope, protein sequence, and protein name columns. (a) Place your peptides, matched parental protein sequences, and protein names in the same row. 2. Predict the C-terminal processing of MHC-I epitopes using:

python3 NetCleave.py --predict input/Q8V4U9_pept_prot.csv --pred_input 3 --mhc_class I --mhc_allele HLA

The results are available at output/Q8V4U9_pept_prot_NetCleave.csv, rename it to output/ I_Q8V4U9_pept_prot_NetCleave. 3. Predict the C-terminal processing of MHC-II epitopes using: python3 NetCleave.py --predict input/Q8V4U9_pept_prot.csv --pred_input 3 --mhc_class II --mhc_allele HLA

The results are available at output/Q8V4U9_pept_prot_NetCleave.csv, rename it to output/ II_Q8V4U9_pept_prot_NetCleave. 2.1.4 Epitope Prioritization and Filtering According to Their CTerminal Processing Likelihood

Provided that you want to prioritize your candidate epitopes based on cleavage processing, our recommendation is: 1. Sort the epitopes according to their likelihood of being processed by ordering the NetCleave score descendingly. Provided that you want to classify the potential candidate epitopes between cleaved and noncleaved, we recommend: 2. Using a relaxed threshold to select the epitopes that are likely to be generated. (a) NetCleave score (column = prediction) > 0.5 for both MHC-I and MHC-II predictions.

220

Roc Farriol-Duran et al.

3. Using a stringent threshold to select the epitopes that are highly likely to be generated. (b) NetCleave score (column = prediction) ≥ 0.8, and 0.6 for MHC-I and MHC-II predictions, respectively. 2.2

Model Retraining

2.2.1 Retraining NetCleave Using an Updated IEDB Dataset (Type 1)

NetCleave enables a retraining mode to keep its training data upto-date or to address a specific user preference. For instance, a user working in a particular species currently not available in the program. NetCleave addresses these needs by allowing three different types of training: (1) using a newer version of the IEDB, (2) combining the IEDB with additional data from other sources and (3) using other data sources without including the IEDB. To keep the model up-to-date, it can be retrained using new versions of the IEDB. To do so, follow the instructions detailed below. 1. Download the mass spectrometry dataset from the IEDB corresponding to MHC-I ligands (http://www.iedb.org/ downloader.php?file_name=doc/mhc_ligand_full_single_file. zip) and uncompress it (see Notes 6 and 7). 2. Process the IEDB raw file. (a) Execute the code below to select epitopes with the following characteristics: Host = Human (Homo sapiens), MHC class = I, Method/Technique = mass spectrometry, and remove duplicates by Description (epitope sequence).

cd predictor/database_functions/ Rscript iedb_processing.R cd ../../ # to return to the NetCleave root folder

(b) Alternatively, decompress the processed file at data/databases/iedb/mhc_ligand_full_processed.zip 3. Generate a training dataset for the model by running NetCleave as follows (see Notes 8 and 9): python3 NetCleave.py --generate --train_input 1 --peptide_data ./data/databases/iedb/mhc_ligand_full_processed.csv --data_path ./data/training_data/new_iedb

As a result, a training data file is generated at data/training_data/new_iedb. This file contains peptide sequences and their corresponding label (1 for observed cleavage sites, 0 for decoy samples).

NetCleave, a C-Terminal Antigen Processing Predictor for MHC-I and MHC-II

221

4. Train the model using the previously generated data. To do so, execute the following code (see Note 10): python3 NetCleave.py --train --data_path ./data/training_data/new_iedb --model_path ./data/models/new_iedb

A set of training data is provided in the package; you can find it at data/training_data. To use one of these files instead of your data, run the previous line of code followed by the arguments defining the desired conditions. NetCleave is now successfully retrained with your updated IEDB version and ready to use for predicting the C-terminal antigen processing. In order to use it for prediction, run the following: python3 NetCleave.py --predict input/Q8V4U9.fasta --pred_input 1 --model_path ./data/models/new_iedb 2.2.2 Retraining NetCleave Using IEDB and a Target Dataset (Type 2)

The following instructions address the second type of training, describing how to train the model by combining the IEDB with an additional dataset of interest. In this case, we use data from HLA Ligand Atlas [26], which is used as an example of human selfpeptide processing. 1. Download the mass spectrometry dataset from the IEDB corresponding to MHC-I ligands (http://www.iedb.org/ downloader.php?file_name=doc/mhc_ligand_full_single_file. zip) and uncompress it (see Notes 6 and 7). 2. Process the IEDB raw file. (a) Execute the code below to select epitopes with the following characteristics: Host = Human (Homo sapiens), MHC class = I, Method/Technique = mass spectrometry and remove duplicates by Description (epitope sequence).

cd predictor/database_functions/ Rscript iedb_processing.R cd ../../ # to return to the NetCleave root folder.

(b) Alternatively, decompress the processed file at data/databases/iedb/mhc_ligand_full_processed.zip 3. Download your target database, in this case, HLA Ligand Atlas (https://hla-ligand-atlas.org/rel/hla_2020.12.zip). Once you have the data, uncompress it (see Notes 7 and 11). 4. Preprocess the HLA Ligand Atlas dataset to have the peptide sequences and UniProt IDs in a single CSV file (see Note 12). To do so, execute the following Python code in the command line:

222

Roc Farriol-Duran et al.

import pandas as pd # Read file with peptides sequences df_peptides = pd.read_csv('data/databases/other/HLA_peptides.tsv',sep='\t') # Read file with UniProt IDs df_uniprot = pd.read_csv('data/databases/other/HLA_protein_map.tsv',sep='\t') # Merge data frames using peptide ID df = pd.merge(df_peptides,df_uniprot,on='peptide_sequence_id') # Select the columns of interest df = df[['peptide_sequence','uniprot_id']] # Export to csv file df.to_csv('data/databases/other/HLA_ligand_atlas.csv')

5. Generate a retraining dataset for the model using both IEDB data and the previously processed peptide data from HLA Ligand Atlas. Run NetCleave as follows (see Notes 9 and 13), indicating the desired path to store the training data with the argument --data_path: python3 NetCleave.py --generate --train_input 2 --peptide_data ./data/databases/iedb/mhc_ligand_full_processed.csv --peptide_data_additional ./data/databases/other/HLA_ligand_atlas.csv --data_path ./data/training_data/combined_strategy

6. Train the model using the previously generated data. To do so, execute the following code: python3 NetCleave.py --train --data_path ./data/training_data/combined_strategy --model_path ./data/models/combined_strategy

Now that you have already retrained NetCleave with the IEDB and your target dataset, you can use it to predict the C-terminal antigen processing. In order to use it for prediction, run the following: python3 NetCleave.py --predict input/Q8V4U9.fasta --pred_input 1 --model_path ./data/models/combined_strategy 2.2.3 Retraining NetCleave Using a Target Dataset (Type 3)

The last type of training NetCleave involves exclusively using data from sources other than the IEDB (Note 14). As an example, here we describe how to train the model only using data from the HLA Ligand Atlas.

NetCleave, a C-Terminal Antigen Processing Predictor for MHC-I and MHC-II

223

1. Download data from your target database, in this case, HLA Ligand Atlas (https://hla-ligand-atlas.org/rel/hla_2020.12. zip), then uncompress it (see Notes 6 and 7). 2. Preprocess data from HLA Ligand Atlas to have the peptide sequences and UniProt IDs in a single CSV file (see Note 12). To do so, execute the following Python code in the command line: import pandas as pd # Read file with peptides sequences df_peptides = pd.read_csv('data/databases/other/HLA_peptides.tsv',sep='\t') # Read file with UniProt IDs df_uniprot = pd.read_csv('data/databases/other/HLA_protein_map.tsv',sep='\t') # Merge data frames using peptide ID df = pd.merge(df_peptides,df_uniprot,on='peptide_sequence_id') # Select the columns of interest df = df[['peptide_sequence','uniprot_id']] # Export to csv file df.to_csv('data/databases/other/HLA_ligand_atlas.csv')

3. Generate a training dataset for the model using the previously processed peptide data from HLA Ligand Atlas. Run NetCleave as follows (see Note 9), indicating the desired path to store the training data with the argument --data_path: python3 NetCleave.py --generate --train_input 3 --peptide_data ./data/databases/other/HLA_ligand_atlas.csv --data_path ./data/training_data/hla_ligand_atlas

4. Train the model using the previously generated data. To do so, execute the following code: python3 NetCleave.py --train --data_path ./data/training_data/hla_ligand_atlas --model_path ./data/models/hla_ligand_atlas

From now on, you can use NetCleave trained with the data of your choice. In order to use it for prediction, run the following: python3 NetCleave.py --predict input/Q8V4U9.fasta --pred_input 1 --model_path ./data/models/hla_ligand_atlas

224

3

Roc Farriol-Duran et al.

Notes 1. When working with a full protein sequence in a FASTA (-pred_input 1), bear in mind that the extreme C-terminal epitopes will not be generated. This is because their cleavage motif can not be generated because the next three amino acids in the protein simply do not exist. 2. Make sure your input files do not contain lower-case amino acids or other characters than those representing the 20 amino acids single-letter codes. 3. When using the FASTA mode (--pred_input 1), provide files with a single protein, as multi-protein files are not currently supported. 4. If you want to know all the MHC options NetCleave supports, use: python3 NetCleave.py --mhc_options

5. UniProt ID mode (--pred_input 2) requires an Internet connection to scrape and parse the protein sequence from the corresponding UniProt ID. It might be the case that the target identifier is not found in the UniProt database version provided. In such a case, go to --pred_input 3 option and provide the protein sequence that contains your epitopes. 6. Move data to your target path. This path will be later given as an argument at the data generation step. 7. Uncompress data from UniProt (data/databases/uniprot) and UniParc (data/databases/uniparc). 8. The argument values may differ depending on the applied restrictive conditions to select IEDB data. By default, it extracts HLA-class I mass spectrometry data; it can be modified by defining the arguments mhc_allele, mhc_class and technique. 9. Training types are: (1) using a newer version of the IEDB, (2) combining the IEDB with additional data from other sources and (3) using other data sources than IEDB. 10. If the default arguments described in Note 8 are not used, please define them again. For example, if you generated HLA-class II mass spectrometry training data, run the following code: python3 NetCleave.py --train --mhc_allele HLA --mhc_class II

NetCleave, a C-Terminal Antigen Processing Predictor for MHC-I and MHC-II

225

11. Move data to your target paths. In this case, IEDB data is in data/databases/iedb and HLA Ligand Atlas data is in data/ databases/other. These paths will be later given as arguments at the data generation step. 12. Peptide sequences and UniProt IDs are available in the files HLA_peptides.tsv and HLA_protein_map.tsv, respectively. 13. Arguments may vary depending on the type of retraining and the paths where data is located. The last argument, peptide_data_additional, is only needed for type 2 retraining as it requires data from two different sources. For types 1 and 3, where you only use one source, omit this argument and use peptide_data to indicate the path. 14. Be aware that large databases (more than 2GB) can kill the retraining mode due to Tensorflow memory limitations. If that is the case, please split your dataset, process it separately, and merge the resulting model files (.h5 format). References 1. Komanduri KV (2018) Divining T-cell targets for cancer immunotherapy. Blood 132:1861– 1863. https://doi.org/10.1182/blood-201809-873588 2. Ali Awadelkareem E, Osman Mohammed N, Bakor Mohammed Gaafar B, AwadElkariem Ali S (2020) Epitope-based peptide vaccine design against spike protein (S) of novel coronavirus (2019-nCoV): an immunoinformatics approach. https://doi.org/10.21203/rs.3.rs30076/v1 3. Bhattacharya M, Sharma AR, Patra P, Ghosh P, Sharma G, Patra BC, Lee S-S, Chakraborty C (2020) Development of epitope-based peptide vaccine against novel coronavirus 2019 (SARSCOV-2): immunoinformatics approach. J Med Virol 92:618. https://doi.org/10.1002/jmv. 25736 4. Moise L, Buller RM, Schriewer J, Lee J, Frey SE, Weiner DB, Martin W, De Groot AS (2011) VennVax, a DNA-prime, peptideboost multi-T-cell epitope poxvirus vaccine, induces protective immunity against vaccinia infection by T cell response alone. Vaccine 29: 501–511. https://doi.org/10.1016/j.vaccine. 2010.10.064 5. Walter S (2012) Multipeptide immune response to cancer vaccine IMA901 after single-dose cyclophosphamide associates with longer patient survival. Nat Med 18:1254. https://doi.org/10.1038/nm.2883 6. La´zaro S, Gamarra D, Del Val M (2015) Proteolytic enzymes involved in MHC class I antigen processing: a guerrilla army that partners with the proteasome. Mol Immunol 68:72–76.

https://doi.org/10.1016/j.molimm.2015. 04.014 7. Brutkiewicz RR (2016) Cell signaling pathways that regulate antigen presentation. J Immunol 197:2971–2979. https://doi.org/ 10.4049/jimmunol.1600460 8. Gfeller D, Bassani-Sternberg M (2018) Predicting antigen presentation—what could we learn from a million peptides? Front Immunol 9:1716. https://doi.org/10.3389/fimmu. 2018.01716 9. Mei S, Li F, Leier A, Marquez-Lago TT, Giam K, Croft NP, Akutsu T, Smith AI, Li J, Rossjohn J, Purcell AW, Song J (2020) A comprehensive review and performance evaluation of bioinformatics tools for HLA class I peptidebinding prediction. Brief Bioinform 21:1119– 1135. https://doi.org/10.1093/bib/bbz051 10. Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M (2020) NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res 48:W449–W454. https://doi.org/10.1093/ nar/gkaa379 11. O’Donnell TJ, Rubinsteyn A, Laserson U (2020) MHCflurry 2.0: improved pan-allele prediction of MHC class I-presented peptides by incorporating antigen processing. Cell Syst 11:42–48.e7. https://doi.org/10.1016/j.cels. 2020.06.010 12. Bassani-Sternberg M, Chong C, Guillaume P, Solleder M, Pak HS, Gannon PO, Kandalaft LE, Coukos G, Gfeller D (2017) Deciphering

226

Roc Farriol-Duran et al.

HLA-I motifs across HLA peptidomes improves neo-antigen predictions and identifies allostery regulating HLA specificity. PLoS Comput Biol 13:e1005725. https://doi.org/ 10.1371/journal.pcbi.1005725 13. Harndahl M, Rasmussen M, Roder G, Dalgaard Pedersen I, Sørensen M, Nielsen M, Buus S (2012) Peptide-MHC class I stability is a better predictor than peptide affinity of CTL immunogenicity. Eur J Immunol 42: 1405–1416. https://doi.org/10.1002/eji. 201141774 14. Gomez-Perosanz M, Ras-Carmona A, Reche PA (2020) PCPS: a web server to predict proteasomal cleavage sites. Methods Mol Biol 2131:399–406. https://doi.org/10.1007/ 978-1-0716-0389-5_23 15. Jørgensen KW, Rasmussen M, Buus S, Nielsen M (2014) Net MHC stab – predicting stability of peptide-MHC-I complexes; impacts for cytotoxic T lymphocyte epitope discovery. Immunology 141:18–26. https://doi.org/10. 1111/imm.12160 16. Rasmussen M, Fenoy E, Harndahl M, Kristensen AB, Nielsen IK, Nielsen M, Buus S (2016) Pan-specific prediction of peptide–MHC class I complex stability, a correlate of T cell immunogenicity. J Immunol 197:1517–1524. https:// doi.org/10.4049/jimmunol.1600582 17. Besser H, Louzoun Y (2018) Cross-modality deep learning-based prediction of TAP binding and naturally processed peptide. Immunogenetics 70:419–428. https://doi.org/10. 1007/s00251-018-1054-6 18. Murata S, Takahama Y, Kasahara M, Tanaka K (2018) The immunoproteasome and thymoproteasome: functions, evolution and human disease. Nat Immunol 19:923–931. https:// doi.org/10.1038/s41590-018-0186-z 19. Neefjes J, Jongsma MLM, Paul P, Bakke O (2011) Towards a systems understanding of MHC class I and MHC class II antigen presentation. Nat Rev Immunol 11:823–836. https://doi.org/10.1038/nri3084 20. Nielsen M, Lundegaard C, Lund O, Kes¸mir C (2005) The role of the proteasome in

generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage. Immunogenetics 57:33–41. https://doi.org/10.1007/s00251-0050781-7 21. Hakenberg J, Nussbaum AK, Schild H, Rammensee H-G, Kuttler C, Holzhu¨tter H-G, Kloetzel P-M, Kaufmann SHE, Mollenkopf H-J (2003) MAPPP: MHC class I antigenic peptide processing prediction. Appl Bioinformatics 2:155–158 22. Nussbaum AK, Kuttler C, Hadeler KP, Rammensee HG, Schild H (2001) PAProC: a prediction algorithm for proteasomal cleavages available on the WWW. Immunogenetics 53: 8 7 – 9 4 . h t t p s : // d o i . o r g / 1 0 . 1 0 0 7 / s002510100300 23. Amengual-Rigo P, Guallar V (2021) NetCleave: an open-source algorithm for predicting C-terminal antigen processing for MHC-I and MHC-II. Sci Rep 111(11):1–8. https:// doi.org/10.1038/s41598-021-92632-y 24. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47: D339–D343. https://doi.org/10.1093/nar/ gky1006 25. Rizk JG, Lippi G, Henry BM, Forthal DN, Rizk Y (2022) Prevention and treatment of monkeypox. Drugs 82:957–963. https://doi. org/10.1007/s40265-022-01742-y 26. Marcu A, Bichmann L, Kuchenbecker L, Kowalewski DJ, Freudenmann LK, Backert L, Mu¨hlenbruch L, Szolek A, Lu¨bke M, Wagner P, Engler T, Matovina S, Wang J, Hauri-Hohl M, Martin R, Kapolou K, Walz JS, Velz J, Moch H, Regli L, Silginer M, Weller M, Lo¨ffler MW, Erhard F, Schlosser A, Kohlbacher O, Stevanovic´ S, Rammensee H-G, Neidert MC (2021) HLA Ligand Atlas: a benign reference of HLA-presented peptides to improve T-cell-based cancer immunotherapy. J Immunother Cancer 9:e002071. https://doi.org/10.1136/jitc-2020-002071

Chapter 16 Prediction of TAP Transport of Peptides with Variable Length Using TAPREG Hector F. Pelaez-Prestel, Sara Alonso Fernandez, Laura Ballesteros-Sanabria, and Pedro A. Reche Abstract CD8 T cells recognize short peptides, more frequently of nine residues, presented by class I major histocompatibility complex (MHC I) molecules in the cell surface of antigen-presenting cells. These epitope peptides are loaded onto MHC I molecules in the endoplasmic reticulum, where they are shuttled from the cytosol by the transporter associated with antigen processing (TAP) as such or as N-terminal extended precursors of up to 16 residues. In this chapter, we describe the use of TAPREG, a tool for predicting TAP binding affinity that has been enhanced to identify potential CD8 T cell epitope precursors transported by TAP. TAPREG is available for free public use at http://imed.med.ucm.es/Tools/tapreg/. Key words TAP, Peptide transport, Epitope prediction, TAPREG

1 Introduction 1.1 Classical Class I Antigen Presentation Pathway

Cytotoxic T lymphocytes (CTL) (a subset of effector CD8 T cells) play a key role in tumor immunosurveillance and elimination of intracellular pathogens, directly killing tumor and infected cells [1]. CTLs discriminate between normal and damaged cells using their T cell receptor (TCR) to recognize the peptides (epitopes) presented by major histocompatibility class I (MHCI) molecules on the cell surface. MHC I molecules preferably bind nine residuelong peptides that generally originate from endogenous proteins that are degraded in the cytosol by the proteolytic activity of the proteasome. Peptide fragments are transported to the lumen of the endoplasmic reticulum (ER) by the transporter associated with

Hector F. Pelaez-Prestel and Sara Alonso Fernandez contributed equally to this chapter. Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_16, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

227

228

Hector F. Pelaez-Prestel et al.

antigen processing (TAP), where they can bind to newly assembling MHC I molecules [2]. Before MHCI binding, peptides can also undergo an N-terminal trimming by ER-associated amino peptidases (ERAAP) [3]. Finally, peptide-MHCI complexes are exported to the cell surface for presentation to the CD8 T cells. Antigen processing steps limit/shape the repertoire of peptides presented by MHCI molecules in vivo, thus explaining the numerous observations of high-affinity MHCI binding peptides that are unable to elicit CTL responses [4, 5]. While cleavage of proteins by the proteasome is quite unspecific, TAP transport of peptides is very selective [6]. 1.2 TAP Transport of Peptides

TAP belongs to the ATP-dependent binding cassette (ABC) transporter superfamily, expressed as a heterodimer consisting of the TAP1 and TAP2 proteins subunits. Both TAP1 and TAP2 proteins encode one hydrophobic transmembrane domain and one ATP binding domain [7]. Transport of peptides by TAP proceeds in two sequential steps: first the peptide binds to TAP and then it is translocated consuming ATP. The initial binding step governs the peptide transport rate by TAP [8]. In other words, TAP preselection of peptides available for MHCI presentation is controlled by their affinity to TAP. Selectivity of TAP has been studied from data generated using assays that determine peptide binding to TAP or peptide accumulation in the ER. TAP preferentially transports peptides with a length of eight–16 residues, whereas longer peptides (>40 residues) may be transported but with much lower efficiency [9, 10]. Besides peptide length preferences, the first three-four N-terminal residues and the C-terminal end of the peptides have also been shown to be important for binding to TAP. Moreover, TAP prefers charged and hydrophobic residues at position 2 (P2), hydrophobic residues at position 3 (P3), and hydrophobic aromatic residues at the C-terminus. In contrast, aromatic, or acidic residues at P1 and prolines at P1 and P2 have strong deleterious effects [11].

1.3 Predicting Peptide Binding to TAP

Several methods have been applied for predicting and analyzing the binding affinity of peptides to TAP, such as artificial neural networks, support vector machines (SVMs), and matrices generated using the Stabilized Matrix Method and the additive method [12– 14]. Most of these methods were trained on the same training set of 435 nonamer (9-mer) peptides of known affinity to TAP made available by Dr. van Endert. We also developed a method to predict binding affinity of peptides to TAP, TAPREG, using SVM regression (SVMr). However, TAPREG SVMr models were trained in a much larger dataset consisting 613 unique 9-mer peptides of known binding affinity (DS613 dataset) and TAPREG outperformed competing methods [15]. TAPREG is available for free

Prediction of Peptide TAP Transport

229

public use at http://imed.med.ucm.es/Tools/tapreg/ and it has been widely used since its inception, currently counting more than 220,000 hits. The success of TAPREG is owed to its unique ability to predict the binding affinity to TAP of peptides with variable lengths. Moreover, we have now updated TAPREG to identify potential CD8 T cell epitopes by predicting the transport of variable N-terminal precursors from nine to 18 residues. Here, we describe the new feature of TAPREG using as input both: a protein sequence or multiple peptide sequences. The selected protein used as example for the prediction was the spike glycoprotein from SARS-CoV-2. SARS-CoV-2 variants has been shown to evade humoral immunity induced by COVID-19 vaccines [16]. Thereby, the identification of new SARS-CoV-2 CD8 T cell epitopes can be key for the development of next-generation COVID-19 vaccines that prevent viral immune escape by inducing cellular immunity rather than humoral immunity.

2

Materials

2.1 Sequence Collection

The protein sequence of spike glycoprotein from SARS-CoV2 canonical strain is retrieved from UniProt (P0DTC2). The FASTA sequence is used as input. Peptides are retrieved from the Immune Epitope Database (IEDB) (https://www.iedb.org/, accessed on 21 July 2022). IEDB is a database where experimentally verified epitopes are deposited. IEDB search parameters were restricted to: SARS-CoV-2 (ID: 2697049, SARS2) as organism; spike glycoprotein as antigen [P0DTC2]; human host; positive T cell response and restricted by MHC class I. The selected parameters are shown in Fig. 1. Peptides sequences retrieved from IEDB are processed to select only those with nine residues and match exactly over the reference spike glycoprotein.

2.2

TAPREG has been trained on a data set of 613 unique nonamer (9-mer) peptides of known binding affinity (DS613 dataset) [15]. To predict binding affinity of nine residues, TAPREG uses an SVMr model derived considering all the residues, while binding affinity of longer peptides is predicted using a model trained in the first four N-terminal residues and the last two C-terminal residues. The input data for TAPREG can consist of either a protein sequence or multiple peptide sequences. For the protein sequence, TAPREG returns all 9-mer peptides encompassed by the protein, ranked by their affinity to TAP (IC50). The number of peptides listed in the output can also be limited using a user-defined threshold of binding affinity. In addition, if the option 9mers + extensions is selected, TAPREG will also examine for each 9-mer, the binding affinity of the corresponding N-terminal extended peptides from

TAPREG Tool

230

Hector F. Pelaez-Prestel et al.

Fig. 1 Parameters used in the Immune Epitope Database (IEDB) to retrieve experimentally verified CD8 T cell epitopes from the spike glycoprotein of SARS-CoV-2. IEDB search parameters are restricted to: SARS-CoV2 (ID: 2697049, SARS2) as organism; spike glycoprotein as antigen [P0DTC2]; human host; positive T cell response; and restricted by MHC class I. CD8 T cell epitope sequences retrieved from IEDB were parsed to select those with nine residues and matching entirely with the spike glycoprotein of SARS-CoV-2

10 to 18 residues, returning the extended peptides with the highest binding affinity. For the peptide input, the server returns the affinity of each individual peptide. As TAP can bind and transport peptides of arbitrary length, TAPREG will predict the affinity of any peptide ranging from eight to 18 residues. There are two models available at the TAPREG site that were trained both on the DS613 dataset using the entire peptide sequences; one was generated from a sparse representation of peptide sequences and the other from a BLOSUM representation.

Prediction of Peptide TAP Transport

3

231

Methods In this section we show how to use TAPREG to predict the binding affinity using as input a protein sequence or multiple peptide sequences. The procedure can be easily adapted to any desired sequence.

3.1 Predicting TAP Binding Affinity Using Peptides as Input

In this section, we show the procedure to obtain the binding affinity to TAP of a set of peptide sequences. In particular, we use a selection of 184 experimentally validated CD8 T cell epitopes from the spike glycoprotein of SARS-CoV2. The relevant epitope sequences are retrieved from the IEDB as indicated in Subheading 2.1. Unique peptide sequences with nine residues are formatted as FASTA and are available at http://imed.med.ucm.es/Tools/ tapreg/sars_cov2_cd8_tepi.txt for copying and/or downloading (see Note 1). To obtain TAP affinity of these CD8 T cell epitopes, proceed as follows: 1. Go to http://imed.med.ucm.es/Tools/tapreg/ in your Internet browser. 2. Select the model. In this example, the DS_613 Sparse was used (default option, see Note 2). 3. Click on the Peptides option (see Note 3). 4. Copy the peptide sequence available in the link above and paste them in the INPUT section of TAPREG. Alternatively, upload the peptide sequence file by clicking first on Examinar. . . and then selecting the file sequence and finally clinking on Upload. As noted, input peptide sequences must be in FASTA format (see Note 1) and range between eight and 18 residues (see Note 3). All the selected parameters are shown in Fig. 2a. 5. Click on Run Analysis 6. A new tab will open showing the results of the analysis. As it is shown in Fig. 2b the output is a table with all the peptides ranked according to the IC50. These results can be copied and pasted in text or spreadsheet documents. Note that only 51% of the peptides analyzed here have a predicted binding affinity lower than 5000 nM (see Note 4). Hence, it would appear that about half of the SARS-CoV-2 spike-specific CD8 T cell epitopes are unlikely to be transported by TAP. However, they could be transported as N-terminal extended epitope precursors.

232

Hector F. Pelaez-Prestel et al.

Fig. 2 TAPREG results using as input SARS-CoV-2 CD8 T cell epitope sequences. (a) TAPREG interface showing the parameters used: the DS_Sparse Model, Peptide and the peptide sequences in FASTA format corresponding to the selected spike-specific CD8 T cell epitopes. (b) TAPREG returns the peptides ranked according to the IC50 value. Epitope sequences used in the example are available at: http://imed.med.ucm.es/ Tools/tapreg/sars_cov2_cd8_tepi.txt 3.2 Predicting TAP Binding Affinity Using as Input a Protein Sequence

Here, we show how to predict the binding affinity of all the 9-mers and then compose a protein using the N-terminal extension option that enables to predict potential CD8 T cell epitopes. As input, we chose SARS-CoV-2 spike glycoprotein, which includes every single CD8 T epitope used in the previous task. The sequence is available for downloading and/or copying at http://imed.med.ucm.es/ Tools/tapreg/sars2spike.fa. Follow these instructions to get all potential CD8 T cell epitopes transported by TAP: 1. Go to http://imed.med.ucm.es/Tools/tapreg/ in your Internet browser. 2. In the THRESHOLD window, do not select any IC50 threshold (default option, see Note 5). 3. Select the model. In this example, DS_613 Sparse was used (default option, see Note 2). 4. Click on the option Protein: 9mers + extension, as we will use a protein sequence as input and evaluate N-terminal extensions of 9-mer peptides (see Note 6). 5. In the INPUT window, paste the sequence available in the link provided above or either upload it by clicking first on Examinar. . ., then selecting the file sequence, and finally clinking on Upload (see Note 7). All the selected parameters are shown in Fig. 3a.

Prediction of Peptide TAP Transport

233

Fig. 3 TAPREG results using as input the sequence of SARS-CoV-2 spike protein. (a) TAPREG interface showing the parameters used: No Threshols (----), the DS_Sparse Model, Protein: 9mers + extensions and SARS-CoV-2 spike protein sequence. This sequence is available at: http://imed.med.ucm.es/Tools/tapreg/ sars2spike.fa. (b) TAPREG results list all 9-mer peptides (optimal size for binding to MHCI molecules), paired with any potential N-terminal peptide precursor (10–18 residues) with better binding to TAP than the 9-mer peptide, the position of the first residue in the protein and the IC50 value. The list of peptides continues through all the 1266 peptides of this protein

6. Click on Run Analysis 7. A new tab will open showing the results of the analysis (see Note 8). As it is shown in Fig. 3b the output is a table with all the peptides ranked according to the IC50. TAPREG returns all peptide sequences of nine residues, optimal for binding to MHCI, paired with the peptide that was likely transported by TAP considering N-terminal extended precursors (nine-18 residues), the position of the first residue within the protein and the IC50 value of the peptide transported by TAP. This list can be copied and pasted in a text or spreadsheet document. Since the 184 spike-specific CD8 T cell epitopes analyzed previously are included in the results obtained upon the spike protein, we can readily realize that 91% of these epitopes paired with N-terminal extended precursors with an IC50 value lower than 5000 nM. Hence, we can conclude that (A) between 40% and 50% spike-specific CD8 T cell epitopes derive from N-terminal extended precursors transported by TAP, which ought to undergo some trimming at the N-terminus and (B) TAPREG is well suited to identify potential CD8 T cell epitopes in a pan-MHCI manner when considering N-terminal peptide precursors.

234

4

Hector F. Pelaez-Prestel et al.

Notes 1. The input introduced in TAPREG must always be in FASTA format either as a protein sequence or multiple peptide sequences. Therefore, if a set of peptide sequences is introduced, each peptide must be headed by “>” followed by a unique name for each peptide. In this example, they were named as “>SARS-CoV-2_Spike_epitope_X”, where X is a number. 2. There are two models available at the TAPREG site both trained on the DS613 dataset: DS_163 Sparse and DS_163 Blosum. The model trained on BLOSUM-encoded sequences displayed a somewhat lower predictive performance than the sparse counterpart, but nonetheless, it is included in the TAPREG server because BLOSUM representation of sequences can often increase the generalization power of predictive models. 3. Although, in this example, all the peptide sequences had nine residues, the length of the sequence of the peptides can range from eight to 18 residues. The inclusion of longer or shorter peptides will preclude the execution of TAPREG and an error message will be reported. 4. IC50 binding affinity values correspond to the concentration of the peptide that will be required to displace a high-affinity reference peptide bound to TAP. Hence, the lower the IC50 the large the binding affinity. An IC50 of 5000 nM is a generous threshold to consider binding to TAP. 5. The user can define a threshold when introducing a protein sequence as input. Note that the lower the threshold the more restrictive the analysis will be. If no threshold is selected, TAPREG will return all peptides ranked by IC50 values. 6. The option Protein: 9mers + extension will report, ranked by IC50 values, all 9-mer peptides in the sequence paired with the N-terminal extended peptide precursor from 10 to 18 residues with the lower IC50 value. If users select the option Protein: 9mers, TAPREG only returns all the 9-mer peptide ranked by the TAP binding affinity, without considering any extensions. Therefore, the calculated IC50 for each peptide will be different. 7. Protein sequences for TAPREG must also be in FASTA format. Moreover, the server accepts only one sequence. Inclusions of more than one sequence in the option proteins will preclude the execution of TAPREG and an error message will be reported. 8. TAPREG analysis of protein sequences with the option extension can last up to 10 min and we recommend making a bookmark of the result page.

Prediction of Peptide TAP Transport

235

Acknowledgments This work was funded by a REACT-European Union grant from the Comunidad de Madrid to the ANTICIPA project of Complutense University of Madrid, through the European Union Regional Development Fund (FEDER) in response to the COVID-19 pandemic. H.P-P is supported by FPU 2019 Grant. References 1. Zhang N, Bevan MJ (2011) CD8(+) T cells: foot soldiers of the immune system. Immunity 35(2):161–168 2. Wieczorek M, Abualrous ET, Sticht J, AlvaroBenito M, Stolzenberg S, Noe F et al (2017) Major histocompatibility complex (MHC) class I and MHC class II proteins: conformational plasticity in antigen presentation. Front Immunol 8:292 3. Guan J, Peske JD, Taylor JA, Shastri N (2021) The nonclassical immune surveillance for ERAAP function. Curr Opin Immunol 70: 105–111 4. Zhong W, Reche PA, Lai CC, Reinhold B, Reinherz EL (2003) Genome-wide characterization of a viral cytotoxic T lymphocyte epitope repertoire. J Biol Chem 278(46): 45135–45144 5. Wang M, Lamberth K, Harndahl M, Roder G, Stryhn A, Larsen MV et al (2007) CTL epitopes for influenza A including the H5N1 bird flu; genome-, pathogen-, and HLA-wide screening. Vaccine 25(15):2823–2831 6. Uebel S, Kraas W, Kienle S, Wiesmuller KH, Jung G, Tampe R (1997) Recognition principle of the TAP transporter disclosed by combinatorial peptide libraries. Proc Natl Acad Sci U S A 94(17):8976–8981 7. Abele R, Tampe R (2004) The ABCs of immunology: structure and function of TAP, the transporter associated with antigen processing. Physiology (Bethesda) 19:216–224 8. Lehnert E, Tampe R (2017) Structure and dynamics of antigenic peptides in complex with TAP. Front Immunol 8:10

9. Ritz U, Seliger B (2001) The transporter associated with antigen processing (TAP): structural integrity, expression, function, and its clinical relevance. Mol Med 7(3):149–158 10. Momburg F, Roelse J, Howard JC, Butcher GW, Hammerling GJ, Neefjes JJ (1994) Selectivity of MHC-encoded peptide transporters from human, mouse and rat. Nature 367(6464):648–651 11. van Endert PM, Riganelli D, Greco G, Fleischhauer K, Sidney J, Sette A et al (1995) The peptide-binding motif for the human transporter associated with antigen processing. J Exp Med 182(6):1883–1895 12. Doytchinova IA, Blythe MJ, Flower DR (2002) Additive method for the prediction of protein-peptide binding affinity. Application to the MHC class I molecule HLA-A*0201. J Proteome Res 1(3):263–272 13. Brusic V, van Endert P, Zeleznikow J, Daniel S, Hammer J, Petrovsky N (1999) A neural network model approach to the study of human TAP transporter. Silico Biol 1(2):109–121 14. Bhasin M, Raghava GP (2004) Analysis and prediction of affinity of TAP binding peptides using cascade SVM. Protein Sci 13(3): 596–607 15. Diez-Rivero CM, Chenlo B, Zuluaga P, Reche PA (2010) Quantitative modeling of peptide binding to TAP using support vector machine. Proteins 78(1):63–72 16. Ballesteros-Sanabria L, Pelaez-Prestel HF, Ras-Carmona A, Reche PA (2022) Resilience of spike-specific immunity induced by COVID-19 vaccines against SARS-CoV-2 variants. Biomedicine 10(5):996

Chapter 17 Docking-Based Prediction of Peptide Binding to MHC Proteins Mariyana Atanasova and Irini Doytchinova Abstract Major histocompatibility complex (MHC) proteins are the most polymorphic and polygenic proteins in humans. They bind peptides, derived from cleavage of different pathogenic antigens, and are responsible for presenting them to T cells. The peptides recognized by the T cell receptors are denoted as epitopes and they trigger an immune response. In this chapter, we describe a docking protocol for predicting the peptide binding to a given MHC protein using the software tool GOLD. The protocol starts with the construction of a combinatorial peptide library used in the docking and ends with the derivation of a quantitative matrix (QM) accounting for the contribution of each amino acid at each peptide position. Key words Molecular docking, Peptide-MHC binding prediction, Flexible and rigid peptide docking, GOLD, Combinatorial library, Quantitative matrix

1

Introduction Major histocompatibility complex (MHC) proteins play an essential role in triggering an immune answer. Their function is to bind peptides derived from the cleavage of different endo- and exogenous pathogen antigens and to present them on the cell surface where peptides can be recognized by the T cell receptors (TCRs). A small number of peptides bound to MHC are recognized by TCRs. They initiate an immune response and are referred as epitopes. The MHC polymorphism was developed under the evolutional pressure of survival of the fittest. Apart from being highly polymorphic, MHC proteins are polygenic, i.e., they are encoded by more than one gene. There are six human MHC (Human Leucocyte Antigen, HLA) class I loci—HLA-A, HLA-B, HLA-C, HLA-E, HLA-F, and HLA-G, and five class II ones—HLA-DR, HLA-DQ, HLA-DP,

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_17, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

237

238

Mariyana Atanasova and Irini Doytchinova

HLA-DM, and HLA-DO. In total, more than 34,400 sequences of both HLA classes are recorded in the IMGT/HLA database in July 2022 (https://www.ebi.ac.uk/ipd/imgt/hla/about/statistics/). MHC class I proteins bind peptide fragments with endogenous origin like viral or self-proteins, while class II molecules bind oligopeptides with exogenous origin as bacterial or parasite antigens. Many crystallographic structures of both MHC classes in apo and holo forms have been released in the RSCB protein databank (https://www.rcsb.org). The holo structures of MHC proteins in complexes with different peptides reveal the intimate mode of binding between partners and are of a great importance for the molecular docking. Molecular docking is a structure-based drug design (SBDD) method providing deep atomistic knowledge about the binding interactions between small ligand molecule and its target macromolecule (usually a protein, but also DNA or RNA) via predicting the probable binding pose (the geometry of the complex, i.e., the orientation and conformation of both partners) and its binding affinity. Molecular docking is applied for hits identification (via virtual screening) [1–4] leads optimization (via rational drug design) [3, 5–7], structure-activity studies [8, 9], mutagenesis studies (via providing a binding hypothesis) [10–12], studies on mechanism of action [13, 14], combinatorial library design [15–18], drug repurposing [19–21], and crystallographic studies (via fitting ligands to electron density) (ICM X-Ray AutoFit—Automated Model Building into Density by Molsoft, LLC https://www.molsoft.com/gui/autofit.html; CheckMyBlob https://checkmyblob.bioreproducibility.org) [22] . Nowadays, the application of molecular docking is not limited only to а small ligand molecule, but it is extended to protein-protein and protein-nucleotide docking [23–26]. There are numerous comprehensive reviews and discussions on molecular docking theory, methods, and tools for predicting the best poses and estimation of the binding energies via scoring functions and their limitations [27–40]. In this chapter, we provide an extensive docking protocol for predicting a peptide binding to a given MHC protein. Due to an unusual ligand molecule, a peptide, several types of docking were defined according to flexibility/rigidity of the peptide side chain/ s and/or the side chains of the binding pocket/s. The protocol of a single flexible pocket docking is described in detail. A combinatorial peptide library is constructed to assess the contributions of each of the twenty naturally occurring amino acids at each peptide position. The scores are normalized and summarized into a quantitative matrix (QM).

Docking-Based Prediction of Peptide Binding to MHC Proteins

2

239

Materials

2.1

Structures

An appropriate pdb file of the MHC class and allele of interest in complex with a peptide should be found in RCSB PDB (https:// www.rcsb.org/). One should be careful with the resolution of the structure, as a lower resolution provides a better structural model. If there is no record for the protein of interest, homology modeling could be applied. Here, we use the X-ray structure of the complex between the staphylococcal enterotoxin I (the peptide) and HLA-DRB1 (pdb code: 2G9H) [41].

2.2

Software

Input files’ preparations and peptide mutations are done by YASARA (http://www.yasara.org/index.html). Molecular docking calculations are performed by GOLD v. 4.1.2. [42].

3 3.1

Methods Docking Options

According to the flexibility/rigidity of the binding molecules, there are several options for docking. Here, the docking protocol is conformed to the unusual ligand—the peptide molecule, which binds to the protein via several amino acids (aas) at different positions, and the unconventional binding site (BS), consisting of several pockets. The options for docking are as follows: 1. Rigid docking, indicating that both protein and peptide are rigid. 2. Rigid protein and single flexible peptide position. The peptide is rigid except for one flexible position of interest. Flexible position means flexible side chain but rigid backbone. 3. Rigid protein and flexible peptide (flexible side chains but rigid backbone). 4. Single flexible pocket docking. Here, the side chains of the residues forming one pocket from the protein BS and the corresponding peptide residue binding in the pocket are flexible. The rest of protein and peptide are rigid. 5. Flexible side chains docking. In this case, the side chains of the residues from the protein BS and peptide are flexible. 6. Fully flexible docking. The backbone and side chains of protein BS and peptide are flexible. This type of docking is the closest to the real conditions, but our experience shows that it fails to predict the X-ray pose of the peptide-protein complex. Moreover, it is the most time-consuming and has an inappropriate time/efficiency cost.

240

Mariyana Atanasova and Irini Doytchinova

The selection of the type of docking depends on several factors like aims of the study, docking software used, available computational resources, etc. Here, we describe the docking protocol for single flexible pocket docking (number 4 of the above classification). This protocol was used previously for prediction of peptide binding to MHC proteins and the derived QMs were implemented in our server EpiDock (http://www.ddg-pharmfac.net/epidock/) [17, 18]. 3.2 Pre-docking Data Preparation

The structure is prepared for docking as follows: 1. Download the pdb file from RCSB PDB and save it within the working folder. 2. Remove the water and other irrelevant molecules. Open the downloaded 2G9H.pdb file by YASARA. Click on 2g9h from the right panel and all molecules within the file can be seen. Mol A, Mol B, and Mol C correspond to chain α and chain β of HLA-DR1 and to peptide, respectively. Delete the other molecules by right click on each molecule and selecting Delete. 3. Hydrogen atoms should be added, as the X-ray structure contains only heavy atoms. This is done from the Edit menu > Add > hydrogens to: All. 4. Save the file from File > Save as > PDB file > select sequence and name then press OK and write the new filename 2g9h_ABC.pdb > OK. 5. Delete Mol A and Mol B and save Mol C as orig_pept.pdb and orig_pept.mol2. The peptide will serve as a parent protein for the construction of combinatorial library. 6. Construct the combinatorial library. The combinatorial peptide library is constructed by single amino acid substitution of any peptide residue by the remaining 19 aas. Here, we consider only the peptide anchor positions p1, p4, p6, p7, and p9. The library consists of 96 peptides (1 wild + 5 positions × 19 aas mutated). In YASARA, the steps are as follows: 1. Open orig_pept.pdb. Then go to Edit > Swap > Residue and select Tyr308--the amino acid at p1 (Pro306 and Lys307 are flanking). OK. 2. From the list of amino acids select Ala, then OK. 3. File > Save as > PDB file > select sequence orig_pept.pdb and name orig_pept from the appeared window. Then press OK and rename the file to P1_Ala.pdb and press OK. Mutate the peptide in the same way with the remaining 18 amino acids for p1 and then do it for the other anchor positions. At the end, you will have a combinatorial peptide library containing 96 peptides (1 wild + 95 mutated peptides).

Docking-Based Prediction of Peptide Binding to MHC Proteins

241

Fig. 1 Visualization of the uploaded protein molecule in complex with a peptide from the PDB 2G9H using the wizard menu in GOLD 3.3 Molecular Docking

To create a gold configuration file and run docking calculations follow the next steps: 1. Open the wizard window: GOLD > Wizard > Load protein > 2g9h_ABC.pdb, as shown in Fig. 1. 2. Go to the table ID and press Add Hydrogens > OK of the pop-up window indicating the number of added hydrogen atoms. 3. From the left menu of the Wizard window go to Delete Ligands/Cofactors and check C:PRO306. Then press Extract and save the peptide ligand molecule in mol2 format, like orig_pept.mol2 > Next. If hydrogens were added and ligands were deleted, a reminding text is shown. Press Next. 4. Now the binding site should be defined: select One or more ligands or cofactors and the peptide ligand molecule is highlighted. All remaining options will be kept unchanged, then press Next two times, skipping the available templates. 5. Next, the ligand/s for docking should be selected: Add > select orig_pept.mol2. For the validation of the docking protocol, only the original ligand/peptide should be re-docked and the

242

Mariyana Atanasova and Irini Doytchinova

RMSD (root-mean-square-deviation) value in Å should be monitored. The predicted pose of the re-docked molecule should be below 1.5 Å. Here, we will upload together with the original peptide file, all 19 mutated at p1 peptides. The uploaded ligand files for docking should be in mol2 format. 6. Set the number of GA runs. We will keep the number of GA Runs to 10 by default. This number can be changed (see Note 1). 7. Use a reference molecule. Add a Reference ligand for comparison. Select orig_pept.mol2 and press Next. The RMSD value has sense only for the original peptide. 8. Choose a scoring function. We choose the scoring function ChemScore for calculations. There are other available scoring functions implemented in GOLD that can be used (see Note 2). 9. Set additional options. Press the button More and uncheck Allow early termination. This option allows the search to continue after the best solutions were found. To allow searching of different poses mark Generate diverse solutions from Diverse Solution Options button (see Note 3). Then press Next. Keep the GA search options to Slow (most accurate) and press Next. 10. Adjust the Advanced settings. Before pressing the Run GOLD button, go to the Advanced button. Fixing ligand rotatable bonds. From the left menu of GOLD setup window go to Ligand Flexibility. Then check Fix Ligand Rotatable Bonds > fix specific. There are two options for selection of fixed bonds: either to select bonds from the screen with right mouse click on the bond and choose “Fix Bond” from the appeared menu; or to enter manually the bond atom file indexes in the table on the window (see Note 4). When single flexible peptide position is used, then all bonds from the peptide should be fixed except those from the side chain aa of interest at the studied position. This is Tyr308 at p1 in the original ligand. The atom file indexes indicating the fixed bonds are listed in Table 1. If the user applies docking with flexible peptide side chains and fixed backbone (types 3 and 5 from the above classification), the atom file indexes forming the backbone bonds for the original peptide are depicted in Table 2. Set the flexible amino acids in the protein-binding site. Press table ID > flexible > Sidechains > double click on a given amino acid to be flexible > then press Free > Crystal > Library > Accept. This is done for all flexible residues (up to 10 in GOLD).

Docking-Based Prediction of Peptide Binding to MHC Proteins

243

Table 1 List of the atom file indexes of the fixed bonds of the original peptide when single peptide position (p1) is flexible Number

Atom1

Atom2

Number

Atom1

Atom2

Number

Atom1

Atom2

1

2

3

27

82

83

53

145

162

2

3

17

28

83

84

54

162

163

3

17

18

29

78

98

55

163

164

4

18

19

30

98

99

56

163

166

5

18

21

31

99

100

57

166

167

6

21

22

32

99

102

58

167

168

7

22

23

33

102

103

59

168

169

8

23

24

34

103

104

60

169

170

9

24

25

35

100

115

61

164

184

10

19

39

36

115

116

62

184

185

11

39

40

37

116

117

63

185

186

12

40

41

38

116

119

64

185

188

13

40

43

39

119

120

65

188

189

14

43

44

40

117

129

66

189

190

15

41

60

41

129

130

67

189

191

16

60

61

42

130

131

68

186

203

17

61

62

43

130

133

69

203

204

18

61

64

44

133

134

70

204

205

19

64

65

45

133

135

71

204

207

20

64

66

46

131

143

72

205

213

21

62

76

47

143

144

73

213

214

22

76

77

48

144

145

74

214

215

23

77

78

49

144

147

75

214

217

24

77

80

50

147

148

76

217

218

25

80

81

51

148

149

77

217

219

26

81

82

52

148

150

To detect the closest amino acids to be set as flexible for the Single flexible pocket docking type, a special file containing only the peptide residue at the studied position should be created. To do this, open 2g9h_ABC.pbd file with YASARA and delete all the amino acids at all peptide positions except

244

Mariyana Atanasova and Irini Doytchinova

Table 2 The atom file indexes for the peptide backbone bonds of the original peptide molecule are fixed during docking calculations Number

Atom1

Atom2

Number

Atom1

Atom2

Number

Atom1

Atom2

1

2

3

14

78

98

27

162

163

2

3

17

15

98

99

28

163

164

3

17

18

16

99

100

29

164

184

4

18

19

17

100

115

30

184

185

5

19

39

18

115

116

31

185

186

6

39

40

19

116

117

32

186

203

7

40

41

20

117

129

33

203

204

8

41

60

21

129

130

34

204

205

9

60

61

22

130

131

35

205

213

10

61

62

23

131

143

36

213

214

11

62

76

24

143

144

37

214

215

12

76

77

25

144

145

13

77

78

26

145

162

those at p1, i.e., only Tyr308 should remain. Save the file as 2g9h_ABC_p1.pdb. Then start another Hermes window and open the wizard: 1. GOLD > Wizard > load protein > 2g9h_ABC_p1.pdb. 2. Go to the table ID and press Add Hydrogens > OK of the pop-up window indicating the number of added hydrogen atoms. 3. From the left menu of the Wizard window go to Delete Ligands/Cofactors and check C:PRO306. Then press Extract and save the amino acid ligand molecule in mol2 format, like p1_Tyr.mol2 > Next. A reminding text about the added hydrogens and deleted ligands appears. Press Next. 4. Define the binding site. Select One or more ligands or cofactors. 5. Change Select all atoms within to 5.0 Å. Then check Generate a cavity atoms file from the selection and press Refine selection. From the pop-up window, the residues can be highlighted and the closest (up to 10 in GOLD) are set flexible. These are: Ile7A, Phe32A, Asn82B, Val85B, Phe89B, Thr90B, Ile31A, Phe24A and Trp43A. The selected residues are set flexible. Then close all files.

Docking-Based Prediction of Peptide Binding to MHC Proteins

245

Fig. 2 Visualized docking solutions of the original ligand. The predicted pose of the peptide is colored in green while the pose from the crystallographic structure is in magenta. The reference RMSD value is 0.2963 Å

Set soft potentials. Go to table ID > Soft potentials > From the screen select the same residues as in the previous step (Ile7A, Phe32A, Asn82B, Val85B, Phe89B, Thr90B, Ile31A, Phe24A, and Trp43A). They will appear in the alternative potential 1 box indicating that the soft potential will be applied on them. 11. Press Run Gold and save configuration file: name.conf 12. When the docking calculations finish, press View solutions and close all the pop-up windows. The docking solutions with the corresponding fitness scores, RMSD values, and all terms forming the final Fitness scores are presented on the left of the Hermes window (Fig. 2). 3.4 Construction of the Docking-Based Quantitative Matrix (QM)

To account for the contribution of each of the twenty naturally occurring amino acids at each peptide position, a quantitative matrix (QM) is constructed. The QM for a given MHC allele is built by normalizing the best fitness score for any peptide in the combinatorial library: NV =

DS - AV max - min

246

Mariyana Atanasova and Irini Doytchinova

where NV is the normalized value, DS—docking score, AV—average value, max, and min correspond to the max and min values, respectively. Thus, a positive NV indicates a favorable amino acid, while a negative—an unfavorable one. In our case, the QM consists of 5 columns (5 peptide anchor positions: p1, p4, p6, p7, and p9) and 20 rows (20 amino acids). The docking-based QMs could be used for structure-affinity studies, quantitative predictions of peptide binding to MHC proteins, design of novel peptides, and virtual screening for MHC binders and T cell epitopes. The derived DB-QM is shown in Table 3.

Table 3 Docking-based quantitative matrix (DB-QM) for the five anchor positions (p1, p4, p6, p7, and p9) DRB1*0101

p1

p4

p6

p7

p9

A

-0.146

0.026

0.167

0.0609

0.122

C

-0.123

-0.026

0.207

0.0788

0.168

D

-0.299

-0.194

-0.315

-0.5012

-0.101

E

-0.208

-0.008

0.298

-0.1430

-0.181

F

0.458

0.464

-0.299

0.2917

0.361

G

-0.207

-0.121

0.16

0.0051

0.021

H

0.071

-0.005

-0.175

-0.1258

-0.034

I

0.109

0.102

0.228

0.2755

0.102

K

0.121

0.269

0.043

-0.0633

0.048

L

0.088

0.258

-0.169

0.1434

0.423

M

0.087

0.23

0.254

0.1056

0.272

N

-0.123

-0.235

-0.091

-0.1056

-0.105

P

-0.419

-0.536

-0.194

0.4988

-0.577

Q

-0.097

0.047

0.086

-0.2627

-0.203

R

-0.145

-0.343

-0.053

-0.3072

-0.489

S

-0.119

-0.144

0.083

-0.1561

0.104

T

-0.018

-0.166

0.221

-0.1777

0.19

V

0.04

0.108

0.222

0.1366

0.056

W

0.581

0.196

0.034

0.2014

-0.166

Y

0.348

0.075

-0.702

0.0355

-0.014

Docking-Based Prediction of Peptide Binding to MHC Proteins

4

247

Notes 1. One can increase this number, for example to 100, resulting in more detailed but time-consuming search. 2. Optional: One can try to validate the protocol using the other functions, especially using the relatively new function ChemPLP. 3. One can keep the default settings, i.e., cluster size = 1 RMSD = 1.5, or cluster size can be changed to a greater value together with the RMSD value. 4. To identify the atom file indexes one can open the peptide molecule in a new Hermes window and from the left tab Display > right click on C:ID > Labels > Label by Atom File Index.

Acknowledgment This work was supported by the Science and Education for Smart Growth Operational Program and co-financed by the European Union through the European Structural and Investment funds (Grant No BG05M2OP001-1.001-0003). References 1. Khan AH, Prakash A, Kumar D et al (2010) Virtual screening and pharmacophore studies for ftase inhibitors using Indian plant anticancer compounds database. Bioinformation 5: 62–66 2. Are´valo JMC, Amorim JC (2022) Virtual screening, optimization and molecular dynamics analyses highlighting a pyrrolo[1,2-a] quinazoline derivative as a potential inhibitor of DNA gyrase B of Mycobacterium tuberculosis. Sci Rep 12:4742 3. Yagci S, Gozelle M, Kaya SG et al (2021) Hitto-lead optimization on aryloxybenzamide derivative virtual screening hit against SIRT. Bioorg Med Chem 30:115961 4. Atanasova M, Dimitrov I, Ivanov S et al (2022) Virtual screening and hit selection of natural compounds as acetylcholinesterase inhibitors. Molecules 27:3139 5. Choong IC, Lew W, Lee D et al (2002) Identification of potent and selective smallmolecule inhibitors of caspase-3 through the use of extended tethering and structure-based drug design. J Med Chem 45:5005–5022

6. Combs AP (2007) Structure-based drug design of new leads for phosphatase research. IDrugs 10:112–115 7. Coumar MS, Leou J-S, Shukla P et al (2009) Structure-based drug design of novel aurora kinase A inhibitors: structural basis for potency and specificity. J Med Chem 52:1050–1062 8. Jia B, Ma Y, Liu B et al (2019) Synthesis, antimicrobial activity, structure-activity relationship, and molecular docking studies of indole diketopiperazine alkaloids. Front Chem 7:837 9. Bacalhau P, San Juan AA, Marques CS et al (2016) New cholinesterase inhibitors for Alzheimer’s disease: Structure Activity Studies (SARs) and molecular docking of isoquinolone and azepanone derivatives. Bioorg Chem 67: 1––8 10. Singh N, Villoutreix BO, Ecker GF (2019) Rigorous sampling of docking poses unveils binding hypothesis for the halogenated ligands of L-type Amino acid Transporter 1 (LAT1). Sci Rep 9:15061

248

Mariyana Atanasova and Irini Doytchinova

11. Luger D, Poli G, Wieder M et al (2015) Identification of the putative binding pocket of valerenic acid on GABA A receptors using docking studies and site-directed mutagenesis. Br J Pharmacol 172:5403–5413 12. Inoue Y, Nakamura N, Inagami T (1997) A review of mutagenesis studies of angiotensin II type 1 receptor, the three-dimensional receptor model in search of the agonist and antagonist binding site and the hypothesis of a receptor activation mechanism. J Hypertens 15:703–714 13. Venhorst J, ter Laak AM, Commandeur JNM et al (2003) Homology modeling of rat and human cytochrome P450 2D (CYP2D) isoforms and computational rationalization of experimental ligand-binding specificities. J Med Chem 46:74–86 14. Xie L, Evangelidis T, Xie L et al (2011) Drug discovery using chemical systems biology: weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavir. PLoS Comput Biol 7:e1002037 15. Atanasova M, Patronov A, Dimitrov I et al (2013) EpiDOCK: a molecular docking-based tool for MHC class II binding prediction. Protein Eng Des Sel 26:631–634 16. Patronov A, Dimitrov I, Flower DR et al (2012) Peptide binding to HLA-DP proteins at pH 5.0 and pH 7.0: a quantitative molecular docking study. BMC Struct Biol 12:20 17. Atanasova M, Dimitrov I, Flower DR et al (2011) MHC class II binding prediction by molecular docking. Mol Inf 30:368–375 18. Patronov A, Dimitrov I, Flower DR et al (2011) Peptide binding prediction for the human class II MHC allele HLA-DP2: a molecular docking approach. BMC Struct Biol 11:32 19. Matondo A, Dendera W, Isamura BK et al (2022) In silico drug repurposing of anticancer drug 5-FU and analogues against SARS-CoV2 main protease: molecular docking, molecular dynamics simulation, pharmacokinetics and chemical reactivity studies. Adv Appl Bioinforma Chem 15:59–77 20. Jukicˇ M, Kores K, Janezˇicˇ D et al (2021) Repurposing of drugs for SARS-CoV-2 using inverse docking fingerprints. Front Chem 9: 757826 21. Kumar S, Chowdhury S, Kumar S (2017) In silico repurposing of antipsychotic drugs for Alzheimer’s disease. BMC Neurosci 18:76 22. Brzezinski D, Porebski PJ, Kowiel M et al (2021) Recognizing and validating ligands with CheckMyBlob. Nucleic Acids Res 49: W86–W92

23. Setny P, Bahadur RP, Zacharias M (2012) Protein-DNA docking with a coarse-grained force field. BMC Bioinf 13:228 24. Viji SN, Balaji N, Gautham N (2012) Molecular docking studies of protein-nucleotide complexes using MOLSDOCK (mutually orthogonal Latin squares DOCK). J Mol Model 18:3705–3722 25. Zacharias M (2010) Accounting for conformational changes during protein–protein docking. Curr Opin Struct Biol 20:180–186 26. Almeida R, Dell’Acqua S, Krippahl L et al (2016) Predicting protein-protein interactions using BiGGER: case studies. Molecules 21: 1037 27. Tiwari A, Singh S (2022) Computational approaches in drug designing. In: Bioinformatics. Elsevier, pp 207–217 28. Prieto-Martı´nez FD, Arciniega M, MedinaFranco JL (2018) Acoplamiento Molecular: Avances Recientes y Retos. TIP Rev Espec en Ciencias Quı´mico-Biolo´gicas 21 29. Bissantz C, Folkers G, Rognan D (2000) Protein-based virtual screening of chemical databases. 1. Evaluation of different docking/ scoring combinations. J Med Chem 43:4759– 4767 30. Taylor RD, Jewsbury PJ, Essex JW (2002) A review of protein-small molecule docking methods. J Comput Aided Mol Des 16:151– 166 31. Sousa SF, Fernandes PA, Ramos MJ (2006) Protein-ligand docking: current status and future challenges. Proteins Struct Funct Bioinf 65:15–26 32. Morris GM, Lim-Wilby M (2008) Molecular Docking. In: Molecular Modeling of Proteins. Methods Molecular Biology. Humana Press, 443:365–382. https://doi.org/10.1007/ 978-1-59745-177-2_19 33. Kumar S, Kumar S (2019) Molecular docking: a structure-based approach for drug repurposing. In: In silico drug design. Elsevier, pp 161–189 34. Stanzione F, Giangreco I, Cole JC (2021) Use of molecular docking computational tools in drug discovery. In: Progress in Medicinal Chemistry. Elsevier, 60:273–343. https://doi. org/10.1016/bs.pmch.2021.01.004 35. Silakari O, Singh PK (2021) Molecular docking analysis: basic technique to predict drugreceptor interactions. In: Concepts and experimental protocols of modelling and informatics in drug design. Elsevier, pp 131–155 36. Perola E, Walters WP, Charifson PS (2004) A detailed comparison of current docking and

Docking-Based Prediction of Peptide Binding to MHC Proteins scoring methods on systems of pharmaceutical relevance. Proteins Struct Funct Bioinf 56: 235–249 37. Kontoyianni M, McClellan LM, Sokol GS (2004) Evaluation of docking performance: comparative data on docking algorithms. J Med Chem 47:558–565 38. Kellenberger E, Rodrigo J, Muller P et al (2004) Comparative evaluation of eight docking tools for docking and virtual screening accuracy. Proteins Struct Funct Bioinf 57: 225–242 39. Halgren TA, Murphy RB, Friesner RA et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment

249

factors in database screening. J Med Chem 47:1750–1759 40. Friesner RA, Banks JL, Murphy RB et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739–1749 41. Ferna´ndez MM, Guan R, Swaminathan CP et al (2006) Crystal structure of staphylococcal enterotoxin I (SEI) in complex with a human major histocompatibility complex class II molecule. J Biol Chem 281:25356–25364 42. Jones G, Willett P, Glen RC et al (1997) Development and validation of a genetic algorithm for flexible docking 1 1Edited by F. E. Cohen. J Mol Biol 267:727–748

Chapter 18 The PANDORA Software for Anchor-Restrained Peptide:MHC Modeling Dario F. Marzella, Giulia Crocioni, Farzaneh M. Parizi, and Li C. Xue Abstract Major histocompatibility complexes (MHC) play a key role in the immune surveillance system in all jawed vertebrates. MHC class I molecules randomly sample cytosolic peptides from inside the cell, while MHC class II sample exogenous peptides. Both types of peptide:MHC complex are then presented on the cell surface for recognition by αβ T cells (CD8+ and CD4+, respectively). The three-dimensional structure of such complexes can give crucial insights in the presentation and recognition mechanisms. For this reason, softwares like PANDORA have been developed to rapidly and accurately generate peptide:MHC (pMHC) 3D structures. In this chapter, we describe the protocol of PANDORA. PANDORA exploits the structural knowledge on anchor pockets that MHC molecules use to dock peptides. PANDORA provides anchor positions as restraints to guide the modeling process. This allows PANDORA to generate twenty 3D models in just about 5 min. PANDORA is highly customizable, easy to install, supports parallel processing, and is suitable to provide large datasets for deep learning algorithms. Key words Integrative modeling, Peptide:MHC complexes, Adaptive immunity

1

Introduction Major histocompatibility complexes (MHC) are central molecules in the T cell-based immune surveillance system. Cells constantly break down proteins into peptides. The major histocompatibility complex (MHC) proteins present some of these peptides on the cell surface, forming the peptide:MHC complex (Fig. 1). T cells use their T cell receptor (TCR) to scan these peptides presented on the cell surface to check the health status of the target cell. When T cells recognize the peptides as foreign, they become activated and elicit immune attacks. The foreign peptide could be derived from viruses or other microbial organisms, tumor antigens, tissue transplants, and other sources. Thus, this peptide-MHC mechanism plays a key role in a wide range of immune defense situations and immunotherapies [1–3].

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_18, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

251

252

Dario F. Marzella et al.

Fig. 1 The conformation of peptide bound to MHC. (a) SARS-COV2 peptide (yellow) bound to MHC class I (cyan) binding groove (PDB ID: 7LG3), (b) peptide (yellow) bound to MHC class II (cyan and blue) binding groove (PDB ID: 4Z7U). The peptide surface is outlined in transparent gray 1.1 3D Modeling of Peptide:MHC

3D structures of peptide:MHC (pMHC) complexes provide crucial information on the peptide:MHC binding mechanism and assist further analysis of MHC dynamics and peptide editing [4]. Due to the diversity of the peptides and MHC molecules, computational modeling is indispensable and complementary to the experimental structure determination methods, such as X-ray, NMR, and Cryo-EM. In the past decades, several modeling approaches have been developed [5–7]. All modeling approaches have two major stages: (1) sampling and (2) scoring. The sampling step generates many plausible conformations. This is followed by the scoring step that evaluates each of these conformations (often in terms of energies) and selects the top-scoring models. Some approaches use molecular dynamics (MD) to model pMHC structures [8]. Some approaches use molecular docking [7]. Some others use a 3D grid-based modeling approach [5]. However, only a few approaches are available for wide use as software or webservers. Here we present PANDORA, a python software for modeling peptide:MHC complexes.

1.2

PANDORA [9] is an integrative modeling python (≥ v 3.7) package for 3D structures of peptide:MHC complexes. It exploits the knowledge that MHC molecules use deep anchor pockets to bind peptides. PANDORA adds distance restraints on these anchor positions while allowing the rest of the peptide backbone to be fully flexible. PANDORA is built on top of MODELLER [10], a homology modeling software. PANDORA optimizes the peptide conformation by minimizing the molpdf score, which is the objective function of MODELLER and contains several energy terms from the CHARMM force field [11] and penalty functions for the restraints (i.e., generating high values when restraints are violated). To the best of our knowledge, PANDORA is the only software that can model both peptide:MHC class I and class II.

PANDORA

3D-Modelling of Peptide:MHC Complexes using PANDORA

253

Briefly, PANDORA takes as input (1) MHC-allele names or MHC sequences and (2) peptide sequences, and predicts the top N (default: 20, configurable) low-energy 3D peptide:MHC models. Our benchmark on all experimental peptide:MHC-I structures from the PDB databank [12] shows that PANDORA models are very reliable [9]. It typically takes 5 minutes on 1 CPU core to generate the 20 lowest-energy models. PANDORA is highly modularized, configurable, and supports parallel processing. We describe here a comprehensive usage case to cover most PANDORA options. This chapter is completed with PANDORA v2.0.0. PANDORA is available through conda (https://anaconda.org/CSB-Nijmegen/ csb-pandora), PyPI (https://pypi.org/project/csb-pandora/), and GitHub (https://github.com/X-lab-3D/PANDORA).

2

Useful Information

2.1 Code-Block Colors

In this chapter, there are multiple code blocks representing bash and python scripts. To better distinguish them, they have been given two different background colors. Moreover, as per praxis, bash scripts will always have the dollar sign ($) at the beginning of each command line. • Dark background: bash scripts Example:

$ export KEY_MODELLER='XXXX'

• White background: python scripts and outputs Example: import PANDORA print(PANDORA.PANDORA_data)

In some parts of the chapter (e.g., Protocol 2) the focus is on only specific parts of the code blocks. In these cases, the gray text has been previously explained, while the black/colored part of the code is the one being explained. Example: # Previously explained code import PANDORA print(PANDORA.PANDORA_data) # New Code from PANDORA import Database db = Database.load()

254

Dario F. Marzella et al.

2.2 Chain ID and Numbering Conventions

Chain IDs of all resulting models are named as follows: – M: MHC alpha chain – B: β-2 Microglobulin, only for MHC-I – N: MHC II beta chain – P: peptide chain The PANDORA models and reference structures are numbered with every chain starting from 1.

2.3 Database Location

PANDORA is a template-based modeling software. The user might want to know where the template database is installed and check its content. Once PANDORA is installed (see Subheading 3.1 below), the exact location of the database is stored within the package as a variable called “PANDORA.PANDORA_data”. Thus, to obtain the database location, the user can run in python:

import PANDORA print(PANDORA.PANDORA_data)

2.4 Database Structure

A PANDORA database has the following organization (see Table 1 for explanations):

Table 1 Database files and folders File name

Description

mhcseqs

MHC reference sequence databases

mhcseqs/HLA_raw.fasta

Raw human reference database obtained from https://github.com/ ANHIG/IMGTHLA

mhcseqs/HLA_cleaned.fasta

Human reference sequences parsed and cleaned for PANDORA to use

mhcseqs/MHC_raw.fasta

Raw non-Human reference database obtained from https://github. com/ANHIG/IPDMHC

mhcseqs/MHC_cleaned.fasta Non-Human reference sequences parsed and cleaned for PANDORA to use PDBs

PDB files for template structures

PDBs/Bad

PDB files filtered out and logs for the database generation

PDBs/IMGT_retrieved

IMGT-downloaded files

refseq_blast_db

BLAST database of MHC reference sequences (built from the combined sequences under the mhcseqs folder)

templates_blast_db

BLAST database of templates sequences used for template selection

PANDORA_database.pkl

Pickle file containing the PANDORA.Database.Database object

3D-Modelling of Peptide:MHC Complexes using PANDORA

├── │ │ │ │ │ ├── │ │ │ │ │ │ │ │ │ │ │ ├── │ │ │ │ │ └──

255

mhcseqs ├── HLA_raw.fasta ├── HLA_cleaned.fasta ├── MHC_raw.fasta └── MHC_cleaned.fasta PDBs ├── Bad │ ├── │ ├── │ ├── │ └── │ ├── │ ├── └──

pMHCI pMHCII log_MHCI.csv log_MHCII.csv

IMGT_retrieved └── IMGT3DFlatFiles pMHCI pMHCII

BLAST_databases ├── refseq_blast_db └── templates_blast_dbBLAST_databases ├── refseq_blast_db └── templates_blast_db PANDORA_database.pkl

2.5 Exploring the Database

The PANDORA database can be opened and explored directly in python. The two main components of the database are the dictionaries of template objects, called respectively “MHCI_data” and “MHCII_data.” These are indexed by template PDB ID.

## load the required module from PANDORA import Database

## load the database db = Database.load()

## select one template template = db.MHCI_data['1A1M']

## print the template information template.info()

256

Dario F. Marzella et al.

Output: This is a Template structure. ID: 1A1M Type: MHC class I Alleles: ['HLA-B*53:01', 'HLA-B*53:01'] Alpha chain length: 278 Peptide length: 9 Alpha chain: GSHSMRYFYTAMSRPGRGEPRFIAVGYVDDTQFVRFDSDAASPRTEPRPPWIEQEGPEYWDRNTQIFKTNTQTYR ENLRIALRYYNQSEAGSHIIQRMYGCDLGPDGRLLRGHDQSAYDGKDYIALNEDLSSWTAADTAAQITQRKWEAA RVAEQLRAYLEGLCVEWLRRYLENGKETLQRADPPKTHVTHHPVSDHEATLRCWALGFYPAEITLTWQRDGEDQT QDTELVETRPAGDRTFQKWAAVVVPSGEEQRYTCHVQHEGLPKPLTLRWEPHH Peptide: TPYDINQML Anchors: [2, 9] Path to PDB file: /home/User/PANDORA_databases/default/PDBs/pMHCI/1A1M.pdb Biopython PDB structure: no PDB loaded

3

PANDORA Protocols

3.1

Installation

For system requisites, see Note 1. The suggested way to install PANDORA is by using conda (thus conda needs to be installed first). As PANDORA leverages on MODELLER, the user will have to request a MODELLER license key. Please request it license through: https://salilab.org/MODEL LER/registration.html Replace XXXX with your MODELLER License key and run the bash command in shell or put it in your .bashrc file and source it:

$ export KEY_MODELLER='XXXX'

3D-Modelling of Peptide:MHC Complexes using PANDORA

257

Then, you can install PANDORA in the same bash terminal by running: $ conda install -c csb-nijmegen csb-pandora -c salilab -c bioconda

For issues with installing and running MODELLER, see Note 2. For other installation issues, please refer to the PANDORA issue page https://github.com/X-lab-3D/PANDORA/issues. 3.2 Download the Template Database or Build the Database Locally

To work properly, PANDORA needs a template database composed of MHC reference sequences and template PDB files. PANDORA provides two alternative options to obtain the database. Option 1: Directly download from the latest released database. This can be done by running the following bash command, after having properly installed PANDORA:

$ pandora-fetch

Alternatively, it can be done in python with the following lines: ## import requested modules from PANDORA import Database ## Create local Database Database.install_database()

Option 2: Locally generate the database. This option makes sure the database is up to date, containing as many templates as possible. As the option above, this can also be performed in bash: $ pandora-create

Or in python: ## import requested modules from PANDORA import Database

## Create local Database db = Database.Database() db.construct_database()

For multi-processing database generation, see Note 3.

258

Dario F. Marzella et al.

3.3 Protocol 1— Model a Peptide:MHC-I Complex, a Simple Scenario

Input: – A case ID () – One peptide sequence () – One MHC class one allele name () – The complex’s MHC class () Steps: A. Load the template database. B. Creating a Target object based on the given target information. C. Generating 20 numbers of pMHC 3D models. By default, the output directory will be generated in the current working directory. The user can also specify an output directory (see the Subheading 3.4, Optional Arguments for details). This protocol can be performed in bash:

$ pandora-run -i -a -p -m

Or in python: ## import requested modules from PANDORA import Pandora from PANDORA import Database

## A. Load template Database db = Database.load()

## B. Create Target object target = Target(id = '', allele_type = '', peptide = '')

## C. Perform modelling case = Pandora.Pandora(target, db) case.model()

3D-Modelling of Peptide:MHC Complexes using PANDORA

259

Example: Bash: $ pandora-run -i MyTestCase -a HLA-A*02:01 -p LLFGYPVYV -m I

Python: # Import requested modules from PANDORA import Target from PANDORA import Pandora from PANDORA import Database

# Load the template database db = Database.load()

# Create Target object target = Target(id = 'MyTestCase', allele_type = 'HLA-A*02:01', peptide = 'LLFGYPVYV') # Perform modelling case = Pandora.Pandora(target, db) case.model()

For MHC allele syntax, see Note 4. For details about the anchor positions used in this protocol, see Note 5. 3.4 Protocol 2— Model a Peptide:MHC-I Complex, a Comprehensive Python Scenario

1. Load the requested modules

260

Dario F. Marzella et al.

# Import requested modules from PANDORA import Target from PANDORA import Pandora from PANDORA import Database

2. Load the Database # Load template Database db = Database.load()

3. Define the basic target information The basic information needed to generate a pMHC-I 3D model are an id, the peptide sequence, and the MHC information. The MHC information can be either the MHC sequence (preferred) or the MHC allele name, or both. In case both are provided, the sequence will be used and the allele name will be ignored. If the user inputs the MHC allele name only, PANDORA will automatically search in the database for its fasta sequence and use it to search structural templates. All of this information must be passed to the PANDORA.PMHC.Target() class as the following arguments: – id – peptide – allele_type – M_chain_seq Additionally, MHC-II models need also the following arguments: – MHC_class – N_chain_seq

3D-Modelling of Peptide:MHC Complexes using PANDORA

261

Example: # Create Target object target = Target(id = 'MyTestCase', MHC_class='I', allele_type = 'HLA-A*02:01', # or M_chain_seq = 'GSHSMR...', peptide = 'LLFGYPVYV' )

4. (Optional but preferred) Define peptide anchor positions PANDORA uses the anchor information to guide its modeling process. However, PANDORA is unable to predict by itself which residues are the anchor residues used by the peptide to bind to the MHC. For MHC-I, by default, PANDORA uses canonical anchor positions P2 and PΩ (for MHC-II, see Note 6). PANDORA supports two options for users to specify peptide anchors: (1) Users can directly provide the anchor positions as a list of indexes (integers) with the arguments “anchors,” or (2) Allow PANDORA to predict the binding core by using NetMHCpan or NetMHCIIpan with the argument “use_netmhcpan.” Example for Option 1 (directly specify anchors): target = Target(id = 'MyTestCase', MHC_class='I', allele_type = 'HLA-A*02:01', peptide = 'LLFGYPVYV', anchors=[2,9] )

262

Dario F. Marzella et al.

Example for Option 2 (NetMHCpan predicted anchors): target = Target(id = 'MyTestCase', MHC_class='I', allele_type = 'HLA-A*02:01', peptide = 'LLFGYPVYV', use_netmhcpan=True )

See Note 7. 5. Create the case object case = Pandora.Pandora(target, db)

6. Model the case case.model()

Optional Arguments: • Specify a different output directory By default, the output directory will be generated in the current working directory. If the user wants to have the outputs produced in a different folder, they can specify it with the argument “output_dir” when defining the Target object. Example: target = Target(id = 'MyTestCase', MHC_class='I', allele_type = 'HLA-A*02:01', peptide = 'LLFGYPVYV', output_dir='/home/User/Documents/' )

• Change the amount of output models The default number of models generated by PANDORA is 20. In case the user wants to obtain a different number of models, they can use the n_loop_models option.

3D-Modelling of Peptide:MHC Complexes using PANDORA

263

Example: case.model(n_loop_models=100)

• Add secondary structure restraints Some MHC-I-bound peptides can fold into short elements of secondary structure within the binding groove. This mostly happens for rare, long (>13 residues) peptides. Although it is very hard for PANDORA to predict secondary structures, a user can predict similar occurrences with specialized tools like AGADIR [13] and PEPFOLD3 [14], and use the predicted secondary structure information to guide PANDORA’s modeling process. The secondary structure predictions can be passed to PANDORA with the “helix” and “sheet” arguments: – For alpha-helices, the user should insert the index of the starting and ending position of the helix on the peptide as a list of integers. Example: target = Target(id = 'MyTestCase', MHC_class='I', allele_type = 'HLA-A*02:01', peptide = 'LLFGYPVLAYVRL', helix=[4,10] )

– For beta-sheets, the user must enter a list with the start position of B-sheet 1, the start position of B-sheet 2, and the length of the B-sheet in h-bonds. For example, [“N:6:P”, “O:13:P”, -3] means that the sheet starts at the Nitrogen atom of the 6th residue of chain P (the peptide), ends at the Oxygen of the 13th residue of chain P and has a length of 3 H-bonds. The negative sign denotes an anti-parallel beta-sheet.

264

Dario F. Marzella et al.

Example: target = Target(id = 'MyTestCase', MHC_class='I', allele_type = 'HLA-A*02:04', peptide = 'FLNKDLEVDGHFVTM', sheet= ["N:6:P", "O:13:P", -3] )

• Add user-defined templates In some cases, a user might want to model a structure using a specific template. To do so, the user may directly pick the template from the ones available in the loaded database, as in the example below: template = db.MHCI_data['1B0G']

Once selected, the template can be passed to the Pandora object: case = Pandora.Pandora(target, db, template=[template] )

• Additional optional arguments. The functions showcased above have many additional arguments that can be used to personalize the modeling. The user can find them in the online API reference at https://csb-pandora. readthedocs.io/en/latest/Documentation.html 3.5 Protocol 3— Model a Peptide:MHCII Complex

Similar to Protocols 1 and 2, PANDORA can be used to model pMHC-II cases with a few minor adjustments. As no canonical anchors can be defined for MHC-II binding peptides, the user must provide PANDORA the anchor position or predict them by using netMHCpan. Moreover, the user needs to provide the sequences or allele types for both MHC chains. The only case in which one chain can be omitted is when modeling DP genes, for which PANDORA will assume the alpha chain to be always HLA-DRA*01:01 if not provided. A pMHC-II example case is shown below.

3D-Modelling of Peptide:MHC Complexes using PANDORA

265

Bash: $ pandora-run -i MyTestCase -a HLA-DPA1*01:03,HLA-DPB1*01:01 -p GSDWRFLRGYHQYA -m II -k 4,7,9,12

Python: ## import requested modules from PANDORA.PMHC import PMHC from PANDORA.Pandora import Pandora from PANDORA.Database import Database

## A. Load the template database db = Database.load() ## B. Create Target object target = PMHC.Target(id = 'MyTestCase', MHC_class='II', allele_type = ['HLA-DPA1*01:03', 'HLA-DPB1*01:01'], peptide = 'GSDWRFLRGYHQYA', anchors=[4,7,9,12])

## C. Perform modelling case = Pandora.Pandora(target, db) case.model()

3.6 Protocol 4—Run PANDORA Wrapper on Multiple Cases

PANDORA supports modeling multiple cases in parallel and on multiple CPU cores. Input Data: A .tsv (tab-separated file) or .csv (comma-separated) file, where every row is a case to be modeled, including at least peptide sequence and either MHC allele name or MHC sequence. Various other information for each case can be added, including anchors, templates, IDs, and MHC chain sequences. The complete list of arguments can be found in the documentation at https:// csb-pandora.readthedocs.io/en/latest/Documentation.html or

266

Dario F. Marzella et al.

Table 2 An example file: MyDatafile.tsv Case1

LLFGYPVYV

HLA-A*02:01

2;9

Case2

VPLRPMTY

HLA-B*35:01

2;8

Case3

KPIVQYDNF

HLA-B*53:01

2;9

Case4

LPPLDITPY

HLA-B*35:01

2;9

Case5

GGRKKYKL

HLA-C*15:02

2;8

Case6

GGKKKYQL

HLA-B*08:01

2;8

you can use the python help() function as help(Wrapper.Wrapper). An example file is showcased in Table 2. The Wrapper class will take care of generating PANDORA target objects and parallelize the modeling on the given number of cores. It can also be run both from bash and python. The following examples will run the wrapper with four parallel processes (thus using four cores of a CPU), taking as input a csv file shaped as the MyDatafile.tsv file showcased above. Bash: $ pandora-wrapper -f MyDatafile.tsv –num-cores 4 --targets-id-column 0 --peptides-column 1 --allele_name_column 2 --anchors-column 3 -header 0 --mhc-class I

Python: from PANDORA import Database from PANDORA import Wrapper

## A. Load pregenerated database of all pMHC PDBs as templates db = Database.load()

## B. Create the wrapper object. wrap =

Wrapper.Wrapper('MyDatafile.tsv', db, num_cores=4, IDs_col=0, peptides_col=1, allele_name_col=2, anchors_col=3, header=False, MHC_class='I')

3D-Modelling of Peptide:MHC Complexes using PANDORA

267

Reference (experimental)

Model

Superimpose

L-RMSD

Fig. 2 Schematic representations of the steps necessary to calculate the L-RMSD 3.7 Model Quality Evaluations

When performing a benchmark experiment, a widely used metric to evaluate the model’s quality is the Ligand-Root Mean Squared Deviation (L-RMSD, Fig. 2). The L-RMSD provides a measure of the structural deviation between the ligands in the model and in the reference structure (i.e., the experimentally determined structure). It is obtained by following these steps: 1. Superpose receptors in the model and reference structures (the MHC is the receptor for the case of peptide:MHC complexes). 2. Calculate the RMSD between the Euclidean distances of all the requested atoms of ligands of the model and of the reference structure (the peptide is the ligand for peptide:MHC complexes). PANDORA provides a function to calculate L-RMSD. The user can specify which atoms to calculate the L-RMSD for by atoms names (default backbone atoms: N, CA, C, O). Also, the user can specify which region of the peptide to calculate RMSD by specifying the residue numbers of the peptide (default: whole peptide).

268

Dario F. Marzella et al.

## import requested modules from PANDORA import calc_LRMSD # Define the model file with path model_file = './/.BL00010001.pdb'

# Define reference file with path ref_file = 'PANDORA.PANDORA_data/PDBs/pMHC/.pdb'

# Calculate L-RMSD lrmsd = calc_LRMSD(model_file, ref_file)

The L-RMSD calculation can be adapted to the user’s needs by changing either the atom types or the peptide residues to include by using the two arguments “atoms” and “ligand_zone.” Example: # Calculate backbone L-RMSD of the binding core core_lrmsd = calc_LRMSD(model_path, reference_path, atoms = ['C', 'CA', 'N', 'O'], ligand_zone=[2,9])

3.8 Anticipated Results

4

An output folder for one single case (run with default arguments) takes about ~6.7 MB. Table 3 lists the content of a sample output folder. Description of the other files (.DL.., .lrsr, .rsr, .sch, .V9999..,. B9999..) can be found in the official MODELLER documentation at https://salilab.org/MODELLER/manual/node105.html.

Limitations of PANDORA • The current version of PANDORA is only fully tested on linux. It is not guaranteed that it will work on Mac OS, and it will not work on Windows.

3D-Modelling of Peptide:MHC Complexes using PANDORA

269

Table 3 Output folder content File name

Description

.BL0*.pdb

Final model files

molpdf_DOPE.tsv

Tsv file containing the molpdf ranking and DOPE score for each model. These values are retrieved by the filan part of the MODELLER.log file

.pdb

Template pdb structure

alignment.afa

Intermediate alignment file produced by MUSCLE [15]

.ali

Final alignment file used by MODELLER

cmd_MODELLER_ini. MODELLER script used to generate the .ini model only py cmd_MODELLER.py

Final MODELLER script used to perform the modeling

MyLoop.py

MODELLER script containing the special restraints

contacts_.list

List of contacts used to restrain the anchors

_. Fasta containing the target MHC sequence fasta .log

PANDORA log file

• The current version of PANDORA does not work with Posttranslational modification (PTM), which is a work in progress. • Models generated using PANDORA are energy-minimized without solvent. Depending on the user’s purpose, the user may want to further refine PANDORA models using molecular dynamics (e.g., GROMACS [16]) that can adequately simulate pMHC complexes in a solvent environment. • PANDORA model quality relies on the accuracy of anchor positions. • PANDORA works very well on short peptides ( “unsupervised” > “attribute” > “NumericToNominal”; a pop-up window opens. Click “NumericToNominal -R first-last” and type “last” in the “attributeindices” field; click the “Apply” button.

Machine Learning Models for Prediction of Bacterial Immunogenicity

295

3. Build the classification model: Select “Classify” from the main menu. To select a classification algorithm, click the “Choose” button in the “Classifier” field and select the desired method. A pop-up window opens. Click the field with the name of the classifier, edit the desired parameters of the algorithm, and click the “OK” button. The applied parameters for each algorithm are described in the corresponding subheading given below. 4. Select the method to test the extracted model: Select “Cross Validation” in the “Test Options” box and write the number of folds in the “Folds” box. Choose to apply tenfold crossvalidation for each of the ML algorithms used (see Note 3). Model performance evaluation metrics as well as model evaluation options can be selected via the “More Options...” button in the “Test Options” box. 5. Start modeling: Click the “Start” button. WEKA displays the performance of the derived classification model after tenfold cross-validation in the “Classifier Output” field. 3.6 Assessment of the Model Performance

The default WEKA output for classification models includes many different parameters. The most common are True Positive (TP) Rate or sensitivity or Recall; False Positive (FP) Rate; Correctly Classified Instances or Accuracy; Positive Predicted Value or Precision (PPV), F1 score (F1), ROC Area; PR Area, Matthews correlation coefficient (MCC). Sensitivity is a measure of how well the model can identify true positives. In the case of immunogenicity prediction, the TRs are the truly predicted immunogens. 1-FP Rate or specificity is a measure of how well the model can identify false positives. In the case of immunogenicity prediction, specificity refers to the truly predicted non-immunogens. Accuracy reflects the accuracy of the model’s prediction. ROC Area is the area under the receiver operating characteristic curve (sensitivity vs. 1-specificity) and represents the diagnostic ability of a classification model as its discrimination threshold varies. PR Area is the area under the Precision-Recall curve. (precision vs. recall) and gives the mean precision of the method. The Matthews correlation coefficient is a measure of the quality of a binary (two-class) classification. F1 score combines the precision and recall of a classifier into a single metric by taking their harmonic mean.

3.7 Validation of ML Models with the Test Set

The dimensions and attributes of the test and training set should be identical, i.e., the class attribute of the test set needs to be transformed from numeric to nominal. The preparation procedure for validation of a classification model with WEKA consists of the following steps:

296

Ivan Dimitrov and Irini Doytchinova

1. Upload the file containing the ACC-transformed test set as an input matrix: Open the file containing the test set: Choose “Open file” from the main menu of WEKA Explorer; choose CSV data files (*.csv) in the field “Files of Type”; browse and select the file containing the test set. Turn the class attribute from numeric to nominal: Choose “Preprocess” in the main menu and select button “Choose” from the field “Filter”; in the drop-down menu, select “filters” > “unsupervised” > “attribute” > “NumericToNominal”; a pop-up window opens. Click “NumericToNominal -R firstlast” and write “last” in the field “attributeindices”; click the button “Apply”. 2. Save data to a file in an ARRF format (see Note 4): Click the button “Save”; select “Arff data files (*.arrf) in the field “Files of Type”; browse to the folder where the file will be saved and enter the name of the file containing the test set; click the button “Save”. The next step is to apply a derived model to the test set: Select “Classify” on the main menu; Select “Supplied test set” and click the button “Set” in the field “Test options”. Click the “Open File” button; In the new pop-up window, choose “Arrf data files (*.arff)” in the field “Files of Type”; browse and select the file in ARFF format containing the test set; click the button “Open”. Close the pop-up window “Test instances” with the button “Close”. 3. Validate a derived model: Choose the ML model from the list of the derived models in the field “-Result list (right-click for options)”; right click on the name of the model in the list and select the option “Re-evaluate model on the current test set” from the pop-up window. WEKA shows the performance of the classification model on the test set in the field “Classifier output”. 3.8 Machine Learning Methods

Five different machine learning methods are used for predicting bacterial immunogenicity.

3.8.1 Partial Least Squares-Based Discriminant Analysis (PLSDA)

PLS-DA is a method for classification via regression. PLS algorithm performs linear combinations of the initial attributes to form new attributes, named principal components (PC). Principal components are used as predictors of the dependent variable [21]. The classification procedure is based on a particular threshold. Samples are classified depending on whether they are larger or smaller than the given threshold. To select the PLS-DA as an algorithm for classification, take the following steps: 1. Apply Classification via regression algorithm: In step 3, choose classification algorithm in Subheading 3.5. Build the

Machine Learning Models for Prediction of Bacterial Immunogenicity

297

classification model, choose “classifiers” > “meta” > “ClassificationViaRegression”. 2. Select PLS as an algorithm for regression. In the pop-up window, click on the field with the name of the classifier and select the button “Choose”, then select “classifiers” > “functions” > “PLSClassifier”. 3. Select PLS filter as a filter applied in the PLS Classification algorithm: Click on the field with the name of the classifier parameters of the algorithm; select the button “Choose” in the field “filter” and select “filters” > “supervised” > “attributes” > “PLSFilter”. 4. Select the number of principal components for the PLS algorithm: Click on the field with the name of the filter, write in the field “numComponents” the number of principal components. 5. Close all the opened pop-up windows by selecting the button “OK” for each window. 6. Start building the model. Click the button “start”. After the model is built, the classification metrics for tenfold cross-validation appear in the field “Classifier output”. 7. Repeat the procedure from step 4 with different numbers for principal components and use classification metrics to select the optimal number. 8. Choose the best PLS-DA model and validate it with the test set. The best performance for the prediction of bacterial immunogens with the training set is achieved with three principal components. 3.8.2 k Nearest Neighbor (kNN)

kNN measures the distances between a query sequence and each of the training data and classifies the query sequence based on how its k closest neighbors are classified [22]. In the case of immunogenicity prediction, the kNN algorithm will measure the distance between the query protein sequence and the sequences of k closest neighbor proteins from the training set. To select the kNN as an algorithm for classification, take the following steps in Subheading 3.5. Build the classification model: 1. Select kNN as an algorithm for classification: In step 3, choose classification algorithm in Subheading 3.5. Build classification model, choose “classifiers” > “lazy” > “IBk”. 2. Use the default WEKA parameters for the kNN algorithm with distance weighting equal to 1/distance: Select weight by 1/distance in the field distanceWeighting. Select the number of the nearest neighbor for the kNN algorithm. Enter the number of the nearest neighbors in the field “KNN”; close all the opened pop-up windows by selecting the button “OK” for each window.

298

Ivan Dimitrov and Irini Doytchinova

3. Start building the model: Click the button “start”. After the model is built, the classification metrics for tenfold cross-validation appear in the field “Classifier output”. 4. Choose the best kNN model and validate it with the test set: Repeat the procedure using different numbers for the nearest neighbor in step 2 and determine the optimal number after an analysis of the classification metrics results for each iteration. The best performance for the prediction of bacterial immunogens with the training set is achieved with n = 1. 3.8.3 Support Vector Machine (SVM)

SVM uses support vectors (cases) to define a hyperplane between two data classes by maximizing the margin between them. The hyperplane is described using a specific function kernel. The Kernel function transforms data from the training set so that a non-linear decision surface is able to transform into a linear equation in a higher number of dimension spaces. WEKA software allows the implementation of libraries from external sources. LibSVM [23] is a very widely used library and WEKA allows its implementation with a specific SVM wrapper [24]. The hyperparameters of the LibSVM algorithm need to be optimized by the WEKA grid-search algorithm. The Radial Bases Function kernel can be used for the LibSVM algorithm. For Radial Bases Function kernel, the hyperparameters are cost and gamma. To apply the WEKA grid-search algorithm with the LibSVM library for classification with Radial Bases Function kernel, cost within the interval (0.1, 1000) with step 1 and gamma within the interval (0.1, 1000) with step 1, take the following steps in Subheading 3.5. Build the classification model: 1. In step 3, choose classification “classifiers” > “meta” > “GridSearch”.

algorithm,

choose

2. Use the default WEKA parameters for the grid-search algorithm with the LibSVM library-specific hyperparameters and interval of their change in step 3: Choose the hyperparameter “cost” and the interval of change: Enter “cost” in the field “XProperty”; “-1.0” in the field “Xmin”, and “3.0” in the field “Xmax”. Choose the hyperparameter “gamma” and the interval of change: Enter “gamma” in the field “YProperty”, “-1.0” in the field “Ymin”, “3.0” in the field “Ymax”. 3. Set the criterion for evaluating the classifier performance: Select “Accuracy” in the field “evaluation”. 4. Select the kernel of the SVM classifier: Click the button “Choose” in the field “classifier” and choose “classifiers” > “functions” > “LibSVM”, click on the name of the chosen classifier and choose “radial bases function: exp(gamma * (u - v)^2)” in the field “kernelType” in the pop-up window. Leave all other fields with the default parameters and

Machine Learning Models for Prediction of Bacterial Immunogenicity

299

close all the opened pop-up windows by selecting the button “OK” for each window. 5. Start building the model: Click the button “start”. After the model is built, the classification metrics for tenfold cross-validation and the values of the selected hyperparameters appear in the field “Classifier output”. 6. Validate the derived SVM classification model with the test set. 3.8.4 (RF)

Random Forest

Random Forest is an ensemble method of individual decision trees [25]. The RF class prediction is based on the votes of individual trees. Each tree learns from a random sample of instances and a random subset of attributes that are randomly scrambled bootstrapping. Apply the Random Forest algorithm with 1000 trees in the random forest and the default WEKA parameters. To select the Random Forest as an algorithm for classification, take the following steps in Subheading 3.5. Build the classification model: 1. Select Random Forest as an algorithm for classification: In step 3, choose classification algorithm in Subheading 3.5. Build model, choose classification “classifiers” > “trees” > “RandomForest”. 2. Apply the RF algorithm with 1000 trees in the random forest: In step 3, enter “1000” in the field “numIterations” in the pop-up window, leave all other fields with the default parameters, and close all the opened pop-up windows by selecting the button “OK” for each window. 3. Start building the model: Click the button “start”. After the model is built, the classification metrics for tenfold cross-validation appear in the field “Classifier output”.

3.8.5 Random Subspace Method (RSM) with kNN Estimator

Random Subspace Method (RSM), also known as feature bagging, reduces the correlation between estimators in an ensemble by training them on random samples of attributes instead of the full set of attributes [26]. Random Forest algorithm is an RSM using a decision tree as an estimator. RSM-kNN is known to be suitable for datasets with the number of attributes much larger than the number of training points, such as gene expression data [27]. To apply the RSM algorithm with the kNN (k = 1) algorithm as an estimator and the default WEKA parameters, follow the steps in Subheading 3.5. Build the classification model: 1. Select RSM as an algorithm for classification: In step 3, choose algorithm, choose classification “classifiers” > “meta” > “RandomSubSpace”. 2. Select kNN algorithm as a classifier: Click on the name of the RSM classifier and click the button “Choose” in the field

300

Ivan Dimitrov and Irini Doytchinova

“classifier” in the pop-up “classifiers” > “lazy” > “IBk”.

window.

Choose

3. Use the default WEKA parameters for the kNN algorithm with distance weighting equal to 1/distance: Click on the name of the classifier and select weight by 1/distance in the field “distanceWeighting”. 4. Select the number of the nearest neighbor for the kNN algorithm. Write in the field “KNN” the number of the nearest neighbors (1). Close all the opened pop-up windows by selecting the button “OK” for each window. 5. Start building the model: Click the button “start” to start building the model. After the model is built, the classification metrics for tenfold cross-validation appear in the field “Classifier output”. 6. Validate the derived RSM classification model with the test set. 3.8.6 Extreme Gradient Boosting (Xgboost)

Gradient boosting is a decision-tree-based ensemble ML algorithm proposed by Breiman [28] and later developed by other researchers [29]. The weighted data for the decision trees is used to grow new decision trees. The function indicating the fitness of model coefficients to the underlying data (loss function) is optimized to calculate the weights. The prediction of the final ensemble model is the weighted sum of the predictions made by the previous decision tree models. The xgboost algorithm is an advanced implementation of the gradient boosting algorithm [30] which allows better control on the overfitting and performs better than the gradient boosting. WEKA allows the implementation of the xgboost by the mlr library in R [31] via a specific R plugin. The main parameters of the algorithm need tuning to improve the prediction performance on a particular dataset. Three of the parameters need to change from the default values after tuning on the training set: nrounds(default = 100), which controls the maximum number of iterations, i.e., the number of trees to grow; eta (default = 0.3), which controls the learning rate, i.e., the rate at which the model learns patterns in data; max_depth (default = 6), which controls the depth of the tree. To apply the xgboost algorithm by mlr library and R plugin with parameters: max_depth = 4, eta = 1, nrounds = 150, and the default xgboost parameters of the mlr package, follow the steps in Subheading 3.5. Build the classification model: 1. Select xgboost as an algorithm for classification: In step 3, choose classification algorithm, and choose “classifiers” > “mlr” > “MLRClassifier”; click on the name of the classifier and select “classif.xgboost” in the field “Rlearner” in the pop-up window.

Machine Learning Models for Prediction of Bacterial Immunogenicity

301

2. Enter “max_depth=4, eta=1, nrounds=150” in the field “learnerParams”; close all the opened pop-up windows by selecting the button “OK” for each window. 3. Start building the model: Click the button “start”. After the model is built, the classification metrics for tenfold cross-validation appear in the field “Classifier output”. 4. Validate the derived xgboost classification model with the test set. The classification metrics are used to compare the performance of the models on training and test sets (Table 2). The three bestperforming algorithms in terms of AROC: RSM-1NN, xgboost, and RF are utilized for the prediction of bacterial immunogens. The prediction is based on majority voting: if at least two of the three models classify a given protein as an immunogen, then it is recognized as an immunogen.

3.9 ML Model Assessment

Table 2 Summary of the performance of the ML models Model

Sensitivity (recall) Specificity Accuracy Precision AROC APR

MCC

F1

PLS-DA Training set

0.64

0.672

0.656

0.661

0.701 0.665 0.312 0.65

Test set

0.612

0.791

0.702

0.745

0.739 0.758 0.41

Training set, 0.764

0.724

0.744

0.735

0.805 0.809 0.488 0.749

Test set

0.746

0.836

0.791

0.82

0.828 0.84

0.584 0.781

Training set, 0.696

0.796

0.746

0.773

0.746 0.69

0.494 0.733

Test set

0.731

0.836

0.784

0.817

0.784 0.732 0.57

Training set, 0.708

0.764

0.736

0.75

0.824 0.82

Test set

0.791

0.746

0.77

0.83

Training set, 0.76

0.792

0.776

0.785

0.853 0.867 0.552 0.772

Test set

0.716

0.925

0.821

0.906

0.881 0.89

Training set, 0.712

0.716

0.714

0.715

0.789 0.798 0.428 0.713

Test set

0.746

0.791

0.767

0.856 0.879 0.584 0.8

0.672

kNN, k = 1

SVM

0.772

RF

0.701

0.473 0.728

0.839 0.495 0.734

RSM-1NN

0.656 0.8

xgboost

0.836

TP true positives, TN true negatives, FP false positives, FN false negatives, AROC area under the ROC curve (sensitivity vs. 1-specificity), APR area under the PR curve (precision vs. recall), MCC Matthews correlation coefficient, F1 F1-score

302

4

Ivan Dimitrov and Irini Doytchinova

Notes 1. The number of columns in the matrix is calculated as a product of the lag value (number of descriptors per amino acids)2, i.e., (8 × 52) = 200. 2. WEKA classification algorithms cannot deal with numerical class attribute. The numerical class attribute needs to be turned into a nominal one. 3. Cross-validation is a procedure that includes the following steps: splitting the original sample randomly into N equalsized subsets; retaining a single subset for testing the model and using the data from the remaining N - 1 subsets to train the model; repeating the process N times, with each of the N subsets used exactly once as a test set; calculating the average of the classification metrics for each iteration to estimate the performance of the derived model. 4. The Attribute-Relation File Format (ARRF) is an ASCII text file format that describes a list of instances sharing a set of attributes, developed for use with WEKA software.

Acknowledgment This work was supported by the Science and Education for Smart Growth Operational Program and co-financed by the European Union through the European Structural and Investment funds (Grant No BG05M2OP001-1.001-0003). References 1. Arnon R (2011) Overview of vaccine strategies. In: Rappuoli R (ed) Vaccine design. Innovative approaches and novel strategies. Caister Academic Press, Norfolk 2. Pizza M, Scarlato V, Masignani V, Giuliani M et al (2000) Identification of vaccine candidates against serogroup B meningococcus by wholegenome sequencing. Science 287(5459): 1816–1820 3. Bagnoli F, Norais N, Ferlenghi I, Scarselli M et al (2011) Designing vaccines in the era of genomics. In: Rappuoli R (ed) Vaccine design. Innovative approaches and novel strategies. Caister Academic Press, Norfolk, pp 21–54 4. Vivona S, Bernante F, Filippini F (2006) NERVE: new enhanced reverse vaccinology environment. BMC Biotechnol 6:35. https:// doi.org/10.1186/1472-6750-6-35

5. Doytchinova IA, Flower DR (2007) VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinf 8:4. https://doi.org/10.1186/14712105-8-4 6. He Y, Xiang Z, Mobley HLT (2010) Vaxign: the first web-based vaccine design program for reverse vaccinology and applications for vaccine development. J Biomed Biotechnol 2010: 297505. https://doi.org/10.1155/2010/ 297505 7. Jaiswal V, Chanumolu SK, Gupta A, Chauhan RS, Rout C (2013) Jennerpredict server: prediction of protein vaccine candidates (PVCs) in bacteria based on host-pathogen interactions. BMC Bioinf 14:211. https://doi.org/10. 1186/1471-2105-14-211

Machine Learning Models for Prediction of Bacterial Immunogenicity 8. Rizwan M, Naz A, Ahmad J, Naz K, Obaid A, Parveen T et al (2017) VacSol: a high throughput in silico pipeline to predict potential therapeutic targets in prokaryotic pathogens using subtractive reverse vaccinology. BMC Bioinf 18:106. https://doi.org/10.1186/s12859017-1540-0 9. Goodswen SJ, Kennedy PJ, Ellis JT (2014) Vacceed: a high-throughput in silico vaccine candidate discovery pipeline for eukaryotic pathogens based on reverse vaccinology. Bioinformatics 30:2381–2383. https://doi.org/10. 1093/bioinformatics/btu300 10. Dalsass M, Brozzi A, Medini D, Rappuoli R (2019) Comparison of open-source reverse vaccinology programs for bacterial vaccine antigen discovery. Front Immunol 10:113. https://doi.org/10.3389/fimmu.2019. 00113 11. Bowman BN, McAdam PR, Vivona S, Zhang JX, Luong T, Belew RK et al (2011) Improving reverse vaccinology with a machine learning approach. Vaccine 29:8156–8164. https:// doi.org/10.1016/j.vaccine.2011.07.1422 12. Heinson AI, Gunawardana Y, Moesker B, Denman Hume CC, Vataga E, Hall Y et al (2017) Enhancing the biological relevance of machine learning classifiers for reverse vaccinology. Int J Mol Sci 18:E312. https://doi.org/10.3390/ ijms18020312 13. Hellberg S, Sjo¨stro¨m M, Skagerberg B, Wold S (1987) Peptide quantitative structure-activity relationships, a multivariate approach. J Med Chem 30:1126–1135 14. Wold S, Jonsson J, Sjo¨stro¨m M, Sandberg M, R€annar S (1993) DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least squares projections to latent structures. Anal Chim Acta 277:239–253 15. Dimitrov I, Zaharieva N, Doytchinova I (2020) Bacterial immunogenicity prediction by machine learning methods. Vaccines (Basel) 8(4):709. https://doi.org/10.3390/ vaccines8040709 16. Zaharieva N, Dimitrov I, Flower DR, Doytchinova I (2019) VaxiJen dataset of bacterial immunogens: an update. Curr Comput Aided Drug Des 15(5):398–400. https://doi.org/ 10.2174/1573409915666190318121838 17. NCBI Resource Coordinators (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 44: D7–D19

303

18. The UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 47:D506–D515 19. Frank E, Hall MA, Witten IH (2016) The WEKA workbench. In: Online appendix for “data mining: practical machine learning tools and techniques”, 4th edn. Morgan Kaufmann, Burlington 20. Venkatarajan MS, Braun W (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. J Mol Model 7: 445–453 21. Umetrics AB (2006) PLS. In: Multi- and megavariate data analysis part I. Umetrics Academy, Umea, p 63 22. Song Y, Liang J, Lu J, Zhao X (2017) An efficient instance selection algorithm for k nearest neighbor regression. Neurocomputing 251:26–34 23. Chang CC, Lin CJ (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:1–27 24. El-Manzalawy Y, Honavar V (2005) WLSVM: integrating LibSVM into Weka environment. Software available at http://www.cs.iastate. edu/yasser/wlsvm 25. Breiman L (2001) Random forests. Mach Learn 45:5 26. Ho TK (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20:832–844 27. Li S, Harner EJ, Adjeroh DA (2014) Random KNN. In: Proceedings of the IEEE international conference on data mining workshop, Shenzhen, China, 14 December 2014 28. Breiman L (1997) Arcing the edge. Technical report 486. Statistics Department, University of California, Berkeley 29. Friedman J (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232 30. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, USA, 13–17 August 2016 31. Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones Z (2016) mlr: machine learning in R. J Mach Learn Res 17(170):1–5

Chapter 21 Vaxi-DL: An Artificial Intelligence-Enabled Platform for Vaccine Development P. Preeti, Swarsat Kaushik Nath, Nevidita Arambam, Trapti Sharma, Priyanka Ray Choudhury, Alakto Choudhury, Vrinda Khanna, Ulrich Strych, Peter J. Hotez, Maria Elena Bottazzi, and Kamal Rawal Abstract Vaccine development is a complex and long process. It involves several steps, including computational studies, experimental analyses, animal model system studies, and clinical trials. This process can be accelerated by using in silico antigen screening to identify potential vaccine candidates. In this chapter, we describe a deep learning-based technique which utilizes 18 biological and 9154 physicochemical properties of proteins for finding potential vaccine candidates. Using this technique, a new web-based system, named Vaxi-DL, was developed which helped in finding new vaccine candidates from bacteria, protozoa, viruses, and fungi. Vaxi-DL is available at: https://vac.kamalrawal.in/vaxidl/. Key words Antigen prediction, COVID-19, Deep learning, In silico vaccine development, Artificial intelligence, Machine learning, Vaccine design, Vaxi-DL, mRNA vaccines, Vaccine, Deep learning pathogen models

1

Introduction The classical principles of vaccination have greatly aided in the control and eradication of several infectious diseases such as smallpox, polio, yellow fever, chickenpox, measles, and tetanus [1]. Despite these achievements, new infections like COVID-19 or Ebola are affecting a large number of individuals across the world [2]. As per conservative estimates, the COVID-19 pandemic alone has resulted in nearly 6.7 million deaths globally [3]. Vaccine development is a very complex and long process. It generally takes more than 10 years to develop and manufacture a successful vaccine with an estimated cost of over $500 million [4]. Vaccine development involves several key steps, including, antigen discovery research, preclinical studies, clinical development, regulatory review and approval, and manufacturing and vaccine delivery [5]. Typically,

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_21, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

305

306

P. Preeti et al.

for an unknown, emerging disease, antigen discovery involves screening hundreds of molecules and may require 2–5 years. Further, preclinical studies aid in the downselection of potential vaccine candidates (PVCs) and may also take several years. Subsequently, the clinical development process involves first testing the PVCs for safety in naı¨ve, healthy control subjects and then followed by additional trials to demonstrate immunogenicity and efficacy [6]. GivenWe’ve made the following changes: the complexity of the process and the fact that most vaccine candidates fail somewhere along the line, it is highly attractive to include computational approaches to bring down costs and accelerate the vaccine development process [7–9]. For finding effective vaccine candidates using computational techniques, researchers have introduced reverse vaccinology (RV), machine learning, and systems vaccinology-based approaches. Few of them have also leveraged either biological or physicochemical features of the target protein [10]. For instance, Doytchinova and Flower (2007) introduced the first alignment-free classification program to identify vaccine candidates against bacteria, viruses, and tumor antigens [11]. Prior to that, the RV approach developed by Pizza et al. (2000) led to a subunit vaccine against Neisseria meningitidis serogroup B [12]. The studies by Doytchinova and Flower, Heinson et al., and Bowman et al. employed a machine learning (ML) approach in the field of RV [11, 13, 14]. Apart from that, several other approaches have been developed which include Antigenpro [15], an ML-based approach by Goodswen et al. [16], JennerPredict [17], and Vaxign-ML [18]. There are several limitations with the existing approaches, which include a lack of experimental datasets, limited accessibility to non-expert users, as well as limited use cases in eukaryotic pathogens. In this chapter, we shall describe an AI-based Vaxi-DL tool. This tool is a web-based predictive system that evaluates the potential of individual protein sequences as potential vaccine candidates using deep learning-based strategies and immunoinformatics techniques [19]. The tool has been designed to predict vaccine candidates in bacteria, protozoa, fungi, and viruses that cause infections in humans. The Vaxi-DL system utilizes 18 biological and 9154 physicochemical properties of known antigen (protein) sequences. During the development phase, it was tested on a wide variety of benchmarking datasets.

2 2.1

Materials Sequences

We downloaded datasets of antigenic (positive) and non-antigenic (negative) proteins for bacterial, protozoan, fungal, and viral models from the literature as well as the Protegen database (https:// violinet.org/protegen/). For obtaining negative samples,

AI-Based Vaccine Development

307

randomly downloaded sequences from the Uniprot database (https://www.uniprot.org/) for each of those pathogens were used. 2.2

3

Software

Methods

3.1 Construction of a Deep Learning Model 3.1.1

Python scripts were used for data curation, annotation of protein sequences, selection of features, data processing, hyperparameter tuning, and construction of DL models. R package is used for computing the physicochemical properties of protein sequences.

Data Curation

3.1.2 Annotating Protein Sequences

We extracted four datasets of antigenic (positive) and non-antigenic (negative) proteins for bacterial, protozoan, fungal, and viral models. Each set contained positive samples collected from the literature as well as the Protegen database [20]. For obtaining negative samples, about 100 protein sequences were randomly downloaded from the Uniprot database for each of those pathogens that had provided positive controls [21]. Next, we checked the identity of these proteins using BLAST and removed redundant proteins with an identity of more than 90% to other proteins in the same set (see Note 1). Further, we checked the sequence similarity of negative protein sequences with positive protein sequences. Sequences with a similarity of less than 30% were moved to the negative dataset [22]. We used these datasets for training, validation, and testing of deep learning models for bacterial, viral, fungal, and protozoan datasets (see Note 2). The physicochemical properties of sequences of protein were estimated using protR, an R package that provides a unique and comprehensive toolkit for estimating several numerical representation schemes of protein sequences. The descriptors used in protR [23] include amino acid composition, autocorrelation [24–26], Composition-Transition-Distribution (CTD) [27, 28], conjoint triad [29], quasi-sequence order [30], pseudo-amino acid composition [31], and profile-based descriptors derived by a positionspecific scoring matrix (PSSM). These descriptors are highly useful in bioinformatics and chemogenomics research [32]. Proteochemometric modeling (PCM) descriptors include scale-based descriptors derived by principal component analysis (PCA), factor analysis, multidimensional scaling, amino acid properties (AAindex), 20+ classes of 2D and 3D molecular descriptors (Topological, WHIM, VHSE, etc.), and BLOSUM/PAM matrix-derived descriptors. There exist several other publicly accessible tools that can be used to compute and extract essential biological and physicochemical characteristics, including Feature Extraction from Protein Sequences (FEPS) [33], MathFeature [34], PyFeat [35],

308

P. Preeti et al.

iFeatureOmega [36], SeqFeature [37], Feature Extraction based on graphical and statistical features (FEGS) [38], and D-chaos game representation (DCGR) [39]. Initially, 18 biological features were selected, including TMPred, BLAST against the human proteome to ensure non-homology with human proteins, ProtParam for instability index value and molecular weight, FungalRV for adhesion prediction, DEG (Database for essential gene prediction), VFDB for checking virulence factor, SignalP (d-value) for secretory/nonsecretory protein detection, GutfloraDB for non-pathogenic bacterial detection, TargetP for subcellular localization, NetMHC for checking MHC Class-1 binding (number of high and weak binders) and number of peptides, NetChop for number of amino acids and cleavage sites, NetCTL for cytotoxic T-lymphocytes (CTL epitope prediction), ChloroP for ChloroP score and CS-score, and PSORTb for Subcellular localization (Table 1). The 9154 physicochemical properties that were selected included, quasisequence-order distributors, pseudo-amino acid composition, autocorrelation descriptors like Moreau-Broto, Geary, and other features like hydrophobicity, polarity, or solvent accessibility (see Note 3). 3.1.3 Selection of Features

For each pathogen model, the distribution profile of scores corresponding to each attribute was determined for positive and negative protein sequences. The characteristics that indicated a significant difference using Welch’s T-test (see Note 4) with a pvalue of less than 0.05 were included in the model. For each pathogen, we filtered the most distinguishing biological and physicochemical features. After filtering, biological features that were obtained for bacterial, protozoan, fungal, and viral models are shown in Table 2.

3.1.4

A dataset was created for each of the pathogen models that contained protein sequence IDs and their respective properties. The dataset was divided into a training and testing set (Table 3). We utilized the standard-scaler of the Scikit-learn python library to preprocess the data by normalization and scaling [40]. The Scikitlearn library contains various classifications, regressions, and algorithms.

Data Preprocessing

3.1.5 Hyperparameter Tuning

Next, we performed hyperparameter tuning for each of the DL models. Hyperparameter tuning involves choosing a set of optimal values for a learning algorithm before the learning process begins. We estimated the optimum values of parameters like the number of hidden layers and weight regularization (see Note 5) [41].

AI-Based Vaccine Development

309

Table 1 The complete list of biological features used in the study S. No. Features

Tool used (with version)

Reference

1

Subcellular localization

PSORTb 1.0.2

[47]

2

Adhesion prediction score

FungalRV

[48]

3

D-value

SignalP 4.1

[49]

4

Number of cleavage sites

NetChop 3.1

[50]

5

Number of transmembrane helices

TMPred

[51]

6

Instability index

ProtParam

[52]

7

Number of amino acids

NetChop 3.1

[50]

8

Cytotoxic T-lymphocytes (CTL epitope NetCTL 1.2 prediction)

[53]

9

MHC Class-I binding—number of high NetMHC 4.0 binders

[54]

10

MHC Class-I binding—number of weak NetMHC 4.0 binders

[54]

11

Subcellular localization

Target P 1.1

[55]

12

Number of peptides

NetMHC 4.0

[54]

13

Molecular weight

ProtParam

[52]

14

Virulence factor prediction

BLASTp (v2.10.0+) with VFDB

[56]

15

Essential/non-essential gene prediction BLASTp (v2.10.0+) with DEG

16

Non-bacterial pathogen/BLAST with Gut flora database

BLASTp (v2.10.0+) with gut flora database

[56]

17

Non-homology with human

BLASTp (v2.10.0+) with NCBI human proteome database

[56]

18

ChloroP score and ChloroP CS-score

ChloroP 1.1

[57]

3.1.6 Construction of DL Models

[56]

All four DL models contain a single input layer consisting of multiple nodes. Each node is based on the model’s attributes. Each of these models is built using a different number of hidden layers. Each hidden layer contains a Fully Connected Layer (FCL), a leaky ReLU activation layer, and a batch normalization layer (see Note 6). The Leaky ReLU has a slight slope for negative values instead of a flat slope. The batch normalization layer enhances the speed of the algorithm and makes it stable through normalization by rescaling and re-centering the layer’s inputs. The output layer contains an FCL with two nodes that represent two output classes: Vaccine candidate and non-vaccine candidate. The softmax activation and a constant bias initializer were used with a computed value

310

P. Preeti et al.

Table 2 Biological and physicochemical features were obtained for bacterial, protozoan, fungal, and viral models after conducting feature selection

Sr. No. Model

No. of biological features (out of 18 total features)

Total number of No. of physicochemical features significant (out of 9154 total features) features

1

Bacterial

11

1436

1447

2

Protozoan 15

2059

2074

3

Fungal

14

2787

2801

4

Viral

13

1741

1754

The features were selected/filtered using Welch’s T-test with a p-value less than 0.05

Table 3 The number of training and testing subsets for the bacterial, protozoan, fungal, and viral datasets Subset

Number of positive sequences

Number of negative sequences

Total

Net total

276 50

276 50

552 100

652

130 45

218 75

348 120

468

120 19

816 131

936 150

1086

338 80

339 80

677 160

837

Bacterial Training Testing Protozoan Training Testing Fungal Training Testing Viral Training Testing

Datasets were obtained from the literature

using a log (number of positive/number of negative) to obtain minimal loss at the early stages of the training. The softmax activation function predicts multinomial probability distribution. For this study, binomial probability distribution was applied. Further, the Adam optimizer [42], which performs updating of network weights iteratively based on training data, was employed with exponential learning rate decay and categorical cross-entropy loss function. The schematic representation of model construction is shown in Fig. 1.

AI-Based Vaccine Development

311

Fig. 1 The schematic representation of deep learning pipeline construction 3.1.7 Training, Validation, and Testing of DL Model

The training subset was further split into training and validation sets using Stratified K-fold cross-validation [40] to evaluate five individual iterations of each DL model. The stratified K-fold extends regular K-fold cross-validation, developed specifically for the classification algorithm. Instead of splitting the datasets randomly, the stratified K-fold maintains a ratio between the target classes in each fold as it was in the full dataset. The class weights parameter of the model fit function was used to balance the

312

P. Preeti et al.

imbalanced training datasets during the training and validation of each DL model. The antigens and non-antigens that were accurately predicted were denoted as true positives (TP) and true negatives (TN), respectively, while the antigens and non-antigens which were inaccurately predicted were denoted as false negatives (FN) and false positives (FP), respectively. These predictions made by the iterations were used to compute the average validation accuracy, i.e., ([TP + TN]/[TP + TN + FP + FN]), sensitivity (TP/[TP + FN]), specificity (FP/[TN + FP]), and recall (TP/[TP + FN]) of the individual models. To increase prediction performance, the testing accuracy was assessed by integrating the raw probability distributions produced from the softmax activation function created by five iterations of two ensemble learning procedures, consensus, and average. 3.1.8 Evaluating Performance

The performance evaluation was done using various metrics, including binary accuracy, sensitivity, specificity, precision, recall, and Area under the curve (AUC). Five different strategies were used to avoid overfitting, including weight bias regularization in the FCL [43], batch normalization layer [44], Adam optimizer with exponential learning rate decay, categorical cross-entropy loss function [45], and early stopping [46].

3.2 Testing the Models

The benchmarking of the Vaxi-DL model was done using an independent dataset. This dataset contained 57 known bacterial antigens, 14 protozoan antigens, and 7 viral antigens that were curated manually. Each of the sequences was evaluated on Vaxi-DL. The performance of Vaxi-DL was compared with other vaccines candidate prediction tools such as VaxiJen, Vaxign-ML, and ANTIGENpro.

3.2.1 Benchmarking with Independent Datasets

3.2.2 Comparison with Known Vaccines

We further extracted three known protective antigens from MenB4C (meningococcal serogroup B vaccine) and ran them on VaxiDL. The results were compared to VaxiJen and Vaxign-ML. Additionally, a list of 17 known vaccines was retrieved from the literature for Bordetella pertussis, Mycobacterium tuberculosis (MPT), and Corynebacterium diphtheriae (DPT). These proteins, either in licensed vaccines or in clinical trials, were run on Vaxi-DL and were compared with the output from other tools.

Known Vaccines from Protegen

192 known vaccines were extracted from the Protegen database. The dataset consisted of 60 bacterial and 132 viral sequences. The sequences were run on Vaxi-DL and their results were compared with VaxiJen and Vaxign-ML, respectively.

Known Vaccines from the Vaxgen Database

1859 known vaccine candidates were downloaded from the Vaxgen database. The sequences were from multiple organisms, including viruses, bacteria, rodents, algae, humans, birds, monkeys, and mice.

AI-Based Vaccine Development

313

We selected 146 protozoa, 586 bacterial, 410 viral, and 11 fungal sequences. These sequences were run on Vaxi-DL and the results were compared with VaxiJen and Vaxign-ML. Licensed Vaccine from the Violin Database

A list of 24 licensed vaccines were retrieved from the Violin database and analyzed in Vaxi-DL. Fourteen were viral vaccines and 10 were bacterial vaccines.

Potential Vaccines Candidate from Known Pathogens

Similarly, 219 protein sequences were obtained from 37 different pathogens, including 15 bacterial, 14 viral, and 8 protozoan species with literature evidence of antigenicity. All sequences were analyzed on Vaxi-DL for their antigenicity.

4

Notes 1. An identity of 90% or more may indicate that there is a high degree of similarity between the query and the aligned sequences and also some highly conserved regions in the related species. Therefore, to remove this redundancy, we eliminate the sequences having an identity of more than 90%. Additionally, the e-value may also be used to help in determining the homology and similarity between the aligned sequences. 2. The data will be divided into training and testing sets. The training data will cover 70–80% of the total data and the testing data will cover the remaining 20–30%. 3. Although there may be alternatives in the form of empirical and computational methods for the functional and physicochemical properties of protein sequences obtained, protR is preferred due to its vast applicability and integration of multiple functions and representation schemes within a single package. 4. Welch’s T-test is preferable as it is suitable to use in cases of unequal population variances and possibly unequal population samples, which we are more likely to encounter. 5. The number of hidden layers is required if the data is needed to be separated non-linearly, whereas the weight regularization results in a simpler linear network and slight underfitting of training data. This helps in enhancing the performance of the algorithm, especially in the case of a large variance in the hyperparameters. 6. An FCL is a layer in which the input data is connected to every activation unit of the next layer. The ReLU or rectified linear unit is an activation function used in deep learning that returns a 0 when a negative value is provided, whereas returns the value itself when a positive input is provided.

314

P. Preeti et al.

Acknowledgments The computational facility used in this work was supported by Robert J. Kleberg Jr. and Helen C. Kleberg Foundation. We are also thankful to Amity University and ICMR [BMI/12(66)/2021 2021-6442] for the support provided during the conduct of this study. Preeti P has received financial support from SERB [File Number: CVD/2020/000842]. The computational facility used for hosting the server was provided by DBT, Government of India [BT/PR17252/BID/7/708/2016]. References 1. Apostolopoulos V (2010) New generation vaccines. Expert Rev Vaccines 9(6):551–553 2. Hotez P (2021) Preventing the next pandemic and tackling antiscience: an interview with Peter Hotez. Future Microbiol 16(8):539–541 3. WHO Coronavirus (COVID-19) Dashboard. Available online: https://covid19.who.int. Accessed on 2 Jan 2023 4. Pronker ES, Weenen TC, Commandeur H, Claassen EH, Osterhaus AD (2013) Risk in vaccine research and development quantified. PLoS One 8(3):e57755 5. IFPMA (2019) The complex journey of a vaccine. Retrieved 15th Sept, 2022, from https:// www.ifpma.org/wp-content/uploads/201 9/07/IFPMA -C omplexJour ney-20 19_ FINAL.pdf 6. Bernstein A, Pulendran B, Rappuoli R (2011) Systems vaccinomics: the road ahead for vaccinology. OMICS 15(9):529–531 7. Rawal K, Sinha R, Abbasi BA, Chaudhary A, Nath SK, Kumari P, Preeti P, Saraf D, Singh S, Mishra K, Gupta P, Mishra A, Sharma T, Gupta S, Singh P, Sood S, Subramani P, Dubey AK, Strych U, Hotez PJ, Bottazzi ME (2021) Identification of vaccine targets in pathogens and design of a vaccine using computational approaches. Sci Rep 11(1): 17626 8. Abbasi BA, Saraf D, Sharma T, Sinha R, Singh S, Sood S, Gupta P, Gupta A, Mishra K, Kumari P, Rawal K (2022) Identification of vaccine targets & design of vaccine against SARS-CoV-2 coronavirus using computational and deep learning-based approaches. PeerJ 10: e13380 9. Rappuoli R, Hanon E (2018) Sustainable vaccine development: a vaccine manufacturer’s perspective. Curr Opin Immunol 53:111–118 10. Dalsass M, Brozzi A, Medini D, Rappuoli R (2019) Comparison of open-source reverse

vaccinology programs for bacterial vaccine antigen discovery. Front Immunol 10:113 11. Doytchinova IA, Flower DR (2007) VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinf 8:4 12. Pizza M, Scarlato V, Masignani V, Giuliani MM, Arico B, Comanducci M, Jennings GT, Baldi L, Bartolini E, Capecchi B, Galeotti CL, Luzzi E, Manetti R, Marchetti E, Mora M, Nuti S, Ratti G, Santini L, Savino S, Scarselli M, Storni E, Zuo P, Broeker M, Hundt E, Knapp B, Blair E, Mason T, Tettelin H, Hood DW, Jeffries AC, Saunders NJ, Granoff DM, Venter JC, Moxon ER, Grandi G, Rappuoli R (2000) Identification of vaccine candidates against serogroup B meningococcus by whole-genome sequencing. Science 287(5459):1816–1820 13. Heinson AI, Gunawardana Y, Moesker B, Hume CC, Vataga E, Hall Y, Stylianou E, McShane H, Williams A, Niranjan M, Woelk CH (2017) Enhancing the biological relevance of machine learning classifiers for reverse vaccinology. Int J Mol Sci 18(2):312 14. Bowman BN, McAdam PR, Vivona S, Zhang JX, Luong T, Belew RK, Sahota H, Guiney D, Valafar F, Fierer J, Woelk CH (2011) Improving reverse vaccinology with a machine learning approach. Vaccine 29(45):8156–8164 15. Magnan CN, Zeller M, Kayala MA, Vigil A, Randall A, Felgner PL, Baldi P (2010) Highthroughput prediction of protein antigenicity using protein microarray data. Bioinformatics 26(23):2936–2943 16. Goodswen SJ, Kennedy PJ, Ellis JT (2013) A novel strategy for classifying the output from an in silico vaccine discovery pipeline for eukaryotic pathogens using machine learning algorithms. BMC Bioinf 14:315 17. Jaiswal V, Chanumolu SK, Gupta A, Chauhan RS, Rout C (2013) Jenner-predict server:

AI-Based Vaccine Development prediction of protein vaccine candidates (PVCs) in bacteria based on host-pathogen interactions. BMC Bioinf 14:211 18. Ong E, Wang H, Wong MU, Seetharaman M, Valdez N, He Y (2020) Vaxign-ML: supervised machine learning reverse vaccinology model for improved prediction of bacterial protective antigens. Bioinformatics 36(10):3185–3191 19. Rawal K, Sinha R, Nath SK, Preeti P, Kumari P, Gupta S, Sharma T, Strych U, Hotez P, Bottazzi ME (2022) Vaxi-DL: a web-based deep learning server to identify potential vaccine candidates. Comput Biol Med 145:105401 20. Yang B, Sayers S, Xiang Z, He Y (2011) Protegen: a web-based protective antigen database and analysis system. Nucleic Acids Res 39 (Database issue):D1073–D1078 21. UniProt C (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49(D1):D480–D489 22. Chen Q, Zobel J, Zhang X, Verspoor K (2016) Supervised learning for detection of duplicates in genomic sequence databases. PLoS One 11(8):e0159644 23. Xiao N, Cao DS, Zhu MF, Xu QS (2015) protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. Bioinformatics 31(11):1857–1859 24. Kawashima S, Ogata H, Kanehisa M (1999) AAindex: amino acid index database. Nucleic Acids Res 27(1):368–369 25. Kawashima S, Kanehisa M (2000) AAindex: amino acid index database. Nucleic Acids Res 28(1):374 26. Kawashima S, Pokarowski P, Pokarowska M, Kolinski A, Katayama T, Kanehisa M (2008) AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 36(Database issue):D202–D205 27. Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH (1999) Recognition of a protein fold in the context of the Structural Classification of Proteins (SCOP) classification. Proteins 35(4): 401–407 28. Dubchak I, Muchnik I, Holbrook SR, Kim SH (1995) Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A 92(19):8700–8704 29. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H (2007) Predicting proteinprotein interactions based only on sequences information. Proc Natl Acad Sci U S A 104(11):4337–4341 30. Chou KC (2000) Prediction of protein subcellular locations by incorporating quasi-

315

sequence-order effect. Biochem Biophys Res Commun 278(2):477–483 31. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43(3):246–255 32. Rifaioglu AS, Atas H, Martin MJ, CetinAtalay R, Atalay V, Dogan T (2019) Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief Bioinform 20(5):1878–1912 33. Ismail H, White C, Al-Barakati H, Newman RH, Kc DB (2022) FEPS: a tool for feature extraction from protein sequence. Methods Mol Biol 2499:65–104 34. Bonidia RP, Domingues DS, Sanches DS, de Carvalho A (2022) MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors. Brief Bioinform 23(1):bbab434 35. Muhammod R, Ahmed S, Md Farid D, Shatabda S, Sharma A, Dehzangi A (2019) PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences. Bioinformatics 35(19):3831–3833 36. Chen Z, Liu X, Zhao P, Li C, Wang Y, Li F, Akutsu T, Bain C, Gasser RB, Li J, Yang Z, Gao X, Kurgan L, Song J (2022) iFeatureOmega: an integrative platform for engineering, visualization and analysis of features from molecular sequences, structural and ligand data sets. Nucleic Acids Res 50(W1):W434–W447 37. Wu S, Liang MP, Altman RB (2008) The SeqFEATURE library of 3D functional site models: comparison to existing methods and applications to protein function annotation. Genome Biol 9(1):R8 38. Mu Z, Yu T, Liu X, Zheng H, Wei L, Liu J (2021) FEGS: a novel feature extraction model for protein sequences and its applications. BMC Bioinf 22(1):297 39. Mu Z, Yu T, Qi E, Liu J, Li G (2019) DCGR: feature extractions from protein sequences based on CGR via remodeling multiple information. BMC Bioinf 20(1):351 40. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830 41. Van Rijn JN, Hutter F (2018) Hyperparameter importance across datasets. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp 2367–2376

316

P. Preeti et al.

42. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv 1412.6980 43. Qi Xu MZ, Zonghua G, Pan G (2019) Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs. Neurocomputing 328:69–74 44. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd international conference on machine learning. B. Francis and B. David. Proc Mach Learn Res: PMLR 37:448–456 45. Zhang Z, Sabuncu M (2018) Generalized cross entropy loss for training deep neural networks with noisy labels. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R (eds) Advances in neural information processing systems, vol 31. Curran Associates, Inc 46. Prechelt L (1998) In: Orr GB, Mu¨ller K-R (eds) “Early stopping – but when?” neural networks: tricks of the trade. Springer Berlin Heidelberg, Berlin, Heidelberg, pp 55–69 47. Gardy JL, Spencer C, Wang K, Ester M, Tusnady GE, Simon I, Hua S, deFays K, Lambert C, Nakai K, Brinkman FS (2003) PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res 31(13):3613–3617 48. Chaudhuri R, Ansari FA, Raghunandanan MV, Ramachandran S (2011) FungalRV: adhesin prediction and immunoinformatics portal for human fungal pathogens. BMC Genomics 12: 192 49. Petersen TN, Brunak S, von Heijne G, Nielsen H (2011) SignalP 4.0: discriminating signal

peptides from transmembrane regions. Nat Methods 8(10):785–786 50. Nielsen M, Lundegaard C, Lund O, Kesmir C (2005) The role of the proteasome in generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage. Immunogenetics 57(1–2):33–41 51. Hofmann KAWS (1993) TMbase-A database of membrane spanning proteins segments. Biol Chem Hoppe Seyler 374:166 52. Wilkins MR, Gasteiger E, Bairoch A, Sanchez JC, Williams KL, Appel RD, Hochstrasser DF (1999) Protein identification and analysis tools in the ExPASy server. Methods Mol Biol 112: 531–552 53. Larsen MV, Lundegaard C, Lamberth K, Buus S, Lund O, Nielsen M (2007) Largescale validation of methods for cytotoxic T-lymphocyte epitope prediction. BMC Bioinf 8:424 54. Andreatta M, Nielsen M (2016) Gapped sequence alignment using artificial neural networks: application to the MHC class I system. Bioinformatics 32(4):511–517 55. Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300(4): 1005–1016 56. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410 57. Emanuelsson O, Nielsen H, von Heijne G (1999) ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 8(5): 978–984

Chapter 22 A Web-Based Method for the Identification of IL6-Based Immunotoxicity in Vaccine Candidates Anjali Dhall, Sumeet Patiyal, Neelam Sharma, Salman Sadullah Usmani, and Gajendra P. S. Raghava Abstract Interleukin 6 (IL6) is a major pro-inflammatory cytokine that plays a pivotal role in both innate and adaptive immune responses. In the past, a number of studies reported that high level of IL6 promotes the proliferation of cancer, autoimmune disorders, and cytokine storm in COVID-19 patients. Thus, it is extremely important to identify and remove the antigenic regions from a therapeutic protein or vaccine candidate that may induce IL6-associated immunotoxicity. In order to overcome this challenge, our group has developed a computational tool, IL6pred, for discovering IL6-inducing peptides in a vaccine candidate. The aim of this chapter is to describe the potential applications and methodology of IL6pred. It sheds light on the prediction, designing, and scanning modules of IL6pred webserver and standalone package (https://webs.iiitd.edu.in/raghava/il6pred/). Key words IL6-inducing Immunotoxicity

peptides,

Vaccine

candidate,

Therapeutic

peptides,

Cytokines,

1 Introduction In the past, a number of vaccines have been developed to safely elicit immune response against infections caused by different pathogens [1]. In the current immunization strategies, subunit vaccines are being considered as an alternative to the conventional attenuation techniques, and even in the COVID era, many vaccines were developed by altogether different strategies, such as Novavax, which is a protein subunit vaccine and approved by the US FDA [2, 3]. Subunit vaccines consist of fragments of protein or peptides from the pathogen that are capable of inducing protective immune response against infectious diseases [4–6]. These novel peptide subunit vaccines act as promising candidates for developing Anjali Dhall and Sumeet Patiyal contributed equally with all other contributors. Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_22, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

317

318

Anjali Dhall et al.

immunization against a number of diseases, including tuberculosis, malaria, hepatitis B, COVID-19, and cancer [7–12]. One of the major challenges in designing the vaccine is the identification of antigenic regions that may induce desired immune response. The ideal scenario is to experimentally validate the immune response to every potential fragment or peptide of the pathogen proteome, which will be very expensive and time-consuming. To overcome this, several computational and in silico approaches have been developed to assist experimental scientists [13, 14]. Identification of antigenic regions that bind to MHC and activate T-helper cells, which further release cytokines; is crucial for designing subunit vaccine and immunotherapies. In the past, a number of prediction methods have been developed for the identification of cytokine-inducing peptides such as CytoPred [15], IL4pred [16], IL2pred [17], IFNepitope [18], IL10pred [19], IL17eScan [20], AntiInflam [21], IL13pred [22]. Interleukin 6 (IL6) is one of the most important and multifunctional cytokines which plays pivotal roles in acute phase responses [23], inflammation [24], hematopoiesis [25], T-cell proliferation, organ development [26], and innate and adaptive immune responses [27]. Moreover, several studies reveal that the elevated levels of IL6 have been observed in a number of diseases, including, cancer, insulin resistance, rheumatoid arthritis, coronary heart disease, cytokine storm or cytokine release syndrome in severe COVID-19 patients [25, 28, 29]. Therefore, in various disease conditions, the presence of IL6-inducing peptides are checked or anti-IL6 therapy is provided. Protein or peptide-based therapeutics have been gaining tremendous scientific attention over the last few years [30]. Subunit vaccine, made up of fragments of pathogen or antigenic regions, is safer to use than a vaccine containing whole pathogens. However, it is possible that these peptide/protein-based therapeutics can cause adverse immune effects or immunotoxicity due to the overproduction of some pro-inflammatory cytokines, for example, IL6. Therefore, it is of utmost importance to identify the IL6-inducing regions while designing protein-based vaccines. In this chapter, we will discuss the IL6-inducing peptide prediction method, IL6Pred [31]. This computational tool is used for the prediction, scanning, and designing of peptides that have the potential to induce the production of IL6 cytokine. IL6-inducing regions in a vaccine candidate can be checked by “Scan” module of our tool and modulated by “Design” module to identify the minimum number of mutations required to change the IL6-inducing region to non-inducing. In addition, “Predict” module of IL6Pred can be used for the prediction of IL6-inducing peptides. Experimental biologists and researchers can utilize this module to check the IL6-inducing potential in the therapeutic peptide before going to clinical trials and investigation.

Identification of IL6 Inducing Peptides

2

319

Materials The experimentally validated IL6-inducing and non-inducing peptide datasets for the prediction models can be obtained from the immune epitope database (IEDB) [32]. The best models were incorporated in the webserver and standalone package. In the webserver, we have incorporated five major modules: (1) Identification of IL6-inducing peptides; (2) designing of non-IL6-inducing peptides; (3) identification of IL6-inducing regions in an antigen; (4) scanning of IL6-specific motifs; and (5) similarity search against experimentally validated IL6 peptides. In addition, we have also provided the Python- and Perl-based standalone packages, which can be used for the prediction of IL6-inducing peptides for large datasets.

3

Methods

3.1 Brief Description of IL6pred

IL6pred [31] is a method developed using several machine learning classifiers to predict, scan, and design the IL6-inducing peptides. This method employs top-10 features, which is calculated using composition-based module of Pfeature [33]. The machine learning–based classifiers implemented in this study include random forest (RF), decision tree (DT), extreme gradient boosting (XGB), logistic regression (LR), k-nearest neighbor (kNN), and Gaussian Naı¨ve Bayes (GNB). It had five major modules, such as predict, design, protein scan, motif scan, and BLAST scan, to predict the IL6-inducing potential of the amino acid sequences provided as the input. The predict and design modules have a length restriction of 8–25 amino acid residues, i.e., it takes sequences having length between 8 and 25, whereas rest of the modules do not have any length restrictions. Each module has been explained in detail in the following sections. It is available as webserver at https://webs.iiitd.edu.in/raghava/il6pred as well as Python- and Perl-based standalone at https://webs.iiitd.edu.in/ raghava/il6pred/stand.html. In addition, it is also accessible via docker-based technology using GPSRdocker [34] and GitHub technology at https://github.com/raghavagps/il6pred.

3.2 Identification of IL6-Inducing Peptides

The “predict module” has utility in anticipating the IL6-inducing potential for the epitope/peptides. This module allows users to submit multiple query sequences. It takes the peptides with length 8–25 amino acid residues as input (see Note 1). Six machine learning classifier-based models were implemented in the backend, which uses the top-10 features to make the predictions. By default, server predict IL6-inducing peptide using a RF-based model, which has shown highest performance on training and validation

320

Anjali Dhall et al.

Fig. 1 The depiction of “Predict” module of IL6Pred with output page

dataset (see Note 2). However, user may select any of the provided machine learning–based models (RF, DT, XGB, LR, KNN, GNB). The default threshold varies with the selection of the model but can be modulate by the users as per their requirement. This module also provides the option to choose from the ten physicochemical properties to be calculated and displayed for each input peptide. The module gives the binary results, i.e., whether the given peptide is IL6 inducer or IL6 non-inducer based on the prediction score. Along with the prediction, the result page also displays calculated values for selected physicochemical properties for each peptide as shown in Fig. 1. The result is downloadable in the commaseparated value (.csv) format. 3.3 Designing of Non-IL6-Inducing Peptides

The second module of IL6pred is design, which facilitates users to find out the minimum mutation required to change the nature of input peptide from IL6 inducer to non-inducer or vice versa. This is achieved by generating all the possible mutants of the peptides by altering each amino acid at each position and then using the mutated peptide to predict the IL6-inducer probability score (see Note 3). The user is allowed to paste or upload a sequence file in the single amino acid code format with length restriction of 8–25 amino acids. This module also allows to select the desired machine

Identification of IL6 Inducing Peptides

321

Fig. 2 The illustration of “Design” module of IL6Pred with output page

learning model and respective threshold. Various mutated peptides are generated and predicted as IL6 inducer or non-inducer based on the prediction score and selected threshold, which is displayed in the result page in the tabular form, as shown in Fig. 2 (see Note 4). It also allows to sort the peptides in the ascending/descending order of their prediction score. The user can select the peptide with the highest score for further studies. The result is downloadable in the .csv format. 3.4 Identification of IL6-Inducing Peptides in Antigen

The scanning module of the server allows one to identify IL6-inducing regions in a protein or antigen or vaccine candidate. The server generates all possible overlapping patterns/segments of user-defined length (between 8 and 25) for an input protein submitted by the user (see Note 5). These overlapping sequences are classified as IL6 inducer or non-inducer by IL6pred. This method takes the amino acids sequences in the single letter code as the input, which can be pasted or uploaded in a file. This module allows to exhibit the result in two different formats: Tabular and graphical. In the tabular format, prediction score against each generated peptide is shown, based on which the final predictions are made (see Fig. 3), whereas, in the graphical mode, only IL6-inducing regions are highlighted. The results in the tabular format in downloadable and sorting, as per the prediction score, is also permitted. The user can exploit these IL6 regions to make the vaccine candidate more effective.

322

Anjali Dhall et al.

Fig. 3 The “Protein Scan” module of IL6Pred showing output in tabular and graphical formats

3.5 Scanning of IL6Specific Motifs

In the study IL6pred, a number of motifs have been discovered in IL6-inducing peptide. These motifs are exclusively found in IL6-inducing peptides. It means, they have the potential to induce the cytokine IL6. Motif scan module allows users to identify IL6-specific motifs in a protein/antigen or vaccine candidate. It has already been shown in the past that the presence of specific motifs may result in the induction of IL6, therefore, it may be of worth to check out the query sequences for the presence of IL6-specific motifs. In the motif scan module, the submitted peptide or protein sequences can be predicted as IL6 inducer or non-inducer based on the mapping or availability of certain motifs which are specific to the IL6-inducing peptides. The user is requested to paste or upload the sequences in a file or in the FASTA format. This module allows the search of motifs in the submitted sequences via two different algorithms, i.e., MEME/ MAST [35] and MERCI (see Note 6). The output page exhibits the results in two formats: tabular and graphical. In the tabular form, the input sequences are assigned as IL6 inducer if the motif (s) is found in the sequence, otherwise predicted as non-inducer. Below the tabular form, each motif is highlighted in the sequences with red color and bigger font. The result in the tabular form is downloadable in the comma-separated value format. The graphical depiction of motif scan module with example is shown in Fig. 4.

Identification of IL6 Inducing Peptides

323

Fig. 4 The “Motif Scan” module of IL6Pred showing search for IL6-specific motifs

3.6 BLAST-Based Similarity Search

The BLAST scan module of IL6pred allows user to identify the regions in a vaccine candidate that exhibit high similarity with experimentally validated IL6-inducing peptides [36]. It takes amino acid sequences in the FASTA format, which can be pasted or uploaded in a file. The default e-value is set at 0.001, but can be modulated by the user according to their requirement. A customized database of sequences used in this study is available in the backend of this module (see Note 7). The submitted sequences are hit against this database at user-specified e-value and based on found hit, the sequence can be predicted as IL6 inducer or non-inducer in the resulting page, as shown in Fig. 5. The result is exhibited in the tabular format, which is downloadable in the .csv format.

3.7 Standalone Package

In addition to web service, IL6pred is also available as a standalone software package written in Python and Perl language. This standalone software can be installed on any computer that supports compatible version of Python and Perl. The packages can be downloaded from IL6pred webserver at https://webs.iiitd.edu.in/ raghava/il6pred/stand.html. Figure 6 exhibits the complete usage of Python-based standalone of IL6pred along with the arguments it requires for the successful run. The main script is named as “il6pred.py”, which can take six different arguments represented by

324

Anjali Dhall et al.

Fig. 5 The “BLAST Search” module of IL6Pred showing hit against experimentally validated IL6-inducing epitopes/peptides

Fig. 6 The usage and description of the arguments taken by Python-based IL6Pred standalone package

six different tags (-i, -o, -j, -t, -w, and -d). The input sequences in the FASTA or single letter code is the necessary argument which is presented by “-i” tag, where the rest of the arguments are optional. The “-o” tag takes the name of the output file which will store the prediction results; by default, it is “outfile.csv”; “-j” represents the job name which can be three different values as 1 for predict module, 2 for design module, and 3 for scan module, it is set as 1 if no value is given against it. The “-t” represents the threshold value which is a float number that can vary from 0 and 1, “-w” is for the window/fragment length required for the scan module which

Identification of IL6 Inducing Peptides

325

represents the length of overlapping fragments (see Note 5), and “-d” is the display option which takes value as 1 for providing only IL6-inducing peptides among the query peptides, or 2 for providing the predictions against all the query peptides.

4

Notes 1. User can submit their protein sequence in FASTA format. 2. By default, it uses random forest-based model, user can select desired model. 3. IL6pred compute prediction score which shows IL6-inducing potential of a peptide. 4. Design module generates all possible mutant peptides and predicts IL6-inducing potential of each mutant peptide. This allows to design a peptide with desire IL6-inducing potential. 5. One can identify IL6-inducing regions in a vaccine candidate using scan module. 6. Motif scan module searches IL6-inducing peptides’ specific motifs using MEME/MAST and MERCI algorithms. 7. BLAST has been integrated to perform similarity search against IL6-inducing peptides.

Acknowledgment The authors are thankful to the Department of Bio-Technology (DBT) and Department of Science and Technology (DST-INSPIRE) for fellowships and the financial support and Department of Computational Biology, IIITD, New Delhi, for infrastructure and facilities. References 1. Pulendran B, Ahmed R (2011) Immunological mechanisms of vaccination. Nat Immunol 12(6):509–517. https://doi.org/10.1038/ni. 2039 2. Cid R, Bolivar J (2021) Platforms for production of protein-based vaccines: from classical to next-generation strategies. Biomol Ther 11(8). https://doi.org/10.3390/biom11081072 3. Usmani SS, Raghava GPS (2020) Potential challenges for coronavirus (SARS-CoV-2) vaccines under trial. Front Immunol 11:561851. https://doi.org/10.3389/fimmu.2020. 561851

4. Elhay MJ, Andersen P (1997) Immunological requirements for a subunit vaccine against tuberculosis. Immunol Cell Biol 75(6): 595–603. https://doi.org/10.1038/icb. 1997.94 5. Andersen P, Doherty TM (2005) TB subunit the pieces together. vaccines--putting Microbes Infect 7(5–6):911–921. https://doi. org/10.1016/j.micinf.2005.03.013 6. Black M, Trent A, Tirrell M, Olive C (2010) Advances in the design and delivery of peptide subunit vaccines with a focus on toll-like receptor agonists. Expert Rev Vaccines 9(2):

326

Anjali Dhall et al.

157–173. https://doi.org/10.1586/erv. 09.160 7. Kaufmann SH (2012) Tuberculosis vaccine development: strength lies in tenacity. Trends Immunol 33(7):373–379. https://doi.org/ 10.1016/j.it.2012.03.004 8. Kanoi BN, Egwang TG (2007) New concepts in vaccine development in malaria. Curr Opin Infect Dis 20(3):311–316. https://doi.org/ 10.1097/QCO.0b013e32816b5cc2 9. Malonis RJ, Lai JR, Vergnolle O (2020) Peptide-based vaccines: current progress and future challenges. Chem Rev 120(6): 3210–3229. https://doi.org/10.1021/acs. chemrev.9b00472 10. Agarwal N, Padmanabh S, Vogelzang NJ (2012) Development of novel immune interventions for prostate cancer. Clin Genitourin Cancer 10(2):84–92. https://doi.org/10. 1016/j.clgc.2012.01.012 11. Degos F (1995) Protein subunit vaccines: example of vaccination against hepatitis B virus. Rev Prat 45(12):1488–1491 12. Heidary M, Kaviar VH, Shirani M, Ghanavati R, Motahar M, Sholeh M, Ghahramanpour H, Khoshnood S (2022) A comprehensive review of the protein subunit vaccines against COVID-19. Front Microbiol 13:927306. https://doi.org/10.3389/fmicb. 2022.927306 13. Usmani SS, Kumar R, Bhalla S, Kumar V, Raghava GPS (2018) In silico tools and databases for designing peptide-based vaccine and drugs. Adv Protein Chem Struct Biol 112: 221–263. https://doi.org/10.1016/bs.apcsb. 2018.01.006 14. Nagpal G, Usmani SS, Raghava GPS (2018) A web resource for designing subunit vaccine against major pathogenic species of bacteria. Front Immunol 9:2280. https://doi.org/10. 3389/fimmu.2018.02280 15. Lata S, Raghava GP (2008) CytoPred: a server for prediction and classification of cytokines. Protein Eng Des Sel 21(4):279–282. https:// doi.org/10.1093/protein/gzn006 16. Dhanda SK, Gupta S, Vir P, Raghava GP (2013) Prediction of IL4 inducing peptides. Clin Dev Immunol 2013:263952. https:// doi.org/10.1155/2013/263952 17. Anjali Lathwal RK, Kaur D, Raghava GPS (2021) In silico model for predicting IL-2 inducing peptides in human. bioRxiv. https:// doi.org/10.1101/2021.06.20.449146 18. Dhanda SK, Vir P, Raghava GP (2013) Designing of interferon-gamma inducing MHC classII binders. Biol Direct 8:30. https://doi.org/ 10.1186/1745-6150-8-30

19. Nagpal G, Usmani SS, Dhanda SK, Kaur H, Singh S, Sharma M, Raghava GP (2017) Computer-aided designing of immunosuppressive peptides based on IL-10 inducing potential. Sci Rep 7:42851. https://doi.org/ 10.1038/srep42851 20. Gupta S, Mittal P, Madhu MK, Sharma VK (2017) IL17eScan: a tool for the identification of peptides inducing IL-17 response. Front Immunol 8:1430. https://doi.org/10.3389/ fimmu.2017.01430 21. Gupta S, Sharma AK, Shastri V, Madhu MK, Sharma VK (2017) Prediction of antiinflammatory proteins/peptides: an insilico approach. J Transl Med 15(1):7. https://doi. org/10.1186/s12967-016-1103-6 22. Jain S, Dhall A, Patiyal S, Raghava GPS (2022) IL13Pred: A method for predicting immunoregulatory cytokine IL-13 inducing peptides. Comput Biol Med 143:105297. https://doi. org/10.1016/j.compbiomed.2022.105297 23. Hirano T (1998) Interleukin 6 and its receptor: ten years later. Int Rev Immunol 16(3–4): 2 4 9 – 2 8 4 . h t t p s : // d o i . o r g / 1 0 . 3 1 0 9 / 08830189809042997 24. Covarrubias AJ, Horng T (2014) IL6 strikes a balance in metabolic inflammation. Cell Metab 19(6):898–899. https://doi.org/10.1016/j. cmet.2014.05.009 25. Hong DS, Angelo LS, Kurzrock R (2007) Interleukin-6 and its receptor in cancer: implications for translational therapeutics. Cancer 110(9):1911–1928. https://doi.org/10. 1002/cncr.22999 26. Su H, Lei CT, Zhang C (2017) Interleukin-6 signaling pathway and its role in kidney disease: an update. Front Immunol 8:405. https://doi. org/10.3389/fimmu.2017.00405 27. Rose-John S, Winthrop K, Calabrese L (2017) The role of IL6 in host defence against infections: immunobiology and clinical implications. Nat Rev Rheumatol 13(7):399–409. https://doi.org/10.1038/nrrheum.2017.83 28. Gubernatorova EO, Gorshkova EA, Polinova AI, Drutskaya MS (2020) IL6: relevance for immunopathology of SARS-CoV-2. Cytokine Growth Factor Rev 53:13–24. https://doi. org/10.1016/j.cytogfr.2020.05.009 29. Hirano T (2010) Interleukin 6 in autoimmune and inflammatory diseases: a personal memoir. Proc Jpn Acad Ser B Phys Biol Sci 86(7): 717–730. https://doi.org/10.2183/pjab. 86.717 30. Usmani SS, Bedi G, Samuel JS, Singh S, Kalra S, Kumar P, Ahuja AA, Sharma M, Gautam A, Raghava GPS (2017) THPdb: database of FDA-approved peptide and protein

Identification of IL6 Inducing Peptides therapeutics. PLoS One 12(7):e0181748. https://doi.org/10.1371/journal.pone. 0181748 31. Dhall A, Patiyal S, Sharma N, Usmani SS, Raghava GPS (2021) Computer-aided prediction and design of IL6 inducing peptides: IL6 plays a crucial role in COVID-19. Brief Bioinform 22(2):936–945. https://doi.org/10. 1093/bib/bbaa259 32. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47 (D1):D339–D343. https://doi.org/10. 1093/NAR/GKY1006 33. Pande A, Patiyal S, Lathwal A, Arora C, Kaur D, Dhall A, Mishra G, Kaur H, Sharma N, Jain S, Usmani SS, Agrawal P, Kumar R, Kumar V, Raghava GPS (2019) Computing wide range of protein/peptide

327

features from their sequence and structure. bioRxiv:599126. https://doi.org/10.1101/ 599126 34. Agrawal P, Kumar R, Usmani SS, Dhall A, Patiyal S, Sharma N, Kaur H, Kumar V, Kaur D, Jain S (2019) GPSRdocker: a Docker-based resource for genomics, proteomics and systems biology. bioRxiv:827766 35. Bailey TL, Boden M, Buske FA, Frith M, Grant CE, Clementi L, Ren J, Li WW, Noble WS (2009) MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 37(Web Server issue):W202–W208. https://doi.org/ 10.1093/nar/gkp335 36. McGinnis S, Madden TL (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32 (Web Server issue):W20–W25. https://doi. org/10.1093/nar/gkh435

Chapter 23 In Silico Tool for Identification, Designing, and Searching of IL13-Inducing Peptides in Antigens Shipra Jain, Anjali Dhall, Sumeet Patiyal, and Gajendra P. S. Raghava Abstract Interleukins are a distinctive class of molecules exhibiting various immune signaling functions. Immunoregulatory cytokine, Interleukin 13 (IL13), is primarily synthesized by activated T-helper 2 cells, mast cells, and basophils. IL13, is known to stimulate many allergic and autoimmune diseases, such as asthma, rheumatoid arthritis, systemic sclerosis, ulcerative colitis, airway hyperresponsiveness, glycoprotein hypersecretion, and goblet cell hyperplasia. In addition to such disorders, IL13 also leads to carcinogenesis by inhibiting tumor immunosurveillance. Due to its role in various diseases, predicting IL13-inducing peptides or regions in a protein is vital to designing safe protein vaccines and therapeutics. IL13pred is an in silico tool which aids in identifying, predicting, and designing IL13-inducing peptides. The IL13pred web server and standalone package is easily accessible at (https://webs.iiitd.edu.in/raghava/il13pred/). Key words Interleukin 13, Inducing peptides prediction, Vaccine development, Immunotherapy

1

Introduction In recent years, selective inhibition of an interleukin using monoclonal antibodies have proved to be successful against various disorders. Cytokine-targeted therapeutic peptides vaccines have shown their efficacy in evoking the antibody responses in various inflammatory and allergic conditions [1]. The peptide/epitope with the ability to elicit the immune response toward a particular cytokine depicts its relevance as vaccine subunit. These potential vaccine candidates serve a promising target in modulating various disease states [2]. However, alteration in immune response in patients is a complex treatment and depends on various other factors as well [3]. In literature, studies have reported selective blocking of IL13 receptors by vaccine/immunotherapies would include use of anti-cytokine blocking antibody and cytokine mutant [4].

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_23, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

329

330

Shipra Jain et al.

Interleukin 13 plays an important role in allergic and inflammatory response of immune system to an antigen [5]. Unregulated interactions of this interleukin lead to various diseases such as asthma, mucus hypersecretion, IgE Ab production, fibrosis, leukemia, and Hodgkin’s disease [6–9]. In addition to these, literature has reported studies with role of IL13/ IL-4 receptors in prognosis of several tumors such as pancreatic, gastric, and colon cancer [10– 13]. Therefore, IL13-induced immunosuppression plays an important role in monitoring various disease states, making this a vital target for immunotherapy and vaccine designing [14]. In recent years, epitope/peptide-based immunotherapeutic and vaccines have proved to be vital in altering the immune response toward various diseases. In literature, IL13-based therapies are reported as promising alternative toward conventional therapeutics. Many experiments have been conducted to design antibodies/peptides which can effectively bind to IL13 receptors and hinder the immune response in a cell. However, these peptides might evoke the release of allergic or inflammatory cytokines such as IL13, which in turn can activate the immune response in the cell. Screening various peptides/epitopes with the help of in silico tool would aid in effective designing of vaccine subunit. Various computational methods are available for predicting cytokine-inducing peptides [15–20], but no method is available exclusively for identifying IL13-inducing potential. In this chapter, we describe the IL13pred tool for the prediction of IL13-inducing peptides or epitopes [21]. IL13pred provides a user-friendly computational platform in screening potential IL13-inducing vaccine subunits. Researchers can use this tool to check the immune response of therapeutic peptides computationally, before going in the clinical trials phase. This tool would aid the scientific community in developing immunotherapy and vaccine subunits efficiently with known IL13-inducing and non-inducing capabilities. This chapter discusses the IL13pred tool and its module in detail for designing and screening potential vaccine subunits.

2

Materials The machine learning–based prediction model is developed on experimentally validated IL13-inducing and non-inducing peptides collected from IEDB (https://www.iedb.org/) [22]. In this chapter, we have focused on four modules of IL13pred, i.e., “Predict Module” for identification of IL13-inducing peptides, “Design Module” for designing of IL13 non-inducing peptides, “Protein Scan Module” for determination of IL13-inducing regions in antigenic sequences, “BLAST scan module” for similarity search against the experimentally validated IL13-inducing epitopes. The overall architecture of IL13pred tool is depicted (see Fig. 1).

Predicting IL13 Inducer Peptides

331

Fig. 1 Schematic depiction of IL13pred tool modules

3

Methods

3.1 Description of IL13pred Tool

IL13pred is an in silico tool developed to predict IL13-inducing potential of a given peptide using machine learning models. Pfeature tool is used for the generation of features from the peptide sequences. IL13pred utilizes top-10 features from the pool of generated features and provides as an input to the machine learning model. The eXtreme Gradient Boosting (XGB)-based machine learning algorithm is implemented in the backend of the server. This tool incorporates four major modules and detailed description of each module is provided below. The IL13pred server and standalone package is accessible at https://webs.iiitd.edu.in/raghava/ il13pred/ and https://webs.iiitd.edu.in/raghava/il13pred/stand. html, respectively.

3.2 Prediction of IL13-Inducing Peptides

In the “Predict module,” the user can paste the peptide sequence in the FASTA format (see Note 1) or can upload the input file with a maximum of 500 sequences. The user is allowed to select the probability threshold value, which, by default, is set at 0.06 and can be modulated as per the user’s requirement. Higher threshold value depicts low sensitivity, and low coverage, but the probability of correct prediction is on the higher side and shows high specificity of the result. This module also enables the user to select relevant physicochemical properties from a pool of options such as, amphipathicity, charge, hydropathicity, hydrophilicity, hydrophobicity, net hydrogen, pI, side bulk, and steric hindrance displayed on the main page. The module provides the classification prediction results

332

Shipra Jain et al.

Fig. 2 Workflow of the Predict module of IL13pred along with the results

along with the score, i.e., whether the given peptide has IL13inducing potential or not and its prediction probability. The user can sort the prediction result based on the probability score; high score denotes higher IL13-inducing potential of the sequence. In addition to the prediction, the result page also depicts the physicochemical properties analysis for the input peptides, which can be sorted in the ascending or descending order (see Fig. 2). Therefore, the user can choose the peptides with the desired physicochemical properties. This tool also enables the user to download the result file in the comma-separated value format.

Predicting IL13 Inducer Peptides

3.3 Designing of IL13-Inducing Peptides

333

IL13pred offers second module as “Design Module,” which enables a user to design/propose all possible mutations in a given peptide. In this module, mutant peptide is generated based on alteration at a single residue level. Along with the possible alterations/mutations, this module provides a prediction score of the mutant peptide. In this section, a user can paste the target sequence of length 8–35 amino acid residues. Post that, this section allows a user to select the probability threshold score, which, by default, is set to 0.06 value. High threshold value denotes that the chances of accurate prediction of inducing potential is high, and low threshold might enable the module to report false inducers. Results of this module depicts various peptide sequences with their ability to induce IL13 in an organism with a score (see Fig. 3). Users can sort the mutant peptides generated based on this score and find the peptide which can induce the IL13 secretion with higher value of prediction score. Results from this page can be downloaded in a comma-separated file format.

Fig. 3 Illustration of Design module of IL13pred along with the results

334

Shipra Jain et al.

3.4 Scanning of IL13-Inducing Regions

The third module of IL13pred is “Protein Scan,” which can be used to identify the IL13-inducing regions in a full-length protein. In this module, the entire length of the protein is scanned to determine the IL13-inducing regions. This method takes an amino acid sequence in the single letter format as the input. In order to scan the full length, overlapping patterns of the user-specified window length (varies from 8 to 25) is generated and these patterns of fixed length are then used to calculate the top-10 features. Further, the calculated features were then provided to the eXtreme Gradient Boosting (XGB-based) machine learning model to predict their potential to induce the production of IL13 (see Note 2). The default threshold is 0.06, but can be altered as per the requirement of the user. This module allows two different approaches to exhibit the results: tabular and graphical. In the tabular format, all the generated overlapping patterns are exhibited along with their prediction score and final prediction, i.e., whether the region/peptide is IL13 inducer or not (see Fig. 4). On the other hand, only IL13inducing epitopes are highlighted. Researchers working in the field of vaccine development can explore the information of IL13inducing regions in an antigen to create an efficient and effective vaccine. The results in the tabular format are downloadable in comma-separated value format for further investigation.

3.5 Similarity Search Based on BLAST

In order to utilize the similarity search to predict the IL13 inducers, the BLAST scan module is incorporated in IL13pred. This module can take multiple sequences in the FASTA format as the input, which can be pasted or uploaded in a file. At first, the customized database is created using the experimentally validated IL13 inducer and non-inducer sequences downloaded from the IEDB. The query sequences are then searched for similarity against the database at a user-defined e-value, which is set at 0.001 but can be modulated as per the need of the user. The query sequence is assigned/predicted as IL13 inducer if the hit is found against the IL13 inducer sequence in the database, otherwise non-inducer as exhibited (see Fig. 5). The results are presented in the tabular format which is downloadable in .csv format.

3.6 Standalone Version of IL13pred

This section, focuses on the IL13pred standalone package based on Python and Perl. The standalone version is applied on larger dataset for predicting the IL13 inducers and non-inducer peptide using command-line interface (see Note 3). This can be downloaded from IL13pred webserver freely available at https://webs.iiitd.edu.in/ raghava/il13pred/stand.html. In the python-based version, the main program is written as “il13pred.py”; along with that, the user can input six different arguments such as (-i, -o, -j, -t, -w, and -d). This program takes input sequence in the FASTA or single letter code using “-i” tag. This tag is compulsory to run a program, whereas the rest of the arguments are not mandatory. The

Predicting IL13 Inducer Peptides

335

Fig. 4 Workflow of Protein Scan module of IL13pred along with the output page

arguments and usage of the standalone version is explained (see Fig. 6). This tool can also be accessed via GPSRDocker and GitHub (https://github.com/raghavagps/il13pred).

4

Notes 1. IL13pred accepts input sequences in FASTA format. 2. The web server implements eXtreme Gradient Boosting (XGB) based model for prediction. 3. IL13Pred standalone software is available at the webserver, which can be used for larger datasets.

336

Shipra Jain et al.

Fig. 5 Illustration of BLAST scan module of the IL13pred along with the output

Fig. 6 IL13pred standalone version usage and its arguments description

Predicting IL13 Inducer Peptides

337

Acknowledgments The authors are thankful to the Department of Bio-Technology (DBT) and Department of Science and Technology (DST-INSPIRE) for fellowships and the financial support and Department of Computational Biology, IIITD, New Delhi, for infrastructure and facilities. References 1. Foerster J, Moleda A (2019) Feasibility analysis of Interleukin-13 as a target for a therapeutic vaccine. Vaccines (Basel) 7(1). https://doi. org/10.3390/vaccines7010020 2. Berry LM, Adams R, Airey M, Bracher MG, Bourne T, Carrington B, Cross AS, Davies GC, Finney HM, Foulkes R, Gozzard N, Griffin RA, Hailu H, Lamour SD, Lawson AD, Lightwood DJ, McKnight AJ, O’Dowd VL, Oxbrow AK, Popplewell AG, Shaw S, Stephens PE, Sweeney B, Tomlinson KL, Uhe C, Palframan RT (2009) In vitro and in vivo characterisation of anti-murine IL-13 antibodies recognising distinct functional epitopes. Int Immunopharmacol 9(2):201–206. https://doi.org/10. 1016/j.intimp.2008.11.001 3. Omland SH, Habicht A, Damsbo P, Wilms J, Johansen B, Gniadecki R (2017) A randomized, double-blind, placebo-controlled, dose-escalation first-in-man study (phase 0) to assess the safety and efficacy of topical cytosolic phospholipase A2 inhibitor, AVX001, in patients with mild to moderate plaque psoriasis. J Eur Acad Dermatol Venereol 31(7): 1161–1167. https://doi.org/10.1111/jdv. 14128 4. Maeda S, Yanagihara Y (2001) Inflammatory cytokines (IL-4, IL-5 and IL-13). Nihon Rinsho 59(10):1894–1899 5. Mitchell J, Dimov V, Townley RG (2010) IL-13 and the IL-13 receptor as therapeutic targets for asthma and allergic disease. Curr Opin Investig Drugs 11(5):527–534 6. El R, Rf L (2011) Interleukin-13 signaling and its role in asthma. World Aller Organ J 4(3): 54–64. https://doi.org/10.1097/WOX. 0B013E31821188E0 7. Corren J (2013) Role of interleukin-13 in asthma. Curr Allergy Asthma Rep 13(5): 415–420. https://doi.org/10.1007/S11882013-0373-9 8. Ajrouche R, Chandab G, Petit A, Strullu M, Nelken B, Plat G, Michel G, Domenech C, Clavel J, Bonaventure A (2022) Allergies, genetic polymorphisms of Th2 interleukins,

and childhood acute lymphoblastic leukemia: the ESTELLE study. Pediatr Blood Cancer 69(3):e29402. https://doi.org/10.1002/ pbc.29402 9. Skinnider BF, Kapp U, Mak TW (2001) Interleukin 13: a growth factor in hodgkin lymphoma. Int Arch Allergy Immunol 126(4): 2 6 7 – 2 7 6 . h t t p s : // d o i . o r g / 1 0 . 1 1 5 9 / 000049523 10. Terabe M, Park JM, Berzofsky JA (2004) Role of IL-13 in regulation of anti-tumor immunity and tumor growth. Cancer Immunol Immunother CII 53(2):79–85. https://doi.org/10. 1007/S00262-003-0445-0 11. Traub B, Sun L, Ma Y, Xu P, Lemke J, Paschke S, Henne-Bruns D, Knippschild U, Kornmann M (2017) Endogenously expressed IL-4Rα promotes the malignant phenotype of human pancreatic cancer in vitro and in vivo. Int J Mol Sci 18(4). https://doi.org/10. 3390/IJMS18040716 12. Fukata M, Abreu MT (2008) Role of Toll-like receptors in gastrointestinal malignancies. Oncogene 27(2):234–243. https://doi.org/ 10.1038/sj.onc.1210908 13. Liu H, Antony S, Roy K, Juhasz A, Wu Y, Lu J, Meitzler JL, Jiang G, Polley E, Doroshow JH (2017) Interleukin-4 and interleukin-13 increase NADPH oxidase 1-related proliferation of human colon cancer cells. Oncotarget 8(24):38113–38135. https://doi.org/10. 18632/ONCOTARGET.17494 14. Krause S, Behrends J, Borowski A, Lohrmann J, Lang S, Myrtek D, Lorenzen T, Virchow JC, Luttmann W, Friedrich K (2006) Blockade of interleukin-13-mediated cell activation by a novel inhibitory antibody to human IL-13 receptor alpha1. Mol Immunol 43(11): 1799–1807. https://doi.org/10.1016/J. MOLIMM.2005.11.001 15. Dhall A, Patiyal S, Sharma N, Usmani SS, Raghava GPS (2021) Computer-aided prediction and design of IL-6 inducing peptides: IL-6 plays a crucial role in COVID-19. Brief

338

Shipra Jain et al.

Bioinform 22(2):936–945. https://doi.org/ 10.1093/bib/bbaa259 16. Dhanda SK, Vir P, Raghava GP (2013) Designing of interferon-gamma inducing MHC classII binders. Biol Direct 8:30. https://doi.org/ 10.1186/1745-6150-8-30 17. Lata S, Raghava GP (2008) CytoPred: a server for prediction and classification of cytokines. Protein Eng Des Sel 21(4):279–282. https:// doi.org/10.1093/protein/gzn006 18. Gupta S, Sharma AK, Shastri V, Madhu MK, Sharma VK (2017) Prediction of antiinflammatory proteins/peptides: an insilico approach. J Transl Med 15(1):7. https://doi. org/10.1186/s12967-016-1103-6 19. Dhanda SK, Gupta S, Vir P, Raghava GP (2013) Prediction of IL4 inducing peptides.

Clin Dev Immunol 2013:263952. https:// doi.org/10.1155/2013/263952 20. Manavalan B, Shin TH, Kim MO, Lee G (2018) PIP-EL: a new ensemble learning method for improved proinflammatory peptide predictions. Front Immunol 9:1783. https:// doi.org/10.3389/fimmu.2018.01783 21. Jain S, Dhall A, Patiyal S, Raghava GPS (2022) IL13Pred: A method for predicting immunoregulatory cytokine IL-13 inducing peptides. Comput Biol Med 143:105297. https://doi. org/10.1016/j.compbiomed.2022.105297 22. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B (2019) The immune epitope database (IEDB): 2018 update. Nucleic Acids Res 47 (D1):D339–D343. https://doi.org/10. 1093/NAR/GKY1006

Part IV Computational Vaccinology Applications and Protocols

Chapter 24 A Lean Reverse Vaccinology Pipeline with Publicly Available Bioinformatic Tools Bart Cuypers, Rino Rappuoli, and Alessandro Brozzi Abstract Reverse vaccinology (RV) marked an outstanding improvement in vaccinology employing bioinformatics tools to extract effective features from protein sequences to drive the selection of potential vaccine candidates (Rappuoli, Curr Opin Microbiol 3(5):445–450, 2000). Pioneered by Rino Rappuoli and first used against serogroup B meningococcus, since then, it has been used on several other bacterial vaccines, varying during time the adopted bioinformatics tools. Based on our experience in the field of RV and following an extensive literature review, we consolidate a lean RV pipeline of publicly available bioinformatic tools whose usage is described in this contribution. The protein features, whose extraction is reported in this contribution, can be also the input in a matrix format for machine learning-based approaches. Key words Reverse vaccinology, Bacteria, Core proteome, Subcellular location, Antigen abundance, T cell epitopes, B cell epitopes

1

Introduction Reverse vaccinology (RV) is a widely used approach to identify potential vaccine candidates by screening the proteome of a pathogen through bioinformatics tools [1]. Since its first conception and application for Group B meningococcus (MenB) vaccine in the early 1990s, many authors have fine-tuned their personal recipes of features to represent protein sequences and criteria applied on each feature to select proteins with higher likelihood to become vaccine candidates (see [2] for a comprehensive review). Based on our experience in the field and following an extensive literature review, we consolidate a lean RV pipeline of publicly available bioinformatic tools. The pipeline is articulated around four primary points: 1. Obtain the core proteome of the target bacterium 2. Prediction of protein subcellular localization

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_24, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

341

342

Bart Cuypers et al.

Table 1 Summary of the four main steps in our reverse vaccinology pipeline Step

Bioinformatic tools Inclusion criteria

1 Obtain core proteome

MMseqs2

Proteins present in more than 80% of strains selected

2 Subcellular localization prediction

PSORTb and Gene Ontology

Predicted extracellular exposed or with an experimental Gene Ontology evidence

3 Antigen abundance

MaxQuant

Abundance value greater than zero

4 Prediction of T and B cell MixMHC2pred and % rank of maximum 2 and predicted score >= 0.5, EpitopeVec respectively linear epitopes

3. Estimation of protein abundance 4. Prediction of B and T cell linear epitopes For each of these steps, we describe the usage of a publicly available bioinformatic tool. We summarize in Table 1 the four steps of our described lean RV pipeline, together with the associated publicly available bioinformatic tool and the inclusion criteria for a protein to belong to the set of potential vaccine candidates.

2

Materials The following hardware and software are required to run the pipeline: Linux computer, or Windows computer with a virtual subsystem for Linux, Web browser (Chrome, Edge, Firefox), Conda installation (https://docs.conda.io/en/latest/miniconda. html), Docker installation (https://docs.docker.com/engine/ install/ubuntu/). For the optional step of estimation of protein expression/abundance (3.4), MaxQuant v2.4.1.0 (https://www. maxquant.org/maxquant/) is required.

3

Methods

3.1 Obtain Core Proteome of the Target Bacterium

This section describes a practical guide on how to retrieve the most suitable genome assemblies of a list of bacterial strains of interest, download their proteomes, and then discover the shared, conserved fraction of their proteomes or “core proteome” (see Note 1). The size of the core proteome is inverse to the phylogenetic divergence and number of the strains included. As such, the researcher might consider running this step multiple times with different sets of strains for exploratory purposes. Genome sequences of strains are available in public databases. We report the most common source of genome sequences, the NCBI genome

A Lean Reverse Vaccinology Pipeline

343

database (see Note 2). This guide is composed by three steps: (1) Find reference assemblies and download their proteomes. (2) Generate a clustered panproteome. (3) Generate the core proteome. 3.1.1 Find Reference Assemblies and Download Their Proteomes

1. Open the NCBI genome data hub at https://www.ncbi.nlm. nih.gov/data-hub/genome 2. Search the bacterial species and strain of interest in the search box and apply the filters “Annotated genomes” and “Annotated by GenBank submitter”. 3. A table with search results will appear (see Fig. 1). There is no “one fits all” approach to selecting the best assembly for any

Fig. 1 The NCBI Genome Hub with an example search result for E. coli assemblies

344

Bart Cuypers et al.

research question. However, some general principles described below should allow the researcher to make a well-founded choice. 4. Every row in the search result table represents a genome assembly in the database, while the columns indicate different properties of these assemblies. These properties should be carefully considered when selecting the optimal reference genome for the relevant research question. Level: There are four levels, which in order of “completeness” are contig, scaffold, chromosome, and complete. In general, more complete is better. It is also possible to filter the table on this property, which is interesting when there are a lot of results for the species and strain of interest. Size: Generally, larger size means the assembly is more complete; Number of genes (does not appear by default: select “select columns” and then mark checkbox “genes”). Generally, more genes indicate a more complete assembly; Year: From a technical point of view, more recent could mean that they used more novel sequencing technologies or newer genome assembly software-even if this is not a guarantee-. From a biological point of view, later depositing date could potentially concern strains of a more recent outbreak. However, both points should be explored more thoroughly in the detailed assembly information; Modifier: This column annotates the species with a strain or, even more detailed, isolate name. This is typically highly relevant information in the context of the research question. Additionally, it is recommended to click on the filtered assemblies of interest, which opens the assembly page, and check the detailed assembly information. The assembly page contains key information such as the submitter of the sequence; the assembly submission date; assembly statistics: Information about the sequencing technology and software used to generate the assembly. When selecting multiple assemblies, it is preferable to use assemblies generated with the same technology and assembly pipeline to avoid biases. Sample details: origin and nature of the biological sample that was used to generate the assembly. For example, from which host or environment was this sample derived? 5. Download the protein FASTA file of the assembly(ies) of interest by clicking on the three vertical dots in the “Action” column and selecting “Download”. “Select file source” should be “Genbank”, and the file type “Protein (FASTA)”. The result will be a folder starting with “GCA_”; inside is the protein FASTA file (protein.faa).

A Lean Reverse Vaccinology Pipeline 3.1.2 Generate Clustered Panproteome

345

The second step in the generation of the core proteome is the generation of a clustered panproteome. The panproteome is the combined set of all proteins of all the different strains from the previous step [3]. Clustering these proteins with a defined sequence similarity (70% identity and 50% coverage) is a meaningful panproteome representation, as it allows to check (1) how ortholog proteins are distributed over the different strains, and (2) in how many strains they occur. The next steps explain how to generate a clustered panproteome with MMseqs2. Steps are the following: 1. Move all downloaded folders (starting with “GCA_”) to a new common folder “Proteomes”. 2. Cluster the protein sequences using MMseqs2 in the command line running the following code: # Navigate to the Proteomes directly cd Proteomes # Generate combined .fasta of all protein fasta files in GCA_ subfolders cat ./*/protein.faa > allproteins.fasta # Create new conda environment and install mmseqs2 conda create –name mmseqs2 conda activate mmseqs2 conda install -c bioconda mmseqs2 # Cluster allproteins.fasta ## Create mmseqs database mmseqs createdb allproteins_duprem.fasta DB ## Cluster with minimum 70% sequence identity and 50% sequence overlap mmseqs cluster --–cluster-reassign 1 –min-seqid 0.7 –cov-mode 0 -c 0.5 DB DB_clu tmp # Generate a tsv file with the clustering result. Mmseqs createtsv DB DB DB_clu DB_clu.tsv # Exit mmseqs2 conda environment conda deactivate

Outcomes are: the “allproteins.fasta” file is a concatenated file containing all the proteins of all strains included in the analysis; the “DB_clu.tsv” file contains the clustering result or “panproteome”.

346

Bart Cuypers et al.

Each row represents two proteins that are members of the same cluster. The protein in the first column each time represents the cluster representative sequence. The second column each time indicates a cluster member. 3.1.3 Generate Core Proteome

The third and final step is to generate a core proteome of proteins that occur in all or at least a large proportion of the strains of interest. The core proteome can be easily generated with the python tool developed for this chapter and the “allproteins.fasta” and “D_clu.tsv” files generated in the previous step. The script has only one variable parameter “-t” (threshold), which indicates the percentage of strains that should contain at least one cluster member of the cluster of interest. We recommend the value of 80%, which means that for a cluster representative to pass to the “coregenome.fasta” output file, at least one cluster member should be present in 80% of the strains. # Make new conda environment with required python packages conda create –name coregenome conda activate coregenome conda install -c conda-forge biopython # Download this chapters python tool and generate coregenome.fasta wget https://github.com/CuypersBart/ReverseVaccinology_Tools/blob/main/core_genome.py python coregenome.py -t 80 allproteins.fasta DB_clu.tsv # Exit coregenome conda environment conda deactivate

3.2 Prediction of Protein Subcellular Localization

Once we have our set of core proteins (“coregenome.fasta”) shared by the majority of our set of strains, we proceed with the prediction of subcellular localization (SCL). Extracellular localization prediction is probably the most important cornerstone in any RV approach ever. Numerous computational subcellular localization prediction methods using sequence data have been developed to complement laboratory approaches (see Note 3). These enable rapid localization predictions for proteins deduced from newly sequenced genomes. Among the existing protein SCL predictors for bacteria and archaea, PSORTb [4] is one of the most widely used SCL predictors, and has remained the most precise bacterial SCL predictor since it was first made available in 2003.

A Lean Reverse Vaccinology Pipeline

347

PSORTb runs under a Linux environment. Very useful is the docker version released by Brinkman Lab in 2018 and available at: https://hub.docker.com/r/brinkmanlab/psortb_commandline/ The docker makes PSORTb runnable also on Mac and Windows 10. The prediction is distinct between Gram positive and negative. Using the command: sudo service docker start sudo

docker

pull

brinkmanlab/psortb_command-

line:1.0.2 wget

https://raw.githubusercontent.com/brink-

manlab/psortb_commandline_docker/master/ psortb chmod +x psortb ./psortb -I coreproteome.fasta -r ./ --negative -o long

the program returns for Gram negative a table in text file with fields: Cytoplasmic_Score; CytoplasmicMembrane_Score; Periplasmic_Score; OuterMembrane_Score; Extracellular_Score; and Final_Localization The selection of protein surface exposed can be done filtering for “Final_Localization” equals to “Periplasmic”, “OuterMembrane”, or “Extracellular”. For Gram positive, the output table follows the format: Cytoplasmic_Score; CytoplasmicMembrane_Score; Cellwall_Score; Extracellular_Score; and Final_Localization Also, in this case, the selection of protein surface exposed can be done filtering for “Final_Localization” equals to “Cellwall” or “Extracellular”. We also suggest the creation of a new score (ranging from 0 to 1) given, for Gram negative, by the sum of scores relative to Periplasmic, OuterMembrane, and Extracellular divided by the total sum of scores; for Gram positive given by the sum of scores relative to Cellwall and Extracellular divided by the total sum of scores. A cutoff of 0.5 can be applied to classify proteins as surface exposed (>= 0.5) or not surface exposed ( proteinGroups_filtered.txt

The protein quantification can be found in the columns “LFQ Intensity”, followed by the sample name. 3.4 Prediction of T and B Cell Linear Epitopes

Once a list of antigens of interest are selected based on the former steps, the next stage is to look deeper into which parts of these antigens contain potential immunogenic epitopes. Two key steps in the mounting of a B cell response by the human host are the presentation of the epitope by MHC-II and its recognition by a B cell receptor. In Subheading 3.4.2, it is outlined how to predict if an epitope can be presented by MHC-II, making use of the MixMHC2pred tool [8]. In Subheading 3.4.3, epitopes are predicted that are likely recognized by B cells, making use of the EpitopeVec tool [9]. This part is divided into three steps: (1) Selecting protein of interest from fasta file; (2) T cell epitope prediction with MixMHC2Pred; and (3) B cell epitope prediction with EpitopeVec.

3.4.1 Selecting Protein of Interest from fasta File

Selecting the correct sequence requires the unique “protein ID” of the protein(s) of interest. This ID can be found in the fasta header before the first space. For example, in Fig. 3, the protein ID is “AFS54726.1”. To select one or multiple proteins of interest, run the commands below.

Fig. 3 Example of protein fasta sequence in an NCBI protein FASTA file

A Lean Reverse Vaccinology Pipeline

351

# Install seqtk conda create --name seqtk conda activate seqtk conda install -c bioconda seqtk # Make list of protein sequence(s) in a .txt file # 1 line should contain 1 protein id echo ’protein_ID_1’ > list.txt echo ’protein_ID_2’ >> list.txt # Use seqtk to extract protein sequence(s) of interest (POIs) seqtk subseq surface_exposed.fasta list.txt > POIs.fasta # Exit seqtk conda environment conda deactivate

3.4.2 T Cell Epitope Prediction with MixMHC2Pred

Disclaimer: The usage of this software is only free for non-profit users. For-profit users should obtain a license. Information is available at https://github.com/GfellerLab/MixMHC2pred. MixMHC2pred makes predictions on the level of epitopes and not entire proteins. Therefore, protein sequences should be split into epitopes of 12 amino acids. The first step is to run the epitope_generator.py script to convert the “POIs.fasta” file with the protein sequences of interest to a list of epitopes. # Install BioPython conda create --name biopython conda activate biopython conda install -c conda-forge biopython # Download and run epitope_generator.py script with window size 12 wget https://github.com/CuypersBart/ReverseVaccinology_Tools/blob/main/epitope_generator python3 epitope_generator.py -w 12 POIs.fasta > epitopes.txt # Exit biopython conda environment conda deactivate

352

Bart Cuypers et al.

Run MixMHC2pred: Navigate to http://mixmhc2pred. gfellerlab.org; paste the epitope sequences of 1 protein in the top box: “Write your peptide sequences in the box below”; write the output name of the results (e.g., “results.txt”) in the second box: “Name for the results file”. In the box “Enter the list of alleles for which you want to make predictions”, enter the alleles below. These are derived from [10] and have an estimated combined coverage of 98% of the world population. DRB1_01_01 DRB1_03_01 DRB1_04_01 DRB1_04_05 DRB1_07_01 DRB1_08_02 DRB1_09_01 DRB1_11_01 DRB1_12_01 DRB1_13_02 DRB1_15_01 DRB3_01_01 DRB3_02_02 DRB4_01_01 DRB5_01_01 DQA1_05_01__DQB1_02_01 DQA1_05_01__DQB1_03_01 DQA1_03_01__DQB1_03_02 DQA1_04_01__DQB1_04_02 DQA1_01_01__DQB1_05_01 DQA1_01_02__DQB1_06_02 DPA1_02_01__DPB1_01_01 DPA1_01_03__DPB1_02_01 DPA1_01_03__DPB1_04_01 DPA1_03_01__DPB1_04_02 DPA1_02_01__DPB1_05_01 DPA1_02_01__DPB1_14_01 Click “run”; this will prompt the download of the result file. This result file contains a list of epitopes. Every epitope has received a % rank score; the best score is about 0, and the worst score is 100. The % rank score indicates, among random peptides, the percentage of peptides expected to be better presented by this allele than the given peptide. The reader can narrow down the candidates by, for example, setting a % rank threshold of maximum 2. A second selection criterion can be the recognition of the peptide across different MHC-II molecules. 3.4.3 B Cell Epitope Prediction with EpitopeVec

Install EpitopeVec by running the following code: # Create new conda environment with required packages conda create --name epitopevec conda activate epitopevec conda

install

-c

conda-forge

numpy=1.17.1

scipy=1.4.1 matplotlib=3.1.3 biopython=1.71.0 tqdm=4.15.0 gensim=3.8.3 conda install -c anaconda scikit-learn=0.22.1 # Create new directory and install EpitopeVec mkdir epitopevec cd epitopevec

A Lean Reverse Vaccinology Pipeline

353

git clone https://github.com/hzi-bifo/epitopeprediction pip3 install epitope-prediction/requirement/ pydpi-1.0.tar.gz wge t

htt p://d eepb io.i nfo/ embed ding _rep o/

sp_sequences_4mers_vec.txt wge t

htt p://d eepb io.i nfo/ embed ding _rep o/

sp_sequences_4mers_vec.bin mv sp_sequences_4mers_vec.bin ./epitope-pred iction/protvec/ mv sp_sequences_4mers_vec.txt ./epitope-pred iction/protvec/

Run EpitopeVec on the “POIs.fasta” file. FASTA files with multiple proteins can be used as input. # Run EpitopeVec cd epitope-prediction python3

main.py

-m

bacterial

-i

../../POIs.

fasta -o ../../epitopevec_result # Exit EpitopeVec conda evironment conda deactive epitopevec

This will generate a file “protein_of_interest.epitopes” file in the folder “epitopevec_result”. In this file, each row represents an epitope that has passed the score cutoff (see further) and each column the properties of these epitopes. The first column is the sequence ID, the second contains the amino acid sequence peptide, the third is the start position of the peptide in the protein sequence, and the fourth is the end position. The last column displays the predicted probability score representing the likelihood of that peptide being a B cell epitope. Epitopes with a predicted score of >= 0.5 are considered B cell epitopes. The -c (cutoff) argument can specify a different score cutoff.

4

Notes 1. Reverse vaccinology approaches must start from the compilation of a pathogen protein repertoire (aka core proteome) and then use selection or prioritization strategies to narrow down

354

Bart Cuypers et al.

to the most promising candidates for vaccine development. The selection of protein repertoire is at the basis of the entire RV pipeline, and it is crucial to its success. The main consideration in this step is which strains of the bacterium the researcher wants to target. The usual targets are circulating strains in a geographic region or continent (i.e., outbreaks), strains from an epidemiological defined period of time, or strains associated with a specific pathotype (like focusing on uropathogenic Escherichia coli strains, or strains of Klebsiella pneumoniae associated only with bloodstream infections). 2. It is worth mentioning also other specific databases dedicated to bacterial strains collections like PubMLST and Enterobase. 3. Computational protein subcellular localization prediction methods are generally based on amino acid composition, known target sequences/motifs or sequence similarity, or a combination of the aforementioned. 4. For a more extensive integration of different tools, we suggest the usage of a recent tool, called BUSCA [11], which integrates different tools to predict localization-related protein features (DeepSig, TPpred3 [12], PredGPI, BetAware, and ENSEMBLE 3.0) as well as tools for discriminating subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci, and SChloro). 5. While proteomics data can provide valuable insights, it is essential to note that the absence of evidence that a protein is expressed (LFQ expression value 0) is not evidence of the absence of this protein in the sample. Indeed, a single LC-MS setup typically detects only a subset of the expressed proteome because of the vast diversity in the physicochemical properties of proteins. In contrast, the detection of a protein by LC-MS does confirm that the protein is indeed expressed, and this information can be interesting in narrowing down an extensive list of candidate antigens. 6. This section shows how to use public data. However, if no satisfactory dataset is available, it is also possible to generate new proteomics data for the strain of interest and skip the first step, “download proteomic dataset”. If no proteomic data is available in PRIDE, the reader could consider using transcriptomic data instead. The disadvantage of transcriptomics data is that transcript levels are not necessarily representative of protein levels. However, the advantage of mRNA sequencing over proteomics is that mRNA typically gives a much more complete picture of the transcriptome than proteomics does for the proteome. Proteins have very diverse biochemical properties, so a single LC-MS setup will only detect a limited subset of the entire proteome. mRNA-seq suffers much less from this

A Lean Reverse Vaccinology Pipeline

355

problem. Good repositories for finding mRNA sequencing data are the NCBI Sequence Read Archive (https://www. ncbi.nlm.nih.gov/sra) and the Gene Expression Omnibus (https://www.ncbi.nlm.nih.gov/geo/). Alignment to the reference genome and transcript quantification can be achieved with the STAR aligner (https://github.com/alexdobin/ STAR) [13]. Reference genomes can be obtained from the same NCBI data hub that is described for proteomes above (https://www.ncbi.nlm.nih.gov/data-hub/genome).

Acknowledgments The authors wish to thank Christophe Lambert, Kris Laukens, and Pieter Meysman for their support with this work and insightful discussions. B.C is supported by the University of Antwerp with a research grant from GSK. References 1. Rappuoli R (2000) Reverse vaccinology. Curr Opin Microbiol 3(5):445–450. https://doi. org/10.1016/S1369-5274(00)00119-3 2. Dalsass M, Brozzi A, Medini D, Rappuoli R (2019) Comparison of open-source reverse vaccinology programs for bacterial vaccine antigen discovery. Front Immunol 10. https://doi.org/10.3389/fimmu.2019. 00113 3. Vernikos GS (2020) A review of pangenome tools and recent studies. In: Tettelin H, Medini D (eds) The pangenome: diversity, dynamics and evolution of genomes. Springer, Cham, pp 89–112. https://doi.org/10.1007/978-3030-38281-0_4 4. Yu NY, Wagner JR, Laird MR, Melli G, Rey S, Lo R, Dao P, Sahinalp SC, Ester M, Foster LJ, Brinkman FSL (2010) PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics (Oxford, England) 26(13): 1608–1615. https://doi.org/10.1093/bioin formatics/btq249 5. Teufel F, Almagro Armenteros JJ, Johansen AR, Gı´slason MH, Pihl SI, Tsirigos KD, Winther O, Brunak S, von Heijne G, Nielsen H (2022) SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 40(7):1023–1025. https:// doi.org/10.1038/s41587-021-01156-3 6. Racle J, Guillaume P, Schmidt J, Michaux J, Larabi A, Lau K, Perez MAS, Croce G,

Genolet R, Coukos G, Zoete V, Pojer F, Bassani-Sternberg M, Harari A, Gfeller D (2022) Machine learning predictions of MHC-II specificities reveal alternative binding mode of class II epitopes. bioRxiv:2022.2006.2026.497561. https://doi. org/10.1101/2022.06.26.497561 7. Tyanova S, Temu T, Cox J (2016) The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nat Protoc 11(12):2301–2319. https://doi.org/ 10.1038/nprot.2016.136 8. Kos¸alog˘lu-Yalc¸ın Z, Lee J, Greenbaum J, Schoenberger SP, Miller A, Kim YJ, Sette A, Nielsen M, Peters B (2022) Combined assessment of MHC binding and antigen abundance improves T cell epitope predictions. iScience 25(2):103850. https://doi.org/10.1016/j. isci.2022.103850 9. Bahai A, Asgari E, Mofrad MRK, Kloetgen A, McHardy AC (2021) EpitopeVec: linear epitope prediction using deep protein sequence embeddings. Bioinformatics (Oxford, England) 37(23):4517–4525. https://doi.org/ 10.1093/bioinformatics/btab467 10. Greenbaum J, Sidney J, Chung J, Brander C, Peters B, Sette A (2011) Functional classification of class II human leukocyte antigen (HLA) molecules reveals seven different supertypes and a surprising degree of repertoire sharing across supertypes. Immunogenetics 63(6): 325–335. https://doi.org/10.1007/s00251011-0513-0

356

Bart Cuypers et al.

11. Savojardo C, Martelli Pier L, Fariselli P, Profiti G, Casadio R (2018) BUSCA: an integrative web server to predict subcellular localization of proteins. Nucleic Acids Res 46(W1): W459–W466. https://doi.org/10.1093/nar/ gky320 12. Savojardo C, Martelli PL, Fariselli P, Casadio R (2015) TPpred3 detects and discriminates mitochondrial and chloroplastic targeting peptides in eukaryotic proteins. Bioinformatics

(Oxford, England) 31(20):3269–3275. https://doi.org/10.1093/bioinformatics/ btv367 13. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, Batut P, Chaisson M, Gingeras TR (2012) STAR: ultrafast universal RNA-seq aligner. Bioinformatics (Oxford, England) 29(1):15–21. https://doi.org/10.1093/bioin formatics/bts635

Chapter 25 Immunoinformatics Protocol to Design Multi-Epitope Subunit Vaccines Parismita Kalita, Aditya K. Padhi, and Timir Tripathi Abstract With the development of scientific technologies, the accessibility of genomic data, computational tools, software, databases, and machine learning, the field of immunoinformatics has emerged as an effective technique for immunologists to design potential vaccines in a short time. A large number of tools and databases are available to screen the genome sequences of parasites/pathogens and identify the highly immunogenic peptides or epitopes that can be used to design effective vaccines. In this chapter, we provide an easy-to-use protocol for the design of multi-epitope-based subunit vaccines. Though the computational immunoinformatics-based approaches have demonstrated their competency in designing potentially effective vaccine candidates quickly, their immunogenicity and safety must be evaluated in laboratory settings before they are tested in clinical trials. Key words Immunoinformatics, In silico vaccine design, Subunit vaccine, Epitopes, Adjuvants, Docking, Molecular dynamics simulation

1

Introduction Immunoinformatics is a rapidly growing component of current immunology research that utilizes the large volume of immunological data available from various genomic, proteomic, clinical, and epidemiological studies to help us advance our understanding of immune system processes and diseases [1, 2]. Currently, a total of 31 immunological databases are described in the Nucleic Acids Research Molecular Biology Database Collection (https://www. oxfordjournals.org/nar/database/cat/14). The highperformance computing potential of immunoinformatics, coupled with mathematical and statistical methods, offers an accelerated strategy to discover, process, analyze, and interpret the immunologically relevant data from such immunological databases. One of the critical applications of this field is to predict immunogenic epitopes for subunit or peptide-based vaccine designing that

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_25, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

357

358

Parismita Kalita et al.

shows potential translational implications [1, 3]. A multi-epitope subunit vaccine is typically composed of rationally combined B- cell (BCL), helper T- cell (HTL), and cytotoxic T- cell (CTL) epitopes derived from antigenic proteins of pathogenic organisms and is conjugated to an adjuvant to enhance its immunogenicity, allowing stimulation of desired immune responses. BCLs produce antibodies that are responsible for antigen recognition. CTLs induce cytotoxic immune responses by releasing granzymes and perforin. The HTLs are involved in the activation of BCLs and CTLs. Immunoinformatic approaches for designing a multi-epitope subunit vaccine consist of a series of web servers, stand-alone software, and algorithms (see Notes 1 to 3). The recent advances in experimental and computational methods for peptide-based vaccine design are reviewed elsewhere [4]. This chapter provides a detailed step-by-step methodology to design a multi-epitope subunit vaccine (see Fig. 1).

2

Materials

2.1 Computational Workstation

Preferable operating systems for MD simulation include Mac or UNIX. Use LINUX (via a virtual machine) in the case of Windows systems. Minimum specifications include a quad-core CPU, ~8GB RAM with a CUDA-enabled GPU card, and ~100 GB disk space.

2.2 Software for Structure Visualization and Data Analysis

The most commonly used freely accessible visualization software includes UCSF Chimera, PyMOL, visual molecular dynamics (VMD), Rasmol, and Jmol. Data analysis software includes Origin, Microsoft Excel, R, xmgrace, and gnuplot.

3

Methods

3.1 Sequence Retrieval

1. To design a multi-epitope vaccine, select targetable proteins from the pathogen of interest and retrieve their amino acid sequences from the NCBI (https://www.ncbi.nlm.nih.gov) or Uniprot databases (https://www.uniprot.org). 2. Consider an in-depth literature survey to select antigenic proteins that have a significant role in the infection and/or lifecycle of the pathogens to constitute a subunit vaccine.

3.2 Antigenicity Prediction

1. To determine the antigenic propensity of the proteins, use the protein sequences (individually) in the antigenic peptide prediction tool web server (http://imed.med.ucm.es/Tools/ antigenic.pl) [5]. 2. Consider proteins with an antigenic probability score greater than 0.8 for further analysis.

Immunoinformatics-Based Subunit Vaccine Design

359

Fig. 1 Overview of the steps of multi-epitope subunit vaccine design methodology

3. Also, the ANTIGENpro module of the SCRATCH protein predictor (http://scratch.proteomics.ics.uci.edu/) can be used to predict the antigenicity of a protein sequence based on protein microarray analysis datasets of five different pathogens [6]. 3.3 HTL Epitope Prediction

1. To predict HTL epitopes, specify the protein sequences (in FASTA format) in the MHC-II epitope prediction module in the Immune Epitope Database (IEDB, http://tools.iedb. org/mhcii/).

360

Parismita Kalita et al.

2. Use a suitable prediction method, keep the default epitope length as “15” and select the full HLA reference set to select alleles that are most frequent in the worldwide population [7]. 3. Select the epitopes with the lowest percentile rank and/or IC50 values indicative of the highest affinity. 4. Provide a valid email id in the IEDB HTL epitope prediction module if you wish to have the results emailed. 3.4 IFN-γ-Inducing Epitopes Prediction

1. Additionally, while designing a vaccine candidate for intracellular pathogens, check the capability of the selected epitopes to induce IFN-Υ production using the IFN epitope server (http://crdd.osdd.net/raghava/ifnepitope/) (see Note 4). 2. Select the “Motif and SVM hybrid” prediction approach and the “IFN-gamma versus Non-IFN-gamma” model option of the prediction module. 3. Several peptide sequences can be submitted for prediction concurrently in this module [8].

3.5 CTL Epitope Prediction

1. To predict CTL epitopes for the screened proteins, enter the protein sequence in the sequence window in the NetCTL1.2 server (https://services.healthtech.dtu.dk/service.php? NetCTL-1.2) and select HLA supertypes from the 10 HLA supertypes list available. 2. Set the rest of the parameters to default and sort the outputs by score. The HLAs A2, A3, and B7 supertypes are usually selected for epitope prediction, which covers ~88.3% worldwide population. 3. Select epitopes having a combined score of greater than 0.75 [9].

3.6 BCL Epitope Prediction

1. For B cell epitopes prediction for the proteins, use the ABCPred server (http://crdd.osdd.net/raghava/abcpred/) [10] that can predict epitopes with 65.93% accuracy using recurrent artificial neural networks. 2. The generated epitopes are ranked according to scores (between 1 and 0). 3. Select the epitopes with a high score for further analysis. 4. BCPREDS (http://ailab.ist.psu.edu/bcpred/predict.html) is another tool used regularly for BCL epitope prediction [11].

3.7 Toxicity Prediction

1. Prior to the multi-epitope vaccine design, examine the selected HTL, CTL, and BCL epitopes for their toxic/non-toxic nature using the ToxinPred module (https://webs.iiitd.edu.in/ raghava/toxinpred/multi_submit.php).

Immunoinformatics-Based Subunit Vaccine Design

361

2. Use the SVM-(Swiss-Prot)based method and select the epitopes that are predicted as non-toxic, showing scores smaller than zero [12]. 3.8 Multi-Epitope Subunit Vaccine Design

1. Design a multi-epitope subunit vaccine by joining the selected HTL, CTL, and BCL epitopes using GPGPG, AAY, and KK linkers, respectively, which will ensure adequate separation of the epitopes in vivo for better recognition by the receptor (see Fig. 1). Linkers also offer flexibility to the amino acid residues for proper folding. 2. Consider juggling the selected epitopes to generate a vaccine construct with the highest possible immunogenicity score. For example, try combinations like BCL–HTL–CTL or BCL– CTL–HTL or CTL–HTL–BCL or CTL–BCL–HTL or HTL– BCL–CTL or HTL–CTL–BCL, etc.

3.9 Adjuvant Selection

1. The addition of an adjuvant offers enhanced immunogenicity to the vaccine candidate. 2. Select a suitable adjuvant based on the literature available and retrieve its amino acid sequence. For example, the agonists of different toll-like receptors (TLRs) are commonly used as adjuvants in subunit vaccines to facilitate receptor recognition. 3. Link the adjuvant sequence to the N-terminal of the multiepitope vaccine sequence using the EAAAK helix-forming linker.

3.10 Immunogenicity Prediction

1. Ensuring the ability of the designed vaccine candidate to induce a humoral and/or cell-mediated immune response is a significant parameter. 2. Determine the immunogenicity of the vaccine using the VaxiJen server (http://www.ddg-pharmfac.net/vaxijen/VaxiJen/ VaxiJen.html). 3. Select the target organism and an output format. 4. Select the sequence with a prediction score higher than the threshold value for that particular target organism model [13].

3.11 Allergenicity Prediction

1. It is crucial to ensure that the designed vaccine is non-allergenic in nature. Test the allergenicity of the vaccine candidate using AllerTOP v2.0 (http://www.ddg-pharmfac.net/AllerTOP/) and/or AlgPred server (http://crdd.osdd.net/raghava/ algpred/). 2. AllerTOP predictor is trained on a dataset containing 2427 known allergens and 2427 non-allergens from different species.

362

Parismita Kalita et al.

3. AlgPred server has five different prediction methods that can be selected individually or in combination to predict the allergenicity of a protein sequence [14, 15]. 3.12 Determination of Physiochemical Properties

1. Determine the physiochemical characteristics of the vaccine candidate using the ProtParam tool of the ExPASy database server (http://web.expasy.org/protparam/). This tool will help determine several properties, including overall molecular weight, pI (isoelectric point), extinction coefficient, instability index, aliphatic index, and the half-life of the vaccine construct. 2. To check the solubility of the vaccine candidate upon overexpression, use the Protein-Sol (https://protein-sol. manchester.ac.uk/) server [16].

3.13 Structure Prediction and Validation

1. Predict the tertiary structure of the vaccine construct using a suitable structure prediction tool/server. RaptorX (http:// raptorx.uchicago.edu/about/) is a freely accessible server often used for tertiary structure prediction [17]. 2. Prior to docking studies, validate the modeled structure of the vaccine construct using the PROCHEK v.3.5 (https:// servicesn.mbi.ucla.edu/PROCHECK/) and ProSA (https:// prosa.services.came.sbg.ac.at/prosa.php) web servers, which will generate outputs in the form of Ramachandran plot and Z-score plot. 3. However, with the recent advances in protein structure prediction, it is strongly advisable to predict the structures using advanced approaches like AlphaFold [18] and RoseTTAFold [19].

3.14 Disulfide Engineering

1. The stability of the vaccine structure can be further enhanced by the in silico introduction of disulfide bonds using the “Disulfide by Design” server v2.0 (http://cptweb.cpt.wayne. edu/DbD2/) to mutate selected amino acid residues to cysteine residues to allow disulfide bond formation (optional step). 2. Follow the user guide (http://cptweb.cpt.wayne.edu/DbD2/ help.php) to identify residues for mutagenesis, and once identified, replace those residues in the original construct and prepare a PDB file for the same (see Subheading 3.13) to use in docking studies [20].

3.15 Molecular Docking with Receptor

1. Molecular docking is performed to understand the binding interactions between the receptor and the vaccine. For vaccine-receptor docking, retrieve the PDB file of the receptor. 2. Using the structures of receptor and vaccine construct (both in .pdb format), perform docking in a suitable docking tool or

Immunoinformatics-Based Subunit Vaccine Design

363

servers like PatchDock (http://bioinfo3d.cs.tau.ac.il/ FireDock/php.php) or ClusPro (https://cluspro.bu.edu/ login.php). 3. Select the best complex structure from among the outputs based on the binding free energy of the complex formation [21, 22]. 3.16 Molecular Dynamics Simulations

1. Once the complex of the vaccine-receptor is available, molecular dynamics (MD) simulation is carried out to understand the binding interactions, dynamics, and stability of the complex. Packages like GROMACS [23], Amber [24], CHARMM [25], NAMD [26], and Desmond [27, 28] can be used for this purpose. 2. The complex structure is first subjected to the addition of hydrogen atoms, followed by immersion in a solvated water box of desired shape and size (see Note 5). A suitable water model from widely used ones, such as TIP3P, TIP4P, SPC, and SPC/E, is typically used, where the complex is placed in the center of the box at an adequate distance from the box edge (e.g., 1 nm). 3. The complex solvated in water is next electrostatically neutralized by adding counterions (typically sodium or chloride ions). An appropriate biomolecular force field, such as GROMOS [29], AMBER [30], CHARMM [31], OPLS [32], or COMPASS [33, 34], is applied during this stage (see Note 6). These forcefields are optimized and hence compatible with specific water models (see Table 1). 4. Following this crucial step, energy minimization is performed for a certain number of steps until the potential energy is negative in the order of 105–106 (depending on the system size and the number of water molecules).

Table 1 A list of widely used biomolecular force fields with the most compatible water models Force field

Water model

GROMOS

SPC, SPC/E

AMBER

TIP3P, TIP4P

CHARMM

TIP3P

OPLS

TIP3P, TIP4P

364

Parismita Kalita et al.

5. An equilibration step consisting of a canonical ensemble and isothermal–isobaric ensemble encompassing heating from 0 to 300 K temperatures is performed, following the energy minimization step (see Note 7). 6. In the final stage, the production run is carried out in physiological environments with appropriate temperature and pressure conditions (supply suitable temperature and pressure couplings). The complex should be simulated in periodic boundary conditions to treat and calculate the long-range electrostatic forces. 7. The generated trajectory from the production run is then considered for several structural and dynamic analyses. 8. A detailed step-by-step MD simulation method for the vaccinereceptor complex is shown in Fig. 2, where the GROMACS modules and other important tools/servers are outlined for each step.

Fig. 2 A detailed step-by-step protocol of the MD simulation is outlined for a vaccine-receptor complex. Further, the GROMACS modules and other important tools/servers used for each step are outlined to simulate a vaccine-receptor complex

Immunoinformatics-Based Subunit Vaccine Design

3.17 MD SimulationBased Analyses

365

1. The MD-simulated trajectory of the receptor-vaccine complex is then analyzed for various physicochemical and structuraldynamic parameters [35–37]. For instance, root mean square deviation (RMSD), root mean square fluctuation (RMSF), and the radius of gyration (Rg), which represent the stability, per-residue flexibility, and compactness of the complex, respectively, usually are analyzed first [38, 39]. 2. The intermolecular interactions formed between the receptor and the vaccine provide a good indication of residue pairs involved in binding and the stability, flexibility, and compactness of the system in general. Such intermolecular interactions can be computed using several tools, web servers, and standalone modeling programs. Some widely used ones that can compute such interactions from extracted static snapshots from MD simulations are Molecular Operating Environment (MOE) software [40], PRODIGY protein–protein module [41, 42], and Arpeggio server [43]. From the complete trajectory, in-built tools of the MD engines are most widely used. However, some external programs and tools, such as visual molecular dynamics (VMD) [44], UCSF Chimera [45], and GetContacts [https://getcontacts.github.io/], can prove very handy. 3. Other interactions, including intermolecular and intramolecular hydrogen bonds and salt bridge interactions, are analyzed from the trajectory to obtain a comprehensive picture of all kinds of interactions. 4. Further, it is useful to perform essential dynamics [46] to get the correlation of atomic movement of the complex from MD simulations. For this purpose, principal component analysis (PCA) is performed, which is a technique that provides insights into the collective motion of atoms governed by the secondary structural elements and detected by the eigenvalues in which the dynamics occur. The trace value obtained from the covariance matrix typically represents how compact the complex behaved during the simulations [38]. 5. Finally, depending on the study objective, various analyses can be performed from the simulations to demonstrate the overall dynamics of the complex.

3.18 Codon Adaptation and in Silico Cloning

1. The codon optimization index ensures the relationship between codon usage and gene expression in a heterologous system. Therefore, it is essential to plan a cloning and expression strategy in this approach to ensure the scalability of vaccine production in times of an outbreak. 2. Generate the nucleotide sequence of the vaccine construct using the Sequence Manipulation Suite: Reverse Translate (http://www.bioinformatics.org/sms2/rev_trans.html).

366

Parismita Kalita et al.

3. To test high-level expression of the vaccine construct in an expression system, perform codon optimization of the nucleotide sequence using Java Codon Adaptation Tool (JCAT) (http://www.jcat.de/) [47]. The desired codon adaptation index (CAI) value for a sequence should range from 0.8 to 1.0, and the GC content should be between 30% and 70%. 4. To select restriction enzyme cleavage sites for cloning the vaccine sequence in expression vectors, use NEBcutter (http://nc2.neb.com/NEBcutter2/) web tool. 5. Use an in silico cloning tool such as SnapGene (https://www. or snapgene.com/guides/simulate-restriction-cloning) DNASTAR Lasergene (https://www.dnastar.com/ workflows/clone-sequence-verification/) to check the fidelity of the designed vaccine sequence for cloning and to create the final expression vector sequence with the vaccine sequence as the insert of interest.

4

Notes 1. Other than the prediction methods mentioned here, several alternative prediction methods are also available. Also, the user guidelines and the detailed algorithms used in the design and development of these methods are available on the respective web pages of the web servers mentioned here. 2. The outputs from the epitope predictors generate epitopes with overlapping sequences. Consider selecting epitopes that are not overlapping to maintain a heterogeneity that will create a more diverse vaccine construct. 3. The immune response profile of the designed vaccine can be evaluated by performing computational immune simulations using the C-ImmSim server (https://kraken.iac.rm.cnr.it/CIMMSIM/index.php?page¼1). This server gives options to choose the injection intervals and duration, etc. (This is an optional step since this tool does not guarantee the accuracy and reliability of the generated results). 4. Prediction of IFN-γ-inducing epitopes is not required while designing a vaccine candidate for pathogens (such as extracellular bacteria or protozoans) that do not require the induction of Th1 response. 5. Several water boxes of different shapes and sizes are available. The most common ones are the standard (1) cubical, (2) rectangular box, (3) truncated octahedron, (4) hexagonal, (5) prism, and (6) rhombic dodecahedron types. The shape of the box is decided based on the shape and size of the complex (i.e., the solute). For example, a cubical or a truncated

Immunoinformatics-Based Subunit Vaccine Design

367

octahedron box symbolizing a sphere can be used for globular proteins, where its shape and size can be visualized using any visualization programs, such as VMD or UCSF Chimera. 6. Universal force fields such as GROMOS, AMBER, and CHARMM are often used for the simulations of biomolecules, while OPLS and COMPASS were originally developed for the simulations of condensed matter. However, all these forcefields are continuously being evolved and developed for more accurate estimation and representation of biomolecules, physicochemical features, and simulation environments. For the simulation of the vaccine-receptor complex, any newer version of the force field, like GROMOS, AMBER, or CHARMM, can be employed. 7. Statistical ensembles and the choice of parameters play an essential role in the simulations of the vaccine-receptor complex. For instance, a suitable thermostat (e.g., a modified Berendsen thermostat) should be used in the canonical ensemble to maintain the system temperature. Similarly, in the isothermal–isobaric ensemble, an appropriate thermostat, as well as a barostat, should be used. A widely used one is the “Parrinello-Rahman” barostat. References 1. Tong JC, Ren EC (2009) Immunoinformatics: current trends and future directions. Drug Discov Today 14(13–14):684–689. https://doi. org/10.1016/j.drudis.2009.04.001 2. Oli AN, Obialor WO, Ifeanyichukwu MO, Odimegwu DC, Okoyeh JN, Emechebe GO, Adejumo SA, Ibeanu GC (2020) Immunoinformatics and vaccine development: an overview. Immunotargets Ther 9:13–30. https:// doi.org/10.2147/itt.S241064 3. Tomar N, De RK (2010) Immunoinformatics: an integrated scenario. Immunology 131(2): 153–168. https://doi.org/10.1111/j. 1365-2567.2010.03330.x 4. Kalita P, Tripathi T (2022) Methodological advances in the design of peptide-based vaccines. Drug Discov Today 27(5):1367–1380. https://doi.org/10.1016/j.drudis.2022. 03.004 5. Kolaskar AS, Tongaonkar PC (1990) A semiempirical method for prediction of antigenic determinants on protein antigens. FEBS Lett 276(1–2, 172):–174. https://doi.org/10. 1016/0014-5793(90)80535-q 6. Magnan CN, Zeller M, Kayala MA, Vigil A, Randall A, Felgner PL, Baldi P (2010) Highthroughput prediction of protein antigenicity using protein microarray data. Bioinformatics

26(23):2936–2943. https://doi.org/10. 1093/bioinformatics/btq551 7. Wang P, Sidney J, Dow C, Mothe´ B, Sette A, Peters B (2008) A systematic assessment of MHC class II peptide binding predictions and evaluation of a consensus approach. PLoS Comput Biol 4(4):e1000048. https://doi. org/10.1371/journal.pcbi.1000048 8. Dhanda SK, Vir P, Raghava GP (2013) Designing of interferon-gamma inducing MHC class-II binders. Biol Direct 8:30. https://doi. org/10.1186/1745-6150-8-30 9. Larsen MV, Lundegaard C, Lamberth K, Buus S, Lund O, Nielsen M (2007) Largescale validation of methods for cytotoxic T-lymphocyte epitope prediction. BMC Bioinformatics 8:424. https://doi.org/10.1186/ 1471-2105-8-424 10. Saha S, Raghava GP (2006) Prediction of continuous B-cell epitopes in an antigen using recurrent neural network. Proteins 65(1): 40–48. https://doi.org/10.1002/prot.21078 11. Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequencebased B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45(W1): W24–w29. https://doi.org/10.1093/nar/ gkx346

368

Parismita Kalita et al.

12. Gupta S, Kapoor P, Chaudhary K, Gautam A, Kumar R, Raghava GP (2013) In silico approach for predicting toxicity of peptides and proteins. PLoS One 8(9):e73957. https://doi.org/10.1371/journal.pone. 0073957 13. Doytchinova IA, Flower DR (2007) VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinformatics 8:4. https://doi.org/10. 1186/1471-2105-8-4 14. Sharma N, Patiyal S, Dhall A, Pande A, Arora C, Raghava GPS (2021) AlgPred 2.0: an improved method for predicting allergenic proteins and mapping of IgE epitopes. Brief Bioinform 22(4). https://doi.org/10.1093/ bib/bbaa294 15. Dimitrov I, Flower DR, Doytchinova I (2013) AllerTOP–a server for in silico prediction of allergens. BMC Bioinformatics 14 Suppl 6 (Suppl 6):–S4. https://doi.org/10.1186/ 1471-2105-14-s6-s4 16. Hebditch M, Carballo-Amador MA, Charonis S, Curtis R, Warwicker J (2017) Protein-Sol: a web tool for predicting protein solubility from sequence. Bioinformatics 33(19):3098–3100. https://doi.org/10. 1093/bioinformatics/btx345 17. Peng J, Xu J (2011) RaptorX: exploiting structure information for protein alignment by statistical inference. Proteins 79 Suppl 10(Suppl 10):161–171. https://doi.org/10.1002/prot. 23175 18. Jumper J, Evans R, Pritzel A, Green T, M, Ronneberger O, Figurnov ˇ ´ıdek A, Tunyasuvunakool K, Bates R, Z Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589. https://doi.org/10.1038/s41586-02103819-2 19. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D (2021) Accurate prediction of protein structures and interactions using a three-track

neural network. Science 373(6557):871–876. https://doi.org/10.1126/science.abj8754 20. Craig DB, Dombkowski AA (2013) Disulfide by Design 2.0: a web-based tool for disulfide engineering in proteins. BMC Bioinformatics 14:346. https://doi.org/10.1186/14712105-14-346 21. Comeau SR, Gatchell DW, Vajda S, Camacho CJ (2004) ClusPro: a fully automated algorithm for protein-protein docking. Nucleic Acids Res 32(Web Server issue):W96–W99. https://doi.org/10.1093/nar/gkh354 22. Schneidman-Duhovny D, Inbar Y, Nussinov R, Wolfson HJ (2005) PatchDock and SymmDock: servers for rigid and symmetric docking. Nucleic Acids Res 33(Web Server issue): W363–W367. https://doi.org/10.1093/ nar/gki481 23. Abraham MJ, Murtola T, Schulz R, Páll S, Smith JC, Hess B, Lindahl E (2015) GROMACS: high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1-2: 19–25. https://doi.org/10.1016/j.softx. 2015.06.001 24. Salomon-Ferrer R, Case DA, Walker RC (2013) An overview of the Amber biomolecular simulation package. WIREs Comput Mol Sci 3(2):198–210. https://doi.org/10.1002/ wcms.1121 25. Brooks BR, Brooks CL 3rd, Mackerell AD Jr, Nilsson L, Petrella RJ, Roux B, Won Y, Archontis G, Bartels C, Boresch S, Caflisch A, Caves L, Cui Q, Dinner AR, Feig M, Fischer S, Gao J, Hodoscek M, Im W, Kuczera K, Lazaridis T, Ma J, Ovchinnikov V, Paci E, Pastor RW, Post CB, Pu JZ, Schaefer M, Tidor B, Venable RM, Woodcock HL, Wu X, Yang W, York DM, Karplus M (2009) CHARMM: the biomolecular simulation program. J Comput Chem 30(10):1545–1614. https://doi.org/ 10.1002/jcc.21287 26. Phillips JC, Braun R, Wang W, Gumbart J, Tajkhorshid E, Villa E, Chipot C, Skeel RD, Kale´ L, Schulten K (2005) Scalable molecular dynamics with NAMD. J Comput Chem 26(16):1781–1802. https://doi.org/10. 1002/jcc.20289 27. Bowers KJ, Dror RO, Shaw DE (2007) Zonal methods for the parallel execution of rangelimited N-body simulations. J Comput Phys 221(1):303–329. https://doi.org/10.1016/j. jcp.2006.06.014 28. Lippert RA, Bowers KJ, Dror RO, Eastwood MP, Gregersen BA, Klepeis JL, Kolossvary I, Shaw DE (2007) A common, avoidable source of error in molecular dynamics integrators. J

Immunoinformatics-Based Subunit Vaccine Design Chem Phys 126(4):046101. https://doi.org/ 10.1063/1.2431176 29. Christen M, Hu¨nenberger PH, Bakowies D, Baron R, Bu¨rgi R, Geerke DP, Heinz TN, Kastenholz MA, Kra¨utler V, Oostenbrink C, Peter C, Trzesniak D, van Gunsteren WF (2005) The GROMOS software for biomolecular simulation: GROMOS05. J Comput Chem 26(16):1719–1751. https://doi.org/ 10.1002/jcc.20303 30. Ponder JW, Case DA (2003) Force fields for protein simulations. Adv Protein Chem 66:27– 85. https://doi.org/10.1016/s0065-3233 (03)66002-x 31. Vanommeslaeghe K, Hatcher E, Acharya C, Kundu S, Zhong S, Shim J, Darian E, Guvench O, Lopes P, Vorobyov I, Mackerell AD Jr (2010) CHARMM general force field: a force field for drug-like molecules compatible with the CHARMM all-atom additive biological force fields. J Comput Chem 31(4): 671–690. https://doi.org/10.1002/jcc. 21367 32. Jorgensen WL, Maxwell DS, Tirado-Rives J (1996) Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J Am Chem Soc 118(45):11225–11236. https:// doi.org/10.1021/ja9621760 33. Sun H, Ren P, Fried JR (1998) The COMPASS force field: parameterization and validation for phosphazenes. Comput Theor Polym Sci 8(1): 229–246. https://doi.org/10.1016/S10893156(98)00042-7 34. Sun H (1998) COMPASS: an ab initio forcefield optimized for condensed-phase applications overview with details on alkane and benzene compounds. J Phys Chem B 102(38): 7338–7364. https://doi.org/10.1021/ jp980939v 35. Shukla R, Tripathi T (2020) Molecular dynamics simulation of protein and protein-ligand complexes. In: Singh DB (ed) Computeraided drug design. Springer, Singapore, pp 133–161. https://doi.org/10.1007/978981-15-6815-2_7 36. Shukla R, Tripathi T (2021) Molecular dynamics simulation in drug discovery: opportunities and challenges. In: Singh SK (ed) Innovations and implementations of drug discovery strategies in rational drug design. Springer, Singapore, pp 295–316. https://doi.org/10.1007/ 978-981-15-8936-2_12 37. Padhi AK, Janezˇicˇ M, Zhang KYJ (2022) Molecular dynamics simulations: principles, methods, and applications in protein conformational dynamics. In: Tripathi T, Dubey VK (eds) Advances in protein molecular and

369

structural biology methods, 1st edn. Academic Press, pp 439–454. https://doi.org/10.1016/ B978-0-323-90264-9.00026-X 38. Kalita P, Padhi AK, Zhang KYJ, Tripathi T (2020) Design of a peptide-based subunit vaccine against novel coronavirus SARS-CoV-2. Microb Pathog 145:104236. https://doi. org/10.1016/j.micpath.2020.104236 39. Kalita P, Lyngdoh DL, Padhi AK, Shukla H, Tripathi T (2019) Development of multiepitope driven subunit vaccine against Fasciola gigantica using immunoinformatics approach. Int J Biol Macromol 138:224–233. https:// doi.org/10.1016/j.ijbiomac.2019.07.024 40. Chemical Computing Group I. Molecular operating environment (MOE). (2016) Chemical Computing Group Inc 1010 Sherbooke St. West, Suite# 910, Montreal, Quebec, Canada 41. Vangone A, Bonvin AM (2015) Contactsbased prediction of binding affinity in proteinprotein complexes. elife 4:e07454. https:// doi.org/10.7554/eLife.07454 42. Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A (2016) PRODIGY: a web server for predicting the binding affinity of protein-protein complexes. Bioinformatics 32(23):3676–3678. https://doi.org/10. 1093/bioinformatics/btw514 ˜o B, 43. Jubb HC, Higueruelo AP, Ochoa-Montan Pitt WR, Ascher DB, Blundell TL (2017) Arpeggio: a web server for calculating and visualising interatomic interactions in protein structures. J Mol Biol 429(3):365–371. https://doi.org/10.1016/j.jmb.2016.12.004 44. Humphrey W, Dalke A, Schulten K (1996) VMD: visual molecular dynamics. J Mol Graph 14(1):33–38, 27–38. https://doi. org/10.1016/0263-7855(96)00018-5 45. Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE (2004) UCSF Chimera–a visualization system for exploratory research and analysis. J Comput Chem 25(13):1605–1612. https:// doi.org/10.1002/jcc.20084 46. Amadei A, Linssen ABM, Berendsen HJC (1993) Essential dynamics of proteins. Proteins 17(4):412–425. https://doi.org/10.1002/ prot.340170408 47. Grote A, Hiller K, Scheer M, Mu¨nch R, No¨rtemann B, Hempel DC, Jahn D (2005) JCat: a novel tool to adapt codon usage of a target gene to its potential expression host. Nucleic Acids Res 33(Web Server issue):W526–W531. https://doi.org/10. 1093/nar/gki376

Chapter 26 In Silico Structure-Based Vaccine Design Sakshi Piplani, David Winkler, Yoshikazu Honda-Okubo, Varun Khanna, and Nikolai Petrovsky Abstract Structure-based vaccine design (SBVD) is an important technique in computational vaccine design that uses structural information on a targeted protein to design novel vaccine candidates. This increasing ability to rapidly model structural information on proteins and antibodies has provided the scientific community with many new vaccine targets and novel opportunities for future vaccine discovery. This chapter provides a comprehensive overview of the status of in silico SBVD and discusses the current challenges and limitations. Key strategies in the field of SBVD are exemplified by a case study on design of COVID-19 vaccines targeting SARS-CoV-2 spike protein. Key words Structure-based vaccine design, Molecular docking, Target selection, Computer-aided vaccine design, Protein modeling, High-throughput virtual screening, Focused library design, De novo design, COVID-19, SARS-CoV-2

1

Introduction The discovery, optimization, and evaluation of antibodies that target key proteins or other relevant epitopes on the surface of a pathogen are central to the vaccine discovery process. The conventional brute-force empirical approach of vaccine discovery uses high-throughput screening of pathogen components being to identify potential leads [1]. While this approach has led to effective vaccines, it is expensive and time-consuming. There is therefore a need for faster and cost-efficient vaccine screening methods. Advances in molecular biology, structural biology, and computational methods over the last 30 years have provided researchers with access to high-quality structural information on a wide variety of biological targets. Structural information is the ultimate vaccine design tool that can streamline all aspects of vaccine discovery, from target selection to lead optimization, and can significantly reduce development cost and speed. Indeed, structure-

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_26, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

371

372

Sakshi Piplani et al.

guided vaccine design has led to the discovery of high-profile vaccines against, for example, respiratory syncytial virus that have been successful in late-stage clinical trials [2]. Most recently, the COVID-19 pandemic has provided an ideal opportunity to apply successful principles of rational structure-based methods to vaccine development [3]. Structure-based vaccine design (SBVD) is the process by which novel vaccines are designed using structural knowledge of the relevant macromolecular target. In SBVD, pathogen libraries are used in tandem to assess their ability to generate antibodies that will bind and interact with the target of interest. If the structure of the desired target is not available, as in the case of a brand new virus such as SARS-CoV-2, it may be possible to create a homology model using related structures available in the PDB [4]. More recently, this can be done using AlphaFold and related machine learning methods [5]. Existing or new antibodies against related viral proteins are subsequently docked onto the pathogen target structure to determine likely efficacy. Automatic and manual data analysis of the docking results is then carried out to predict which antibodies will have the highest binding affinity to the target, and which antigen target structures are stable and most likely to be recognized by relevant neutralizing antibodies. This work can be doubly productive as it can assist with identifying and designing optimal vaccine antigens, while at the same time having the potential to identify or generate therapeutic antibodies or small molecule drugs targeting the same pathogen structures being used for vaccine design. This chapter summarizes the process of structure-based vaccine design and discusses the choice of target, lead identification, and the various docking-based and molecular dynamics (MD) simulation methods. Key concepts in SBVD will be illustrated through a case study that explores how these could be applied to design of effective SARS-CoV-2 vaccines. The aim of this chapter is to guide novice computational users in the general steps involved with in silico vaccine design followed by in vitro and in vivo validation studies. Unless the pathogen has already been well studied or has high homology to a known pathogen, the process of vaccine discovery and development requires extensive study of target proteins and potential antigenic sites that are key to pathogen function before a lead can be identified. The first stage is identification, purification, and structure determination of the structural target, usually a surface protein or sugar involved in viral pathogenesis processes such as receptor binding or cell fusion. For example, in COVID-19, its spike protein was rapidly identified to be the key target for vaccine design, given its critical role in viral attachment and fusion, just as SARS and MERS coronaviruses were similarly dependent on their respective spike proteins [6]. The selection of the target is the most

In Silico Vaccine Design

373

critical step as it sets the course for all future aspects of research in the vaccine discovery pipeline. Once the biological target has been selected, its structure is determined by one of three methods: X-ray diffraction, generally known as X-ray crystallography, nuclear magnetic resonance spectroscopy, generally known as NMR, and comparative modeling, generally known as in silico homology modeling. In the second stage, computational techniques are used to predict the molecular aspects of the interaction between ligand and target protein epitope and determine how this might be disrupted using vaccine-induced antibodies, therapeutic monoclonal antibodies, or small molecule drugs. Promising targets are then tested in biological assays to identify compounds with best activity, either as a vaccine antigen or as an antiviral drug (lead discovery). Once a candidate series has been identified, the lead optimization stage begins. Here, key lead properties like structural integrity, ease of manufacture, stability, adsorption, metabolism, distribution, and excretion are optimized, and toxicity minimized. Further steps include synthesis of the optimized leads, testing, determination of the target lead activity, additional synthetic optimization, and potential industrial scale-up. After several rounds of the process, optimized vaccines and/or drugs should show improved properties and specificity for the target. The lead optimization stage concludes with the successful demonstration of in vivo efficacy in an appropriate animal model. 1.1 Rational Structure-Based Vaccine Design 1.1.1 Choice of the Vaccine Target and Structure Determination

1.1.2 X-Ray Crystallography and NMR

The choice of the target protein is primarily based on therapeutic and biological relevance. The antigen target is a protein, peptide, or sugar whose activity can be modulated, thereby disrupting viral function and infectivity. Proteins are often used as vaccine targets due to their significant role in viral attachment and infectivity and can either be activated or inhibited by antibodies induced by vaccines, or directly by small molecules. Once the viral target has been identified, it is essential to determine its three-dimensional (3D) structure, ideally in association with its cellular receptor, to better understand the key structures that need to be targeted to disrupt viral infectivity. The structure of the target can be determined by one of the following methods: Both X-ray crystallography and NMR produce data on the relative position of atoms of a molecule. X-ray crystallography relies on scattering of X-rays from electron clouds of atoms, whereas NMR measures the interaction of atomic nuclei. The data from crystallographic structure determination is an electron density map, essentially a contour plot indicating positions in the crystal structure where electrons are most likely to be found. This data must be interpreted in terms of a 3D model using semi-automatic computational methodologies. The data from an NMR experiment is usually a set of distances between atomic nuclei that define both

374

Sakshi Piplani et al.

bonded and non-bonded close contacts in a molecule. These must be interpreted to produce a 3D molecular structure using computational tools. The structure determination in each case requires assumptions and approximations; hence, the resulting molecular structures obtained may have errors. The choice of technique depends on many factors that include molecular weight, solubility, and ease of crystallization of the macromolecule under study. X-ray crystallography remains the main workhorse of structure determination for SBVD. Currently, the RCSB Protein Data bank database contains around 200,000 structures of which 90% were solved through X-ray crystallography [7]. 1.1.3 Homology Modeling

In the absence of an experimentally determined 3D structure of a protein, in silico homology modeling can provide structural models that are comparable to the best results achieved experimentally. In general, at least 30% target-template sequence identity is required to generate a useful structural model [8]. This allows researchers to use the generated in silico models for functional analysis and to predict interactions with other molecules. Homology modeling predicts the structure of the target protein primarily by aligning the target sequence (the query) with the sequence of one or more known structures (the template[s]) and is based on two major assumptions: • The structure of a protein is encoded in its sequence, thus knowing the sequence should at least, in theory, suffice to obtain the structure. This observation was elegantly demonstrated by Anfinsen when he showed that bovine pancreatic ribonuclease, following exposure to a denaturant, could spontaneously regain its native folded structure [9]. • The structure of a protein is more stable and conserved than its sequence. Therefore, closely related similar sequences will essentially adopt similar structures and more distantly related sequences will have at least similar folding, a relationship first identified by Chothia and Lesk [10]. In practice, homology modeling is a multistep process which can be summarized in the following seven steps: (a) Template identification (b) Alignment correction (c) Model building (d) Loop modeling (e) Side-chain modeling (f) Model optimization (g) Model validation

In Silico Vaccine Design

375

A comprehensive review of homology modeling is beyond the scope of this chapter. However, interested readers are referred to a series of publications and reviews on homology modeling [11] and software and servers such as Protein model portal [12], Modbase [13], and SWISS-MODEL [14] for generation of threedimensional protein structures. 1.1.4 AI-Based Ab Initio Modeling

Increasingly, sophisticated machine learning models are being used to predict protein structures even in the absence of good homology templates. This approach is exemplified by AlphaFold, which can efficiently predict most protein 3D structures with a high level of accuracy [5]. The speed and accuracy of these tools have transformed structural modeling, being more accessible and requiring less training than manual homology modeling approaches.

1.1.5 Identification of Binding Site

Once the 3D structure of the target viral protein is generated, its potential cellular receptor can be determined to facilitate docking and virtual screening (VS). Typically, cellular receptors used by other closely related viruses will be tested first, using docking and MD to estimate potential interactions and binding affinities. The ligand binding site can be the active site where the substrate binds, an assembly site where another macromolecule binds, or a communication site necessary to relay the information. Although small ligand binding sites tend to coincide with the largest and deepest pockets on the target surface [15], viral interactions are often protein–protein contacts that involve large, relatively flat interaction regions, not distinct pockets. However, given that binding sites are important for molecular recognition and interaction and that the same protein–protein interaction site can be targeted by antibodies or antiviral drugs, computational methods that scan the target for potential binding sites are very useful. These methods take a target structure as input and generate an ordered list of putative binding sites. Generally, not all reported sites correspond to true binding sites. The methods can be broadly divided into sequence-, structure-, and energy-based methods.

1.1.6 Sequence-Based Methods

Sequence-based methods are based on evolutionary conservation and exploit the existence of conserved residues in the binding site. In the LIGSITEcsc algorithm [16], a sequence conservation measure of neighboring residues is used to re-rank the top three sites predicted by LIGSITE (explained under structure-based methods), which leads to an improved success rate in binding site prediction. Unlike LIGSITEcsc, in ConCavity [17], the conservation information is not only used to re-rank the predictions but is also incorporated into the binding site detection procedure.

376

Sakshi Piplani et al.

1.1.7 Structure-Based Methods

In contrast to sequence-based methods, structure-based methods predict ligand binding sites by analyzing geometrical features like clefts or cavities. Some methods are solely based on geometrical features (LIGSITE [18], PocketPicker [15]), while others take into account additional features like physicochemical information, polarity, or charge (Fpocket [19], SiteFinder by MOE).

1.1.8 Energy-Based Methods

Energy-based methods calculate the interaction energy between various probes placed on grid points around the target surface and subsequent clustering of the probes with the most negative interaction energies identifies the most energetically favorable binding pockets. Notable examples of energy-based methods are Q-SiteFinder [20] and SiteHound [21].

1.1.9

A consensus method is essentially a meta-approach that combines results from several algorithms mentioned above. For example, Metapocket 2.0 [16] collects results of eight different methods by taking the top three sites from each method, with the authors demonstrating that Metapocket 2.0 performs better than any one individual method alone. For a detailed review of binding site prediction methods, readers can refer to [22–24].

Consensus Method

1.1.10 Molecular Docking-Based Virtual Screening

Once the viral structure and the target cellular receptor are determined, molecular docking-based virtual screening can be used to identify potential lead antigens and antiviral drugs. VS is regarded as a computational counterpart of the wet-lab high-throughput screening method and is one of the most widely used strategies. The major advantage of docking approaches is their speed and the guidance they provide to follow-on wet-lab experiments [25]. Molecular docking is also useful in the study of ligand–target interactions, as the docking programs can analyze the molecular interactions between a ligand and a target. As shown below, this can be used in diverse ways to characterize the viral protein and its potential evolution, including potential for vaccine and drugescape via mutation, but also the nature of the cellular receptor that determines the species specificity of the virus and whether this may change over time. It is thereby an extremely powerful tool for designing vaccine antigens and predicting how these may need to change to adapt to virus evolution. The optimal antigen structures obtained by in silico modeling can then be synthesized and tested. In general, molecular docking-based VS consists of several steps: (1) viral protein preparation; (2) cellular receptor preparation; (3) molecular docking; (4) post-docking analysis; and (5) molecular dynamics and binding free energy calculation. Careful and thorough literature surveys of the target, binding site, and known ligands (if available) may assist in selecting the docking algorithms best suited to a given target.

In Silico Vaccine Design

377

1.1.11 Target Preparation

The preparation of the target viral and cellular receptor protein structures requires great care because experimental structures frequently have problems such as incomplete side chains, missing loops, randomly oriented sidechains in binding sites due to poorly defined or incorrectly interpreted experimental data. Structural characterization of the targeted protein includes a choice of tautomeric forms for histidine residues, correct assignment of protonation states of amino acids, and conformation of residues, especially in the binding site. It is also important to add missing hydrogen atoms, build missing residues or loops, identify overlapping atoms to reduce clashes, and optimize hydrogen bonding networks. Water molecules and cofactors in protein active sites may need to be removed or retained. Metals, ions, and cofactors that form an integral part of the binding interaction with the ligand are considered part of the docking site and hence are retained. If the software allows for flexible docking to account for conformational changes during ligand binding, the number, the identity of flexible residues, and degrees of flexibility need to be defined.

1.1.12 Molecular Docking

Molecular docking is a method that predicts the preferred orientation of one molecule (key) when bound in an active site of another molecule (lock) to form a stable complex in which the free energy of the overall system is minimized. It exploits the concept of molecular shape and physicochemical complementarity. The structures interact like a hand in a glove, where both shape and physicochemical properties contribute to the fit. Molecular docking processes have two major steps: (a) searching and (b) scoring.

1.1.13

Search Algorithm

The search algorithm implemented in any docking tool should explore the number of ways the protein and ligand can bind. The size of the search space grows exponentially with the size and flexibility of the molecules. For example, the number of possible conformations for a small molecule with 10 rotatable bonds with 30 degrees of increments is 1012. If the target protein is also flexible, this quickly becomes an intractable problem. To address this, docking applications frequently employ one or more of the following search algorithms; simulated annealing, fast shape matching, incremental construction, particle swarm optimization, and evolutionary algorithms [26].

1.1.14

Scoring Functions

The scoring function estimates and ranks the binding strength of a ligand-receptor complex. It is important to have an efficient, accurate scoring function that correctly ranks the relative binding strength of each ligand in a database. Ideally, the score should correlate with the binding affinity of the ligand for the protein, so that the top scoring compounds are also the best binders. There are three main types of scoring functions used by docking programs: (1) force field-based—GLODscore, DOCK, AutoDOCK derived

378

Sakshi Piplani et al.

from AMBER, and CHARMm; (2) empirical scoring—PLP, CHemscore, Glide SP/XP, and PLANTSchemplp, and PLANTSplp; and (3) knowledge-based—PMF, VaccineScore, and its derivates VaccineScoreCSD, Astex statistical potential. For a detailed review of molecular docking, the reader can refer to the following publications [27–29]. A substantial number of docking tools are now available for discovery of novel bioactive molecules. Selecting a particular docking tool is always a challenge. One study reported that AutoDock offered a better combination of accuracy and speed compared to eight other docking programs in recapitulating the X-ray poses of 100 small organic protein kinase inhibitors [30, 31]. They also reported that GLIDE, GOLD, and SURFLEX had the best docking accuracy. Therefore, the choice of docking program largely depends on the protein family under consideration. The procedures associated with different docking tools often differ, for example, formats of the ligands and target files, scoring functions and algorithms for ligand placement, or the ligand may be docked in entirety or in fragments. However, the general principles of docking remain similar; compounds are first placed inside the binding pocket using the algorithm implemented by the docking software and then evaluated for non-covalent interactions using a scoring algorithm available in the same package. If the position of the ligand is known a priori, this information can be exploited during and after the docking process. Programs such as DOCK allow an anchor fragment to be specified that can guide ligand placement during the docking process. 1.1.15 Post-Docking Evaluation and Analysis

The output of docking and scoring is a ranked list of predicted bound ligands. Using computer graphics software, these can be evaluated visually for goodness of fit, the formation of key interactions and hydrogen bonds, electrostatic interactions, surface complementarities, and stability of the bound conformation compared to free conformation. Selected hit structures are then subjected to further in silico optimization using focused libraries of potential ligands. The 3D structure of the complex, comprising the viral protein and the cellular receptor or neutralizing antibody, is often subsequently determined experimentally to validate the binding mode predicted by the docking software. The best lead ligands are then evaluated for binding and biological activity in a wet lab.

1.1.16 Molecular Dynamics and Binding Free Energy Calculation

The field of MD simulations is rapidly progressing with improvements in simulation methodology and increasing accuracy of biomolecular force fields [32]. Experiments that are difficult or even impossible to perform in the wet lab can be simulated using MD. From a vaccine discovery perspective, MD simulation can be used to study the stability of docked complexes and to ensure that molecular docking has yielded accurate results. The stability of the

In Silico Vaccine Design

379

docked complex is often estimated using the root mean square deviation (RMSD) and root mean square fluctuation (RMSF) during the period of the simulations [33]. RMSF values are calculated to study the thermal stability and structural flexibility [33], while the RMSD value measures the changes in the structure during simulation. Changes in the order of 1–3 Å are considered acceptable. Substantial improvements in protein–ligand docking results can be achieved using high-throughput MD simulations [34]. 1.1.17 Limitations of Molecular Docking

Docking protocols are the combination of search algorithms and scoring functions. The scoring function is a mathematical construct that is used to calculate the strength of non-covalent interactions or binding affinity [27]. Some docking experiments fail due to the inability of docking methods to account for conformational changes that occur during the binding process of protein and ligand while searching the potential binding pose of the ligand in the target cavity. Predicting target receptor structural rearrangements during ligand binding is a complex problem. Unfortunately, docking tools have limited ability to follow the exact modeling of flexibility available to the protein during the binding process [35]. This problem can be solved by MD simulations, although it needs to be remembered that MD simulations are computationally expensive. When ligands with high degrees of flexibility bind to the protein target, ligand entropy penalties can dramatically affect the free energies of binding of the complex and most docking and MD algorithms treat ligand entropy in an approximate way [36]. Studies comparing different docking tools on a large test set, reported 30–80% success rate [37, 38]. Modifying basic parameters in the docking software can drastically affect the docking and virtual screening results, indicating that expert knowledge is critical for optimizing the accuracy of docking predictions. Particular docking tool works best for specific targets, it is possible to use multiple docking tools and scoring functions in order to enhance accuracy [39]. A universal docking tool (algorithm and scoring function) is not available currently.

1.1.18 COVID-19 Vaccine Design

The COVID-19 pandemic caused by the severe acute respiratory syndrome–coronavirus-2 (SARS-CoV-2) has had a devastating impact on populations around the globe, causing tens of millions of fatalities and disrupting all economies. As the first pandemic in the age of computer-based vaccine design, it also provided a unique opportunity to apply these methods for the first time to a real-life event. The reports in January 2020 of the existence of a completely new corona severe acute respiratory syndrome virus raised many important questions about its nature, mechanism of infectivity, potential for development of vaccines and/or antiviral drugs, as well as on its origins and the mechanism of its transfer to humans. A

380

Sakshi Piplani et al.

key issue is to rapidly determine the species specificity of any new virus, as such data is important for understanding its potential origins, its potential to infect commercial and domestic animals, and to help rapidly identify suitable animal models for testing candidate vaccines and antiviral drugs. Understanding how coronaviruses move between species could also help to prevent similar events in the future. Elucidation of the molecular basis for species susceptibility differences could also provide important insights into why human populations exhibit different susceptibilities [40]. In this context, in silico structural homology modeling, protein–protein docking, and molecular dynamics simulation tools offered a powerful means to quickly find answers to these many interrelated questions, offering the potential to accelerate pandemic vaccine design and assist antiviral drug screening as well as help identify the potential origins of the pandemic virus. The following section describes the range of computational and biological approaches to characterize the SARS-CoV-2 spike protein, starting from the genomic sequence and going all the way to in vivo vaccine testing [4] (see Fig. 1). Modeled 3D structures were used for docking studies to characterize the interaction of spike with angiotensin converting enzyme 2 (ACE2), the relevant human receptor. Computer models were employed to design a vaccine from the extracellular domain (ECD) of the SARS-CoV-2 spike protein, with the aim of inducing antibodies able to block the binding of the SARS-CoV-2 virus to ACE2, thus preventing infection. The ACE2 confirmed as the receptor for the spike protein and viral entry into host cells was further shown to be enhanced by priming of the spike protein by transmembrane protease serine 2 (TMPRSS2) [41]. Subsequent results confirmed that our computationally designed spike protein antigen was highly stable and able to induce antibodies against spike protein that neutralized the wild-type lineage (Wuhan-Hu-1-like) SARS-CoV-2 viruses and cross-neutralized variant viruses. In addition to inducing neutralizing antibody, the vaccine also induced memory CD4 and CD8 T cell responses with a Th1 phenotype which translated into the killing of spike-labeled target cells, in vivo [42]. The originally in silico designed antigen was next tested for protection against SARSCoV-2 infection in immunized mice, hamsters, cats, ferrets, and monkeys [42–44]. Conspicuously, despite being an intramuscular vaccine, it prevented nasal virus shedding and virus transmission to naı¨ve animals, a critical property for a successful pandemic vaccine. The vaccine successfully completed Phase 2 and 3 human trials involving over 16,000 participants, where it was found to be safe and effective in preventing SARS-CoV-2 symptomatic infection and severe disease [45, 46]. It received an initial marketing authorization on October 6, 2021, making it the first recombinant spike protein-based vaccine of its type to be licensed. Following a third

In Silico Vaccine Design

381

Structure Assessment of modelled protein (Swissmodel)

Structural modelling of SARS-CoV-2 spike protein and ACE2 receptor (Modeller9.23)

Docking of SARS-CoV-2 Spike protein with ACE2 receptor (HDOCK)

Molecular Dynamics Simulation of docked SARS-CoV-2 & ACE2 protein complex (Gromacs2020)

Calculation of binding free energies of spike-ACE2 complexes (g_mmpbsa)

Spike protein vaccine design and generation (JCat)

In-vivo vaccine immunogenicity

Spike protein binding immunoglobulin ELISA assay

SARS-CoV-2 neutralizing antibody and T cell assessment Fig. 1 Schematic showing key processes involved with in silico COVID-19 vaccine design, starting with initial identification from the genome sequence of the key virus attachment protein, in this case, the SARS-CoV2 spike protein, building a 3-D structural modeling of the protein, using this to scan for the putative human cellular receptors, modeling the effects of potential protein stabilizing mutations and other adaptations to make it into a suitable vaccine antigen. These structural models can also be used to characterize how neutralizing antibodies might be interacting with spike protein to prevent viral infection. Finally, the in silico predictions need to be tested in vivo in animal immunogenicity studies to see whether the vaccine is able to induce appropriate antibody and T cell responses able to mediate protection

382

Sakshi Piplani et al.

dose booster study, where it showed good boosting of spike antibody levels regardless of primary COVID-19 vaccine [46], it received approval to be used in adults as a booster dose. The following methods outline all steps needed to model, characterize, simulate, test, and validate a novel vaccine candidate against a newly arisen pandemic virus.

2

Materials

2.1 Software and Servers

1. Modeller 9.23—for modeling of Spike protein and ACE2 receptor (https://salilab.org/modeller/). 2. HDOCK—for docking of spike and ACE2 receptor (http:// hdock.phys.hust.edu.cn/). 3. UCSF Chimera—for visualization and analysis of docked receptor (https://www.cgl.ucsf.edu/chimera/). 4. Gromacs2020—to visualize the structural stability and entropic effect (https://www.gromacs.org/). 5. g_mmpbsa—to calculate the binding affinity of the docked complexes 6. SWISS-MODEL (https://swissmodel.expasy.org/assess). 7. Protein Data Bank (PDB) Database (https://www.wwpdb. org/). 8. JCat (http://www.jcat.de/). 9. Genscript codon optimization tool (https://www.genscript. com/gensmart-free-gene-codon-optimization.html).

2.2 Vaccine Design and Generation

1. Restriction enzymes from New England Biolabs (NEB). 2. T4 DNA ligase (NEB). 3. TOP10 competent cells (Life Technologies). 4. 2 × YT medium (VWR). 5. Innova 40R shaker (Eppendorf). 6. Heraeus Megafuge 16R Refrigerated Benchtop Centrifuge (Thermo Scientific). 7. Qiagen Endotoxin-Free Plasmid Purification kit (Giga prep size). 8. Nanodrop 2000 (ThermoFisher).

2.3 Mouse Vaccination

1. Female BALB/c or C57BL/6, 6–10 week old, mice under protocol approved by Institutional Animal Care & Use Committee (IACUC). 2. 0.5 mL Insulin syringe with 29-gauge needle (BD). 3. 1 mL Tuberculin syringe and 23-gauge needle (BD).

In Silico Vaccine Design

383

4. Goldenrod animal lancet (4 mm) (Medipoint Inc.). 5. Heraeus Fresco17 Refrigerated Microcentrifuge (Thermo Scientific). 6. CpG55.2™ oligonucleotide adjuvant (Vaxine Pty Ltd). 7. Advax™ adjuvant (Vaxine Pty Ltd). 2.4

ELISA

1. Inactivated viruses or recombinant viral proteins of interest. 2. 96-well ELISA plates (Greiner Bio-One, Catalog #: 655001). 3. 96-well non-binding plates (Greiner Bio-One, Catalog #: 655901). 4. Sterile serological pipettes (Greiner Bio-One). 5. Multichannel pipette (20–200 μL). 6. Variable Adjustable Volume Pipettes. 7. Carbonate bicarbonate coating buffer: 3.03 g Na2CO3 and 6.0 g NaHCO3 in 1 L water, pH 9.6. 8. Biotinylated-anti-mouse immunoglobulin of interest (Abcam). 9. HRP-conjugated streptavidin (BD Bioscience). 10. Blocking buffer: 1% bovine serum albumin (BSA) in Phosphate buffered saline (PBS). 11. Washing buffer: PBS + 0.05% Tween20 (Sigma-Aldrich). 12. TMB substrate kit (KPL, SeraCare). 13. Stop solution: 1 M Phosphoric acid (Sigma-Aldrich). 14. Plate washer (NUNC). 15. Plate reader (OD450nm).

2.5

pVNT Assay

1. Class II Biosafety cabinet. 2. CO2 incubator (Thermo Scientific). 3. 96-well cell culture plates, white (Greiner Bio-One, Catalog #: 655083). 4. HEK 293T expressing human ACE2 (293T-hACE2) cells. 5. Assay virus (SARS-CoV-2 spike-pseudotyped lentivirus containing Luciferase gene). 6. Sterile serological pipettes (Greiner Bio-One). 7. Dulbecco’s Modified Eagle Medium (1× DMEM) (Gibco, ThermoFisher, Catalog #: 10564011). 8. Dulbecco’s Modified Eagle Medium no phenol red (1× DMEM-FR) (Gibco, ThermoFisher, Catalog #: 31053028). 9. Fetal calf serum (FCS) (Gibco, ThermoFisher). 10. PBS (Sigma-Aldrich, Catalog #: P2272).

384

Sakshi Piplani et al.

11. 0.25% Trypsin-EDTA (Gibco, ThermoFisher, Catalog #: 25200072). 12. 100× Penicillin-Streptomycin (10,000 U/mL) (Gibco, ThermoFisher, Catalog #: 15140122). 13. Polybrene (Selleck Chem, Catalog #: E1299) 14. 0.22 and 0.45 μm Disk filter unit and 50 mL syringe. 15. Variable Adjustable Volume Pipettes. 16. Hemocytometer. 17. Inverted microscope (Olympus).

3

Methods for COVID-19 Vaccine Design and Simulation

3.1 Structural Modeling of SARSCoV-2 Spike Protein

1. As no three-dimensional structure was available at the commencement of the project into a novel virus, SARS-CoV-2, as soon as the genomic sequence was released, which for SARSCoV-2 was in mid-January 2020, it was critical to perform genome analysis to identify the putative virus attachment protein, which for coronaviruses is the spike protein. This was identified from an analysis of the SARS-CoV-2 genome sequence in NCBI (accession number: NC 045512) [47]. 2. A PSI-BLAST search of the new virus genome sequence against the Protein Data Bank (PDB) Database was first performed to identify a template for 3D-modeling. Given the homology of the putative SARS-CoV-2 spike protein (76.4% sequence identity) to SARS, the X-ray structure of SARS coronavirus S template (PDB ID 5XLR) could be used for modeling SARS-CoV2 spike protein. 3. Using the SARS-CoV-1 structure (PDB-ID 6ACC) [48], a structural homology modeling approach was used employing Modeller9.23 (https://salilab.org/modeller/) to obtain a 3D structure of the key viral attachment protein, such as the SARSCoV-2 spike protein. 4. The quality of the spike protein model was evaluated using GA341 and DOPE score, and the quality of the model were assessed using the SWISS-MODEL structure assessment server (https://swissmodel.expasy.org/assess). 5. To help identify the putative cellular receptor, known coronavirus receptor proteins, such as DPP4 and ACE2 can be docked in silico to see if they bind the viral attachment protein. For example, to assess ACE2 as the possible receptor, the crystal structure of human ACE2 (PDB-ID 3SCI) [49] can be retrieved from PDB and, using HDOCK server, the putative spike protein was docked against human ACE2 protein (http://hdock.phys.hust.edu.cn/) [50].

In Silico Vaccine Design

385

6. The docking poses were ranked using an energy-based scoring function and the docked structure analyzed using UCSF Chimera. A high binding score predicted human ACE2 as the entry receptor for SARS-CoV-2 spike protein, thereby confirming spike protein suitability for vaccine design [51]. 7. The docked model was optimized using AMBER99SB-ILDN force field in Gromacs2020 (https://www.gromacs.org/). Molecular dynamics simulations (MDS) were carried out for at least 100 ns using a GPU-accelerated version of the program. 8. The structural stability of the complex was monitored by the root mean square deviation (RMSD) value of the backbone atoms of the entire protein. 9. The free energy of binding was calculated next for simulated SARS-CoV-2 spike and human ACE2 structure using g_MMPBSA. 10. Finally, MDS was performed on the spike protein vaccine construct to assess its ability to form a stable trimer despite the lack of the transmembrane and cytoplasmic domains. 3.2 Structural Modeling of ACE2 Receptor

1. Once the key SARS-CoV-2 receptor has been confirmed, in the case of SARS-CoV-2 being ACE2, it provides an opportunity to model the interaction between various spike protein variants and ACE2 species variants to better understand the nature of their interaction and how to disrupt it to reduce viral infectivity, using either a vaccine or antiviral drug approach. 2. Protein preparation and removal of non-essential and non-bridging water molecules for docking studies and analysis of docked proteins was performed using the UCSF Chimera package (https://www.cgl.ucsf.edu/chimera/). 3. The 3D structures of the RBD of SARS-CoV-2 spike protein and non-human ACE2 proteins were built using Modeller 9.23 (https://salilab.org/modeller/). 4. The ACE2 receptors of selected species were homology modeled using the following template structures—1R42 (human ACE2), 3CSI (human glutathione transferase), and 3D0G (ACE2 structure from spike protein receptor-binding domain of the 2002–2003 SARS coronavirus human strain complexed with human-civet chimeric receptor ACE2). 5. Template similarity is important for model building; the sequence of Macaca fascicularis (monkey, accession number A0A2K5X283) is 97% similar to that of human ACE2, while Ophiophagus hannah (king cobra) has a much lower template similarity of 61%, which generally results in a lower quality model.

386

Sakshi Piplani et al.

6. The quality of the generated models was evaluated using the GA341 score 62 and DOPE (Discrete Optimized Protein Energy) method, scores63, and the model quality assessed using SWISS-MODEL structure assessment server (https:// swissmodel.expasy.org/assess). 7. Structures with the lowest DOPE score were refined by MD simulations (vide infra) and used for further analysis. 8. Ten homology ACE2 models per protein were built and then refined and optimized in Gromacs. 9. The modeled ACE2 structures were also assessed for quality control using Ramachandran Plot and MolProbity scores in SWISS-MODEL. 10. The Ramachandran plot checks the stereochemical quality of a protein by analyzing residue-by-residue geometry and overall structure geometry and visualizing energetically allowed regions for backbone dihedral angles of amino acid residues in protein structure. The Ramachandran score of SARS-CoV2 spike protein was 90% in the binding region and the MolProbity score was 3.17. The Ramachandran score of the percentage of amino acid residues in the various species ACE2 that fall into the energetically favored region ranged from 96% to 99%. 11. The MolProbity score evaluates model quality at both the global and local level, this combine protein quality scores that reflect the crystallographic resolution of a model. It is a log-weighted combination of the number of serious atom clashes per 1000 atoms, percentage Ramachandran not favored, and percentage bad side-chain rotamers. A good MolProbity score is one that is equal to or lower than the crystallographic resolution. For reference, the MolProbity score for the X-ray structures of the templates (PDB IDs) were 1R42 = 3.01; 3SCI = 3.14; 3D0G = 2.74; and 6 M17 = 1.99. The Ramachandran and Molprobity scores show whether the built structures are of good quality and are suitable for use in further studies. 12. The structure of the open form of the SARS-CoV-2 S protein was subsequently published (e.g., PDB ID 6VYB) allowing a comparison with the modeled structures. A high structural similarity of the homology modeled spike protein structure with the EM structures (PDB ID 6M0J (RBD) and 6VYB (open state) with RMSD of 0.36 Å shows that the model is of good quality.

In Silico Vaccine Design

3.3 Docking of SARS-CoV-2 S Protein with ACE2 Proteins

387

1. The homology modeled ACE2 structures were docked against SARS-CoV-2 S protein structure using HDOCK (http:// hdock.phys.hust.edu.cn/) [50, 52]. HDOCK performs rigid body docking by mapping the receptor and ligand molecules onto grids. It docks two molecules using an FFTW-based hierarchical approach. First, possible binding modes are globally sampled through a fast Fourier transform (FFT)-based global search strategy with an improved shape complimentary scoring method. Specifically, one molecule is fixed, and second molecule is rotated and translated in space. For each movement of the ligand, both the receptor and ligands molecules are mapped onto grids that extend past the proteins that account for long-range interactions of atoms. 2. Molecular docking was performed on the homology modeled SARS-CoV-2 S protein and various ACE2 proteins using the hybrid docking method because attempts to use the templatefree docking method with the structures generated by Modeller may be unsatisfactory. The hybrid method uses the structural template for the complex (PDB ID 6M17) to generate the results. 3. All docking poses were then ranked using an energy-based scoring function. 4. The hybrid docking procedure may potentially introduce bias into the structures of the spike protein bound to ACE2 for non-human species because it uses a human ACE2 X-ray structure as a template. To check for possible bias, the ACE2 structures generated by HDOCK can be compared with those generated independently by Modeler. 5. The backbones of the ACE2 structures all align with RMSD values of 0.5–0.8 Å and exhibit strong structural similarities. Complexes were subjected to molecular dynamics simulation to wash out any template-induced bias.

3.4 MD Simulation of Docked SARS-CoV2 Spike/ACE2 Protein Complexes

1. The docked SARS-CoV-2 spike/ACE2 protein complexes were optimized using the AMBER99SB-ILDN force field in gromacs2020 (http://www.gromacs.org/) [53]. 2. Simulations were carried out using the GPU-accelerated version of the program and by implementing periodic boundary conditions in ORACLE server. 3. The final docked structures were selected by cluster analysis of the docked conformation and from RMSD analysis of docked conformations of our structures with 3D0G (SARS-RBD and ACE2). 4. Docked complexes were immersed in a truncated octahedral box of TIP3P water molecules. The solvated box was further neutralized with Na+ or Cl- counter ions using the tleap program.

388

Sakshi Piplani et al.

5. Particle Mesh Ewald (PME) was employed to calculate the long-range electrostatic interactions. The cut-off distance for the long-range van der Waals (VDW) energy term was 12.0 Å. 6. The system was minimized without restraints. 7. 2500 cycles of steepest descent minimization were applied, followed by 5000 cycles of conjugate gradient minimization. 8. After system optimization, the MD simulations were initiated by gradually heating each system in the NVT ensemble from 0 to 300 K for 50 ps using a Langevin thermostat with a coupling coefficient of 1.0/ps and with a force constant of 2.0 kcal/mol·Å2 on the complex. 9. A production run of 100 ns of MD simulation were performed under a constant temperature of 300 K in the NPT ensemble with periodic boundary conditions for each system. 10. During the MDS procedure, the SHAKE algorithm was applied to all covalent bonds involving hydrogen atoms. 11. The simulation time step was 2 fs. 12. The structural stability of the complex was monitored by the RMSD and RMSF values of the backbone atoms of the entire protein. 13. The free energies of binding were calculated for all simulated docked structures. 14. Calculations were also performed for up to 500 ns to ensure that 100 ns is sufficiently long for convergence and that the docked conformation and protein–protein interaction is stable. 15. The simulation of the docked spike RBD-human ACE2 for 500 ns confirmed convergence by RMSD and RMSF. 16. All complexes should stabilize during simulations, with RMSD fluctuations converging to a range of 0.5–0.8 nm. For SARSCoV-2 spike protein and human ACE2, the complex stabilized after 50 ns suggesting that 100 ns was an adequate simulation time. 17. The RMSD values for superimposition of the Cα backbones of each ACE2 structure before and after 100 ns simulation were 1.2 ± 0.1 Å, showing movement away from the initial HDOCK structures. 18. The RMSF graph should be analyzed to look for fluctuation in the amino acids. 19. Three production runs were performed with different random starting seeds to estimate binding energies and binding energy uncertainties for each of the strongest binding ACE2 structures—human, bat, and pangolin. The binding energies were based on a 10 ns analysis section (1000 frames).

In Silico Vaccine Design

389

20. The 100 ns simulated structures of the ACE2 proteins from all species were compared against those generated by homology modeling, with the RMSD values for C alignments being between 0.5 and 0.8 Å, suggesting that any memory of the template has been removed or minimized. 21. The structures generated independently by homology (Modeller) and HDOCK should then be compared and in the current case agree very well RMSD250 μL/well. 5. Dry all surfaces of plates with a paper towel and add 200 μL/ well of Blocking Buffer. 6. Cover the plate with a plastic seal and incubate at least 30 min at RT. 7. Dilute standard and samples serially in Blocking Buffer in a 96-well non-binding plate. Always prepare enough volume for a duplicate set per sample as 100 μL/well of samples are required. 8. Flick off Blocking Buffer from plates and bang the plate onto absorbent pads to remove air bubbles and excess moisture. 9. Add 100 μL/well of samples. 10. Cover the plate with a plastic seal and incubate for 2 h at RT. 11. Gently flick contents of microplate into the sink and wash plates six times with washing buffer >250 μL/well each (bang the plate onto absorbent pads to remove air bubbles and excess moisture in between every two consecutive washes). 12. Dry all surfaces with a paper towel. 13. Dilute the Biotinylated anti-mouse detection antibodies and streptavidin-HRP in Blocking buffer (total volume required is 10 mL per plate). Prepare this just before start of washing plates. 14. Gently add 100 μL/well of the diluted detection antibody using a multichannel pipette without forming air bubbles. 15. Cover the plate with a plastic seal. 16. Incubate for 1 h at RT. 17. Wash the plate six times.

394

Sakshi Piplani et al.

18. Mix equal volume of Peroxidase Substrate and Peroxidase Substrate Solution B just before use. 10 mL are required per ELISA plate. 19. Add 100 μL/well of the TMB substrate solution using a multichannel pipette. 20. Incubate for 10 min at RT to allow color to develop. 21. Add 100 μL/well of stop solution using a multichannel pipette to stop the reaction. 22. Gently tap the plate to stop the reaction uniformly. 23. Measure OD at 450 nm with a VersaMax ELISA microplate reader (Molecular Devices, CA, USA) and analyze using SoftMax Pro Software. 24. For determination of ELISA end-point titers, absorbance cut-off values are established as the mean absorbance of eight negative-control wells containing sera of naive mice plus 3 × SD. 25. Absorbance values of test sera were considered positive if they were equal to or greater than the absorbance cut-off and end-point titers calculated as log10 of the reciprocal of the last dilution giving a positive absorbance value. 3.10 Assessment of SARS-CoV2 Neutralizing Antibody Using Lentivirus Pseudotype Assay

1. ELISA measures total binding antibody or IgG, but this may not correlate with virus neutralizing capacity, which must instead be measured using a more specific neutralizing antibody assay. 2. A replication-deficient SARS-CoV-2 spike-pseudotyped lentivirus-based neutralization assay (pVNT) enables convenient evaluation under BSL2 conditions. (Neutralizing assays using wild-type SARS-CoV-2 virus continue to require the assays to be conducted in a BSL3 facility). 3. This pVNT assay can measure neutralizing activity of immune sera from any species including mice and humans as unlike ELISA assays it is not species specific. 4. A single-cell sorted HEK-293T stable cell line expressing human ACE2 on the plasma membrane [59] is maintained in DMEM containing 10% FCS (DMEM-10) medium. 5. Expression cassette of human-codon optimized SARS-CoV2 spike with C-terminal 18aa truncation of the original stain and different variants is cloned into pCAGGS vector. 6. Spike-pseudotyped lentiviral particles are produced by co-transfecting HEK-293T cells with firefly luciferase encoding third generation lentiviral vector pCDH-EF1-Luc-IRES-Puro, packaging plasmid psPAX2 and spike expressing plasmid

In Silico Vaccine Design

395

pCAGGS-Spike using Lipofectamine 2000 according to the product manual and recombinant virus particles harvested at 72 h post transfection. 7. The supernatant is centrifuged for 15 min at 1500 × g and then filtered through a 0.45 μm syringe filter and stored at -80 °C. 8. Neutralization activity of immune sera is then measured with a single round transduction of 293T-hACE2 cells with spikepseudotyped lentiviral particles. 9. Prior to infecting cells, immune sera is serially diluted in 50 μL and incubated with 50 μL of pseudotyped virus particle containing about 20,000 relative light units (RLU) for 1 h at 37 °C. 10. Then 50 μL of 293T-hACE2 cells at 12,500 cells per well in a 96-well white tissue culture white plate. 11. The cells are then cultured at 37 °C for 72 h, followed by removing of the culture medium and replacing of 30 μL of Phenol red-free DMEM medium. 12. Then 30 μL of ONE-Glo EX (Promega) reagent is added into each well and incubated at RT with shaking at 450 rpm on a ThermoMixer before luciferase activity reading on BMG FluoStar plate reader. 13. Neutralization is calculated by reduction in % RLU relative to pseudotyped virus alone group without any serum treatment. 14. Neutralization antibody titers are calculated using Sigmoidal 4PL robust fit regression method in GraphPad Prism Ver. 9. 15. If the SARS-CoV-2 vaccine is working ideally, then it should induce as high as possible a ratio of neutralizing antibody to total spike binding antibody.

4

Notes 1. Small laboratory animals such as mice or hamsters are most convenient for such immunogenicity studies, being relatively low cost and able to provide reasonable group sizes for vaccine immunogenicity testing. Mice were not susceptible to the original SARS-CoV-2 virus isolates and hence were not useful for SARS-CoV-2 protection studies, which thereby needed to be performed on susceptible species such as hamsters, ferrets, cats, and monkeys. 2. Recombinant or inactivated protein vaccines are a safe and reliable approach, but generally suffer from weak immunogenicity unless formulated with an appropriate adjuvant [60]. Adjuvants induce higher and more durable immune

396

Sakshi Piplani et al.

responses and can also be used to impart a relevant T helper bias to the immune effector response. Advax-SM is a combination adjuvant developed by our team that consists of delta inulin polysaccharide particles (Advax™) formulated with a Toll-like receptor 9 (TLR9)-active molecule, CpG oligonucleotide. A similar adjuvant approach provided enhanced protection of recombinant spike protein vaccines against severe acute respiratory syndrome (SARS) and Middle East respiratory syndrome (MERS), coronaviruses [61, 62]. The Th1-bias imparted by the adjuvant also prevented the eosinophilic lung immunopathology otherwise seen after immunization with SARS spike protein alone or with alum adjuvant [61]. References 1. Macarron R, Banks MN, Bojanic D, Burns DJ, Cirovic DA, Garyantes T, Green DVS, Hertzberg RP, Janzen WP, Paslay JW, Schopfer U, Sittampalam GS (2011) Impact of highthroughput screening in biomedical research. Nat Rev Drug Discov 10(3):188–195. https:// doi.org/10.1038/nrd3368 2. Crank MC, Ruckwardt TJ, Chen M, Morabito KM, Phung E, Costner PJ, Holman LA, Hickman SP, Berkowitz NM, Gordon IJ, Yamshchikov GV, Gaudinski MR, Kumar A, Chang LA, Moin SM, Hill JP, DiPiazza AT, Schwartz RM, Kueltzo L, Cooper JW, Chen P, Stein JA, Carlton K, Gall JG, Nason MC, Kwong PD, Chen GL, Mascola JR, McLellan JS, Ledgerwood JE, Graham BS, Team VRCS (2019) A proof of concept for structure-based vaccine design targeting RSV in humans. Science 365(6452):505–509. https://doi.org/10. 1126/science.aav9033 3. Wang MY, Zhao R, Gao LJ, Gao XF, Wang DP, Cao JM (2020) SARS-CoV-2: structure, biology, and structure-based therapeutics development. Front Cell Infect Microbiol 10:587269. https://doi.org/10.3389/fcimb.2020. 587269 4. Piplani S, Singh PK, Winkler DA, Petrovsky N (2021) In silico comparison of SARS-CoV2 spike protein-ACE2 binding affinities across species and implications for virus origin. Sci Rep 11(1):13063. https://doi.org/10.1038/ s41598-021-92388-5 5. Jumper J, Evans R, Pritzel A, Green T, M, Ronneberger O, Figurnov Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T,

Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589. https://doi.org/10.1038/s41586-02103819-2 6. Martinez-Flores D, Zepeda-Cervantes J, CruzResendiz A, Aguirre-Sampieri S, Sampieri A, Vaca L (2021) SARS-CoV-2 vaccines based on the spike glycoprotein and implications of new viral variants. Front Immunol 12:701501. https://doi.org/10.3389/fimmu.2021. 701501 7. Burley SK, Bhikadiya C, Bi C, Bittrich S, Chen L, Crichlow GV, Duarte JM, Dutta S, Fayazi M, Feng Z, Flatt JW, Ganesan SJ, Goodsell DS, Ghosh S, Kramer Green R, Guranovic V, Henry J, Hudson BP, Lawson CL, Liang Y, Lowe R, Peisach E, Persikova I, Piehl DW, Rose Y, Sali A, Segura J, Sekharan M, Shao C, Vallat B, Voigt M, Westbrook JD, Whetstone S, Young JY, Zardecki C (2022) RCSB Protein Data Bank: celebrating 50 years of the PDB with new tools for understanding and visualizing biological macromolecules in 3D. Protein Sci 31(1):187–208. https://doi.org/10.1002/pro.4213 8. Forrest LR, Tang CL, Honig B (2006) On the accuracy of homology modeling and sequence alignment methods applied to membrane proteins. Biophys J 91(2):508–517. https://doi. org/10.1529/biophysj.106.082313 9. Anfinsen CB (1973) Principles that govern the folding of protein chains. Science (New York, NY) 181(4096):223–230. https://doi.org/ 10.1126/science.181.4096.223 10. Chothia C, Lesk AM (1986) The relation between the divergence of sequence and structure in proteins. EMBO J 5(4):823–826.

In Silico Vaccine Design https://doi.org/10.1002/j.1460-2075.1986. tb04288.x 11. Martı´-Renom MA, Stuart AC, Fiser A, Sa´nchez R, Melo F, Sali A (2000) Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29: 291–325. https://doi.org/10.1146/annurev. biophys.29.1.291 12. Arnold K, Kiefer F, Kopp J, Battey JND, Podvinec M, Westbrook JD, Berman HM, Bordoli L, Schwede T (2009) The Protein Model Portal. J Struct Funct Genom 10(1): 1–8. https://doi.org/10.1007/s10969-0089048-5 13. Pieper U, Webb BM, Dong GQ, SchneidmanDuhovny D, Fan H, Kim SJ, Khuri N, Spill YG, Weinkam P, Hammel M, Tainer JA, Nilges M, Sali A (2014) ModBase, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res 42(Database issue):D336–D346. https://doi. org/10.1093/nar/gkt1144 14. Kiefer F, Arnold K, Ku¨nzli M, Bordoli L, Schwede T (2009) The SWISS-MODEL repository and associated resources. Nucleic Acids Res 37(Database issue):D387–D392. https://doi.org/10.1093/nar/gkn750 15. Weisel M, Proschak E, Schneider G (2007) PocketPicker: analysis of ligand binding-sites with shape descriptors. Chem Cent J 1(1):7. https://doi.org/10.1186/1752-153X-1-7 16. Huang B, Schroeder M (2006) LIGSITEcsc: predicting ligand binding sites using the Connolly surface and degree of conservation. BMC Struct Biol 6(1):19. https://doi.org/10. 1186/1472-6807-6-19 17. Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA (2009) Predicting protein ligand binding sites by combining evolutionary sequence conservation and 3D structure. PLoS Comput Biol 5(12):e1000585. https://doi.org/10.1371/journal.pcbi. 1000585 18. Hendlich M, Rippmann F, Barnickel G (1997) LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model 15(6):359–363, 389. https://doi.org/10.1016/s1093-3263 (98)00002-3 19. Le Guilloux V, Schmidtke P, Tuffery P (2009) Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics 10(1): 168. https://doi.org/10.1186/1471-210510-168 20. Laurie ATR, Jackson RM (2005) Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics

397

21(9):1908–1916. https://doi.org/10.1093/ bioinformatics/bti315 21. Ghersi D, Sanchez R (2009) EasyMIFs and SiteHound: a toolkit for the identification of ligand-binding sites in protein structures. Bioinformatics 25(23):3185–3186. https://doi. org/10.1093/bioinformatics/btp562 22. Henrich S, Salo-Ahen OMH, Huang B, Rippmann FF, Cruciani G, Wade RC (2010) Computational approaches to identifying and characterizing protein binding sites for ligand design. J Mol Recognit 23(2):209–219. https://doi.org/10.1002/jmr.984 23. Nisius B, Sha F, Gohlke H (2012) Structurebased computational analysis of protein binding sites for function and druggability prediction. J Biotechnol 159(3):123–134. https:// doi.org/10.1016/j.jbiotec.2011.12.005 24. Xie Z-R, Hwang M-J (2015) Methods for predicting protein-ligand binding sites. Methods Mol Biol 1215:383–398. https://doi.org/10. 1007/978-1-4939-1465-4_17 25. Lo´pez-Vallejo F, Caulfield T, Martı´nez-Mayorga K, Giulianotti MA, Nefzi A, Houghten RA, Medina-Franco JL (2011) Integrating virtual screening and combinatorial chemistry for accelerated drug discovery. Comb Chem High Throughput Screen 14(6):475–487. https:// doi.org/10.2174/138620711795767866 26. Heberle´ G, de Azevedo WF (2011) Bio-inspired algorithms applied to molecular docking simulations. Curr Med Chem 18(9): 1339–1352. https://doi.org/10.2174/ 092986711795029573 27. Halperin I, Ma B, Wolfson H, Nussinov R (2002) Principles of docking: an overview of search algorithms and a guide to scoring functions. Proteins 47(4):409–443. https://doi. org/10.1002/prot.10115 28. Meng X-Y, Zhang H-X, Mezei M, Cui M (2011) Molecular docking: a powerful approach for structure-based drug discovery. Curr Comput Aided Drug Des 7(2): 1 4 6 – 1 5 7 . h t t p s : // d o i . o r g / 1 0 . 2 1 7 4 / 157340911795677602 29. Yuriev E, Ramsland PA (2013) Latest developments in molecular docking: 2010–2011 in review. J Mol Recognit 26(5):215–239. https://doi.org/10.1002/jmr.2266 30. Buzko OV, Bishop AC, Shokat KM (2002) Modified AutoDock for accurate docking of protein kinase inhibitors. J Comput Aided Mol Des 16(2):113–127. https://doi.org/10. 1023/a:1016366013656 31. Kellenberger E, Rodrigo J, Muller P, Rognan D (2004) Comparative evaluation of eight docking tools for docking and virtual screening

398

Sakshi Piplani et al.

accuracy. Proteins 57(2):225–242. https:// doi.org/10.1002/prot.20149 32. Hansson T, Oostenbrink C, van Gunsteren W (2002) Molecular dynamics simulations. Curr Opin Struct Biol 12(2):190–196. https://doi. org/10.1016/s0959-440x(02)00308-1 33. Kuzmanic A, Zagrovic B (2010) Determination of ensemble-average pairwise root meansquare deviation from experimental B-factors. Biophys J 98(5):861–871. https://doi.org/ 10.1016/j.bpj.2009.11.011 34. Guterres H, Im W (2020) Improving proteinligand docking results with high-throughput molecular dynamics simulations. J Chem Inf Model 60(4):2189–2198. https://doi.org/ 10.1021/acs.jcim.0c00057 35. Teodoro ML, Kavraki LE (2003) Conformational flexibility models for the receptor in structure based drug design. Curr Pharm Des 9(20):1635–1648. https://doi.org/10.2174/ 1381612033454595 36. Winkler DA (2020) Ligand entropy is hard but should not be ignored. J Chem Inf Model 60(10):4421–4423. https://doi.org/10. 1021/acs.jcim.0c01146 37. Cross JB, Thompson DC, Rai BK, Baber JC, Fan KY, Hu Y, Humblet C (2009) Comparison of several molecular docking programs: pose prediction and virtual screening accuracy. J Chem Inf Model 49(6):1455–1474. https:// doi.org/10.1021/ci900056c 38. Lape M, Elam C, Paula S (2010) Comparison of current docking tools for the simulation of inhibitor binding by the transmembrane domain of the sarco/endoplasmic reticulum calcium ATPase. Biophys Chem 150(1–3): 88–97. https://doi.org/10.1016/j.bpc.2010. 01.011 39. Charifson PS, Corkery JJ, Murcko MA, Walters WP (1999) Consensus scoring: a method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. J Med Chem 42(25):5100–5109. https://doi.org/10.1021/jm990352k 40. Ashoor D, Ben Khalaf N, Marzouq M, Jarjanazi H, Fathallah MD (2020) SARSCoV-2 RBD mutations, ACE2 genetic polymorphism, and stability of the virus-receptor complex: The COVID-19 host-pathogen nexus. bioRxiv:2020.2010.2023.352344. https://doi.org/10.1101/2020.10.23. 352344 41. Hoffmann M, Kleine-Weber H, Schroeder S, Kru¨ger N, Herrler T, Erichsen S, Schiergens TS, Herrler G, Wu N-H, Nitsche A (2020) SARS-CoV-2 cell entry depends on ACE2

and TMPRSS2 and is blocked by a clinically proven protease inhibitor. Cell 42. Li L, Honda-Okubo Y, Huang Y, Jang H, Carlock MA, Baldwin J, Piplani S, Bebin-Blackwell AG, Forgacs D, Sakamoto K, Stella A, Turville S, Chataway T, Colella A, Triccas J, Ross TM, Petrovsky N (2021) Immunisation of ferrets and mice with recombinant SARSCoV-2 spike protein formulated with AdvaxSM adjuvant protects against COVID-19 infection. Vaccine 39(40):5940–5953. https://doi.org/10.1016/j.vaccine.2021. 07.087 43. Li L, Honda-Okubo Y, Baldwin J, Bowen R, Bielefeldt-Ohmann H, Petrovsky N (2022) Covax-19/Spikogen(R) vaccine based on recombinant spike protein extracellular domain with Advax-CpG55.2 adjuvant provides single dose protection against SARS-CoV-2 infection in hamsters. Vaccine 40(23):3182–3192. https://doi.org/10.1016/j.vaccine.2022. 04.041 44. Tabynov K, Orynbassar M, Yelchibayeva L, Turebekov N, Yerubayev T, Matikhan N, Yespolov T, Petrovsky N, Tabynov K (2022) A spike protein-based subunit SARS-CoV2 vaccine for pets: safety, immunogenicity, and protective efficacy in juvenile cats. Front Vet Sci 9:815978. https://doi.org/10.3389/fvets. 2022.815978 45. Tabarsi P, Anjidani N, Shahpari R, Mardani M, Sabzvari A, Yazdani B, Roshanzamir K, Bayatani B, Taheri A, Petrovsky N, Li L, Barati S (2022) Safety and immunogenicity of SpikoGen(R), an Advax-CpG55.2-adjuvanted SARS-CoV-2 spike protein vaccine: a phase 2 randomized placebo-controlled trial in both seropositive and seronegative populations. Clin Microbiol Infect 28(9):1263–1271. https:// doi.org/10.1016/j.cmi.2022.04.004 46. Tabarsi P, Anjidani N, Shahpari R, Roshanzamir K, Fallah N, Andre G, Petrovsky N, Barati S (2022) Immunogenicity and safety of SpikoGen(R), an adjuvanted recombinant SARS-CoV-2 spike protein vaccine as a homologous and heterologous booster vaccination: a randomized placebocontrolled trial. Immunology 167(3): 340–353. https://doi.org/10.1111/imm. 13540 47. Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y (2020) A new coronavirus associated with human respiratory disease in China. Nature 579(7798):265–269 48. Song W, Gui M, Wang X, Xiang Y (2018) Cryo-EM structure of the SARS coronavirus

In Silico Vaccine Design spike glycoprotein in complex with its host cell receptor ACE2. PLoS Pathog 14(8):e1007236 49. Wu K, Peng G, Wilken M, Geraghty RJ, Li F (2012) Mechanisms of host receptor adaptation by severe acute respiratory syndrome coronavirus. J Biol Chem 287(12):8904–8911 50. Yan Y, Tao H, He J, Huang S-Y (2020) The HDOCK server for integrated protein–protein docking. Nat Protoc 15(5):1829–1852. https://doi.org/10.1038/s41596-0200312-x 51. Piplani S, Singh PK, Winkler DA, Petrovsky N (2020) In silico comparison of spike proteinACE2 binding affinities across species; significance for the possible origin of the SARS-CoV2 virus. arXiv preprint arXiv:2005:06199 52. Yan Y, Zhang D, Zhou P, Li B, Huang SY (2017) HDOCK: a web server for proteinprotein and protein-DNA/RNA docking based on a hybrid strategy. Nucleic Acids Res 45(W1):W365–W373. https://doi.org/10. 1093/nar/gkx407 53. Abraham MJ, Murtola T, Schulz R, Pa´ll S, Smith JC, Hessa B, Lindahlad E (2015) GROMACS; high performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX 1–2: 19–25 54. Baker NA, Sept D, Joseph S, Holst MJ, McCammon JA (2001) Electrostatics of nanosystems: application to microtubules and the ribosome. Proc Natl Acad Sci U S A 98(18): 10037–10041. https://doi.org/10.1073/ pnas.181342398 55. Kumari R, Kumar R, Open Source Drug Discovery Consortium, Lynn A (2014) g_mmpbsa–a GROMACS tool for highthroughput MM-PBSA calculations. J Chem Inf Model 54 (7):1951–1962. https://doi. org/10.1021/ci500020m 56. Wang E, Sun H, Wang J, Wang Z, Liu H, Zhang JZH, Hou T (2019) End-point binding free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in

399

drug design. Chem Rev 119(16):9478–9508. https://doi.org/10.1021/acs.chemrev. 9b00055 57. Shang J, Ye G, Shi K, Wan Y, Luo C, Aihara H, Geng Q, Auerbach A, Li F (2020) Structural basis of receptor recognition by SARS-CoV-2. Nature 581(7807):221–224. https://doi.org/ 10.1038/s41586-020-2179-y 58. Hutchinson G, Abiona O, Ziwawo C, Werner A, Ellis D, Tsybovsky Y, Leist S, Palandjian C, West A, Fritch E, Wang N, Wrapp D, Boyoglu-Barnum S, Ueda G, Baker D, Kanekiyo M, McLellan J, Baric R, King N, Graham B, Corbett K (2022) Nanoparticle display of prefusion coronavirus spike elicits S1-focused cross-reactive protection across divergent subgroups. Res Sq. https:// doi.org/10.21203/rs.3.rs-2199814/v1 59. Crawford KHD, Eguia R, Dingens AS, Loes AN, Malone KD, Wolf CR, Chu HY, Tortorici MA, Veesler D, Murphy M, Pettie D, King NP, Balazs AB, Bloom JD (2020) Protocol and reagents for pseudotyping lentiviral particles with SARS-CoV-2 spike protein for neutralization assays. Viruses 12(5). https://doi.org/10. 3390/v12050513 60. Perrie Y, Mohammed AR, Kirby DJ, McNeil SE, Bramwell VW (2008) Vaccine adjuvant systems: enhancing the efficacy of sub-unit protein antigens. Int J Pharm 364(2):272–280 61. Honda-Okubo Y, Barnard D, Ong CH, Peng B-H, Tseng C-TK, Petrovsky N (2015) Severe acute respiratory syndrome-associated coronavirus vaccines formulated with delta inulin adjuvants provide enhanced protection while ameliorating lung eosinophilic immunopathology. J Virol 89(6):2995–3007 62. Adney DR, Wang L, Van Doremalen N, Shi W, Zhang Y, Kong W-P, Miller MR, Bushmaker T, Scott D, de Wit E (2019) Efficacy of an adjuvanted Middle East respiratory syndrome coronavirus spike protein vaccine in dromedary camels and alpacas. Viruses 11(3):212

Chapter 27 Reverse Vaccinology for Influenza A Virus: From Genome Sequencing to Vaccine Design Valentina Di Salvatore, Giulia Russo, and Francesco Pappalardo Abstract Reverse vaccinology (RV) consists in the identification of potentially protective antigens expressed by any organism starting from genomic information and derived from in silico analysis, with the aim of promoting the discovery of new candidate vaccines against different types of pathogens. This approach makes use of bioinformatics techniques to screen the whole genomic sequence of a specific pathogen for the identification of the epitopes that could elicit the best immune response. The use of in silico techniques allows to reduce dramatically both the time and cost required for the identification of a potential vaccine, also facilitating the laborious process of selection of those antigens that, with a traditional approach, would be completely impossible to detect or culture. RV methodologies have been successfully applied for the identification of new vaccines against serogroup B meningococcus (MenB), Bacillus anthracis, Streptococcus pneumonia, Staphylococcus aureus, Chlamydia pneumoniae, Porphyromonas gingivalis, Edwardsiella tarda, and Mycobacterium tuberculosis. As a case of study, we will go in depth into the application of RV techniques on Influenza A virus. Key words Reverse vaccinology, In silico trials, Vaccine design, Epitopes prediction, Immune system response, Immune system simulation

1

Introduction Since it was coined in early 2000 by Rappuoli [1], the term “reverse vaccinology” foreshadowed the massive revolution this innovative approach would bring. Thanks to recent advancements in genomic techniques, today it is possible to practically sequence every type of genome, proteome, or transcriptome, thus allowing researchers to have access to a huge amount of valuable knowledge. This genomebased information, through specific bioinformatics tools, can be therefore used to identify the best epitopes to discover new and effective vaccine formulations against specific diseases. The traditional vaccine development approach originated in the early eighteenth century and involved the use of attenuated or inactivated viruses. This kind of approach has been predominant

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_27, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

401

402

Valentina Di Salvatore et al.

for a very long time, but the vaccines produced with this technique were often associated with several side effects, for example, in many cases, vaccinated individuals developed the same disease they had been vaccinated against [2]. Both traditional approaches in vaccine development require a long series of in vitro analysis to obtain attenuated virus strains or to identify protective antigens to be used for vaccine formulation. In particular, the pathogen is usually cultured in laboratory environment, then its components are identified one by one to look for the most protective antigens. Anyway, the process of identification of potential antigens is not as easy as it can seem: it is possible only if the quantity of purified antigens is sufficient for vaccine testing. Unfortunately, the majority of most abundant proteins usually is not suitable for vaccine design; furthermore, on the other hand, the identification of the less common proteins can be very difficult because of the lack of proper genetic tools. In the event that the identification of the best antigen is successful, the next step is represented by its large-scale production, usually through in vitro cultures of the pathogens, after which the new molecule can be used for vaccine development. As it is easy to guess, this kind of approach could require ages to obtain the final vaccine formulation. Since then, tremendous strides have been made: one of the most revolutionary ones was the use of modern recombinant DNA technology to produce subunit vaccines based on selected antigens [3]. Most of the vaccines produced in the last 20 years are based on this technique, just like the hepatitis B and Bordetella pertussis vaccines. Nevertheless, even if most of the subunit vaccines have proved to be very effective, for some of them, the mechanism behind the protective effect they seem able to induce still remains unknown, such as in the case of Bacillus Calmette-Gue´rin (BCG) vaccine against tuberculosis [2]. This lack of knowledge prevents having a clear picture of all interactions between pathogen and host, thus making even more difficult the design of a vaccine against complex pathogens. To date, several technological advancements have been able to accelerate the early steps of vaccine discovery and design [4]. For example, the availability of complete genome sequences, combined with new molecular biology or microarray technologies, allows to test the capacity to induce a protective immune response of every potential antigen within a week. In this context of technological change, the RV approach makes its first appearance. RV studies were among the first to exploit the abundance of information generated by genome sequencing for vaccine development. The first clinically approved vaccine developed using an RV approach was MenB against infection by meningococcal group B bacteria [5]. The use of genome-based information has led to the discovery of previously unknown and unreported proteins, which may represent potential vaccine

Identification of New Vaccine Candidates for Influenza A Virus

403

candidates, if properly characterized. The main advantage offered by the application of the RV technique is the identification of potentially protective antigens with no need to culture the pathogens in a laboratory regardless of the purified antigen quantity. This result translates into huge savings both in terms of time and resources. 1.1 Reverse Vaccinology Approach Overview

After the first application of RV technology for the design of MenB vaccine [6], an official usage protocol has been defined, indicating all the steps and the proper tools needed to conduct a complete analysis. Today, two different algorithms can be applied to define the strategy for RV approach [7]: decision tree (filtering) or machine learning (classifying) algorithms. Depending on the chosen algorithm, the suitable tools may vary, even if both the algorithms share the same input, which are protein sequences and the same purpose, which is the selection of a group of potential antigens. Both of these approaches will be briefly described below, and some guidance will be given on the tools best suited to the two types of algorithms.

Decision tree (filtering): A series of filters is used to identify the potential vaccine candidates among the protein sequences used as input. For this reason, this approach requires the use of flowchartlike programs representing the sequence of filters applied. The most part of studies utilizing this kind of approach did not share any standard protocol [8], therefore different bioinformatic tools may be used, as for instance Vaxign1 or Blast.2 A first attempt to create an automated RV tool has been done with the creation of the New Enhanced Reverse Vaccinology Environment (NERVE) [9]. Machine learning (classifying): Machine learning techniques are able to model the whole bacterial proteome and classify predicted antigens depending on their ability to elicit a protective effect, being, this way, potential vaccine candidates [10]. VaxiJen3 is one of the most commonly used tools when applying Machine learning RV, and also the first RV software adopting this kind of approach. RV technology was used for the very first time within the Neisseria meningitidis serogroup B project, aimed at the identification of a new and effective vaccine formulation against group B Meningococcus (MenB). What most surprised the scientific com-

1

http://www.violinet.org/vaxign/ https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastHome 3 http://www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html 2

404

Valentina Di Salvatore et al.

munity at the time was the fact that all the antigens identified by the RV approach were completely different from those identified by the traditional approach, which explained the ineffectiveness of current vaccines against that specific strain of meningococcus. Most of the newly identified antigens were represented by lipoproteins or surface proteins which are not commonly located on bacterial surface and, therefore, are very difficult to find and identify with traditional methods [3]. It is clear, therefore, as already from its first applications, the RV approach has immediately demonstrated its greatest strengths: it allows the discovery of possible new epitopes even where traditional methods fail, and can deliver results in drastically less time, which is critical, especially when dealing with lethal pandemics. 1.2 A Case Study: Application of RV Methodology on the Design of the Influenza A Vaccine

Reverse vaccinology, together with computational modeling and simulation, can help in accelerating the vaccine discovery process. One of the most common applications of this combined approach is the computer-aided design of multi-epitope vaccines, which usually consists of several stages ranging from data collection to the identification of vaccine candidates. Nevertheless, despite its undeniable usefulness in identifying the best epitopes, this approach still has a major flaw: it is unable to demonstrate the therapeutic effectiveness of the vaccine formulation. A solution to make up for this lack, which would contribute to RV technologies somehow being incomplete, albeit of great importance, could be represented by a particular case study, described below, involving the use of the Universal Immune System Simulator (UISS). UISS is an advanced agent-based model that can faithfully reproduce the immune system dynamics of the human body in response to precise stimuli [11, 12]. It uses a multilayer approach which includes the following: • The physiological response of the immune system to a self/nonself-entity (physiology layer) • The dynamics related to the progression of the disease (disease layer) • The effects induced by different treatments for that specific disease (treatment layer) This multilevel organization makes UISS particularly flexible and easily adaptable to several biological scenarios, making it an indispensable tool for predicting the course of different pathologies and the effects of any new treatments [13]. As a working example, we focus on the identification of a multiepitope vaccine for Influenza A virus.

Identification of New Vaccine Candidates for Influenza A Virus

405

Table 1 Bioinformatic tools used in the proposed workflow Tool

Description

Download

ClustalOmega Jalview NetCTL 1.2 NetMHCIIpan 4.0

To perform MSA To analyze results of MSA To predict CTL epitopes To predict HTL epitopes

http://www.clustal.org/omega/ https://www.jalview.org/ https://bio.tools/netctl https://services.healthtech.dtu.dk/

BepiPred VaxiJen v.2.0

To predict B cell http://tools.iedb.org/bcell/help/ epitopes http://www.ddg-pharmfac.net/vaxijen/VaxiJen/ To evaluate antigenicity VaxiJen.html

AllerTOP v.2.0

To evaluate allergenicity https://www.ddg-pharmfac.net/AllerTOP/

2

Materials

2.1

Sequences

The HA and NA protein sequences have been retrieved from the National Center for Biotechnology Information (NCBI) database. In this particular case, protein sequences of eight different H5N1 Influenza strains, which were widespread from 1997 to 2005, have been collected.

2.2

Software

All the software used in this report have been summarized in Table 1. In particular, to perform MSA, ClustalOmega software [14] is used, in combination with another tool, Jalview [15]. The prediction of Cytotoxic T-lymphocyte (CTL) epitopes has been done through NetCTL 1.2 application [16], while Helper T-lymphocyte epitopes are predicted through NetMHCIIpan 4.0 Server [17]. BepiPred Linear Epitope Prediction server [18] is used to predict B cell epitopes, and VaxiJen v.2.0 server [19] and AllerTOP v2.0 [20] are used to evaluate antigenicity and allergenicity of selected epitopes.

3

Methods

3.1 General Workflow

The combined use of RV technologies and the UISS simulator has created a general workflow applicable for the identification of the best vaccine formulation against any pathogen, which is shown in Fig. 1. UISS-REVAX is the name of the particular implementation specific for REverse VAccinology added to the general framework of UISS.

406

Valentina Di Salvatore et al.

Selection of Human pathogens and Multiple Sequence Alignment (MSA)

CTL epitopes prediction

HTL epitopes prediction

B cell epitopes prediction

Antigenicity and Allergenicity evaluation

Peptides and epitopes selection

UISS - REVAX in silico trial

Identification of optimal dosage

Vaccine protection prediction

Vaccine efficacy prediction

Multi-epitope vaccine design

Fig. 1 Workflow of the proposed advanced RV pipeline for multi-epitope vaccine design

The workflow consists of the following: (a) Selection of FASTA sequence of specific human pathogens. (b) Multiple Sequence Alignment (MSA). (c) Prediction of Cytotoxic T-lymphocyte epitopes (CTL). (d) Prediction of Helper T-lymphocyte epitopes. (e) Identification of linear B cell epitopes. (f) Evaluation of antigenicity and allergenicity of all predicted epitopes. (g) Efficacy prediction of selected peptides through UISS-REVAX modeling and simulation platform. (h) Multi-epitope vaccine formulation design.

Identification of New Vaccine Candidates for Influenza A Virus

407

3.2 Collection of Influenza A Protein Sequences

Influenza A virus is an RNA virus whose main structural components are Hemagglutinin (HA) and Neuraminidase (NA) proteins. Influenza A virus owns the ability to evolve, giving rise to several different variants even more contagious and virulent and, therefore, more difficult to contrast [13]. The first step in applying the workflow for the identification of a potential multi-epitope vaccine against Influenza A virus is the collection of HA and NA protein sequences from the National Center for Biotechnology Information (NCBI) database.

3.3 Multi-Sequence Alignment

Next, Multi-Sequence Alignment (MSA) needs to be performed on the selected sequences. MSA aims at analyzing the homology and correlation level between all the sequences under investigation to obtain a consensus sequence, including residues, amino acids, and nucleotides. To perform MSA, ClustalOmega software [14] is used (see Note 1), in combination with another tool, Jalview [15], which is a web-based application able to analyze the results of MSA and provide information about novel protein or RNA sequence group and the relation among them.

3.4 Epitopes Prediction

Next step is the prediction of Cytotoxic T-lymphocyte (CTL) epitopes through NetCTL 1.2 application [16] (see Note 2). This tool takes consensus sequences coming from MSA analysis in FASTA format as input, allowing the user to customize the default settings depending on the purpose of the analysis, and provides the affinity binding scores as output. Subsequently, Helper T-lymphocyte epitopes also need to be predicted through NetMHCIIpan 4.0 Server [17], which is a web server created to predict binding between peptides and MHC-II molecules. Similar to NetCTL, this tool also takes the consensus sequences as input and gives the affinity binding scores as output. The prediction of B cell epitopes is done using BepiPred Linear Epitope Prediction server [18] (see Note 3). BepiPred takes consensus sequences in FASTA format as input and provides with a table showing each residue along with related scores and a significance threshold as output.

3.5 Antigenicity and Allergenicity Evaluation

After all MHC-I, MHC-II, and B cell epitopes predictions have been completed, the next step is the evaluation of antigenicity and allergenicity levels of collected epitopes, to identify only those epitopes which result to be both antigenic and not allergenic and select them as good candidates for vaccine formulation. VaxiJen v.2.0 server [19] is used to evaluate the antigenicity of selected epitopes and to predict protective antigens, while AllerTOP v2.0 [20] is used to identify allergens among them.

408

Valentina Di Salvatore et al.

3.6 Simulation on UISS-FLU

As a last, fundamental step, UISS-FLU, a specific implementation of the influenza disease layer in the UISS platform, is used (see Note 4). To improve the statistical significance of the results obtained also in terms of immunological variability, it is recommended that all the simulations are performed using a cohort of 100 virtual patients with different immunological backgrounds. UISS allows the user to simulate several scenarios with different settings: (a) Influenza virus challenges only virtual cohort. (b) Multi-epitope vaccine administered without virus challenge on virtual cohort. (c) One injection of multi-epitope vaccine (with different dosage ranging from 5000 to 1,500,000 LP per ml) and influenza challenge at 40, 60, and 120 days. (d) One injection of multi-epitope vaccine (with a dosage of 500,000 LP per ml), a booster dose at 90 days (with same dosage), and then influenza challenge at 120 days.

3.7

Results

The final result of the entire workflow is the prediction of the best vaccine formulation in relation to the following immune system dynamics (see Note 5): – Lung epithelial infected cells and total lung epithelial cells population levels. – IgM, IgG, and IgA concentration levels. – IL-1, IL-2, IL-6, IL-12, IFN-γ, and TNF-α concentration levels. – Neutrophils, MHC-II antigen presenting macrophages, MHC-I antigen presenting dendritic cells, and MHC-II antigen presenting dendritic cells population levels. – Activated CD4+ Th1 cells, activated CD8+ T cells, total memory B cells, and total memory Th1 cells population levels. The proposed epitopes combination is the following: LYDKVRLQL ðMHC- IÞ þ DAINFESNGNFIAPE ðMHC- IIÞ þ LLNDKHSNGTVKDRSP ðBÞ with a dosage of 500,000 LP per ml and a booster at 90 days (see Note 6).

4

Notes 1. ClustalOmega [14] can align up to 4000 sequences or a maximum file size of 4 MB: for larger sequences, it is necessary to fragment the original sequence and submit each short sequence one by one.

Identification of New Vaccine Candidates for Influenza A Virus

409

2. NetCTL [16] has several restrictions on the size of submitted data: maximum 5000 sequences per submission, and each sequence should be no more than 20,000 amino acids and not less than nine amino acids for at most 200,000 amino acids in total. 3. A new version of BepiPred software [18] is currently available, solving most of the issues related to the first version, which has been used in the suggested workflow: BepiPred-2.0, indeed, can offer a significantly improved predictive power when compared to other available tools, including its first version. 4. The UISS simulator, used to assess the effectiveness of the vaccine formulation indicated and to select the best epitopes among those under examination, does not have a user-friendly graphical interface: this could somehow limit its use by non-experts. The development team is currently working on solving this problem. 5. The suggested workflow for the design of multi-epitope vaccine does not provide the complete vaccine formulation, simply suggesting the best combination of epitopes for maximum immune response. This also affects the evaluation of efficacy which, of course, needs to be tested in the later stages of vaccine development. 6. The methodology applied for the design of the multi-epitope vaccine does not provide any indication about the most appropriate route of administration, which usually is the direct way: injection into the circulatory stream.

Acknowledgments The authors acknowledge partial support from University of Catania, internal grants. References 1. Rappuoli R (2000) Reverse vaccinology. Curr Opin Microbiol [Internet] 3(5):445–450. Available from: https://linkinghub.elsevier. com/retrieve/pii/S1369527400001193 2. De Sousa KP, Doolan DL (2016) Immunomics: a 21st century approach to vaccine development for complex pathogens. Parasitology [Internet] 143(2):236–244. Available from: https://www.cambridge.org/ core/product/identifier/S00311820150010 79/type/journal_article 3. Adu-Bobie J (2003) Two years into reverse vaccinology. Vaccine [Internet] 21(7–8): 605–610. Available from: https://linkinghub.

elsevier.com/retrieve/pii/S0264410X0200 5662 4. Black S, Bloom DE, Kaslow DC, Pecetta S, Rappuoli R (2020) Transforming vaccine development. Semin Immunol [Internet] 50: 101413. Available from: https://linkinghub. elsevier.com/retrieve/pii/S10445323203002 94 5. Palumbo E, Fiaschi L, Brunelli B, Marchi S, Savino S, Pizza M (2012) Antigen identification starting from the genome: a “reverse vaccinology” approach applied to MenB, pp 361–403. Available from: http://link.springer. com/10.1007/978-1-61779-346-2_21

410

Valentina Di Salvatore et al.

6. Masignani V, Pizza M, Moxon ER (2019. Available from: https://www.frontiersin.org/ article/10.3389/fimmu.2019.00751/full) The development of a vaccine against meningococcus B using reverse vaccinology. Front Immunol [Internet] 10 7. Dalsass M, Brozzi A, Medini D, Rappuoli R (2019. Available from: https://www. frontiersin.org/article/10.3389/fimmu.201 9.00113/full) Comparison of open-source reverse vaccinology programs for bacterial vaccine antigen discovery. Front Immunol [Internet] 10 8. Heinson AI, Woelk CH, Newell M-L (2015) The promise of reverse vaccinology. Int Health [Internet] 7(2):85–89. Available from: https://academic.oup.com/inthealth/articlelookup/doi/10.1093/inthealth/ihv002 9. Vivona S, Bernante F, Filippini F (2006) NERVE: new enhanced reverse vaccinology environment. BMC Biotechnol [Internet] 6 ( 1 ) : 3 5 . A v a i l a b l e f r o m : h t t p s : // bmcbiotechnol.biomedcentral.com/ar ti cles/10.1186/1472-6750-6-35 10. Heinson A, Gunawardana Y, Moesker B, Hume C, Vataga E, Hall Y et al (2017) Enhancing the biological relevance of machine learning classifiers for reverse vaccinology. Int J Mol Sci [Internet] 18(2):312. Available from: h t t p : // w w w. m d p i . c o m / 1 4 2 2 - 0 0 6 7 / 1 8/2/312 11. Russo G, Pennisi M, Fichera E, Motta S, Raciti G, Viceconti M et al (2020) In silico trial to test COVID-19 candidate vaccines: a case study with UISS platform. BMC Bioinformatics [Internet] 21(17):1–16. Available from: https://doi.org/10.1186/s12859-02003872-0 12. Russo G, Di Salvatore V, Sgroi G, Parasiliti Palumbo GA, Reche PA, Pappalardo F (2022) A multi-step and multi-scale bioinformatic protocol to investigate potential SARS-CoV-2 vaccine targets. Brief Bioinform [Internet]. [cited 2022 Apr 24];23(1). Available from: https:// pubmed.ncbi.nlm.nih.gov/34607353/ 13. Giulia R, Elena C, Avisa M, Di Salvatore Valentina PF (2022) Beyond the state of the art of reverse vaccinology: predicting vaccine e. Res Sq Prepr

14. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W et al (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol [Internet] 7(1):539. Available from: https://onlinelibrary.wiley. com/doi/10.1038/msb.2011.75 15. Procter JB, Carstairs GM, Soares B, Moura˜o K, Ofoegbu TC, Barton D et al (2021) Alignment of biological sequences with Jalview, vol 2231, pp 203–224. Available from: http://link. springer.com/10.1007/978-1-0716-10367_13 16. Larsen MV, Lundegaard C, Lamberth K, Buus S, Lund O, Nielsen M (2007) Largescale validation of methods for cytotoxic T-lymphocyte epitope prediction. BMC Bioinformatics [Internet] 8(1):424. Available from: https://bmcbioinformatics.biomedcentral. com/articles/10.1186/1471-2105-8-424 17. Reynisson B, Alvarez B, Paul S, Peters B, Nielsen M (2020) NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res [Internet] 48(W1):W449–W454. Available from: https://academic.oup.com/ nar/article/48/W1/W449/5837056 18. Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequencebased B-cell epitope prediction using conformational epitopes. Nucleic Acids Res [Internet] 45(W1):W24–W29. Available from: https://academic.oup.com/nar/ar ticlelookup/doi/10.1093/nar/gkx346 19. Doytchinova IA, Flower DR (2007) VaxiJen: a server for prediction of protective antigens, tumour antigens and subunit vaccines. BMC Bioinformatics [Internet] 8(1):4. Available from: https://bmcbioinformatics. biomedcentral.com/articles/10.1186/14712105-8-4 20. Dimitrov I, Bangov I, Flower DR, Doytchinova I (2014) AllerTOP v.2—a server for in silico prediction of allergens. J Mol Model [Internet] 20(6):2278. Available from: http://link.springer.com/10.1007/s00894014-2278-5

Chapter 28 Immunoinformatics Vaccine Design for Zika Virus Ana Clara Antonelli, Vinnycius Pereira Almeida, and Simone Gonc¸alves da Fonseca Abstract Zika virus (ZIKV) is an emerging virus from the Flaviviridae family and Flavivirus genus that has caused important outbreaks around the world. ZIKV infection is associated with severe neuropathology in newborns and adults. Until now, there is no licensed vaccine available for ZIKV infection. Therefore, the development of a safe and effective vaccine against ZIKV is an urgent need. Recently, we designed an in silico multi-epitope vaccine for ZIKV based on immunoinformatics tools. To construct this in silico ZIKV vaccine, we used a consensus sequence generated from ZIKV sequences available in databank. Then, we selected CD4+ and CD8+ T cell epitopes from all ZIKV proteins based on the binding prediction to class II and class I human leukocyte antigen (HLA) molecules, promiscuity, and immunogenicity. ZIKV Envelope protein domain III (EDIII) was added to the construct and B cell epitopes were identified. Adjuvants were associated to increase immunogenicity. Distinct linkers were used for connecting the CD4+ and CD8+ T cell epitopes, EDIII, and adjuvants. Several analyses, such as antigenicity, population coverage, allergenicity, autoimmunity, and secondary and tertiary structures of the vaccine, were evaluated using various immunoinformatics tools and online web servers. In this chapter, we present the protocols with the rationale and detailed steps needed for this in silico multi-epitope ZIKV vaccine design. Key words ZIKV, Vaccine, Immunoinformatics tools, Multi-epitope, Epitope prediction, Immunogenicity, CD4 epitopes, CD8 epitopes, B cell epitopes

1

Introduction Zika virus (ZIKV) is a mosquito-borne virus that belongs to the Flaviviridae family and Flavivirus genus. ZIKV is a positive sense, single-stranded RNA virus that causes self-limited infection. Most cases present mild to moderate symptoms such as fever, rash, muscle and joint pain, and headache [1], however, ZIKV is a neurotropic virus that can cause other pathologies such as Guillain-Barre´ Syndrome (GBS) and Congenital ZIKV Syndrome (CZS) [2, 3]. GBS is an autoimmune acute neuropathy characterized by demyelination of nerves and reduced myotatic reflexes, which can cause paralysis and even lead to death [4]. CZS consists of a set of

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_28, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

411

412

Ana Clara Antonelli et al.

neurological malformations, such as microcephaly and eye and hearing damage, which can occur in fetuses of mothers infected by ZIKV during pregnancy [3, 5]. Prophylactic methods against ZIKV infections consist mainly of vector control strategies, since there are no licensed vaccines and no specific treatment [6]. ZIKV infection cases have considerably waned after the outbreak in Brazil in 2015 and the three following years. However, the SARS-CoV-2 pandemic in 2020 is an example that, especially RNA viruses which have high mutation rates, should be target of constant scientific research, even the ones that cause mild symptoms and during times of reduced cases. Moreover, the World Health Organization reported in 2019 that 61 countries and territories, despite not having documented any cases of ZIKV infection, have evidence of occurrence of the Aedes aegypti mosquito, which could lead to new outbreaks of ZIKV infection and/or other flaviviruses transmitted by this vector [7]. Several types of vaccines against ZIKV have been evaluated, including inactivated or live attenuated virus vaccines, DNA and RNA vaccines, and subunit and recombinant virus vaccines. Several of these vaccine candidates have progressed to phase I clinical trials; however, only one vaccine has progressed to phase II (NCT03110770) [8, 9]. This vaccine, similar to most ZIKV vaccines in current development, is based on the precursor membrane (prM) and envelope (E) proteins of ZIKV [8]. Besides whole virion or whole protein structures, peptides can also be used as immunogens in a vaccine structure. These vaccines have gained attention in vaccine research after the development of bioinformatics tools that can predict critical epitopes in a protein sequence, that are able to induce appropriate B and T cell immune responses [8, 10]. These tools can contribute to reducing the time to develop a vaccine, from the identification of possible targets to the structure design of the vaccine. Several in silico vaccine structures based on epitope prediction have been designed against ZIKV using different immunoinformatic tools [11–16]. Furthermore, Antonelli and colleagues (2022) tested synthetic peptides that were selected in silico and observed that they were able to induce the production of IFN-γ and/or IL-2 by CD4+ and CD8+ T cells when used to stimulate peripheral blood mononuclear cells (PBMCs) of individuals with previous ZIKV infections, showing the potential of immunoinformatics tools to predict epitopes that are actually immunogenic in vitro and are good candidates for in vivo validation [16]. In this chapter, we provide protocols of various immunoinformatic tools that can facilitate the identification of immunogenic T and B cell epitopes and the design of a ZIKV in silico vaccine structure, based on the strategies used in our in silico vaccine designed for ZIKV [16]. A general pipeline of the methods and rationale for the vaccine design is provided in Fig. 1.

Immunoinformatics Vaccine Design for Zika Virus

413

Fig. 1 Vaccine pipeline. Rationale of the design of an immunoinformatics based vaccine

2

Methods

2.1 Consensus Sequence

If aiming to target conserved epitopes across different strains of ZIKV, it is important to start the epitope prediction analysis from a consensus sequence, i.e., the protein chain which contains the amino acids sequence commonly found in most selected proteomes (see Fig. 2). In the case of ZIKV, as the genome is relatively conserved, the consensus sequence for strains isolated in Brazil and in the Americas is the same (as of March 2018) [16]. To verify the consensus sequence of a specific virus, the protein sequences can be downloaded from NCBI (https://www.ncbi.nlm. nih.gov/protein) in FASTA format (see Fig. 3). After acquiring the sequences of interest, they should be organized in a text file before the alignment. Before each sequence, the “>” symbol should be added, which will indicate that a new isolate sequence will start from there. The alignment can be done using ClustalX (http://www.clustal.org/download/current/). In the software interface: 1. Click “File” and select “Load Sequences”. 2. Select the file with the sequences. 3. Click “Alignment” and select “Do Complete Alignment”. After the alignment is done, a file will be generated in “aln” format. This file will be used to visualize the consensus sequence using Jalview (https://www.jalview.org/download/). Open the file generated with ClustalX using Jalview. All the sequences will be shown already aligned and the position of each amino acid can be seen as well as the residues that are different from most strains (color > percentage identity). The consensus sequence will be shown at the bottom of the page (see Fig. 4, in red). To copy this sequence, right click on top of the word “consensus” and then “copy consensus sequence”. Next, the proteome can be broken down into individual proteins for the downstream analysis (see Fig. 2).

414

Ana Clara Antonelli et al.

Fig. 2 ZIKV genomes and proteomes from several strains are available in NCBI. The consensus sequence will inform on the most common amino acids found in each position of the virus proteome. From this sequence, the individual proteins can be obtained and be used for further downstream analysis. The ZIKV genomes codes for four structural proteins and seven non-structural proteins (NS). The structural proteins are the Capsid protein C, its endoplasmic reticulum anchor (not shown in the figure but is found from amino acid 105 to 122), protein prM, which is broken down into peptide pr (123–215aa) and small envelope protein M (216–290aa), and Envelope protein E. The NS proteins are NS1, NS2A, NS2B, NS3, NS4A, NS4B, and NS5 2.2 CD8 Epitope Selection

The immune response mediated by CD8+ T cells is important in the control of ZIKV infection through viral clearance [17, 18]. Therefore, including CD8 epitopes in the vaccine construct may confer additional layers of protection against the virus. To predict the MHC-I restricted CD8+ T cell epitopes, several user-friendly online platforms are freely available. The Immune Epitope Database (https://www.iedb.org) and NetMHC (https://services.healthtech.dtu.dk/service.php?NetMHC-4.0) are examples of reliable tools (see Note 1). Examples of other services have been published elsewhere [19]. ZIKV, for example, did not present a very high diversity of CD8+ T cell epitopes when investigated using IEDB and Propred1 platforms [16]. In order to include more sequences with high likelihood of being immunogenic for CD8+ T cells, linear sequences, including overlapping epitopes or epitopes spaced by a few amino acids, can be included in the vaccine construct as one peptide chain.

2.2.1 IEDB MHC-I Binding

1. In the IEDB MHC-I Binding platform (http://tools.iedb. org/mhci/result/), the proteins saved in FASTA format can be inserted in the blank space or saved in a text document and uploaded in “choose file”.

Immunoinformatics Vaccine Design for Zika Virus

415

Fig. 3 Example of how to download the proteomes in FASTA format from NCBI (https://www.ncbi.nlm.nih.gov/ protein/1913350236). (Access date 09/03/2022)

2. Select the IEDB recommended method (see Note 2) or other result formats as wished. Artificial Neural Networks (ANN) [20] and Stabilized Matrix Method (SMM) [21] tools are also available. ANN and SMM provide a score in which smaller values indicate that the mentioned epitope has a high potential to bind to the selected MHC. 3. Different MHC species can be selected, depending on the desired application. 4. After selecting the species, a list of MHC alleles will be made available. Choose the ones present in the vaccine target population or choose all to ensure highest population coverage. 5. The length of the peptide can also be selected, and 9-mers have been recommended in most applications as the prediction accuracy is the highest with peptides of this size. 6. Specify the output as wished, in which the default is recommended and select “submit”. 7. An example of the usual settings is shown in Fig. 5. 8. The results page can be downloaded as an excel sheet (see Note 3). 9. The peptides can be ranked based on their position and scores. Peptides with percentile rank values ≤2 are considered binders (see Note 4).

416

Ana Clara Antonelli et al.

Fig. 4 Jalview interface showing sequences of ZIKV strains aligned using ClustalX. The percentile identity was selected to show the amino acids that differ from most strains, which are relatively rare for ZIKV. The consensus sequence is shown in the red square at the bottom of the page as well as quality scores and conservation at each amino acid position 2.2.2 IEDB Class I Immunogenicity

The IEDB Class I immunogenicity [22] tool (http://tools.iedb. org/immunogenicity/) can be used to further validate the epitopes found in the binding prediction platforms. To study the immunogenicity of the selected epitopes 1. Organize the epitopes of interest (see Note 5) in a list format (no special characters between them) in a text file and upload to the platform (or copy and paste, whichever is more convenient). 2. Select “submit”. 3. Peptides with a score greater than 0.15 are considered as potentially immunogenic [16].

2.3 CD4 Epitope Selection 2.3.1 IEDB MHC-II Binding

1. In the IEDB MHC-II Binding prediction platform (http:// tools.iedb.org/mhcii/), proteins saved in FASTA format can be inserted in the blank space or saved in a text document and uploaded in “choose file”. 2. In “Prediction Method”, select the IEDB recommended method (see Note 1). 3. Different species of MHCs can be selected, depending on the desired application.

Immunoinformatics Vaccine Design for Zika Virus

417

Fig. 5 IEDB MHC-I Binding Predictions page (http://tools.iedb.org/mhci/). (Access date 09/03/2022)

4. After selecting the species, a list of MHC alleles will be made available. Choose the ones present in the vaccine target population or choose all to ensure higher population coverage. 5. The length of the peptide can also be selected, and 15-mers have been recommended in most applications for CD4 epitopes. 6. Specify the output as wished, in which the default is recommended and select “submit”. 7. An example of the usual settings is shown in Fig. 6. 8. The results page can be downloaded as an excel sheet (see Note 2). 9. The peptides can be ranked based on their position and scores. By default, prediction results show only the Percentile Rank and Adjusted Rank when the Consensus method is used. The

418

Ana Clara Antonelli et al.

Fig. 6 IEDB MHC-II binding prediction page (http://tools.iedb.org/mhcii/). (Access date 08/23/22)

table can be expanded to display the individual score of different methods used by checking the box above the results table. Lower percentile ranks are considered better binders, peptides with IC50 ≤ 500 nM are considered peptides with intermediate affinity and IC50 ≤ 50 nM is considered high binding affinity. 2.3.2 IEDB CD4+ T Cell Immunogenicity Prediction

1. Organize the epitopes of interest (see Note 4) in a list format (no special characters between them) in a text file and upload to the platform (or copy and paste, whichever is more convenient) at (http://tools.iedb.org/CD4episcore/). 2. Select the desired Prediction method. The IEDB recommended methods combine both the other methods available in the platform (7-allele and immunogenicity) [23]. 3. Lower values are considered more immunogenic [24].

Immunoinformatics Vaccine Design for Zika Virus

419

Fig. 7 IFNepitope prediction page (http://crdd.osdd.net/raghava/ifnepitope/predict.php). (Access date 08/23/ 22) 2.4 IFN-γ-Inducing Epitopes on IFNepitope Server

Prediction of IFN-γ induction by the selected epitopes can be performed in order to better analyze their functional properties. 1. Organize the epitopes of interest in a list format (no special characters between them) in a text file and upload to the platform (or copy and paste, whichever is more convenient) at (http://crdd.osdd.net/raghava/ifnepitope/predict.php). 2. Select the desired method and model of prediction. An example of the commonly used settings is shown in Fig. 7. 3. Positive scores are considered IFN-γ inducers.

2.5 Other Cytokines Induction Evaluation

Induction of pro-inflammatory cytokines can be predicted in order to check if the vaccine peptides can induce satisfactory immune response. Also, regulatory cytokine induction such as IL-10 can also be predicted in order to evaluate if the peptides selected could negatively affect the vaccine immunogenicity. Moreover, there are also servers that can predict whether a peptide is able to activate APCs (Antigen presenting cells) or not. Prediction of IL-10 and IL-4 induction can be performed with the downloaded IL10pred

420

Ana Clara Antonelli et al.

platform [25] and with the IL-4Pred online server (https://webs. iiitd.edu.in/raghava/il4pred/index.php) [26], respectively. The capacity to activate APCs and induce pro-inflammatory response can be evaluated with VaxinPAD online server (https://webs.iiitd. edu.in/raghava/vaxinpad/index.php) [27] and Pro-Inflam online (http://metabiosys.iiserb.ac.in/proinflam/index.html) server [28], respectively. 2.6 IEDB Population Coverage

Calculating the population coverage can be a useful metrics to estimate the frequency of individuals in a given area with the MHC alleles predicted to bind to the selected epitopes. In the IEDB population coverage interface (http://tools.iedb.org/popu lation/): 1. Choose the number of epitopes to be analyzed by the tool. 2. Select the world areas of interest and choose which MHC class should be considered (class I, II, or both combined). 3. Add the epitopes to the epitope list and their corresponding MHC alleles they are predicted to bind (based on predictions made on IEDB MHC-I Binding Predictions). An example is shown in Fig. 8.

2.7 B Cell Linear Epitope Prediction

Protection against ZIKV can be reached with the induction of neutralizing antibodies, particularly against the domain III of the Envelope protein E. Servers, such as ABCpred (https://webs.iiitd.edu.in/ raghava/abcpred/ABC_submission.html) and BepiPred 2.0 (https://services.healthtech.dtu.dk/service.php?BepiPred-2.0) can be used to predict B cell linear epitopes from the target protein sequence. B cell discontinuous epitopes can also be predicted from the tertiary structure of a protein. This will be discussed later on in the topic “B cell epitope prediction from protein tertiary structure.”

2.8 Vaccine Structure Design

After selecting the T and B cell peptides that will compose the vaccine structure, choosing the best order to arrange them in the vaccine sequence is an important step that can affect the overall vaccine efficacy. Epitopes should be rearranged and tested for antigenicity, allergenicity, and autoimmunity in different positions in the vaccine sequence. Moreover, it is important to select proper linkers to separate each function domain. Adjuvants can be added to the final vaccine structure to boost immune response, and their position in the vaccine structure should also be tested carefully (N′ or C′ terminal).

2.8.1

Peptide vaccines offer significant advantages when compared to whole microorganism vaccines, since they are a safer and more specific alternatives. However, this means they can be less immunogenic than classical inactivated and attenuated vaccines. Therefore,

Adjuvants

Immunoinformatics Vaccine Design for Zika Virus

421

Fig. 8 IEDB Population Coverage http://tools.iedb.org/population/. (Access date 09/03/2022)

the use of adjuvants to improve immune response is indispensable. Adjuvants are compounds or molecules administered alongside vaccines that are immune stimulatory and can boost immune response by targeting APCs and innate immune receptors. Protein adjuvants derived from microorganisms or synthesized can be recognized by Toll-like receptors (TLRs) and elicit potent innate immune responses [29]. Bacterial flagellin, for instance, is a TLR5 agonist that induces pro-inflammatory cytokine production [30]. TLR4 binders derived from Mycobacterium tuberculosis, such as Heparin-binding hemagglutinin and 50S ribosomal protein

422

Ana Clara Antonelli et al.

L7/L12, or synthesized such as peptide RS09 (APPHALS), are also good protein adjuvant options [31–34]. Protein adjuvants, however, may present different levels of toxicity, which have to be considered when designing a vaccine structure. The sequence of these proteins may be retrieved from NCBI or UniProt. Analysis of autoimmunity and allergenicity should be performed for the vaccine-adjuvant conjugate in order to check for its safety. These analyses, together with antigenicity evaluation, can help choose the best adjuvant to use in the vaccine structure. Moreover, the position of the adjuvant in the vaccine sequence (N′ or C′ terminal) should also be considered when testing for safety and immunogenicity parameters. 2.8.2

2.9

Linkers

Vaccine Safety

2.9.1 Vaccine Potential Allergenicity

2.9.2 Evaluation of Autoimmune Induction

Linkers are important components of multi-epitope vaccines, since they connect peptides to each other and separate functional domains (e.g., CD8 epitopes, CD4 epitopes, and adjuvants). Linkers provide flexibility, stability, and are important for efficient protein folding [35, 36]. Different linkers are usually used for TCD8+ and TCD4+ peptides and for connecting the adjuvant to the vaccine sequence because each linker has a unique property. AAY linkers, for instance, are commonly used to separate TCD8+ peptides because they produce suitable sites for binding to TAP transporter [35–37]. GPGPG linkers have been shown to induce helper T cell responses and are usually placed between these cells’ peptides in the vaccine structure [36, 38, 39]. For B cell linear epitopes, KK linkers are the most used [36, 40]. DNTAN [41] and EAAK [36, 42] are good options to connect the adjuvant to the rest of the vaccine sequence because they provide flexibility to the protein structure. The epitopes or the final protein construct can be copied in plain amino acids sequence format in the AllergenFP website http:// ddg-pharmfac.net/AllergenFP/method.html. After selecting “get the result”, a page will appear with the information of whether the added sequence is a “PROBABLE ALLERGEN” or “PROBABLE NON-ALLERGEN”. Evidences of autoimmunity induced by vaccines and/or adjuvants have been reported. Therefore, evaluation of autoimmunity by the vaccine sequence is an important step to guarantee vaccine safety. A simple way to do this is to perform a BLASTp in order to check if your vaccine sequence has any similarity to human protein sequences. 1. Access NCBI BLASTp (Protein Blast) platform: https://blast. ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins 2. On the “Enter accession number(s), gi(s), or FASTA sequence (s)” tab, insert the vaccine sequence or upload a file in FASTA format.

Immunoinformatics Vaccine Design for Zika Virus

423

Fig. 9 Blastp search parameters (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins). (Access date 09/04/2022)

3. On the “Database” tab, select “Standard databases (nr)” and “Non-redundant protein sequences”. 4. On the “Organism” tab, write “human” and select “human taxid:9606”. 5. Select “Blastp” on the “Algorithm” tab. Figure 9 shows the mentioned parameters selected. 6. On “Algorithm parameters”, you can use the default Search parameters or change parameters such as “word size”, “expect threshold”, and “matrix” according to your interests. 7. Click “BLAST” to obtain the results. 8. On the results page, you can see which amino acids on your sequence are similar to which human protein sequence. Consider as significantly similar sequences of more than 5–6 similar linear (in a row) amino acids. These sequences might be able to induce autoimmunity.

424

Ana Clara Antonelli et al.

2.10 Vaccine Antigenicity

Vaccine antigenicity can indicate the capacity of the final vaccine construct to be recognized by the immune system. VaxiJen (http:// www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html) analyzes the input as an amino acid sequence and gives as the output the information of the protein predicted antigenicity. As we aim for a ZIKV vaccine, the target organism should be “virus” and the threshold of 0.4 is recommended.

2.11 Physical and Chemical Properties

One limitation of the mentioned prediction tools relies on the fact that they do not take into consideration the capacity of the epitopes to be soluble, which could render them difficulty to synthesize and make it challenging to test them in vivo. One such way to verify this is by predicting the physical and chemical properties of the selected epitopes and final protein construct. ProtParam tool provides a range of useful information about the protein, including the theoretical pI, molecular weight, estimated half-life, and others. Paste the protein sequence in the box as instructed on the website and select “Compute parameters” [43].

2.12 Secondary Structure Prediction

Protein secondary structure refers to the local conformation proteins’ polypeptide backbone that is stabilized by hydrogen bonds. Secondary structure is usually divided into three states, namely, helix (H), strand (E), and coil (C) [44]. However, a newer classification assigns secondary structures into eight state types α-helix (H), 310 helix (G), π-helix (I), β-strand (E), β-bridge (B), β-turn (T), bend (S), and loop or others (C) [45]. Prediction of protein secondary structures is important in order to provide information about protein function and activity [46]. There are several servers that can predict a protein secondary structure, such as PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/) and JPred (https://www. compbio.dundee.ac.uk/jpred/).

2.13 Tertiary Structure, Refinement, and Validation

Deciphering the final construct tertiary structure can provide information about predicted antibody binding sites and docking analysis with receptors of the innate immunity, such as toll-like receptors (TLR). There is a range of online platforms that can predict protein tertiary structures. An example is the Protein Homology/analogY Recognition Engine V 2.0 (Phyre2) developed by Imperial College, London [47] (http://www.sbg.bio.ic.ac.uk/~phyre2/html/ page.cgi?id=index). Phyre2 offers even modeling against proteins deposited in the Protein Data Bank (PDB) and the AlphaFold Protein Structure Database. After obtaining the predicted tertiary structure in “pdb” format, it can be refined using, for instance, ReFOLD (https://www. reading.ac.uk/bioinf/ReFOLD/ReFOLD3_form.html), which further improves the quality of the obtained model by identifying and correcting abnormal atom interactions and improbable angles. The software will provide the top modeled refined structures with a

Immunoinformatics Vaccine Design for Zika Virus

425

Fig. 10 Ramachandran plot (https://zlab.umassmed.edu/bu/rama/) showing a predicted model quality score. In green are the highly preferred conformations, in brown the preferred positions and in red the questionable observations. (Access date 10/04/2022)

p-score, so the lowest the p value is, the lowest the probability that the model is incorrect. More details about the results interpretation can be found in the platform “help page” (https://www.reading.ac. uk/bioinf/ReFOLD/ReFOLD_help.html) [48]. To validate the final refined model, the bonds and angles disposition of each residue can be investigated using Ramachandran plots. The Ramachandran plot server (https://zlab.umassmed.edu/bu/rama/) is one of such platforms [49]. An example of a Ramachandran Plot is shown in Fig. 10.

426

Ana Clara Antonelli et al.

2.14 B Cell Epitope Prediction from Protein Tertiary Structure

Prediction algorithms have been developed to predict discontinuous and continuous epitopes in a given modeled protein in pdb format. For instance, Ellipro (http://tools.iedb.org/ellipro/), another user-friendly IEDB platform, predicts linear and discontinuous epitopes in a method that searches for structures that protrudes from the protein “spheroid”, which are more likely to bind to B cell receptors [50]. In the server interface: 1. Upload the model that has been refined and validated in pdb format. 2. Default minimum score and maximum distance can be used, 0.5 and 6, respectively. 3. Submit job and the linear and discontinuous epitopes can be seen.

3

Notes 1. As any online platform, the prediction servers are subject to regular updates and changes, so the website address and the interface may differ from what is shown here. If the link we provide does not work, try searching the platform name in any browsing software to get an updated link. It is also possible that, if the page is not found, the service has been discontinued, so an alternative platform should be sought. 2. The methods are updated periodically, so choose the most recent version. 3. If the output exceeds a certain limit of lines, the table option will not be available. In this case, decreasing the number of outputs by selecting one epitope size or less MHCs is an option. If all the data is wanted, you can do two or multiple rounds of predictions with the different parameters, and then merge the tables in excel. 4. The different score methods will not always converge, so it is recommended to select epitopes predicted as binders by multiple methods. 5. The page adds a note that “The tool was only validated for 9-mer peptides. However, predictions can be made for peptides of any length.” It is important to mention that during the design of an in silico vaccine using immunoinformatics tools the scientist should make some decisions to use more suitable and precise servers for each step considering the best precision.

Immunoinformatics Vaccine Design for Zika Virus

427

Funding This work was funded by the Fundac¸˜ao de Amparo a` Pesquisa do Estado de Goia´s (FAPEG) under grant n. 201710267001260. Conflict of Interest The authors declare no conflict of interest.

References 1. Wang A, Thurmond S, Islas L et al (2017) Zika virus genome biology and molecular pathogenesis. Emerg Microbes Infect 6:1–6. https:// doi.org/10.1038/emi.2016.141 2. Musso D, Nilles EJ, Cao-Lormeau VM (2014) Rapid spread of emerging Zika virus in the Pacific area. Clin Microbiol Infect 20:O595– O596. https://doi.org/10.1111/1469-0691. 12707 3. Oliveira Melo AS, Malinger G, Ximenes R et al (2016) Zika virus intrauterine infection causes fetal brain abnormality and microcephaly: tip of the iceberg? Ultrasound Obstet Gynecol 47:6– 7. https://doi.org/10.1002/uog.15831 4. Cre´ange A (2016) Guillain-Barre´ syndrome: 100 years on. Rev Neurol (Paris) 172:770– 774. https://doi.org/10.1016/j.neurol. 2016.10.011 5. De Barros Miranda-Filho D, Martelli CMT, De Alencar Ximenes RA et al (2016) Initial description of the presumed congenital Zika syndrome. Am J Public Health 106:598–600. https://doi.org/10.2105/AJPH.2016. 303115 6. Centers for Disease Control and Prevention (CDC) (2022) Zika virus prevention and transmission what we know 7. World Health Organization (2019) Zika epidemiology update, July 2019. pp 1–14 8. Pattnaik A, Sahoo BR, Pattnaik AK (2020) Current status of Zika virus vaccines: successes and challenges. Vaccine 8:1–19. https://doi. org/10.3390/vaccines8020266 9. Yeasmin M, Molla MMA, Al Masud HMA, Saif-Ur-Rahman KM (2022) Safety and immunogenicity of Zika virus vaccine: a systematic review of clinical trials. Rev Med Virol. https:// doi.org/10.1002/rmv.2385 10. Soria-Guerra RE, Nieto-Gomez R, GoveaAlonso DO, Rosales-Mendoza S (2015) An overview of bioinformatics tools for epitope prediction: implications on vaccine development. J Biomed Inform 53:405–414. https:// doi.org/10.1016/j.jbi.2014.11.003 11. Alam A, Ali S, Ahamad S et al (2016) From ZikV genome to vaccine: in silico approach for

the epitope-based peptide vaccine against Zika virus envelope glycoprotein. Immunology 149: 386–399. https://doi.org/10.1111/imm. 12656 12. Dos Santos Franco L, Oliveira Vidal P, Amorim JH (2017) In silico design of a Zika virus non-structural protein 5 aiming vaccine protection against Zika and dengue in different human populations. J Biomed Sci 24:1–10. https://doi.org/10.1186/s12929-0170395-z 13. Kumar Pandey R, Ojha R, Mishra A, Kumar Prajapati V (2018) Designing B- and T-cell multi-epitope based subunit vaccine using immunoinformatics approach to control Zika virus infection. J Cell Biochem 119:7631– 7642. https://doi.org/10.1002/jcb.27110 14. Prasasty VD, Grazzolie K, Rosmalena R et al (2019) Peptide-based subunit vaccine design of T-and b-cells multi-epitopes against Zika virus using immunoinformatics approaches. Microorganisms 7. https://doi.org/10.3390/ microorganisms7080226 15. Shahid F, Ashfaq UA, Javaid A, Khalid H (2020) Immunoinformatics guided rational design of a next generation multi epitope based peptide (MEBP) vaccine by exploring Zika virus proteome. Infect Genet Evol 80: 104199. https://doi.org/10.1016/j.meegid. 2020.104199 16. Antonelli ACB, Almeida VP, de Castro FOF et al (2022) In silico construction of a multiepitope Zika virus vaccine using immunoinformatics tools. Sci Rep 12:1–20. https://doi. org/10.1038/s41598-021-03990-6 17. Elong Ngono A, Vizcarra EA, Tang WW et al (2017) Mapping and role of the CD8+ T cell response during primary Zika virus infection in mice. Cell Host Microbe 21:35–46. https:// doi.org/10.1016/j.chom.2016.12.010 18. Huang H, Li S, Zhang Y et al (2017) CD8 + T cell immune response in immunocompetent mice during Zika virus infection. J Virol 91:1– 15. https://doi.org/10.1128/jvi.00900-17 19. Shiragannavar S and Madagi S (2022) In Silico Vaccine Design Tools. Vaccine Development.

428

Ana Clara Antonelli et al.

IntechOpen. https://doi.org/10.5772/ intechopen.100180 20. Lundegaard C, Lamberth K, Harndahl M et al (2008) NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8–11. Nucleic Acids Res 36:509–512. https://doi.org/10.1093/nar/gkn202 21. Peters B, Sette A (2005) Generating quantitative models describing the sequence specificity of biological processes with the stabilized matrix method. BMC Bioinformatics 6:1–9. https://doi.org/10.1186/1471-2105-6-132 22. Calis JJA, Maybeno M, Greenbaum JA et al (2013) Properties of MHC class I presented peptides that enhance immunogenicity. PLoS Comput Biol 9. https://doi.org/10.1371/ journal.pcbi.1003266 23. Paul S, Lindestam Arlehamn CS, Scriba TJ et al (2015) Development and validation of a broad scheme for prediction of HLA class II restricted T cell epitopes. Gene 422:28–34. https://doi. org/10.1016/j.jim.2015.03.022. Development 24. Dhanda SK, Karosiene E, Edwards L et al (2018) Predicting HLA CD4 immunogenicity in human populations. Front Immunol 9:1– 14. https://doi.org/10.3389/fimmu.2018. 01369 25. Nagpal G, Usmani SS, Dhanda SK et al (2017) Computer-aided designing of immunosuppressive peptides based on IL-10 inducing potential. Sci Rep 7:1–10. https://doi.org/ 10.1038/srep42851 26. Dhanda SK, Gupta S, Vir P, Raghava GP (2013) Prediction of IL4 inducing peptides. Clin Dev Immunol 2013:263952. https:// doi.org/10.1155/2013/263952 27. Nagpal G, Chaudhary K, Agrawal P, Raghava GPS (2018) Computer-aided prediction of antigen presenting cell modulators for designing peptide-based vaccine adjuvants. J Transl Med 16:1–15. https://doi.org/10.1186/ s12967-018-1560-1 28. Gupta S, Madhu MK, Sharma AK, Sharma VK (2016) ProInflam: a webserver for the prediction of proinflammatory antigenicity of peptides and proteins. J Transl Med 14:1–10. https://doi.org/10.1186/s12967-0160928-3 29. Moyle PM (2017) Biotechnology approaches to produce potent, self-adjuvanting antigenadjuvant fusion protein subunit vaccines. Biotechnol Adv 35:375–389. https://doi.org/10. 1016/j.biotechadv.2017.03.005 30. Turley CB, Rupp RE, Johnson C et al (2011) Safety and immunogenicity of a recombinant

M2e-flagellin influenza vaccine (STF2.4xM2e) in healthy adults. Vaccine 29:5145–5152. https://doi.org/10.1016/j.vaccine.2011. 05.041 31. Jung D, Jeong SK, Lee CM et al (2011) Enhanced efficacy of therapeutic cancer vaccines produced by co-treatment with mycobacterium tuberculosis heparin-binding hemagglutinin, a novel TLR4 agonist. Cancer Res 71:2858–2870. https://doi.org/10. 1158/0008-5472.CAN-10-3487 32. Lee SJ, Shin SJ, Lee MH et al (2014) A potential protein adjuvant derived from Mycobacterium tuberculosis Rv0652 enhances dendritic cells-based tumor immunotherapy. PLoS One 9:1–11. https://doi.org/10.1371/journal. pone.0104351 33. Shanmugam A, Rajoria S, George AL et al (2012) Synthetic toll like receptor-4 (TLR-4) agonist peptides as a novel class of adjuvants. PLoS One 7. https://doi.org/10.1371/jour nal.pone.0030839 34. Li M, Jiang Y, Gong T et al (2016) Intranasal vaccination against HIV-1 with adenoviral vector-based nanocomplex using synthetic TLR-4 agonist peptide as adjuvant. Mol Pharm 13:885–894. https://doi.org/10. 1021/acs.molpharmaceut.5b00802 35. Nezafat N, Ghasemi Y, Javadi G et al (2014) A novel multi-epitope peptide vaccine against cancer: an in silico approach. J Theor Biol 349:121–134. https://doi.org/10.1016/j. jtbi.2014.01.018 36. Dong R, Chu Z, Yu F, Zha Y (2020) Contriving multi-epitope subunit of vaccine for COVID-19: immunoinformatics approaches. Front Immunol 11. https://doi.org/10. 3389/fimmu.2020.01784 37. Dolenc I, Seemu¨ller E, Baumeister W (1998) Decelerated degradation of short peptides by the 20S proteasome. FEBS Lett 434:357–361. https://doi.org/10.1016/S0014-5793(98) 01010-2 38. Livingston B, Crimi C, Newman M et al (2002) A rational strategy to design multiepitope immunogens based on multiple Th lymphocyte epitopes. J Immunol 168:5499–5506. https://doi.org/10.4049/jimmunol.168.11. 5499 39. Ribeiro SP, Rosa DS, Fonseca SG et al (2010) A vaccine encoding conserved promiscuous HIV CD4 epitopes induces broad T cell responses in mice transgenic to multiple common HLA class II molecules. PLoS One 5:1–9. https://doi.org/10.1371/journal.pone. 0011072

Immunoinformatics Vaccine Design for Zika Virus 40. Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequencebased B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45:W24– W29. https://doi.org/10.1093/nar/gkx346 41. Michalsky E, Goede A, Preissner R (2003) Loops In Proteins (LIP) – a comprehensive loop database for homology modelling. Protein Eng 16:979–985. https://doi.org/10. 1093/protein/gzg119 42. Barh D, Misra AN, Kumar A, Azevedo V (2010) A novel strategy of epitope design in Neisseria gonorrhoeae. Bioinformation 5:77– 8 2 . h t t p s : // d o i . o r g / 1 0 . 6 0 2 6 / 97320630005077 43. Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A (2005) Protein identification and analysis tools on the expasy server. In: Walker JM (ed) The Proteomics Protocols Handbook, Humana Press, p 571–607 44. Lyu Z, Wang Z, Luo F et al (2021) Protein secondary structure prediction with a reductive deep learning method. Front Bioeng Biotechnol 9:1–8. https://doi.org/10.3389/fbioe. 2021.687426 45. Kabsch W, Sander C (1983) Dictionary of protein secondary structure: pattern recognition

429

of hydrogen-bonded and geometrical features. Biopolymers 22:2577–2637 46. Ma Y, Liu Y, Cheng J (2018) Protein secondary structure prediction based on data partition and semi-random subspace method. Sci Rep 8: 1–10. https://doi.org/10.1038/s41598-01828084-8 47. Kelley LA, Mezulis S, Yates CM et al (2015) The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc 10:845– 858. https://doi.org/10.1038/nprot. 2015-053 48. Adiyaman R, McGuffin LJ (2021) ReFOLD3: refinement of 3D protein models with gradual restraints based on predicted local quality and residue contacts. Nucleic Acids Res 49:W589– W596. https://doi.org/10.1093/nar/ gkab300 49. Anderson RJ, Weng Z, Campbell RK, Jiang X (2005) Main-chain conformational tendencies of amino acids. Proteins Struct Funct Genet 60:679–689. https://doi.org/10.1002/prot. 20530 50. Ponomarenko J, Bui HH, Li W et al (2008) ElliPro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinformatics 9:1–8. https://doi.org/10.1186/ 1471-2105-9-514

Chapter 29 Immunoinformatics Approaches in Designing Vaccines Against COVID-19 Ankita Chakraborty, Jagadeesh Bayry , and Suprabhat Mukherjee Abstract Since the onset of the COVID-19 pandemic, a number of approaches have been adopted by the scientific communities for developing efficient vaccine candidate against SARS-CoV-2. Conventional approaches of developing a vaccine require a long time and a series of trials and errors which indeed limit the feasibility of such approaches for developing a dependable vaccine in an emergency situation like the COVID-19 pandemic. Hitherto, most of the available vaccines have been developed against a particular antigen of SARS-CoV, spike protein in most of the cases, and intriguingly, these vaccines are not effective against all the pathogenic coronaviruses. In this context, immunoinformatics-based reverse vaccinology approaches enable a robust design of efficacious peptide-based vaccines against all the infectious strains of coronaviruses within a short frame of time. In this chapter, we enumerate the methodological trajectory of developing a universal anti-SARS-CoV-2 vaccine, namely, “AbhiSCoVac,” through advanced computational biologybased immunoinformatics approach and its in-silico validation using molecular dynamics simulations. Key words Immunoinformatics, Reverse vaccinology, Vaccine, SARS-CoV-2, Molecular dynamics simulation

1

Introduction Vaccination is considered to be the most efficacious means to combat the infection and the immunopathological attributes of SARS-CoV-2. One of the striking features in the recent development of various potential vaccines using a number of immunoreactive antigens from SARS-CoV-2, is the rapid emergence of the designs. This has been achieved due to the successful implementation of immunoinformatics approaches, which provides rapid, robust, and easy techniques for designing peptide-based vaccines. Since the onset of the COVID-19 pandemic, the initial emphasis was invested in developing chemotherapeutics and exploring repurposed drugs such as remdesivir, favipiravir, hydroxychloroquine, ivermectin, 2-deoxy glucose (2-DOG). However, the emergence of drug-resistant SARS-CoV-2 variants and limited efficacy of some

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_29, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

431

432

Ankita Chakraborty et al.

of the aforesaid medications prompted the scientific communities to focus on developing immunotherapeutic intervention strategies, such as therapeutic monoclonal antibodies (mAbs) and vaccines. Although mAbs are known to possess specific binding affinity toward the viral antigens of interest [1], the excessive high cost of the available mAbs-based treatments has limited the widespread use of this approach in treating COVID-19 patients. Therefore, vaccination using the whole SARS-CoV-2 virion, attenuated virus, viral surface proteins, and viral mRNA was initiated to provide an affordable therapeutic option for counteracting the tide of infections and deaths due to COVID-19 [2, 3]. Till date, a number of vaccines have been developed using immunoinformatics. Choudhury et al. (2022) have conceptualized a multi-epitope multitargeted chimeric vaccine candidate named AbhiSCoVac and avouch that it could have the requisite efficiency to inhibit the pathogenic attributes of all the virulent forms of the virus, and thus thwart the spread of the pandemic. This proposed vaccine candidate was designed using B and T cell epitopes of the peptide fragments occurring at the viral spike protein-TLR4 interface [4]. Molecular docking followed by molecular dynamics simulation studies unveiled the stability of the binding of the vaccine to human TLRs and MHCs that primarily attested the high immunogenic nature of the vaccine without having any significant allergenicity [4]. Another study by Ghorbani et al. (2020) revealed successful implementation of immunoinformatics approaches for designing a novel virus-like particle with vaccine adjuvant properties by selecting a preliminary set of epitopes capable of inducing immune responses. Intriguingly, this constructed virion is devoid of genetic materials of the native virus and therefore ensures safety from side effects [5]. A venture by Kumar et al. (2020) has given rise to a T cell, B cell epitope-based subunit vaccine which is reported to be highly antigenic and immunogenic with potential TAP affinity, ensuring a better capacity of antigen processing. Allegedly, the vaccine shows strong binding to TLR2 and the MHC receptors, and is efficacious in generating an adaptive immune response against the virus [6]. Behmard et al. (2020) claim to have succeeded in formulating a dependably safe and efficacious multi-epitope polypeptide as a vaccine against SARSCoV-2 with the help of an array of immunoinformatics tools. The viral structural proteins were analyzed to screen out an initial set of epitopes, and the potential non-toxic and non-allergenic T cell and B cell binding, and also cytokine-inducing epitopes were sieved out on the basis of assumption. The selected epitopes were punched together with the help of linkers, and a suitable adjuvant was also added in order to increase the immunogenicity of the vaccine construct. The affinity of the vaccine to TLR3 and the stability of the vaccine-receptor complex were confirmed by various studies and the efficiency of the vaccine construct in eliciting response

In-Silico Vaccine Design Against COVID-19

433

from the human immune system against the virus, while ruling out the possibilities of any negative side effects, was concluded [7]. Dong et al. (2020) reported the formulation of a multi-epitope vaccine, by fusing B cell, HTL, and CTL epitopes with linkers, and its immunogenicity enhanced by adjoining β-defensin amino acid sequence and pan-HLA DR binding epitopes to the N terminal of the vaccine. A TAT sequence was punched with the C-terminal to enable its intracellular delivery, and the probable 3d structure of the vaccine complex was predicted. Later on, the complex between the vaccine and the immune receptors (like TLR3, MHC-I, MHC-II) were evaluated by molecular docking. The mRNA of the vaccine was then enhanced, and its secondary structure was generated, followed by in silico cloning [8]. Naz et al. (2020) analyzed two domains of the spike proteins of SARS-CoV-2 and concluded the possible emergence of two vaccine constructs with T cell and B cell epitopes, which were further modeled using linkers and adjuvants, followed by evaluation of their respective 3D structures on the basis of their physicochemical properties or their possible interactions with TLRs, ACE2 or HLA [9]. Abdelmageed et al. (2020) targeted the envelope protein of the 2019-nCoV virus to conceptualize a T cell epitope-based vaccine construct, which is claimed to be highly effective in washing out the global pandemic [10]. On the basis of extensive study, Das et al. (2021) have developed a chimeric antibody by conjugating the CDRH3 of regdanvimab with a framework of sotrovimab, which, according to their claim, can combat the variants of the virus that could escape from the mAb-mediated neutralization [1]. However, there is a need for a universal vaccine candidate that could encounter all the pathogenic coronaviruses that infect humans and produce fatal outcomes. In fact, six of such viruses, namely, hCoV-229E, hCoV-0C43, hCoV-HKU-1, MERS-CoV, SARS-CoV, and SARS-CoV-2 have been reported so far for having infected humans and causing death [4, 11]. Herein, we present the methodological trajectory of designing a novel universal efficacious vaccine against COVID-19 using the antigenic fragments obtained from its spike glycoprotein through advanced immunoinformatics approach.

2

Materials Immunoinformatics approaches are now considered as indispensable aids in diagnostic and immunological developments. Conventional paths of progressing with immunological research are expensive and time-consuming, which is particularly undesirable in this field. Immunoinformatics approaches in dealing with the steps of developing an efficacious vaccine reduce the experimental load. In this chapter, the stepwise methodology for designing a universal vaccine “AbhiSCoVac”, has been described in detail.

434

Ankita Chakraborty et al.

2.1 Software, Servers, and Databases

1. MEGA (https://www.megasoftware.net/): Molecular Evolutionary Genetics Analysis is a user-friendly software, which provides tools which help in analyzing the DNA or protein sequence data from different species or populations, from an evolutionary perspective [4]. 2. PyMOL (https://pymol.org/2/): It is a highly customizable Python-based cross-platform software used for 3D visualization of macromolecules, electron densities, surfaces, trajectories, and can also be used for editing molecules, analyzing macromolecules, protein–ligand docking, drug screening, molecular dynamics simulations, etc. [12]. 3. ElliPro (http://tools.iedb.org/ellipro/): Reliability in predictions of antibody epitopes is essential in immunodiagnostics. This software helps in predicting and visualizing the structure of linear and discontinuous antibody epitopes in a given protein sequence or structure. 4. GROMACS (GROningenMAchine for Chemical Simulations) (https://www.gromacs.org/): This software is a favorite amongst researchers for performing molecular docking simulations. 5. BcePred (http://crdd.osdd.net/raghava/bcepred/): This server is useful for predicting B cell epitopes on the basis of the physicochemical properties of amino acids. 6. ClusPro (https://cluspro.org/): A popular server for protein– protein docking, it also contains the facility of modifying the search. In this server, one can remove unstructured protein regions, apply attraction or repulsion, construct homomultimers, and much more [13]. 7. I-TASSER (https://zhanggroup.org/I-TASSER/): Iterative Threading ASSembly Refinement is a software which helps a user predict three-dimensional structure model of protein molecules from amino acid sequences, and the properties of the protein molecules [14]. 8. HawkDock: This web server is used for the prediction and analysis of the structures of various protein–protein complexes (http://cadd.zju.edu.cn/hawkdock/). 9. PRODIGY (http://milou.science.uu.nl/services/PROD IGY/): This server lets the user determine the binding affinity of protein–protein complexes. 10. BepiPred (http://cbs.dtu.dk/services/BepiPred): It helps to predict the location of linear B cell epitopes with the help of a combination of a hidden Markov model and a propensity scale method.

In-Silico Vaccine Design Against COVID-19

435

11. SWISS-MODEL (https://swissmodel.expasy.org/): This server allows protein structure-homology modeling, and is accessible via the ExPASy web server. 12. ProPred (http://crdd.osdd.net/raghava/propred/): This server helps a researcher in predicting the probable binding regions of MHC class II in an antigen sequence, and thus, assists in locating the binding regions that could turn out useful in selecting vaccine candidates. 13. VaxiJen 2.0 (https://mybiosoftware.com/vaxijen-2-0/): It is the first server dedicated to the alignment-independent prediction of protective antigens, and makes antigen classification possible based only on the physicochemical properties of the proteins. 14. PolyView (https://polyview.cchmc.org): This server is used to determine the secondary structure of the vaccine. 15. Phyre2 server (http://www.sbg.bio.ic.ac.uk/phyre2): This server is used to model the tertiary structure of the vaccine. It operates an advanced remote homology detection method for building a coordinate model by determining ligand binding sites and analyzing the effects of the amino acid residues. 16. SAVES (https://www.doe-mbi.ucla.edu/saves/): Abbreviation for Structure Analysis and Validation, SAVES is an interactive validation server for five programs, namely, ERRAT, PROVE, VERIFY 3D, PROCHECK, and WHATCHECK, commonly used in protein structure validation. 17. ExPASy (https://www.expasy.org): It is the bioinformatics resource portal of the Swiss Institute of Biotechnology which provides access to various databases and software tools which are of great help in the fields of science and clinical research. 18. Pro-SA web server (https://prosa.services.came.sbg.ac.at): It evaluates the quality of a modeled protein structure based on the Z-score. 19. PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred): It provides a researcher with a wide range of methods for predicting the structure of proteins. 20. iMODS (https://imods.iqfr.csic.es/): It helps the user to explore the collective motions of proteins and nucleic acids using NMA in internal coordinates. 21. AllerTop (https://www.ddg-pharmfac.net/AllerTOP/) and AllergenFP 1.0 (https://ddg-pharmfac.net/AllergenFP/): These tools are used to predict the allergenicity of a particular protein sequence. 22. HawkDock web server (http://cadd.zju.edu.cn/hawkdock/): This web-based server was used for quantitative determination

436

Ankita Chakraborty et al.

of the binding energy and binding affinity for the protein– protein interaction between the vaccine peptide and the human immune receptors. 23. Immune epitope database (IEDB) (https://www.iedb.org/): It contains the records of different observations from experiments regarding the adaptive immune response to epitopes, collected mainly from literature [15]. 24. C-IMMSIM server (http://www.cbs.dtu.dk/services/CImmSim-10.1): This web tool is a position-specific scoring matrices-(PSSM) based technique widely employed for conducting the immune simulation experiment with in silico designed vaccine and characterizing the immune response profile of the chimeric vaccine. 25. Java Codon Adaptation Tool (JCat) (http://www.jcat.de): It is used for optimization of the codons as well as reverse translation of the amino acid sequence to generate the cDNA sequence of the vaccine construct necessary for expressing the protein in bacterial system. 26. SnapGene (https://www.snapgene.com/): A user-friendly software for DNA visualization, molecular cloning, and sequence editing. 27. NCBI protein database (https://www.ncbi.nlm.nih.gov): It includes the collection of sequences from a variety of sources, which includes translations from annotated coding regions from TPA, GenBank, RefSeq, and also records from SwissProt, PIR, PDB, and PRF. 28. RCSB protein data bank (https://www.rcsb.org/): It is the data center for the global Protein Data Bank (PDB), containing records of the data of the 3D structure, function, and evolutionary updates of biological macromolecules like proteins and nucleic acids.

3

Methods

3.1 Data Mining and Selection of Antigenic Protein

The aim of this study is to develop a universal vaccine that could be effective against all the SARS-CoV-2 strains that primarily infect humans and cause fatal outcomes. Keeping this in our view, the following steps are required to be performed for selecting the antigens. 1. Retrieve the amino acid sequences of spike glycoproteins of six pathogenic coronaviruses, namely, hCoV-229E (Acc. No.: P15423.1, hCoV-0C43 (Acc. No.: P36334.1), hCoV-HKU1 (Acc. No.: AGW27872.1), MERS-CoV (Acc. No.:

In-Silico Vaccine Design Against COVID-19

437

QBM11748.1), SARS-CoV (Acc. No.: AAR86775.1) and SARS-CoV-2 (Acc. No.: P0DTC2.1) in FASTA format from NCBI Protein database (see Note 1). 2. Perform multiple sequence alignment using ClustalW algorithm in MEGA ver. 11 platform and save the alignment file in .txt format for further use. 3. Prepare an unrooted phylogenetic tree of the viruses using maximum parsimony algorithm and check the statistical confidence of the phylogram through bootstrap analysis with 1000 replications (see Fig. 1). The number in the phylogenetic tree indicates bootstrap value and a score more than 80 is considered satisfactory [16]. The alignment of the spike protein sequence and study of the phylogenetic relationship amongst various pathogenic coronaviruses is necessary to identify the conserve sequence present within the SARS strains. Spike protein sequences of SARS-CoV-2 share significant homology to SARS-CoV, hCoV-OC43, and hCoV-HKU1, while MERSCoV and hCoV-229E are distantly related. 4. Before conceptualizing a vaccine candidate against any disease, having sound knowledge about the interactions between the causal agent and the host body, and hence digging out data of the previous findings on the targeted disease, is essential. In this case, it is already inferred that TLR4 is the prime pattern recognition receptor (PRR) that senses and interacts with the spike glycoproteins on the surface of different coronaviruses [4]. Thus, to find out the spike protein sequence from each pathogenic coronavirus which interacts with TLR4 and perform molecular docking. For molecular docking, use the crystal structure of human TLR4-MD2 complex (PDB ID: 3FXI) by exploring protein data bank (PDB). Download the structure in .pdb format. Similarly, to retrieve the crystal structure of spike glycoproteins of SARS-CoV, hCoV-OC43, hCoVHKU1, MERS-CoV and hCoV-229E from PDB, search their respective pdb ids such as 6U7H, 5I08, 6Q05, 6NZK, 5XLR, and 6VYB. Perform molecular docking using ClusPro 2.0 and determine the binding affinity (DG) using PRODIGY [17] (see Note 2). Use PyMol software to visualize the interacting peptides/amino acid patches within the TLR4-spike interface, extract them and paste in .txt format. Make the refine antigen by selecting the common conserved sequence occurring across all the six viruses. 3.2 Screening of B and T Cell Epitopes and Prediction of Antigenicity

Peptide vaccines are based on identifying and chemically synthesizing the specific portions of the cognate antigens of B cells and T cells, termed as their respective epitopes which are potent in eliciting specific immune responses, and thus are elemental in vaccine development [18]. Immunoinformatics provides a researcher with

438

Ankita Chakraborty et al.

Fig. 1 Phylogenetic analysis for studying the sequence homology amongst the spike protein of different coronaviruses that infect humans. The unrooted phylogenetic tree is constructed using maximum parsimony algorithm and post-analyzed with bootstrap statistics with 1000 replications. (Adapted from Choudhury et al. [4] with permission from the publisher)

enormous data and a wide array of tools for prediction of B cell and T cell epitopes which are generally the records of experimentally discovered epitopes [19, 20]. One can rummage through the following databases for the prediction of B cell and T cell epitopes in the refined spike antigen epitopes computationally.

In-Silico Vaccine Design Against COVID-19

439

1. Determine the presence of B cell epitopes in the refined spike protein antigen using immune epitope database (IEDB) and BepiPred 2.0. Compare the results and select the peptides which appear in both the analyses (see Note 3). 2. Determine the T cell epitopes comprising both MHC-I and MHC-II epitopes using IEDB T cell epitope prediction and ProPred server. 3. Check the antigenicity of the predicted epitopes using VaxiJen 2.0 and construct the vaccine following the steps depicted in the subsequent section. 3.3 Construction of the Vaccine Architecture and Physicochemical Characterization

This step involves the linking of epitopes with linker sequences, punching adjuvants to enhance the immunogenicity of the vaccine candidate [21]. The primary, secondary, and tertiary structures of the vaccine construct can be constructed and viewed with the help of various immunoinformatics servers and software. Two-dimensional and three-dimensional structures of the vaccine molecule are also to be extracted from relevant servers and analyzed [21]. The following steps are to be performed for generating the vaccine construct: 1. Use VaxiJen 2.0 server to assess the antigenicity of the construct relative to each coronavirus particle. Set the threshold score at 4.0. VaxiJen 2.0 operates Auto Cross-Covariance (ACC) transformation algorithm to search the antigenic strength of a given epitope right from its sequence [22]. 2. Explore ElliPro tool from IEDB server to identify the conformational and/or discontinuous B cell epitopes occurring in the vaccine construct. This will generate a Protrusion Index (PI) for each detected discontinuous epitope, which will indicate the quantity of amino acid residues present on the interior and surface of the computed ellipsoids (see Note 4). 3. After confirming the antigenicity and epitopes, combine the MHC I epitopes by glycine-proline rich GPGPG linkers and link the MHC II epitopes using AAY linkers. In addition, add a Cholera Toxin B (CTB) protein to the N-terminus of the vaccine by EAAAK linker. CTB is a useful adjuvant to boost the immunogenicity of the vaccine construct [23] (see Note 5). 4. Determine the secondary protein structure of the vaccine using PolyView and then model the tertiary structure using Phyre2 server (http://www.sbg.bio.ic.ac.uk/phyre2). 5. Validate the 3D model of the vaccine for stereochemical quality and free energy. First, check with ERRAT tool given in SAVES 6.0 server following assessment with Pro-SA web server that evaluates the biophysical quality of the 3D-modeled structure based on the Z-score. Determine the Ramachandran plot for

440

Ankita Chakraborty et al.

the generated vaccine structure using the SWISS-MODEL Structure Assessment tool from the ExPASy server. A model is considered as a stereochemically stable one if >90% of the constituent residues are occurring in the favored region, while a score of above 50 in ERRAT output indicates a good fitness of the model [1, 23] (see Note 6). 6. Estimate the allergenicity of the antigen using AllerTOP 2. and AllergenFP 1.0 servers. Compare the results and go for the physicochemical analyses. 7. Open the ExPASy database and submit the peptide sequence corresponding to the vaccine to ProtParam tool for the assessment of various physicochemical attributes such as molecular weight, half-life, isoelectric point (pI), instability index, aliphatic index, half-life, and GRAVY score (see Note 7). The design output and stereochemical attributes of the universal vaccine “AbhiSCoVac” is presented in Fig. 2. 3.4 Determination of Vaccine Efficacy in Triggering Innate and Adaptive Immunity

Both innate and adaptive subparts of the human immune system are equally significant when it comes to immunization [24]. To determine how efficacious a newly designed vaccine could be, it needs to be assessed on the basis of certain parameters, like its stability at various conditions, its ability to elicit immune response, the longevity of its effects, or its capacity to combat the number of probable future variants that might turn out to be even more dangerous than the current. Hence, it is important to determine the physicochemical properties and immunogenicity of the vaccine construct. To study the interaction of the vaccine molecules with the target immune cell receptors and ensure effective binding, molecular docking of the vaccine peptide molecule is needed to be performed with prime receptors like cell surface TLRs (TLR1, 2, 4, 5, and 6), MHC I and MHC II [4]. Also, parameters like bonding interactions, specific residues, and interaction zones involved should be studied following the steps mentioned below. 1. Perform separate molecular docking of the vaccine peptide with TLR4-MD2 protein and with the native TLR1/2 and TLR2/6 dimer as well with MHC class I (PDB ID: 1I1Y) and MHC class II (PDB ID: 1KG0) receptor using ClusPro 2.0 server. 2. Record the comparative docking score and mode of biophysical interactions using PyMol. 3. Use PRODIGY server for determining the overall binding affinity (ΔG) of the vaccine molecule toward the target receptor. Binding of the vaccine peptide to the human TLRs and MHCs is presented in Fig. 3.

In-Silico Vaccine Design Against COVID-19

441

Fig. 2 3D structure and physicochemical analyses of the universal COVID-19 vaccine “AbhiSCoVac.” (a) Surface topology of the tertiary structure of the peptide vaccine. (b) Kolaskar and Tongaonkar plot showing antigenicity profile of the vaccine construct across the amino acid sequence of the peptide. (c) Map of the different secondary structural components in the sequence of the vaccine. (c) Ramachandran plot and (d) X-ray-NMR plot showing the stereochemical stability of the designed vaccine. (Adapted from Choudhury et al. [4] with permission from the publisher)

442

Ankita Chakraborty et al.

Fig. 3 Binding of the vaccine molecule to humans (a) TLR2, (b) TLR4, (c) MHC-I, and (d) MHC-II. Molecular docking showing the protein–protein interaction between the vaccine and the target receptor. (Adapted from Choudhury et al. [4] with permission from the publisher) 3.5 Determination of Stability of Binding of Vaccine to the Target Proteins

After the vaccine molecule is docked with the immune receptors, the docked complexes are examined for molecular dynamics simulation trajectory analyses (see Note 8). The simulation can be performed using the following steps: 1. Take the docked complexes comprising the vaccine peptide and immune receptor (TLR4, TLR2, MHC I, and MHC II) in . pdb format and submit in the iMod, Amber99sb force field and GROMACS 5.1.4 for performing the simulation studies. 2. Initiate the simulation by running the Normal Mode Analysis (NMA) for assessing the atomic and molecular motion and flexibility of the vaccine-TLR and vaccine-MHC complexes using iMODS (see Fig. 4). The degree of deformability, the non-rigid portion of the simulated structure demonstrates the presence of coil or domain linkers contributing flexibility to the protein structure [1]. The stability of a protein–protein complex in NMA is expressed by eigenvalue, while instability is presented by B-value [21]. The domain motion in the protein complex comprising the vaccine peptide and human TLR4 is represented by a curved arrow (see Fig. 4a). As given in Fig. Xb, a high degree of deformability has been observed throughout the chain of AbhiSCoVac-immune receptor (TLR or MHC)

In-Silico Vaccine Design Against COVID-19

443

Fig. 4 NMA-based determination of the stability of binding of the vaccine molecule to the immune receptor of human. (a) Modal vibrations of vaccine-receptor complex. (b) Eigenvalues showing the stability and (c) represents deformability of the protein complexes. (d) Covariance matrices showing the direction and impact of different atomistic and molecular motion within the vaccine-receptor complexes comprising (I) MHC I, (II) MHC II, (III) TLR2, and (IV) TLR4-MD2

444

Ankita Chakraborty et al.

complex, shown in hinges [25]. This deformation is due to the high helical content in the two interacting proteins. The relative amplitude of atomic displacements in the protein–protein interaction is presented by B-factor values that reveals a higher deformability in the protein, indicating toward a greater flexibility of AbhiSCoVac-immune receptor complex (see Fig. 4c). Eigenvalue indicates the impact of each deformation movement in the total protein motion [26]. 3. After initial confirmation of the stability, MD simulation should be performed. MD simulation determines the thermodynamic stability, conformational changes, and secondary structural attributes of the vaccine-receptor complex through calculation of various parameters like root mean square deviation (RMSD), root mean square fluctuation (RMSF), radius of gyration, (Rg), hydrogen bond, solvent accessible surface areas (SASA), and binding free energy (see Fig. 5), which are easy to be determined through MD simulations [21]. For successful simulation, the following steps need to be executed: (a) In a cubic box of simple point charge (SPC) water molecules, immerse the protein complexes. Add number of Na+ and Cl- ions to neutralize the system. (b) Perform energy minimization using the steepest descent method for at least 50,000 steps for omitting the shortrange bad contacts. (c) Equilibrate the systems for 5 ns followed by 50 ns MD production runs at a temperature of 298 K and pressure 1 bar with a 2 fs time step (see Note 9). (d) During the course of MD simulation, determine the conformational changes and stability of vaccine bound to different immune receptor, the root-mean-square deviation (RMSD), root-mean-square fluctuation (RMSF), the radius of gyration (Rg), hydrogen bond, and solvent accessible surface area (SASA). 4. Use Molecular Mechanics Poisson-Boltzmann Surface Area (MMPBSA) [27] method to calculate the binding free energy. Calculate the binding free energies of the complexes through 5000 snapshots that should be taken at an equal interval of time from 50 ns simulation period. The efficacy of the vaccine is predicted quantitatively by determining the binding free energy (ΔG) for the docked protein complex comprising the vaccine candidate and the target receptor protein before and after the course of simulation using HawkDock web server that exploits the Molecular Mechanics/Generalized Born Surface Area (MM/GBSA) approach [28]. The MM/GBSA approach estimates both binding energy and binding affinity through

In-Silico Vaccine Design Against COVID-19

445

Fig. 5 Determination of the stability of the protein–protein interactions between the vaccine molecule and immune receptor of a human through Molecular Dynamics Simulation. (a) Comparative biophysical stability of

446

Ankita Chakraborty et al.

analyzing the gas-phase energy (MM), polar electrostatic solvation energy (GB), and non-polar (SA) solvation energy [29] that are fitted in the following equation mentioned below: ΔG bind = ΔE MM þ ΔG GB þ ΔG SA - T ΔS (ΔGbind: Binding free energy; TΔS: entropy; ΔEMM: Total gas-phase energy; ΔGGB: Polar solvation energy; ΔGSA: non-polar solvation energy) The total gas phase energy (ΔEMM) is usually calculated based on the internal bond energies (ΔEinternal), electrostatic energy (ΔEelectrostatic), and van der Waals energy (ΔEvdw). ΔE MM = ΔE internal þ ΔE electrostatic þ ΔE vdw Quantitative value of each form of interaction energy, binding affinity, and free energy can be calculated from the simulation platform and can be depicted as a table (see Table 1). 3.6 Immune Simulation with the Vaccine

Immune simulation follows MD simulations, which evaluates the specific immune response and the immunogenicity of the vaccine construct [4, 30]. Immunogenicity is the key parameter for the success of a developed vaccine construct, while lack of allergenicity is considered to be the added advantageous feature of the vaccine. The potency of the designed vaccine can be examined by immune simulation following the steps mentioned below: 1. Complete the registration in C-IMMSIM performing the immune simulation.

server

for

2. Set the simulation condition for 1000 days or 3000 time steps, wherein one time step accounts for 8 h. 3. Administer the vaccine dose through in silico injection tool available in the server. The minimum gap between two doses of the vaccines should be set for 4 weeks. Inject 12 consequent doses of vaccine comprising 1000 peptide molecules at a time schedule of 10, 94, 178, 262, 346, 430, 514, 598, 682, 766, 850, and 934 days (see Note 10).

ä Fig. 5 (continued) the different vaccine-receptor protein complexes in the simulation environment is shown by (I) RMSD (in nm) and (II) Rg (in nm), (b) (I) SASA (in nm2) and (II) hydrogen bonding, and (c). (I) RMSF (in nm) and (II) the contributions of other non-covalent interactions such as van der Waals forces, electrostatic, polar solvation, and SASA contributing to the total binding energy (in kJ/mol). The MHC-I-vaccine, MHC-II-vaccine, TLR2-vaccine, and TLR4-vaccine complexes are, respectively, represented by black, red, blue, and green colors. Molecular docking showing the protein–protein interaction between the vaccine and the target receptor, i.e., (a) TLR2, (b) TLR4, (c) MHC-I and MHC-II. (Adapted from Choudhury et al. [4] with permission from the publisher)

In-Silico Vaccine Design Against COVID-19

447

Table 1 Quantitative determination of different non-covalent forces stabilizing the vaccine-immune receptor complexes Complexes of the vaccine peptide and different sensor proteins Stabilization parameters

MHC-I

MHC-II

van der Waals energy

-213.6 ± 5.0

-134.6 ± 5.1 -251.9 ± 10.1 -244.1 ± 9.1

Electrostatic energy

-326.7 ± 16.2 -55.5 ± 5.0

-249.3 ± 16.3 -327.6 ± 14.1

Polar solvation energy

435.0 ± 14.0

152.5 ± 6.0

333.6 ± 25.9

387.6 ± 20.7

SASA energy

-35.6 ± 0.6

-17.8 ± 0.5

-31.5 ± 1.1

-30.3 ± 0.6

TLR2

TLR4

Total binding free energy (kJ/mol) -141.4 ± 11.7 -55.2 ± 8.0

-199.4 ± 18.9 -213.8 ± 19.2

Average RMSD (nm)

0.97

1.1

1.09

1.7

Average Rg (nm)

4.5

5.8

5.1

5.1

Average SASA (nm )

653.3

541.3

793.1

995.1

Average H-bond

6

9

5

6

Average RMSF (nm)

0.57

0.85

1.1

1.7

2

Adapted from Choudhury et al. [4] with permission from the publisher

4. Administer a live self-replicating virus on 1100th step to check the efficacy of the vaccination. In a control experiment set, i.e., without any prior vaccination, inject a virus on the 366th day. This is required to simulate and compare the potency of the host in encountering the infection with and without vaccination. 5. Measure the serum titer/level of different antibodies, cytokines, chemokines, and interferon through graphical plots. IgM level refers to the immediate response toward the antigen while IgG for the long-lasting immunoglobulins. Abundance of memory B and T cells and their sustainability over time should be taken into consideration while developing a vaccine in silico (see Fig. 6). 3.7 Cloning of the Vaccine Construct

Cloning of the vaccine for production of the recombinant antigen is of major importance for developing any vaccine. An example of in silico cloning of AbhiSCoVac in plasmid vector pET-28a(+) is given in Fig. 7. In silico cloning of the vaccine construct can be achieved through the following trajectory: 1. Firstly, convert the amino acid sequence of the vaccine to cDNA with proper codon optimization profile for perpetuation in bacterial system using JCat online tool.

Fig. 6 Immune simulation showing the efficacy of the vaccine molecule in inducing immunity against coronavirus. (a) Immunogenecity profile of the peptide vaccine after administration of live virus on 366th day. (b) Control experimental set depicting the immune response induced after administrating the virus at 366th day without prior vaccination. In each set of experimental outcomes, (I) reveals antigen count per ml, (II) antibody titers, (III) B cell population per mm3, (IV) helper-T cells population of per mm3, (V) population of the macrophages per state per mm3, and (VI) cytotoxic T-lymphocyte population of per mm3

In-Silico Vaccine Design Against COVID-19

449

Fig. 7 Map of the cloned cDNA of the vaccine “AbhiSCoVac” in expression vector pET-28a(+)

2. Select an expression vector (e.g., pET-28a(+) plasmid for AbhiSCoVac) for expressing the vaccine candidate in E. coli K-12 strain. 3. Insert the optimized cDNA encoding the vaccine peptide into the plasmid vector using SnapGene tool 5.2.1 (see Note 11). Taken together, a scheme on the workflow for immunoinformatics and reverse vaccinology-based designing of universal vaccine, AbhiSCoVac has been presented in Fig. 8.

4 Notes 1. All the amino acids and cDNA sequences should be prepared in FASTA format and be saved as .txt file. 2. While studying molecular docking with the TLRs, interaction of the vaccine peptide with the extracellular domain of the TLRs will be taken into account only. The complexes showing most negative binding free energy are to be considered for further study.

450

Ankita Chakraborty et al. 1

3

2

4

U A C U C A G G U U C A

Tyr

Ser

Gly

Ser

Sequence retrieval of SARS-CoV-2 spike glycoproteins from open source databases for phylogenetic analysis of different coronavirus strains that infect humans

7

Modelling of 3D structures of spike protein by SWISS-MODEL Workspace or MODELLER

6

5

Construction of sequence of vaccine candidate through the sequential addition of antigenic peptides, linkers and adjuvants

Generation of 3D crystal structure of vaccine candidate and validation of the modelled structure through analyzing the stereochemical properties

9

8

Assessment of efficacy of vaccine in binding human TLRs (TLR2, TLR4) and encountering the mediators of antigen presentation (MHC-I and MHC-II)

Identification of the major interacting patches

Molecular docking of spike protein with TLR4

Determination of the vaccine in eliciting adaptive immune responses including differential generation of different antibody classes through immune simulation approaches

Exploration of B-cell and Tcell epitopes within the antigenic peptide fragments

10

Reverse translation of the amino acid sequences and in silico cloning of the vaccine peptide

Fig. 8 Scheme depicting route map for designing a multiepitope peptide vaccine through reverse vaccinologybased immunoinformatics approach

3. While screening the B and T cell epitopes, more than one web-based server should be explored and the results must be compared before finalizing the epitopes. 4. During selection of B and T cell epitopes, epitopes having higher binding scores are only to be shortlisted. 5. Appropriate adjuvant(s) and linkers are need to be used based on the possible level of immunogenicity in the antigen desired. 6. The 3D structure of the vaccine should be modeled by using both program-based and automated modeling servers. The stereochemical quality of the developed model(s) is required to be verified by Ramachandran plot, X-ray, and NMR using computational tools. 7. Physicochemical characterization of the vaccine peptide, including the assessment of hydrophobicity, is mandatory before checking immune behavior. 8. Stability of the vaccine-immune receptor complexes must be examined by both NMA and MDS. 9. Molecular dynamics must be done for 50 nm at least. For larger complex, simulation needs to be conducted for 100 nm. 10. Immune simulation should be performed using multiple injections of the antigen after different time-frames and the efficacy of the vaccine in eliciting antibody responses must be compared with whole virus as well as mock injection.

In-Silico Vaccine Design Against COVID-19

451

11. Molecular cloning has to be done after selecting suitable expression vector and host system. During cloning, GC content and codon adaptation index (CAI) score should be determined to predict the protein expression levels.

Acknowledgments JB and SM acknowledge Department of Science & Technology (DST)-Science & Engineering Research Board (SERB), Govt. of India (Sanction no. CRG/2021/002605), for providing core research grant to them. All the uncited articles which have not been cited due to space limitations are respectfully acknowledged. We acknowledge the use of BioRender.com for making of the illustrations. References 1. Das NC, Chakraborty P, Bayry J, Mukherjee S (2021) In silico analyses on the comparative potential of therapeutic human monoclonal antibodies against newly emerged SARSCOV-2 variants bearing mutant spike protein. Front Immunol 12:782506. https://doi.org/ 10.3389/fimmu.2021.782506 2. Han X, Xu P, Ye Q (2021) Analysis of Covid-19 vaccines: types, thoughts, and application. J Clin Lab Anal 35(9):e23937. https://doi. org/10.1002/jcla.23937 3. Ndwandwe D, Wiysonge CS (2021) Covid-19 vaccines. Curr Opin Immunol 71:111–116. https://doi.org/10.1016/j.coi.2021.07.003 4. Choudhury A, Sen Gupta PS, Panda SK, Rana MK, Mukherjee S (2022) Designing AbhiSCoVac — a single potential vaccine for all ‘Corona culprits’: immunoinformatics and immune simulation approaches. J Mol Liq 351:118633. https://doi.org/10.1016/j. molliq.2022.118633 5. Ghorbani A, Zare F, Sazegari S, Afsharifar A, Eskandari MH, Pormohammad A (2020) Development of a novel platform of virus-like particle (VLP)-based vaccine against COVID19 by exposing epitopes: an immunoinformatics approach. New Microbes New Infect 38:100786. https://doi.org/10.1016/j. nmni.2020.100786 6. Kumar N, Sood D, Chandra R (2020) Design and optimization of a subunit vaccine targeting COVID-19 molecular shreds using an immunoinformatics framework. RSC Adv 10(59): 35856–35872. https://doi.org/10.1039/ d0ra06849g

7. Behmard E, Soleymani B, Najafi A, Barzegari E (2020) Immunoinformatic design of a COVID-19 subunit vaccine using entire structural immunogenic epitopes of SARS-COV-2. Sci Rep 10(1):20864. https://doi.org/10. 1038/s41598-020-77547-4 8. Dong R, Chu Z, Yu F, Zha Y (2020) Contriving multi-epitope subunit of vaccine for COVID-19: immunoinformatics approaches. Front Immunol 11:1784. https://doi.org/ 10.3389/fimmu.2020.01784 9. Naz A, Shahid F, Butt TT, Awan FM, Ali A, Malik A (2020) Designing multi-epitope vaccines to combat emerging coronavirus disease 2019 (COVID-19) by employing immunoinformatics approach. Front Immunol 11: 1663. https://doi.org/10.3389/fimmu. 2020.01663 10. Abdelmageed MI, Abdelmoneim AH, Mustafa MI, Elfadol NM, Murshed NS, Shantier SW, Makhawi AM (2020) Design of a multi epitope-based peptide vaccine against the E protein of human COVID-19: an immunoinformatics approach. Biomed Res Int 2020. https://doi.org/10.1155/2020/2683286 11. Choudhury A, Das NC, Patra R, Bhattacharya M, Ghosh P, Patra BC, Mukherjee S (2021) Exploring the binding efficacy of ivermectin against the key proteins of SARSCoV-2 pathogenesis: an in silico approach. Future Virol 16(4):277–291. https://doi. org/10.2217/fvl-2020-0342 12. Yuan S, Chan HCS, Hu Z (2017) Using PyMol as a platform for computational drug design. WIREs Comput Mol Sci 7(2):1298. https:// doi.org/10.1002/wcms.1298

452

Ankita Chakraborty et al.

13. Kozakov D, Hall DR, Xia B, Porter KA, Padhorny D, Yueh C, Beglov D, Vajda S (2017) The ClusPro web server for protein– protein docking. Nat Protoc 12(2):255–278. https://doi.org/10.1038/nprot.2016.169 14. Roy A, Kucukural A, Zhang Y (2010) I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc 5(4):725–738. https://doi.org/10. 1038/nprot.2010.5 15. Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheele DK, Sette A, Peters B (2019) The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res 47(D1):D339–D343. https://doi.org/10. 1093/nar/gky1006 16. Mesel-Lemoine M, Millet J, Vidalain PO, Law H, Vabret A, Lorin V, Escriou N, Albert ML, Nal B, Tangy F (2012) A human coronavirus responsible for the common cold massively kills dendritic cells but not monocytes. J Virol 86(14):7577–7587. https://doi.org/10. 1128/JVI.00269-12 17. Xue LC, Rodrigues JP, Kastritis PL, Bonvin AM, Vangone A (2016) Prodigy: a web server for predicting the binding affinity of protein– protein complexes. Bioinformatics 32(23): 3676–3678. https://doi.org/10.1093/bioin formatics/btw514 18. Patronov A, Doytchinova I (2013) T-cell epitope vaccine design by immunoinformatics. Open Biol 3(1):120139. https://doi.org/10. 1098/rsob.120139 19. Galanis KA, Nastou KC, Papandreou NC, Petichakis GN, Pigis DG, Iconomidou VA (2021) Linear B-cell epitope prediction for in silico vaccine design: a performance review of methods available via command-line interface. Int J Mol Sci 22(6):3210. https://doi.org/10. 3390/ijms22063210 20. Vashi Y, Jagrit V, Kumar S (2020) Understanding the B and T cell epitopes of spike protein of severe acute respiratory syndrome coronavirus2: a computational way to predict the immunogens. Infect Genet Evol 84:104382. https:// doi.org/10.1016/j.meegid.2020.104382 21. Gorai S, Das NC, Gupta PS, Panda SK, Rana MK, Mukherjee S (2022) Designing efficient multi-epitope peptide-based vaccine by targeting the antioxidant thioredoxin of bancroftian filarial parasite. Infect Genet Evol 98:105237. https://doi.org/10.1016/j.meegid.2022. 105237

22. Dalsass M, Brozzi A, Medini D, Rappuoli R (2019) Comparison of open-source reverse vaccinology programs for bacterial vaccine antigen discovery. Front Immunol 10:113. https://doi.org/10.3389/fimmu.2019. 00113 23. Hou J, Liu Y, Hsi J, Wang H, Tao R, Shao Y (2014) Cholera toxin B subunit acts as a potent systemic adjuvant for HIV-1 DNA vaccination intramuscularly in mice. Hum Vaccin Immunother 10(5):1274–1283. https://doi.org/10. 4161/hv.28371 24. Clem AS (2011) Fundamentals of vaccine immunology. J Glob Infect Dis 3(1):73–78. https://doi.org/10.4103/0974-777X.77299 25. Das NC, Sen Gupta PS, Biswal S, Patra R, Rana MK, Mukherjee S (2022) In-silico evidences on filarial cystatin as a putative ligand of human TLR4. J Biomol Struct Dyn 40(19): 8808–8824. https://doi.org/10.1080/ 07391102.2021.1918252 26. Lo´pez-Blanco JR, Aliaga JI, Quintana-Ortı´ ES, Chaco´n P (2014) IMODS: internal coordinates normal mode analysis server. Nucleic Acids Res 42(Web Server issue):W271–W276. https://doi.org/10.1093/nar/gku339 27. Wang J, Morin P, Wang W, Kollman PA (2001) Use of MM-PBSA in reproducing the binding free energies to HIV-1 RT OF TIBO derivatives and predicting the binding mode to HIV-1 RT of Efavirenz by docking and MM-PBSA. J Am Chem Soc 123(22): 5221–5230. https://doi.org/10.1021/ ja003834q 28. Weng G, Wang E, Wang Z, Liu H, Zhu F, Li D, Hou T (2019) Hawkdock: a web server to predict and analyze the protein–protein complex based on computational docking and MM/GBSA. Nucleic Acids Res 47(W1): W322–W330. https://doi.org/10.1093/ nar/gkz397 29. Ylilauri M, Pentik€ainen OT (2013) MMGBSA as a tool to understand the binding affinities of filamin–peptide interactions. J Chem Inf Model 53(10):2626–2633. https://doi.org/ 10.1021/ci4002475 30. Kar T, Narsaria U, Basak S, Deb D, Castiglione F, Mueller DM, Srivastava AP (2020) A candidate multi-epitope vaccine against SARS-COV-2. Sci Rep 10(1):10895. https://doi.org/10.1038/s41598-02067749-1

Chapter 30 A Sample Guideline for Reverse Vaccinology Approach for the Development of Subunit Vaccine Using Varicella Zoster as a Model Disease Elif Cireli

and Levent C¸avas¸

Abstract For the development of multi-peptide vaccine, identification of antigenic epitopes is crucial. If it is done using wet lab techniques, the identification process can be time-consuming, laborious, and cost-intensive. In silico tools, on the other hand, enable researchers to predict potential epitopes with little to no cost for further in vivo and in vitro testing. The rapid identification process using in silico tools helps in responding to health emergencies faster. Developing an efficient and high coverage vaccine is one of the ways to reduce morbidity and mortality rates of the diseases and protect the affected populations. In this chapter, we introduce the necessary tools and methodology for the identification and characterization of antigenic epitopes to design a multi-epitope vaccine using varicella-zoster virus as an example vector model. Key words IEDB, Immunoinformatics, In silico, Peptide vaccine

1

Introduction With the COVID-19 pandemic, the importance of a rapid response to emerging diseases and outbreaks is better understood. One of the most efficient and fast ways to respond to a disease is through vaccination. There are various vaccinology techniques to prevent the diseases from spreading to the general population. In this chapter, we introduce reverse vaccinology approach to identify putative epitopes using immunoinformatics tools. Reverse vaccinology is an approach where the whole protein sequence(s) is used for the identification of possible antigens [1]. This methodology employs the usage of immunoinformatics tools to predict also putative epitopes, thus saving time, money, and labor needed to identify the epitopes. In contrast to traditional approaches, peptide subunit vaccines do not contain the whole organism, but antigenic regions of a pathogen, such as T and B cell epitopes into a single vaccination formulation [2]. For this

Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_30, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

453

454

Elif Cireli and Levent C¸avas¸

reason, identification of B and T cell epitopes plays a crucial role in the design of subunit vaccines that elicit both humoral and adaptive immune responses [3]. Immune Epitope Database and Analysis Resource (IEDB) is a publicly available and free database which provides epitope and assay information related to infectious and autoimmune diseases, and allergy to the scientific community [4]. IEDB also contains various in silico analysis tools for continuous and discontinuous B cell and T cell epitope prediction, population coverage, and other relevant epitope analysis tools. B cells are the main part of humoral immune response, and they are responsible for the antigen specific antibody production [5]. Prediction of antigenic B cell epitopes therefore are of high importance for eliciting humoral immune responses and production of antibodies to neutralize the pathogens. B cell epitope prediction tool in IEDB provides the researchers with different prediction methods such as hydrophilicity, physicochemical properties of amino acid residues and turn scale. These methods have independent algorithms; however, the consensus of results can also be used. Antigen-specific T cells recognize the antigens when they form a complex with major histocompatibility complex (MHC) molecules [6]. Through antigen processing pathway, the antigenic peptides are bound to MHC molecules and presented to T cells [7]. CD8+ T cells recognize peptides bound to MHC-I molecule which is present in almost all nucleated cells [8, 9]. The peptide binding groove of MHC-I molecules is closed at both ends and peptides that are 8–15 amino acid residues long can bind, but 9mer peptides are predominant [10]. On the other hand, MHC-II molecules are only present in antigen presenting cells and have an open binding groove which enable them to accommodate longer peptides of 13–25 amino acid residues long, including the core and flanking regions [11]. The MHC-I prediction tool in IEDB offers a wide range of prediction methods including ANN [12–17], SMM [18], SMMPMBEC [19], CombLib [20], Consensus [21], NetMHCpan [22–26], NetMHCcons [27], PickPocket [28], and NetMHCstabpan [29]. The tool also enables the researchers to choose between different hosts including pig, dog, and cow. If “human” is selected as the host, IEDB provides a reference allele set of 27 alleles which cover 97% of the world population [30]. The MHC-II prediction tool also offers a variety of prediction methods such as NN-align [31, 32], SMM-align [33], Combinatorial library [20], Sturniolo [34], NetMHCIIpan [22, 31, 35]. Users can specify the locus DR, DP, and DQ for human and H-2-I for mouse. Common allele set is also available for MHC-II prediction and users can also proceed with 7 allele set [36].

Development of Subunit Vaccine by Immunoinformatics Tools

455

In most cases, the epitope prediction process is uncomplicated; however, the predicted peptides should also be tested for their antigenicity, allergenicity, and toxicity. Weak immunogenicity is one of the drawbacks of peptide-based vaccines and the candidate peptides that will be included in the vaccine construct should be tested for their antigenicity [37]. Vaxijen 2.0 is a bioinformatics tool that uses an alignment-free approach to predict the antigenicity of the peptides based on the target organism [38]. Although the allergic response against peptide vaccines denoted to be minimal predicted peptides should be screened for their allergenicity to achieve maximal safety [39]. AllerTOP v. 2.0 is that predicts the peptides’ a bioinformatics server allergenicity [40]. Toxicity is one of the disadvantages of peptide-based therapies [41]. Identifying the non-toxic peptides, while retaining their non-allergenic, antigenic properties, and other functionalities, helps to minimalize the side effects. ToxinPred is a bioinformatics tool that predicts the toxicity of the peptides using support vector machine and motif-based algorithm [42]. To cover the above mentioned listed tools and methodology, we selected varicella zoster as a model disease. VZV is a common human alpha herpesvirus, which causes varicella and shingles later in life after the primary infection [43]. VZV spreads through direct contact with skin lesions or through respiratory aerosols [44]. Primary infection of VZV causes acute varicella, commonly known as “chickenpox” and results in lifelong infection of the trigeminal and dorsal root ganglia [45]. Although primarily considered a childhood disease, reactivation of latent virus can cause herpes zoster or shingles especially in elderly people [46]. VZV is not considered a life-threatening infection; however, in some cases, VZV infection can cause severe complications such as meningitis, encephalitis, cerebellitis, and myelopathy [47]. Therefore, in this chapter, we proposed a method to design candidate peptide-based vaccine with high population coverage by using immunoinformatics tools and varicella zoster as a model disease.

2

Materials All the bioinformatics tools used in this chapter and their accession links are given in the list below. The related instructions are available in the given links. 1. Uniprot is a publicly available database for the retrieval of protein sequences available at (http://www.uniprot.org) [48]. 2. Clustal Omega multiple sequence alignment tool is available at (https://www.ebi.ac.uk/Tools/msa/clustalo/) [49].

456

Elif Cireli and Levent C¸avas¸

3. BepiPred 2.0 linear B cell analysis tool is available in IEDB database at (http://tools.iedb.org/bcell/) [50]. 4. Emini surface accessibility analysis tool is available in IEDB database at (http://tools.iedb.org/bcell/) [51]. 5. Kolaskar and Tongaonkar antigenicity analysis tool is available in IEDB database at (http://tools.iedb.org/bcell/) [52]. 6. Vaxijen 2.0 antigenicity prediction server is available at (http:// www.ddg-pharmfac.net/vaxijen/VaxiJen/VaxiJen.html) [38]. 7. ToxinPred toxicity analysis tool is available at https://webs. iiitd.edu.in/raghava/toxinpred/design.php) [42]. 8. AllerTOP v. 2.0 allergenicity prediction server is available at (https://www.ddg-pharmfac.net/AllerTOP/) [40]. 9. MHC-I prediction tool is available in IEDB database at (http://tools.iedb.org/mhci/) [22]. 10. MHC-II prediction tool is available in IEDB database at (http://tools.iedb.org/mhcii/) [53, 54]. 11. Interferon gamma inducing MHC-II peptide prediction tool is available at (http://crdd.osdd.net/raghava/ifnepitope/ design.php) [55]. 12. Population coverage tool in IEDB database is available at (http://tools.iedb.org/population/) [56].

3

Methods In this section, we explain thoroughly how to identify the B and T cell epitopes of a given antigen sequence, analysis of the predicted peptides and population coverage. The complete workflow of this chapter is given in Fig. 1.

3.1 Retrieval of VZV Glycoprotein B Sequences

The desired antigenic protein sequence can be retrieved from UniProt.org, which is available at (http://www.uniprot.org) [48]. In this chapter, the model antigen sequence Varicella-zoster virus envelope glycoprotein B sequences are retrieved from UniProt.org (see Note 1). For your particular antigenic sequence of interest, follow the steps listed below: 1. Enter the protein name to search box. 2. Select specific proteins based on your research that appear on search interface by checking the box appearing left on the sequence entry. 3. Click download and select FASTA canonical form to obtain FASTA formats of the selected sequences.

Development of Subunit Vaccine by Immunoinformatics Tools

457

Fig. 1 Complete workflow of this chapter 3.2 Multiple Sequence Alignment and Identification of Conserved Regions

Multiple sequence alignment is important for identifying type and position of the mutations in protein sequences. It is also a very important tool for the comparison of protein and DNA sequences. Clustal Omega, which is publicly available at (https://www.ebi.ac. uk/Tools/msa/clustalo/), can be used for multiple sequence alignment [49]. Clustal Omega uses seeded guide trees to align three or more sequences. The output generated by the Clustal Omega can be read as the following: asterisk (*) positions that have single fully conserved residue; colon (:) indicates highly similar properties among residues; and period (.) is weakly similar properties between the residues. This step is especially important for the identification of conserved regions and selection of the sequence(s) that are going to be further analyzed (see Note 2). For performing multiple sequence alignment using Clustal Omega tool, these steps should be followed:

458

Elif Cireli and Levent C¸avas¸

1. Paste the amino acid sequences retrieved in the first step. 2. Put “>” and type in the name of the entries. Name the entries accordingly either based on the sequence ID or the way they are going to be recognized. 3. Run multiple sequence alignment. 4. Identify conserved regions and mutations. 5. Select the protein sequence that is going to be further analyzed if you are going to use single amino acid sequence. 3.3 Linear B Cell Epitope Analysis

The prediction of B cell epitopes is important for eliciting a strong immune response. In this chapter, we show how to analyze B cells using tools available in IEDB and screen the epitopes for their antigenicity, toxicity, and allergenicity. There are various prediction methods available on IEDB website for the identification of B cell epitopes, but the consensus of methods is recommended (see Note 3). For this purpose, select multiple prediction methods and identify the common epitope residues for further analysis. In this chapter, for the identification of B cell epitopes, we use BepiPred 2.0 linear B cell analysis, Emini surface accessibility, and Kolaskar and Tongaonkar antigenicity prediction tools [50–52]. The B cell prediction can be done as follows: 1. Go to “Antigen Sequence Properties” on IEDB website under “B Cell Epitope Prediction” (https://www.iedb.org/). 2. Paste your antigenic sequence in the search box. 3. Select the desired prediction method by clicking on the check box. 4. Click on submit. 5. Use the default threshold value if you do not have a special protein search query. 6. Repeat the process for other B cell prediction tools. 7. After obtaining results, try to identify common predicted epitopes by different prediction methods. After the identification of common regions, analyze the epitopes for their antigenicity using Vaxijen 2.0 tool. Vaxijen is an alignment-free prediction tool which predicts the antigenicity of the peptides based on physicochemical properties of the peptides and selected target organism [38]. 1. Go to http://www.ddg-pharmfac.net/vaxijen/VaxiJen/ VaxiJen.html website. 2. Paste your previously identified common epitopes. 3. Choose your target organism from the dropdown menu.

Development of Subunit Vaccine by Immunoinformatics Tools

459

4. Run analysis. 5. Select the B cell epitopes that are identified as antigenic. Toxicity prediction is important for minimizing the side effects and designing a vaccine with non-toxic components. To predict the toxicity of the peptides, ToxinPred server, available at (https:// webs.iiitd.edu.in/raghava/toxinpred/design.php), can be used [42]. This server has various prediction methods and features. If you would like to see how different point mutations affect the toxicity of the given peptide, click the “Design Peptide” on the main menu and analyze single peptide sequence and its possible point mutations. If your search query contains multiple epitopes, which is generally the case in designing multi-epitope vaccine, then you should use “Batch Submission”. 1. Go to https://webs.iiitd.edu.in/raghava/toxinpred/index. html and select “Batch Submission” from the menu above. 2. Start entering your search query by typing “>” following the name of the epitope (e.g., ep1, epitope2). Skip to the bottom line and paste your epitope sequence. Your search query should look like the entry below: >Ep1 YVYYEDY >Ep2 AERQESKARKKNK >Ep3 KSQDAETKP 3. Select either supported vector machine (SVM) or quantitative matrix method (the default setting is SVM). 4. If applicable, choose the desired physicochemical properties to be displayed. 5. Run analysis. 6. Select the epitopes that are identified as non-toxic. Finally analyze the allergenicity of the predicted epitopes to avoid unwanted allergic reactions using AllerTOP v. 2.0 server [40]. 1. Go to https://www.ddg-pharmfac.net/AllerTOP/ server. 2. Enter your peptides to the search box as one letter amino acid code. 3. Run the analysis.

460

Elif Cireli and Levent C¸avas¸

After completing the prediction and characterization of B cell epitopes, select the peptides that are predicted as a B cell epitope by IEDB analysis tools and predicted to be antigenic, non-allergenic, and non-toxic by Vaxijen 2.0, AllerTop, and ToxinPred, respectively [38, 40, 42] (see Note 10). 3.4 Prediction of MHC Class I Epitopes

T cells recognize the antigens when they are displayed on MHC molecules. MHC-I and II molecules present the antigens to CD8+ ad CD4+ cells, respectively [11]. Upon recognition, the naı¨ve T cells become activated and induce T cell-mediated immunity [57]. Although, MHC binding epitopes are attractive targets for designing peptide vaccine, MHC molecules are highly polymorphic and experimental identification is cost-intensive. In this case, in silico prediction tools come in handy for the selection of MHC epitopes. Here, we describe the prediction method by MHC-I binding prediction tool in IEDB database available at (http://tools.iedb. org/mhci/) (see Note 6). 1. Go to “MHC-I binding” under “T cell epitope prediction” on IEDB website (https://www.iedb.org/). 2. Paste the antigenic sequence to search box. 3. Choose the prediction method (see Note 5). 4. Choose MHC source species. 5. Choose MHC alleles or reference set and the length of the peptides. 6. Specify the output by selecting mode of sorting peptides output format. 7. Select the epitopes that have lower percentile rank than 1.0. Screen the peptides that lie in the cut-off value for antigenicity, allergenicity, and toxicity as described before for B cell analysis and select the peptides with desired characteristics. More detail related to use of MHC-I prediction tool is available in IEDB website at (http://tools.iedb.org/mhci/help/).

3.5 MHC-II Binding Prediction

IEDB database also contains MHC-II binding prediction tool. The tool works the same way as MHC-I binding prediction tool described previously. The instruction of using the tool is highly similar to MHC-I prediction tool and the steps are listed below. 1. Go to “MHC II binding” under T cell epitope prediction on IEDB database (https://www.iedb.org/). 2. Paste the antigenic sequence to search box. 3. Choose the prediction method. 4. Choose MHC source species/locus.

Development of Subunit Vaccine by Immunoinformatics Tools

461

5. Choose MHC alleles or reference set and the length. 6. Specify the output by selecting mode of sorting peptides output format. 7. Select the epitopes that have lower percentile rank than 10.0. The lower the percentile rank means better the binding affinity (see Notes 7 and 8). CD4+ helper T cells activated by MHC-II molecules play an important role in the production of IFN-γ which has pleiotropic immunoregulatory functions [58]. IFN-γ inducing epitopes can be screened as follows: 1. Go to http://crdd.osdd.net/raghava/ifnepitope/design.php website. 2. Enter the peptide sequence. 3. Select the prediction method. 4. Run the analysis. 5. Select the peptides that can trigger the production of IFN-γ. 3.6 Population Coverage Analysis

High population coverage is important for designing an effective vaccine. Therefore, we describe here how to calculate the population coverage of the selected epitope set. After the identification of MHC-I and -II epitopes, subject them to population coverage analysis with IEDB population coverage tool available at (http://tools.iedb.org/population/) [56]. 1. Go to link (http://tools.iedb.org/population/). 2. Set the total number of epitopes you have. 3. Select the areas or populations you want to target you can also select “World” to analyze the world coverage of the epitopes. 4. You can analyze MHC-I and -II epitope sets separately or collectively, therefore select the appropriate calculation option by checking the box on the left. 5. Type in your epitopes and MHC restricted alleles and run the analysis (see Notes 4 and 9). 6. In the results section, you get information on percent coverage and individual epitope hits (see Note 11).

4

Notes 1. Although the protein name is given correctly, sometimes Uniport [48] extends the search results. For example, even when you search for envelope glycoprotein B, envelope glycoprotein E can appear in search results. Therefore, always check the name, reference, and protein length. Avoid selecting protein sequences that contain unknown amino acids denoted as

462

Elif Cireli and Levent C¸avas¸

X. If your desired amino acid sequence is not available on Uniprot you can also use NCBI [59] database available at (https://www.ncbi.nlm.nih.gov/). If you use NCBI database, make sure to check any updates on protein sequence. 2. Clustal Omega does not accept the sequences that have the same name or start with the same name before space. When you download the FASTA formats of the sequences they may have different names than they appear on the database. Therefore, always name your sequences the way they are going to be recognized in your results. When the number of sequences that are going to be analyzed increases, nomenclature becomes more important. Additionally, if you want to obtain clearer results, try to analyze proteins that are of similar lengths, otherwise similarity between the proteins decreases and it becomes harder for you to detect mutations. 3. Sometimes single amino acids can appear on the results table; however, these amino acids alone should not be taken into consideration as B cell epitopes since single amino acid residues cannot trigger a strong immune response. However, they can be searched within the epitope sequences predicted by different methods. You can choose to use single B cell epitope analysis tool, but IEDB recommends consensus of the methods. Changing the threshold changes the sensitivity and specificity of the tool. Sensitivity and specificity based on threshold can be found in the help section of the tool (http://tools.iedb.org/ bcell/help/). The results may change based on development of new algorithms and prediction methods. 4. If you work on an endemic, do research on common alleles in that region and specifically include them to your allele list while performing MHC-I and II analysis. This is important for both reaching a high population coverage in the targeted population and identifying the epitopes of these MHC molecules. 5. If you would like to proceed with another prediction method other than IEDB recommended and consensus, the results will be listed based on their IC50 values. In this case, the peptides with IC50 values results.txt

BepiPred scores each position with the propensity of being part of a B cell epitope. By default, the threshold is set at 0.5, so any position with a score above that will be considered as part of an epitope. 13. To make sure that the predicted epitopes are indeed located on the surface of a protein, we will calculate the accessibility of each residue. In this protocol, we will use NACCESS to calculate the relative solvent accessibility of each position. It only requires a PDB file as an input, for which we will use the T. cruzi AlphaFold models. To obtain and install NACCESS, follow the instructions on its site (see Notes 30 and 31). It is simply used by the command:

498

Albert Ros-Lucas et al.

This will output several files, but we are just interested in the .rsa results. Each residue will show several values in different columns, but we will only use the “All-atoms REL” column, which refers to the relative accessibilities in percentage of all the atoms in a residue. Residues with at least 50% RSA will be considered as good candidates of being part of B cell epitopes. 14. Finally, we will compare the results of the entropy calculation (conservation), the RSA (accessibility) and BepiPred (epitope prediction). Since the entropy has been calculated from the multiple sequence alignments, while the RSA and BepiPred have been obtained from the unaligned sequences, we must now “realign” these results. You can either add gaps using the aligned reference sequence as model, or align the sequence into the MSA by using MUSCLE’s profile–profile alignment (see Note 32). Finally, analyze every position of each sequence. Regions that fulfill all three requirements will be considered as potential B cell epitopes. While B cell epitopes do not have a canonical length, for practical purposes, we will keep sequences with a minimum length of 7 amino acids. 3.7 Homology Analysis

Given that epitope sequences similar to host proteins could trigger cross-reactivity or show low immunogenicity due to mechanisms of immunotolerance, it is useful to study sequence homology between potential epitopes and the proteome of Homo sapiens. 1. Launch a BLASTP search of our predicted epitopes to assess the similarity between these sequences and the ones from the host. You can either run this search locally, or by using the NIH BLAST site. Blast against the human non-redundant (nr) protein sequences database with the following parameters: PAM30 matrix, word size of 2, gap open and extend costs of 9 and 1, no compositional adjustments, and an expected threshold of 10,000. 2. If you have BLAST installed locally, we can also do a BLASTP search against the human microbiome, since it heavily influences the immune response. For this, we can create a new blast database using sequences from the Human Microbiome Project (HMP) (see Note 33). Run the BLASTP search locally using this database and the same parameters as above. 3. An epitope’s identity can be measured as the number of identical positions over the aligned query length. If any epitope finds a hit with an identity higher than 70% in either the human or the HMP blast searches, discard it.

Computational Prediction of Trypanosoma cruzi Epitopes Toward the. . .

4

499

Notes 1. A script can be used instead of downloading each strain manually. For this purpose, you can design a web scrapper using the https://tritr ypdb.org/common/downloads/Current_ Release/ base url, instead of the web app. 2. You can use the strain name and the gene id (i.e., >TcruziCLBrener_TcCLB.403481.10). It’s important to not use spaces in the headers, so use underscores (“_”) or similar to join both identifiers 3. For gff files, the strain name and the locus tag can be used as headers (e.g., >TcruziBrazilA4_TcBrA4_Chr1-6-10737971073507). 4. You can install conda using the Anaconda or Miniconda installers. Anaconda is more user-friendly, but has a much heavier installation that contains many packages. Miniconda is used entirely via the command prompt, but is much more lightweight. 5. The identity cutoff and the word size can be set to lower thresholds, which will result in fewer clusters, but with more sequences in them. For a guide on which word size to use for each cutoff, consult the CD-HIT manual. 6. If the input fasta is too large and the program collapses your computer, try adjusting the and parameters to limit the resources consumed. 7. Additional parameters can be useful, and visualized with specifies that the shorter . For example, sequences of a cluster need to be at least 75% of the length of the representative sequence in that cluster. This can be very useful if we have many protein fragments that we want to discard, which would distort the alignments otherwise. 8. This file is usually found in the installation folder of CD-HIT, and you can also find it in its GitHub repository. However, you will need to install Perl in your system. 9. Given the large number of files, some scripting will be necessary to align all the fasta files. Parallelization, if possible, is recommended. 10. The BioPython library for Python contains the pairwise2 module, which you can use to perform quick global pairwise alignments and return a score for each one. 11. The Shannon entropy equation can be implemented manually, or using an existing library or function for the programming language of choice. In Python, the SciPy library has an entropy function that can be used (remember to set the log to base 2).

500

Albert Ros-Lucas et al.

12. Multiple sequence alignments with shorter proteins than the rest (e.g., short ORFs or other protein fragments) will cause strange results in the N and/or C-terminal regions. To avoid this, you can exclude these sequences when clustering using the s parameter, or not taking into consideration the artificial gaps generated by these sequences in both termini. Normal gaps product of indels should be taken into consideration though. 13. The output from the entropy calculation can be stored as the entropy for each position (e.g., in a CSV file) or as an aligned sequence with masked positions. The former can be useful if you want to try different entropy thresholds. 14. Given that the allowed number of sequences for the online site is 100, it is better to install the software and run it locally with a single fasta file. Alternatively, you can split the sequences in several fasta files. 15. Since one protein can provide several k-mers, it is important to keep track of the origin of each of them. A good practice is to create a new fasta file with the newly generated 9-mers, adding underscores with enumeration to the header of each one, keeping the original header information intact. 16. Do not include headers, and keep the same order as in the fasta file generated as mentioned in Note 15. 17. These scores follow the same order as the input file, so they can be easily correlated with each 9-mer and thus their protein of origin. 18. It is possible that duplicated 9-mers from different (or the same) proteins appear. The MHC I predictor scores the same sequences equally, so if you have many repeated 9-mers you can predict the scores of just one of them. 19. Alternatively, you can use specific HLA alleles for a more regional approach. You can consult allele frequencies at http://www.allelefrequencies.net. 20. You can either launch 27 different commands (if you are using the 27 reference alleles), using a different allele each time, which is useful if you are interested in parallelizing execution. Alternatively, you can launch a single command specifying all the alleles and lengths separated by commas. Remember that each allele must be matched by a specified epitope length: ./src/predict_binding.py IEDB_recommended HLA-A*01:01,HLA-A*02:01 9,9 input.fasta > results.txt

Computational Prediction of Trypanosoma cruzi Epitopes Toward the. . .

501

21. Some peptides might score better percentiles in some alleles, but not in others. Strong binders in several alleles are preferable, since they will theoretically be more widely recognized in the population. 22. Since the recommended method for each allele differs, it is better to always specify as method. 23. Adjust the number of alleles needed to accommodate a reasonable number of CD4+ T cell epitopes. Remember that space in a genetic construct is limited, so strike a balance between the different epitope types and space. In addition, it is also recommended to select CD4+ epitopes that coincide with B cell epitopes. 24. It is suggested to install SignalP in a separate Python environment, to avoid conflicts. You can easily create them using or 25. If there are too many entries and the prediction fails, split your fasta file accordingly. 26. For convenience, you can edit the names of every step and nested strategy. 27. Further filtering could be done by, for example, excluding genes that are upregulated in epimastigote stages, and/or selecting those that show high evidence of expression in trypomastigote and amastigote stages. 28. BioPython can parse PDB files using the Bio.PDB.PDBParser function. Iterate through each atom of each residue, check if they are alpha carbons (“CA”) with the .get_id() method, and obtain the B value using .get_bfactor(). 29. Follow the readme instructions of each package to install it; it is also highly recommended to create a new virtual environment just for BepiPred with Python version 3.6 and the specific versions of the packages described in the BepiPred installation instructions. You will also need to install NetSurfP [33], specifically the 1.0d version, along with its data.tar.gz and nr70_ db.tar.gz files to work. They can all be downloaded from https://services.healthtech.dtu.dk/service.php?NetSurfP-1.0. In addition, you will need a blastpgp installation from the unsupported BLAST [24] suite version 2.2.18 that you can download from https://ftp.ncbi.nlm.nih.gov/blast/ executables/legacy.NOTSUPPORTED/2.2.18/. 30. To obtain NACCESS, you will have to contact prof. Simon Hubbard and send a signed agreement via postal mail to The University of Manchester, in the United Kingdom. Since this can take time, take this into account when planning your project.

502

Albert Ros-Lucas et al.

31. Some problems are known to occur in systems with modern Fortran compilers. Guidelines on how to solve this are included in the NACCESS readme instructions. 32. More information on profile–profile alignment can be found in the MUSCLE manual and/or here https://drive5.com/ muscle/manual/addtomsa.html 33. Upon installing BLAST locally, you will also have access to the makeblastdb command. The easiest way to use it is to create a unique fasta file with all the sequences you want to include in this database.

Acknowledgments We acknowledge support from the Spanish Ministry of Science and Innovation and State Research Agency through the “Centro de Excelencia Severo Ochoa 2019-2023” Program (CEX2018000806-S), and support from the Generalitat de Catalunya through the CERCA Program. This research was also supported by CIBER—Consorcio Centro de Investigacio´n Biome´dica en Red- (CB 2021), Instituto de Salud Carlos III, Ministerio de Ciencia e Innovacio´n and Unio´n Europea—NextGenerationEU. A.R-L., J.G. and J.A-P. belong to the AGAUR Grup de Recerca Consolidat 2021SGR01562. References 1. Beaumier CM, Gillespie PM, Strych U et al (2016) Status of vaccine research and development of vaccines for Chagas disease. Vaccine 34:2996–3000. https://doi.org/10.1016/j. vaccine.2016.03.074 2. Pinazo M-J, Gascon J (2015) Chagas disease: from Latin America to the world. Rep Parasitol 4:7–14. https://doi.org/10.2147/RIP. S57144 3. Gascon J, Bern C, Pinazo M-J (2010) Chagas disease in Spain, the United States and other non-endemic countries. Acta Trop 115:22–27. https://doi.org/10.1016/j.actatropica.2009. 07.019 4. Aldasoro E, Posada E, Requena-Me´ndez A et al (2018) What to expect and when: benznidazole toxicity in chronic Chagas’ disease treatment. J Antimicrob Chemother 73:1060– 1067. https://doi.org/10.1093/jac/dkx516 5. Jackson Y, Alirol E, Getaz L et al (2010) Tolerance and safety of nifurtimox in patients with chronic Chagas disease. Clin Infect Dis 51: e69–e75. https://doi.org/10.1086/656917

6. Teh-Poot C, Dumonteil E (2019) Mining Trypanosoma cruzi genome sequences for antigen discovery and vaccine development. In: Go´mez KA, Buscaglia CA (eds) T. cruzi infection. Springer New York, New York, pp 23–34 7. Vita R, Mahajan S, Overton JA et al (2019) The Immune Epitope Database (IEDB): 2018 update. Nucleic Acids Res 47:D339–D343. https://doi.org/10.1093/nar/gky1006 8. Galanis KA, Nastou KC, Papandreou NC et al (2021) Linear B-cell epitope prediction for in silico vaccine design: a performance review of methods available via command-line interface. Int J Mol Sci 22:3210. https://doi.org/10. 3390/ijms22063210 9. Sullivan NL, Eickhoff CS, Sagartz J, Hoft DF (2015) Deficiency of antigen-specific B cells results in decreased Trypanosoma cruzi systemic but not mucosal immunity due to CD8 T cell exhaustion. J Immunol 194:1806–1818. h t t p s : // d o i . o r g / 1 0 . 4 0 4 9 / j i m m u n o l . 1303163

Computational Prediction of Trypanosoma cruzi Epitopes Toward the. . . 10. Buschiazzo A, Muia´ R, Larrieux N et al (2012) Trypanosoma cruzi trans-sialidase in complex with a neutralizing antibody: structure/function studies towards the rational design of inhibitors. PLoS Pathog 8:e1002474. https://doi. org/10.1371/journal.ppat.1002474 11. Varadi M, Anyango S, Deshpande M et al (2022) AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50:D439–D444. https://doi.org/10.1093/nar/gkab1061 12. Aslett M, Aurrecoechea C, Berriman M et al (2010) TriTrypDB: a functional genomic resource for the Trypanosomatidae. Nucleic Acids Res 38:D457–D462. https://doi.org/ 10.1093/nar/gkp851 13. Amos B, Aurrecoechea C, Barba M et al (2022) VEuPathDB: the eukaryotic pathogen, vector and host bioinformatics resource center. Nucleic Acids Res 50:D898–D911. https:// doi.org/10.1093/nar/gkab929 14. The NIH HMP Working Group, Peterson J, Garges S et al (2009) The NIH human microbiome project. Genome Res 19:2317–2323. https://doi.org/10.1101/gr.096651.109 15. Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22:1658–1659. https://doi.org/10.1093/bio informatics/btl158 16. Edgar RC (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 32:1792– 1797. https://doi.org/10.1093/nar/gkh340 17. Nielsen M, Lundegaard C, Lund O, Kes¸mir C (2005) The role of the proteasome in generating cytotoxic T-cell epitopes: insights obtained from improved predictions of proteasomal cleavage. Immunogenetics 57:33–41. https:// doi.org/10.1007/s00251-005-0781-7 18. Besser H, Louzoun Y (2018) Cross-modality deep learning-based prediction of TAP binding and naturally processed peptide. Immunogenetics 70:419–428. https://doi.org/10. 1007/s00251-018-1054-6 19. Reynisson B, Alvarez B, Paul S et al (2020) NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data. Nucleic Acids Res 48:W449–W454. https:// doi.org/10.1093/nar/gkaa379 20. Teufel F, Almagro Armenteros JJ, Johansen AR et al (2022) SignalP 6.0 predicts all five types of

503

signal peptides using protein language models. Nat Biotechnol 40:1023–1025. https://doi. org/10.1038/s41587-021-01156-3 21. Gı´slason MH, Nielsen H, Almagro Armenteros JJ, Johansen AR (2021) Prediction of GPI-anchored proteins with pointer neural networks. Curr Res Biotechnol 3:6–13. https://doi.org/10.1016/j.crbiot.2021. 01.001 22. Jespersen MC, Peters B, Nielsen M, Marcatili P (2017) BepiPred-2.0: improving sequencebased B-cell epitope prediction using conformational epitopes. Nucleic Acids Res 45:W24– W29. https://doi.org/10.1093/nar/gkx346 23. Hubbard SJ, Thornton JM (1993) ‘NACCESS’, computer program. Department of Biochemistry and Molecular Biology, University College, London 24. Altschul S (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402. https://doi.org/10.1093/nar/ 25.17.3389 25. Resource Coordinators NCBI, Agarwala R, Barrett T et al (2018) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 46:D8–D13. https://doi.org/10.1093/nar/gkx1095 26. Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:623–656. https://doi.org/10.1002/j.1538-7305.1948. tb00917.x 27. Alonso-Padilla J, Lafuente EM, Reche PA (2017) Computer-aided design of an epitopebased vaccine against Epstein-Barr virus. J Immunol Res 2017:1–15. https://doi.org/ 10.1155/2017/9363750 28. Stewart JJ, Lee CY, Ibrahim S et al (1997) A Shannon entropy analysis of immunoglobulin and T cell receptor. Mol Immunol 34:1067– 1082. https://doi.org/10.1016/S0161-5890 (97)00130-2 29. Weiskopf D, Angelo MA, de Azeredo EL et al (2013) Comprehensive analysis of dengue virus-specific responses supports an HLA-linked protective role for CD8+ T cells. Proc Natl Acad Sci U S A 110:E2046–E2053. https://doi.org/10.1073/pnas.1305227110 30. Greenbaum J, Sidney J, Chung J et al (2011) Functional classification of class II human leukocyte antigen (HLA) molecules reveals seven different supertypes and a surprising degree of repertoire sharing across supertypes. Immunogenetics 63:325–335. https://doi.org/10. 1007/s00251-011-0513-0

504

Albert Ros-Lucas et al.

31. Jumper J, Evans R, Pritzel A et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589. https://doi.org/10.1038/s41586-02103819-2 32. Tunyasuvunakool K, Adler J, Wu Z et al (2021) Highly accurate protein structure prediction for the human proteome. Nature 596:590–

596. https://doi.org/10.1038/s41586-02103828-1 33. Petersen B, Petersen TN, Andersen P et al (2009) A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct Biol 9:51. https:// doi.org/10.1186/1472-6807-9-51

Chapter 33 Computational Vaccine Design for Common Allergens Nandini Ghosh, Gaurab Sircar, and Sudipto Saha Abstract In this chapter, the steps of designing candidate vaccine molecules for allergen-specific immunotherapy (AIT) using immunoinformatics are described. The most modern approach of AIT deals with carrier-bound B cell epitope and multi-epitope vaccine molecules. The strategy for designing these molecules and the bioinformatics tools and servers used for that are discussed in detail here. Key words Vaccine, Non-anaphylactic peptide, Multi-epitope, Allergen-specific immunotherapy, Carrier-bound B cell epitope, Hypoallergenic

1 Introduction Allergy is a chronic disease with variable symptoms including rhinitis, shortness of breath, atopic dermatitis and even life-threatening anaphylaxis, affecting a large proportion of the world population. Allergic individuals produce specific IgE against environmental allergens, whereas non-allergic subjects produce IgG without developing allergic inflammation upon allergen exposure. Currently, the treatment of allergy relies on the use of anti-histamines and corticosteroids for immediate relief, but the only diseasemodifying approach is allergen-specific immunotherapy (AIT), which is often termed as “therapeutic vaccination.” The difference between AIT and prophylactic vaccines is that prophylactic vaccines generate antibody and T cell-mediated adaptive immune responses well in advance, so that it can fight the pathogen during future infection, whereas, in allergy, the immune system is hyper-reactive, causing inflammation and allergy. The aim of AIT is generating counter-immune responses by producing neutralizing IgG antibodies. Several strategies of AIT have been taken, of which the use of crude antigenic extract comes first, but the major problems of using Nandini Ghosh and Gaurab Sircar contributed equally. Pedro A. Reche (ed.), Computational Vaccine Design, Methods in Molecular Biology, vol. 2673, https://doi.org/10.1007/978-1-0716-3239-0_33, © The Author(s), under exclusive license to Springer Science+Business Media, LLC, part of Springer Nature 2023

505

506

Nandini Ghosh et al.

crude extracts are the risk of life-threatening anaphylaxis, reduced specificity, batch-to-batch variation of allergens, and the presence of unwanted substances in the natural extracts. The first challenge to improve the quality and reduce the side effects of AIT was taken by Sledge in 1938 [1]. He noticed that the controlled release of allergens in the body by adsorbing it onto the surface of Aluminum hydroxide reduced the risk of severe systemic anaphylaxis. After that, several experiments were done to reduce allergenicity by chemical modifications like conjugation of allergen extracts to polyethyleneglycol (PEG) [2] or denaturation of allergenic proteins producing allergoids. Afterward, molecular approaches were taken for AIT. First, short, allergen-specific T cell epitopes, which are not IgE reactive were used for AIT, then recombinant hypoallergens, native-like recombinant allergens, immunomodulatory compoundcoupled allergen virus-like particle-coupled allergen, allergen encoded by nucleic acids and finally carrier-bound B cell epitope peptides [3]. In the case of a carrier-bound B cell epitope, an hypoallergenic IgE epitope is attached to a carrier molecule like PreS protein of Hepatitis B virus. The major advantage of carrier-bound B cell epitope peptides over other methods is that it only induces neutralizing IgG antibodies, neither recruiting IgE nor T cells, thereby ameliorating the immediate or T cell-mediated late-phase side effects of AIT. As an example, BM32 is a carrier-bound B cell epitope designed as a candidate vaccine against grass pollen allergy that has given a very good response in clinical trial [4, 5]. Hypoallergenic/non-anaphylactic peptides (NAPs) are designed by several methods, such as truncation of epitopes, mutation, and oligomerization. Bioinformatic tools are used for the prediction of epitopes, and the prediction of surface-exposed residues during the desig of BM32 [6, 7]. An hypoallergenic candidate vaccine against fungal allergen Rhi o1 has been designed based on the prediction of B cell epitopes using ABCPred and BCEPred servers, followed by further experimental procedures [8]. Another emerging approach in AIT is using a multi-epitope vaccine [9]. Multi-epitope vaccine constructs have already been designed against four pan-allergen families using computer-aided tools [9]. IgE epitopes are retrieved from databases and the non-anaphylactic peptides are designed and joined together with a carrier molecule which is bioinformatically tested for appropriated folding, expression in mammalian cells and bacteria, mRNA folding, and tolerogenic effect through the induction of IL-10. Although a multi-epitope vaccine is in the preliminary making steps, the successful implementation of this type of vaccines may solve allergies caused by pan-allergens. In this chapter, the immunoinformatic methods for designing carrier-bound B cell epitope and multi-epitope vaccine molecules are discussed.

Computational Vaccine Against Allergens

2

507

Materials and Methods

2.1 In Silico Designing of CarrierBound Hypoallergenic B Cell Epitope Vaccine

Carrier-bound hypoallergenic B cell epitope vaccine or non-anaphylactic peptide (NAP) vaccine may be designed against a particular allergen or an allergen family in case of pan-allergen. The steps of designing NAPs are discussed below taking the example of Bet v 1 as a prototype/representative allergen under Bet v 1 allergen family: 1. Select allergen (see Note 1) from allergen databases such as Allfam, WHO/IUIS allergen nomenclature sub-committee database (e.g., Allergen family name: Bet v 1 family, Allfam ID: AF069, No. of IUIS approved allergens under this family—26). 2. Retrieve experimentally determined IgE epitope information of the selected allergen from databases such as Allerbase, IEDB, or from research articles (e.g., IgE epitopes of Bet v 1 (uniport ID: P15494)—EQVKASKEMGETLLRAVESYLLA, GDNLFPKVAPQAIS, ISFPEGFPFKYVKD, DGDNLFPKVA, ISFPEGFPFKYVK, RVDEVDHTNFKYNY). 3. If experimentally determined IgE epitope information is unavailable, then predict B cell epitope, using ABCpred and BCEpred servers, which needs further experimental validation. 4. Perform multiple sequence alignment of the allergens of the selected family and manually check (see Note 2) of the conservancy of IgE epitopes or check epitope conservancy using the IEDB database (see Fig. 1) [e.g., based on epitope conservancy 2 epitopes of Bet v 1 are selected for further study. These are as follows: (a) Peptide (73%).

1

(P1)—EQVKASKEMGETLLRAVESYLLA

(b) Peptide 2 (P2)—PEGFPFKYVKDRVDEVDHTNFKYNY (62%) (in this case 2 epitopes are merged)]. 5. Retrieve experimentally determined T cell epitope (see Note 3) information of the selected allergens and check if they are overlapping with selected IgE epitopes which are used for designing NAPs (e.g., T cell epitopes overlapping with P1 and P2 are underlined—EQVKASKEMGETLLRAVESYLLA, PEGFPFKYVKDRVDEVDHTNFKYNY). 6. Replace the critical amino acid residues, responsible for epitope paratope interaction, from the selected IgE epitopes. To achieve this, replace the charged residues with oppositely charged residues. If more than one charged residue is present in a row then replace them with uncharged residues to

508

Nandini Ghosh et al.

Fig. 1 Screenshot of IEDB epitope conservancy analysis tool

circumvent the formation of an H-bond. Adjacent non-antigenic regions may also be added. (e.g., NAPs corresponding to P1 and P2 are as follows: P1′ -GDTLEKISNEIKIVATPDGG SILKISNKYHTKGDHEVKAKQVD ASGGMGKTLLDAVKSYLLAHSDN, P2′-NGGPGTIKKISFP PKGFPFDYVAAAVAAVKHTNFDYNYSVI EGGPIGD. In P1′ and P2′, changed residues are marked in red and adjacent non-antigenic regions are italicized).

Computational Vaccine Against Allergens

509

7. If the T cell epitopes overlapping with selected IgE epitopes remain in unchanged form even after changing the residues, then delete these portions from designed NAPs to eliminate the chance of late-phase allergic reaction (e.g., in the designed NAPs, T cell epitopes are not present in intact form. In the following constructs, portions corresponding to T cell epitopes are underlined. As the amino acid residues are changed, it no longer acts as T cell epitope. P1′-GDTLEKISNEIKIVATPDG GSILKISNKYHTKGDHEVKAKQVDASGGM GKTLLDAVKSYLLAHSDN. P2′-NGGPGTIKKISFP PKGFPFDYVAAAVAAVKHTNFDYNYSVIEGGPIGD). 8. Couple NAPs with carrier molecules like a stretch of human hepatitis B virus-derived PreS antigen, cholera toxin B (CTB), and portions of tetanus toxoid fragment C(TTFrC) (see Note 4) [e.g., the constructs prepared with P1‘ and P2‘ are represented in Fig. 2a]. 9. Check the antigenicity of the construct using ANTIGENpro (http://scratch.proteomics.ics.uci.edu/) server (see Fig. 3) (e.g., Antigenicity of construct 1 is 93.68% and construct 2 is 85.96%. So these molecules can be used as candidate vaccines). 2.2 In Silico Design of Multi-Epitope Vaccine

For designing multi-epitope vaccine (see Note 5), first of all, NAPs should be designed and then 2 or more NAPs will be joined with each other and with a carrier protein to form a synthetic protein. The steps are as follows:

Fig. 2 Constructs of NAPs ((a), i–ii) and multi-epitope vaccines ((b), iii–vii) are represented as a line diagram

510

Nandini Ghosh et al.

Fig. 3 Screenshot of the antigenicity analysis server

1. For designing of NAP, first select the allergen family, then retrieve IgE epitope or predict B cell epitopes. Then retrieve T cell epitope sequence, and finally design NAPs. These are IgE epitopes where anaphylactic residues are replaced and devoid of any intact T cell epitope. (e.g., P1′-GDTLEKISNEIKIVATPDGG SILKISNKYHTKGDHEVKAKQVDASGGMG KTLLDAVKSYLLAHSDN. P2′-NGGPGTIKKISFP PKGFPFDYVAAAVAAVKHTNFDYNYSVIEGGPIGD). 2. Then attach two or more NAPs with a linker molecule and then join it with a carrier molecule in different combinations [e.g., the sequences of the constructs are represented in Fig. 2b]. 3. Finally, check the physicochemical properties of each of the constructs and use the best construct as a vaccine molecule. 2.3 Assessment of Physicochemical Properties of the Synthetic Candidate Vaccine Using Computational Tools

Physicochemical properties of each of the constructs of the multiepitope vaccine will be tested through the following process: 1. Check the molecular weight, pI, half-life, instability index, aliphatic index, and grand average hydropathicity (GRAVY) score of each construct through the Protparam server [10]. The preferable molecular weight is