Computational Life Sciences: Data Engineering and Data Mining for Life Sciences 3031084101, 9783031084102

This book broadly covers the given spectrum of disciplines in Computational Life Sciences, transforming it into a strong

468 72 19MB

English Pages 592 [593] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Data Analysis for the Life Sciences

The unprecedented advance in digital technology during the second half of the 20th century has produced a measurement re

253 81 16MB Read more

Data Mining for the Social Sciences: An Introduction 9780520960596

We live in a world of big data: the amount of information collected on human behavior each day is staggering, and expone

244 39 13MB Read more

Linear Algebra for Computational Sciences and Engineering 9783030213213

844 122 30MB Read more

Life Sciences, Information Sciences 9781119516606, 9781786302434

Developed from presentations given at the Cerisy SVSI (Sciences de la vie, sciences de l'information) conference he

166 81 4MB Read more

University Physics for the Life Sciences

806 84 111MB Read more

Machine Learning and Data Sciences for Financial Markets 9781316516195, 9781009028943

Leveraging the research efforts of more than sixty experts in the area, this book reviews cutting-edge practices in mach

1,921 489 90MB Read more

Mathematics for the Life Sciences 9781461472766, 9781461472759

138 53 4MB Read more

Mining Software Engineering Data for Software Reuse 9783030301057, 3030301052

1,346 216 11MB Read more

Data and Text Processing for Health and Life Sciences [1st ed.] 978-3-030-13844-8;978-3-030-13845-5

This open access book is a step-by-step introduction on how shell scripting can help solve many of the data processing t

380 99 3MB Read more

Applied mathematics for business and economics, life sciences, and social sciences [2 ed.] 0023055901, 9780023055904

5,887 476 41MB Read more

Computational Life Sciences: Data Engineering and Data Mining for Life Sciences
3031084101, 9783031084102

Author / Uploaded
Jens Dörpinghaus
Vera Weil
Sebastian Schaaf
Alexander Apke

Table of contents :
Preface
Contents
Contributors
Solving Problems in Life Sciences: On Programming Languages and Basic Concepts
Interesting Programming Languages Used in Life Sciences
1 Introduction
2 Julia
2.1 The Theoretical Appeal
2.2 A Real World Example
3 Perl/Raku
3.1 The Theoretical Appeal
3.2 A Real World Example
4 Spreadsheets
4.1 The Theoretical Appeal
4.2 A Tale of Algorithm Distribution
4.3 How Vlookup Solved a Data Protection Issue
5 (Not Your Grandparent's) SQL
5.1 The Theoretical Appeal
5.2 Two Small Epiphanies
6 R
6.1 The Theoretical Appeal
6.2 An Ad-hoc Overview for a Messy Dataset
7 Python
7.1 The Theoretical Appeal
7.2 Tying it All Together in One Small Application
8 Java
Introduction to Java
1 Getting Ready
1.1 Installation of JDK
1.2 Running Java from Command Line
1.3 Set Up Your Environment: Eclipse
1.4 Installing Version Control Software: SVN and/or Git
1.5 Installing and Using Maven
2 Your First Java Project
2.1 Creating a New Project
2.2 Creating a New Java Class
2.3 Sharing a Project via SVN
2.4 Sharing a Project via Git
3 Java Basics
3.1 Variable Declaration
3.2 Comparison
3.3 Arrays
3.4 Loops
3.5 Adding Extensions
3.6 Exceptions
4 Adding External Libraries
4.1 Adding an External Jar-File
4.2 Building External Libraries
Basic Data Processing
1 Data Architecture and Data Modeling
1.1 A Primer on Object-Oriented Programming
1.2 How Objects Are Represented in Java and Can Be Used to Store Data
1.3 Classes: How to Create an Object from a Class
1.4 More Information on Class Inheritance
2 Using Lists and Other Data Structures
2.1 Using ArrayList
2.2 Using LinkedList
2.3 Using Collections and Stack
2.4 Sorting a Collection
3 Handling Parameters
4 Reading and Writing Files and Data
4.1 Text Files
4.2 Tables
4.3 Pictures and Other Binary Data
5 Basic Mathematics and Statistics
Algorithm Design
1 A Simple Algorithm
2 Modeling Real World Problems
3 Running Times of Algorithms
3.1 The Big O Notation
3.2 Calculation Rules for the Big O Notation
3.3 Determining the Asymptotic Running Time of an Algorithm
4 A Faster Search Algorithm
5 Introduction to Complexity Theory
5.1 NP-complete Problems
6 Basic Concepts of Algorithm Design
6.1 Divide and Conquer
6.2 Dynamic Programming
6.3 Recursion
6.4 Greedy Heuristics
References
*-20pt Data Mining and Knowledge Discovery
Data and Knowledge Management
1 Data, Information, Knowledge, Wisdom
1.1 Data Processing and Workflows
1.2 Scientific Data and Data Life Cycle
2 Data Engineering Techniques
2.1 Data Collection
2.2 Data Processing
2.3 Data Analyses
2.4 Data Storage
2.5 Data Re-use
3 Technical, Ethical and Social Issues
3.1 Technical issues
3.2 Ethical and Social Issues
References
Databases and Knowledge Graphs
1 Introduction
1.1 Relational Database Concepts
2 Java: JDBC
2.1 SQLite
2.2 H2
3 Knowledge Graphs and noSQL Databases
3.1 A Primer on Knowledge Graphs
3.2 A Primer on noSQL Databases
3.3 Python: Neo4J
3.4 Link Prediction on Large Scale Knowledge Graphs
3.5 Machine Learning
Knowledge Discovery and AI Approaches for the Life Sciences
1 Knowledge Representation: Describing Complex Objects and Data
1.1 Structured Data
1.2 Unstructured Data
1.3 Problems
1.4 Using XML
1.5 Using JSON
1.6 Ontology Engineering for the Semantic Web: RDF and OWL
2 Knowledge Discovery: Methods for Efficient Data Processing on the Web
3 Basic Descriptive Statistics
3.1 Scales of Measurement
3.2 Frequencies and Statistical Value
3.3 Bivariate Statistics
3.4 Basic Methods of Inferential Statistics: t-Test, Analysis of Variance, Regression
4 AI Approaches for Life Sciences
4.1 Classification and Clustering
4.2 Binning
4.3 Hashing
4.4 Machine Learning Approaches for Classification
5 Personalized Medicine
5.1 Unmet Medical and Patient Needs
5.2 Properties of Biomedical Data and Challenges
5.3 Standardization and Harmonization
5.4 Perspectives on Personalized Medicine
References
Longitudinal Data
1 Sparse Data
1.1 Are we Longitudinal Yet?
1.2 Removing Mean Effects and Accessing Variability
2 Smoothing and Modelling Data
2.1 Locally Smoothing Data
2.2 Modelling Data and Adjusting for Phase Variation
2.3 Maximum Likelihood and Bayesian Approaches
2.4 Obtaining Your Model and Further Analysis
3 A Playful Dataset
Distributed Computing and Clouds
Computational Grids
1 Early Beginnings
2 Grid Computing
2.1 Grid Middleware
2.2 Site Autonomy
2.3 Using Resources of More Than One Grid: Grid Federation
3 Using Grid Technology in Life Sciences
3.1 Text Mining in Grids
3.2 Drug Discovery in Grids
4 How to Use Grid Resources as of 2021
5 Summary
References
Cloud Computing
1 Cloud Computing in a Nutshell
1.1 Commercial Cloud Providers
1.2 Cloud Access Patterns
1.3 Open Cloud Middleware
1.4 Service Level Agreements
2 Using Cloud Technology in Life Sciences
2.1 Life Sciences Applications in EGI
2.2 Life Sciences Applications in Helix Nebula Science Cloud
2.3 Life Sciences Applications in EOSC
2.4 Life Sciences Applications in ELIXIR
3 Summary
References
Standards
1 Grid Standards
1.1 Working Groups
1.2 Recommendations
2 Cloud Standards
2.1 Institute for Electrical and Electronics Engineers
2.2 International Organization for Standardization/International Electrical Commission ISO/IEC
2.3 International Telecommunication Union—Telecommunication Standardization Sector
2.4 National Institute of Standards and Technology
2.5 Open Grid Forum
2.6 OASIS Open
2.7 European Telecommunications Standards Institute
2.8 DMTF
2.9 ATIS
2.10 Global Inter-Cloud Technology Forum
2.11 SNIA
2.12 TIA
3 Summary
References
Advanced Topics in Computational Life Sciences
Network Analysis: Bringing Graphs to Java
1 Directed Graphs
1.1 Food Chains
1.2 Social Relation and Between-Species Interaction
2 Undirected Graphs
2.1 Protein Interaction Network
2.2 Similarity Graph
3 Some More Examples
3.1 Substructure and Maximal Common Substructure Searching
3.2 Random Graphs
3.3 Social Networks
3.4 Directed Protein Interaction Networks
References
Optimization
1 Linear Optimization
1.1 Formulation of an LP
1.2 Solving an LP with lpsolve
1.3 Possible States of an LP
1.4 Geometrical Approach
1.5 Algorithmic Aspects
2 Combinatorial Optimization
2.1 Integer Programs
2.2 Dynamic Programming
2.3 Branch-and-Bound
2.4 Local Search Heuristics
2.5 Hill Climbing
2.6 Simulated Annealing
2.7 Concluding Remarks
References
Image Processing and Manipulation
1 Using ImageJ as a Library
1.1 Reading and Writing Pictures
1.2 Using the ImageProcessor
1.3 Creating New Images and Destroying Images
1.4 Basic Image Manipulations
1.5 Particle Analyses
1.6 Classifying Objects
1.7 Colour Analysis
2 Other Libraries
3 Building an Analysis Pipeline
3.1 Bash Scripts
3.2 Parallel Environments
References
Sequence Analysis
1 Basics in Sequence Analysis
1.1 Of Molecules and Codes
1.2 From Subsequences to Functions
1.3 Molecular Genetics and Beyond
1.4 Computing on Biological Sequences
2 Introduction to BioJava
3 Reading and Writing FASTA
4 Database Search
5 NGS Sequences in Java
6 Sequence Alignment
6.1 Multiple Sequence Alignment
6.2 BLAST
7 Summary
References
Applications and Emerging Technologies
NGS Data Analysis with Apache Spark
1 Next-Generation Sequencing
1.1 Definition
1.2 Illumina Sequencing
1.3 NGS File Formats
2 FASTQC Software
2.1 Introduction
2.2 Interpretation of the FastQC Report
3 Introduction to Apache Spark
3.1 Apache Spark—Main Concepts
3.2 Apache Spark Main Features
3.3 Using the Best of Apache Spark
3.4 Spark Versus Hadoop
3.5 Spark Installation in Standalone Mode in Ubuntu
4 Implementation
5 Results
6 Conclusion
References
Plant Image Analysis
1 Introduction
2 Materials and Methods
2.1 Data
2.2 Segmentation
2.3 Object Recognition
2.4 Object Analysis
2.5 Explorative Data Analysis
3 Results
3.1 Comparison of Leaf Count to Ground Truth for the A1 and A2 Datasets
3.2 Overview of the A2 Dataset
3.3 Analyses of Correlations
4 Discussion
4.1 Challenges in the Identification of Plants
4.2 Object Analysis
4.3 Quality of the Analysis of the A1 Dataset
4.4 Correlations in Leaf Characteristics
5 Conclusion
References
Anonymization of Electronic Health Care Records: The EHR Anonymizer
1 Introduction: Electronic Health Care Data
1.1 EHR's and the Problem of Data Privacy
1.2 Anonymization
1.3 The EHR Anonymizer
2 Methods
2.1 File Handling
2.2 EHR and Annotations
2.3 BRAT Rapid Annotation Tool
2.4 Design of the GUI
2.5 Feedback Loop
3 Results
3.1 Annotation Performance
3.2 Statistical Metrics
4 Discussion
4.1 Aim of the Project
4.2 Results Interpretation
4.3 Future Work
References
Metadata-Enriched Image Database: A Tool for Generating and Interacting with an Image Database Using Metadata
1 Introduction
2 User Understanding
2.1 Determine User Objectives
2.2 Situation Assessment
2.3 Application Goal
2.4 Project Plan
3 Data Understanding
3.1 Data Exploration
3.2 Project Plan Extension
4 Background
4.1 SQLite Database Engine
4.2 RESTful API with Spring
5 Implementation
5.1 Command Line Interface
5.2 SQLite Databases
5.3 Documentation About Tables and Data Structures
5.4 Web Service
6 Evaluation
6.1 Result Evaluation
6.2 Process Review
6.3 Future Steps
7 Deployment
References
Biomedical Knowledge Graphs: Context, Queries and Complexity
1 Background
1.1 Preliminaries
1.2 Method
2 Results
2.1 Real World Usecases for Testing
2.2 Storing the Knowledge Graph
2.3 Polyglot Persistence Systems
2.4 Graph Queries
3 Discussion
3.1 Knowledge Discovery on Custom Layers
3.2 Missing Data
3.3 Performance
3.4 Context Based NLP
3.5 Answering Semantic Questions and FAIRification of Data
3.6 Perspectives for Personalised Medicine
4 Conclusion
References
Classification of Images from Biomedical Literature
1 Introduction
2 Background
2.1 Pre-processing of the Data
2.2 Logistic Regression
2.3 RESTful API
3 Workflow
3.1 Data Acquisition
3.2 Data Storage
3.3 Pre-processing of the Data
3.4 Machine Learning
3.5 RESTful API
3.6 Command-Line Application
4 Results
4.1 Pre-processing of the Data
4.2 Machine Learning
4.3 Command-Line Application
4.4 Web Application
5 Conclusion and Outlook
5.1 Pre-processing of the Data
5.2 Machine Learning
5.3 RESTful API
References
Index

Citation preview

Studies in Big Data 112

Jens Dörpinghaus Vera Weil Sebastian Schaaf Alexander Apke Editors

Computational Life Sciences Data Engineering and Data Mining for Life Sciences

Studies in Big Data Volume 112

Series Editor Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland

The series “Studies in Big Data” (SBD) publishes new developments and advances in the various areas of Big Data- quickly and with a high quality. The intent is to cover the theory, research, development, and applications of Big Data, as embedded in the fields of engineering, computer science, physics, economics and life sciences. The books of the series refer to the analysis and understanding of large, complex, and/or distributed data sets generated from recent digital sources coming from sensors or other physical instruments as well as simulations, crowd sourcing, social networks or other internet transactions, such as emails or video click streams and other. The series contains monographs, lecture notes and edited volumes in Big Data spanning the areas of computational intelligence including neural networks, evolutionary computation, soft computing, fuzzy systems, as well as artificial intelligence, data mining, modern statistics and Operations research, as well as self-organizing systems. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution, which enable both wide and rapid dissemination of research output. The books of this series are reviewed in a single blind peer review process. Indexed by SCOPUS, EI Compendex, SCIMAGO and zbMATH. All books published in the series are submitted for consideration in Web of Science.

Jens Dörpinghaus · Vera Weil · Sebastian Schaaf · Alexander Apke Editors

Computational Life Sciences Data Engineering and Data Mining for Life Sciences

Editors Jens Dörpinghaus Federal Institute for Vocational Education and Training (BIBB) Bonn, Germany German Center for Neurodegenerative Diseases (DZNE) Bonn, Germany Sebastian Schaaf German Center for Neurodegenerative Diseases (DZNE) Bonn, Germany

Vera Weil Department for Mathematics and Computer Science University of Cologne Cologne, Germany Alexander Apke Department for Mathematics and Computer Science University of Cologne Cologne, Germany

ISSN 2197-6503 ISSN 2197-6511 (electronic) Studies in Big Data ISBN 978-3-031-08410-2 ISBN 978-3-031-08411-9 (eBook) https://doi.org/10.1007/978-3-031-08411-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

The life sciences have long been considered a descriptive science—10 years ago, the field was relatively data poor, and scientists could easily keep up with the data they generated. But with advances in genomics, imaging and other technologies, biologists are now generating data at crushing speeds. —Emily Singer, 2013

In a nutshell, these words from Emily Singer1 describe the starting point of our book: Nowadays, most life scientists have to handle enormous amounts of data. That is, they must be able to use techniques and apply adapted tools to their specific problem at hand in order to approach its solution. In other words, possessing and generating (big) data is one thing, but the heart of the matter is to retrieve information out of this data—efficiently and reliably.

Why Life Sciences and Data Let’s leave the nutshell: Biology is an exciting field, which has recently developed to what we call life sciences. On top of at least 500 years of exciting science and development, life sciences face additional, computer-related issues. Every day and world-wide, the field of life sciences is changing in terms of software, algorithms, file formats and much more. Regarding this, together with the already mentioned issue of handling a great amount of rapidly increasing data, it is nearly unavoidable to develop and apply new, efficient, robust and thus reliable heuristics and algorithms. In other words, a lot of problems in life sciences and bioinformatics have to be approached with algorithmic solutions, presuming technical understanding and, of

1

See https://www.wired.com/2013/10/big-data-biology/: Emily Singer: Biology’s Big Problem: There’s Too Much Data to Handle, www.wired.com, 10.11.2013. v

vi

Preface

course, biological expertise. This applies to students, researchers and practicioners to the same extent. And here lies the entry point of our book.

Book Purpose The purpose of this book is to offer you theoretical knowledge as well as practical advice on diverse yet fundamental topics of computational life sciences. This encloses at least a sketch (and often much more than this) of the theoretical foundations of a specific field as well as the thereof evolving practical aspects. In some cases, this leads up to the presentation of an (implemented) solution found for a problem that arose in a specific application. Hence, every chapter of this book is either of high practical relevance or of great scientific interest, or both. If you are interested in Data Science and Life Sciences in general, of if you want to consolidate your knowledge in these topics, of if you are interested in applications and evolving technologies in this field or if you just need an inspiration in order to approach your own problem at hand, we believe that this book offers you a strong helping hand. Or, to make it short: We believe that students, junior and senior researchers benefit from this book as well as teachers and practitioners.

Book Overview This book is divided into five parts. The first part, Solving Problems in Life Sciences: On Programming Languages and Basic Concepts, offers foundations on different topics. Nowadays, issues being bound to a certain language are rare, hence the first chapter considers the most common programming languages used in the Life Sciences. It closes with a short explanation why we chose Java to play a main role in most of our examples. The first chapter is followed by an introduction to the programming language Java, including information of using collaborative tools like git. This introduction offers a quick start guide to Java, and comes in hand with the third chapter, Basic Data Processing. Amongst other topics, common data structures and their usage as well as apects of the object-orientied programming paradigm are introduced. The part closes with the chapter Algorithm Design, in which we consider the modeling of real world problems as well as fundamental algorithmic principles. The second part, Data Mining and Knowledge Discovery, starts with an introductory chapter on the management of data and knowledge. This is followed by the chapter on databases and how the contained information can be structured using knowledge graphs. Applied statistics and AI approaches with regard to the applications in life sciences and medicine are the core elements of the third chapter. It closes with a chapter on longitudinal data, that is, roughly speaking, data repeatedly collected over an extended period of time.

Preface

vii

The third part on Distibuted Computing and Clouds offers insights on computational grids as well as on cloud computing. Both of these chapters are flanked by examples emerging from applications in the life sciences. The part closes with a chapter on standards which help to create interoperable solutions that are then able to interact with other solutions using the same standards. Working and doing research in the life sciences often requires knowledge in at least one of the following topics: graphs, optimiziation, image processing and sequence analysis. Hence, in the fourth part, Advanced Topics in Computational Life Sciences, starts with a chapter on graphs and how to implement and use them with Java. It is followed by a chapter in which the fields of linear programs and combinatorial optimization are illuminated. Images play a crucial role in the life sciences and medicine, hence we dedicated the third chapter to image processing and image manipulation. We close the fourth part with a chapter on sequence analysis, an almost classical field in the area of life sciences and a standard example that shows the connection between the decoding of DNA-sequences and text mining problems in computer science. As one of the key features, in the last part of the book, Applications and Emerging Technologies, you will also find some more sophisticated applications arising from scientific projects. These do not only show interesting results but might also be helpful guidelines for your own upcoming programming project.

Contributors Especially the creation of last part of the book could only be accomplished by the help of the many contributors that authored those chapters. You will find the respective authors listed at the beginning of each. Further, also in the other parts of this book you will find some chapters written by additional contributors. Whenever this is the case, those contributors are mentioned explicitly at the beginning of the according chapter. All the chapters with no explicitly mentioned authors are contributed by the editors. Thanks to this variety of contributors and their scientific background, you will find a wide range of different styles of writing and thematic focusses throughout this book. This leads to a rather dynamic than monotonous ductus that hopefully makes it even more fun to read. We would like to thank all these authors that all contributed a significant part of this book.

Acknowledgements We acknowledge the many people that helped to improve this book. It is our pleasure to explicitly mention Christof Meigen and Dr. Wolfgang Ziegler who played an important role. In addition, we thank all students from our teaching at the Universities of Cologne and Bonn that helped to improve the quality of several chapters, in

viii

Preface

particular Regina Wehler, Colin Birkenbihl and Olivier Morelle. We would like to extend our thanks to Springer—for their help and patience through the publication process of this book. This work is dedicated to our families and children. Cologne, Germany Bonn, Germany Bonn, Germany Cologne, Germany January 2021

Alexander Apke Jens Dörpinghaus Sebastian Schaaf Vera Weil

Contents

Solving Problems in Life Sciences: On Programming Languages and Basic Concepts Interesting Programming Languages Used in Life Sciences . . . . . . . . . . . . Christof Meigen

3

Introduction to Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke

21

Basic Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke

55

Algorithm Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Apke, Vera Weil, Jens Dörpinghaus, and Sebastian Schaaf

79

Data Mining and Knowledge Discovery Data and Knowledge Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Christof Meigen, Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke Databases and Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Tobias Hübenthal Knowledge Discovery and AI Approaches for the Life Sciences . . . . . . . . 183 Alexander Apke, Vera Weil, Jens Dörpinghaus, and Sebastian Schaaf Longitudinal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 Christof Meigen Distributed Computing and Clouds Computational Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Wolfgang Ziegler

ix

x

Contents

Cloud Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 Wolfgang Ziegler Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Wolfgang Ziegler Advanced Topics in Computational Life Sciences Network Analysis: Bringing Graphs to Java . . . . . . . . . . . . . . . . . . . . . . . . . . 327 Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke Image Processing and Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke Applications and Emerging Technologies NGS Data Analysis with Apache Spark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 Avirup Guha Neogi, Ashraf Eltaher, and Astghik Sargsyan Plant Image Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469 Francèl Lamprecht, Christine Robinson, and Regina Wehler Anonymization of Electronic Health Care Records: The EHR Anonymizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 Thomas Lordick, Alexander Hoch, and Bryce Fransen Metadata-Enriched Image Database: A Tool for Generating and Interacting with an Image Database Using Metadata . . . . . . . . . . . . . 501 Shreya Kapoor, Sophia Krix, and Gemma van der Voort Biomedical Knowledge Graphs: Context, Queries and Complexity . . . . . 529 Jens Dörpinghaus, Carsten Düing, and Andreas Stefan Classification of Images from Biomedical Literature . . . . . . . . . . . . . . . . . . 569 Tharindu Madhusankha Alawathurage, Bharadhwaj Vinay, and Gadiya Yojana Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597

Contributors

Alawathurage Tharindu Madhusankha Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Apke Alexander Department for Mathematics and Computer Science, University of Cologne, Cologne, Germany Dörpinghaus Jens BIBB Bundesinstitut für Berufsbildung, Bonn, Germany Düing Carsten Mathematical Institute, University Koblenz-Landau, Koblenz, Germany Eltaher Ashraf Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Fransen Bryce Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Guha Neogi Avirup Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Hoch Alexander Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Hübenthal Tobias Department Mathematik und Informatik, Abteilung Informatik, Cologne, Germany Kapoor Shreya Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Krix Sophia Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Lamprecht Francèl Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Lordick Thomas Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany xi

xii

Contributors

Meigen Christof LIFE Child, LIFE Leipzig Research Center for Civilization Diseases, Leipzig University, Leipzig, Germany Robinson Christine Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Sargsyan Astghik Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Schaaf Sebastian German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany Stefan Andreas Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Schloss Birlinghoven, Sankt Augustin, Germany; Bonn-Rhein-Sieg University of Applied Sciences, Sankt Augustin, Germany Vinay Bharadhwaj Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany van der Voort Gemma Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Wehler Regina Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Weil Vera Department of Mathematics and Computer Science, University of Cologne, Cologne, Germany Yojana Gadiya Bonn-Aachen International Center for Information Technology (b-it), Bonn, Germany Ziegler Wolfgang Fraunhofer Institute for Algorithms and Scientific Computing SCAI, Department of Bioinformatics, Schloss Birlinghoven, Sankt Augustin, Germany; z-rands, Hennef, Germany

Solving Problems in Life Sciences: On Programming Languages and Basic Concepts

Interesting Programming Languages Used in Life Sciences Christof Meigen

Abstract In this chapter we present seven programming languages and tools which have contributed a lot to—and were improved based on the demands of—various fields of Life Sciences ranging from epidemiology to genetics, from simulations to image processing. We cannot give a detailed introduction into each language, but rather provide some context and small examples to to wet your appetite and make you aware of all the beautiful and powerful tools available

1 Introduction The limits of my language mean the limits of my world. Ludwig Wittgenstein

While it certainly has its merits to stick to common, tried-and-true programming environments which are taught at universities across the globe for decades, sometimes domain-specific programming languages offer very effective ways of getting certain things done, or provide uncommon perspectives on problems that would be hard to solve in a traditional programming language. Life Sciences in particular has a long tradition of domain experts creating and using software tools they like, even if these are frowned upon by trained software engineers. The statistical language R, for example, prouds itself on being created by statisticians, and indeed it features many concepts that are unusual in mainstream programming languages. The sections are divided into a small introduction to the language, and a real world example based on code from one of the authors projects. Of course, we commit many sins of ommission. Why didn’t we include Scala, Matlab, SPSS, SAS, C++, Rust, JavaScript, Bash our your favorite language? We wanted to somehow feature languages on the convex hull of the solution space (SPSS, for example lies somewhere C. Meigen (B) LIFE Child, LIFE Leipzig Research Center for Civilization Diseases, Leipzig University, Philipp-Rosenthal-Strasse 27, 04103 Leipzig, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 J. Dörpinghaus et al. (eds.), Computational Life Sciences, Studies in Big Data 112, https://doi.org/10.1007/978-3-031-08411-9_1

3

4

C. Meigen public class Bell Curve { public static double pdf(double x, double mu, double sigma Squared){ return 1/(Math.sqrt(2*Math.PI))* Math.exp(-Math.pow((x-mu),2)/(2*sigmaSquared)) } public static void main(){ System.out.printf("%d",pdf(0.0,1.0,0.0)); } }

Fig. 1 Java program calculating the density function of the normal distribution

between Excel and R), interesting outliers in various directions, but ultimately, the choice is personal. Note: These systems are usually very easy to download and install, but there are also various online services like repl.it, were you can try out examples in your web browser.

2 Julia Programs must be written for people to read, and only incidentally for machines to execute. Harold Abelson

2.1 The Theoretical Appeal Even people not much involved in mathematics have usually heard of the Gaussian Bell curve, which is the probability density function (pdf) of the normal distribution. The formula for it in most math textbooks (and on wikipedia) looks like this: (x−μ)2 1 e− 2σ2 pdf(x|μ, σ 2 ) = √ 2πσ 2

For a trained programmer it’s usually no problem to implement this function in her favorite programming language, for example in Java, and to output the value of pdf(1, 0, 1). Looking at Fig. 1, it seems as if Goethe’s quip “Whatever you say to [mathematicians] they translate into their own language and forthwith it is something entirely different”, holds true for programmers, too. The Java program bears little resemblence to our original definition. If you want to read, understand and spot bugs in the above code, you have to be trained in mathematics as well as in Java, and constantly switch between the two.

Interesting Programming Languages … Fig. 2 Julia program calculating the density function of the normal distribution Fig. 3 Julia program calculating the scaling factor of a linear transformation

pdf (x, σ 2 , μ) = 1/

5

(2π ∗ σ 2 ) ∗ exp (−(x − μ)^2 /(2σ 2 ))

println (pdf (0, 1, 0)) using LinearAlgebra

scaling(A) =

det(A’*A)

A=[11 -11] println(scaling(A))

Now look at the equivalent, complete program in Julia (https://www.julialang. org) in Fig. 2—that is actually how the√code looks like in your editor. Being able to write 2π instead of 2*Math.PI, and 2 instead of Math.sqrt(2) is more than a matter of taste, it’s vital to immediately recognise the mathematical meaning of the expression (the input on the keyboard is done by typing π, and it gets auto-completed to π). Let’s do another example. In linear algebra, the most important concepts are that of matrices and their determinants. In the first paragraph of the Wikipedia article about determinants it states that determinants can be used to calculate scaling factors for volumes after linear transformations defined by a matrix A, with the actual factor being the square root of the determinant of the product of A transposed and A. That was a long sentence. The corresponding definition of the scaling factor in Julia in Fig. 3—doing actual matrix transposing and matrix multiplication—makes the definition very easy to grasp. You might be wondering about the cost of this niceness. Surely, if the language does not require you to specify any data type, and allows very abstract notations, this comes with a heavy perfomance penalty when the computer tries to figure out how to perform the actual calculations. Nothing could be further from the truth. Julia was designed from the ground up for high performance numeric computations, and is heavily used in high performance computing. The matrix multiplication in Fig. 3 will choose an appropriate algorithm for sparse, diagonal and a couple of other matrices, and any Julia type with addition and multiplication can be used inside the matrix, for example a 24-Bit color or an 8-Bit integer. In fact, you can view the generated machine code directly after entering the function definition (see Fig. 4). The assembly code goes on for 72 more lines, and is probably of limited interest for most readers. It’s meant to illustrate that—unlike in most dynamic languages— basic functions like comparison and multiplication are implemented in Julia itself and produce optimized machine code which is readily avaliable for inspection and analysis. Spanning several layers of abstraction from machine code to mathematical notation in one coherent language is one of the things that make Julia unique.

6 Fig. 4 Generated machine code (excerpt) for the pdf function

C. Meigen julia> @code_native pdf(0,1,0) .text ;Functionpdf{ ;Location:none:1 [...] pushq%rax vcvtsi2sdq%rsi,%xmm0,%xmm0 movabsq$139630073418848,%rax#imm=0x7EFE28ED1460 ;}}}} ;Function*;{ ;Location:float.jl:399 vmulsd(%rax),%xmm0,%xmm0 ;}} ;Functionsqrt;{ ;Location:math.jl:479 ;Function x!= 0, DatAll[:,row_window]) DatAll_SAC = DatAll[rows_DatAll_window,:] sac_frequ = length(EventAll[:Sac][:Start] n DatAll_SAC[:,1]) sac_frequ_sec = sac_frequ/(size(DatAll_SAC, 1))*1000

Fig. 6 Same fragment of the eyetracker analysis code in Julia

strategy is totally different, with Julia analyzing the whole program and generating efficient machine code. Take-away If you have complex code, especially if it includes some math, and it needs to run fast and still be easy to debug, you should give Julia a try. julialang.org

3 Perl/Raku The problem with being consistent is that there are lots of ways to be consistent, and they’re all inconsistent with each other. Larry Wall

3.1 The Theoretical Appeal Raku (www.raku.org) is the completely redesigned sister-language of Perl, a language that, among other things, famously “saved the Human Genome Project”. Raku is meant to include a lot of concepts from different programming paradigms, while keeping the famous Perl virtue of whipuptitude, which is, as explained by the Raku language designer, the “aptitude to whip things up”. More than most other languages Raku is defined by its culture and spirit rather than language features—and it’s a very geeky and accepting culture. It simply is a certain kind of people that like words like whipuptitude, name the method for reading a file at once slurp or make the language accept the roman numeral C (Unicode # 216D) as a valid input for the value 100. People like the first implementor of Raku, a high-school dropout with fluid gender identity who later went on to become the first Minister for Digital Affairs of Taiwan. So, a lot of things can be expressed succintly in plain Raku (see code in Fig. 7). Some less common concepts include junctions, meaning variables that can hold mul-

8

C. Meigen # least common multiple of the numbers from 1 to 10? [lcm] 1..10 # the 10001st prime number ((1..*).grep: *.is-prime)[10000] # random password (’0’..’z’).roll(15).join # infinite list of Friday 13th’s since the movie Friday 13th (1980) (Date.new(year => 1980) .. *).grep({.day-of-week == 5 && .day == 13}) # easy introspection: methods callable on the value 1 1.^methods

Fig. 7 Raku examples (some from Andrew Shitov’s collection “Raku One-liners”)

tiple values at once, infinite lists, easily parallelizable operators and metaoperators. As for Perl, it is so closely embedded in sysadmin culture and has cared so much about portability and backwards compatibility that you can be sure your Perl script just runs basically anywhere there is a command line—no package install, docker image or virtual environment required. But it seems you cannot have your cake and eat it, too: Raku (and Perl) thrive when there are new problem spaces to be explored like the early internet (the first generation amazon.com website was built with Perl) and early BioInformatics, but as processes become more standardized, the very flexibility that allowed the implementation of new ideas is often viewed as a burden, and more strict approaches take over.

3.2 A Real World Example While large projects the author contributed to—like like CrescNet, a network for preventive care in paediatrics, with a central database of more than 5 million visits to the doctor—still run exclusively on Perl, its use in newer projects is often limited to smaller tasks. Suppose you have—as so happened in the LIFE Child study in Leipzig—a couple of hundred files with results from spirometry (a lung function test), looking like the listing in Fig. 8. That is, you have per file a main header, with basic demographic data like sex, date of birth and an ID number, and then per measurement a block. That block consists of an examination-specific subheader, containing data relevant to the examination like date, type of measurement, height and weight at the time of the examination. This subheader is followed by a table of results, with columns for the name of the parameter, unit, measured value and norm value. Keep in mind that you can have several examination blocks per file. In linguistics and computer science, what we just described is called a grammar describing the structure of textual information, and a more formal way of writing the previous paragraph (with some more specific markers for vertical and horizontal space) is shown in Fig. 9.

Interesting Programming Languages …

9

IdNr SIC015465664 Geschlecht M GebDatum 02.08.2007 Bemerkung Datum 21.07.2018 Zeit 11:33 Groesse 132 Gewicht 29 Messtyp Spiro/FV LfdNr 1 uflags G0 Parameter Einheit IVC l 2.21 ERV l 0.69 IRV l 1.03 TV l 0.37 IC l 1.52 VCex l

Soll 1.62 0.61 0.59 0.47 1.06 1.73

Ist

Fig. 8 Example of Spirometry data in a device-specific text format grammar spiro { token TOP { + + } token mainheader { \h+ \v+ } regex headerkey {IdNr|Geschlecht|GebDatum|Bemerkung} regex value { \V* } regex block { + + } regex title { Parameter \h+ Einheit \h+ Soll \h+ Ist \h*\v+} token subheader { \s+ \v+ } token subheaderkey {Datum|Zeit|Groesse|Gewicht|Messtyp|LfdNr|uflags} regex result { \h \h \h \v+} regex resultkey { \S+ } regex einheit { \S* } regex soll { \S* } regex ist { \S* } }

Fig. 9 Grammar for the Spirometry data

This grammar is actually valid Raku code, and if you have a file named “examplespiro.txt”, you can just write my {$m = spiro.parse("example-spiro.txt".IO.slurp); print($}m[0][2]);

With the data file above, this would print 1.03, but you have all the data available for processing in a well-defined data structure that mimics the structure of the file.

10

C. Meigen

This is a nice example for the usability of Raku: A small task for which it would be overkill to use tools like parser-generators, but which can be implemented directly using only the built-in functionality of Raku. Take-away Raku (and its sister language Perl) are developed by an enthusiastic group of hackers who favor flexibility, cleverness and features over strictness. If you have a small task that needs an unorthodox solution, or are stuck with a boring task that would become exciting through a clever approach, Raku might be just the swiss-army chain saw you need raku.org.

4 Spreadsheets [...]..imagine if I had a heads-up display, like in a fighter plane, where I could see the virtual image hanging in the air in front of me. I could just move my mouse/keyboard calculator around on the table, punch in a few numbers, circle them to get a sum [...] Dan Bricklin, inventor of VisiCalc

4.1 The Theoretical Appeal Ever since the publication of VisiCalc in 1979, spreadsheet software like Lotus 1-23, MS Excel, LibreOffice Calc or Google Sheets has enabled non-programmers to easily manage tabular data, doing a variety of operations from simple calculations to sophisticated reports. In science, spreadsheets have a well-deserved bad reputation, since they don’t separate code and data, don’t impose rules on tabular data (allowing values in one column to have different data types, allowing merged cells to span several columns and rows), make it easy to tamper with data and make it non-obvious in which order calculations are performed. They are, however, very easy to set up, easy to use and easy to archive—and modern spreadsheet software has functionality to transform and analyze data that easily matches what beginners can do in any statistics software.

4.2 A Tale of Algorithm Distribution When Tim Cole and others popularized the Body Mass Index (BMI) as a measure for obesity, it soon became clear that age-specific cut-offs had to be calculated for children. Unfortunately, the BMI is not normally distributed, since you can weigh more than double the median weight, but not less than zero. At first, it was proposed to create age groups and calculate parameters (with published formulas) for the median,

Interesting Programming Languages …

11

standard deviation and skewness separately, but this resulted in rather bumpy curves. A second paper followed, with a more complicated approach that was hard to implement given the description alone, so the authors started to distribute Fortran code for the calculations upon request. After appearently not everyone was comfortable compiling and adapting Fortran code, a Windows application was published which allowed data import and setting smoothing parameters in graphical dialogs. That application was widely used for generating smooth centiles, but it did not fit well in automated data processing pipelines. So, nowadays, if you want to calculate centiles for BMI, you use the R package gamlss created based on ideas from Time Cole and others. This odyssee of distributing the same algorithm (mathematical notation, lowlevel programming language, self-contained application, high level programming language) was what Prof. Michael Hermanussen had in the back of his head when he asked for providing small algorithmic examples for core concepts of his book “Auxology: Studying Human Growth and Development”, without requiring the reader to install and familiarize themselves with statistical software. For this project, we did not have the resources to develop a full-scale application with data import/editing capabilities. In the past, we had created a website childhealth24.com, which provided some evaluations on an individual level, but for the book we wanted people to be able to play around with their data. After some discussions, we concluded to provide algorithms as user-defined functions in Google Sheets (see Fig. 10). This way, users can operate in the spreadsheetlike system they are used to (as opposed to R or Python), yet can add advanced functionality (written in JavaScript) directly in their worksheets. This is certainly no general solution, but compared to providing R packages, creating special-purpose applications or web services, or providing source code for people to compile themselves this solution seems—even 6 years after publishing “Auxology: Studying Human Growth and Development”—like a really nice way to distribute algorithms that people can use on their data.

Fig. 10 Google sheets document with included auxology functionality

12

C. Meigen

4.3 How Vlookup Solved a Data Protection Issue In the Rhineland Study, we got lists of (tens of thousands of) people to contact as possible participants from the municipality. This list included names and addresses, but not the block number that defined a statistical region, by which we selected addresses for a mailing. In another table, we got the addresses (without names) together with the block numbers. Staff from the contact management would select the addresses and upload the selected addresses to the database. Merging these tables would be a 2 line operation in Raku or R, but that would not account for manual data curation, which was necessary in quite a few cases, and also we did not want to install additional software on the contact management’s desktop computers. Finally, the data management staff was not allowed to perform this task, as they should not handle clear names of participants for data protection reasons. I’m a bit shamed to admit how surprised I was to find out that this was trivial to do with Excel, since the included vlookup function is basically a left join in database terms (In recent versions of Excel, it is recommended to use the combination of index and match for that task). That solution worked well since people were already familiar with Excel, and could easily import, copy and filter data. Also manually looking up cases where no match was found and correcting data directly was trivial. Doing any of theses things in a script-based solution would be a challenging task for non-programmers. That we, however, also did our shiftplan in a worksheet-from-hell with many VBA macros is a story for a different day. Take-away Spreadsheets are much more powerful than programmers may think, and spreadsheetbased solutions are usually well-accepted by users. Google Sheets allows custom functions to be easily incooperated by other users. docs.google.com, libreoffice.org

5 (Not Your Grandparent’s) SQL 5.1 The Theoretical Appeal SQL has been the default query language for relational databases for more than half a century, and as such is involved in many data processing workflows. You will get a very brief introduction to it in Chap. 6. The nice thing about SQL is that it makes no assumptions about how the data is stored, and looking at an SQL query will tell you little about how the data is actually processed. A little less known than simple selections from tables is the fact that with features like common table expressions a.k.a. WITH RECURSIVE it is easily possible to chain SQL queries to achieve what would be called pipelines in other languages,

Interesting Programming Languages …

13

feeding the results from one processing step in the next query. In addition, window functions and pivot tables (called crosstab in PostgreSQL) give you a lot of flexibility to summarize and transform data. It remains a matter of taste and practical considerations (is your data already in a relational database?) whether you make use of these features, but SQL offers query features on par with modern data processing packages like Pandas.

5.2 Two Small Epiphanies 5.2.1

Cloud SQL

Preparing a talk about the influences of social trends on medical decisions, I once thought it would be a good idea to look at all the comments from reddit.com (a popular social networking site) that somehow deal with “gluten”: What is the context, a social graph of the authors of these comments etc.—nothing too fancy. The comments of reddit are public, and people scrape them all the time and provide them as downloads, usually in a “one JSON per line” format. As it turns out, processing tens of billions of these comments is not that fast, and would require some cluster solution which I had no intent to set up for a few queries. When I discovered that these huge datasets had been uploaded to Google BigQuery and made available for querying directly via SQL… let me take a brief detour. I learned programming in BASIC on a computer with 3.3 kB of RAM. The difference in experience from that to a modern day desktop computer felt as dramatic as the difference between struggeling for days with the Reddit dataset on my own hardware compared to just typing a SQL query and have an unknown number of Google servers do all the data processing within a few seconds without me having to care where the data is and what techniques are used. Amazon of course offers similar capabilities with Redshift (which advertises “no table size limit”), and other cloud providers surely do too. Especially when it comes to publicly available data interesting for science, a lot of datasets are already preloaded in these services and ready to be queried.

5.2.2

Import First

This section is about SQL, but in order to use SQL your data has to be in a database first. The task of importing the data is usually done via some script that reads from a CSV, XML, JSON or text file, does some data cleaning, checks for inconsistencies and finally transforms the data so it can be imported into various database tables. Often you need data from the database to check the validity of data, for example if labels are from a predefined set, or whether some datasets were already imported. On many occasions it might be useful to import the—unclean, poorly structured— data first into the database, and then do all the processing steps nicely in SQL within the database. In fact, all modern relational databases support reading CSV files

14

C. Meigen

directly, or parsing and destructuring JSON and XML. And—in contrast to other languages—all SQL databases by default deal with data that does not fit into your computer’s memory. Take-away Cloud services offer insane scalability, and SQL can be used not just for simple queries, but also to clean, import and analyze data. cloud.google.com/bigquery, aws.amazon.com/redshift

6 R 6.1 The Theoretical Appeal If, as a programmer, you were told to primarily work with a language that has no data type for numbers or strings that, while supporting them, strongly discourages the use of looping constructs like for or while and whose main data structure—the so called data frame—has no equivalent in mainstream programming languages: Would you giggle with intellectual curiosity, or turn away in horror? We are talking about R, a language and environment for statistical computing, or, as you should call it on your CV, data science. Let’s dissect the claims in the previous paragraph. Of course, you can write something like x % select(id, starts_with("actino")) %>% write_csv("gut-filtered.csv")

One could spend time arguing whether using SQL or LINQ within a program, or using separate libraries with their own mini-languages like pandas for Python would provide a similar ease, but those are more like separate languages, and we would be missing two other major points: interactive usage and R as lingua franca of statistics. Much more than in other languages, development in R happens interactively, most commonly in the RStudio graphical user interface, where you type in commands and can inspect all variables/datasets and view generated plots directly. R programs often require just a few lines to produce meaningful results, which makes such incremental development possible and enjoyable (of course, it comes with a danger to further use a result without saving the code that generated it). In this regards, working in R much more resembles working in a notebook-like environment like Mathematica/Wolfram or Jupyter, while also borrowing the concept of an “image file” from Smalltalk/Lisp with the complete state of all variables and definitions being saved by default upon exiting R. And finally, getting statistics right is complicated. Advanced algorithms for multidimensional modelling and testing are too complicated to be easily described with a few formulas. For a large part of statistics, R has become a language in which new ideas are primarily published, mostly as R packages in the Central R Archive Network (CRAN)—so the R version by default is the definite implementation.

6.2 An Ad-hoc Overview for a Messy Dataset As stated, the true value of R lies in its community codifying decades of statistical knowledge into easy to use libraries, for example lme4 for mixed effect models or gamlss for modelling distributions. But keeping in line with the spirit of this chapter, we want to demonstrate how R can help you with smaller challenges. An aspiring Ph.D.-student was once tasked with a data analysis, where, as it turned out, the provided dataset was a little less than perfect. It was a table with columns for the variables and rows for the participants. A row was created for each study visit (so there were multiple rows per participant). The visits were usually a few days apart and the data should be analyzed together. While the study protocol stated which questionnaires and examinations were supposed to be done at which visit, this had not been followed rigorously (especially half-answered questionnaires were completed at another visit—sometimes repeating a few questions). Combining rows where in every column at most one value was not missing, or where all the not

16

C. Meigen

Fig. 12 Problematic dataset with contradictionary data, two requested summaries ## df is our dataset, prepare a long form df2 % pivot_longer(-id, names_to="var", values_to="val") %>% filter(!is.na(val)) %>% distinct() ## create first summary df2 %>% count(id,var) %>% filter(n>1) %>% count(var) ## create the second summary df2 %>% group_by(id, var) %>% summarize(values=paste(val, collapse=","), n=n()) %>% filter(n>1) %>% group_by(id) %>% summarize(diffs=paste(var, ’ (’,values,’)’, sep=’’, collapse=’/’))

Fig. 13 R code to create the two summaries

missing values were equal was no problem. But, in a dataset with thousands of rows and several hundred columns, how can the problematic cases be found? Specifically, we want a list of varaibles where this occurs (with the number of participants), and, for manual inspection, also a list of participants, which shows for each variable the contradictionary values. Looking at the table in Fig. 12, we want to produce the two summaries shown next to it. What we want to achieve in the first summary can be described as: Starting with our data, we want a list of values and variables per id, and then we remove all the missing values, and then we remove all duplicate rows (we save this intermediate result for later as df2). Then we count the number of cases per id and variable, and then we filter where n > 1, and then we count the number of cases per variable. This can be directly entered into R, usually without even creating a new script file (see Fig. 13). For the second summary, we start with our intermediate df2 and group it by id and variable, create a comma-separated string of all the values and calculate the number n of values, then we filter for n > 1, then we group by id and create a “/”-separated string of the variables with all the values in parenthesis. Take-away

Interesting Programming Languages …

17

For working with tabular datasets that fit into your computers memory there is no more elegant way than using R, especially with the tidyverse libraries and working in RStudio. This is especially true if you have to apply advanced statistics. r-project.org, rstudio.com

7 Python 7.1 The Theoretical Appeal It can be argued that Python is the second best language at everything: Rails (in Ruby) is a nicer Web Framework than Django (in Python), the tidyverse libraries (in R) are better for data science than pandas (in Python), Perl is better at scripting Unix tasks, Lua is better for tight integration with C, and of course any compiled language is way faster than Python. So if you want to ever do one thing and one thing only, Python may not be your best choice. But few of us rise to the status of a master, and we have to get by being a jack of all trades: quickly set up a web application, automatising some system tasks, set up a data processing pipeline to produce some plots, interfacing to some new and fancy library, preferably by the end of the day. Python is available to extend your database (plpython in PostgreSQL), it’s available as a scripting language in 2d and 3d graphics programs (Gimp and Blender), it’s around longer than Java, and ranks number one by far in the “Popularity of Programming languages” index 2020 (based on Google searches), was the language most in demand in the US job market in January 2020 with 74’000 job openings according to Indeed and is used by Google’s director of research, Peter Norvig, to solve the annual Advent of Code puzzles. So how does Python do it? Python’s informal design principles were codified in the “Python enhancemant proposal (PEP) 20”, where 14 of 19 are some variation of “Simple is better than complex”, like “There should be one—and preferably

18

C. Meigen int countNegative(double a[]) { int count = 0; for (double d: a) { if (d < 0) { count++; } } return count; }

def countNegative(a): count = 0 for d in a: if d < 0: count++ return count

Fig. 14 Usage of significant whitespace in Python (right) versus explicit nesting with curly brackets in Java (left)

only one—obvious way to do it.”1 or “Readability counts.” The most obvious—and controversial—incarnation of these principles is the use of indentation for nesting definitions, conditions and loops (see Fig. 14). The consequence of indentation being significant is that you have to format your code properly, and it’s practically impossible to have more than 4 or 5 levels of nesting, meaning you are forced to write simple functions. One of the biggest differences between beginners and seasoned programmers is that beginners tend to write complicated code, copy-pasting more and more apparently working snippets together, while seasoned programmers know the pain of finding bugs in complicated and hard to read functions, so, by mandating simplicity, Python forces you to be a better programmer. Compare the code in Fig. 14 to the Julia solution countNegative(a)=length(filter(x->x < modelVersion > 4.0.0 < groupId > ObjectExamples < a r t i f a c t I d > O b j e c t E x a m p l e s < v e r s i o n > 0.0.1 - S N A P S H O T < / v e r s i o n > < build > < s o u r c e D i r e c t o r y > src < / s o u r c e D i r e c t o r y > < plugins > < plugin > < a r t i f a c t I d > maven - compiler - p l u g i n < / a r t i f a c t I d > < version > 3.5.1 < configuration > < s o u r c e > 1.8 < / s o u r c e > < t a r g e t > 1.8 < / t a r g e t >

This file contains all dependencies for the project. Whenever we will give you snippets for Maven dependencies, you can paste them here. If you want to build your project, right click on it and choose Run as... → Maven install.

2 Your First Java Project 2.1 Creating a New Project First, we need to create a new Java project in Eclipse. You will find this in the menu File → New → Java Project:

Introduction to Java

29

Fig. 3 How to create a new Java class

Give the project a name and click on finish. Afterwards you will see your new project in the Package Explorer on the left hand. It has a src folder which contains the source codes.

2.2 Creating a New Java Class You will find the option to create a new Java class in a similar place: File → New → Class. The following dialogue appears (Fig. 3): Use “Test” as a new class name and click on finish.

30

J. Dörpinghaus et al.

We will start with this Java class: Listing 2 Test.java 1 2 3 4 5 6 7 8

// Test . java public class Test { p u b l i c s t a t i c v o i d m a i n ( S t r i n g [] a r g s ) { S y s t e m . out . p r i n t l n ( " H e l l o W o r l d " ) ; } }

The first line starts with // which indicates a comment line. The two slashes may start at an arbitrary position within a line, so that everything in this line right of // is a comment (and therefore programmatically ignored). A comment gives you the possibility to add “normal”, human-readable notes and information to your source code. This is extremely useful (and important!) in order to make your code readable and to remember decisions you made. It is also useful if you are working together on code to share information. Finally, source code can be deactivated by simply transforming it into a comment instead of deleting it. Hereby, it may easily be reactivated. The second line in Listing 2.2 indicates the definition of a class. With public, we indicate that the following definition is public; we will discuss the meaning of ”public” later. The keyword class is used to define a class. Keywords are reserved words that you should not use for other purposes, for example to name variables. The name of the class follows: Test. The class starts and ends with curly braces { and }. The definition range of a class is usually called a block. Within the class of Listing 2.2 we will find another block, defined by a public and static method called main. You will see public static void main (String[] args) in a lot of source code. This is the main entry method for a class. If we execute the program, your computer will jump to that method and execute the code. We will discuss the definition in more detail later. Till now it is only important that after we started this application, the following line will be executed: System.out.println("Hello World");. As you can see, every statement or program command ends with the semicolon ;. Moreover, a text value, a so-called String, is set within quotes ". If you save your text file, make sure the filename is corresponding to the class name: public class Test → Test.java. Otherwise you will receive a compiler error. We will now run the application. If you find your Test.java file in the Package Explorer, you can use the right mouse key to open the context menu. Choose Run as... → Java Application and you will see the output in the Console shown at the bottom, as shown in Fig. 4. You can also choose Run → Run from the menu or press Ctrl+F11.

Introduction to Java

31

Fig. 4 Hello World - application

2.3 Sharing a Project via SVN Now there is only one thing left: Sharing your project via SVN. Choose your project in the Package Explorer, open the context menu and choose Team → Share Project.

Choose “SVN” to share your project.

32

J. Dörpinghaus et al.

If you have already created a repository location, you can choose it from the list at the bottom. Otherwise you can create a new location.

Use the repository position that was provided to you.

Introduction to Java

33

In the last step, you can choose if you want to use the Project name as the folder name or store it in a different location. Please carefully choose a folder location respectively follow your system administrator’s recommendations.

The following question can be answered with “No”, as team synchronization is not yet our topic. You can once again open the context menu of your project in the Package Explorer. Choose Team → Commit.... You will be asked for a comment to your commitment. Please take that serious, because it will make it easier to reproduce who has committed what and for what purpose. After that, you will get an information like Transmitting file data ... Committed revision 2764 in the Console. We can now open the SVN perspective. Perspectives are used in Eclipse to change the environment for different purposes. What you are already working with, is the Java perspective. Its purpose is Java programming. On the top right bottom of your Eclipse window you will see the following icons:

34

J. Dörpinghaus et al.

If you do not have a SVN icon, open the “Open Perspective” window by clicking on the respective icon. You can choose “SVN Repository Exploring”. Next time, the icon will appear. After choosing this perspective, you will see all of your SVN repositories and all files accessible by your user, that is, by the name associated with your account. See Fig. 5, where jdoerping is the name associated with our sample user account. This perspective is extremely helpful to move, organize and evaluate your commits. For example, it is easy to see the history of your files and projects.

2.4 Sharing a Project via Git 2.4.1

Managing Your Projects in Gitlab

There are different services for sharing a project via git. You may for example use the (free) github.com Service by creating an account and hosting your project on their platform. Bitbucket and Gitlab are further platforms of this kind. We demonstrate the use of git in a GitLab-environment. Observe that some properties of your project might require different settings. However, this section should guide you through the essential steps that are required in order to manage and share your projects with git - using GitLab, for example. If your institution also uses gitlab, you might probably have an account created for you, and probably a website that allows you to follow your shared git projects. Before being able to push any files via SSH you will need to add your ssh key to the

Fig. 5 The SVN repository exploring perspective in eclipse

Introduction to Java

35

service (and create one if it does not exist). Connect to GitLab and select settings from your icon in the upper right corner (see Fig. 6). Then select “SSH Keys” from top menu and follow the instructions. !

You cannot use SSH connection from all networks. Better choose https if you are not familiar with SSH.

If you want to create a new project, click the green “New project” button, as exemplary shown in Fig. 7. Please choose an appropriate name as project name, enter a proper description and decide whether to make the project public, cf. Fig. 8. After that, add all members of the team that you want to share your project with using Settings → Members. Choose the role “Master”, otherwise the other members will not be able to commit in master (Fig. 9). Next you can pull the Git project assigned to you or created by you. In the GitLab interface, go to the main project page and copy the ssh address as shown in Fig. 10. We will need this address when cloning the repository, as described in the next section.

Fig. 6 The user menu in GitLab

Fig. 7 Creating a new projekt in Gitlab

36

J. Dörpinghaus et al.

Fig. 8 Choose project name, describe the project and choose the visibility level

Fig. 9 Add members to your project

Fig. 10 Copy the SSH address

Introduction to Java

37

Fig. 11 The Git repository exploring perspective in eclipse

2.4.2

Cloning a Repository

We can now open the Git perspective in Eclipse. Recall that in Eclipse, perspectives are used to change the environment for different purposes. What you are already working with, is the Java perspective. Its purpose is Java programming. On the top right bottom of your Eclipse window you will see the following icons:

If you do not have a Git icon, open the “Open Perspective” window by clicking on the respective icon. You can choose “SVN Repository Exploring”. Next time, the icon will appear. You can also choose “Git” or “Other” from Window → Perspective → Open Perspective:

After choosing this perspective, you will see all of your Git repositories and all files accessible by your user. See Fig. 11. This perspective is extremely helpful to

38

J. Dörpinghaus et al.

Fig. 12 Paste the SSH address in URI form and select SSH or https protocol

move, organize and evaluate your commits. For example, it is easy to see the history of your files and projects. You can now clone a Git repository by clicking on the link or choosing it from the menu. Or go to Eclipse File → Import . . . → Git → Projects from Git → Clone URI and paste the HTTPS address into the URI form as shown in Fig. 12. The https protocol is chosen automatically. Depending on the Git repository that you clone, Eclipse will warn that the project is empty. That is okay, just click next and import it as a general project. Please be careful, after selecting the Git folder you have to add a subfolder with the project’s name in order to avoid conflicts (Fig. 13). In order to commit changes and push to GitLab you have to go to Git perspective with the same way as described for SVN perspective. Select the Git project from the left side and in the lower right part of eclipse under Git staging you can see the unstaged changes. These are all the changes you haven’t committed yet. Select the files that you want to commit (or all of them), right click and select Add to index. Write a commit message, for example what changes you introduced (‘added new project’ or ‘updated project by modifying ...’) and then select Commit and push . . . like it is shown in Fig. 14.

Introduction to Java

39

Fig. 13 Select the .git folder as the save location of the new project

2.4.3

Sharing an Existing Project with Git

If you have a new local project which has not been added to your Git project, you can use the context menu by choosing Team -> Share project...

40

J. Dörpinghaus et al.

Fig. 14 Add files to index, commit and push

You can now choose the previously configured repository and add a new, suitable path, for example handout1/task1.

Introduction to Java

41

Fig. 15 Create and push a tag

2.4.4

Import an Existing Project to Your Workspace

If you refresh a repository location in the Git-Perspective, you will see all the stuff other people have already pushed to master:

You can now import those projects by choosing “Import” from the context menu.

2.4.5

Tagging a Git Project in Eclipse and Pushing the Tags

There are different reasons to use tagging in a Git Project. Tags are often used to mark special milestones of your software, such as release versions etc. The procedure that has to be followed in Eclipse is: 1. 2. 3. 4. 5. 6.

Switch to the Egit perspective. Select the project you want to tag and expand it. Locate the Tags in the project tree, right click it and select Create Tag. . .. Enter tag name and optionally a tag message. Right click Tags and select Push Tag. . .. Select the desired tag and push.

—– After getting ready and setting up the environment we can now focus on the foundation of Java programming. First of all, there are some sections that describe the basics of Java: Variables, Comparison, Arrays, Loops and how to import other packages. You may skip these topics if you are already familiar with Java (Fig. 15).

42

J. Dörpinghaus et al.

3 Java Basics Java is a very popular programming language. And like any other programming language, it has its own structure, syntax and of course paradigms. The most important topic is the OOP – object orientated programming – concept. You have already seen, that code blocks are organized using the brackets { and }. It is a good practice to use indentation for each block. There are also some reserved keywords. These words are reserved for the compiler, and he recognizes them as special signaling words. These words should not be used for naming other things in Java. See Table 1 for an overview. The Java code is organized in packages and classes, and classes contain methods, variables, constants, nested classes and so on. You have already seen and compiled a .class file. Classes within a subfolder belong to the same package. A jar file can contain multiple packages. We will discuss this in more detail later. But this is how Java code is structured. A class is the framework for an object in Java. First of all, it contains – if necessary – the package, then the imports and then the class definition: 1 2 3 4 5

p a c k a g e xyz ; i m p o r t zyx ; c l a s s yzx { ... }

The good news is that Eclipse usually takes care of these lines. Eclipse will also tell you, if you are mixing up the naming conventions of Java. Classes for example should consist of one or more nouns and start with a capital letter. Methods should contain a verb and start with a small letter. Variables and instances of objects should start with a lowercase letter. Constants should be written in capital letters.

Table 1 Reserved keywords in Java abstract assert case catch continue default enum extends for goto instanceof int new package return short switch synchronized transient try

boolean char do final if interface private static this void

break class double finally implements long protected strictfp throw volatile

byte const else float import native public super throws while

Introduction to Java

43

3.1 Variable Declaration Variables are the place to store any kinds of values. They are the storage location together with an associated symbolic name which identifies the variable. Variables need to be declared before they are used. This concept ensures that variables are only of one kind. If we declare a variable, we must pass information about what kind of data we want to store. 3.1.1

Characters and Text

If we want to store text, we need to define a String variable: Listing 3 Variable Example 1 2 3 4

S t r i n g name = " H u b e r t " ; S y s t e m . out . p r i n t l n ( " H e l l o " + name ) ; name = " Fritz " ; S y s t e m . out . p r i n t l n ( " H e l l o " + name ) ;

The first line indicates that we are declaring a String variable called “name”. The equation sign indicates, that we assign the value “Hubert” to this variable. Thus, in the next line we print out a concatenated variable with “Hello” and the value inside of the variable name. In the next line we change the value of an already declared variable. We must not use the type once again but can simply use the variable name. Execute the code example above and take a look at the output:

Strings and characters can be concatenated using the plus-sign. A single character is of type char: 1 char c = " a " ;

As you can imagine, you can concatenate both types. Characters are seldom used but need much less memory than strings.

44

J. Dörpinghaus et al.

3.1.2

Numbers

Sometimes you might also think about storing data that is not a text. Take a look at the following example: Listing 4 Variable Example 2 1 2 3 4 5

int age ; age = 22; int n e x t a g e ; n e x t a g e = age + 1; S y s t e m . out . p r i n t l n ( " Your age next year : " + S t r i n g . v a l u e O f ( n e x t a g e ) );

An int variable stores integer values, which means natural numbers with signum out of Z. As you can see, we have defined two variables, “age” and “nextage”. You can also do basic arithmetic using +,-,*,/, see Table 2. If we want to represent natural numbers out of Z, we have to check how big these numbers can be. Java reserves a range of Bits to store the values. See Table 3 for an overview. n n In general, you can use the formula − 22 till 22 − 1 to calculate the range if you have n bits. It is also useful to take into account that there is no number representation for N in Java. There is always an algebraic sign. What is the best way to choose one of these data types? You have to find the best balance between the usage of memory space and your own needs. The most used type is int. You might think, that memory is not a big issue. But if you have big data, you will see, that your main memory melts like ice in the sun. We will discuss that in one of the next chapters.

Table 2 Mathematical operators for arithmetic Operator Function + * / %

Addition Subtraction Multiplication Division Modulo

Table 3 Variables for natural numbers Datatype Bits byte short int long

8 16 32 64

Range −128 till 127 −32.768 till 32.767 −2.147.483.648 till 2.148.483.647 −9.223.372.036.854.775.808 till 9.223.372.036.854.775.807

Introduction to Java

3.1.3

45

Floating-Point Numbers

The world is not discrete whereas the representation of numbers with natural numbers is mentioned above. But we also want to calculate with real numbers. This works with natural numbers, if we use bits as a representation of fractals of the form 21i and represent a real number as a sum of these values. This means x = x1 · 211 + x2 · 212 + ... + xi · 21i for i ∈ N. Example 0.390625 is equivalent to 011001 = x2 = x3 = x6 = 1.

1 4

+

1 8

+

1 64

as a binary number since

Thus, displaying floating-point numbers is not only an issue of the range, but also of precision. The IEEE (Institute of Electrical and Electronics Engineers) has designed several standards for floating-point numbers. They include the arithmetic formats, interchange formats, rounding rules and operations. Floating-point numbers are a complex topic. Example 0.4 is according to IEEE Standard 754-1985 defined as a periodic sequence. Thus, using a 32-bit floating point number the value stored in the computer is approximate 0.3999998. Thus, keep in mind, that the computation of floating-point numbers is not always precise. For example, if calculating n · m for a small n and a great m, the calculation will not be numerical stable. See Table 4 for different floating-point types in Java.

3.1.4

True or False? Boolean

To differ between True or False we can use booleans: 1 b o o l e a n bool = false ; 2 b o o l e a n b o o l 2 = true ;

There are a lot of examples where boolean values occur, even if you do not define them. Whenever you have a condition, a boolean value returns: 1 2 3 4

int a = int b = boolean boolean

4; 6; bool = ( a == b ) ; b o o l 2 = ( a != b ) ;

Table 4 Variables for floating-point numbers Datatype Bits float double

32 64

Smallest number

Greatest number

1.401 · 10−45 4.941 · 10−324

3.403 · 1038 1.798 · 10308

46

J. Dörpinghaus et al.

We have defined two different integer values for a and b. The doubled == checks for equality. Since a and b are not equal, the first boolean will be False. != checks for inequality, thus, the second statement is True. See the next section for more details on logical expressions. It is important to realize, that booleans can be inverted with the ! sign: 1 b o o l e a n bool = false ; 2 b o o l e a n b o o l 2 = ! bool ;

Now, bool2 has the value true. char String int float/double boolean

The Cheat Sheet Variable type to store a single character car letter = "a"; Variable type to store Text String text = "example"; Variable type to store natural numbers int myNumber = 5; Variable type to store floating numbers double myNumber = 5.5; Variable type to store a boolean boolean bool = false;

3.2 Comparison It is often necessary to find condition within the program. For example, if n = 0 we cannot calculate n1 . The syntax is quite ease, we use the word if: Listing 5 Condition Example 1 2 3 4 5

int age int b = if ( age b = }

= 0; 1; > 0) { b / age ;

b Thus, the line calculating age will only be reached, if age > 0. As mentioned above, a condition always returns a boolean and the if checks, if the statement is true. See Table 5 for a list of operators to compare values. You can also combine if-statements with and && and or ||. For example 1 if ( ( age > 0) && ( age < 45) )

will hold only if the value of age > 0 and age < 45. It is true for all ages from 1 to 44.

Introduction to Java Table 5 Operators for comparing values Operator == != < > =

47

Function Equality Imparity Less More Less or equal More or equal

3.3 Arrays Arrays are special data types that contain multiple instances of variables. They can be compared with lists. They have an index, that count all elements inside of the array. For example, if we want a list of strings to save some names we can declare this array with 1 S t r i n g [] n a m e s ;

This indicates, that we have a new variable called “names” that is a field and thus, has an index to access the elements inside which have the type String. But it is still empty and in technical terms no object was created. But we can once again instantiate the object in one step: 1 S t r i n g [] n a m e s = { " H a n s " , " K l a u s " , " Bob " , " C h a r l e s " };

This creates an array with length four because it has four elements inside. We cannot change the size of an array after creating it. But we can read the length with length: 1 S y s t e m . out . p r i n t l n ( n a m e s . l e n g t h ) ;

Which will return 4. If we want to access get elements stored in a field, we can use [ ] and the index: 1 S y s t e m . out . p r i n t l n ( n a m e s [0]) ; 2 S y s t e m . out . p r i n t l n ( n a m e s [ n a m e s . length -1]) ;

// F i r s t e l e m e n t // Last e l e m e n t

As you can see, the index of the first element is not one, but zero. Now we have seen, how to create a new object with values. But sometimes we may need to create an object without assigning values. We can do this with 1 S t r i n g [] n a m e s ; 2 n a m e s = new S t r i n g [ 4 ] ;

or in one line 1 S t r i n g [] n a m e s = new S t r i n g [ 4 ] ;

which will create a new array field with the length four. If we want to iterate over all fields, we need loops.

48

J. Dörpinghaus et al.

3.4 Loops Loops are used to execute some commands for a distinct range or time. Thus, a loop declares always a block of statements that are executed by the loop and it always contains a condition. If this condition is true, the loop will execute the statement block once again. Java has four loops: while, do-while, for, for-each. If the condition never gets false, we have created an infinite loop and the program will never terminate. A while-loop is a loop that proofs the condition before executing the loop block. This means, that the code block may not be executed if the condition was false before coming to the loop. The do-while-loop proofs the condition after executing the code block. Thus, the code block is at least executed once! See Fig. 16 for an illustration. We will start with a little example that prints all values from 20 till 1 to the screen: 1 int count =20; 2 w h i l e ( c o u n t > 0) { 3 S y s t e m . out . p r i n t l n ( S t r i n g . v a l u e O f ( c o u n t ) ) ; 4 c o u n t = c o u n t -1; 5 }

We instantiate the integer count with 20 and the loop condition is count > 0. The same example using a do-while loop would look like this: 1 int count =20; 2 do { 3 S y s t e m . out . p r i n t l n ( S t r i n g . v a l u e O f ( c o u n t ) ) ; 4 c o u n t = c o u n t -1; 5 } w h i l e ( c o u n t > 0) ;

You see the differences between both loops if you change the instance of count to −20. The while loop will not execute the code block since the condition is false. The do-while loop will not check the condition in the beginning thus the code block will be executed, and it will print -20 on the screen. Afterwards it will proof the condition and terminate. Fig. 16 Differences between while and do-while-loops: the condition is either proofed before or after executing the loop code block

starting point

while-loop true condition false

loop code block

starting point

do-while-loop true condition false

loop code block

Introduction to Java

49

A special loop is the for-loop. It is typically used for counting natural numbers. Like in while the code block will only be executed if the condition is true. Following the above example, we get: 1 for ( int count =20; count > 0; i - - ) { 2 S y s t e m . out . p r i n t l n ( S t r i n g . v a l u e O f ( c o u n t ) ) ; 3 }

As we can see a for loop has three blocks inside of the condition block. The first one is used to instantiate a counter variable. The next one is the condition. If this is true the code block will be executed, otherwise not. The next one is the execution on the counter variable. Usually people use short variants. They look like this: 1 2 3 4

i ++; // is the same as i = i +1; i - -; // is the same as i = i -1;

Please notice, that i++ is not the same as ++i. Both increment the number, but the last one increments the number before evaluating the current expression. The first one increments the number after the expression is evaluated. For-loops are usually (but not only!) used within the context of lists or arrays. If we want to iterate over all fields of an array, we can use for-loops: 1 S t r i n g [] n a m e s = { " H a n s " , " K l a u s " , " Bob " , " C h a r l e s " }; 2 for ( int i =0; i < n a m e s . l e n g t h ; i ++) { 3 S y s t e m . out . p r i n t l n ( n a m e s [ i ]) ; 4 }

This will assign the values from 0 to names.length to i and thus, the printlnline will be executed for every item in the array and will print all elements in the array on the screen.

3.5 Adding Extensions Sometimes we need to import extensions, that were programmed or provided by other people. Or we have structured our source code in different packages and want to import them. A package is a construct in Java, that allows users to organize classes, interfaces and methods. If you take a look at Eclipse, you will see the packages as folders inside of your project. See Fig. 17 for a screenshot. Packages are a really important concept to organize your applications. They are always written in lower case letters. Usually they start with the revers internet domain name: net.exampleName.package. Or in the above example: de.fraunhofer.scai.bio.bionlpsharedtaskutils.configuration.

This indicates the package configuration within the project de.fraunhofer.scai.bio.bionlpsharedtaskutils. Exceptions are only allowed for Java build-in packages. They start with java.

50

J. Dörpinghaus et al.

Fig. 17 A list of packages inside of a project

The Keyword is import and has to be written in the first lines of the source code, before the line defining the class. Java comes with a lot of packages providing the basic functionality. For example class Scanner in package java.util provides an easy method to read input from command line: Listing 6 Test.java 1 i m p o r t j a v a . util . S c a n n e r ; 2 3 public class InputTest { 4 5 p u b l i c s t a t i c v o i d m a i n ( S t r i n g [] a r g s ) { 6 S c a n n e r sc = new S c a n n e r ( S y s t e m . in ) ; 7 S y s t e m . out . p r i n t ( " P l e a s e e n t e r y o u r n a m e : " ) ; 8 S t r i n g i n p u t = sc . n e x t () ; 9 S y s t e m . out . p r i n t l n ( " H e l l o " + i n p u t + " . " ) ; 10 } 11 }

The program above will first of all add the java.util.Scanner package. The main method creates a new object Scanner with the name sc. The String input will be assigned by the user’s input.

Introduction to Java

51

As you can see, it is also possible to import single classes from a package. The above example includes class Scanner from package java.util. It is also possible to use patterns like 1 i m p o r t java . util .*;

which will include all classes from the package java.util.

3.6 Exceptions Exceptions are an easy way to handle failures and problems which occur in your application. For example, a division with zero is forbidden. If you have user input and the user’s inputs zero you need to catch that problem. Java offers a very nice option for doing that: throwing and catching an exception. You may have seen, that an application exits with an exception: java.lang.NullPointerException: Cannot get property ’statements’ on null object at .... If you do not want your application to terminate, you need to catch that exception: 1 2 3 4 5 6 7 8 9

try { // E x c e p t i o n may o c c u r here ... } catch ( Exception e ) { e . p r i n t S t a c k T r a c e () ; }

The exception will be thrown as an object of type Exception. This is the most general Exception and will catch all exceptions. You can also directly catch specific exceptions like IOException, FileNotFoundException and so on. See Table 6 for a non-complete list. You can also catch multiple exceptions: 1 2 3 4 5 6 7 8 9 10 11 12 13

try { // R e a d i n g data and doing c o m p u t a t i o n ... } catch ( FileNotFoundException e ) { S y s t e m . out . p r i n t l n ( " File not found ! " ) ; } catch ( ArithmeticException e ) { S y s t e m . out . p r i n t l n ( " D i v i s i o n by zero ! " ) ; }

52

J. Dörpinghaus et al.

Table 6 Most common Java exceptions. The last two may occur in all situations Exception Occurs while... Exception ClassCastException ArrayIndexOutOfBoundsException NullPointerException NumberFormatException IOException FileNotFoundException EOFException NullPointerException IllegalArgumentException

Most general exception Casting Iterating over arrays Iterating over arrays Computing numbers reading or Writing data Reading or writing data Reading or writing data

If there is something you really need to do after an exception is risen, you can use the finally-block: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

try { // E x c e p t i o n may o c c u r h e r e d u r i n g r e a d i n g a file ... } catch ( IOException e ) { e . p r i n t S t a c k T r a c e () ; } finally { // Close file handler , etc . ... }

This is basically all you need to know to catch exceptions and handle errors that may occur. Java also enforces you to put functions that may rise an exception in try-catch-blocks. But there are some very bad ideas while handling exceptions. First, a catch-block should never be empty. At least you should send a warning, but the best you really handle this error. Your application may not terminate in a proper way. And remember: all commands that are written after the point of the exception will not be executed! The second bad idea is not using the finally-block if you catch special exceptions. If you write methods, you may either forward your exceptions to the calling method or you can throw your own exceptions. Simply add throws to your declaration: 1 p u b l i c s t a t i c int d i v i s i o n ( int a , int b ) t h r o w s E x c e p t i o n { 2 return a/b; 3 }

This function forwards the exception that may be thrown during division to the calling class. You may also throw a new – maybe more specific – exception:

Introduction to Java

53

1 p u b l i c s t a t i c int d i v i s i o n ( int a , int b ) t h r o w s E x c e p t i o n { 2 try { 3 return a/b; 4 } 5 catch ( Exception e ) { 6 t h r o w new N u m b e r F o r m a t E x c e p t i o n ( " D i v i s i o n by zero ! " ) ; 7 } 8 }

To finalize this section: It is often a good idea to create your own exceptions. You can simply extend the master class Exception: 1 public class MyNewSuperException extends Exception { 2 public MyNewSuperException ( String message ) { 3 super ( message ); 4 } 5 }

After importing that class, you can use it within your code. It is a good idea to be as specific as possible but as general as needed.

4 Adding External Libraries External libraries expand your Java libraries with new classes, methods and functions. We have already discussed in Sect. 1.5 how to automatically add and build dependencies that are available as Maven resources. Sadly, not all libraries are available in such a way.

4.1 Adding an External Jar-File Some libraries are shipped as a simple jar-file. You can simply download this file and add it as a dependency to your project. Check your project’s Properties and click on Java Build Path and choose the tab Libraries, see Fig. 18. Choose • Add JARs... if you have a jar-file within your Eclipse workspace and • Add External JARs... if you have a jar-file within your file system and outside of your Eclipse workspace (for example, if you have downloaded it). In this dialogue you can also handle your system libraries and add external class folders.

54

J. Dörpinghaus et al.

Fig. 18 The Java Build Path dialogue

4.2 Building External Libraries It is often necessary to build external libraries, if they are not shipped as a jar-file. Please consult the introductions for every library. The most often used cases are: • Make-files are a Unix-way of automatically build packages. Use configure, make, make install. • Ant is a build-tool similar to Maven. Use ant in command line or use it in Eclipse. Afterwards you have created a jar-file and can proceed as described in Sect. 4.1. Another possibility is to install libraries with your system-wide package manager (yum, aptitude, ...). This is very easy, but sometimes you have to add the class path to your project as described in Sect. 4.1.

Basic Data Processing Jens Dörpinghaus, Vera Weil, Sebastian Schaaf, and Alexander Apke

Abstract This chapter comes in hand with the second chapter, to provide the reader with more information about Java. In particular it focuses on Basic Data Processing. Amongst other topics, common data structures and their usage as well as aspects of the object-orientied programming paradigm are introduced

The first section—Data Architecture and Data Modeling—is just repetition and a short re-introduction into object-oriented programming with focus on Data Processing. After that we will focus on the concepts of data representation using lists and other data structures. We will then describe how parameters can be used to make applications more flexible and handle and manipulate the data processing. In the forth section, we will discuss how data from files can be read or written. In the last section, we will give a preliminary introduction to mathematical computation and basic statistics. This chapter describes the basic foundation that are used in the next chapters.

J. Dörpinghaus (B) Federal Institute for Vocational Education and Training (BIBB), Bonn, Germany e-mail: [email protected] J. Dörpinghaus · S. Schaaf German Center for Neurodegenerative Diseases (DZNE), Bonn, Germany e-mail: [email protected] V. Weil Department of Mathematics and Computer Science, University of Cologne, Albertus-Magnus-Platz, 50923 Cologne, Germany e-mail: [email protected] A. Apke Department for Mathematics and Computer Science, University of Cologne, Cologne, Germany e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 J. Dörpinghaus et al. (eds.), Computational Life Sciences, Studies in Big Data 112, https://doi.org/10.1007/978-3-031-08411-9_3

55

56

J. Dörpinghaus et al.

1 Data Architecture and Data Modeling Before handling any problem, we need to consider the data architecture of a corresponding program. Since an application usually has some input and output data, we need to know in which format the input and output has to be and how we want to represent the data inside of our application. Good news: Java is object-oriented and thus makes it very easy to store data. If we want to go a step further, we will see that data modeling is a complex thing. We may think about highly re-usable designs that are open for other purposes (‘generalization’). If the data representation gets very complex, we need to think about scalability and computation ability. Object technology used in Java has the advantage to improve the quality of your code as well as its productivity—because it is easy to read and to write. But compared to more ‘classical’ programming paradigms it is a different way of thinking, so we need to go ahead with some examples.

1.1 A Primer on Object-Oriented Programming This little section is really a primer. There are various resources available in literature, books and on the web that make a good job explaining object-oriented programming in detail. So this is “object-oriented programming in a nutshell” for those who have never heard about it or simply as a recapitulation for those coming again to this topic. The opposite to object-oriented programming is structural programming. This is done for example in (historically) older languages like C or COBOL. Most newer languages since the 1980s follow object-oriented paradigms, somehow more obvious like in Java or more hidden like in Python. Structural programming is based on the data structure in the background, because the data needs to be defined previously. Objects combine methods with the program instructions and the data with its attributes, that is, data and behavior are packed into one single point. This principle is called encapsulation. Objects communicate with other objects via methods available from outside. A method or attribute can be public and thus available from outside or private and only accessible from inside the object. An object can have parent objects and child objects. A parent object is more general than its child, whereas the child is more specific. A child must re-use all attributes, data and methods coming with the parent object. Thus, if you write code for a parent object, which is more general, these implementations are available for all child objects. A child objects extends the parent object, which is the keyword used in Java. This principle is called inheritance. A special way of inheriting objects is the interface or abstract object. An abstract class can contain abstract methods, which hides the implementation. If a method is abstract, it has no content. Each object extending this object must

Basic Data Processing

57

implement it, since it is not implemented yet. An interface is a class where all methods are abstract. It only gives the framework for all objects implementing this interface. Thus this concept is widely used, if an object needs to define methods but cannot or is not able to know how they should be implemented. A class is defined with the keyword class: 1 public class Thing { 2 }

We need to give some attributes. A thing might have a name and an owner. We will allow Text: 1 public class Thing { 2 String name; 3 String owner; 4 }

If we create a new instance or object of class Thing we can do it with new and access its attributes like this: 1 Thing one = new Thing(); 2 one.name = "Random name";

We will discuss some more examples in the next sections. It is important to mention the keyword this. It can be used to differ between local attributes living in the object and those coming from outside. A little example: 1 public class Thing { 2 String name; 3 String owner; 4 void setName(String name) { 5 this.name = name; 6 } 7 }

Within the method setName we have two instances of the variable name. One local, one coming with the methods as parameter. The keyword this indicates, which one is meant. If an object contains methods or attributes that are connected to the class or the object itself, but not to the object’s instances created from this class, we use the keyword static. For example, Integer.MAX_VALUE returns the highest possible integer value of any integer number defined in Java. It is a value that is hence connected to the class Integer and is not only of relevance for one specific instance of Integer. Hence the constant Integer.MAX_VALUE is a static attribute of the class Integer. Another example is Math.max(): it returns the maximum of two numbers, which is static because it is not related to the data stored in Math. Thus static indicates that we are using the object, but not the distinct instance of an object.

58

J. Dörpinghaus et al.

1.2 How Objects Are Represented in Java and Can Be Used to Store Data Objects are used to create a object-centered representation of real-world data and identities. This means, an object is a representation of something which has attributes and functions. For example, a cup can be used to drink. It has an attribute covering the filling, the size and the filling height (empty or full). And you may also image two methods: filling something in and drinking it out. See Fig. 1 for an illustration. This directly indicates a great benefit of objects: If you have already implemented a cup, you can also use this work to implement a pot. It also has some filling inside. You only need some more attributes like temperature and so on. Thus take care that your code remains re-usable. Every class has • attributes which define what an object has and • operations which define what an object can do. If we once again turn to Fig. 1, we can see that the attributes usually live inside of an object and are not accessible from outside if you do not use the operations or methods to change them. For example, we will create a new class called patient. We want to manage some patients in a hospital. Let us assume, we have their name and the room number. We can create something like: Listing 3.1 Patient.java 1 public class Patient { 2 3 private String name; 4 private int room; 5 6 }

Now we have a new class with two attributes: name and room. This class currently has no operations. As you can see, there is a new keyword: private. It makes a variable or method only accessible from inside the class. If we change private to public we could modify the variable from outside—that is, with methods not necessarily defined within the class. But let’s start by creating an object from this class to illustrate the idea of public and private elements.

Fig. 1 An example object: a cup with three attributes and two operations or methods

class cup

fill in

filling

drink

size filling height

Basic Data Processing

59

1.3 Classes: How to Create an Object from a Class We will save that class in a different Java file and our main test code with the main in a different file. In this file, we can create a new instance of Patient, which means we create an object from a class: Listing 3.2 ListTest.java 1 public class ListTest { 2 3 public static void main(String[] args) { 4 Patient newPatient = new Patient(); 5 } 6 }

But how do we set name and room? As you can see, we have defined the variables inside of the class as private. We cannot access them from outside. This is coding convention, since technically it is also possible to set them public. But we want more control on what happens and catch exceptions, like assigning a text to a number. Thus we define public methods, accessible from outside that get and set the variables. For example, 1 2 3 4 5 6

public String getName() { return name; } public void setName(String name) { this.name = name; }

defines a method getName that returns the name stored inside of this object. And setName sets the name inside the object to a variable passed to this function. The keyword this refers to the internal variable. Every class has a constructor method. It is always public and has the same name as the class, in our case Patient. We can pass arguments to that so that we can instance some internal variables. For example, we do not want to instantiate a new patient without a name. So, the class will look like this: Listing 3.3 Patient.java 1 public class Patient { 2 3 private String name; 4 private int room; 5 6 public String getName() { 7 return name; 8 } 9 public void setName(String name) { 10 this.name = name; 11 } 12 public int getRoom() { 13 return room; 14 } 15 public void setRoom(int room) { 16 this.room = room; 17 }

60

J. Dörpinghaus et al.

18 19 public Patient(String name, int room) { 20 this.name = name; 21 this.room = room; 22 } 23 }

We have the internal variables which we already discussed. And we have four getter and setter methods for both variables. The last method is the constructor, which needs two variables and instances both. We can create a new patient with this code: 1 Patient newPatient = new Patient("Hans Meier", 23);

which will create a new patient with name “Hans Meier” and the room number 23.

1.4 More Information on Class Inheritance There is a lot more to say about object inheritance and construction. We will roughly discuss two main topics. You can use extends to extend an existing class. For example 1 public class SuperPatient extends Patient

will have all already defined properties and methods of the class Patient. But we may add more properties, methods and even overwrite existing ones. As you can see, this is a good way to save time. Another useful concept is introduced with interface. An interface declares properties and methods that have to be implemented by all classes that implement this interface. For example, we could imagine an interface “Human” (or “person”) for our hospital management software. We can then add more specialized classes for employees, patients, doctors and so on, which inherit common attributes like ‘date_of_birth’ from the superclass ‘human’. Then we can save a lot of work because we can re-use every function that works for “Human”. Imagine further that we have a method called sayHello in the interface “Human”. Having an instance of an object implementing this interface and representing a tourist from Germany, sayHello could be implemented in such a way that the word Hello would be written on the console. Another instance representing a tourist from France would write Bonjour on the console. This principle of methods somehow adjusting to the implementing instance is called polymorphism.

Basic Data Processing

61

2 Using Lists and Other Data Structures We have already discussed arrays in the last chapter. But now we will discuss lists that can handle new entries while running the application and some other data structures like stacks. A list stores data in sequence and this sequence has a fixed ordering. All lists inherit from java.util.List which already predefines some methods. Every field can handle any object (as in fact it just references to it). We will now discuss several special classes, which implement java.util.List .

2.1 Using ArrayList We will first focus on ArrayLists which are provided by the package java.util.ArrayList. We need 1 import java.util.ArrayList;

to import the package ArrayList. The creation of a new ArrayList works exactly the same way as the creation of a new object or variable. But there is a small difference: Lists need an additional parameter that describes what kind of variable or object we want to store inside the list. For example 1 ArrayList liste = new ArrayList();

creates a new ArrayList called liste that can store Strings. This can also be done using 1 List liste = new ArrayList();

An ArrayList provides some methods, most of them inherited from java.util.List. For example, add adds a new element to the list. get(i) returns the entry at ith position. We will also discuss remove and size. 1 2 3 4 5 6 7 8 9

ArrayList liste = new ArrayList(); System.out.println(String.valueOf(liste.size())); liste.add("This is a String"); System.out.println(liste.size()); System.out.println(liste.get(0)); liste.add("This is already my second String"); System.out.println(liste); liste.remove(1); System.out.println(liste);

We have already discussed the creation of a new ArrayList. The second line prints out the size of the list. Remember: String.valueOf converts an integer value to a string value which can be printed with System.out.println. However, since the class ArrayList has the toString-method already implemented, we could as well just omit String.valueOf. In both cases, the output will be 0. We add a new String with the value “This is already my second string” which

62

J. Dörpinghaus et al.

will cause the next line (here, String.valueOf is omitted) to print ‘1’ because we have now one element in the list. The next line will print the value of the string stored at position 1. Lists in Java, like in the majority of programming languages, start counting their elements with 0 (they are “0-based”). So the first element has index 0, the second 1 and so on. We will then add another string value and print out liste. This works, since an ArrayList automatically converts to string, if needed.1 In the next line we will remove the variable stored at index 1 in the list, which means we will delete the second item. The expected output of this application is thus: 1 2 3 4 5

0 1 This is a String [This is a String, This is already my second String] [This is a String]

But coming back to our Patient. The adding of a new Patient would look like this: 1 ArrayList liste = new ArrayList(); 2 Patient newPatient = new Patient("Hans Meier", 24); 3 liste.add(newPatient);

There are several keywords and functions you should know, when using Lists and ArrayLists. For example, there is an easy way to check if an element is contained in a list: 1 if (list1.contains(el)) { ...

The method contains returns true, if the parameter exists within a list. Now we know, how to create a new ArrayList, write classes, create objects and store data inside them.

2.2 Using LinkedList A LinkedList also implements the interface java.util.List, and thus most things work exactly the same way as we have seen in ArrayList. First of all, we need to import the package with 1 import java.util.LinkedList;

Now we can create a new, empty LinkedList using 1 List liste = new LinkedList();

The difference between LinkedList and ArrayList is more a technical one: An ArrayList stores the data in an internal array, whereas LinkedList uses the technology of linked lists. If you need to randomly access objects within the sequence, ArrayList is much faster. But if you want to delete or add objects 1

This means in technical terms that it implements toString().

Basic Data Processing

63

somewhere in the middle of a list, LinkedList is much faster. Thus you should use the data structure carefully depending on your needs. Rule of thumb is that arrays have to be initialized with a certain size, and this is needed to be extended at runtime, a new array of suitable size has to be created, filled with the values of the old one plus the additional values. The old array has to be deleted. Linked lists are dynamic in size, as one element just references its successor (or even predecessor in the case of double linked lists). These references can be dynamically redirected, so adding or removing elements is fast and memory efficient. In turn, random access to a certain element of the linked list requires to run over all preceding (or reversely: succeeding) elements, while the array can be accessed directly at every position. (In fact, the Java compiler may overcome these issues at compilation time, as code optimization tries to detect and improve possible sub-optimal code introduced by a developer... But it is always better to keep an eye on this!) We have already discussed static arrays in Sect. 2.1. They do not naturally fit to dynamic data structures like lists. But from time to time it is necessary to convert both. This can be done using Arrays.asList(): 1 String[] names = {"Hans", "Klaus", "Bob", "Charles"}; 2 LinkedList namesList = Arrays.asList(names);

2.3 Using Collections and Stack More general than Lists are data structures called Collections. They are a dynamic collections of objects. All collections inherit from java.util.Collection, for example, List is a Collection. This is why methods like add, addAll, clear, isEmpty, remove and so on are also available in collections. A stack is a data type which has two special operations: push and pop. With push an element is stored at the beginning of the stack, with pop the element, which was most recently added, will be returned and removed. This is like storing things, one at the other: the most recent one can be returned (stack = ‘Last In, First Out’ = LIFO). Thus, an easy example is the following: 1 2 3 4

Stack stack = new Stack(); stack.push("Element 1"); stack.push("Element 2"); System.out.println (stack.pop());

which will print “Element 2” on screen.

64

J. Dörpinghaus et al.

2.4 Sorting a Collection If you have an object of Collection—and as stated out previously most lists, for example, ArrayLists are—it can be sorted very easily. Just use Collections.sort(): 1 Stack stack = new Stack(); 2 Collections.sort(stack);

This works well for all simple variable types like strings and integers. In other words: this works for all kind of objects that implement the Interface Comparable. This interface imposes a total ordering on the objects and makes sure that these objects have a function called compareTo to compare an object with another one. Another example using ArrayList and strings: 1 2 3 4 5 6 7 8 9 10 11

ArrayList liste = new ArrayList(); liste.add("Xavier"); liste.add("Sebastian"); liste.add("Anton"); liste.add("Wladi"); liste.add("New String 2"); liste.add("New String 22"); Collections.sort(liste); for(String elem: liste){ System.out.println(elem); }

This example will output the following order: 1 2 3 4 5 6

Anton New String 2 New String 22 Sebastian Wladi Xavier

The method Collections.sort() takes an optional second parameter: an object of type Comparator. You can build your own comparators. This is for example needed, if you have a list with custom objects which do not have an existing Comparator. Coming back to the object Patient we discussed in a previous section: 1 public class Patient { 2 3 private String name; 4 private int room; 5 6 public String getName() { 7 return name; 8 } 9 public void setName(String name) { 10 this.name = name; 11 } 12 public int getRoom() { 13 return room; 14 } 15 public void setRoom(int room) {

Basic Data Processing

65

16 this.room = room; 17 } 18 19 public Patient(String name, int room) { 20 this.name = name; 21 this.room = room; 22 } 23 }

We could either sort it by name or by room. A custom comparator would thus look like this: 1 public class PatientNameComparator implements Comparator { 2 @Override 3 public int compare(Patient o1, Patient o2) { 4 return o1.getName().compareTo(o2.getName()); 5 } 6 }

Since getName returns a string value, it has a build-in method called compareTo. Thus the comparison is just defined back to the comparison of strings. This would work similar for integers as being used in getRoom. An ArrayList of Patient could thus be sorted using 1 Collections.sort(PatientsList, new PatientNameComparator());

If you do not want or simply cannot go back to compareTo, the return value of compare should be a negative integer (if first argument is less than the second one), zero (if both are equal) or greater than zero (if the first argument is greater than the second).

3 Handling Parameters Starting an application is usually done in command line. Even desktop or start menu icons are shortcuts for a command line command. To pass information to the application command line arguments or parameters are used. Thus, every application may have one or more parameters. For example, to start vim and open a file called test.tex from the UNIX command line you can execute 1 vim test.tex

Or to copy this file to /home/jens/folder you can use 1 cp test.tex /home/jens/folder

But some values are also passed as flags, either as letters (-l) or as written words (--all). For example, ls -a and ls --all do the same. It is also possible to add further information, to these flags, for example, --min 20 --max 50. Over a long time, conventions turned out to be helpful and therefore accepted world-wide as good programming practice. So, if a program comes with positional parameters (mostly, they are mandatory) these are stated first (e.g., an input file) and in a

66

J. Dörpinghaus et al.

dedicated order, while non-positional parameters come with so-called ‘switches’, introducing the following element as a certain input. For example, an input file path often is declared as -i [path_to_file] (short form; single dash) or --input [path_to_file] (long form; double dash). Although there are exceptions (e.g. ‘find -name’) for historic reasons, this “GNU convention” should be respected. A single dash followed by multiple characters is expected to be interpreted as multiple single-character options (e.g. ls -art = ls -a -r -t). Often, option parsers (classes, which read the command line call) are designed to provide both short and long form options for the same input parameter. Also, many option parser classes provide the automatic generation of one of the most important sources of help for a user who does not know how to use a particular tool on the command line: the synopsis. The usual action carried out when starting a tool with (a) no arguments at all (although there are some mandatory), (b) missing mandatory arguments, (c) misconfiguration or simply (d) calling XYZ -h or XYZ --help is the print of the synopsis (a short, structured ‘how-to’) and maybe a more dedicated error message. The synopsis usually looks like the following: USAGE: myTool positional_parameter_1 -i mandatory_argument [-o optional_argument]

Providing such synopsis is the absolute minimum a developer has to provide to a user when distributing executable code (and even for yourself—nobody wants to analyze code in order to find out how to use it...). Coming back to Java, every main method has String[] args as parameter. This is the easiest and most native way to manage parameters. You can iterate over all parameters for example, with a for-loop: 1 for (String param: args) { 2 System.out.println(param); 3 // or something different 4 }

But this approach makes it very complicated to parse dependent parameters like -a 20. This is the reason why there are so many libraries covering parameter parsing. We will focus on a single one: Apache Commons CLI.2 This dependency can be solved with Maven by adding this dependency to your pom file: 1 2 commons-cli 3 commons-cli 4 1.4 5

The command line parameters will be stored as options in an object called Options. Thus we import the two libraries: 1 import org.apache.commons.cli.Option; 2 import org.apache.commons.cli.Options;

We can now create this new Options-object and an additional Option-object which we can add to the list of options: 2

See http://commons.apache.org/proper/commons-cli/.

Basic Data Processing

67

1 Options options = new Options(); 2 3 Option input = new Option("c", "cut", true, "cut of value"); 4 5 options.addOption(input);

As you can see, the constructor of Option takes four values: a short form and a long form for the parameter (usage -c or –cut), a Boolean flag, if the option is followed by a value (here it is set to true, thus it expects something like -c 20) and a description (which goes to an automatically generated help text). We can also flag an option to be required: 1 input.setRequired(true);

Now your application will terminate if this mandatory parameter is not set. But we need some additional imports to do the parameter parsing: 1 2 3 4 5

import import import import import

org.apache.commons.cli.CommandLine; org.apache.commons.cli.CommandLineParser; org.apache.commons.cli.DefaultParser; org.apache.commons.cli.HelpFormatter; org.apache.commons.cli.ParseException;

We can now use a CommandLineParser, a HelpFormatter and a CommandLine object to parse the parameters received by args: 1 2 3 4 5 6 7 8 9 10 11 12 13

CommandLineParser parser = new DefaultParser(); HelpFormatter formatter = new HelpFormatter(); CommandLine cmd; try { cmd = parser.parse(options, args); } catch (ParseException e) { System.out.println(e.getMessage()); formatter.printHelp("parsingtest", options); System.exit(1); return; }

As you can see, a ParseException can be thrown and then some help messages will be printed. The first argument passed to printHelp describes the application name, the second needs an Options object to gather information to display. If you execute this application, it will result in this message: 1 Missing required option: c 2 usage: parsingtest 3 -c,--cut cut of value

In command line, you could simply add parameters to the execution. In Eclipse, this can be done using Run as → Run Configurations. In tab Arguments you can now add parameters, see Fig. 2. You can now proceed with options with cmd.getOptionValue passing the long name of the option:

68

J. Dörpinghaus et al.

Fig. 2 “Run Configurations” window. Here you can add program arguments

1 Integer inputFilePath = Integer.parseInt(cmd.getOptionValue("cut")); 2 System.out.println(inputFilePath.toString());

As you can see, you need to take care of the proper formatting of the arguments: All Arguments are parsed as String by default. We will now discuss another, more extended example: 1 2 3 4 5

Options options = new Options();

Option inputProt = new Option("p", "protein", true, "protein name to filter"); options.addOption(inputProt); Option inputInt= new Option("i", "interaction", true, "restricts output to interactions of type"); 6 options.addOption(inputInt); 7 8 CommandLineParser parser = new DefaultParser(); 9 HelpFormatter formatter = new HelpFormatter(); 10 CommandLine cmd; 11 12 try { 13 cmd = parser.parse(options, args); 14 } catch (ParseException e) { 15 System.out.println(e.getMessage()); 16 formatter.printHelp("protein-protein interactions", options); 17 18 System.exit(1); 19 return; 20 } 21 22 String protein = cmd.getOptionValue("protein", ""); 23 String interaction = cmd.getOptionValue("interaction", "");

This example takes two parameters for filtering protein-protein interaction data. The first one, stored in object inputProt takes an arbitrary protein name to filter

Basic Data Processing

69

the output. The second one, stored in object inputInt takes another string value to restrict the interactions to a distinct type. This example also introduces a second parameter for getOptionValue: It is the default value, if a parameter which is not mandatory is not passed. Thus both variables will contain an empty string, if -p or -i are not passed. Make sure that the default behavior of your application is suitable.

4 Reading and Writing Files and Data Reading and writing files are basic tasks for any kind of data processing. Usually data is stored in files. Another possibility is that it comes in streams. You may either think of network streams (for example, if you get data from a measurement device) or if the streams are behind an API you get the data directly as data objects (for example, if you are using a web application). First of all, it is necessary to consider the type of data you want to read or write. Most files are—more or less sorted or arranged—text files. Text files have nothing in common with documents coming from a word processor or a calculating program. But these applications can export text files. For example, LibreOffice or Microsoft Word use a special format (odt or doc) that writes special files with additional information about layout and formatting. So, it is better to use a real text editor or simply cat or less in command line to take a look at the file content. But some data—like images, videos—are stored as binary data, which is by its nature not printable to a text-driven terminal. It is history, but still some files are encoded using ASCII (American Standard Code for Information Interchange; limited to 127 or 255 characters, respectively, due to 8bit encoding). Since a lot more characters were needed, this character encoding table was extended. Nowadays text files are usually encoded as UTF-8. It is backward compatible to ASCII, so it contains ASCII completely (first 255 characters of both systems match). Trying the other way around will mess up your strings, as complex characters cannot be displayed in the simple ASCII encoding—artifacts due to chopped characters are the consequence. If you process a lot of data, you will also get in touch with some other standards like ISO and some problems may occur while processing data saved on alternative operating systems like Windows. For example, a line break is realized as “line feed” (LF; ASCII character 10) in UNIX, Android, macOS, AmigaOS, BSD, ..., while Windows, DOS, OS/2, Atari, ... use line feed and carriage return (CR; ASCII character 13) as double symbol. In other, historic operating system even carriage return alone is used. As the names already suggest, these concepts date back to the time of (later electric) typewriters, which require a turn of the platen cylinder and a return of the carriage (the typing head) to the first (left) position. Also, most other contemporary keyboard concepts date back to (electromechanical) typewriters. Nowadays, the different implementations of line breaks still troubles data scientists lives, as data is produced and handled in mixed

70

J. Dörpinghaus et al.

environments. So, always be aware of encoding and end of line notation. Converters are available and are “default tools” for everyday work ... Another problem is the difference between permissions granted to files which is absolutely different in *nix operating systems and alternative operating systems. Unix and Linux follow the concept of “Everything is a File” (or a stream—which is convertible in either direction). Also, the file system tree differs. You will find constructions with drive letters (C:directory...) whereas the Unix file system tree starting with //... is widely used. A lot of functions to check and verify directories and files can be found in the class java.io.File. In a portable environment like Java, this may result in major issues when using a tool on another OS than on which it has been developed...

4.1 Text Files Usually files are read and written using classes of types ‘stream’ or ‘buffered’. The package java.io offers a lot of functionality for this purpose. If you want to use character-based text files, you can use Reader and Writer, whereas InputStream and OutputStream are used for binary data, see Sect. 4.3. If we want to read from files, we usually use classes FileReader and FileWriter: Listing 3.4 ReaderExample.java 1 import java.io.*; 2 3 public class ReaderExample { 4 5 public static void main(String[] args) { 6 Reader reader = null; 7 8 try{ 9 reader = new FileReader("input"); 10 for ( int c; ( c = reader.read() ) != -1; ) { 11 System.out.print( (char) c ); 12 } 13 } 14 catch(IOException e) { 15 e.printStackTrace(); 16 } 17 finally { 18 try { 19 reader.close(); 20 } catch ( Exception e ) 21 { 22 e.printStackTrace(); 23 } 24 } 25 } 26 }

This will load the file input and print the content to the screen. See Fig. 3 for an overview of the project. The package throws some different exceptions like FileNotFoundException while opening the file or the mentioned and caught

Basic Data Processing

71

Fig. 3 Package overview for the ReaderExample class

IOException. The read method returns a int, thus we need to convert it to character—which implies that this reading method is reading every single character, one after another. If we want more control over our reading process, we may use the class BufferedReader. It needs a FileReader to be created and offers methods for reading an input line by line: Listing 3.5 ReaderExampleLine.java 1 import java.io.*; 2 3 public class ReaderExampleLine { 4 5 public static void main(String[] args) { 6 Reader reader = null; 7 try { 8 reader = new FileReader("input"); 9 BufferedReader br = new BufferedReader(reader); 10 String line; 11 while ((line = br.readLine()) != null) { 12 System.out.println(line); 13 } 14 } 15 catch(IOException e) { 16 e.printStackTrace(); 17 } 18 finally { 19 try { 20 reader.close(); 21 } catch ( Exception e ) { 22 Se.printStackTrace(); 23 } 24 } 25 } 26 27 }

One more word to character encoding: The reader object knows the correct encoding when you tell it the correct encoding. But not all readers let you do this: for example, FileReader does not! Thus, if you want to change the character encoding you have to use InputStreamReader. Writing files using BufferedWriter works the same. We need a FileWriter with a filename and can use it to create the BufferedWriter

72

J. Dörpinghaus et al.

Listing 3.6 WriterExample.java 1 import java.io.*; 2 3 public class WriterExample { 4 5 public static void main(String[] args) { 6 BufferedWriter bw = null; 7 FileWriter fw = null; 8 try { 9 String content = "New content to file \n And a new line"; 10 fw = new FileWriter("newfile"); 11 bw = new BufferedWriter(fw); 12 bw.write(content); 13 } catch (IOException e) { 14 e.printStackTrace(); 15 } finally { 16 try { 17 bw.close(); 18 fw.close(); 19 } catch (Exception e) { 20 e.printStackTrace(); 21 } 22 } 23 } 24 }

You can see that the exception handling is quite the same as in the reading process. But there are some interesting new points. For example, you will find \ n in the string that is written in line 9. It refers to the line break we already discussed—\ n is a newline character, and \ r is the carriage return. But on *nix operating systems you usually only use \ n. If you do not see the new file in Eclipse, press F5 to refresh the project view. Now we have all together to read and write data from files. We will now discuss how to read and write tables.

4.2 Tables Data in tables usually comes as CSV files, which is a standard data exchange format. This refers to comma-separated values, which means the tabular data is arranged line by line and separated by a comma (or another delimiter, for example, a tabular sign \t). Thus the format is highly configurable—or vice versa it is not standardized—and can store text as well as numbers (which in fact are technically stored as characters!). For example, if we have records holding user IDs, names and the size it may look like this: 1 10,Henry Ford, 182 2 11,Norbert Smith, 176 3 14,Dan Olsen, 188

Here the delimiter is a colon “,”. Any field may be quoted, and as you will see in the next example, some fields must be quoted:

Basic Data Processing

73

1 10,"Ford, Henry", 182 2 11,"Smith, Norbert", 176 3 14,"Olsen, Dan", 188

If the text fields were not quoted, the file would be corrupted. Both last and first name would be seen as a different field. Blank spaces outside of quotes are not forbidden, but very uncommon. Because of the usability you will often see that people use the semicolon ; as a delimiter. But take care, there is no mandatory rule for that. How can we read a CSV file with Java? Of course, we can read it line by line, split it at the delimiter with some complicated rules. Or we can let another package do all the magic (and take care for the common pitfalls). We will use OpenCSV, see http://opencsv.sourceforge.net/. We will use Maven to automatically resolve the dependencies, download the package and build our software. Simply add 1 2 3 4 5 6 7

com.opencsv opencsv 3.9

to your pom.xml and run Maven. We can now use the example from the last section and change it a little bit to read a CSV file: Listing 3.7 CSVReaderExample.java 1 import com.opencsv.CSVReader; 2 import java.io.*; 3 4 public class CSVReaderExample { 5 6 public static void main(String[] args) { 7 Reader reader = null; 8 try { 9 reader = new FileReader("names.csv"); 10 CSVReader cr = new CSVReader(reader); 11 String[] line; 12 while ((line = cr.readNext()) != null) { 13 System.out.println("id= " + line[0] + ", name= " + line[1] + " , height=" + line[2] + "]"); 14 } 15 } 16 catch(IOException e) { 17 System.out.println(e); 18 } 19 finally { 20 try { 21 reader.close(); 22 } catch ( Exception e ) { 23 System.out.println(e); 24 } 25 } 26 } 27 28 }

74

J. Dörpinghaus et al.

As you can see, we are now reading a String array which contains all elements in one line. We explicitly need to cast integers or floats since all variables are read as strings. Once again: before reading a CSV file we need a deeper understanding of the format. The constructor of the CSVReader class has two optional parameters for the delimiter and the quotation sign. Usually the " sign is used for quotations, but if you need to change it to ’ and the delimiter to a tabulator sign which is written as \t you can use 1 CSVReader cr = new CSVReader(reader, ’\t’, ’\’’);

The same options can be passed, if we use the CSVWriter class to write csv files. This works absolutely the same way as writing normal files: Listing 3.8 CSVWriterExample.java 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

import java.io.*; import com.opencsv.CSVWriter; public class CSVWriterExample { public static void main(String[] args) { FileWriter fw = null; CSVWriter cw = null; try { String[] content = {"010", "Gus Miller", "172"}; fw = new FileWriter("newcsvfile"); cw = new CSVWriter(fw, ’\t’); cw.writeNext(content); } catch (IOException e) { e.printStackTrace(); } finally { try { cw.close(); fw.close(); } catch (Exception e) { e.printStackTrace(); } } } }

As you can see, we are now using a CSVWriter object with a tabulator delimiter \t. The method writeNext expects a String array and we can write line by line with this. You can use for example, a for-loop to iterate over a list of lists.

4.3 Pictures and Other Binary Data If we want to read and write pictures or binary data, we can use InputStream and OutputStream in the package java.io. It is rarely necessary to write a custom binary data reader, because for most cases, existing dedicated classes are sufficient.

Basic Data Processing

75

Since Java 1.4 there is a new package javax.imageio that offers very easy functions to read and write pictures as a BufferedImage. There are a lot of different possibilities to read and write images in Java because the Java API was more and more expanded over the years. You will find some functions in the class Toolkit or in the package com.sun.image.codec.jpeg for reading and writing JPEG files. But since javax.imageio is the easiest, we will first read and write a PNG picture with this. Listing 3.9 SimpleImageReaderExample.java 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

import import import import

java.awt.image.BufferedImage; java.io.File; java.io.IOException; javax.imageio.ImageIO;

public class SimpleImageReaderExample {

public static void main(String[] args) { try { File file = new File("scai_logo.png"); BufferedImage image = ImageIO.read(file); File file2 = new File("scai_logo_copy.png"); ImageIO.write( image, "png", file2 ); } catch (IOException e) { e.printStackTrace(); } } }

First of all there are two File objects which manage the two file names. We load a picture as a BufferedImage and write it in a different file. Catching IOException is also necessary. That is all. ImageIO.read() can also handle URLs and sockets. We will now do some basic image manipulation. We will see that there is some confusion with the classes. BufferedImage extends the class Image and also implements RenderedImage. Since ImageIO.write needs a RenderedImage and Image does not implement it, after using an Image object we need to cast back before writing. This can be done with this little helper function: 1 2 3 4 5 6 7 8 9 10

private static BufferedImage convertToBufferedImage(Image src) { int width = src.getWidth(null); int height = src.getHeight(null); int type = BufferedImage.TYPE_INT_RGB; BufferedImage dest = new BufferedImage(width, height, type); Graphics2D g2 = dest.createGraphics(); g2.drawImage(src, 0, 0, null); g2.dispose(); return dest; }

76

J. Dörpinghaus et al.

Table 1 Mathematical methods offered by java.lang.Math Method Functionality abs(x) signum() max(a,b), min(a,b) ceil(), floor(), round() sqrt() exp(), log()

Absolute value |x| Sign function Maximum/minimum of a, b Rounding of numbers Square root Exponential, logarithm

We can now use getScaledInstance() in class Image to resize or scale an image. Since BufferedImage extends the class Image we can also use it with a BufferedImage but it will return an Image. An image always offers the methods getWidth() and getHeight() to get the height and width in pixels. We can then calculate a new height or width. There are also various scaling algorithms like Image.SCALE_SMOOTH which offers the best quality, whereas Image.SCALE_FAST is faster. Now the code has changed to this: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

try { File file = new File("scai_logo.png"); BufferedImage image = ImageIO.read(file); Image scaledImage = image.getScaledInstance( (image.getWidth() * 50) / 100, (image.getHeight() * 50) / 100, Image.SCALE_SMOOTH ); File file2 = new File("scai_logo_scaled.png"); ImageIO.write( convertToBufferedImage( scaledImage), "png", file2 ); } catch (IOException e) { e.printStackTrace(); }

It will resize the picture to 50% and save it as a new file. You will find various more manipulating functions and methods in the API documentation. We will discuss image manipulating and analyzing in a later chapter.

5 Basic Mathematics and Statistics You can do basic mathematics without any additional class. But the internal class java.lang.Math offers some additional utilities and statistic constants. See Tables 1 and 2 for an incomplete list of the most useful methods and constants. Before using any of these additional things, we can use basic arithmetic to do basic statistics. If we want to calculate the average of some floats stored in a List, we

Basic Data Processing

77

Table 2 Mathematical constants offered by java.lang.Math Attributes Functionality Math.E Math.PI

Euler’s number Pi

can use a for-loop to iterate over the values, sum them up and divide through the list size: 1 2 3 4 5 6 7

static float average(List list) { float sum = 0; for (int i=0; i n 0 : g(n) ≤ c · f (n)}

(1)

Roughly speaking, this means that O( f ) is the set of all functions that do not grow faster than f . We use the letter O because we also refer to this set as the order of the function f. For example, we have that 4n ∈ O(10n) because we get that 4n ≤ c · 10n holds for all n > n 0 if we choose e.g. c = 1 and n 0 = 0. We therefore say 4n is in the order of 10n. But contrariwise, we also have 10n ∈ O(4n) because it also holds that 10n ≤ c · 4n for all n > n 0 if we chose e.g. c = 5 and n 0 = 0. As we see, functions that only differ in a constant factor are of the same order. Consequently, we write 4n ∈ O(n) and 10n ∈ O(n). Also the addition of a constant value k does not affect the order of a function. Assume that 4n ≤ cn for all n ≥ n 0 holds. We then can simply define a new, greater value c2 = c + k and get 4n + k ≤ c2 n for all n ≥ n 0 . Hence, also 4n + k ∈ O(n). / So, what does affect the order of a function? For example, we have g(n) = 4n 2 ∈ O(n). We will see why. The question is if we can find two constant values c, n o ∈ N such that 4n 2 ≤ cn for all n > n 0 . We can chose c arbitrarily large. But no matter how large we chose c, we always get that 4c2 > c · c and also 4n 2 > c · n for all n > c. We are therefore unable to find a n 0 ∈ N such that the inequality from Definition 1

Algorithm Design

85

Table 1 The hierarchy of the complexity classes Complexity class Description O(1) O(log(n)) O(n) O(n · log(n)) O(n 2 ) O(n k ) O(k n )

Constant functions Logarithmic functions Linear functions Linear-logarithmic functions Quadratic functions Polynomial functions (with k ∈ N) Exponential functions (with k ∈ N)

1 public boolean duplicatesCheck(List L){ 2 3 for(int i=0;i L2 ) { 2 3 int evencount = 0; 4 int oddcount = 0; 5 int [] output = new int [2]; 6 7 for ( int x : L1 ) {| 8 9 if ( search ( L2 , x ) ) { 10 11 if ( x %2==0) { 12 13 evencount ++; 14 } 15 else { 16 17 oddcount ++; 18 } 19 20 } 21 22 } 23 output [0] = evencount ; 24 output [1] = oddcount ; 25 26 return output ; 27 }

Fig. 5 An algorithm to count even and odd numbers that two lists have in common

by one, the running time is O(1) each. The operator % executes the modulo operation, which describes the remainder left after division. This means that by i%2==0 we check whether the remainder after dividing i by 2 is 0. In other words, we check whether i is even. For checking this condition, we have to perform an arithmetic operation and then compare the calculated value to 0. Both steps can be done in constant time such that the running time for checking the condition is O(1). In total, we get O(1) + O(1) = O(1) for the inner if-statement. For the next if-statement, we can calculate the running time by adding the inner running time and the time for checking the condition search(L2,i) by (8). We just calculated the inner running time of O(1) and from Sect. 1 we know that the running time of our search algorithm is O(|L 2 |) = O(m). Hence, the total running time of the second if-statement is O(1) + O(m) = O(m). For the entire algorithm countCommon(L1, L2) we therefore get the running time O(1) + n · O(m) by applying (7) and (11). Together with the calculation rules from Sect. 3.2, this gives the total asymptotic running time O(n · m).

90

A. Apke et al.

1 public boolean binarySearch ( List < Integer > L , int v ) { 2 3 int left = 0; 4 int right = L . size () -1; 5 6 while ( left sqlite - jdbc < version > 3.18.0

Before we will continue with a little example we have to load the driver. This can be done with the following lines of code: Listing 6.2 DatabaseExample.java 1 2 3 4 5 6 7 8 9 10 11 12

i m p o r t java . sql .*; public class DatabaseExample { p u b l i c s t a tic void main ( S t r i n g [] args ) { C o n n e c t i o n con = null ; try { C l a s s . f o r N a m e ( " org . s q l i t e . J D BC " ) ; con = D r i v e r M a n a g e r . g e t C o n n e c t i o n ( " jdbc : s q l i t e : test . db "); } catch ( ClassNotFoundException e ) { e . p r i n t S t a c k T r a c e () ; }

These lines will also setup a new connection to the SQLite database stored in the file test.db. If the file does not exist, it will be created. We will now create a simple table with only two columns storing hospitals (name and id): 1

C R E A T E T A B L E H O S P I T A L ( ID INT P R I M A R Y KEY NOT NULL , NAME TEXT NOT NULL )

Running an application for the first time will create the table. Running it for the second time will return an error message. Why? Because the table already exists. Thus we can use IF NOT EXISTS to prevent this: 1

C R E A T E T A B L E IF NOT E X I S T S H O S P I T A L ( ID INT P R I M A R Y KEY NOT NULL , NAME TEXT NOT NULL )

Now every communication with the database can be done using a StatementObject which can be set to con.createStatement(). This statement has two

Databases and Knowledge Graphs

131

functions: executeUpdate() and executeQuery(). The last one returns a ResultSet object which is iterable and can be used to parse the results from the database. A fully working example would look like this: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

try { S t a t e m e n t stmt = con . c r e a t e S t a t e m e n t () ; S t r i n g sql = " C R E A T E TABLE H O S P I T A L " + " ( ID INT P R I M A R Y KEY NOT NULL , " + " NAME TEXT NOT NULL ) " ; stmt . e x e c u t e U p d a t e ( sql ) ; stmt . close () ; stmt = con . c r e a t e S t a t e m e n t () ; sql = " INSERT INTO H O S P I T A L ( ID , NAME ) " + " VALUES (1 , ’ Paul M c A r t h u r ’) ; " ; stmt . e x e c u t e U p d a t e ( sql ) ;

stmt = con . c r e a t e S t a t e m e n t () ; R e s u l t S e t rs = stmt . e x e c u t e Q u e r y ( " S E L E C T * from H O S P I T A L " ) ; while ( rs . next () ) { int x = rs . g e t I n t ( " ID " ) ; S t r i n g s = rs . g e t S t r i n g ( " NAME " ) ; S y s t e m . out . p r i n t l n ( s ) ; } } catch ( E x c e p t i o n e ) { e . p r i n t S t a c k T r a c e () ; } finally { try { if ( con != null ) { con . close () ; } } catch ( S Q L E x c e p t i o n ex ) { S y s t e m . out . p r i n t l n ( ex . g e t M e s s a g e () ) ; } } } }

Here we see three statements, which means objects from Statement. In the end the connection will be closed. In general a database connection should be kept the open until the application terminates. Several SQLite examples close the connection after each statement. This ensures, the connection is closed in the end, but it also provides many unnecessary operations so don’t do it like this. In general a database connection should be closed. SQLite is somehow special, since it allows an explicit (by code) and implicit (program terminates) closing of connections. All outstanding transactions will be rolled back. But usually there are no outstanding transactions, so there is no need to worry too much about this issue using SQLite.

132

T. Hübenthal

There is also a more specific Statement class called PreparedStatement. It is often more easy to use since it stores the statement without values and thus reduces the execution time. In technical terms: The SQL statement is precompiled whereas normal statements will be compiled within the DBMS. The above example with an INSERT statement will look like this: 1 2 3 4 5 6

S t r i n g sql = " I N S E R T INTO H O S P I T A L ( ID , NAME ) V A L U E S (? ,?) " ; P r e p a r e d S t a t e m e n t p s t m t = con . p r e p a r e S t a t e m e n t ( sql ) ; p s t m t . s e t I n t (1 , 1) ; p s t m t . s e t S t r i n g (2 , " D u c k b u r g H o s p i t a l " ) ; p s t m t . e x e c u t e U p d a t e () ;

This approach can be used for an arbitrary number of parameters described by a question mark. It is convenient to use prepared statements in a loop. The parameters can be set anywhere, for example in a WHERE-clause. See the following UPDATE example: 1 2 3 4 5 6

S t r i n g sql = " U P D A T E H O S P I T A L SET NAME = ?

WHERE ID = ? " ;

P r e p a r e d S t a t e m e n t pstmt = conn . p r e p a r e S t a t e m e n t ( sql ) ; p s t m t . s e t S t r i n g (1 , " D u c k b u r g H o s p i t a l II " ) ; p s t m t . s e t I n t (2 , 1) ; p s t m t . e x e c u t e U p d a t e () ;

To complete the list of actions, a DELETE statement might look like this: 1

S t r i n g sql = " D E L E T E FROM H O S P I T A L W H E R E id = ? " ;

More thinks could be written about managing transactions and JDBC. Since JDBC can be used for several database systems we will continue with another DB system called H2.

2.2 H2 H2 is a popular relational database written in Java. It’s main advantage is that it can either run as a classical server-side database or can be included in an Java application. It can be used in nearly the same way like SQLite. This is the main advantage of JDBC. H2 can be used disk-based but also as in-memory database. It has strong security options allowing for example read-only databases and encrypted databases. It is also important to know that H2 contains updatable and scrollable, large result sets with support for external functions. Once again the dependencies to Maven can be added by dropping the following lines into pom.xml:

Databases and Knowledge Graphs

1 2 3 4 5 6

133

< dependency > < g r o u p I d > com . h 2 d a t a b a s e < a r t i f a c t I d > h2 < / a r t i f a c t I d > < version > 1.4.197 < scope > test

And the previous example will run by changing the driver and getConnection: 1 2

C l a s s . f o r N a m e ( " org . h2 . D r i v e r " ) ; con = D r i v e r M a n a g e r . g e t C o n n e c t i o n ( " jdbc : h2 :~/ test " , " sa " ," " ) ;

Here a local database will be stored in a test file. The next parameters are user and password. To be clearer one may use helper variables: 1 2 3 4 5 6 7 8

s t a t i c f i n a l S t r i n g J D B C D R I V E R = " org . h2 . D r i v e r " ; s t a t i c f i n a l String DBURL = " jdbc : h2 :~/ test " ; s t a t i c f i n a l S t r i n g U S E R = " sa " ; static final String PASS = ""; Class . forName ( JDBCDRIVER ); conn = D r i v e r M a n a g e r . g e t C o n n e c t i o n ( DBURL , USER , PASS ) ;

Good news: that was basically all new stuff. As seen in the previous chapter all JDBC stuff can be re-used. All statements follow the following structure: 1 2 3

S t a t e m e n t st = con . c r e a t e S t a t e m e n t () ; Stmt . e x e c u t e U p d a t e ( " sql s t a t e m e n t " ) ; con . c l o s e () ;

We will continue with the little hospital example. We already talked about SELECT, INSERT, UPDATE and DELETE and created tables. What is left? Deleting tables. But stop... before deleting anything we should make a backup. H2 has a build in function which is very helpful: 1 2 3

sql = " B A C K U P TO ’ b a c k u p . zip ’; " ; stmt . e x e c u t e U p d a t e ( sql ) ; stmt . close () ;

This creates a zip-file with a complete backup of the table in SQL language. Now we can drop the table Hospital: 1 2 3

sql = " DROP TABLE H O S P I T A L ; " ; stmt . e x e c u t e U p d a t e ( sql ) ; stmt . close () ;

This will delete all data within the table. Deleting only the data can be done using TRUNCATE: 1 2 3

sql = " T R U N C A T E T A B L E H O S P I T A L ; " ; stmt . e x e c u t e U p d a t e ( sql ) ; stmt . close () ;

Using different SQL databases is easy with JDBC, in general it is possible to switch from one database backend to a different without problems. We will now continue with a different database paradigm called noSQL.

134

T. Hübenthal

3 Knowledge Graphs and noSQL Databases After quite a lot of theoretical stuff in the previous chapters and sections it’s time for more examples of practical application. Unlike most chapters before this one we will not be using Java but Python for the most parts. As you have most likely read in Chap. 1 there are a lot of pros and cons considering each programming language. And as we would like to have a broad field of possible applications and focus on results rather than optimising each and every little bit we will go for Python. However, once you’ve encountered several object oriented languages their core doesn’t vary too much for the most part.

3.1 A Primer on Knowledge Graphs Besides a programming language we will need a few other tools. First and foremost this will be the graph. The underlying concept will later be explained in more detail in Chap. 13 Network Analysis: Bringing Graphs to Java. But as we’ll already need it here, we’ll go through a few core definitions to know what we are working with. Definition A directed graph D = (V, A) consists of a non-empty set V containing the nodes and a finite set A holding sorted tuples a = (vi , v j ) ∈ A. These tuples consist of mutually distinct nodes vi , v j ∈ V . The definition is quite mathematical and abstract. However, we can have a more heuristic look at the graph and look at a concrete example. In Fig. 2 we can see a small graph containing eight nodes and eight edges. Thus, the set of nodes here is V = {A, B, C, D, E, F, G, H } and the set of edges is A = {(A, C), (B, H ), (C, B), (C, D), (E, C), (E, F), (E, G), (F, H )}. You can see that the graph is directed as the edges don’t go both directions. Every social network is a good example of a huge graph. The people using the network represent the nodes and their relations represent the edges. Often, in the context of network analysis edges are equivalent to relations. But let’s go back to our example at hand. When we consider the nodes to be persons and the edges to be relations then we can easily imagine both having additional information. A person most likely has got a name, an age and probably some other information. And quite similarly relations can contain further information. For example two colleagues can have a professional relationship but not be friends. Whereas other people are friends but don’t work together. Besides those attributes the nodes can have labels. When, for example, we are looking at books each book can get a label, e.g. crime, thriller, romance, scify, etc. When we look back at our social network then we can see that it might not only consist of people but also organisations, clubs, businesses etc. are represented there. People can follow their favourite musician or their local bookstore. So the elements of our graph can obtain a label, e.g. person, store, public personality etc. This line of thought easily leads to the following definition of a property graph:

Databases and Knowledge Graphs

135

B

A

C

E

D

F

G

H

Fig. 2 A small example of a graph

Definition A property graph G is an seven-tuple G = (V, A, L , P, X, λ, σ). Here, L , P, X are additional sets whereas λ and σ represent functions. λ : V → L is used to attach labels l ∈ L to all the nodes v ∈ V . The function σ : (V ∪ E) × P → X assigns properties p ∈ P to all the nodes v ∈ V . The properties have values x ∈ X . Having this concept in mind we’re nearly done with the part of the graph theory. The thing we want to have a look at is the knowledge graph. A knowledge graph is a property graph that satisfies certain constraints. First of all the model has to describe real-world or at least realistic entities and their relations to one another. Note that our nodes now represent entities. Then, the classes of the nodes and relations follow a given schema. Furthermore it has to be possible to link any two entities via a relation. And last but not least we have to cover multiple topics. This last one is actually quite intuitive as we aim to obtain new knowledge with the help of the graph and therefore we need to think outside the box and look for the bigger picture. Thus, we formally define knowledge graphs as follows: Definition A knowledge graph G = (E, R) is a graph where the set of nodes consists of entities and the set of edges consists of relations among them. E = {E 1 , E 2 , ..., E n } is a union of structured ontologies E i . Analogously R = {R1 , R2 , ..., Rm } is a union of inter- and intra-ontological relations. Additionally every entity can be connected to further context Ci ∈ C, C = {C1 , C2 , ..., Ck } which in general are entities themselves. For now these should be all the tools we need considering the aspect of graph theory and although this concept seems to be quite abstract we will soon see it come to life when taking a look at the practical application. However, before we get there, we need to have a look at the technical possibilities to store such a possibly large-scale knowledge graph.

136

T. Hübenthal

3.2 A Primer on noSQL Databases As we already figured out we need some space to store our data. And as we’ve already seen at the beginning of this chapter there are a lot of options to chose from. There, the focus is on the sql databases whereas here, we want to take a look at noSQL databases. Why, one might ask as the introduction to sql databases sounds great and seems to be quite practical. And of course there are many good reasons why sql databases are so prevalent among all databases. However, there is also a reason to have a closer look at other options. So the first thing we will do is take a look at possible problems arising from sql databases and how noSQL databases circumvent some of them. These days, big data is a really important topic as in all parts of life we are surrounded by huge amounts of data. Especially in life sciences tons of data are offered every day. But before we dive deeper let’s have a quick look at big data and what actually hides behind the popular buzzword. The topic big data is oriented towards the three v’s: • volume: you are confronted with a very large amount of data, • variety: the data you are provided with consists of a vast range of datatypes, • velocity: data arrives at very high speed and therefore needs to be processed at the same speed. Furthermore there are to more v’s one might consider: value and veracity. Above a certain amount of data relational databases are no longer suitable. The range of the different data types (variety) does not agree with the in advance predefined data schemes of relational databases. In addition there is also a problem with the architecture: data volumes of 100 terabytes usually need to be distributed across several physical machines (also known as horizontal scaling) which increases complexity and can jeopardise stability. The alternative of vertical scaling is possible by acquiring new and improved hardware which is associated with significantly higher costs. This leads to noSQL databases, which scale very well horizontally. noSQL (not (only) sql) databases are non-relational databases that fulfil two conditions: • Data is not stored in classic tables as in relational databases. • The database language is not (only) sql so relational database techniques are not exclusively used. The data within a noSQL database is stored in a highly distributed data storage architecture as key-value pairs, in columns, document stores or graphs. Key-value databases are the simplest form. Data is stored using a data object as the identification key and a data object as the value. The key space does not support any structure other than the special characters used. The database is schema-free, which means that data objects can be stored in any form. An extension of this simple principle is offered by column-oriented databases. For improved read access, the data is stored in columns, since not all columns are

Databases and Knowledge Graphs

137

Fig. 3 Exemplary representation of two generic nodes in Neo4j that are connected by an edge

usually needed for a row, but, conversely, there are frequently used columns. The data objects are then identified via row keys and the object attributes via column keys. Columns of the same type can be combined into column families. It is assumed that they can be read together. The schemata of a table only refer to column families, not to entire tables. Document databases combine freedom of schema and the possibility to structure stored data, which is only possible to a very limited extent with key-value databases. Such structured data sets are called documents. The freedom from schema gives great flexibility in data storage, which makes them well suited for big data. At the outermost level, they are key-value databases, whose stored values for the keys correspond to the documents. At a second level, these contain their own document-specific structure, usually consisting of recursively nested attribute-value pairs. These structures are also schema-free. The graph database is clearly different from the other models. The data is stored as nodes and edges, each belonging to a type or label. They themselves contain data as values for their attributes. The data of a graph database is represented as a graph. An example is shown in Fig. 3. The schema of the database is implicit, which means that new data can be added to the database without first knowing the type. This is then created by the database itself. Graph databases are used when the relations of the data are important. Examples are social networks and linking of websites to one another. A particular advantage is the index-free neighbourhood: it is possible to determine the direct neighbours of a node without having to consider all edges, as would be the case in a relational database. The effort of querying a relation is therefore independent from the amount of data. In a relational database, on the other hand, it grows with the number of references searched. As you might have already guessed the database we’re going to use is Neo4j which is a popular graph database. The corresponding query language is Cypher.

138

T. Hübenthal

3.3 Python: Neo4J Now that we have all the necessary tools we can start building a first real project. This chapter has two similar but different concepts to show the practical application of life sciences. We want to present a comprehensible approach and use these projects to show the great potential and also restrictions that lie in the combination of graph theory, life sciences and programming. The first example is motivated by the studies on integrative data semantics for research on neurodegenerative diseases from the Fraunhofer Institute SCAI in St. Augustin. The project is a collaboration with the German institute for neurodegenerative diseases (DZNE) and the university hospital (UKB) in Bonn. It contains data from patients all over Germany suffering from dementia. The data is given in the form of a mongo database. The idea is now to transfer this data in a suitable form into a knowledge graph. This graph shall then be stored and queried in a Neo4j database instance. Later we will follow a similar idea and enhance it with link prediction on the graph. But for now let’s take a look at what we have to do to realise this first project. Due to the sensitivity of personal clinical data we use a sample of artificial data. It was created to replace the real-word data so the structure is quite similar to what we find in reality.

3.3.1

Arranging Clinical Data in a Knowledge Graph

We base our model and our work on an already existing knowledge graph which contains data from the PubMed database1 which means it holds references and other data on biomedical topics and life sciences. This graph originates in an earlier project from scientists at the Fraunhofer Institute SCAI in St. Augustin as well. The idea and concept is presented in the paper Towards context in large scale biomedical knowledge graphs.2 The topic is quite interesting and I would like to encourage the curious reader to take the time and have a look at the referenced project. The fundamental idea behind the graph was making use of the countless relations between the literature and being able to query the graph. It helps finding literature from the same author, the same journal or for the same topic. To be able to query the graph in a sensible way the authors of the above mentioned paper gathered possible questions from medical staff that then can be transformed into graph queries. These queries can be quite simple but also very komplex. Thanks to the query language Cypher the user has a broad array of possibilities when trying to extract knowledge from the data. And this is something a graph really shines at: finding related information. The knowledge graph for the PubMed database was built on a data schema which we will have a quick look at in Fig. 4. We can see that the core node is the document itself, containing several attributes. Then we have some further nodes as for example the publication type, the author and the journal. Moreover, we have the entity node. 1 2

https://pubmed.ncbi.nlm.nih.gov. https://arxiv.org/abs/2001.08392.

Databases and Knowledge Graphs

139

Affiliation affiliation

Journal

hasCitation provenance

hasAffiliation

journal

Author Document

forename

hasDocument

documentID

surname

PublicationType

title isAuthor

collection

identifier

isOfType

type

provenance altIDs : List hasAnnotation

Entity source identifier

publicationDate: Date

Unstructured hasRelation type function provenance context

preferredLabel

value uri

optional Label

BELFunction

uri

Fig. 4 Underlying data schema for the PubMed graph

It includes nodes of all kinds that can be viewed as an entity in the biomedical world. This generic approach makes the graph a lot easier to comprehend and allows for further application and extension as we will see soon. Now, based on our given data, the question arises how to arrange them in a suitable schema and thus store them efficiently in a knowledge graph so that the clinical questions can be answered with the help of graph queries. In doing so, the subsequent connection to the graph of the PubMed database from the paper mentioned above should be made possible. For this purpose, the schema used there, which is shown in Fig. 4, is to be extended into a new schema. The given data from the IDSN study, or the artificial data used here, will then be transformed into a knowledge graph using the new schema. This is to be done via mapping between the entities and several ontologies with context. Here, mapping means a function M : E(D) → E(O), where E(D) represents the entities of the given data and E(O) the entities of the ontologies used. The mapping thus consists of edges that create a logical relation between biomedical data and knowledge in the knowledge graph.

140

3.3.2

T. Hübenthal

Queries and Their Categorisation

When looking at the possible questions we want to ask we need to categorise them for our queries. In the paper we base our work on the authors have already proposed a categorisation for their queries which is presented in Fig. 5. The scheme in Fig. 5 first distinguishes between two different forms of queries. Those that refer to local structures of the graph are given an entry point and search locally in a k-neighbourhood to it. The second category contains both global and local queries. Theoretically, the queries classified there can search the entire graph. However, in the case of application, they can be restricted by exclusively considering a subgraph or given entry and exit points in such a way that only local structures are searched. An example of this is the subcategory aggregation. There, it is possible to search for the average node degree in the entire graph G = (V, E) as well as for the node degree of a single node v ∈ V . Local structures include the three areas of Graph Navigation, Adjacency Query and Pattern Matching. At the first grouping, the following categories are then named: • RPQ: A pair of nodes (u, v) is sought such that a path exists between the nodes whose sequence of edge labels resembles a given pattern and is given as a regular expression. An RPQ thus has the following form: ans(u, v) ← (u, r, v) where r is the regular expression from a finite alphabet S and ans(u,v) denotes the return of the query. • Shortest path: A shortest path is searched between a given pair of nodes (u, v). In adjacency queries, conjunctive m querys, abbreviated CQ, can be found. These have (xi , ai , yi ) over a finite alphabet S. Here, each xi the form ans(z 1 , ..., z n ) ← i=1 and yi is a node variable or a constant. Furthermore, we have ai ∈ and z i is a node x j or y j , j ∈ {1, ..., m}. An example of an application would be to search for a patient who has received two different diagnoses. CRPQs and ECRPQs are then found under the grouping Pattern Matching. The former are defined in exactly the same way as

graph query

graph navigation

RPQ

shortest path

adjacency query

CQ

pattern matching

CRPQ

depth/ breadth first search

local & global structures

local structures

aggregation

ECRPQ

pathfinding

all pair shortest path

centrality

degree centrality

minimum spanning tree

random walk

community

triangle count

closeness centrality

connected components

betweenness centrality

page rank

label propagation

louvain modularity

Fig. 5 Categorisation of graph queries from the paper Towards context in large scale biomedical knowledge graphs

Databases and Knowledge Graphs

141

CQs, except that a regular expression ri ∈ is used instead of the expression ai ∈ . In the second category, which considers local and global structures, one finds four subcategories. Aggregation has already been discussed above. Pathfinding considers the following three groups: • All Pair Shortest Path: For a graph G = (V, E), a shortest path is searched between each pair of nodes in the graph. • Minimum spanning tree: A minimum spanning tree is searched for the graph. • Random walk: For a graph G = (V, E), from a starting node a random adjacent node is selected and added to the path P. From this, a random adjacent node is selected again. This continues until a previously determined path length for P is reached. The group Centrality, as the name suggests, looks at different measures of centrality: • Degree Centrality: Degree Centrality measures the node degrees of the nodes of a graph. The more central a node is, the higher its degree. • Closeness Centrality: For a graph G = (V, E), the nodes that lie as centrally as possible in G are sought. For this purpose, the shortest paths from each node to all other nodes are calculated, summed up and then inverted: C = n−1 1d(u,v) where n=0 d(u, v) represents the distance between two nodes u, v. To compare graphs of different size this can be normalised by multiplying it with n. • Betweenness Centrality: This is a measure of the importance of a node v for the entire graph. It measures how many shortest paths between any two other nodes the node v lies on. • Page Rank: The term was originally introduced by Google and provides information on how strongly a webpage is linked to other pages. The greater the number and the more important the pages that link to it, the higher the page rank. Transferred to graph theory, the relevance of a node can thus be measured by its edges leading to and away from it. Breadth first search and depth first search are also considered in the diagram. Since they are widely used and well known we will not go into detail on them. Finally, the community category is divided into the following sub-areas: triangle count, strongly connected components, label propagation and louvain modularity. However, since we won’t make use of those we’ll leave it up to you to find further information if need be. Interestingly, many of the implementations of the algorithms in Neo4j have poor runtimes on large networks. It is much more efficient to outsource the algorithms and communicate with the graph database using simple queries. However, this is nothing we want to go into further detail on. After having gathered and categorised and the clinical questions we have to adapt a little to the usage of artificial data. Not all the information from the real-world data can be found in our artificial subset so we have to change the queries accordingly. Note that this is done while preserving the questions category. The result can be seen in Table 1.

142

T. Hübenthal

Table 1 Clinical questions, their complexity class and replacements for testing purpose. Here DC refers to Degree Centrality, SP to Shortest Path and BC to Betweenness Centrality Class Question Replacement 1

RPQ

2

RPQ

3

RPQ

4

CRPQ

5

RPQ

6

CRPQ

7

RPQ

8

CRPQ

9

DC

10

ECRPQ

11

RPQ

12

RPQ

13

SP

14

BC

For which patients do complete neuropsychological tests exist? Which measurement values are most common in the context of {diagnosis1}?

Which patients have the most distinct HGNC values? Which patients are found most often in the context of a risk group {RiskGroup1}? Which measurements are collected at Which HGNC values are most the same time? commonly collected during a certain visit {visit1}? What does the chronological order of What does the chronological order of the measured values of entity {entity1} the measured HGNC values {entity1} and risk group {RiskGroup2} for a and risk group {RiskGroup2} for a patient {patient1} look like? patient {patient1} look like? How many patients received a diagnosis How many patients received a diagnosis within two days of their visit? on their first visit? Which patients diagnosed with Which patients diagnosed with {diagnosis2} underwent {diagnosis2} at visit {visit2} had the neuropsychological testing {number1} HGNC value {HGNC_value} at their days beforehand? previous visit? Do people of age {age1} come for Do people of sex {sex1} come for examination more often than others? examination more often than others? Which entity {entity2} do patients Which patients are diagnosed with without any diagnosis have in common? exactly {number} distinct diagnoses and to which risk groups do they belong? How often does an allel tuple {allel Which risk group is most common tuple} appear amongst all patients? amongst all patients? What literature {literature1} can be Which HGNC values {HGNC_value2} found for patient {patient2} diagnosed can be found for patient {patient2} with {diagnosis3}? diagnosed with {diagnosis3}? How many patients underwent How many patients are diagnosed with neuropsychological testing {npt1} and {diagnosis4} and at the same time have at the same time have laboratory value an HGNC value {HGNC_value}? {LAB_value1}? How many Patients suffer from How many patients are diagnosed with disturbance {disturbance1} and what {diagnose5} and what sex are they? sex are they? – What is the shortest path between entity {entity4} and entity {entity5} and what is on this path? – Which patient connects entities most strongly?

Databases and Knowledge Graphs

3.3.3

143

Structure of the Graph

To solve the given problem, the clinical data of the given data model must be arranged according to a suitable schema. This schema is then generalised to act as the data schema for the graph in Neo4j. First, the variables from the data model are considered. They must be divided into classes, relations and associated attributes in order to later enable uncomplicated queries in the graph within as many subgroups of the patients or subgraphs as possible. For this purpose, separate classes are created for most variables instead of storing them in attributes. For example, the patient who can be considered the centre of his person-specific data set, does not receive his gender or his genetically determined ApoE type as an attribute of the patient class, but separate classes are created for both. This procedure allows for later subgraphs, such as patients of a certain gender or genotype, to be viewed more easily. There are also two other advantages: On the one hand the query for nodes in Neo4j is more efficient than the query for attributes of nodes, since an additional file must be read for the accessibility of the attributes. This requires further computational effort and slows down usage. On the other hand, by outsourcing the attributes to their own classes and thus nodes, it is possible to use more differentiated and specific relations to describe the relationships between nodes. From the given data we can build the data schema for our knowledge graph. It is presented in Fig. 6. But let’s take a closer look at the classes, relations and attributes used. The first and central class is the patient himself. It contains no attributes apart from an id, as these are outsourced for the reasons mentioned above. The second class considered is the gender. It classifies the patients into male and female subgroups. In principle, of course, any number of different genders can be distinguished here. The third class is the ApoE risk type. It describes a genetic condition that is closely associated with the risk of developing Alzheimer’s disease. For further information on ApoE risk types I recommend to have a look atSNPedia.3 From the ApoE risk type we can derive the risk group the patients belongs to. It has to do with the way their ApoE risk type is built and can also be found on SNPedia. But for the sake of a little simplicity we’ll leave it at that for now. Next, the measurements of clinical examinations and studies are considered. These include LAB, LIQ, NPT, ApoE, SCA and diagnoses. They are collected in the class of attributes and belong to the superordinate category topic. The attributes are assigned values during the examinations of the patients. Like the topics, these are created as separate objects of the class unstructured and linked to the attributes class via relations. The values themselves are not initially assigned a unit. Therefore, another class named unit is created. It stores the corresponding unit for each measured value. As mentioned above, this again allows for a wider range of viewing and questioning options for the graph. In addition, the units can be linked via appropriate relations to a suitable ontology, for example UCUM (Unified Code for Units of Measure).4 Furthermore, one or more diagnoses can be assigned to a patient. These are also defined as an independent class and are stored with a 3 4

https://www.snpedia.com. https://ucum.org/trac.

144

T. Hübenthal

Fig. 6 Detailed data schema for the knowledge graph

diagnosis code. This corresponds to the diagnosis codes of the disease ontology5 in its descriptive use. In addition, there is also the class time. Every measured value and every examination is provided with a time stamp. This makes it possible to determine a temporal hierarchy for any subset of the total data and, under certain circumstances, it is even possible to identify trends in the course of the disease. The time stamps can be the date of an examination but also the date of the first appearance of a symptom. The last classes used are Source and SourceAll. Source serves as the source of a data set of a specific clinical institute. The location where the data was collected is also stored. This is used for local delimitation of the data and allows data sets from different sources to be combined in one graph without losing their affiliation. The class SourceAll, which inherits from Source, is almost identical to it, but with an additional attribute named provenance. This makes it possible to unambiguously define and record the relationships between different instances of classes within a given data set. The naming of the edges was based on Dublin Core.6 These are simple standards of the Dublin Core Metadata Initiative for data formats of documents or objects. Some of the vocabulary listed there has been replaced by more domain-specific expressions, but the basic structure has been retained. An example of this is the predefined term hasFormat to which the relation hasSex is then created in a structurepreserving manner in order to represent the patient-sex relationship. In addition to the designation or label of an edge, further information is required. Each relation between 5 6

https://disease-ontology.org. https://dublincore.org.

Databases and Knowledge Graphs

145 source

sex

isSubSource

source

sex

site

hasSource hasSex

sourceAll Patient

source

hasSource

site

id

provenance hasRelation provenance time

hasRelation type provenance time

hasSource

Unstructured hasRelation type function provenance context time

Entity source identifier

value uri

preferredLabel uri

hasSource hasCitation provenance

BELFunction

hasAnnotation

Document documentID title collection

Publication Type isOfType

identifier

provenance

type

altIDs : List isAuthor

publicationDate: Date hasDocument

Author forename

Journal

surname

journal hasAffiliation

Affiliation affiliation

Fig. 7 Generalised data schema for the knowledge graph

two classes receives a timestamp that is stored in the attribute time. This can be given explicitly by a concrete date or implicitly by the relationship of the classes and relations to each other. In addition, the edges receive an attribute provenance. This

146

T. Hübenthal

correlates with the attribute of the node SourceAll and assigns an internal affiliation to relationships between several classes. The attribute provenance, regardless of whether it is within a node class or as an attribute of a relation, prevents data ambiguity. However, this schema is too detailed for the graph database. Here, an extension of the already existing data schema of the PubMed database is to be created, which abstracts what has been conceived so far by combining entities. Subsequently, an intersection must be found between the schema generalised in this way and the PubMed model. If we look at the classes in Fig. 6, the ones highlighted in blue can be grouped under the concept of an entity. Thus, the core of the intersection with the already existing graph is found. This also contains, underlined in orange, unstructured data, which, as the name suggests, represents unordered data which is considered as a supplementary context to entities and patients. The resulting schema is shown in Fig. 7.

3.3.4

Programming

There are several ways to load data into the Neo4j database. For a small amount of data, nodes and edges including attributes can be created manually. However, this quickly becomes inefficient or even impossible. An alternative is the Neo4j command LOAD CSV. According to the Neo4j documentation, this is particularly suitable for importing medium-sized data sets up to a size of about ten million entries. The scope of the test data covered in this paper does not exceed this number, but to achieve a faster import and easier scalability to larger datasets, a third method is used: neo4jadmin import. This variant allows a mass import of very large data sets only once per database. Two CSV files are created for each relation and node of the graph. The first represents a header that contains only one line that stores the column headings. The second CSV file then contains the actual nodes or edges with their attributes. The entities and relations of the ontologies used were already available as suitable CSV files. The programme presented thus only creates the CSV files and headers needed outside the ontologies. The above-mentioned artificial data are given in a CSV file that corresponds to the pattern which is shown in Listing 6.3. 1 2 3 4 5 6 7 8

" V i s i t " ," sex " ," rg " ," d i a g n o s i s " ," e1 " ," e2 " ," HG " "0 _ B N 2 _ 0 " ,0 ,"3" ," DOID :9841" ,0 ,3 ,"[ ’ H G N C _ 3 9 9 6 2 ’ , ’ H G N C _ 4 2 5 3 1 ’ , ’ H G N C _ 4 6 1 8 9 ’ , ’ H G N C _ 1 9 5 6 2 ’]" "0 _ B N 2 _ 1 " ,0 ,"4" ," DOID :5831" ,2 ,2 ,"[]" "1 _ C O L _ 0 " ,0 ,"2" ," DOID :3229" ,1 ,0 ,"[ ’ HGNC_17062 ’ , ’ HGNC_8949 ’]" "1 _ C O L _ 1 " ,0 ,"4" ," DOID :7207" ,1 ,0 ,"[ ’ H G N C _ 3 9 6 6 4 ’ , ’ H G N C _ 3 0 3 1 1 ’ , ’ HGNC_7177 ’, ’ HGNC_15305 ’, ’ HGNC_47431 ’, ’ HGNC_13743 ’, ’ H G N C _ 3 5 2 5 4 ’]" "1 _ C O L _ 2 " ,0 ,"2" ," DOID :2073" ,1 ,2 ,"[ ’ H G N C _ 3 9 3 7 8 ’ , ’ H G N C _ 1 7 8 8 4 ’ , ’ HGNC_11099 ’, ’ HGNC_24460 ’, ’ HGNC_3048 ’, ’ HGNC_35038 ’, ’ H G N C _ 1 8 6 6 9 ’]" "1 _ C O L _ 3 " ,0 ,"2" ," DOID :2814" ,2 ,3 ,"[ ’ H G N C _ 2 1 3 4 3 ’ , ’ H G N C _ 2 8 6 0 9 ’ , ’ H G N C _ 2 8 0 0 3 ’ , ’ H G N C _ 1 8 3 5 9 ’ , ’ H G N C _ 4 9 7 3 8 ’ , ’ H G N C _ 2 8 6 4 ’]" "2 _ B N 2 _ 0 " ,0 ,"4" ," DOID :9861" ,1 ,1 ,"[ ’ H G N C _ 1 7 1 4 7 ’]"

Databases and Knowledge Graphs

147

Listing 6.3 Pattern of the CSV file

In our programme, we first need to specify the paths we want to use for our source data as well as for our output. Then we can do some preliminary work. Python offers a variety of possibilities to store data: • List: Lists are arrays that have an intrinsic order, allow duplications and are mutable. • Tuples: A tuple is very similar to a list, but does not allow any changes. • Dictionary: A dictionary is unordered, does not allow duplicates and uses keyvalue pairs to store data. • Set: Sets are very similar to dictionaries, but they only store keys and no associated values. This programme uses lists and sets as well as a dictionary. First of all, empty sets and lists are created that serve different purposes. The group of sets serves to prevent duplicates: Patients can appear more than once in the file to be read, as well as localities or visit indicators. The same applies to entities such as HGNC and diagnoses. However, these are already covered by the HGNC ontology and the disease ontology and therefore do not need to be created as separate classes and CSV files to be output. In order to avoid duplicates, the data already considered are stored in these sets. Before creating a new object, it is then ensured by querying the corresponding set that for example the patient has not yet been created. The empty lists are used to store the created edges and nodes, which is necessary to be able to write the objects into the CSV files at the end. Furthermore a dictionary is created to map the -tuples to the risk groups (cf. SNPedia). In the file to be read in, not only the -values, but also the risk groups are given. However, since the latter are implicitly derived from the epsilons, this dictionary is used to determine the corresponding risk group and the given values of this column are ignored. The actual programme first creates the directory specified by the above paths with the method createPath() in order to later store the output there. Afterwards, the gender nodes are created with the method createSex(). This can happen before the actual import, as the number of genders is finite. Considering the runtime of our programme, this is faster than doing it in the main for-loop later. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# -------------------------------------------------# l i s t s for t e m p o r a l s t o r a g e of v a l u e s from csv file ( p r e v e n t duplicates ) p e r s o n I d = set () v i s i t S i t e = set () v i s i t N o = set () sex = set () # -------------------------------------------------# s t o r e o b j e c t s from file reader # stores sexNodes s e x N o d e s = [] # store patient objects p a t i e n t N o d e s = [] # store entity nodes : # all e n t i t i e s t o g e t h e r

148 17 18 19 20 21 22 23 24 25 26 27 28 29 30

31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74

T. Hübenthal e n t i t y N o d e s = [] # s e p a r a t e d by l a b e l r g N o d e s = [] h g N o d e s = [] d i a g n N o d e s = [] # store unstructured values u n s t r u c t u r e d N o d e s = [] # store source nodes s o u r c e N o d e s = [] s o u r c e A l l N o d e s = [] # h a s S o u r c e T o g e t h e r = [] # r e l a t i o n t y p e s for e d g e s ( i n s p i r e d by d u b l i n core s c h e m a ) relType = [ ’ hasSex ’, ’ patientHasSource ’, ’ hasPatient ’, ’ hasDiagnosis ’, ’ hasGeneNomenclature ’, ’ hasValue ’, ’ isSubSource ’, ’ entityHasSourceAll ’, ’ unstructuredHasSourceAll ’, ’ patientHasSourceAll ’, ’ hasEpsCombination ’, ’ rgHasSourceAll ’, ’ epsilonCombinationHasSourceAll ’] # e d g e p a t i e n t N o d e -[ h a s S e x ] - > s e x N o d e p a t S e x = [] # e d g e p a t i e n t N o d e -[ h a s S o u r c e ( All ) ] - > s o u r c e ( All ) Node p a t S o u r c e A l l = [] p a t S o u r c e = [] # edge p a t i e n t N o d e e n t i t y N o d e p a t R g = [] p a t H g = [] p a t D i a g n = [] p a t E p s C o m b = [] # e d g e p a t i e n t N o d e -[ h a s V a l u e ] - > u n s t r u c t u r e d N o d e p a t U n = [] # edge entityNode -[ h a s S o u r c e ] - > s o u r c e A l l N o d e r g S o u r c e A l l = [] d i a g n S o u r c e A l l = [] h g S o u r c e A l l = [] e p s C o m b S o u r c e A l l = [] # edge u n s t r u c t u r e d N o d e -[ h a s S o u r c e ] - > s o u r c e A l l N o d e u n S o u r c e A l l = [] # edge sourceAll -[ i s S u b S o u r c e ] - s o u r c e s A S o u r c e = [] # -------------------------------------------------# d i c t i o n a r y for r i s k g r o u p s : epsRgDict = { ’ epsilon1epsilon1 ’: ’ epsilon1epsilon2 ’: ’ epsilon1epsilon3 ’: ’ epsilon2epsilon2 ’: ’ epsilon2epsilon3 ’: ’ epsilon3epsilon3 ’: ’ epsilon1epsilon4 ’: ’ epsilon2epsilon4 ’: ’ epsilon3epsilon4 ’: ’ epsilon4epsilon4 ’: }

’ low ’ , ’ low ’ , ’ low ’ , ’ low ’ , ’ low ’ , ’ low ’ , ’ medium ’, ’ medium ’, ’ medium ’, ’ high ’

# -------------------------------------------------# node c l a s s e s # uri not g i v e n class nodeEntity : def _ _ init _ _ ( self , _ entityID , _ source , _ id , _ p r e f e r r e d L a b e l , _ uri ) :

Databases and Knowledge Graphs 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119

149

self . e n t i t y I D = _ e n t i t y I D self . s o u r c e = _ s o u r c e self . id = _ id self . p r e f e r r e d L a b e l = _ p r e f e r r e d L a b e l # self . uri = _ uri class sexNode : def _ _ init _ _ ( self , _ sex ) : self . sex = _ sex class unstructuredNode : def _ _ init _ _ ( self , _ u n s t r u c t u r e d I D , _ value , _ uri ) : self . u n s t r u c t u r e d I D = _ u n s t r u c t u r e d I D self . value = _ v a l u e # self . uri = _ uri class patientNode : def _ _ init _ _ ( self , _ id ) : self . id = _ id class sourceNode : def _ _ init _ _ ( self , _ sourceID , _ source , _ site ) : self . s o u r c e I D = _ s o u r c e I D self . s o u r c e = _ s o u r c e self . site = _ site class sourceAllNode : def _ _ init _ _ ( self , _ s o u r c e A l l I D , _ source , _ site , _ p r o v e n a n c e ) : self . s o u r c e A l l I D = _ s o u r c e A l l I D self . s o u r c e = _ s o u r c e self . site = _ site self . p r o v e n a n c e = _ p r o v e n a n c e # -------------------------------------------------# edge class c l a s s edge : def _ _ init _ _ ( self , _ type , _ p r o v e n a n c e , _ startNode , _ endNode , _ time ) : self . p r o v e n a n c e = _ p r o v e n a n c e self . type = _ type self . time = _ time self . s t a r t N o d e = _ s t a r t N o d e self . e n d N o d e = _ e n d N o d e # source / sourceAll

Listing 6.4 Begin CSV-Import

Then, the method impCSVfile(_filename) is called. It represents the core of the programme. _filename is used here as a placeholder for a file name of a CSV file to be read in. Listing 6.4 shows that the method first reads the given file and then creates some auxiliary variables to be used later. provenance is initially set to −1. It is incremented by one for each line read from the CSV file to prevent data ambiguity as mentioned above. 1 2 3 4

def i m p C S V F i l e ( _ f i l e n a m e ) : with open ( _ f i l e n a m e ) as c s v f i l e : r e a d C S V = csv . r e a d e r ( csvfile , d e l i m i t e r = ’ , ’ )

150 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

T. Hübenthal

# p r o v e n a n c e d u m m y for test data p r o v e n a n c e = -1 # p l a c e h o l d e r uri uriPlaceholder = 0 # check first line f i r s t L i n e = True # create source : source = _ filename # i t e r a t e t h r o u g h all rows for row in r e a d C S V : # helping variables t e m p P a t i e n t I d = None t e m p V i s i t N u m b e r = None tempSex = ’ female ’ t e m p S i t e = None t e m p P r o v = None t e m p R g = None # t e m p R i s k G r o u p = None t e m p D i a g n = None t e m p E p s C o m b I d = None a v o i d D o u b l e R e l = True # s k i p f i r s t row ( h e a d e r ) if f i r s t L i n e : firstLine = False continue # r a i s e p r o v e n a n c e for this row p r o v e n a n c e += 1 # split first entry into three parts visit = [ patientID , visitSite , v i s i t N u m b e r ] # e . g . 0 _ BN _ 1 v i s i t = row [0]. s p l i t ( ’ _ ’ ) # split last entry into parts ( split HGNC m a r k e r s ) l = len ( row ) # "[ ’ HGNC _ XXXX ’ , ’ HGNC _ YYYY ’ , ’ HGNC _ ZZZZ ’]" # l i s t H G N C V a l u e s = [ HGNC _ XXXX , HGNC _ YYYY , . . . ] l i s t H G N C V a l u e s = row [l -1]. strip ( ’ ][ ’) . s p l i t ( ’ , ’) # c h e c k for d o u b l e s and o t h e r w e i s e c r e a t e p e r s o n and add it to d u p l i c a t e s and n o d e s # v i s i t [0] is patient - id t e m p P a t i e n t I d = v i s i t [0] if t e m p P a t i e n t I d not in p e r s o n I d : avoidDoubleRel = False p e r s o n I d . add ( visit [0]) c r e a t e P e r s o n ( visit [0]) # visit [1] is site # c r e a t e a t t r i b u t e s for s o u r c e A l l node t e m p S i t e = ’ site _ ’ + str ( visit [1]) t e m p P r o v = ’ p r o v e n a n c e I D _ ’ + str ( p r o v e n a n c e ) # as each line in the csv file is a s s i g n e d to a u n i q u e p r o v e n a n c e , t h e r e is no n e e d to f i l t e r with if - c l a u s e # c r e a t e s s o u r c e A l l node and s t o r e s it in list , sourceALl - ID can t h e r e f o r e be the p r o v e n a n c e

Databases and Knowledge Graphs 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109

151

c r e a t e S o u r c e A l l ( tempProv , source , v i s i t [1] , provenance ) # c h e c k for d o u b l e s and o t h e r w e i s e c r e a t e site and add it to d u p l i c a t e s and n o d e s if v i s i t [1] not in v i s i t S i t e : v i s i t S i t e . add ( visit [1]) # here s i t e I D is used as s o u r c e I D b e c a u s e of the l i m i t e d g i v e n set of data # N o r m a l l y data sets would come from d i f f e r e n t locations , c r e a t i n g d i f f e r e n t s o u r c e s w i t c h could # then be c a t e g o r i z e d by t h e i r s i t e I D as a s o u r c e I D as well . # c r e a t e E n t i t y ( sourceID , source , v i s i t [1] , ’ visitSite ’ , u r i P l a c e h o l d e r ) c r e a t e S o u r c e ( tempSite , source , visit [1]) # v i s i t [2] is v i s i t n u m b e r t e m p V i s i t N u m b e r = ’ v i s i t N u m b e r _ ’ + str ( visit [2]) if v i s i t [2] not in v i s i t N o : v i s i t N o . add ( visit [2]) c r e a t e U n s t D a t a ( t e m p V i s i t N u m b e r , v i s i t [2] , uriPlaceholder )

# at the b e g i n n i n g sex is set to female , in o t h e r c a s e s c h a n g e it if row [1] == ’0 ’: t e m p S e x = ’ male ’ t e m p D i a g n = ’ DO _ ’ + str ( row [3]) # s t o r e e s p i l o n s ( g i v e n as 0 -3 , n e e d e d as 1 -4) e1 = int ( row [4]) + 1 e2 = int ( row [5]) + 1 # c r e a t e t e m p E p s I d for edge to eps1 / eps2 c o m b i n a t i o n if e1 > e2 : t e m p E p s C o m b I d = ’ e p s i l o n ’ + str ( e2 ) + ’ e p s i l o n ’ + str ( e1 ) else : t e m p E p s C o m b I d = ’ e p s i l o n ’ + str ( e1 ) + ’ e p s i l o n ’ + str ( e2 ) # find c o r r e s p o n d i n g r i s k g r o u p tempRg = epsRgDict [ tempEpsCombId ] # create edges # e d g e p a t i e n t N o d e -[ h a s S e x ] - > s e x N o d e p a t S e x . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [0] , p r o v e n a n c e , t e m p P a t i e n t I d , tempSex , visit [2]) ) # e d g e p a t i e n t N o d e -[ p a t i e n t H a s S o u r c e A l l ] - > sourceAllNode p a t S o u r c e A l l . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [9] , p r o v e n a n c e , t e m p P a t i e n t I d , tempProv , visit [2]) ) # edge p a t i e n t N o d e e n t i t y N o d e ( rg , epsilon , diagn , hg ) # p a t i e n t r i s k g r o u p p a t R g . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [2] , p r o v e n a n c e , tempRg , t e m p P a t i e n t I d , v i s i t [2]) ) # p a t i e n t e p s i l o n c o m b i n a t i o n

152 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140

T. Hübenthal p a t E p s C o m b . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [10] , p r o v e n a n c e , t e m p P a t i e n t I d , t e m p E p s C o m b I d , visit [2]) ) # p a t i e n t d i a g n o s i s p a t D i a g n . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [3] , p r o v e n a n c e , t e m p P a t i e n t I d , tempDiagn , visit [2]) ) # p a t i e n t hgnc if l i s t H G N C V a l u e s ! = [ ’ ’ ]: for e n t r y in l i s t H G N C V a l u e s : entry = entry . strip (" ’") p a t H g . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [4] , p r o v e n a n c e , t e m p P a t i e n t I d , entry , v i s i t [2]) ) # hgnc s o u r c e A l l h g S o u r c e A l l . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [7] , p r o v e n a n c e , entry , tempProv , visit [2]) ) # e d g e p a t i e n t N o d e -[ h a s V a l u e ] - > u n s t r u c t u r e d N o d e ( v i s i t no ) p a t U n . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [5] , p r o v e n a n c e , t e m p P a t i e n t I d , t e m p V i s i t N u m b e r , visit [2]) ) # edge u n s t r u c t u r e d N o d e ( v i s i t no ) -[ h a s S o u r c e A l l ] - > sourceAllNode u n S o u r c e A l l . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [8] , p r o v e n a n c e , t e m p V i s i t N u m b e r , tempProv , visit [2]) ) # edge entityNode -[ h a s S o u r c e A l l ] - > s o u r c e N o d e ( rg , diagn , e p s i l o n ) # rg S o u r c e A l l r g S o u r c e A l l . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [7] , p r o v e n a n c e , tempRg , tempProv , visit [2]) ) # d i a g n S o u r c e A l l d i a g n S o u r c e A l l . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [7] , provenance , tempDiagn , tempProv , visit [2]) ) # e p s i l o n c o m b i n a t i o n s o u r c e A l l e p s C o m b S o u r c e A l l . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [7] , p r o v e n a n c e , t e m p E p s C o m b I d , tempProv , visit [2]) ) # p a t i e n t s o u r c e if a v o i d D o u b l e R e l == F a l s e : p a t S o u r c e . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [1] , p r o v e n a n c e , t e m p P a t i e n t I d , tempSite , visit [2]) ) # s o u r c e A l l s o u r c e s A S o u r c e . a p p e n d ( c r e a t e E d g e s ( r e l T y p e [6] , p r o v e n a n c e , tempProv , tempSite , visit [2]) )

Listing 6.5 Main part of the import file

The entities within the data schema use a uri that could be used at this point, but is not explicitly given for the given artificially generated data. The name of the file read in is stored as source to serve as the later id of the source nodes and attribute of the SourceAll nodes. If, for example, different source files were given, the source could be used to store and later retrieve the affiliation. Afterwards, each row of the CSV file is called up individually. Within such a call, it is first queried whether the first row is active, since this only contains the heading of the columns of the CSV file and should not be used. Then the values of the visit from row[0] are separated from each other and stored in an array named visit.

Databases and Knowledge Graphs

153

In addition, the last entry of the row, which consists of a string, which in turn contains a list of strings of HGNC values, is split and the values are stored. In the process, the square brackets are removed. The patient node is then to be created, but only if it does not already exist. The patient ids read so far are stored in sets. In each line read, an if-query is used to check whether the patient with the current id has already been created as an object or not. The effect of this query on the runtime is examined later. Subsequently, source nodes are created. As described above, a distinction is made between sourceAll and source. In the former, the provenance is also stored. Furthermore, the information about the number of visits of the patient is stored as unstructured. Since no explicit times are given in the artificial data, this node is used as implicit time information. At the beginning of each row that is run through by the for loop, the programme sets the gender to female. If another gender, in this case male, is then found when the row is read, the programme then sets the variable to the corresponding value. This will be needed later for the edges of the graph that assign the patients their gender. If the gender is not only a binary specification, this variable can be set using a finite dictionary with all genders. After that, one looks at the -values of the patient. It is checked whether the two epsilons are in the correct order. If not, they are stored interchanged. The associated risk group can then be determined from the combination via a dictionary. Finally, the mapping, the actual core of the programme, must be developed. To do this, all the required edges with their attributes are created as objects and stored in separate lists. The edges created are given a relation type, the corresponding provenance, a start and a target node as well as a time parameter. Afterwards, these objects are stored in separate lists, which are later run through for our output CSV file. For the edges between patients and HGNC values the current list listHGNCValues of the HGNC values belonging to the current row is processed entry by entry, but only if it is not empty. This condition is important because otherwise empty fields appear in the CSV file and cause problems during import into the database. This would lead to edges that contain a start node but no end node, which would cause the subsequent import into the graph database to fail. As mentioned above, this principle can also be applied if there are other attributes in addition to HGNC values in multiple versions. After all possible edges have been created as objects been stored in the lists, the method is finished. At the end there are two methods named nodeCreation() and edgeCreation() who work very similarly. Both call several methods, which in turn read each list filled with graph objects, be they nodes or edges, and write the individual objects to lines of an object-specific CSV file. Superfluous attributes are ignored when writing. Finally the last two methods headerNodes() and headerEdges() are called. These create a suitable header for each node and edge type for the database. This consists of only one line, which serves as a header for the tabulated CSV files for Neo4j to read in the edges and nodes correctly. These lines of code are not shown here as they are rather boring and really only do exactly what is described above.

154

T. Hübenthal

The programme here of course can be extended to more extensive data. The nodes are stored as separate objects, which makes it possible to extend them as well, for example by attributes and even class-specific functions. In addition, it retains the flexibility to quickly change or remove data types. The same applies to the created edges. Here, too, they are objects that can be extended, modified or shortened at any time. For the sparse artificial test data provided, this would not have been necessary. but the procedure ensures the generic applicability of the script and serves as an example for working with a great variety of data.

3.3.5

Import into Neo4j and Results

The import into the graph database follows the documentation that can be found on the Neo4j website under admin-import. This is why we will not go into detail about that. Now we want to have a look at the results. First, we evaluate our script from above considering runtimes. All the preliminary steps prior to the main for-loop run in O(1), e.g. creating sets, lists, dictionaries and finite numbers of other objects. Then, our for-loop runs in O(n) where n represents the size of our input data. It is important to use sets there for checking for duplicates. The lookup time for lists is O(n) as in the worst case the programme would iterate the whole list. Sets however work with hashing functions to create keys. This is why whenever a new string is to be checked it can simply look for the hashed key instead of going through the whole set. And at the end, when printing everything to CSV files, we have O(n) again since we print one list after the other and all of them can have the same size as the input at maximum. So the whole script runs in O(n) which is quite convenient considering possibly large data inputs. The practical runtimes are shown in Fig. 8 and there one

300 250

Time (seconds)

200 150 1400 36000 135000

100 50 0 1

2

3

4

5 Instances

Fig. 8 Measured practical runtimes of the import script

6

7

8

9

10

Databases and Knowledge Graphs

155

Table 2 RPQs in Cypher Query number Cypher queries 1

2

3

5

7

11

12

MATCH (e:Entity {source:’HGNC’}) (p:Patients) RETURN COUNT(DISTINCT p) AS numberOfPatients MATCH (e:Entity {source:’HGNC’})(v:Unstructured {value:’2’}) RETURN e.preferredLabel AS HGNC_Wert, COUNT(e) AS number ORDER BY number DESC LIMIT 10 MATCH (e:Entity {source:’HGNC’})(v:Unstructured {value:’0’}) RETURN COUNT(DISTINCT p) AS p_count MATCH (s:Sex) (v:Unstructured) WHERE v.value > ’6’ RETURN DISTINCT s.sex AS patientSex, COUNT(s) as sex_count MATCH (h:Entity {identifier: ’37785’})(e:Entity {identifier: ’DOID:14332’}) RETURN DISTINCT p.patient as Patient, e.identifier as Diagnosis, h.identifier as HGNC_-Value MATCH (s:Sex) (e:Entity {identifier: ’DOID:0040021’}) RETURN DISTINCT p.patient, s.sex

can see that it scales as we expected from the theoretical time complexity. After checking for the theoretical runtime of our script, let’s have a look at the runtimes of our queries in Neo4j. First we look at the RPQs. So according to the table above these would be queries 1, 2, 3, 5, 7, 11, 12. The queries are translated into the database language Cypher and executed in Neo4j. In Table 2 you can see the Cypher statements corresponding to the clinical questions in the table above. Important to notice: As technology changes very fast, by the time you read this book, the syntax might have changed a little. Please have a look at the Neo4j documentation. It explains the current syntax very well. The result auf the runtime is shown in Fig. 9. Due to transformation into a finite automaton these queries should have a polynomial runtime. Then we can have a look at the runtimes of our CRPQs and ECRPQs. They are thrown together in one diagram as they are not as numerous as the RPQs. The class of CRPQs is N P-complete. Therefore, it can probably not be solved efficiently. This

156

T. Hübenthal 10000 Q1 Q2 Q3 Q5 Q7 Q11 Q12

9000 8000

Runtime (ms)

7000 6000 5000 4000 3000 2000 1000 0 1

2

3

4

5

6

7

8

9

10

Instance

Fig. 9 Runtimes of different RPQ queries in milliseconds 10000 Q1 Q2 Q3 Q5 Q7 Q11 Q12

9000 8000

Runtime (ms)

7000 6000 5000 4000 3000 2000 1000 0 1

2

3

4

5

6

7

8

9

10

Instance

Fig. 10 Runtimes of different (E)CRPQ queries in milliseconds

is somewhat reflected in our measurements which are shown in Fig. 10. Especially Q8 has a horrible runtime. The corresponding queries are presented in Table 3. But if you think that this is bad, wait for the algorithms to steal the (E)CRPQs’ thunder! You can see the Cypher queries in Table 4. As you can see in Fig. 11 Q13 has a really high runtime. Depending on the structure of the graph and the amount of data it might be a lot faster when outsourcing these algorithms to Python scripts. However, this requires a little bit of implementation, especially considering the import of data from Neo4j. But this is something we will see later.

Databases and Knowledge Graphs

157

Table 3 (E)CRPQs in Cypher Query number

Cypher queries

4

MATCH (u:Unstructured)(e:Entity {source: ’HGNC’}) MATCH (r:RiskGroups)-[:hasPatient]->(p) RETURN e.preferredLabel AS HGNC, r.riskgroup AS RiskGroup, u.value AS VisitNo ORDER BY u.value DESC

6

MATCH (:Unstructured {value:’2’})(:Entity {preferredLabel: ’Alzheimer’s disease’}) WITH p MATCH (:Unstructured {value:’1’}) (:Entity {identifier:’35357’}) RETURN p LIMIT 10

8

MATCH (e:Entity {source: ’HGNC’})