COVID-19 Experience in the Philippines: Response, Surveillance and Monitoring Using the FASSSTER Platform (Disaster Risk Reduction) 9819931525, 9789819931521

This book provides an overview of the extensive work that has been done on the design and implementation of the COVID-19

107 75 4MB

English Pages 179 [169] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

COVID-19 Experience in the Philippines: Response, Surveillance and Monitoring Using the FASSSTER Platform (Disaster Risk Reduction)
 9819931525, 9789819931521

Table of contents :
Foreword by Dr. Jaime C. Montoya
Foreword by Dr. Alethea R. De Guzman
Foreword by Dr. Selva Ramachandran
Preface
Acknowledgements
Contents
Contributors
Part I COVID-19 Disease Surveillance System in the Philippines
1 Origins of FASSSTER
1.1 Building Blocks of Syndromic Surveillance
1.1.1 The FASSSTER Framework
1.1.2 Spatio-Temporal Epidemiological Modeler (STEM)
1.1.3 Localizing Parameter Estimation
1.2 Use of Electronic Medical Records for Symptoms Monitoring and Reporting
1.3 Infodemiology: Use of Crowdsourced Health-Related Information
1.3.1 Extracting Data from Tweets
1.4 Spatio-Temporal Visualization of Health and Syndromic Surveillance Data
1.5 Insights on the Feasibility Analysis of an Online Syndromic Surveillance Platform Using Spatio-Temporal Visualization
References
2 Management of COVID-19 Data for the FASSSTER Platform
2.1 Data Sources
2.1.1 The COVID-19 Line List
2.1.2 DOH Data Collect
2.1.3 Mobility Data
2.2 Data Cleaning
2.2.1 Data Cleaning Process for COVID-19 Line List
2.2.2 Data Imputation
2.3 Utilization of Linelist Data for Analytics and Modeling
2.3.1 Case Statistics
2.3.2 Epidemic Curves
2.3.3 Growth Rate
2.3.4 Risk Classification
2.3.5 Barangay Hotspots
2.3.6 Deaths over Time
2.4 Daily Hospital Report
2.4.1 Health Facility Bed Capacity Utilization Rates (Regional, Provincial/HUC/ICC)
2.4.2 Province/HUC/ICC Risk Classification
2.4.3 Testing Aggregates
3 FASSSTER Data Pipeline and DevOps
3.1 Description of Data Sources for the COVID-19 FASSSTER Platform
3.2 Preprocessing
3.2.1 Producing the Compartmental Model (SEIR)
3.2.2 Producing EpiNow and EpiNow2
3.2.3 Producing the Spatio-Temporal Model
3.3 Interoperability Between R Models and Python Scripts
3.4 Data Visualization
Part II FASSSTER Analytics and Models
4 Disease Surveillance Metrics and Statistics
4.1 Risk Classifications and Community Quarantine Protocols
4.2 Phase 1: LGU Epidemic Response Framework
4.2.1 Case Doubling Time and Mortality Doubling Time
4.2.2 Critical Care Utilization Rate
4.3 Phase 2: Average Daily Attack Rate, Two-Week Growth Rate, and Healthcare Utilization Rate
4.3.1 Average Daily Attack Rate
4.3.2 Two-Week Growth Rate
4.3.3 Healthcare Utilization Rate
4.4 Other Metrics
4.4.1 Social Risk Rating and Classification
4.4.2 Economic Risk Classification
4.4.3 Security Risk Classification
4.5 Other Surveillance Visualizations
4.5.1 Epidemic Curve by Location
4.5.2 COVID-19 Positivity Rate
4.5.3 7-Day Moving Average and Growth Factor
4.5.4 Number and Percentage of Barangays with New Cases in Last 14 Days
References
5 Effective Reproduction Number Rt
5.1 The COVID-19 Pandemic and the Reproduction Number
5.2 Basic Versus Effective Reproduction Number
5.3 Computing Rt in FASSSTER
5.3.1 EpiEstim
5.3.2 EpiNow2
5.4 Conclusion
References
6 The FASSSTER SEIR Model
6.1 The SEIR Equations
6.2 The Basic Reproduction Number
6.3 Model Implementation
6.3.1 Model Parameters from Data
6.3.2 Model Parameters from Literature
6.3.3 Model Fitting to Data
6.3.4 Historical Changes in the Model Parametrization
6.4 Applications of the SEIR Model
6.4.1 Community Quarantine Policies and PDITR
6.4.2 Healthcare Costs and Economic Losses
6.4.3 Healthcare Requirements
6.5 Conclusion
References
7 Geospatial and Spatio-Temporal Models
7.1 Introduction
7.2 Hot Spots and Attack Rate
7.3 Local Indicator of Spatial Autocorrelation (LISA)
7.3.1 Motivation and Relevance
7.3.2 Method
7.3.3 Sample Results and Interpretation
7.4 Bayesian Modeling of Spatio-Temporal Risk
7.4.1 Motivation and Relevance
7.4.2 Model Design
7.4.3 Method
7.4.4 Model Selection
7.4.5 Sample Results and Interpretation
7.5 Concluding Remarks
References
Appendix FASSSTER Data Dictionary

Citation preview

Disaster Risk Reduction Methods, Approaches and Practices

Maria Regina Justina Estuar Elvira De Lara-Tuprio Editors

COVID-19 Experience in the Philippines Response, Surveillance and Monitoring Using the FASSSTER Platform

Disaster Risk Reduction Methods, Approaches and Practices

Series Editor Rajib Shaw, Keio University, Shonan Fujisawa Campus, Fujisawa, Japan

Disaster risk reduction is a process that leads to the safety of communities and nations. After the 2005 World Conference on Disaster Reduction, held in Kobe, Japan, the Hyogo Framework for Action (HFA) was adopted as a framework for risk reduction. The academic research and higher education in disaster risk reduction has made, and continues to make, a gradual shift from pure basic research to applied, implementation-oriented research. More emphasis is being given to multi-stakeholder collaboration and multi-disciplinary research. Emerging university networks in Asia, Europe, Africa, and the Americas have urged process-oriented research in the disaster risk reduction field. With this in mind, this new series will promote the output of action research on disaster risk reduction, which will be useful for a wide range of stakeholders including academicians, professionals, practitioners, and students and researchers in related fields. The series will focus on emerging needs in the risk reduction field, starting from climate change adaptation, urban ecosystem, coastal risk reduction, education for sustainable development, community-based practices, risk communication, and human security, among other areas. Through academic review, this series will encourage young researchers and practitioners to analyze field practices and link them to theory and policies with logic, data, and evidence. In this way, the series will emphasize evidence-based risk reduction methods, approaches, and practices.

Maria Regina Justina Estuar · Elvira De Lara-Tuprio Editors

COVID-19 Experience in the Philippines Response, Surveillance and Monitoring Using the FASSSTER Platform

Editors Maria Regina Justina Estuar Ateneo de Manila University Quezon City, Philippines

Elvira De Lara-Tuprio Ateneo de Manila University Quezon City, Philippines

ISSN 2196-4106 ISSN 2196-4114 (electronic) Disaster Risk Reduction ISBN 978-981-99-3152-1 ISBN 978-981-99-3153-8 (eBook) https://doi.org/10.1007/978-981-99-3153-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

This book tells the story of how we applied mathematics, computer science and data science in the design, development and deployment of a localized syndromic surveillance and scenario-based analytics and modeling tool that served as a guide for policy making during the COVID-19 pandemic in the Philippines. We dedicate this work to our past, present and future students. May our experience serve to inspire them to continue using science in serving and building the nation.

Foreword by Dr. Jaime C. Montoya

Among the many important facets of public health includes disease surveillance systems that generate data and enable experts to have a profound understanding about disease outbreaks and the health-related risks it may cause to every individual. Developed since 2016, the Feasibility Analysis of Syndromic Surveillance using Spatio-Temporal Epidemiological Modeler or FASSSTER started as a platform used in the surveillance of Dengue, Measles, and Typhoid outbreaks in the Philippines. Recognizing the need for a surveillance system that will allow health experts and policymakers to understand the pandemic and assess the effects of strategies in place, the Department of Science and Technology-Philippine Council for Health Research and Development provided support to recalibrate the FASSSTER Platform in 2020 to monitor the spread of COVID-19. Since then, it has become one of the decisionmaking tools of the government in managing the pandemic. This book presents a unique and valuable window on the building blocks of FASSSTER Platform as a product of rigorous studies aimed at providing evidence-based forecasts on the local transmission of a highly infectious disease. The PCHRD acknowledges the commitment and hard work of the developers of FASSSTER, led by Dr. Maria Regina Justina E. Estuar. The emerging needs of public health require innovative approaches to combat present-day and future threats. May this book serve as a useful resource and inspire proactive practices in enhancing disease surveillance systems directed to establishing effective strategies when posed with public health risks. Taguig City, Philippines February 2022

Dr. Jaime C. Montoya

vii

Foreword by Dr. Alethea R. De Guzman

The Epidemiology Bureau of the Philippine Department of Health has the mandate to provide strategic epidemiologic information to guide policy and strategy development. This essential mandate was made even more important as we faced the COVID-19 pandemic and critical decisions had to be made to save lives. While health has been foremost in the country response, it was imperative that we also considered how our decisions ensured not just recovery of individuals but also of communities and the economy. In our ongoing response, we saw greater appreciation of the value of data analytics and disease modeling in guiding actions so that these goals are achieved. Throughout our journey, the FASSSTER team has been our constant, committed partner. The development of the FASSSTER platform made data generation more efficient enabling our data analytics team more opportunity to provide a more comprehensive analysis and understanding of the COVID-19 situation. It was also a means to better visualize our data and make it easier for stakeholders to comprehend data that are often highly technical in nature. The disease models developed were also crucial in our goal of balancing value put on health and economy. These models provided estimates or forecasts used to ensure that capacities shall be adequate. They were also crucial in showing through various scenarios how policy decisions will affect both our health systems but also the efforts to rehabilitate our economy. The impact of decisions made through use of these information can be best exemplified by our numbers which showed that while we have seen several large case increases, we have also shown large improvements in our outcomes. Lastly, having seen the practical value of these platforms and models, this book will provide clear guidance on how the methodologies used by the FASSSTER team can be replicated and tailored by other professionals and institutions in creating targeted and appropriate strategies and interventions for programs other than the COVID-19 response. It has indeed been a challenging time for us—as members of our families, communities, and institutions leading the COVID-19 response. While a perfect response is improbable, we always aimed at making the best decisions with the best information available to us. Through our partnership with the FASSSTER team, we showed how

ix

x

Foreword by Dr. Alethea R. De Guzman

sound and excellent data analytics have enabled the country to consistently recover from this pandemic and guide us to a not just a new but a better normal. Manila, Philippines February 2022

Dr. Alethea R. De Guzman

Foreword by Dr. Selva Ramachandran

In 2020, the world was gripped by the COVID-19 pandemic. As with any crisis, speed of action was of paramount importance, and the only way by which countries could adequately handle the pandemic was to ensure that its actions were informed by accurate, real-time data. The FASSSTER team provided visualizations that not only enabled us to “see” the extent of COVID-19 transmission, but their advanced mathematical models allowed both national and local governments the ability to anticipate possible surges that could occur 3 months ahead of time. This information has proven pivotal in the Philippines’ response and recovery strategy, with a cabinet secretary estimating that the FASSSTER projections informed the decisions made by the Inter-Agency Task Force Against COVID-19 that helped prevent between 1.5 and 3.5 million infections during first few months of the pandemic. UNDP is proud to have supported the FASSSTER team’s hard work under its Pintig Lab initiative, and we have been pleased to see the Department of Health now own the platform, with the ambition to harness its capabilities for other diseases under its universal healthcare programme. This publication is an important step towards realizing the goal of a more resilient universal healthcare system for the country. May it serve as a source of inspiration and guidance for our leaders of today and tomorrow. Manila, Philippines February 2022

Dr. Selva Ramachandran

xi

Preface

FASSSTER stands for Feasibility Analysis on Syndromic Surveillance using SpatioTemporal Epidemiological Modeller. In this book, we write about the building blocks of the home grown syndromic surveillance and scenario-based analytics and modeling tool built in the FASSSTER Platform which served as a decision-making tool of the Philippine government in managing the COVID-19 pandemic. We begin with a brief background on its pre-pandemic origins as a scenario-based syndromic surveillance platform designed for disease monitoring in rural health units. It is followed by chapters that describe main components of the tool that was put together in one single operational dashboard. The book is divided into two main parts. The Part I describes the backbone of the system which serves as a framework in designing and deploying disease surveillance tools. The chapter on Data Management describes how COVID-19 data was managed including the data sources, the data cleaning process, and data imputation. The chapter on Data Operations presents data pipeline, preprocessing, interoperability solutions, and data visualization aspects of the system. The Part II of the book focuses on the analytical tools and models that were developed for use in surveillance and monitoring of COVID-19 cases. The chapter on disease surveillance and metrics includes: risk classification and community quarantine protocols that were implemented in the Philippines, Metrics including average daily attack rate and health care utilization rate, other metrics, and surveillance visualization tools. The chapter on effective reproduction number provides an example of how open source tools such as EpiEStim and EpiNow were localized for use at the national and regional levels. The chapter on the FASSSTER SEIR Model, its design, implementation, and application to policy. The final chapter presents geospatial and spatio-temporal models designed and used for localized decision-making. This book is written for country-wide, regional, or city level disease surveillance units who are planning to setup a similar environment for COVID-19 monitoring or other diseases. This book also serves as a reference to students who are being trained in epidemiological mathematical disease modeling, computer science, and data science with a focus on public health informatics.

xiii

xiv

Preface

We hope that this book can serve as a foundation for a data-driven approach in managing highly transmissible diseases for low middle income countries at the local and country level. Quezon City, Philippines February 2023

Reena Estuar

Acknowledgements

We would like to acknowledge the following institutions for providing us with the opportunity to contribute to the management of the COVID-19 pandemic in the Philippines: Epidemiology Bureau, Department of Health National Economic Development Authority Australian Tuberculosis Modeling Network World Health Organization We would like to thank the following institutions for providing us with resources and support: Office of the President, Ateneo de Manila University Ateneo Center for Computing Competency and Research, Ateneo de Manila University Department of Information Systems and Computer Science, Ateneo de Manila University Department of Mathematics, Ateneo de Manila University Department of Economics, Ateneo de Manila University School of Science and Engineering, Ateneo de Manila University Philippine Council for Health Research and Development Department of Science and Technology United Nations Development Programme We are grateful for family, friends, and faith who have been our source of strength from the beginning of our journey up to the present as we remain steadfast in our commitment to serve the Filipino people through science and technology.

xv

Contents

Part I

COVID-19 Disease Surveillance System in the Philippines

1 Origins of FASSSTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Regina Justina E. Estuar and Kennedy E. Espina

3

2 Management of COVID-19 Data for the FASSSTER Platform . . . . . . Maria Regina Justina E. Estuar, Lenard Paulo V. Tamayo, Jay-Arr Buhain, Jillian Yasmin Chua, Daniel Joseph Benito, Lean Franzl Yao, and Raymond Francis Sarmiento

17

3 FASSSTER Data Pipeline and DevOps . . . . . . . . . . . . . . . . . . . . . . . . . . . Lenard Paulo Tamayo, Christian Pulmano, Romel John Santos, Jay-Arr Buhain, and Raven Ico

43

Part II

FASSSTER Analytics and Models

4 Disease Surveillance Metrics and Statistics . . . . . . . . . . . . . . . . . . . . . . . . Christian Pulmano, Elvira de Lara-Tuprio, Maria Regina Justina Estuar, and Lenard Paulo V. Tamayo

61

5 Effective Reproduction Number Rt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Timothy Robin Y. Teng and Raven D. Ico

81

6 The FASSSTER SEIR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Elvira de Lara-Tuprio, Jay Michael R. Macalalag, and Carlo Delfin S. Estadilla 7 Geospatial and Spatio-Temporal Models . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Daniel Joseph Benito, Lean Franzl Yao, Joshua Uyheng, Elvira de Lara-Tuprio, Christian Pulmano, and Maria Regina Estuar Appendix: FASSSTER Data Dictionary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

xvii

Contributors

Daniel Joseph Benito Ateneo de Manila University, Quezon City, Philippines Jay-Arr Buhain Cavite State University, Cavite City, Philippines; Ateneo de Manila University, Quezon City, Philippines Jillian Yasmin Chua Ateneo de Manila University, Quezon City, Philippines Elvira de Lara-Tuprio Ateneo de Manila University, Quezon City, Philippines Kennedy E. Espina Ateneo de Manila University, Quezon City, Philippines Carlo Delfin S. Estadilla Ateneo de Manila University, Quezon City, Philippines Maria Regina Justina Estuar Ateneo de Manila University, Quezon City, Philippines Maria Regina Justina E. Estuar Ateneo de Manila University, Quezon City, Philippines Raven Ico Ateneo de Manila University, Quezon City, Philippines Raven D. Ico Ateneo de Manila University, Quezon City, Philippines Jay Michael R. Macalalag Caraga State University, Agusan del Norte, Philippines Christian Pulmano Ateneo de Manila University, Quezon City, Philippines Romel John Santos Ateneo de Manila University, Quezon City, Philippines Raymond Francis Sarmiento Ateneo de Manila University, Quezon City, Philippines Lenard Paulo Tamayo Ateneo de Manila University, Quezon City, Philippines Lenard Paulo V. Tamayo Ateneo de Manila University, Quezon City, Philippines Timothy Robin Y. Teng Ateneo de Manila University, Quezon City, Philippines Joshua Uyheng Carnegie Mellon University, Pittsburgh, USA

xix

xx

Contributors

Lean Franzl Yao Ateneo de Manila University, Quezon City, Philippines; Social Computing Lab, Nara Institute of Science and Technology, Nara, Japan

Part I

COVID-19 Disease Surveillance System in the Philippines

In 2014, the National Epidemiology Center of the Department of Health launched the Philippine Integrated Disease Surveillance and Response System (PIDSR) as a response to the call of the 2005 International Health Regulations for Member states on the urgent need for countries to adopt an integrated disease surveillance system (DOH, 2014). Instead of managing several disease surveillance systems that use separate data collection, reporting, and analytical tools, PIDSR provided a framework that integrated detection, registration, reporting, confirmation, analysis, and feedback in one platform. Since 2015, PIDSR has been deployed in all registered hospitals and health facilities monitoring over 25 diseases and syndromes that are classified as having the potential to evolve into outbreaks. In 2015, the Ateneo Center for Computing Competency and Research or ACCCRe (then named Ateneo Java Wireless Competency Center or AJWCC) proposed to build a web-based syndromic surveillance platform that will collect real-time information from various data sources as input to analytics and models needed for surveillance, modeling, and monitoring of high priority diseases. The name FASSSTER was derived from the project title of the proposal submitted to the Philippine Council for Health Research and Development (PCHRD) for funding. The Feasibility Analysis on Syndromic Surveillance using Spatio-Temporal Epidemiological modeler or FASSSTER proposed the development of interfaces for defining and accessing different data sources, the development of a container for different types of disease models, and development of visualization dashboards that serve as input to decisions and planning. The three-year journey resulted in a robust framework that served as an accessible tool for epidemiological surveillance teams that needed to use an online environment for scenario-based disease modeling and analytics. Managing big data requires careful planning of the data architecture, from data extraction, data cleaning, data processing and visualization, and data storage. Access to reliable data is very important in COVID-19 disease surveillance. Hence, data management which includes data collection and data cleaning scripts is set up such that there is regular access to data that serves as input to the analytics and models generated and viewed in the FASSSTER platform. Part I provides detailed information on the first three stages of the data pipeline: identifying data sources, critical

2

Part I: COVID-19 Disease Surveillance System in the Philippines

inspection of data, and performing data cleaning and imputation. Chapter 1 presents the background and origins of FASSSTER, with highlights on components that were deemed relevant during its three-year implementation. Chapter 2 establishes the groundwork on data management, describing sources and characteristics of data sets used in the FASSSTER platform. Chapter 3 describes the development and operations environment of the FASSSTER platform, highlighting a practical approach in the implementation of data science techniques.

Reference DOH (2014) Manual of procedures for the philippine integrated disease surveillance and response. https://doh.gov.ph/sites/default/files/publications/ PIDSRMOP3ED_VOL1_2014.pdf

Chapter 1

Origins of FASSSTER Maria Regina Justina E. Estuar and Kennedy E. Espina

Abstract Early detection of local disease transmission relies on access to realtime case reports. Automated health information systems such as electronic medical records and hospital information systems are primary data sources for surveillance information. However, other sources of information, including symptoms reported by households to primary care facilities and even those posted by the public on social media platforms, may be considered additional inputs in increasing the reliability and validity of forecasts on potential outbreaks. In 2016, a cloud-based platform for syndromic surveillance and localized disease modeling was designed to complement the Philippine Integrated Disease and Surveillance Response System (PIDSR). The Feasibility Analysis on Syndromic Surveillance using Spatio-Temporal Epidemiological ModeleR or FASSSTER was developed to extract data from different data sources, including electronic medical records, case reports submitted by hospitals, and symptoms posted on social media. The platform was also designed for online scenario-based disease modeling, time series, and spatio-temporal forecasting using STEM (IBM, Spatiotemporal epidemiological modeler project. http://www.eclipse. org/stem/, 2022). The platform was developed as a social networking platform providing the ability for users to share, view, simulate, and visualize model output. This chapter discusses the origins of FASSSTER, specifically, the framework used in the development of the FASSSTER platform during its first three years. The chapter presents a multi-dimensional approach to the design and development of an online disease surveillance platform using data from disease surveillance systems, electronic medical records, and health-related reports found online. Keywords Disease surveillance platform · Disease modeling · Infodemiology

M. R. J. E. Estuar (B) · K. E. Espina Ateneo de Manila University, Loyola Heights, Katipunan Ave, Quezon City, Philippines e-mail: [email protected] K. E. Espina e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. R. J. Estuar and E. De Lara-Tuprio (eds.), COVID-19 Experience in the Philippines, Disaster Risk Reduction, https://doi.org/10.1007/978-981-99-3153-8_1

3

4

M. R. J. E. Estuar and K. E. Espina

1.1 Building Blocks of Syndromic Surveillance Syndromic surveillance aims to provide an early detection of the spread or outbreak of a particular disease. A high number of reported cases from hospitals or symptoms reported from electronic medical records may be indicators of localized transmission or outbreak. Regular tracking of symptoms may serve as an early warning of a possible outbreak in a community (CDC 2022). The contribution of syndromic surveillance compared to other disease monitoring systems is the use of frequency of illness with a specific set of clinical features that are non-routine (Henning 2004). Aside from direct monitoring of symptoms in emergency wards, another source of syndromic data come from electronic medical records indicating frequency of visits, count of prolonged symptoms, or count of symptoms clustering in a specific location. With the availability of access to online information, search engines reporting an increase in search traffic related to home remedies for a certain illness may also be an indicator of a potential outbreak. With the proliferation of social media platforms, an unusual increase in the trend of reports on flu, for example, by citizens in a particular area or reported absences because of prolonged fever, may also be a sign of an impending outbreak. In the Philippines, the focus of eHealth solutions for the past 8 years since the launch of the Philippine eHealth Strategic Framework and Plan in 2014 was geared toward the development and adoption of eHealth technologies that will facilitate the implementation of universal health care (eHe 2022). Within the context of any type of disaster, including access to health information is relevant, most especially in disasteraffected areas where paper records are lost or destroyed. More so, immediate and real-time reports on the health status of a community serve as input to epidemiologists and public health managers who are tasked to recommend interventions that will prevent further transmission of the disease and reduce mortality. Since syndromic data can come from a variety of data sources, data that are deemed significant, like daily reporting of symptoms, initial assessment of primary care physician, weather information, and cases reported from emergency rooms, should be collected and placed in digital storage accessible to epidemiology and surveillance units for realtime monitoring of outbreaks. The system that collects or extracts syndromic data should also have the capability to produce statistical computations and models that are relevant in health monitoring. Disease modeling plays a significant role in understanding the effect of interventions on the spread of the disease in a community provided by the government, and acceptance and adoption of the programs by the community. The disease model is set up so that it mimics close to real scenarios and allows for the generation of alternative scenarios in relation to the population behavior and transmission behavior. A disease modeling platform usually contains a variety of disease models addressing various aspects of disease monitoring and surveillance. By doing so, the system is able to capture patterns and trends that can define how the community behaves and responds to the management of the outbreak. In most cases, not many epidemiology and surveillance units have disease modeling experts as part of their team. Instead,

1 Origins of FASSSTER

5

disease modelers from academic and research institutions are tapped when there is a requirement to perform computations for scenario-based time series projections as well as other epidemiological parameters that can be used as health indicators during an outbreak. Disease surveillance systems should, therefore, also consider different types of users to accommodate collaborative work between decision-makers, public health experts, and modelers. The FASSSTER framework was developed to consider these elements. Specifically, it includes three layers, namely: the data layer that house all the data coming from different sources, the modeling layer, where all codes that produce output from disease models are stored and updated, and the visualization layer, where users are able to select data sources, statistical analysis as well as see outputs in graphs, charts, and maps.

1.1.1 The FASSSTER Framework The 2016 version of the FASSSTER framework consists of the following components: baseline datasets, online data sources, and modeling to display spatio-temporal spread of the diseases, determine population at risk, and detect early onset of an outbreak (Fig. 1.1). Baseline datasets are extracted from existing systems managed by the Department of Health including: PIDSR (Philippine Integrated Disease Surveillance and Response) and ESR (Event Surveillance Reporting), which are both used by hospitals to report cases of notifiable diseases, SPEED (Surveillance in Post Extreme Emergencies and Disasters), and eFHSIS (electronic field health service information system) for cases reported in non-hospital health facilities. Another source of health data are electronic medical record systems used by primary care facilities. Online syndromic data sources include health-related reports coming from news portals and social media such as Twitter, Facebook, and other online posting platforms. The Spatio-Temporal Epidemiological Modeler or STEM (IBM 2022) is the modeling tool that was used in conjunction with the FASSSTER disease modeling platform. In fact, part of the FASSSTER acronym includes STEM. The proposal being a feasibility study on the development of an online tool that can run local STEM models gave birth to the project title: Feasibility Analysis on Syndromic Surveillance using Spatio-Temporal Epidemiological modeleR (FASSSTER). Figure 1.2 shows how the different components are connected at the back end. Data from third-party systems that collect health information, including electronic medical records, event-based surveillance systems, and online sources are stored in a data warehouse. Scripts are developed to produce datasets that serve as input to the model. Inputs to the model may be in the form of initial values or parameters or raw data such as location codes. The main modeling tool of FASSSTER is STEM. STEM—“Spatio-Temporal Epidemiological Modeler” is a software tool that models complex epidemiological scenarios in multiple populations in distributed locations. The software is implemented under the Eclipse Foundation, an open-source software incubation program

6

M. R. J. E. Estuar and K. E. Espina

Fig. 1.1 FASSSTER Framework adopted from the PCHRD FASSSTER 2017 terminal report

1 Origins of FASSSTER

7

Fig. 1.2 FASSSTER backend framework

managed by IBM (2022). It provides a visualization of the possible spread of a disease in different scenarios. The merit of the software is that it is free and can be configured to model a variety of diseases and to interface with other systems for both input of data and output of information. Since STEM operates as a stand-alone platform, one of the main objectives of the project was to interface with STEM and develop an online environment for scenario-based disease modeling and surveillance. A STEM server is set up in the FASSSTER environment, where a compiled version of the local STEM model is uploaded. Users can upload the models, run the base scenarios, set up initial parameters, and run visualizations through the FASSSTER web interface.

1.1.2 Spatio-Temporal Epidemiological Modeler (STEM) Compartmental disease modeling begins with understanding the nature of the disease, specifically its transmission dynamics. The required compartments are then determined, along with the appropriate parameters which govern the movement of individuals across these compartments. In the FASSSTER framework, STEM is installed in the machine of the disease modeler, and disease modeling is performed outside the FASSSTER platform for its initial run. Within STEM, the modeler sets up the scenario or environment of the model including: selecting the solver that will be used to run the simulation, selecting the geographic, population, and the disease models where values or parameters that serve as input to the models are obtained from local data. The last two steps define the spatial and temporal aspects of the model. An infector is created to initiate the infection from a specified location. A sequencer is also created, identifying the start date and end date of the simulation. The disease modeling interface in STEM provides an option to select from a standard list of compartmental disease models or create a new compartmental model based on the mathematical equations that define it. Figure 1.3 shows an example of designing an SEIR compartmental model where S contains the initial susceptible

8

M. R. J. E. Estuar and K. E. Espina

Fig. 1.3 Disease modeling interface in STEM

population, E is for the population who will be exposed to the disease, I is for those who will be infected, and R is for those who will recover. Additional compartments may be added depending on the transmission dynamics of the disease being studied.

1.1.3 Localizing Parameter Estimation Data used in disease modeling within the FASSSTER framework were extracted from the Philippine Integrated Disease Surveillance and Response (PIDSR) system, Philippine Health Statistics records, and literature for those not found in local systems with further validation from key public health consultants. Data from the DOH’s PIDSR System were used as the gold standard in validating the results of the disease models. The validation of the projections was based on historical records from PIDSR. During its inception years, the FASSSTER platform extracted incidence data related to the three (3) diseases being modeled, namely Dengue, Measles, and Typhoid. For these data to be useful, further processing is needed, such as the transposition of the incidence data to STEM “format,” which counts the total number of people infected in a day instead of the number of new incidents for the day. From the

1 Origins of FASSSTER

9

Fig. 1.4 FASSSTER disease modeling validation process

Fig. 1.5 FASSSTER interface for initializing parameters

transposed data, the initializer decorators for the STEM simulations could already be automatically generated using an R Script. A script was developed to invoke headless runs for the simulation automatically. The bash script gets the decorator files for all 48 months of simulation, which then produces the logged files once a simulation is done. The logged files, made monthly, are then subjected to statistical analysis, which includes finding their correlation with the generated cumulative PIDSR data (Fig. 1.4). The web interface of FASSSTER allows the user to set initial parameters that serve as input to the model. Figure 1.5 shows an example of how initial parameters are entered into the system. A user can edit the Parameters on the screen. The parameters are the transition rates that the disease model uses when running a simulation. Typical parameters for an SEIR(S) model include Disease Transmission Rate, Incubation Period, Recovery Rate, and Mortality Rate with death due to natural causes and death due to the disease.

10

M. R. J. E. Estuar and K. E. Espina

1.2 Use of Electronic Medical Records for Symptoms Monitoring and Reporting Primary care facilities play a vital role in syndromic surveillance because information on symptoms is collected when a patient consults with a clinic physician. Digitized health data, when collected in real time can provide frequency count of symptoms and their clustering by location which could be an indicator of local transmissions. However, most primary care facilities have limited access to electronic medical record systems which poses a challenge to real-time monitoring of symptoms. In the Philippines, however, an initiative by the Department of Health (DOH) and the Philippine Health Insurance Company (PhilHealth) led to several groups developing electronic medical records to assist the DOH in digitizing health for the entire country. Since 2012, several electronic medical records have been deployed in municipal and city health facilities. In 2017, the Department of Health, together with PhilHealth through a Joint Administrative Order (JAO 2017), provided guidelines in the implementation and validation of eHealth systems. The process included the requirement to comply with data standards provided by the Department of Health and to pass the software validation process based on the successful transmission of patient registration and patient encounter data to the Philippine Health Information Exchange Lite (PHIE Lite). In this JAO, health facilities are required to adopt an electronic medical record system from the list of validated and accredited EMR providers. With this policy, disease surveillance and monitoring systems can therefore be enhanced by including a collection of symptoms and diagnoses from electronic medical records. For its pilot study, the FASSSTER framework was designed to connect to two electronic medical record systems, namely, iClinicsys, the electronic medical record system designed and maintained by the Department of Health, and SHINEOS+, an electronic medical record system developed by the Ateneo Java Wireless Competency Center (AJWCC) for the public service program of Smart Communications, Inc. Two types of integration were tested in this framework. The first method allows for the upload of files in the platform, which was the case for iClinicsys. Anonymized iClinicsys information in spreadsheet format is uploaded by a facility to the FASSSTER platform. Another method is through application programming interfaces (APIs). Web services were developed in SHINEOS+ so that data is passed on to FASSSTER automatically. For example, an API call for a specified month and year will extract the following information: age, age consulted, blood type, height, weight, complaint, diagnosis list, Barangay (Barangay code), City (City code), Province (Province code), and Region (Region code). These information serve as input to the model and visualization layer. All data submitted to FASSSTER are anonymized and aggregated which means no personal details of patients were being disclosed.

1 Origins of FASSSTER

11

Fig. 1.6 FASSSTER tweet collection and processing framework

1.3 Infodemiology: Use of Crowdsourced Health-Related Information Aside from the data from existing surveillance systems and electronic medical record systems, there are other latent sources of reports regarding possible outbreaks. For example, Google Analytics can provide spatial and temporal trends on searches about diseases such as flu or dengue. Information from various social network providers are also another source of information on symptoms. Reports on absences in school and workplace within a cluster or a community may also be signs of possible spread of a particular illness. This information may also prove to be relevant in monitoring outbreaks. These new kinds of surveillance techniques fall under what is now referred to as “infodemiology”—a portmanteau of the words “information” and “epidemiology.” Infodemiology, in relation to disease surveillance, may be characterized as the extraction and quantification of health-related information performed by a population in online environments for purposes of public health and public policy (Eysenbach 2009). Infodemiology aims to address the deficiencies of the current surveillance process implemented by the public health sector by including readily available data in electronic form. The quantification of extracted online health information such as health-seeking behavior, online health reports, and movements and mobility of the population are then included as multipliers to disease model parameters.

1.3.1 Extracting Data from Tweets A Twitter collection system was developed in the FASSSTER platform to extract tweets containing health-related keywords. Figure 1.6 shows the tweet collection and processing framework in the FASSSTER platform. Keywords used to extract tweets include mentions of the disease (e.g., dengue), its symptoms (e.g., high fever, body malaise), health-seeking behavior (e.g., is 38◦ considered high fever?), and other behaviors related to the disease (e.g., absent for 3 days because of high fever; what is the best medicine for high fever?). The tweets are pre-processed using stan-

12

M. R. J. E. Estuar and K. E. Espina

dard text mining methodologies including tokenization, removal of stopwords, and obtaining frequency counts. The remaining words comprise the health corpora that is then used in developing machine learning-based models to distinguish health-related information from non-health-related information, absence or presence of the disease, or predict the spread of the disease within a selected population.

1.4 Spatio-Temporal Visualization of Health and Syndromic Surveillance Data Visualization of health-related information is an important resource to policymakers, especially when decisions on strategies and interventions need to be made in a short period of time. During emergencies, for example, when several reports are submitted to policymakers, there is still a need to summarize and make sense of the information. The goal of data visualization is to capture the meaning of the information that needs to be provided and present it in such as way that there is very little cognitive processing that is required from the users. For example, if the objective is to show which areas are at high risk of local transmission, a map view that shows hot spots with a legend on risk levels is more appropriate than a table providing a list of the locations with corresponding case counts. The FASSSTER platform has been designed such that projections produced from disease model runs, symptoms and initial diagnosis produced from EMR submissions, and validated health-related information extracted from social media are visualized on a map so that locations at risk can easily be determined. As shown in Fig. 1.7, hotspots show areas where there are high reported cases of typhoid. At the same time, FASSSTER is able to provide these visualizations over time. When the user drags the slider or hits the play button, the colors on the map change depending on the case counts reported for a particular day. The FASSSTER platform is also equipped with real-time visualization of reports coming from electronic medical records. This places importance on symptom reporting and tracking in preventing local transmission through early detection of possible outbreaks. As shown in Fig. 1.8, FASSSTER displays the frequency of diagnosis reported by electronic medical records. In the same manner, the tweet layer also displays validated health-related tweets over time as seen in Fig. 1.9.

1 Origins of FASSSTER

Fig. 1.7 FASSSTER STEM disease model visualization layer

Fig. 1.8 FASSSTER electronic medical record visualization layer

13

14

Fig. 1.9 FASSSTER tweet visualization layer

M. R. J. E. Estuar and K. E. Espina

1 Origins of FASSSTER

15

1.5 Insights on the Feasibility Analysis of an Online Syndromic Surveillance Platform Using Spatio-Temporal Visualization The design and development of the FASSSTER disease surveillance and scenariobased disease modeling used an iterative approach with regular consultations with its target users, namely: the mathematical disease modeler, the simulations manager, and the policy-maker. The mathematical modeler derives the mathematical equations that define the characteristics and transmission dynamics of the disease. The output is usually in the form of ordinary differential equations or delayed differential equations. The simulations manager is considered a user of the models. With knowledge of epidemiology and public health, the simulations manager will produce projections by selecting a disease model and adjusting its initial parameters based on intervention scenarios. The FASSSTER platform allows models and scenarios to be shared with other users within the same environment. Once the scenarios are built, the decisionmaker can then run the simulations and make decisions aided by the output of the models. In its first year, the focus was on the design and development of localized disease models, ensuring their validity by comparing their output with historical data. In the same year, the web features for uploading disease models to the server, as well as the databases for symptoms tracking and infodemiology tracking, were built. The web interface was also designed so that users can create scenarios by adjusting the values of initial parameters, running simulations online, and sharing scenario-based models with other users. In the first version of FASSSTER, STEM was used to generate the spatio-temporal projections produced after running the model. For its second year, the focus was on setting up the interoperability layer so that data coming from different sources can automatically be submitted to the FASSSTER back end. Scripts were developed to transform the datasets as inputs to the model. It was also in the same year that the visualization component was developed. It was in its second year that users were heavily engaged in the testing of the platform. Work for the final year included preparations for the evaluation and institutionalization of the FASSSTER Disease Surveillance and Scenario-based Modeling tool in the Department of Health. The result of the three-year implementation showed the relevance of setting up a health information exchange service that will function as a clearing house for the data that come from different data sources. Data standards are also important so that the different systems follow the same data format which is important for interoperability. The data infrastructure should also be designed such that locally developed models can be hosted, executed, and visualized in an online environment. It was also noted that there is a need for high performance computing machines in running the models, especially in simulations that require the processing of large chunks of data. Finally, such platforms should be complemented by a team of experts including epidemiologists, public health specialists, mathematical modelers, and disease surveillance experts who will work hand-in-hand with computer scientists, system developers,

16

M. R. J. E. Estuar and K. E. Espina

and data scientists. The first three years of pilot implementation paved the way for the improvements in the design, development, and consequent implementation of the FASSSTER Platform for the COVID-19 pandemic.

References (2017) http://ehealth.doh.gov.ph/index.php?option=com_content&view=category&layout=blog& id=87 (2022) https://www.cdc.gov/nssp/overview.html (2022) http://ehealth.doh.gov.ph/index.php (2022) Spatiotemporal epidemiological modeler project. http://www.eclipse.org/stem/ Eysenbach G (2009) Infodemiology and infoveillance: framework for an emerging set of public health informatics methods to analyze search, communication and publication behavior on the internet. J Med Internet Res 11(1):e11. https://doi.org/10.2196/jmir.1157 Henning KJ (2004) Overview of syndromic surveillance: what is syndromic surveillance. http:// www.cdc.gov/MMWR/preview/mmwrhtml/su5301a3.htm

Chapter 2

Management of COVID-19 Data for the FASSSTER Platform Maria Regina Justina E. Estuar, Lenard Paulo V. Tamayo, Jay-Arr Buhain, Jillian Yasmin Chua, Daniel Joseph Benito, Lean Franzl Yao, and Raymond Francis Sarmiento

Abstract The transformation of data into meaningful information is essential for decision-makers when managing and mitigating the spread of a disease during an epidemic. However, real data is not without its uncertainties primarily because of the manner in how data is collected, transformed, and interpreted. A crucial step, therefore, is to ensure that data quality is not compromised. Data should be complete, valid, and accurate. It is, therefore, essential to review and verify data sources and understand the process of data collection, validation, and analysis so as not to compromise data quality. The selection of which variables to use in analysis depends on the availability of data and corresponding proxy or substitute data. At times, imputation methods are also employed to replace missing data. This chapter presents the components and processes that were developed and used in the management of Philippine COVID-19 data for the FASSSTER platform. The first part discusses the data extracted from data sources, namely: COVID KAYA, DOH Data Collect, Google mobility, and other publicly available datasets. The second part describes data cleaning and imputation methods performed on the datasets. The third part describes the transformed datasets used for analysis and modeling. Keywords Data sources · Data management · Data quality

M. R. J. E. Estuar (B) · L. P. V. Tamayo · J. Y. Chua · D. J. Benito · L. F. Yao · R. F. Sarmiento Ateneo de Manila University, Quezon City, Philippines e-mail: [email protected] L. P. V. Tamayo e-mail: [email protected] J.-A. Buhain Cavite State University, Cavite City, Philippines e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. R. J. Estuar and E. De Lara-Tuprio (eds.), COVID-19 Experience in the Philippines, Disaster Risk Reduction, https://doi.org/10.1007/978-981-99-3153-8_2

17

18

M. R. J. E. Estuar et al.

2.1 Data Sources At the onset of the pandemic, the Department of Health (DOH) convened a group of modelers, data scientists, software engineers, and public health practitioners to assist the government in creating public awareness of the pandemic in the form of (1) publicly accessible websites that will provide information related to COVID-19 in the Philippines, (2) disease model projections that will provide insights on future cases based on feasible interventions, (3) data collection tools for contact tracing, and (4) digital applications for information dissemination. The group was eventually formalized as the sub-technical working group on data analytics (sTWG-DA) with the core task of producing data-driven analytics and modeling for the Inter-Agency Task Force (IATF), headed by the Secretary of Health and composed of representatives from all government agencies and sectors. The FASSSTER Team was included in the group that provided assistance in computations, analytics, and modeling. Based on previous experiences in the development of FASSSTER platform, the COVID-19 platform was set up to produce outputs that can be categorized into three areas. The platform provided descriptive statistics displaying summaries such as daily cases and active cases and their distribution to mild, moderate, severe, and critical, health care utilization on ICU beds and mechanical ventilators, economic indicators, and security indicators. The platform also produced inferential analysis used to obtain insights for nowcasting such as case doubling time, growth rate, positivity rate, and visualization of epidemic curves, to name a few. The core feature of the platform was the modeling tools, including intervention-based time series projections, time-varying reproduction numbers, spatial cluster analysis, and spatiotemporal case projections. Automation of these decision-making tools into a web-based platform required that the data sources be accessed on a daily basis, in a standard format and have already undergone validation and verification. The succeeding section describes the data sources, data cleaning process, data imputation, and detailed description of the data used for analysis and modeling.

2.1.1 The COVID-19 Line List The COVID-19 line list comprises fields found in the case investigation forms (CIF) used by the Department of Health (DOH) for collecting information about COVID-19 cases. Information extracted includes personal details of the reported case, date of onset, laboratory results, and patient status. The daily line list undergoes a data cleaning process to ensure that the data have been validated for accuracy and consistency. The data is extracted from a Google Cloud BigQuery (BQ) warehouse that stores all information related to COVID-19. Figure 2.1 provides a high-level diagram of the data pipeline from the data source, warehouse, and corresponding models and

2 Management of COVID-19 Data for the FASSSTER Platform

19

Fig. 2.1 Linelist data from data source to FASSSTER

analytics. COVID Kaya collects and stores information on confirmed cases. After extraction from COVID Kaya, the data is transferred into the BigQuery warehouse. FASSSTER extracts data from BigQuery for analytics, modeling, and visualizations. The minimum data fields for computations include data of confirmed cases, confirmed case details, reporting facility, symptoms, comorbidity, and location. The Philippine Standard Geographic Code (PSGC) is used as the standard coding for all locations in the Philippines. Table 2.1 provides the list of fields, data types, descriptions, and sample data from linelist dataset. The data dictionary serves as a guide to knowing the expected value of a variable.

2.1.2 DOH Data Collect The DOH Data Collect (DDC) application gathers data from hospitals and health service providers on a daily basis. Specifically, the DDC repository includes occupied and vacant hospital beds, isolation rooms, ICU beds, mechanical ventilators,

20

M. R. J. E. Estuar et al.

Table 2.1 Linelist data dictionary Field name Type Age_Group Sex Repatriate

STRING STRING STRING

Admitted

STRING

Comorbidity

STRING

Status

STRING

Date_Died Date_Recovered Date_Onset

DATE DATE DATE

Report_Date

DATE

max_Report_Date regionPSGC

DATE STRING

provincePSGC

STRING

cityPSGC

STRING

barangayPSGC

STRING

imputed_Date_Onset

STRING

Description

Sample data

10-year age group Sex Repatriate/Returning Overseas Filipinos (ROF) Binary variable indicating patient has been admitted to hospital Yes/No if patient has comorbidity Known current health status of patient (asymptomatic, moderate, mild, severe, critical, died, recovered) Date died Date recovered Date of onset of symptoms Date when the case reported latest report date Philippine Standard Geographic Code of Region of residence Philippine Standard Geographic Code of Province of residence Philippine Standard Geographic Code of Municipality or City of residence Philippine Standard Geographic Code of Barangay of residence imputed onset of date (not being updated anymore)

≥80y/o FEMALE

DIABETES MELLITUS DIED

44296 44285 44292 44480 140000000

143200000

143211000

143211001

2 Management of COVID-19 Data for the FASSSTER Platform Table 2.2 Philippines NCR mobility data Date Retail and Grocery and Parks (%) recreation pharmacy (%) (%) 10/11/22 10/12/22 10/13/22 10/14/22 10/15/22

−5 −6 −7 −1 9

41 41 37 44 55

−12 −9 −11 −8 6

21

Transit Workplaces Residential stations (%) (%) (%) −13 −15 −13 −16 3

−1 0 0 2 12

12 13 12 12 8

and bed wards. It also provides data on human resource needs. The case linelist and testing aggregates are also included in the data sources. All three data sources are pre-processed for anonymization, field selection, and generation of data products for the COVID-19 Tracker and for public consumption. The Data Collect is publicly available via the DOH COVID-19 Tracker website. Based on the DDC documentation, data are encoded from a paper case investigation form (CIF) which increases the risk of data discrepancies and inaccuracies. In this case, data are constantly updated and validated from different data sources.

2.1.3 Mobility Data The Google mobility data is used as an additional input to the FASSSTER analytics and model. It guides in understanding the current mobility status in a particular area. The report provides information on daily trends in the mobility behavior in a particular area, including visits to supermarkets, pharmacies, parks, transit stations, and workplaces, among others. For example, Table 2.2 shows the percentage increase or decrease in the movement for select places from October 11, 2022, to October 15, 2022. There was a slow but continuous decrease in people going to retail and recreational places for the first four days but a sudden increase on the fifth day. The movement behavior can be used to determine where people go and stay most of the time. These datasets also show how the length of stay visits at different place categories change compared to baseline data such as pre-pandemic mobility (see Fig. 2.2).

2.2 Data Cleaning The data collection stage is prone to encountering data quality issues. Handwritten data may either be illegible or misspelled. Moreover, tools that aid data collection are often in free text form, where errors commonly occur. Data collected are also

22

M. R. J. E. Estuar et al.

Fig. 2.2 Sample mobility data

sometimes incomplete and are replaced with default values that are considered null in the data processing. Data inconsistency may be caused by data that is collected from multiple different sources, and without data standards, data may be represented differently. Therefore, pre-processing of data is necessary after data collection. Regardless of how data are collected, it can be assumed that errors may be present. The main goal of data cleaning is to minimize errors and data issues to improve overall data quality. It is a repetitive process of data analysis, transformation, verification, correction, and validation.

2.2.1 Data Cleaning Process for COVID-19 Line List The COVID-19 Line List goes through a data quality assurance process before it is released for data analysis and modeling. In this stage, the linelist is submitted to a data quality program that inspects for discrepancies such as redundancy and incompleteness. A manual process of data verification is performed by the data quality team and the sentinel sites. As data discrepancies increased, it was necessary to write automated scripts for validation to check the current data quality scores of the COVID-19 dataset before executing the data cleaning scripts. As an example for this section, the dataset used consisted of 2,476,430 records, covering the period of October 2020 to September 2021. It is split into four datasets to test the data cleaning scripts on each dataset to check the applicability and correctness of the data transformation workflows defined from each iteration of the data cleaning process framework.

2 Management of COVID-19 Data for the FASSSTER Platform

23

Fig. 2.3 Data cleaning process workflow

In Fig. 2.3, the data cleaning process framework, which was designed for automating the data cleaning and validation scripts of the COVID-19 data, includes (1) analyzing data, (2) identifying data issues, (3) defining transformation workflows, (4) executing defined transformation workflows, and (5) executing data validations to check data quality. The first three parts of the framework are done manually, while the remaining parts are automated using Python scripts. Common data quality issues were observed in the COVID-19 confirmed cases dataset. Table 2.3 shows the breakdown of the current data quality of the COVID-19 data. Data quality scores are computed by the count of data points that comply with the criteria of the data quality types over the total count of data points, where the overall data quality score of the dataset is the average of the data quality scores of data completeness, data consistency, data uniqueness, and data validity. The overall data quality of the COVID-19 dataset is at 90.91%, with validity at 95.94%, consistency at 99.59%, completeness at 71.13, and 96.98% uniqueness, for the period of analysis. Data validity issues are observed when certain data values are not found from their corresponding reference list or do not follow a predefined standard format. Data with expected values are referenced from a list as shown in Table 2.4. Validity issues in data with expected values are minimal at 0.01%, where errors were mostly caused by misspelled data or data that were not part of the reference list. Rule violations cause validity issues under contact details and date data. There are 46.87% invalid contact details data because data either contain non-numerical data or do not meet the expected character count of telephone and cellphone numbers in the country. There are no invalid data found under date data, however. Data inconsistency is a data quality issue where data are not uniform throughout the dataset. This issue is mostly observed in name, contact details, and date data.

24

M. R. J. E. Estuar et al.

Table 2.3 Percentage of data quality scores of COVID-19 data Field name Completeness (%) Validity

Consistency

First name Middle name Last name Telephone no. Cellphone no. Office no. Region code Province code Municipality code Barangay code Suffix Sex Civil status Health status Case classification Birth date Interview date Admitted date Discharge date Average

100 96.80 100 8.73 6.74 99.91 99.80 99.80 96.93 95.52 1.08 100 6.60 37.38 46.59 99.76 3.00 3.59 0.57 71.13

N/A N/A N/A 77.66% 75.31% 98.27% N/A N/A N/A N/A 100% 100% 100% 99.99% 100% 100% 100% 100% 100% 95.94%

100% 97.16% 100% N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 100% 100% 100% 100% 99.59%

Table 2.4 Data with expected values Column name Expected values Suffix Sex Civil status Health status Case Classification

SR, JR, I, II, III, IV, V FEMALE, MALE SINGLE, MARRIED, WIDOWED, SEPARATED, DIVORCED, OTHERS MILD, RECOVERED, SEVERE, ASYMPTOMATIC, MODERATE, DIED, CRITICAL CONFIRMED CASE, PUM: OTHERS, PUM: CASE LINKED CONTACTS, SUSPECT CASE

2 Management of COVID-19 Data for the FASSSTER Platform

25

Values under name data are expected to be names, however, 0.95% are in initials where most are observed under middle name at 3.55%. Data inconsistency issues are highly found in contact details data but are ignored because cellphones are automatically formatted to +(XX)XXX-XXX-XXXX and telephone numbers to +(XX)XXXX-XXXX upon entering the system and might be automatically tagged as inconsistent data because of the different prefixes. No inconsistency issues are found in date data because the standard format is followed when encoding data to the system. Missing data and lack of data are data quality issues that fall under data completeness, which is observed in almost all fields in the COVID-19 data. Data completeness issues are caused by uncollected data during the data collection stage. When invalid or incorrect data are removed, it affects the completeness of data points because they are emptied or nullified. The data completeness score is reduced by 4% that are obtained by validating the data before and after data cleaning is executed. Some uncollected data are not imputed because assigning false values to missing data can create data uncertainties. Only missing location codes are imputed from deriving data from other fields with available data and the PSGC dataset as a mapping reference. The columns that are used for the imputation of location codes are barangay code, region name, province name, municipality name, current address, and permanent address. The imputation of missing location codes can improve the data completeness of the location codes data. 51% of the imputed missing location codes are from the available province code used to filter the PSGC dataset to impute the correct municipality code, followed by the available region, province, and municipality names combination used to filter the PSGC dataset at 46%. Missing region and province codes can be imputed using the imputed municipality codes. Overall, data completeness of location codes improved by 2% after imputation. Data uniqueness issues exist if data has duplicate values that are exactly the same, duplicate data but contains different values in selected fields, and unique data that shares a unique identifier. There are 3.02% duplicate data that are detected and removed from the COVID-19 dataset by the de-duplication methods created for this study. The breakdown of how much duplicate data are removed per iteration is shown in the Table. Most duplicate data are removed using the standard de-duplication process done in the first iteration with 2.02%, where the columns used as keys are name data, sex, and birth date. Followed by the third iteration with 0.70% with help from the soundex of names to identify possible misspelled names to identify duplicate data (Table 2.5). There are no data quality issues caused by the defined transformation workflows during execution of data cleaning. Most of the data cleaning results are observed in the aggravation of data completeness in contact details data and in the improvement of data completeness in location data. Most of the nullified data in contact details data are due to the default contact number used when data is not provided. The improvement in data completeness of location codes helps improve results in models, being able to sort and identify which localities have high numbers of confirmed cases to implement immediate COVID-19 responses to the location.

26

M. R. J. E. Estuar et al.

Table 2.5 Duplicates removed per iteration Iteration Key 1 2 3

Name Data, Sex, Birth Date Name Data, Sex, Manipulated Birth Date Name Soundex, Sex, Birth Date

Duplicates detected and removed (%) 2.02 0.29 0.70

The data cleaning process framework designed for the data cleaning of the COVID-19 dataset helped identify data quality issues that need to be addressed in the COVID-19 confirmed cases dataset. This process also helped as a guide in automating the data cleaning and validation process. The criteria that were considered for identifying data issues during data analysis were data completeness, data consistency, data validity, and data uniqueness. The primary source of data quality issues was errors made during the data collection stage which are caused by human errors, incomplete information, and different data representations. Duplicate data are expected, but some duplicate data are overlooked because of misspelled names and interchanged months and days in birth dates. Patterns and causes of data issues found in the COVID-19 dataset were identified during the data analysis stage. Some invalid data can be addressed through correction, missing data using imputations, and inconsistent data by reformatting data, where the defined data transformations were automated.

2.2.2 Data Imputation One obstacle in using the data for further analysis is missing data. This is much more apparent in the early months of the pandemic since the data collection process for COVID-19 cases was not yet properly integrated in the health facilities. Imputing missing data consists of two steps: (1) use the specimen collection date as a proxy for missing symptom onset dates and this linelist is passed on to the models where (2) missing dates for symptom onset (for those without a proxy), admission, recovery, and deaths were imputed based on the mean delay to the report date. This method of imputing missing values places a significant bias on the delays being concentrated around the mean, which prompted the study of alternative methods of imputation. The methods studied were: k-nearest neighbors (kNN), fitting to a distribution, and predictive mean matching (PMM). The final imputation method used in FASSSTER is the Predictive Mean Matching (PMM) which uses actual observed values as replacement for missing data. This section shall also discuss the process of selecting the most appropriate imputation method amid the need for timely delivery of results while ensuring imputed values do not introduce new biases. The criteria used to decide the method of imputation included code efficiency (since this will be

2 Management of COVID-19 Data for the FASSSTER Platform

27

Fig. 2.4 Linelist data Table 2.6 Order of imputation Date to impute Date of specimen collection Date of symptom onset Date of admission Date of result Date of recovery

Reference date Report date Date of specimen collection Date of specimen collection Date of specimen collection Date of symptom onset

implemented daily on a growing data set) and the ability to follow the distribution of raw data (Fig. 2.4). As shown in the sample linelist data, it suggests that imputing the values: (1) Date of Admission—the date when the patient is admitted to the hospital or isolated, (2) Date of Recovery—the date when the patient recovers based on DOH guidelines, (3) Date of Result—the date when the results of the laboratory test were known, (4) Date of Specimen Collection—the date when the patient was swabbed; and (5) Date of Symptom Onset—the date when the patient started to experience symptoms of the disease. To fill the missing dates, delays are used to obtain the difference between the date under consideration and its reference date. For instance, onset delay refers to the number of days from Date of Symptom Onset to Date of Specimen Collection (i.e., Delay = Date of Specimen Collection—Date of Onset). Table 2.6 shows the sequence for filling the missing dates. 1 2 3 4

library ( dplyr ) library ( readr ) l i b r a r y ( mice ) library ( lattice )

28 5 6 7 8 9 10

M. R. J. E. Estuar et al.

library ( ggplot2 ) library ( abind ) library ( foreach ) library ( doParallel ) library ( lubridate ) set . seed (1)

11 12

13

14

15

16

G e t R e c e n t B a c k u p F i l e s % r i g h t _ join ( data2 . df , by = " Case _ N u m b e r " ) % >% # c o m b i n e with the data with donor i n d i c a t o r s # R e p l a c e m i s s i n g dates with i m p u t e d date , if p r e v i o u s l y imputed - - d p l y r :: m u t a t e ( D a t e _ S p e c i m e n = case _ when ( ! is . na ( Date _ S p e c i m e n ) ~ Date _ Specimen , TRUE ~ i m p u t e d _ Date _ Specimen ) , Date _ O n s e t = case _ when ( ! is . na ( Date _ O n s e t ) ~ Date _ Onset , TRUE ~ i m p u t e d _ Date _ O n s e t ) , R e s u l t _ Date = case _ when ( ! is . na ( R e s u l t _ Date ) ~ R e s u l t _ Date , TRUE ~ i m p u t e d _ R e s u l t _ Date ) Date _ A d m i t t e d = case _ when ( ! is . na ( Date _ A d m i t t e d ) ~ Date _ Admitted , TRUE ~ i m p u t e d _ Date _ A d m i t t e d ) ) % >% d p l y r :: s e l e c t ( - i m p u t e d _ Date _ Specimen , - i m p u t e d _ Date _ Onset , - i m p u t e d _ R e s u l t _ Date ) % >% - i m p u t e d _ Date _ Admitted , # If the m i s s i n g date was p r e v i o u s l y imputed , do not i n c l u d e in the i m p u t a t i o n - - d p l y r :: m u t a t e ( I m p u t e S p e c i m e n _ r e c i p i e n t = case _ when ( is . na ( Date _ S p e c i m e n ) ~ 1 ,

71

TRUE ~ 0) , 72

I m p u t e O n s e t _ r e c i p i e n t = case _ when ( is . na ( Date _ O n s e t ) &

30

M. R. J. E. Estuar et al.

73

tolower ( Status ) != " asymptomatic " ~ 1, TRUE

74

~ 0) , I m p u t e A d m i t t e d _ r e c i p i e n t = case _ when ( is . na ( Date _ A d m i t t e d ) ~ 1 ,

75

76

TRUE ~ 0) , I m p u t e R e s u l t _ r e c i p i e n t = case _ when ( is . na ( R e s u l t _ Date ) ~ 1 ,

77

78

TRUE ~ 0) ) % >% Listing 2.1 Imputation code snippet

2.3 Utilization of Linelist Data for Analytics and Modeling FASSSTER analytics that used the Case Linelist data are as follows: Case Statistics, Epidemic Curves, Growth Rate, Province/HUC/ICC Risk Classification, Barangay Hotspots, COVID-19 Deaths Over Time, and Case Doubling Time. Table 2.7 shows which linelist fields were used for the analytical tools in FASSSTER platform.

2.3.1 Case Statistics Generation of Case Statistics is a feature included in the Health Section of the FASSSTER platform. Case Statistics shows the distribution of the different types of confirmed COVID-19 cases including age, sex, comorbidity cases, and comorbidity deaths. This dashboard can be used to view the daily trend of cases, and daily distribution of cases for a select location by region, province, city, or municipality. The fields that are used for the computation of Case Statistics are the following: • • • • • • • • • • •

Age_Group Sex Repatriate Admitted Comorbidity Status max_Report_Date regionPSGC provincePSGC cityPSGC barangayPSGC

2 Management of COVID-19 Data for the FASSSTER Platform

31

Table 2.7 Summary of linelist fields used by FASSSTER analytical tools Field name

Case statistics

Age_Group



Sex



Repatriate (mainly used to check which records to exclude)



Admitted



Comorbidity



Status



Epidemic Growth curves rate

Province/ Barangay COVID- Case HUC/ICC hotspots 19 deaths doubling risk clasover time time sification











Date_Died











 

Date_Recovered 

Date_Onset



Report_Date



















































max_Report_Date





regionPSGC



provincePSGC



cityPSGC



barangayPSGC



imputed_Date_Onset (DOH and FASSSTER imputed)

 

The Age_Group variable is used as a filter to show the number of cases per age category. The Sex variable shows the distribution of cases between Male and Female. Data on comorbidity is also included as an indicator of vulnerability and, therefore, high-priority areas in relation to pharmaceutical and non-pharmaceutical interventions. Returning overseas Filipinos or repatriates are also tagged for purposes of contact tracing. Status describes the severity of the patient. Report Date is used as a basis for computing the date of onset of COVID-19 for each case as well as estimating the date of recovery. PSGC codes per area, from region to the smallest geographical area, the barangay, are also collected to compute for area-based analysis.

2.3.2 Epidemic Curves (see Fig. 2.5). Epidemic curves as shown in Fig. 2.5 by location shows the number of new COVID-19 cases per day based on: report date, date of symptom onset together

32

M. R. J. E. Estuar et al.

Fig. 2.5 Epidemic curves

with its median, 7-day moving an average number of new cases (7DMA) and the 75th percentile of daily new cases since the first case as of the latest date. Epidemic curves around the country shows the daily numbers of new cases in NCR, Luzon except NCR, Visayas, and Mindanao, based on the report date. The fields that are used for Epidemic Curves are the following: • • • • • • • • •

Repatriate Date_Onset Report_Date max_Report_Date regionPSGC provincePSGC cityPSGC barangayPSGC imputed_Date_Onset.

2.3.3 Growth Rate Growth Rate as shown in Fig. 2.6 generates two graphs: (1) the graph of 7-day moving average (7DMA) of new confirmed cases over time, and (2) the graph of the growth factor of 7DMA through time. The fields that are used for Growth Rate are the following:

2 Management of COVID-19 Data for the FASSSTER Platform

Fig. 2.6 Growth rate

Fig. 2.7 Risk classification

• • • • •

Repatriate Report_Date regionPSGC provincePSGC cityPSGC.

33

34

M. R. J. E. Estuar et al.

2.3.4 Risk Classification Risk Classification as shown in Fig. 2.7 provides a summary of risk indicators of a locality (region, province, ICC-independent component cities and HUC-highly urbanized cities), a sub-feature of Health in FASSSTER platform that shows the location (region, Province/HUC/ICC), cases (active and total), population, growth in cases recent 3–4 versus 5–6 weeks, growth in cases recent 1–2 versus 3–4 weeks, average daily attack rate (per 100,000 population) and the number and % of brgys with new cases in the last 14 days (pareto). Below explains each column of the Province/HUC/ICC Risk Classification matrix. 1. Location—Name of the region and Province/HUC/ICC. 2. Active Cases—Number of confirmed positive who are alive and have not recovered from COVID-19 as of the specified date in the given province, city or municipality. 3. Total Cases—Cumulative number of confirmed positives since the start of the outbreak in the given locality as of the specified date, including those who have died or have recovered from COVID-19. 4. Growth in Cases—This is the difference of the two numbers (No. of Cases Recent 1–2 Weeks and No. of Cases Recent 3–4 Weeks) 5. No. of Cases Recent 1–2 Weeks—This is the total number of new cases in the locality over the last 14 days up to the specified date. 6. No. of Cases Recent 3–4 Weeks—This is the total number of new cases in the locality over a 2-week period starting from 28 days prior up to 14 days prior to the specified date. 7. Average Daily Attack Rate—The number indicates the average daily attack rate over the last 14 days per 100,000 population in a given locality as of the specified date. 8. Number And % of Brgys with New Cases in Last 14 Days—Three numbers which indicate (a) the number of barangays in the locality with new COVID-19 cases in the last 14 days, (b) the corresponding percentage of all the barangays and (c) the number of barangays in the locality that contribute to 80% of new cases in the last 14 days. The fields that are used for the Risk Classification Table are the following: • • • • • • •

Repatriate Status Report_Date regionPSGC provincePSGC cityPSGC barangayPSGC.

2 Management of COVID-19 Data for the FASSSTER Platform

35

2.3.5 Barangay Hotspots Barangay Hotspots shows the table of Barangays with new cases for the last 7 or 14 days and can be filtered by region, province, and municipality/city. The fields that are used for the Barangay Hotspots are the following: • • • • • • •

Repatriate Status Report_Date regionPSGC provincePSGC cityPSGC barangayPSGC.

2.3.6 Deaths over Time COVID-19 Deaths Over Time as shown in Fig. 2.8 show the number of COVID-19related deaths per day, the 7-day moving average (7DMA) of the number of deaths and the 7DMA of deaths per day.

Fig. 2.8 Deaths over time

36

M. R. J. E. Estuar et al.

The fields that are used for Deaths over Time are the following: • • • • • • •

Repatriate Status Date_Died Report_Date regionPSGC provincePSGC cityPSGC.

2.4 Daily Hospital Report Daily hospital report contains data about hospital beds and ventilator vacancy counts, health care workers: quarantine and admitted counts, Classification counts (suspect, probable, and confirmed cases) along with their respective health status (Asymptomatic, Mild, Severe, Critical), PSGC of facility. FASSSTER analytics that uses the daily report data are as follows: Health Facility Bed Capacity Utilization Rates (Regional, Provincial/HUC/ICC) and Province/HUC/ICC Risk Classification (Fig. 2.9). Table 2.8 provides the list of fields, data type, description, and sample data from daily hospital report dataset.

2.4.1 Health Facility Bed Capacity Utilization Rates (Regional, Provincial/HUC/ICC) Regional CUR (Capacity Utilization Rates) and Bed Capacity are the sub-features of Health Capacity in FASSSTER platform. Regional CUR (Capacity Utilization Rates) shows a graph for the total bed and mechanical ventilator utilization rate for

Fig. 2.9 Daily report data from data source to FASSSTER

2 Management of COVID-19 Data for the FASSSTER Platform

37

Table 2.8 Daily report data dictionary Field name

Type

Description

Sample data

reportdate

TIMESTAMP

Date of submission

2021-1010T00:00:00Z

icu_v

FLOAT

Total number of ICU beds with or without negative pressure in your facility that are currently vacant and available for use by suspect, probable, or confirmed COVID-19 patients

0

icu_o

FLOAT

Total number of ICU beds with or without negative pressure in your facility that are currently occupied by suspect, probable, or confirmed COVID-19 patients

0

isolbed_v

FLOAT

Total number of isolation beds and converted/makeshift isolation beds based on DOH standards in your facility that are currently vacant and available for use

3

isolbed_o

FLOAT

Total number of isolation beds and converted/makeshift isolation beds based on DOH standards in your facility that are currently occupied by suspect, probable, or confirmed COVID-19 patients

3

beds_ward_v

FLOAT

Total number of beds in converted wards/units dedicated for confirmed COVID-19 cases in your facility that are currently vacant and available for use. Wards are not allowed for Suspect or Probable cases (PUIs) and should not included

4

beds_ward_o

FLOAT

Total number of beds in converted wards/units dedicated for confirmed COVID-19 cases in your facility that are currently occupied. Wards are not allowed for Suspect or Probable cases (PUIs) and should not included

17

mechvent_v

FLOAT

Total number of mechanical ventilators that are vacant and available for use by PUIs (Suspect or Probable patients) or Confirmed COVID-19 patients

4

mechvent_o

FLOAT

Total number of mechanical ventilators in your facility, whether outsourced or in-house, that are currently occupied by PUIs (Suspect or Probable patients) or Confirmed COVID-19 patients

3

region_psgc

STRING

Philippine Standard Geographic Code of Region

PH040000000

province_psgc

STRING

Philippine Standard Geographic Code of Province

PH045800000

city_mun_psgc

STRING

Philippine Standard Geographic Code of Municipality or City

PH045813000

38

M. R. J. E. Estuar et al.

Fig. 2.10 Regional CUR

each region on a given date while for bed capacity sub-feature of health capacity, it shows the national, regional, provincial/HUC/ICC and hospital summary in the occupancy on beds (public and private), total count and the change in occupancy rate vs last week % of beds for severe cases (bed wards and isolation beds) and critical care (ICU beds and mechanical ventilators). From the daily hospital report dataset, cfname, reportdate, icu_v, icu_o, isolbed_v, isolbed_o, beds_ward_v, beds_ward_o, mechvent_v, mechvent_o, region_psgc, province_psgc and city_mun_psgc are the fields that is being used in the Regional CUR and Bed Capacity (see Fig. 2.10).

2.4.2 Province/HUC/ICC Risk Classification Risk Classification of a locality (region, province, ICC-independent component cities and HUC-highly urbanized cities), a sub-feature of Health in FASSSTER platform that shows the location (region, Province/HUC/ICC) and Health Capacity Utilization Rate (HCUR (Regional %), HCUR (Provincial %), Dedicated Beds and Authorized Licensed Beds). This section explains each column of the Province/HUC/ICC Risk Classification matrix pertaining to HCUR (see Fig. 2.11). 1. Location—Name of the region and Province/HUC/ICC. 2. HCUR (Regional %)—This is the higher between the percent utilization of beds and percent utilization of mechanical ventilators in the entire region to which the LGU belongs.

2 Management of COVID-19 Data for the FASSSTER Platform

39

Fig. 2.11 Bed capacity

3. HCUR (Provincial %)—This is the same as the HCUR (Regional %) but the number of beds is the count for the province, city or municipality only. 4. Dedicated Beds—This number pertains to the total number of hospitals beds (ICU, Isolation Beds, Beds for Ward) dedicated specifically for servicing COVID-19 cases. 5. Authorized Licensed Beds—This number pertains to the number of hospitals beds in the province, HUC, or ICC that are licensed to operate. Table 2.9 shows which daily hospital report fields were used for the analytical tools in FASSSTER platform.

2.4.3 Testing Aggregates Testing aggregates table contains data on testing results of the laboratories that are aggregated per day, indicator of the volume of samples processed by each laboratory and provides an assessment of the capacity of each laboratory. FASSSTER analytics that uses the testing aggregates data is National and Regional Positivity Rate (Fig. 2.12). Table 2.10 provides the list of fields, data type, description, and sample data from the testing aggregates dataset. Since the National and Regional Positivity Rate can be filtered by facilities, daily or cumulative that is why all of the fields listed in the table except for validation status were being utilized in the analytics (see Table 2.11)

40

M. R. J. E. Estuar et al.

Table 2.9 Summary of daily hospital report fields used in FASSSTER analytics Field name Health facility bed capacity Province/HUC/ICC risk utilization rates (Regional, classification Provincial/HUC/ICC) cfname reportdate icu_v icu_o isolbed_v isolbed_o beds_ward_v beds_ward_o mechvent_v mechvent_o region_psgc province_psgc city_mun_psgc

            

           

Fig. 2.12 Testing aggregates data from data source to FASSSTER Table 2.10 Testing aggregates data dictionary Field name

Type

Description

Sample data

facility_name

STRING

Name of the institution certified by the Department Manila of Health to perform COVID-19 Testing Doctors Hospital

report_date

DATE

Date of submission

avg_turnaround_time

FLOAT

The time interval from the time of submission of a 24 process to the time of the completion of the process

daily_output_samples_tested

FLOAT

The total specimens processed with results 294 (positive, negative, equivocal or invalid) released from 6 pm the previous day to 6pm of the reporting date

44479

(continued)

2 Management of COVID-19 Data for the FASSSTER Platform

41

Table 2.10 (continued) Field name

Type

Description

Sample data

daily_output_unique_individuals

FLOAT

The sum of all unique individuals tested 294 (positive+negative) with results that are released from 6 pm the previous day to 6pm of the reporting date

daily_output_positive_individuals

FLOAT

Refer to the actual number of all unique individuals 24 with positive results that are released from 6 pm the previous day to 6pm of the reporting date

daily_output_negative_individuals

FLOAT

Refer to the actual number of all unique individuals 270 with negative results that are released from 6 pm the previous day to 6pm of the reporting date

daily_output_equivocal

FLOAT

Number of all specimens with equivocal results that are released from 6 pm the previous day to 6 pm of the reporting date

daily_output_invalid

FLOAT

Number of all specimens with invalid results that 0 are released from 6 pm the previous day to 6 pm of the reporting date

remaining_available_tests

FLOAT

Remaining COVID-19 tests that can be conducted by the health facility or laboratory based on the PCR testing kits they currently have. For GeneXpert labs, this refers to the remaining number of cartridges on hand

8100

backlogs

FLOAT

Number of samples without validated results released within 48 h after receipt

0

cumulative_samples_tested

INTEGER

Sum of all specimens tested with validated results from the start of laboratory operation up to the reporting date

115257

cumulative_unique_individuals

INTEGER

The total number of unique individuals who underwent COVID-19 testing, regardless of result, accumulated since the start of operations in the laboratory. One individual, with 2 or more specimen results will only be counted once

115212

cumulative_positive_individuals

INTEGER

The total number of unique individuals with a positive result after COVID-19 testing using the appropriate confirmatory test (ex. RT-PCR)

10768

cumulative_negative_individuals

INTEGER

The total number of unique individuals with a negative result after COVID-19 testing

104444

pct_positive_cumulative

FLOAT

Total Number of cumulative positive individuals as percent of cumulative unique individuals

0.09

pct_negative_cumulative

FLOAT

Total Number of cumulative negative individuals as 0.91 percent of cumulative unique individuals

0

42 Table 2.11 Summary of testing aggregates fields used in FASSSTER analytics

M. R. J. E. Estuar et al. Field name

Positivity rate

facility_name report_date avg_turnaround_time daily_output_samples_tested daily_output_unique_individuals daily_output_positive_individuals daily_output_negative_individuals daily_output_equivocal daily_output_invalid remaining_available_tests backlogs cumulative_samples_tested cumulative_unique_individuals cumulative_positive_individuals cumulative_negative_individuals pct_positive_cumulative pct_negative_cumulative validation_status

                

The FASSSTER models that use the Line list data are as follows: Time-series Projections, Spatio-Temporal Visualizations, Moran’s I and LISA Statistics, and Time-Varying R Number. Use of the data and the models are explained in separate chapters.

Chapter 3

FASSSTER Data Pipeline and DevOps Lenard Paulo Tamayo, Christian Pulmano, Romel John Santos, Jay-Arr Buhain, and Raven Ico

Abstract In data science, the data pipeline serves as a methodological and potentially architectural framework for setting up systems that require near real-time monitoring through dashboards and visualization. The collection, aggregation, and analysis of data related to COVID-19 cases proved to be important in providing the community with the right information at the right time. In the beginning of the pandemic, the data used for interpretation came from different data sources. Some datasets were made available to the public by the Department of Health (DOH) by publishing a Google Drive that contained the datasets in spreadsheet format (http://bit.ly/ DataDropPH). Eventually, DOH provided access to a BigQuery database to select groups where data can be automatically extracted on a daily basis. These datasets are extracted and ingested to a data warehouse for further analysis. Various data analysis and modeling techniques are applied to the data. As such, data analysis scripts are written using two popular programming languages, R and Python, to facilitate the processing and transformation of data. The stakeholders then view model outputs in a web-based visualization platform. This chapter describes the FASSSTER data pipeline, from extraction, preprocessing, and processing to produce outputs generated by analytics and models and corresponding data visualization techniques. Keywords Data pipeline · Data analytics · Data visualization

L. P. Tamayo (B) · C. Pulmano · R. J. Santos · J.-A. Buhain · R. Ico Ateneo de Manila University, Quezon City, Philippines e-mail: [email protected] C. Pulmano e-mail: [email protected] R. J. Santos e-mail: [email protected] J.-A. Buhain e-mail: [email protected] R. Ico e-mail: [email protected] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2023 M. R. J. Estuar and E. De Lara-Tuprio (eds.), COVID-19 Experience in the Philippines, Disaster Risk Reduction, https://doi.org/10.1007/978-981-99-3153-8_3

43

44

L. P. Tamayo et al.

3.1 Description of Data Sources for the COVID-19 FASSSTER Platform The FASSSTER platform requires the extraction of data on a daily basis. However, the dataset required for each of the analytical tools and models comes from different data sources. For example, COVID Kaya is the main source for all confirmed COVID-19 cases and close contacts, which, at the time of writing, provides more than a million records. While DOH data drop provides daily reports and testing aggregates data, a data warehouse using BigQuery was set up so that the FASSSTER platform can extract the data from a single source (Fig. 3.1). Python is a very popular programming language choice that fits the needs in extracting, transforming, and loading data to various formats as well as moving data to another database for specific use cases. FASSSTER uses a set of Python codes to extract data in Big Query. For example, COVID-19 data analytics requires the daily linelist for computing the active and cumulative cases. The linelist is extracted daily from BigQuery and the platform will aggregate the numbers to show the case statistics for each region (Fig. 3.2). A script is executed to ensure that the data used in the FASSSTER platform is updated. To do that a Python BigQuery dependency is required to be installed by using the command pip3 for Python 3 version. 1 2

pip3 i n s t a l l - - u p g r a d e google - cloud - b i g q u e r y pip3 i n s t a l l - - u p g r a d e google - cloud - b i g q u e r y [ p a n d a s ] Listing 3.1 Python dependency installation code snippet

Fig. 3.1 Data pipeline

3 FASSSTER Data Pipeline and DevOps

45

Fig. 3.2 Case statistics dashboard

After the installation, the data is extracted from BigQuery using a Python script wherein the linelist, daily report, and testing aggregates data sets are stored and warehoused for FASSSTER consumption. The script starts with importing the necessary modules, environment variables, and necessary access keys. 1 2 3 4

from g o o g l e . c l o u d i m p o r t b i g q u e r y i m p o r t p a n d a s as pd i m p o r t os os . e n v i r o n [ " G O O G L E _ A P P L I C A T I O N _ C R E D E N T I A L S " ]= " [ path to . json s e r v i c e a c c o u n t key file ] " Listing 3.2 Python import modules and Google service account code snippet

The next step is to construct a BigQuery client object as shown in line 1 code snippet 3.3, perform a query that will retrieve data from a specified table, make an API request and convert results to a Python Pandas DataFrame object. 1 2

3 4

c l i e n t = b i g q u e r y . C l i e n t () q u e r y _ j o b _ c c = c l i e n t . q u e r y ( """ S E L E C T * F R O M ‘[ p r o j e c t _ n a m e . d a t a s e t . t a b l e _ n a m e ] ‘ """ ) r e s u l t s _ c c = q u e r y _ j o b _ c c . r e s u l t () df = r e s u l t s _ c c . t o _ d a t a f r a m e () Listing 3.3 Python query API request and dataframe code snippet

The last step after fetching the data from BigQuery is to store the Pandas DataFrame as a parquet file using pandas.DataFrame.to_parquet in a gzip file compression. FASSSTER will read these gzip files (linelist, daily report, and testing aggregates) to do modeling, analytics, and produce visualizations.

46 1

L. P. Tamayo et al.

df . t o _ p a r q u e t ( ’[ name of the file ]. gzip ’ , c o m p r e s s i o n = ’ gzip ’) Listing 3.4 Python pandas dataframe to parquet code snippet

FASSSTER also extracts daily the linelist data in a form of CSV file using the same process mentioned above. This linelist csv file is being used for the Imputation and EpiNow2.

3.2 Preprocessing Preprocessing techniques include automated and manual inspection of the data to ensure that all data is in the correct format before it is sent to the FASSSTER platform for processing. The dataset needs to be transformed into a suitable format for the models.

3.2.1 Producing the Compartmental Model (SEIR) There are two ways data is being aggregated and reported: Cumulative and Active number of cases per day. Cumulative cases are the number of cases accumulated over time since the pandemic started. While active cases are the total number of confirmed cases who are still exhibiting symptoms for a predefined time period. These two datasets are stored separately. The first step is to access the data for both cumulative and active cases. For the cumulative method of aggregation, a read method is called so that the output of the imputation preprocessing step is stored in a data frame. The imputation script produces RDS files for its outputs as it efficiently stores data better than CSV files. 1

data . df