Artificial Intelligence in Earth Science: Best Practices and Fundamental Challenges 0323917372, 9780323917377

Artificial Intelligence in Earth Science: Best Practices and Fundamental Challenges provides a comprehensive, step-by-st

606 45 43MB

English Pages 428 [430] Year 2023

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Artificial Intelligence in Earth Science: Best Practices and Fundamental Challenges
 0323917372, 9780323917377

Table of contents :
Front Cover
Artificial Intelligence in Earth Science: Best Practices and Fundamental Challenges
Copyright
Contents
Contributors
Chapter 1: Introduction of artificial intelligence in Earth sciences
1. Background and motivation
2. AI evolution in Earth sciences
3. Latest developments and challenges
4. Short-term and long-term expectations for AI
5. Future developments and how to adapt
6. Practical AI: From prototype to operation
7. Why do we write this book?
8. Learning goals and tasks
9. Assignments & open questions
References
Chapter 2: Machine learning for snow cover mapping
1. Introduction
2. Machine learning tools and model
2.1. What is ``scikit-learn´´
2.2. Why do we use random forest
2.3. Other supporting packages used in the chapter
3. Data preparation
4. Model parameter tuning
4.1. Number of samples
4.2. Number of features
4.3. Number of trees
4.4. Tree depth
5. Model training
5.1. Splitting data into training and testing subsets
5.2. Defining the random forest model
5.3. Feature importance
5.4. Save the model
6. Model performance evaluation
6.1. Testing subset model performance
6.2. Image-wide model performance
6.3. Model performance in open areas versus forested areas
7. Conclusion
8. Assignment
9. Open questions
References
Chapter 3: AI for sea ice forecasting
1. Introduction
1.1. Sea ice
1.2. Arctic Sea ice and global climate patterns
2. Sea ice seasonal forecast
3. Sea ice data exploration
3.1. Dataset description
4. ML approaches for sea ice forecasting
4.1. ML-based sea ice forecasting
4.1.1. Data preprocessing
4.1.2. Fitting the model
4.1.3. Model evaluation
4.2. Deep learning-based sea ice forecasting
4.2.1. Data preprocessing
4.2.2. Model training
4.2.3. Model evaluation
4.3. Ensemble learning-based sea ice forecasting
4.3.1. Data concatenation
4.3.2. Model evaluation
5. Results and analysis
6. Discussion
7. Open questions
8. Assignments
References
Chapter 4: Deep learning for ocean mesoscale eddy detection
1. Introduction
2. Chapter layout
3. Data preparation
3.1. AVISO-SSH data product
3.2. Training and testing sets
3.3. SSH map preprocessing
3.4. Generate ground truth eddy masks using the py-eddy-tracker algorithm
3.5. Use multiprocessing to generate segmentation masks in parallel
3.6. Take a subset of the masks and SSH map, and save to a compressed numpy (.npz) file
4. Training and evaluating an eddy detection model
4.1. Load data
4.1.1. Specify NPZ file paths
4.1.2. Load NPZ and convert into PyTorch DataLoader
4.1.3. Examine distribution of class frequencies to identify class imbalances
4.1.4. Example visualization
4.1.5. Example visualization (animate validation data)
4.2. Defining the training components
4.2.1. Segmentation model
4.2.2. Loss function, L(fθ(x),y)
4.2.3. Optimizer
4.2.4. One-cycle learning rate scheduler
4.3. Metrics
4.3.1. Precision and recall
4.3.2. Tensorboard logger (SummaryWriter)
4.4. Train the model
4.4.1. Define training loop
4.4.2. Analyze training curves in TensorBoard
4.4.3. Run the training loop for prescribed num_epochs
4.5. Evaluate model on training and validation sets
5. Discussion
6. Summary
7. Assignments
8. Open questions
Acknowledgments
References
Chapter 5: Artificial intelligence for plant disease recognition
1. Introduction
1.1. Plant disease challenge
1.2. Promising AI technique for plant disease detection and classification
2. Data retrieval and preparation
2.1. Data variability
2.2. Protocols for image capture
2.3. Image annotation
3. Step-by-step implementation
4. Experimental results and how to select a model
5. Discussion
6. Conclusion
7. Assignment
8. Open questions
References
Chapter 6: Spatiotemporal attention ConvLSTM networks for predicting and physically interpreting wildfire spread
1. Introduction
1.1. Technical contributions
2. Methodology
2.1. ConvLSTM network
2.2. Attention-based methods for ConvLSTM networks
3. Earth AI workflow
3.1. Dataset acquisition and preparation
3.1.1. Input-output sequence generation
3.1.2. Data normalization
3.2. Modeling workflow demonstration
3.2.1. Attention ConvLSTM networks architecture
Imports
Model configuration
Convolutional block attention module (CBAM)
Nonattention ConvLSTM block
CSA-ConvLSTM block
SCA-ConvLSTM block
Encoder-decoder block
Train and test network
3.2.2. Execute the model
3.3. Physical interpretability of the trained model: integrated gradients-based feature importance
4. Results
4.1. Prediction performance
4.2. Physical interpretation
5. Conclusions
6. Assignment
7. Open questions
References
Chapter 7: AI for physics-inspired hydrology modeling
1. Introduction and background
2. PyTorch and autodifferentiation
2.1. Getting started with PyTorch
2.2. Autodifferentiation theory
2.3. Practical use of autodifferentiation in PyTorch
3. Extremely brief background on numerical optimization
3.1. First-order methods: Gradient descent and other flavors for training neural networks
3.2. Second-order methods: Standards for numerical solutions to differential equations
3.3. Brief detour on numerically solving ODEs
The hydrologist's favorite: The linear reservoir model
4. Bringing things together: Solving ODEs inside of neural networks
The nonlinear reservoir model
Learning the reservoir conductivity function with neural networks
4.1. Split out the input/output data
4.1.1. Let us train!
4.2. What did the network actually learn though?
4.3. Introducing torchdiffeq
5. Scaling up to a conceptual hydrologic model
5.1. The system of equations
5.2. Data
5.3. The model training functions
5.4. Setting up our training/testing data
5.5. Defining the model setup
5.6. Training the model
5.7. Model analysis
6. Conclusions
6.1. Exercises
6.2. Open questions
References
Further reading
Chapter 8: Theory of spatiotemporal deep analogs and their application to solar forecasting
1. Introduction
1.1. A brief history of weather analogs
1.2. Machine learning and its integration with analog ensemble
1.3. What you will learn in this chapter
2. Research data
2.1. Surface radiation budget network
2.2. Numerical weather prediction models
3. Methodology
3.1. Analog forecasting
3.1.1. Quantification of similarity between weather patterns
3.1.2. Generation of future predictions
3.2. Analog ensemble and the spatial extension
3.3. Spatial-temporal similarity metric with machine learning
4. Results and discussion
4.1. Verification at a single location
4.2. Search space extension
4.3. Weather analog identification
4.4. Machine learning interpretability via attribution
5. Final remarks
6. Assignment
7. Open questions
Appendix A. Deep learning layers and operators
A.1. Convolution
A.2. Nonlinear activation
A.3. Pooling
A.4. Convolutional long short-term memory network
Appendix B. Verification of extended analog search with GFS
Appendix C. Weather analog identification under a high irradiance regime
Appendix D. Model attribution
References
Chapter 9: AI for improving ozone forecasting
1. Introduction
1.1. What you will learn in this chapter
1.2. Prerequisites
2. Background
3. Data collection
3.1. AirNow O3 concentration
3.1.1. TROPOMI O3
3.2. CMAQ simulation data
4. Dataset preparation
5. Machine learning
5.1. Extreme gradient boosting model
5.2. Accuracy assessment
5.3. Comparison with other ML models
6. ML workflow management
7. Discussion
7.1. Accuracy improvement
7.2. Stability and reliability
8. Conclusion
9. Assignment
10. Open questions
11. Lessons learned
References
Chapter 10: AI for monitoring power plant emissions from space
1. Introduction
1.1. What you will learn in this chapter
1.2. Credentials
1.3. Prerequisites
2. Background
3. Data collection
3.1. TROPOMI tropospheric NO2 data
3.2. MERRA-2 meteorology data
3.3. EPA eGRID data
3.4. MODIS MCD19A2 product
4. Preprocessing
4.1. TROPOMI NO2
4.2. MERRA-2
4.3. MCD19A2
4.4. Merging training data
5. Machine learning
5.1. Support vector regression (SVR)
5.2. Utility functions
6. Managing emission AI workflow in Geoweaver
7. Discussion
8. Summary
9. Assignment
10. Open questions
11. Lessons learned
References
Chapter 11: AI for shrubland identification and mapping
1. Introduction
2. What youll learn
3. Background
4. Prerequisites
5. Model building
5.1. Preprocessing
5.2. Model fitting
5.3. Model evaluation
6. Discussion
7. Summary
8. Assignment
9. Open questions
References
Chapter 12: Explainable AI for understanding ML-derived vegetation products
1. Introduction
2. Background
3. Prerequisites
4. Method & technique
4.1. Choosing a machine learning model
4.2. Explainable artificial intelligence (XAI)
4.3. Local and global interpretability
5. Experiment & results
5.1. ELI5
5.2. Implementation
5.2.1. Conclusion
5.3. SHAP
5.3.1. Implementation
5.3.2. Conclusion
5.4. Accumulated local effects (ALE)
5.4.1. Implementation
5.4.2. Conclusion
5.5. Anchor
5.5.1. Implementation
5.5.2. Conclusion
6. Summary
7. Assignment
8. Open questions
9. Lessons learned
Acknowledgments
References
Further reading
Chapter 13: Satellite image classification using quantum machine learning
1. Introduction
1.1. Machine learning
1.2. Quantum computer and informatics
1.3. Quantum machine learning
1.4. Remote sensing (RS) and land cover classification
1.5. Vegetation and nonvegetation cover
2. Data
2.1. Satellite data retrieval
2.2. Split images into batches for annotation
3. Applying QML on MODIS hyperspectral images
3.1. Quantum neural network
3.2. Land cover (binary) classification
3.3. Setup of TensorFlow, TensorFlow quantum, and Cirq
3.3.1. TensorFlow (TF)
3.3.2. TensorFlow quantum (TFQ)
3.3.3. Cirq
3.4. Setup
3.5. Loading and preprocessing data
3.6. Quantum circuit data encoding
3.7. Quantum neural network: Building and compiling the model
3.8. Training the QNN model
3.9. Classification performance
4. Conclusions
5. Assignments
6. Open questions
Acknowledgment
References
Chapter 14: Provenance in earth AI
1. Introduction
2. Overview of relevant concepts in provenance, XAI, and TAI
2.1. Guidelines for building trustworthy AI
2.2. Understanding explainable AI
2.3. Provenance and documentation
3. Need for provenance in earth AI
3.1. Use of AI in the earth science domain
3.2. Related work in provenance and earth science
4. Technical approaches
4.1. Metaclip (METAdata for CLImate products)
4.2. Kepler scientific workflow system
4.3. Geoweaver
5. Discussion
6. Conclusions
7. Assignment
8. Open questions
Acknowledgments
References
Chapter 15: AI ethics for earth sciences
1. Introduction
2. Prior work
3. Addressing ethical concerns during system design
4. Considerating algorithmic bias
5. Designing ethically driven automated systems
6. Assessing the impact of autonomous and intelligent systems on human well-being
7. Developing AI literacy, skills, and readiness
8. On documenting datasets for AI
9. On documenting AI models
10. Carbon emissions of earth AI models
11. Conclusions
12. Assignments
13. Open questions
References
Index
Back Cover

Citation preview

ARTIFICIAL INTELLIGENCE IN EARTH SCIENCE

This page intentionally left blank

ARTIFICIAL INTELLIGENCE IN EARTH SCIENCE Best Practices and Fundamental Challenges Edited by

ZIHENG SUN Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA, United States Center for Spatial Information Science and Systems, George Mason University, Fairfax, VA, United States

NICOLETA CRISTEA Department of Civil and Environmental Engineering, University of Washington, Seattle, WA, United States eScience Institute, University of Washington, Seattle, WA, United States

PABLO RIVAS Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States

Elsevier Radarweg 29, PO Box 211, 1000 AE Amsterdam, Netherlands The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States Copyright © 2023 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. ISBN: 978-0-323-91737-7 For information on all Elsevier publications visit our website at https://www.elsevier.com/books-and-journals

Publisher: Candice Janco Acquisitions Editor: Peter Llewellyn Editorial Project Manager: Joshua Mearns Production Project Manager: Bharatwaj Varatharajan Cover Designer: Christian J. Bilbow Typeset by STRAIVE, India

Contents

3 Sea ice data exploration 44 4 ML approaches for sea ice forecasting 45 5 Results and analysis 55 6 Discussion 56 7 Open questions 57 8 Assignments 57 References 57

Contributors ix 1. Introduction of artificial intelligence in Earth sciences 1 Ziheng Sun and Nicoleta Cristea

1 Background and motivation 1 2 AI evolution in Earth sciences 4 3 Latest developments and challenges 6 4 Short-term and long-term expectations for AI 8 5 Future developments and how to adapt 9 6 Practical AI: From prototype to operation 9 7 Why do we write this book? 11 8 Learning goals and tasks 12 9 Assignments & open questions 14 References 14

4. Deep learning for ocean mesoscale eddy detection 59 Edwin Goh, Annie Didier, and Jinbo Wang

1 2 3 4

Introduction 59 Chapter layout 60 Data preparation 61 Training and evaluating an eddy detection model 75 5 Discussion 94 6 Summary 97 7 Assignments 97 8 Open questions 98 Acknowledgments 99 References 99

2. Machine learning for snow cover mapping 17 Kehan Yang, Aji John, Ziheng Sun, and Nicoleta Cristea

1 Introduction 17 2 Machine learning tools and model 18 3 Data preparation 19 4 Model parameter tuning 21 5 Model training 28 6 Model performance evaluation 32 7 Conclusion 38 8 Assignment 39 9 Open questions 39 References 39

5. Artificial intelligence for plant disease recognition 101 Jayme Garcia Arnal Barbedo

1 Introduction 101 2 Data retrieval and preparation 103 3 Step-by-step implementation 105 4 Experimental results and how to select a model 111 5 Discussion 113 6 Conclusion 115 7 Assignment 115 8 Open questions 115 References 116

3. AI for sea ice forecasting 41 Sahara Ali, Yiyi Huang, and Jianwu Wang

1 Introduction 41 2 Sea ice seasonal forecast 42

v

vi

Contents

6. Spatiotemporal attention ConvLSTM networks for predicting and physically interpreting wildfire spread 119 Arif Masrur and Manzhu Yu

Appendix D Model attribution 242 References 244

9. AI for improving ozone forecasting 247 Ahmed Alnuaim (Alnaim), Ziheng Sun, and Didarul Islam

1 Introduction 119 2 Methodology 121 3 Earth AI workflow 123 4 Results 148 5 Conclusions 154 6 Assignment 155 7 Open questions 155 References 155

7. AI for physics-inspired hydrology modeling 157 Andrew Bennett

1 Introduction and background 157 2 PyTorch and autodifferentiation 160 3 Extremely brief background on numerical optimization 169 4 Bringing things together: Solving ODEs inside of neural networks 177 5 Scaling up to a conceptual hydrologic model 186 6 Conclusions 201 References 202 Further reading 203

8. Theory of spatiotemporal deep analogs and their application to solar forecasting 205 Weiming Hu, Guido Cervone, and George Young

1 Introduction 206 2 Research data 208 3 Methodology 211 4 Results and discussion 218 5 Final remarks 234 6 Assignment 235 7 Open questions 236 Appendix A Deep learning layers and operators 236 Appendix B Verification of extended analog search with GFS 238 Appendix C Weather analog identification under a high irradiance regime 240

1 Introduction 247 2 Background 249 3 Data collection 251 4 Dataset preparation 254 5 Machine learning 255 6 ML workflow management 264 7 Discussion 265 8 Conclusion 266 9 Assignment 267 10 Open questions 267 11 Lessons learned 267 References 268

10. AI for monitoring power plant emissions from space 271 Ahmed Alnuaim (Alnaim) and Ziheng Sun

1 2 3 4 5 6

Introduction 271 Background 274 Data collection 275 Preprocessing 281 Machine learning 285 Managing emission AI workflow in Geoweaver 290 7 Discussion 291 8 Summary 292 9 Assignment 292 10 Open questions 293 11 Lessons learned 293 References 294

11. AI for shrubland identification and mapping 295 Michael J. Mahoney, Lucas K. Johnson, and Colin M. Beier

1 2 3 4 5 6

Introduction 295 What you’ll learn 296 Background 296 Prerequisites 297 Model building 299 Discussion 312

vii

Contents

7 Summary 315 8 Assignment 315 9 Open questions 315 References 316

12. Explainable AI for understanding ML-derived vegetation products 317 Geetha Satya Mounika Ganji and Wai Hang Chow Lin

1 Introduction 317 2 Background 318 3 Prerequisites 320 4 Method & technique 320 5 Experiment & results 322 6 Summary 333 7 Assignment 334 8 Open questions 334 9 Lessons learned 334 Acknowledgments 335 References 335 Further reading 335

13. Satellite image classification using quantum machine learning 337 Olawale Ayoade, Pablo Rivas, Javier Orduz, and Nurul Rafi

1 Introduction 337 2 Data 340 3 Applying QML on MODIS hyperspectral images 342 4 Conclusions 353 5 Assignments 354 6 Open questions 354 Acknowledgment 354 References 354

14. Provenance in earth AI 357 Amruta Kale and Xiaogang Ma

1 Introduction 357 2 Overview of relevant concepts in provenance, XAI, and TAI 359 3 Need for provenance in earth AI 363 4 Technical approaches 365 5 Discussion 372 6 Conclusions 374 7 Assignment 374 8 Open questions 374 Acknowledgments 375 References 375

15. AI ethics for earth sciences 379 Pablo Rivas, Christopher Thompson, Brenda Tafur, Bikram Khanal, Olawale Ayoade, Tonni Das Jui, Korn Sooksatra, Javier Orduz, and Gissella Bejarano

1 Introduction 379 2 Prior work 380 3 Addressing ethical concerns during system design 380 4 Considerating algorithmic bias 382 5 Designing ethically driven automated systems 384 6 Assessing the impact of autonomous and intelligent systems on human well-being 386 7 Developing AI literacy, skills, and readiness 387 8 On documenting datasets for AI 388 9 On documenting AI models 390 10 Carbon emissions of earth AI models 391 11 Conclusions 393 12 Assignments 393 13 Open questions 394 References 394

Index 397

This page intentionally left blank

Contributors Sahara Ali Department of Information Systems, University of Maryland, Baltimore County, MD, United States

Edwin Goh Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States

Ahmed Alnuaim (Alnaim) Department of Geography and Geoinformation Science; Center for Spatial Information Science and Systems, George Mason University, Fairfax, VA, United States

Weiming Hu Department of Geography and Institute for Computational and Data Sciences, The Pennsylvania State University, University Park, State College, PA; Center for Western Weather and Water Extremes, Scripps Institution of Oceanography, University of California, San Diego, CA, United States

Olawale Ayoade Department of Physics, College of Arts and Sciences, Baylor University, Waco, TX, United States

Yiyi Huang Department of Information Systems, University of Maryland, Baltimore County, MD, United States

Jayme Garcia Arnal Barbedo Embrapa Digital Agriculture, Campinas, Brazil Colin M. Beier Department of Sustainable Resources Management, State University of New York College of Environmental Science and Forestry, Syracuse, NY, United States

Didarul Islam Department of Geography and Geoinformation Science; Center for Spatial Information Science and Systems, George Mason University, Fairfax, VA, United States

Gissella Bejarano Department of Computer Science, Marist College, Poughkeepsie, NY, United States

Aji John eScience Institute; Department of Biology, University of Washington, Seattle, WA, United States

Andrew Bennett University of Arizona, Tucson, AZ, United States

Lucas K. Johnson Graduate Program in Environmental Science, State University of New York College of Environmental Science and Forestry, Syracuse, NY, United States

Guido Cervone Department of Geography and Institute for Computational and Data Sciences; Department of Meteorology and Atmospheric Science, The Pennsylvania State University, University Park, State College, PA, United States

Tonni Das Jui Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States

Nicoleta Cristea Department of Civil and Environmental Engineering; eScience Institute, University of Washington, Seattle, WA, United States

Amruta Kale Department of Computer Science, University of Idaho, Moscow, ID, United States

Annie Didier Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States

Bikram Khanal Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States

Geetha Satya Mounika Ganji KBRWyle, Inc., Sioux Falls, SD, United States

Wai Hang Chow Lin KBRWyle, Inc., Sioux Falls, SD, United States

ix

x

Contributors

Xiaogang Ma Department of Computer Science, University of Idaho, Moscow, ID, United States Michael J. Mahoney Graduate Program in Environmental Science, State University of New York College of Environmental Science and Forestry, Syracuse, NY, United States Arif Masrur Pennsylvania State University, University Park, PA, United States Javier Orduz Department of Mathematics and Computer Science, Earlham College, Richmond, IN, United States Nurul Rafi Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States Pablo Rivas Center for Standards and Ethics in Artificial Intelligence; Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States

Information Science and Systems, George Mason University, Fairfax, VA, United States Brenda Tafur Ingenieria de Sistemas, Facultad de Ingenieria y Arquitectura, Universidad de Lima, Provincia y Departamento de Lima, Lima, Peru Christopher Thompson Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States Jianwu Wang Department of Information Systems, University of Maryland, Baltimore County, MD, United States Jinbo Wang Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States Kehan Yang Department of Civil and Environmental Engineering; eScience Institute, University of Washington, Seattle, WA, United States

Korn Sooksatra Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States

George Young Department of Meteorology and Atmospheric Science, The Pennsylvania State University, University Park, State College, PA, United States

Ziheng Sun Department of Geography and Geoinformation Science; Center for Spatial

Manzhu Yu Pennsylvania State University, University Park, PA, United States

C H A P T E R

1 Introduction of artificial intelligence in Earth sciences Ziheng Suna,b and Nicoleta Cristeac,d a

Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA, United States bCenter for Spatial Information Science and Systems, George Mason University, Fairfax, VA, United States cDepartment of Civil and Environmental Engineering, University of Washington, Seattle, WA, United States deScience Institute, University of Washington, Seattle, WA, United States

1 Background and motivation When it comes to the concept of the Earth system, the first picture on most people’s minds might be earthquakes, because they are the common and direct consequences of Earth system movements. About earthquakes, Dr. Sun has his own story to tell: When I was a kid (in the 1990s), an earthquake hit my hometown at the northeast edge of the Tibet plateau. I remember vividly the ceiling light hanging from a soft line was swinging wildly, and the earth jolted, shivered, and tried to get away from my feet. I attempted to stand up but the shake transmitted to my legs and dysfunctional them completely. I found myself crawling on the ground and couldn’t move a finger. My hands grasped the earth so tight that I could feel every stroke of the quake waves. The quake fades away after several minutes. As a middle schooler, I had no idea how close death can be and how things would go from there. I was fortunate as the ceiling stopped coming down and the brick walls still held up. It is one of the most memorable life-or-death experiences and the feeling of that kind of helplessness is still vivid after twenty years. After I left home for college, several other major earthquakes hit that region repeatedly, including some devastating ones: Wenchuan quake (magnitude-8.0, 2008, der Hilst, 2008) and Yushu quake (magnitude-7.1, 2010, Chen et al., 2010) (as shown in Fig. 1). Tens of thousands of people like you and me died.

Every day natural hazards like earthquakes test the resilience of humanity and cause the same kind of helplessness in many entities from individuals to the metropolitan (Hyndman and Hyndman, 2016). After surviving millions of incidents and having evolved with accumulated knowledge over thousands of years, humankind has become the dominant species on

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00003-7

1

Copyright # 2023 Elsevier Inc. All rights reserved.

2

1. Introduction of artificial intelligence in Earth sciences

FIG. 1

Global earthquakes with magnitudes from 1970 to 2014 (Dr. Sun’s hometown region is the purple box). Earthquake data is from ANSS-USGS, and the basemap is Google hybrid.

this planet. The power of the entirety of civilized society makes its own path to impact the Earth’s environment. We have built skyscrapers, towers, homes, roads, bridges, factories, ships, planes, railways, satellites, power grids, scales, to improve our lives and connect billions of people anywhere anytime. However, all the infrastructure and the supplies supporting our civilization totally depend on the stability and smooth cycles of the Earth systems. Any disturbance or disruption of the Earth cycles could have devastating effects. Recent trends in the climate and the increase of radical events have caused tremendous amounts of damage and raised concerns among Earth scientists (Houghton et al., 1990). Although the threats from nature have been greatly reduced after robust solid buildings were built to endure large earthquakes and weather forecasting systems were developed to prevent casualties from hurricanes, snow, floods, landslides, and tsunami, we need to remain vigilant as the dangers still persist and even increase in intensity and frequency. Individual natural disasters may be deadly but their impacts are relatively small scale or affect one or two regions which could recover. If we lay eyes on the long-term sustainability of the whole human society in the next hundred or even a thousand years, there are three grand challenges that are global and likely nonrecoverable: food security, climate change, and energy resources (Gregory et al., 2005). It is hard for most people today to believe but it is a fact that hunger is still happening around the world. Food shortage or insecurity has close relation to poverty but in a bigger picture, is the direct result of the changes in Earth systems. Enduring extreme drought, deforestation, emissions and chemical soil pollutants, are all contributing together, directly or indirectly impacting our food supply chain from upstream

1 Background and motivation

3

agriculture to food industries. As the world population foreseeably surpasses ten billion during the next decade—and some forecasts predicting twenty billion by 2050—the strain on our food systems will become a critical problem as today’s agriculture can’t deliver enough food to meet that need. Disruptive climate events have been more frequent in recent years, likely caused by the increasing emissions funneled by the globalization of industries and commercial activities. Worsening air quality could be even more harmful than food shortage. Many forms of atmospheric pollution are constantly emitted and affect public health and the environment. Global circulation patterns (Oort, 1983) can transport the pollution rapidly around the world. Humanity has fought numerous wars to acquire natural resources, from land to water, from minerals to oil, from forest to grass. We continuously exploited our planet for resources to support our development. However, the natural resources that can produce energy are limited and will eventually be used up due to the speed of consumption being much faster than its creation. Besides meeting the food problem of the exploding population, meeting energy needs is also a long-term challenge that we or future generations will have to face sooner or later. The ongoing energy crisis and rocketing high oil and gas prices could cause global conflicts over energy in the next few years. Science and technology play significant roles in tackling both the individual natural hazards and the long-term survival challenges of humankind. For instance, learning about atmospheric pollution would be valuable knowledge to prevent further deterioration of our air quality. Scientists made great breakthroughs on understanding the mechanisms of natural hazards, and built computer models to simulate and predict future events and their impacts. Based on aquired knowledge, new technologies have been invented and used to mitigate damages and sustain our society: felling of trees to control wildfire spread, use of antiquake designs to increase home resilience to earthquakes, and application of smart irrigation schedules to maximize crop yields while saving water, among others. In the past decades, science and technology have become the backbone of most solutions to our most urgent crises and will continue to be our most valuable tools in the next chapter of human development. However, the development of science and technology is not straightforward and is always accompanied by many myths, assumptions, theories, and technical limitations throughout its development. The journey of scientific research is not always successful and has gone through ups and downs. Take climate prediction models as an example. There are so many chemical reactions to consider and not all of them are revealed and studied in the existing research. Scientists have conducted countless experiments and improvements to push the models to be more accurate and closer to reality and have achieved incredible progress. However, due to the model complexity, the computing of each model run takes huge amounts of resources and in turn could negatively affect the environment resulting from the energy consumption. In addition, even with high performance computing resources, the reliability of the model results are still in doubt, especially those long-term prediction results. The short-term prediction, or forecasts within 3 months, can be constantly corrected and improved using the near-real-time collected observations from large-scale sensor networks. For long-term forecasts, especially for those catastrophic events in the coming years, the capability of the current models still falls short of our expectation. Scientists are actively seeking for new environment-friendly solutions to advance model simulation and prediction of atmospheres and all the other spheres in the Earth system (Alley et al., 2019). Artificial Intelligence (AI) technologies have been aggressively experimented with in Earth system sciences and attempted to solve these urgent problems and provide a solution for those ultimate challenges of humanity.

4

1. Introduction of artificial intelligence in Earth sciences

2 AI evolution in Earth sciences AI is not a new technology, and is historically closely connected to other domains such as mathematics, astronomy, and physics. The basic concepts and algorithms were invented a long time ago. Neural networks were first invented in 1944 by Warren McCullough and Walter Pitts (Hardesty, 2017). The optimization backbone algorithm backpropagation was first developed in the 1970s (the basics were formed in the 1960s or earlier and there is an argument about who is its real inventor—Synced, 2020). Stochastic gradient descent used in the back propagation was first described by Robbins and Monro in A Stochastic Approximation Method (Robbins and Monro, 1951), and later Kiefer and Wolfowitz introduced the machine learning variant, Stochastic Gradient Descent, in their paper Stochastic Estimation of the Maximum of a Regression Function (Kiefer and Wolfowitz, 1952). Convolutional neural networks were first introduced in the 1980s by Yann LeCun when he was a postdoctoral researcher. The recurrent neural network was designed in 1986. Long short term memory, one of the most successful industrial workhorses for sequence data processing like speech recognition and translation, was proposed in 1997 by Sepp Hochreiter and J€ urgen Schmidhuber. However, in 1980–1990s, most of the neural network models were limited by the computing power and memory volume back then, and it took a very long time to train the model and the cost was too high and almost intolerable for individual scientists. AI research in Earth sciences also faces the same challenge and it went through another AI winter. After entering the 2010s, with the rapid development of computer hardware and according to Moore’s Law (the number of transistors on a dense integrated circuit will double in about every two years), training AI models with more layers has become realistic. The famous ImageNet competition (Deng et al., 2009), an annual global AI competition event hosted by researchers from Stanford, has shown us that the scale of the model is one of the keys to unlock the real power of AI to pursue advanced intelligence. Deep learning has become one of the most popular research areas and could almost be seen in every branch of Earth sciences today. The idea behind deep learning is based on adding more hidden layers to the neural network, using a series of methods to control the model regularization and balance between overfitting and underfitting. In every iteration, the difference is calculated and the back propagated (or back optimized by another algorithm) to adjust the weights on the connections between the neurons which could be located on two sequential layers or two nonsequential layers (ResNet, U-Net). Numerous networks deal with specific types of datasets. Convolutional neural networks are one of the most widely used models to learn patterns from images or n-dimensional gridded data, a common task in Earth system sciences. Most powerful neural networks of today are larger than those of even just a few years ago, such as GPT-3 (Generative Pre-trained Transformer 3) (Brown et al., 2020) which has 96 layers with about 175 billion trainable parameters. Training the models takes a huge amount of computing hours and specialized hardware like GPU or TPU for greater benefits of parallelism to reduce overall time costs. Training big models on Earth scientific datasets could easily go up to hundreds of Terabytes or even Petabyte level could cause challenging problems such as running out of memory and or excessively long computation times. The development of AI in Earth sciences is still at an early stage. Despite some challenges, researchers have managed to peek at the potential of AI when applied to solve complex

2 AI evolution in Earth sciences

5

problems such as weather and climate prediction, or land cover land use mapping, wildfire detection, ocean eddy detection, earthquake signal capture, hurricane trajectory forecasting, and others (Sun et al., 2022). AI has shown its excellence in shedding light on use cases where human data analysts have trouble interpreting the information. At present scientists rely on numerical physics-based models which integrate sophisticated mathematical equations to calculate the desired results given certain conditions and assumptions. For example, given the temperature, humidity, precipitation, wind direction, of the whole region, the model will use them as initial conditions and simulate the situation’s evolution to deliver a prediction. It is more intuitive to tune the model parameters as we have some knowledge about their variations and impacts on internal mechanisms. The AI model provides a solution which appears to be simpler than investing too much time in tuning the traditional model to fit Earth observation and data collection. Building AI models doesn’t require the modelers to have superb mathematical background and deep understanding of those equations. Once the hardware and software environment is set up, the models will learn patterns from the data perspective. The drawbacks are also obvious: AI can be easily influenced by the quality of the input data, and more vulnerable to noise in the training data. Interpreting the results could become difficult, which is one of the main reasons the credibility and reliability of AI models are constantly questioned. There are few use cases of AI being operationally applied in Earth data product generation and application at present. Most existing use cases focus on land cover mapping. For example, Esri Inc. contracted Impact Observatory to generate global annual land cover maps from 2017 to 2021 at 10 m resolution using AI to train on ESA (European Space Agency) Sentinel-2’s six spectral bands. They used billions of human labeled image pixels to complete the training of an AI land classification model of 9 classes including vegetation types, bare surface, water, cropland and built areas. The efforts behind the scene are massive and take a lot of human hours to create those labels for training. The cost is highly expensive depending on the scope of the problem and target application extent. For example, creating a global agriculture map at 10 m resolution will probably take ten times or even more effort than the global land cover map with nine classes, because the crops have way more classes and need more expertise for the human labellers to distinguish the crops from satellite images. Some crop types are not recognizable from satellite images and need ground survey instead which will drive the overall cost even higher. The same problem is especially true in other domains in geosciences as well, such as geological mapping which requires the labelers to annotate accurate labels on rocks and soil to match with the continuous unmanned observations such as the satellite or drone survey images to create reliable geology maps. Many researchers are looking for alternative solutions such as using crowdsourcing or citizen scientists to help create those labels. However, the existing efforts and the datasets resulting from those activities only covers a tiny portion of all the training labels needed to address more critical problems in Earth sciences. The bottleneck is the availability of training label datasets that can be used to correspond with the huge datasets provided by the continuous observing sensors in places. Even when labeled datasets are available, there is still a huge amount of work to transform them into AI-ready data format, such as the decades of work on plants and animals whose records are written in thousands of books and reports. Digitalizing them and turning them into structured, unified, and analysis-ready form would also be very challenging.

6

1. Introduction of artificial intelligence in Earth sciences

3 Latest developments and challenges AI has been attracting attention for years and exciting breakthroughs happened across multiple Earth science domains. Geoscientists have made a lot of attempts to bring in AI technologies to address challenging problems in the past few decades. Much progress has been achieved across domains, while a series of questions and challenges still remain. In seismology, AI helped to pick up early earthquake and volcano seismic signals from the giant set of wave signals and tell the differences between valid signals and noises. The use of AI can be found in all kinds of research cases: earthquake detection and phase picking, earthquake early warning, ground motion prediction, seismic tomography, and earthquake geodesy. LANL held a high-profile competition on Kaggle for earthquake prediction and attracted many machine learning practitioners (Johnson et al., 2021). Many ML algorithms, such as LSTM, XGBoost, Random Forest, and genetic programming, have been widely experimented on many datasets. AI-enhanced approaches can detect earthquake-related signals and early signs from the noises and distinguish them by accurately and objectively determining the noise thresholds. The traditional way requires experts to visually filter and identify those earthquake signals using phase association methods, which undoubtedly delays the data processing and slows down the information extraction phases. These traditional methods have challenges in detecting small seismic events that might be important and decisive in predicting the future big events. AI methods are pursued mainly because (1) they are expected to extract more patterns and relationships between collected datasets and accumulated prior knowledge; (2) they can significantly improve computational efficiency of seismic model processing. However, there are two major challenges in processing seismic data to detect or predict hazard events such as earthquakes and volcanoes ( Jiao and Alavi, 2020): (1) there are massive amounts of noises in seismic data; (2) there are many undetected events (e.g., small earthquakes, or earthquakes in remote areas with less sensor coverage), which are likely to be excluded from training and result in biases and underfitting of AI models. In land cover land use mapping, AI has been extensively used to generate higher spatial– temporal resolution maps (Sun et al., 2019). AI can learn from the collected tremendous amount of human labels in the past decades and use the hidden patterns, some of which are yet to be discovered by scientists, to classify the newly captured observations (satellite images), or to map the land cover in a near-real-time manner. For instance, the living atlas project (Wright et al., 2022) generates annual 9-class global land cover maps using billions of human-labeled samples. There are several main driving factors behind the AI efficiency in this domain that uniquely stands out and is more generally welcomed and accepted by the community. Earth observing satellites have captured petabytes of datasets in the past five decades and will continue to increase at an unprecedented rate after commercial companies join in the market. Most imagery datasets collected by government agencies are openly available. The science outreach teams within NASA, NOAA, USGS, and ESA have devoted significant effort to make those datasets easily accessible and usable via a lot of websites, software tools, standardized web services, and programming libraries. Along with the increase in the number of satellites, the spatial–temporal-spectral resolution of the images has dramatically improved. The large objects on the Earth surface such as forests, lakes, mountains, rivers, icebergs, etc., can be clearly seen and analyzed visually by experts. The availability and high

3 Latest developments and challenges

7

resolution of satellite images makes it very easy to create labels about general land cover classes without having a field survey. That is why creating billions of human labeled samples is practical. More importantly, many research groups made their training labels open to avoid huge amounts of duplicated labeling efforts by the community, allowing researchers to worry less about time-consuming labeling work and focus more on AI model improvement. Also, satellite image classification is very similar to the classic AI/ML tasks, which were originally designed for image classification, like images of dogs and cats found in ImageNet competitions. It is now easy to adapt the experiment setup, with expanded classification goals. Relatively speaking, due to the availability of satellite images and training labels, most AI experiments in land cover classification can be redone by a new group in a short period. In other words, such research has a lower barrier of entry for beginners of either AI or geosciences. In hydrology, AI is starting to takle problems related to ambiguity and uncertainty in hydrologic predictions or scaling problems of regional models. Hydrology-related scientific challenges are further complicated by the noisy, complex, and dynamic nature of variables such as groundwater, evapotranspiration, discharge, sediment transport, and their interaction with soil and climate. Time series prediction models have been widely applied to address hydrology specific questions, e.g., long short term memory (LSTM) approaches. The models can find patterns in the existing datasets that scientists have collected for decades. Nonlinear models may perform better than linear models. ML has been successfully used in flood forecasting, precipitation, water quality, and groundwater estimation (Sun et al., 2022). However, there are a number of pitfalls with the current AI techniques. Not all the Earth scientific challenges that are in need for answers have rich data suitable to train an unbiased AI model, and the current AI techniques are not well equipped to deal with data-scarce scenarios. For problems with limited datasets, AI cannot infer the patterns but instead may learn the patterns and noises in the restricted samples, which may not align with the existing knowledge. At the same time, the massive amount of human-labeled samples required are another bottleneck problem preventing AI from reaching the practical level of application. For tasks like simple hierarchy and land cover mapping, it is possible to quickly create lots of labels just based on the satellite images. However, for complex classification tasks such as agricultural crop mapping or detailed vegetation classification, it may not be possible to create labels simply based on remote sensing data to avoid doing field surveys. AI needs the input and output data to refer to the same time period to create a good match. Using data from various sources may pose difficulty in creating valid matches for AI. For instance, the history of satellite observations can be traced back to the 1970s and the continuity of data coverage and quality varies. The ground surveys could be conducted during periods which might fall out of the observation window of the satellite images. For example, the ground occurs on April 10, but cloudless satellite images are only available on Mar 20 and May 1. Using the data on either dates will result in the models that may represent inadequate patterns, missing the growing stage. High quality training data pairs are critically important but not guaranteed by data providers due to the limitation of nature (weather, clouds) and observation capabilities (revisiting period, spatial resolution, swath width, etc.).

8

1. Introduction of artificial intelligence in Earth sciences

4 Short-term and long-term expectations for AI AI is not a magic tool. It has many limitations and as a result experiences many restrictions and other challenges during operational use. Classic AI includes tasks such as image object recognition, car plate recognition, human face matching, tabular data prediction, but we still have to treat AI as a tool instead of an intelligent entity. Most AI systems can be considered as an improved version of the current rule-based expert systems but still have a long way to reach complete self-learning intelligence, or artificial general intelligence (AGI) (Goertzel, 2007). This section will describe the general goals (or our expectations) of AI research progress in Earth system sciences from both short-term and long-term perspectives. The general short-term goals for AI practitioners in the next two decades will likely be: (1) improve models’ accuracy to the expected level; (2) stabilize models’ performance over space and time; (3) operationalize AI models in real-world practice (e.g., natural hazard responding). The first goal is obviously that AI first needs to be accurate enough before it can be considered usable. Current AI models are still struggling to reproduce the expected accuracy when confronted with new data, especially data representing patterns or distributions that are not directly included in the original training datasets. A common challenge for AI models is to address the overfitting and underfitting. The current proposed approaches such as cross validation (e.g., GridSearch and RandomizedSearch in scikit-learn) usually require multiple attempts to find the best hyperparameter configuration and are not very efficient. AutoML automates the model selection, parameter tuning, and model comparison, which significantly reduces the requirements for manual tuning. The long-term expectations of Earth AI in this century are quite exciting to imagine and vary by domain. The ideal breakthrough will likely happen on perfecting model reliability and consistency. A model could stably produce accurate predictions without any human intervention, or a model can identify the noise in the inputted signals and self-adjust to adapt to the abrupt changes in the real world. People will no longer have to write multiple scenario-dependent rules to restrict model behaviors. Models’ spatial–temporal capability will be greatly enhanced to enable humans to enter a previously untouched territory. Take hurricane prediction as an example. The current prediction of the trajectory may evolve to a new service with very high accuracy about its path, wind speed, and precipitation in each county, or even at neighborhood level. Homeowners could receive warning notices days ahead to take actions to protect their properties. First responders could quickly identify the most damaged areas and allocate resources accordingly in the most efficient manner. Some daring ideas could envision how to turn a disaster into a potentially constructive event, e.g., developing new airborne (floating) electricity generators and placing them on the path of hurricanes to collect the tremendous power of nature. We could imagine that in the next century, human beings will very likely turn the situation from passively reacting to natural disasters into actively predicting and eventually taking advantage of the extreme natural events. AI will definitely be one of the most fundamental tools to bring that beautiful vision into reality.

6 Practical AI: From prototype to operation

9

5 Future developments and how to adapt The recent developments in AI are rapid and dramatic and the pace of progress is still expected to accelerate further in the next decade. There are many development directions in the AI world, and all the users from Earth sciences need to prepare to adjust their strategy and adapt to the upcoming changes. Research groups or AI companies will upgrade their AI models to more powerful and cutting edge models. Models can be easily switched as most AI modules in operational systems are self-contained and treated as a black box. One of the challenging tasks during this transition is to evaluate the new changes in stability, reliability, accuracy, noise resilience, and time costs. We have witnessed that the benefits brought by new models are often overshadowed by the high cost, slow turnaround, high vulnerability to noise, and unexplainable results. To make the model change easier and smoother, the full stack AI workflow needs to stay consistent over time, with the data preparation and postprocessing steps standardized and the goals of AI models kept them unchanged. A comprehensive evaluation of the new model must be conducted to justify its replacement of the old model. The development of software and services will be limited without the development of the underlying hardware. Moore’s Law (the number of transistors on a square inch of integrated circuit will double every year) has been validated over the past fifty years but recently the chip progress has seen a slowdown. We may find a way to continue the speedy improvements in the world of computing. The increase in computing power we experienced in the first AI boom (1980–1987) has inspired and supported the ongoing AI evolution. Foreseeably, more powerful computers, or not even digital computers (e.g., analog chips, quantum computers) will be introduced and experimented to explore new breakthroughs. It will undoubtedly bring dramatic changes to the existing AI pipelines, and some profound hardware replacement will cause the infrastructure and technology stack to be rethought and probably redesigned. The trends of AI are not only toward more accurate and complicated, but will likely become more portable and lightweight. Edge AI is a central concept that shifts the AI model running from data centers to personal devices like smartphones, smart cars, and robots, and makes them act as decentralized agents without communicating with external web services. In contrast to cloud AI which has everything running inside cloud data centers and only has an API exposed for users, edge AI can facilitate individual devices to run AI models. The current strategy is to deploy a pretrained model to the edge devices to consume and produce results. There are also researchers trying to use edge devices to train AI models in place, but this activity requires the AI models to be more lightweight, with fewer trainable parameters, and with a quick turnaround.

6 Practical AI: From prototype to operation It is always easier said than done. There is a long way from prototype to production, and many AI researchers halt at the prototype stage and their achievements never reach production. It could be quite disappointed to learn about the real ratio of production conversion

10

1. Introduction of artificial intelligence in Earth sciences

across the entire AI research community. To increase the success rate, a series of guidelines and actions must be taken throughout the entire lifecycle of AI research to prepare them for real-world adoption. The real-world application scenario is very chaotic and disorganized, which results in the violent collision of “the world as imagined” with “the world as it is.” There are many roadblocks that need to be removed to land a new model in a production environment. Here is a list of some of the major obstacles and general guidelines for solving them. The first common issue is data bias. Most AI models are trained on a subset of the real dataset and the pattern representation always has biases toward the majority across the various classes. Take satellite based land cover classification as an example. If a region is composed of 90% corn fields and 10% soybean fields, the model will lean toward corn because classifying a pixel as corn will have less penalty to the overall accuracy. Many solutions have been proposed to tackle the bias problem like the dropout in neural network, the class weight in Scikit-learn, intentionally augmenting the minority classes in training data, etc. There is no standard answer to this challenge and ideally it would be considered together with the application scenarios and the pattern similarities between training data and the real data to judge if the model is suitable for the use case. The second challenge is explicitly specifying the spatiotemporal limitation of the trained AI models. Production software has to specify its application terms and scope ahead. The patterns learnt by AI have restrictions in its applicable spatial or temporal extent as its training dataset is usually regional and for a certain period of time. Using the example of the hurricane trajectory prediction, the AI model is could be initially trained using the dataset in one region (tropical cyclones), with a fixed spatial extent. Using this model in another region and another season will likely have uncertain results, and would be discouraged. These limitations must be explicitly labeled in the delivered package. Consuming and feeding real-world data into AI models is a “long tail” task, which is not the focus of this book, but essential to support the sustainability of AI projects. Many studies will give roughly a similar number about how long data scientists spent on data cleaning and preprocessing—60% to 70% (or more) of all the time spent on AI tasks. Only a fraction of data scientists’ time is spent on ML and science problem solving–related work. There are many AI models coming out every year in the Earth science community but most of them are not used in operation. One of the major reasons is that the connection between the real-world datasets and the model inputs is difficult to maintain. Subsequent users cannot easily reproduce the data preparation step because of the unmentioned preprocessing steps that will cause uncertainty in the models. For example, the filling values and the standardization of meteorological variables. If the missing data is filled with 0 while the new users use 9999 and feed them into the trained ML models, the accuracy will likely be affected (due to normalization or other operation). One solution is that the AI model developers provide the full stack pipeline instead of just a model file. The full-stack pipeline includes both preprocessing standard products (e.g., NASA/NOAA professionally controlled data products), and postprocessing AI model results into information deliverables. However, many existing tutorials directly use some ready-to-use datasets without providing explanations or specific steps on how to prepare those files. Making and sharing packages of a full-stack pipeline of AI workflows is an ideal solution but still a huge challenge to address (Sun et al., 2020). Problems like source code disparity, dependency incompatibility, environment requirements, workflow software varieties,

7 Why do we write this book?

11

FIG. 2 The concept of MLOps (machine learning operations).

provenance metadata, data format here can easily hold scientists back from creating operational AI workflows. Another major challenge is how to make operators better understand and maintain the deployed AI models. Unlike tree-based models (e.g., random forest, extreme gradient boost), most neural network based models, including most of the currently popular deep learning models, are frequently referred to as “black boxes” which means people don’t understand how it really works as all the weights and rules directly come from data. Explainable AI is still a long way to go and requires a whole set of tools and libraries to interpret the model predictions. Making AI explainable and comprehensible is an important step for human users to understand and trust the AI results. To operate an AI model in geoscientific scenarios, such as precipitation forecasting, it requires awareness of the connection between an unlikely prediction and its real causes as there are dozens of input variables and thousands of rules and weights built inside AI models. Overall, making an AI prototype model operational is a difficult task. Accuracy is important in the research phase, but not the only thing that operational AI must care about. Computing cost per prediction, turnaround time, simplicity, stability, reliability, trustworthiness, stress performance, explainability, and difficulty to maintain all play a role. This book will reveal and discuss some of the existing solutions to these problems scattered in the chapters. For example, based on the industrial experiences with ML in practice, MLOps is a new concept which reflects the general stages of developing and deploying ML models from prototype into production (Fig. 2). However, it is certain that there is no one-size-fits-all solution. We hope this book can be a positive move toward a better direction for AI for Earth research, focusing on the key issues mentioned above to make AI operational in Earth sciences sooner and better.

7 Why do we write this book? AI is gaining increased interest based on the popularity of the AI-relevant meetings, sessions, and webinars across geoscientific communities. However, after interacting with

12

1. Introduction of artificial intelligence in Earth sciences

colleagus at various conferences, we found that although a lot of scientists are interested in investing time in AI, there are no instantly useful or straightforward materials to quickly get them started. Most existing tutorials, papers, and books are assuming the readers already have some knowledge about AI and its python ecosystem, which is not always true for geoscientists. Geoscientists have various backgrounds including mathematical modeling, numerical modeling, atmospheric chemistry, geophysics, seismology, mineralogy, etc. One of the main reasons that most scientists feel AI is difficult is because it is challenging to place the AI selflearning idea into their existing physics-based knowledge framework that may conflict with AI. The connections between AI and the geosciences must be laid out and explained in detail to allow scientists to digest the technology from their perspective. Thanks to plenty of open tutorials online, scientists have many opportunities to access and learn about AI techniques by themselves. When people have a general understanding about how AI models work, they will have to convert their existing datasets to AI ready format. That is much more difficult than it appears to be, and takes a surprising amount of time. The merging, project transformation, batching, filling values, reshuffling, scaling, and cleaning takes a lot of effort and the procedures could be different for each dataset or use case which only serves to exacerbate the issue. The major differences between Earth AI tutorials and general AI tutorials exist in the data preparation and post processing steps. There are very few tutorials that can link all the steps and prepare a comprehensive guide on composing full-stack AI workflow. This book is written to fill in the missing piece and offer a detailed end-to-end guide for geoscientists to make AI “work” in their research. We hope this book can help researchers gain a clearer picture of the AI landscape and how to merge AI techniques into geoscientific problem solving. In addition to introducing the ML models, this book will dive into the details of the upstream and downstream steps to unfold the overlooked details which are essential to support real-world production applications. In contrast to the innovation-oriented research books which assume the readers are AI experts, this book tends to be prepared for geoscientists with little AI knowledge or programming techniques. The narrative is written from the perspective of geoscientists rather than computer scientists and emphasizes on how to smooth the data processing pipeline, achieve equivalent performance as the benchmark, and maintain the stable workflow for updating data sources. For example, research papers may focus on how to improve the accuracy of the current weather forecasting models, but give little details about how to process the weather station historical data. This book will aim at addressing those data retrieval, cleaning, and preprocessing steps in equal detail as the AI models so our readers can obtain knowledge on holistic workflow rather than just AI model embedding. Different from conventional modeling, the success run of an AI model is not the sign of success of the AI projects. The work that needs to be done after the first success run can easily overweigh the early investment. These hidden blocks will be mentioned and discussed throughout the book.

8 Learning goals and tasks The overall goal is to learn how to use AI techniques in solving geoscientific problems. Due to the wild variety of geoscientific problems, it is difficult to summarize a one-fits-all solution

13

8 Learning goals and tasks

TABLE 1

Learning objectives.

Objectives

Stage

Level

Processing imagery data to ML-ready format

Preprocessing

Easy

Processing time series data to ML-ready format

Preprocessing

Easy

ML model creation and setup

ML model

Easy

Balancing between overfitting and under-fitting

ML model

Easy

Cross validating the trained models

ML model

Easy

ML model comparison and selection

ML model

Difficult

ML model interpretation using explainable AI skills

ML model

Expert

ML hyperparameter tuning and optimization

ML model

Expert

Sensitivity analysis

ML model

Expert

Turn ML model results into geoscience data products

Postprocessing

Easy

Accuracy evaluation

Postprocessing

Easy

Making trained ML model run predictions

Production

Easy

Building automated ML workflows

Production

Easy

to address the entire field. The rest of the chapters will focus on various problems spread across different spheres of the Earth system. Each chapter is an independent tutorial containing the full workflow to address a specific problem. For each chapter, the general learning objectives are quite similar and listed in Table 1. We have categorized the objectives into four stages according to the AI pipelines: preprocessing, ML modeling, postprocessing, and production (e.g., deployment and maintenance). The preprocessing section differs mainly according to the data type: imagery, tabular, singular or time series. Each data type has a different strategy and libraries to tackle. We also label the difficulty level of each task for beginners. We will explain each task in a plainly understandable way. The modeling section includes selecting and creating ML models, splitting training and testing sub datasets, cross validating model performance to rule out coincidences, feature selection and engineering, hyperparameter tuning and optimization, sensitivity analysis against each input variable, trained model interpretation, etc. The post processing stage basically focuses on turning ML model results into data products or actionable information, evaluates the accuracy and sends the results back to ML model train-test-validate cycles for further improvement. The production stage means deploying the trained models into operation for real-world predictions. Most actions will be automated in routine operation and human supervisors will just need to monitor the pipelines and address the errors in a timely manner. Since most AI research is not in production yet, this book will collect as many experiences as possible that are practically tested and introduce the recommended approaches or tools. Readers of this book can learn from these experiences but also keep an open mind for upcoming techniques or new tools to replace them in the near future.

14

1. Introduction of artificial intelligence in Earth sciences

9 Assignments & open questions Every chapter will have a section of assignments which are achievable within a relatively short time, and a section of open questions which have no good answers at present. This serves to give students some space to exercise their newly learnt knowledge and skills. Assignments are mostly choice questions and programming tasks about the ML knowledge introduced in the chapter. They are designed for students to test how well they perceive the AI models and the entire workflow well. The answers will be posted at the end of the book. Besides assignments, as AI is an evolving research topic, there are many open questions that remain unaddressed. It would be very enlightening and inspiring for readers to think about them and devote research efforts to answer or address them. Some of the open questions are common for all Earth AI use cases and so challenging that even one full Ph.D. study probably still won’t address them. Each challenge needs to be studied along with specific datasets and use cases so the questions after each chapter will contain the scenario settings similar to the context of the chapter use case. It could be the source of ideas for future research that might bring the next big breakthrough to the Earth AI development, even to the broad scope of the entire AI kingdom. The questions could be about the problems that are extended to a larger scope than the topic of the chapter like: how to identify hidden earthquake signals? How to monitor agricultural drought using machine learning? How to detect wildfires in an early stage? How to derive high spatial–temporal resolution maps about water, forests, crops, and snow? Answering these questions is one of the big AI dreams that could greatly enhance our capability to protect us from natural hazards like earthquakes, landslides, hurricanes, avalanches, tornados, floods, or volcanoes. The interaction with our living environments will be more gentle and harmonized via the use of AI to make Earth a more sustainable and better planet.

References Alley, R., Kerry, E., Zhang, F., 2019. Advances in weather prediction. Science 363 (6425), 342–344. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., et al., 2020. Language models are few-shot learners. Adv. Neural Inf. Proces. Syst. 33, 1877–1901. Chen, L.C., Wang, H., Ran, Y.K., Sun, X.Z., GuiWu, S., Wang, J., Tan, X.B., Li, Z.M., Zhang, X.Q., 2010. The M S7.1 Yushu earthquake surface rupture and large historical earthquakes on the Garz^e-Yushu fault. Chin. Sci. Bull. 55 (31), 3504–3509. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L., 2009. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255. der Hilst, B.H., 2008. A geological and geophysical context for the Wenchuan earthquake of 12 May 2008, Sichuan, People’s Republic of China. GSA Today 18 (7), 5. Goertzel, B., 2007. In: Pennachin, C. (Ed.), Artificial general intelligence. Vol. 2. Springer, New York. Gregory, P.J., Ingram, J.S.I., Brklacich, M., 2005. Climate change and food security. Philos. Trans. R. Soc. B: Biol. Sci. 360 (1463), 2139–2148. Hardesty, L., 2017. Explained: Neural Networks. MIT News. https://news.mit.edu/2017/explained-neuralnetworks-deep-learning-0414. (Accessed 23 October 2022). Houghton, J.T., Jenkins, G.J., Ephraums, J.J., 1990. Climate change: the IPCC scientific assessment. Am. Sci. 80 (6). Hyndman, D., Hyndman, D., 2016. Natural Hazards and Disasters. Cengage Learning. Jiao, P., Alavi, A.H., 2020. Artificial intelligence in seismology: advent, performance and future trends. Geosci. Front. 11 (3), 739–744.

References

15

Johnson, P., Bertrand, R.-L., Laura, P.-N., Gregory, B., Chris, M., Claudia, H., Addison, H., 2021. Laboratory earthquake forecasting: a machine learning competition. Proc. Natl Acad. Sci. 118 (5). Kiefer, J., Wolfowitz, J., 1952. Stochastic estimation of the maximum of a regression function. Ann. Math. Stat., 462–466. Oort, A.H., 1983. Global atmospheric circulation statistics, 1958–1973. No. 14. US Department of Commerce, National Oceanic and Atmospheric Administration. Robbins, H., Monro, S., 1951. A stochastic approximation method. In: The Annals of Mathematical Statistics, pp. 400–407. Sun, Z., Di, L., Fang, H., 2019. Using long short-term memory recurrent neural network in land cover classification on Landsat and cropland data layer time series. Int. J. Remote Sens. 40 (2), 593–614. Sun, Z., Di, L., Burgess, A., Tullis, J.A., Magill, A.B., 2020. Geoweaver: advanced cyberinfrastructure for managing hybrid geoscientific AI workflows. ISPRS Int. J. Geo Inf. 9 (2), 119. Sun, Z., Sandoval, L., Robert Crystal-Ornelas, S., Mousavi, M., Wang, J., Lin, C., Cristea, N., et al., 2022. A review of earth artificial intelligence. Comput. Geosci., 105034. Synced, 2020. Who Invented Backpropagation? Hinton Says He Didn’t, but His Work Made It Popular. https:// medium.com/syncedreview/who-invented-backpropagation-hinton-says-he-didnt-but-his-work-made-it-popul ar-e0854504d6d1. (Accessed 23 October 2022). Wright, D., Brumby, S., Breyer, S., Fitzgibbon, A., Pisut, D., Statman-Weil, Z., Hannel, M., Mathis, M., Kontgis, C., 2022. Mapping the World at 10 m: A Novel Deep-Learning Land Use Land Cover Product and Beyond. No. EGU22-9012, Copernicus Meetings.

This page intentionally left blank

C H A P T E R

2 Machine learning for snow cover mapping Kehan Yanga,b, Aji Johnb,c, Ziheng Sund, and Nicoleta Cristeaa,b a

Department of Civil and Environmental Engineering, University of Washington, Seattle, WA, United States beScience Institute, University of Washington, Seattle, WA, United States c Department of Biology, University of Washington, Seattle, WA, United States dDepartment of Geography and Geoinformation Science, George Mason University, Fairfax, VA, United States

1 Introduction Snow is an important component in the Earth’s hydrological and energy cycles, and one of the most reflective land cover types on the Earth’s surface with good impacts on tackling climate warming. Snowpack functions as a natural reservoir that stores water in cold months, slowly releasing it into streams, rivers, and lakes, substantially mitigating water stress during the summer. Rising temperature accelerates snowmelt rate and reduces snow-covered area (SCA), leading to lower land surface reflectance. Thus, more solar radiation will likely be absorbed by the land surface, which in turn accelerates climate warming. SCA is also a critical variable for studying land processes, changes in hydrology, impacts on plant phenology, and other associated environmental processes. Therefore, to better understand water availability and the Earth’s energy balance, it is important to know the accurate spatial and temporal distribution of the snowpacks. While mapping SCA using optical satellite imagery is very well established, its implications have been restricted by the tradeoff between spatial and temporal resolution over the past decades. Because mountain snowpack has very high spatial heterogeneity and the SCA can change dramatically in the melting season and over complex terrain, it is necessary to have SCA observations at both high spatial and temporal resolution. In this context, the Planet Labs’ small satellites provide new opportunities for SCA mapping. Planet Labs is a commercial company that launched about 200 satellites, aiming to produce daily coverage of the entire Earth at about 3- to 5-m spatial resolution. All these small satellites can provide

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00004-9

17

Copyright # 2023 Elsevier Inc. All rights reserved.

18

2. Machine learning for snow cover mapping

observations of visible (i.e., red, green, and blue), and near-infrared bands, where snow surface usually has a relatively high reflectance compared with other land cover types. The first generation of Plant satellites was launched in 2016. Being relatively new to users, its application in SCA mapping has seen little research. This may be due in part to traditional approaches such as Normalized Difference Snow Index not being applicable, as Planet imagery does not supply a shortwave infrared band. Therefore, here we describe how to use machine learning to accomplish the task of mapping SCA from satellite optical imagery with very high resolution. Specifically, in this chapter, we will demonstrate how we prepare the model inputs, select the best model, tune model hyperparameters, and evaluate model performance. The objectives are to: (1) Demonstrate how to use a machine learning model to map SCA from high-resolution Planet imagery at meter scale. (2) Generate a workflow that can be used repeatedly in future applications. (3) Develop tutorial materials for educational purposes.

2 Machine learning tools and model 2.1 What is “scikit-learn” Scikit-learn (https://scikit-learn.org/stable/) is one of the most powerful and popular python packages designed to facilitate the use of machine learning techniques. It provides a whole set of assisting functions and a comprehensive package of algorithms for classification, regression, and clustering. In this section, we will use the version 0.23.2, and the random forest (RF) algorithm as our main ML model.

2.2 Why do we use random forest RF is a widely used machine learning algorithm originally developed by Leo Breiman (Breiman, 2001) and Adele Cutler (Cutler et al., 2012). It is an ensemble of multiple decision trees that are eventually aggregated to get the most likely result. A decision tree is a supervised machine learning model that is based on a series of questions to categorize samples or make predictions. Each question forms a tree node that splits the data into two different branches. If the answer to the question is “yes,” the decision follows the “yes” branch; otherwise, the decision follows the other path until it reaches a result (leaf node). The quality of the results is evaluated by metrics, such as the mean squared error (MSE), Gini Impurity, and information gain, measuring the differences between RF results and the target values. While the decision tree model is very easy to use, it can be prone to overfitting issues. Using an ensemble of decision trees can largely reduce the overfitting and prediction variance, providing more accurate results. Bagging, also known as bootstrap aggregation, is the most well-known ensemble learning technique, which trains multiple models independently with the randomly selected sample sets. The final prediction is determined by the average (for regression) or majority (for classification) of all the models. RF is an extension of the bagging

3 Data preparation

TABLE 1

Important python packages and their main functions used in the chapter.

Package

Version

Main functionality

numpy

>¼ 1.20.0

Provide mathematical operations for large, multidimensional arrays and matrices

pandas

>¼ 1.4.1

Offer data structures and operations for manipulating numerical tables and time series

rasterio

>¼ 1.2.10

Read and write raster data

matplotlib

>¼ 3.5.1

Data visualization

joblib

>¼ 1.0.1

Save and load machine learning models

fiona

>¼ 1.8.21

Read and write raster data

19

approach, which generates a random subset of both samples and features for each model training. While a decision tree is based on all features to make decisions, the RF algorithm only uses a subset of features, which can reduce the influence of highly correlated but biased features in model prediction. While many other complex machine learning models exist for land cover classification, for example, the convolutional neural network (CNN) (Cannistra et al., 2021), we choose to use RF for snow classification as it has been proven by many other applications as a very robust and versatile technique. Additionally, RF is much less computationally expensive and does not require a graphics processing unit (GPU) to conduct model training, which means we can easily set up the model using our laptop.

2.3 Other supporting packages used in the chapter The python packages, their version, and main functionality are presented in Table 1. The community behind these packages are very actively developing and improving them so the specific function definition and parameters could change over time. However, based on our experiences so far, the changes to scikit-learn function interfaces are minor and most source code can be used across versions.

3 Data preparation The satellite images used in this chapter are provided by the Planet Labs Education and Research Program (https://www.planet.com/markets/nasa/). The program provides limited, noncommercial access to PlanetScope imagery. The product used in this chapter is the Planet orthorectified product “PS2,” which includes four bands, blue (Band 1, 455–515 nm), green (Band 2, 500–590 nm), red (Band 3, 590–670 nm), and near-infrared (Band 4, 780–860 nm), with a spatial resolution of 3.7 m. The first step is to visualize the Planet data by the following code, which reads and displays the Planet image over a region in California.

20

2. Machine learning for snow cover mapping

# import functions and packages from functions_book_chapter_SCA import



dir_raster = './data/planet/20180528_181110_1025_3B_AnalyticMS_SR_clip.tif' planet = rasterio.open(dir_raster).read()/10000 planet = np.where(planet[0,:,:] == 0, np.nan, planet) # the default nan data value is 0, replace with np.nan fig, axs = plt.subplots(2,2,figsize=(15,7.5)) im1 = axs[0,0].imshow(planet[0,:,:],cmap='jet') axs[0,0].set_title("Surface reflectance of blue band", fontsize=16) im2 = axs[0,1].imshow(planet[1,:,:], cmap='jet') axs[0,1].set_title("Surface reflectance of green band", fontsize=16) im3 = axs[1,0].imshow(planet[2,:,:], cmap='jet') axs[1,0].set_title("Surface reflectance of red band", fontsize=16) im4 = axs[1,1].imshow(planet[3,:,:], cmap='jet') axs[1,1].set_title("Surface reflectance of NIR band", fontsize=16) cbar_ax = fig.add_axes([0.95, 0.15, 0.02, 0.7]) fig.colorbar(im1, cax=cbar_ax)

Fig. 1 shows the surface reflectance of the four bands of the PlanetScope image “20180528_181110_1025_3B_AnalyticMS_SR_clip.tif” within a region of the Tuolumne Basin, California. The red and orange colors represent high surface reflectance, while the cyan and dark blue colors represent low surface reflectance. Because snow has a very high reflectance in the visible bands, the red and orange regions are very likely to be covered by snow. Surface reflectance of blue band

0

Surface reflectance of green band

0

1000

1000

2000

2000

3000

3000

4000

4000

0.8

0.7

0.6

0.5 0

1000

2000

3000 4000

5000

6000 7000

0

8000

Surface reflectance of red band

0

1000

2000

2000

3000

3000

4000

4000

2000 3000

4000

5000

6000 7000

8000

Surface reflectance of NIR band

0

1000

1000

0.4

0.3

0.2

0.1

0

1000

2000

3000 4000

5000

6000 7000

8000

0

1000

2000 3000

4000

5000

6000 7000

FIG. 1 Spatial distribution of the surface reflectance of four bands of Planet image.

8000

4 Model parameter tuning

21

In the next step, we carefully draw a few ROIs (i.e., Region of Interest) using QGIS on the image and label each ROI as “1” or “0.” “1” represents “snow,” while “0” represents “no-snow.” The ROIs are labeled based on visual inspection. We only consider the binary classification because the mixing pixel issue is not significant for the Planet image at high spatial resolution (3.7 m), even though it is also not negligible, especially at the edge of snowpacks and no-snow land surface. For demonstration purposes, we only show the binary classification, “snow” and “no-snow” in this chapter. We extract the surface reflectance of all bands of each pixel inside the ROIs and generate an input feature table with 100,000 samples (i.e., “sample_100K.csv”). Here, a pixel is equivalent to a sample. Each sample has four feature columns (Blue, Green, Red, and NIR) and one label column (label). We discuss the influence of sample size on model performance in Section 4. # read model input features and labels data = pd.read_csv('./data/samples/sample_100K.csv', index_col = False) print("Sample dimensions:".format(), data.shape) print(data.head()) X = data[['blue','green','red','nir']] y = data['label']

Output:

Surface reflectance is the fraction of incoming solar radiation that is reflected from the Earth’s surface. It typically ranges from 0 to 1, and 0 means no reflection and 1 means total reflection. The original surface reflectance value extracted from the Planet “PS2” product is scaled by 10,000. Here, we have converted the original values to real surface reflectance as shown in the table. • • • • •

Blue: The surface reflectance of the blue band (455–515 nm). Green: The surface reflectance of the green band (500–590 nm). Red: The surface reflectance of the red band (590–670 nm). Nir: The surface reflectance of the near-infrared band (780–860 nm). Label: “0” is no-snow land pixel, “1” is snow pixel.

4 Model parameter tuning Parameter selection is a very critical step in the training process. In our experiment, there are several important parameters tuned in the process. To get an optimal set of

22

2. Machine learning for snow cover mapping

parameters as well as the optimal sample size, we conducted a series of sensitivity tests on the following four parameters: number of samples, number of features, number of trees, and tree depth. We use the “RandomForestClassifier” function in the “sklearn.ensemble” package (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble. RandomForestClassifier.html?highlight¼randomforestclassifier#sklearn.ensemble. RandomForestClassifier) to define the model. The main parameters to customize the model include “n_estimators,” “max_depth,” and “max_features.” • “n_estimators” is the number of trees in the forest. This parameter controls how many rules the RF model will learn from Planet images. Normally, a larger tree number would result in better model performance, but it also means longer model training time. Additionally, the model performance will stop getting significantly better beyond a certain number of trees. • “max_features” is the size of feature subsets used in splitting tree nodes. It controls how many features each individual rule will consider when looking for the best split. An empirical default value of max_features is “None” for regression problems, which considers all features instead of a random subset, and “sqrt” (representing square root) for classification tasks, which uses a random subset of size sqrt(n_features). • “max_depth” is the maximum depth of the tree. This parameter controls the maximum number of decisions or splits any path in the trees could have. A deeper tree has more splits and captures more information about the data, yet this will cause over-fitted decision trees. Similar to the number of trees, the model performance will stop getting significantly better once tree depth is deep enough. This section will examine the three parameters and their sensitivities to the final results with the goal of choosing the optimal parameter combination for model training.

4.1 Number of samples First, we read the sample data using the code below. To reduce the calculation time, we select 10,000 samples instead of the entire 100,000 samples as the model performance does not show significant improvement when the sample size exceeds 4000. We use k-fold cross-validation (k ¼ 10 meaning the experiments will be repeated 10 times with different splits of training samples) to evaluate model performance with 100 repeat times. # prepare data data = pd.read_csv('./data/samples/sample_10K.csv', index_col = False) print("Sample dimensions:".format(), data.shape) data.head() X = data[['blue','green','red','nir']] y = data['label']

4 Model parameter tuning

23

Output: Sample dimensions: (10000, 5) Theoretically, the performance of RF can improve as the sample size increases. However, a larger sample size would result in higher computational expense. In most times, the model accuracy does not change significantly after the samples reach a certain number. We generated a custom function “get_model_size()" to train RF models with different sample sizes. We proportionately selected samples by changing the “max_samples” parameter. “max_samples” can be set to a float between 0 and 1 to control the percentage of the training dataset to generate the bootstrap sample used in each decision tree training. All custom functions are organized in the “functions_SCA_mapping.py” and available to download in the GitHub repository: https://github.com/earth-artificial-intelligence/earth_ai_book_ materials. To reduce the computational burden, we only experiment with percentages ranging from 0.01 to 0.1 with a 0.01 interval, and the percentages ranging from 0.1 to 1.0 with 0.1 interval in the experiment below. The result shows that the overall model accuracy improves with the increasing sample size when the percentage increases from 0.01 to 0.08 (i.e., 100–800 samples), followed by a very slight improvement with the percentage increases from 0.08 to 0.4 (i.e., 800 to 4 k samples). The changes in model performance are negligible after the sample size exceeds 4000 (i.e., 0.4% of the total samples), meaning that the whole dataset can be very well represented by a subset with 4000 observations, and it is unnecessary to further increase the sample size for this application. Therefore, we will use 4000 as the optimal sample size to train the SCA model. # customize models with different sample sizes models = get_models_size() results, names = list(), list() for name, model in models.items(): # evaluate models using k-fold cross-validation scores = evaluate_model(model, X, y) results.append(scores) names.append(name) # print the mean and standard deviation of models print('>%s Mean Score: %.6f (Score SD: %.6f)' % ('Sample size: ' + str(int(float (name)



10000)), scores.mean(), scores.std()))

# display model performance plt.figure(figsize=(10,5)) plt.boxplot(results, labels=names, showmeans=True) plt.show()

24

2. Machine learning for snow cover mapping

Output:

1000

0.998

0.996

0.994

0.992

0.990 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

4.2 Number of features The number of features for each split node is perhaps the most important parameter, and is set via “max_features” in scikit-learn. To explore its influence on model accuracy, we tested the integer values of the “max_features” from 1 to 4. The result shows that the median accuracy (green triangle) with max_features ¼ 4 is slightly higher than the other three max_features values, though no significant difference is observed among all four test sets.

25

4 Model parameter tuning

The default of max_features is the square root of the number of input features. The input features have 4 so the default value would be 2. However, the total feature size we have is already very small and the model shows slightly better performance when max_features ¼ 4. As we do not want to lose any information from these four bands, we decided to use 4 as max_features in the final SCA model. # customize models with different model feature sizes models = get_models_feature() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate models using k-fold cross-validation scores = evaluate_model(model, X, y) results.append(scores) names.append(name) # print the mean and standard deviation of models # print('>%s %.3f (%.3f)' % (name, scores.mean(), scores.std())) print('>%s Mean Score: %.6f (Score SD: %.6f)' % ('Features: ' + name, scores.mean(), scores.std())) # display model performance plt.boxplot(results, labels=names, showmeans=True) plt.show()

Output:

1.000 0.999 0.998 0.997 0.996 0.995 0.994 0.993

1

2

3

4

26

2. Machine learning for snow cover mapping

4.3 Number of trees The number of trees is another key parameter and can be set via the “n_estimators” option in scikit-learn, and the default value is 100. The code below explores the effect of the number of trees on the model performance. We set the “n_estimators” to the values between 1 and 1000, with only a few selected tree numbers displayed in the boxplot. Typically, when we increase this parameter, the model performance increases but the improvements will slow down after certain thresholds. In this case, the improvement in model performance is negligible after the number of trees exceeds 10, and thus n_estimators is set to be 10 in our SCA model. # customize models with different tree numbers models = get_models_tree() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate models using k-fold cross-validation scores = evaluate_model(model, X, y) results.append(scores) names.append(name) # print the mean and standard deviation of models # print('>%s %.3f (%.3f)' % (name, scores.mean(), scores.std())) print('>%s Mean Score: %.6f (Score SD: %.6f)' % ('Tree numbers: ' + name, scores.mean (), scores.std())) # display model performance plt.boxplot(results, labels=names, showmeans=True) plt.show()

Output:

27

4 Model parameter tuning

1.000 0.999 0.998 0.997 0.996 0.995 0.994 0.993

1

2

3

4

5

10

20

50

100 200 800 1000

4.4 Tree depth The last parameter is the maximum depth of decision trees and can be set via the “max_depth” argument in the scikit-learn RandomForestClassifier. Ideally, we would like as many trees as possible to improve model performance, so “max_depth” is set to “None” by default, meaning no depth limit. However, a large tree depth will lead to a longer computing time and over-fitting issues, and reducing tree depth could make the ensemble converge earlier. We need a tree depth that is enough to split each node for our samples within an acceptable time limit. The code below explores the effect of maximum tree depth on the model accuracy. The result shows that the model performance does not have a significant difference when the “max_depth” is greater than 8 and stabilized when the tree depth is 10. So, we set “max_depth” to 10 in our final model. # customize models with different tree depths models = get_models_depth() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate models using k-fold cross-validation scores = evaluate_model(model, X, y) results.append(scores) names.append(name) # print the mean and standard deviation of models # print('>%s %.3f (%.3f)' % (name, scores.mean(), scores.std())) print('>%s Mean Score: %.6f (Score SD: %.6f)' % ('Tree Depth: ' + name, scores.mean (), scores.std())) # display model performance plt.figure(figsize=(10,5)) plt.boxplot(results, labels=names, showmeans=True) plt.show()

28

2. Machine learning for snow cover mapping

Output

1.000 0.999 0.998 0.997

0.996 0.995 0.994

0.993 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

None

5 Model training Based on the tests presented in the last section, we have identified the optimal values of the sample size and three main parameter configurations to set up our model. The next step is to train the RF model for SCA mapping.

5.1 Splitting data into training and testing subsets In this step, the observations are split into a training subset and a testing subset. Usually, we want to use 70%–80% of the data for training, and the remaining 20%–30% of the data for

5 Model training

29

testing. Since we find that the model accuracy is reaching a stable stage after the sample size reaches 4000 (Section 4), we only use 4000 samples, with the remaining 96,000 samples used as an evaluation testing subset to fully test the model accuracy and efficiency. To ensure the training set is unbiased, we use the “train_test_split” function from the “sklearn. model_selection” module to randomly split the sample dataset into training and test subsets. There are several arguments in the train_test_split function. The parameter “test_size” can be float or integer. If it is a float, the value should be between 0.0 and 1.0, representing the proportion of the dataset to include in the test split. If it is an integer, it represents the absolute number of the train samples. The “random_state” parameter is very useful to ensure that the model sampling is reproducible. Here, we assign “1” to “random_state.” It does not matter what the actual “random_state” number is. The important thing is that every time we use “1,” just the same as the first time we make the split, we will get the same splits, which is very useful for the demonstration. # read model input features and labels data = pd.read_csv('./data/samples/sample_100K.csv', index_col = False) print("Sample dimensions:".format(), data.shape) print(data.head()) X = data[['blue','green','red','nir']] y = data['label'] X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.96,random_state=1)

Output:

5.2 Defining the random forest model Now, as we have both the training subset and the optimal parameters, we can run the “RandomForestClassifier()" to build and train our model using the code below: # define the model model = RandomForestClassifier(n_estimators=10, max_depth=10, max_features=4)

To evaluate the model performance, we conduct K-fold cross-validation using “RepeatedStratifiedKFold” and “cross_val_score” from “sklearn.model_selection.” Here, the training subset is randomly split into 10 folds evenly, and each fold is literally used to test the model which is trained by the remaining 9 folds of data. This process is repeated until each of the 10 folds has been used as the testing set. The most common evaluation metric, the “accuracy,” is used to represent the model performance. This whole process is repeated 1000 times to get the final model performance reported as below:

30

2. Machine learning for snow cover mapping

# evaluate the model cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=1000) n_scores

=

cross_val_score(model,

X_train,

y_train,

scoring='accuracy',

cv=cv,

n_jobs=-1) # report model performance print('Mean Score: %.6f (SD: %.6f)' % (n_scores.mean(),n_scores.std())) Output: Mean Score: 0.998049 (SD: 0.002128)

The overall model training accuracy is 0.998 with 0.002 standard deviation over the 1000 repeated cross-validations, indicating that only 0.2% of samples or pixels on average are incorrectly classified. If we look at the distribution of the accuracy values as shown below, most accuracy values are clustered near 1.00 and all values are higher than 0.98, indicating the model training is very precise and robust. # the histogram of the scores n, bins, patches = plt.hist(n_scores, density=True, facecolor='blue', alpha=0.75) plt.text(0.91, 15, r'mean = ' + str(n_scores.mean().round(6)) + ' '+ 'SD = ' + str(n_scores.std().round(6))) plt.xlim(0.9, 1.01) plt.xlabel('Accuracy') plt.ylabel('Probability (%)') plt.grid(True) plt.show()

Output: 300

Probability (%)

250 200 150 100 50 mean = 0.998055 SD = 0.002137 0 0.90

0.92

0.94

0.96

0.98

1.00

Accuracy

5.3 Feature importance It is important to explore each feature’s importance, especially when the feature size is very large, increasing the chance of features being redundant. In our case, it is unnecessary to

5 Model training

31

reduce feature size as we only used four features, but we want to know which band provides the most significant information for the SCA mapping. We use “permutation_importance” in the “sklearn.inspection” module to estimate feature importance. The function of the permutation feature importance is described below: “The permutation feature importance is defined to be the decrease in a model score when a single feature value is randomly shuffled. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. This technique benefits from being model agnostic and can be calculated many times with different permutations of the feature.”

As the RF algorithm trains the model with randomly selected sample subsets and feature subsets, each model run would have a different estimate of feature importance. Thus, to get a robust estimate of feature importance, we repeated the process 1000 times as shown below. The result shows that the blue band provides the most important information, while all the other three bands are less important. model.fit(X_train,y_train) result = permutation_importance(model, X_train, y_train, n_repeats=1000, random_ state=42, n_jobs=2) print('Permutation importance - average:'.format(), X_train.columns) print([round(i, 6) for i in result.importances_mean]) # displace feature importance fig, ax = plt.subplots(figsize=(6,5)) ax.boxplot(result.importances.T) ax.set_title("Permutation Importances", fontsize = 16) ax.set_xticklabels(labels=X_train.columns, fontsize=14) plt.show()

Output:

32

2. Machine learning for snow cover mapping

5.4 Save the model We now have our model trained and evaluated. The model can be saved to a file using the “dump()" function from the “joblib” package. Next time we want to apply this model, we do not have to repeat the processing steps described in the previous section, but just read the model file and reuse it. Next, we will discuss how we load this model and apply it to a satellite image. # save model dir_model = "./models/random_forest_SCA_binary.joblib" joblib.dump(model, dir_model)

6 Model performance evaluation In the previous sections we described how to train a RF model that can accurately predict SCA from Planet imagery with an overall accuracy of 0.998 estimated through the k-fold cross-validation approach. However, we do not know how the model performs outside the training subset. To provide a more comprehensive evaluation, and especially to test the model generalization at other locations, we provide the following two levels of assessment: ● (1) As the model training only needs 4000 samples, we have 96,000 samples remaining for evaluation which are geographically located in the same region as the training samples. We can therefore assume that this testing subset has similar spectrum and physiographic features compared with the training subset, and thus the prediction is expected to be accurate. ● (2) To ensure that we have a robust model, we would also want to know how the model performs over the entire satellite image, and across various land cover types. One challenge we usually face with satellite data is that there are limited high-resolution ground-truth datasets we can use as validation datasets. The same issue exists in this case. Here, we use an airborne lidar-derived SCA dataset as “ground truth” to evaluate the model performance. Specifically, we will use the lidar-derived high-resolution snow depth data at 3-m spatial resolution provided by the Airborne Snow Observatory (ASO, Painter et al., 2016) to derive the “ground truth” SCA. The lidar snow depth dataset does not provide direct SCA information, so we will use a threshold value of 10 cm on the snow depth data to calculate SCA. The ASO regularly conducts airborne lidar surveys for several watersheds in California and Colorado. The aircraft carries a lidar sensor to map snow depth based on the elevation difference between snow-on and snow-off surfaces. The uncertainty of the final snow depth product at the 3-m resolution is unbiased with a root mean squared error (RMSE) of 8 cm (Painter et al., 2016). A previous study by Cannistra et al., 2021 also used 10 cm as a threshold to convert the ASO 3-m snow depth data into a binary SCA map. If the snow depth is deeper than 10 cm, the pixel is classified as a snow pixel; otherwise, the pixel is a no-snow pixel. While using different thresholds will result in slightly different SCA maps, we find that using a threshold between

6 Model performance evaluation

33

8 and 10 cm gives the best agreement between Planet SCA and ASO snow depth derived SCA ( John et al., 2022). For our experiment we apply 10 cm as the threshold to derive SCA from the ASO 3-m snow depth dataset to generate the “ground truth” dataset. The left figure below shows the spatial distribution of snow depth for the study domain, and the right figure shows the distribution of binary SCA. dir_aso = './data/ASO/ASO_3M_SD_USCATE_20180528_clip.tif' raso = rasterio.open(dir_aso,'r').read() raso = np.where(raso[0,:,:] < 0, np.nan, raso) th = 0.1 # using 10 cm threshold raso_binary = np.where(raso >= 0.1, 1, 0) # if the SD is higher than 10 cm, then snow; otherwise, no-snow fig, axs = plt.subplots(1,2,figsize=(16,6)) im1 = axs[0].imshow(raso[0,:,:],cmap = 'Blues',vmin = 0, vmax = 5) axs[0].set_title('ASO snow depth', fontsize=16) fig.colorbar(im1, ax = axs[0], label = 'snow depth (meter)', extend = 'max') im2 = axs[1].imshow(raso_binary[0,:,:], cmap = 'gray', interpolation = 'none') axs[1].set_title('ASO snow cover (TH = 10 cm)', fontsize=16)

Output:

The common five evaluation metrics used to evaluate model accuracy are defined as follows: Precision ¼ TP  ðTP + FPÞ Recall ¼ TP  ðTP + FN Þ F1 score ¼ 2∗Precision∗Recall  ðPrecision + RecallÞ Balanced accuracy ¼ ðTP=ðTP + FNÞ + TN=ðTN + FPÞÞ  2 Accuracy ¼ ðTP + TN Þ  ðTP + TN + FP + FN Þ where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.

34

2. Machine learning for snow cover mapping

6.1 Testing subset model performance We run the “model.predict()" to get the snow cover prediction, where the “model” is the SCA mapping model we trained above. The code below is to load the saved model from the directory: dir_model = "./models/random_forest_SCA_binary.joblib" # load model model = joblib.load(dir_model)

Then, we calculate the five evaluation metrics using “calculate_metrics()" which is a custom function we generated to evaluate model performance. df = pd.DataFrame() df['obs'] = y_test df['predict'] = model.predict(X_test) # Cross-tabulate predictions print("Confusion matrix:") print(pd.crosstab(df['obs'], df['predict'], margins=True)) print(calculate_metrics(df))

Output:

From the confusion matrix printed out above, false negatives account for 116 of 54614 no-snow pixels, and the false positives for 65 of 41386 snow pixels. All five evaluation metrics are very close to 1, indicating the model performs very well in the remaining 96,000 testing samples, which is not very surprising as the pixels within the testing subset are located close to the training samples and also have similar spectrum features.

6.2 Image-wide model performance Now, let’s look at how the model performs across the entire image. First, we collect the four bands’ surface reflectance over the entire image, and then we apply a custom python function “run_sca_prediction()" to predict SCA. Finally, we save the SCA image to “dir_our” directory. dir_raster = './data/planet/20180528_181110_1025_3B_AnalyticMS_SR_clip.tif' dir_out = './data/SCA/' nodata_flag = 9 run_sca_prediction(dir_raster, dir_out, nodata_flag, model)

6 Model performance evaluation

35

Output:

We display the original Planet false-color image (left) and the predicted SCA map (right) in the figure below. Based on visual examination, the model captures the spatial distribution of snow areas very well. Next, we further compare the Planet-derived SCA with the validation data set—ASO SCA, and calculate the performance metrics. dir_planet = './data/planet/20180528_181110_1025_3B_AnalyticMS_SR_clip.tif' r_na_flag = rasterio.open(dir_planet, 'r').read() r_planet = rasterio.open(dir_planet, 'r').read([4,3,2])/10000 dir_sca = './data/SCA/20180528_181110_1025_3B_AnalyticMS_SR_clip_SCA.tif' r_sca = rasterio.open(dir_sca, 'r') fig, (ax1, ax2) = plt.subplots(1,2, figsize = (16,4)) show(r_planet, ax= ax1, cmap='jet', interpolation = 'none',title = 'Planet false color Image') show(r_sca.read().squeeze(), ax= ax2, interpolation = 'none',title = 'Planet Snow Cover')

To compare the ASO SCA and Planet SCA using the same spatial extent, we first read the data within the same area as shown by the yellow color. Because the ASO snow depth product should exclude water bodies and glaciers, we apply the waterbody dataset [data downloaded from NHDPlus (EPA, 2022)] and glacier dataset [data downloaded from Randolph Glacier Inventory 6.0 (GLIMS, 2022)] to mask out those areas. dir_planet_ext = './data/GIS/extent/CATE_20180528_181110_img_ext.shp' with fiona.open(dir_planet_ext, "r") as shapefile: shapes = [feature["geometry"] for feature in shapefile] dir_aso = "./data/ASO/ASO_3M_SD_USCATE_20180528_binary_clip.tif" with rasterio.open(dir_aso,'r') as src: r_aso = rasterio.mask.mask(src, shapes, crop=True, filled = False) dir_pred = './data/SCA/20180528_181110_1025_3B_AnalyticMS_SR_clip_SCA.tif' with rasterio.open(dir_pred,'r') as src: r_predict = rasterio.mask.mask(src, shapes, crop=True, filled = False)

36

2. Machine learning for snow cover mapping

dir_watermask = './data/mask/waterbody_TB_UTM11_clip.tif' with rasterio.open(dir_watermask,'r') as src: r_watermask = rasterio.mask.mask(src, shapes, crop=True, filled = False) dir_glaciermask = './data/mask/02_rgi60_WesternCanadaUS_hypso_TB_clip.tif' with rasterio.open(dir_glaciermask,'r') as src: r_glaciermask = rasterio.mask.mask(src, shapes, crop=True, filled = False)

We then organize data and print the evaluation results: df = pd.DataFrame() df['predict'] = r_predict[0].ravel() df['obs_sd'] = r_aso[0].ravel() df['watermask'] = r_watermask[0].ravel() df['glaciermask'] = r_watermask[0].ravel() # remove NA data region, water bodies, and glaciers df_mask = df[(df.predict >= 0) & (df.watermask != 0) & (df.glaciermask != 0)] df_mask.loc[df['obs_sd'] >= th, 'obs'] = 1 df_mask.loc[df['obs_sd'] < th, 'obs'] = 0 # print(df) print("Confusion matrix:") print(pd.crosstab(df_mask['obs'], df_mask['predict'], margins=True)) print("Overall model performance:") print(calculate_metrics(df_mask))

Output:

The results show that, overall, close to 90% of the pixels are classified accurately with the F1 score value of 0.87. The model has fewer false-positive predictions than false-negative predictions as the precision value (0.89) is slightly higher than the recall value (0.85), indicating a small underestimation in SCA modeling.

6 Model performance evaluation

37

6.3 Model performance in open areas versus forested areas To explore the potential reasons that explain the mismatch between ASO SCA and Planet SCA, we divide the entire domain into two categories: open areas and forested areas. We use a 3-m canopy height model dataset provided by ASO, Inc to classify the domain into open and forested areas. If the pixel has a tree height value higher than 1 m, then this pixel is classified as forest; otherwise, the pixel is classified as open area. file_landcover = './data/ASO/ASO_3M_CHM_USCATB_20140827_binary_clip.tif' # 1 - forest, 0 open area with rasterio.open(file_landcover,'r') as src: r_landcover = rasterio.mask.mask(src, shapes, crop=True, filled = False) df['landcover'] = r_landcover[0].ravel() df_mask = df[(df.predict >= 0) & (df.watermask != 0) & (df.glaciermask != 0)] df_open = df_mask[df_mask.landcover == 0] print("Model performance in open areas:") print(calculate_metrics(df_open)) df_forest = df_mask[df_mask.landcover == 1] print("Model performance in forested areas:") print(calculate_metrics(df_forest))

Output:

The results show a difference in model accuracy between open and forested areas. For the open areas, the overall model accuracy is 90%, with very similar precision (0.89) and recall (0.90) values, indicating similar false positive and false negative predictions. However, the overall model accuracy for the forested area is only 85%, with relatively high precision (0.77) and an extremely low recall (0.19), indicating much higher false-negative predictions than false positive predictions. The main reason for the high false-negative in forest areas is that Planet uses optical sensors which cannot penetrate canopy cover to get underneath snow cover information while ASO uses a lidar sensor that can penetrate the canopy. To get a closer look at the difference between ASO SCA and Planet SCA, we select two example sites: A and B as shown in the Figure below. Site A is in open terrain with half of the area in a shaded valley and Site B is in a dense forest.

38

2. Machine learning for snow cover mapping

The spatial distribution of snow cover in two example sites: (1) A open and shaded terrain, and (2) B dense forest. The model accurately predicts SCA from Planet imagery for the open areas at Site A even across shaded terrain where snow and all other land surfaces have low reflectance. At Site B, ASO and Planet show significant differences over dense forests. Optical Planet sensors can observe SCA in forest gaps. However, Planet-derived SCA is uncertain along forest edges and under the canopy where there are mixed forest-snow pixels or forest canopy pixels that introduce errors in the model.

7 Conclusion In this chapter, we developed a RF model to map SCA from high-resolution Planet imagery. We investigate the influence of sample size and three main parameters, namely the number of features, the number of trees, and the tree depth, on model performance. The optimal parameters determined by these tests are used in the final SCA mapping model. The final model shows very good performance in predicting binary SCA, achieving an overall model accuracy of 89% at 3-m spatial resolution when evaluated using a “ground truth” SCA dataset derived from the ASO snow depth dataset. Furthermore, the model predicts Planet SCA with higher accuracy in open areas than in forested areas, with overall accuracies of 90% and 85%, respectively. While the uncertainties in both SCA datasets are not negligible, the main mismatch between Planet SCA and the ASO SCA is caused by the difference in observation sensors. ASO SCA relies on a lidar sensor that can penetrate the canopy and detect snow under the canopy, while Planet uses optical sensors that only receive surface reflectance above the canopy.

References

39

In summary, the SCA mapping model we demonstrated in this chapter is easy to set up and use. The AI methods and workflows presented here have great potential applications in hydrological process monitoring and prediction, particularly in a world with more dramatic changes in snow cover.

8 Assignment (1) Please change the values of parameters used in the RF model, such as the number of samples, the number of features, the number of trees, and tree depth. Do you observe any changes in the predicted SCA performance metrics? (2) Please apply the RF model to a new Planet satellite image in Colorado provided in the GitHub repository (i.e., ./test/planet_co.tif), evaluate the SCA accuracy using the ASO snow depth data (i.e., ./test/ASO_SD_CO.tif). Evaluate the model results and model transferability to different geographic locations (Link to the Github repository: https:// github.com/earth-artificial-intelligence/earth_ai_book_materials).

9 Open questions There are multiple ways to continue AI research in the field of providing reliable, highresolution snow maps to support ecology, hydrology, and other environmental research. Building on the example provided here, we suggest the following activities as starting points for future investigations: (1) Examine the SCA detection approach and modeled SCA results and explore the causes of model uncertainty. What other avenues can we explore to improve model accuracy? (2) Explore the feasibility of adapting the workflows, data processing pipelines, and the random forest model presented here to work with data from different satellite sensors, such as Landsat series and Sentinel-2.

References Breiman, L., 2001. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:1010933404324. Cannistra, A.F., Shean, D.E., Cristea, N.C., 2021. High-resolution CubeSat imagery and machine learning for detailed snow-covered area. Remote Sens. Environ. 258, 112399. https://doi.org/10.1016/j.rse.2021.112399. Cutler, A., Cutler, D.R., Stevens, J.R., 2012. Random forests. In: Ensemble Machine Learning. Springer, pp. 157–175. EPA, 2022. NHDPlus (National Hydrography Dataset Plus), https://www.epa.gov/waterdata/nhdplus-nationalhydrography-dataset-plus#::text¼National%20Hydrography%20Dataset%20Plus%20. NHDPlus,with%20the %20U.S.%20%20Geological%20Survey (Accessed 10 April 2022). GLIMS, 2022. The Randolph Glacier Inventory 6.0. https://www.glims.org/RGI/rgi60_dl.html. (Accessed 10 April 2022). John, A., Cannistra, A., Yang, K., Tan, A., Shean, D., Hille Ris Lambers, J., Cristea, N., 2022. High-resolution snowcovered area mapping in forested mountain ecosystems using PlanetScope imagery. Remote Sens. 14 (14). https://doi.org/10.3390/rs14143409. Painter, T.H., Berisford, D.F., Boardman, J.W., Bormann, K.J., Deems, J.S., Gehrke, F., Hedrick, A., Joyce, M., Laidlaw, R., Marks, D., Mattmann, C., McGurk, B., Ramirez, P., Richardson, M., Skiles, S.M., Seidel, F.C., Winstral, A., 2016. The airborne snow observatory: fusion of scanning lidar, imaging spectrometer, and physically-based modeling for mapping snow water equivalent and snow albedo. Remote Sens. Environ. 184, 139–152. https://doi.org/10.1016/j.rse.2016.06.018.

This page intentionally left blank

C H A P T E R

3 AI for sea ice forecasting Sahara Ali, Yiyi Huang, and Jianwu Wang Department of Information Systems, University of Maryland, Baltimore County, MD, United States

1 Introduction In this chapter, we present an in-depth tutorial covering some of the applications of artificial intelligence (AI) for forecasting Arctic Sea ice. We start off by presenting some background knowledge on sea ice and how it differs from icebergs and glaciers. Here, we put some light on Arctic amplification, the complex Arctic ecosystem, and how it affects the global climate patterns. In Section 2, we review some of the recent data-driven methods proposed for forecasting Arctic Sea ice. In Section 3, we look at the datasets used for this study and perform time-series data analysis to understand the seasonality in sea ice. In Section 4, we present a step-by-step tutorial for forecasting sea ice using state-of-the-art machine learning (ML) and deep learning methods. These steps are going to walk you through data preparation, model training, and model evaluation using open-access tools and techniques. We further present a comparative analysis of the implemented methods in Section 5. Finally, we conclude the chapter in Section 6 by giving a quick recap of what we learnt, how this work can be extended, and what are some of the questions we recommend to think of after going through this chapter.

1.1 Sea ice Sea ice forms in the ocean by frozen ocean water contrary to icebergs, glaciers, and ice sheets that initially originate in the land and eventually end up floating on the ocean. Sea ice is observed to be covered in snow for a large percentage of the year (National Snow and Ice Data Center, 2022). Different states of water have different densities; water has lower density upon converting into a solid state of itself, ice, floating on top of the ocean water. In addition, different forms of water such as fresh water and salty water have slightly different densities due to the concentration of other particles. Density is observed to decrease when the

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00012-8

41

Copyright # 2023 Elsevier Inc. All rights reserved.

42

3. AI for sea ice forecasting

purity increases. As a percentage, sea ice covers 7% of the entire surface of the Earth and 12% of the world’s oceans which include the Arctic ice pack of Arctic Ocean and Antarctic ice pack of Southern Ocean. The area covered by sea ice grows and shrinks over the course of the year, exhibiting seasonal cycles. Sea ice forms in the fall, as less sunlight reaches the Arctic and air temperatures begin to drop. The total area covered by ice increases through the winter, usually reaching its maximum in early March. In spring and summer, the ice begins to melt with more sunlight and higher temperature, shrinking to its minimum extent each September.

1.2 Arctic Sea ice and global climate patterns The Arctic is a region with unique climate features. For instance, in the Arctic, the Sun never rises over the horizon because of which the seasonal variations in polar day and night are extreme. Sea ice directly impacts the Earth’s radiation budget by reflecting a large portion of solar radiation due to its bright surface, inducing a cooling effect on the Earth climate system, which will adjust the global and regional weather circulations and patterns. Over the past four decades, measurements and observations suggest that the thickness and the extent of the sea ice in the Arctic has been rapidly shrinking. More solar energy is absorbed at the surface and ocean temperatures rise, which accelerates global warming. Moreover, sea ice shrinkage can accelerate the progression of global warming trend and the climate change patterns. For example, sea ice has a profound impact on both global ocean temperatures and the global movement of ocean waters. Estimating the sea ice variability using accurate modeling approaches would lead to better understanding and forecasting the future climate changes globally. A reliable sea ice seasonal forecast would allow us to identify opportunities, risks, and threats in many other fields. Specifically, there will be changes in transportation routes due to sea ice shrinkage, which will likely reduce transportation expenses, as well as changes in the development of resources in the Arctic coastal communities, impacting the indigenous people and wildlife in the region.

2 Sea ice seasonal forecast Accurately predicting seasonal and interannual fluctuations of sea ice is essential to improve management of ocean and coastal resources in the Arctic and provide planning information for shipping, to better serve Northern communities. Current operational sea ice forecasting systems mainly rely on coupled Earth system models, which estimate the solution to differential equations of fluid motion and thermodynamics to obtain time and spacedependent values for various variables in the atmosphere, ocean, or sea ice. Some good examples are Geophysical Fluid Dynamics Laboratory’s Couple Physical Model (Delworth et al., 2006) and the Max Planck Institute’s Meteorology Earth System Model (Gutjahr et al., 2019). However, these physical-based models usually perform no better than simple statistical methods at lead times of 2 months and beyond. More importantly, building and running coupled Earth system models require advanced understanding of physics and dynamics, and are also extremely computational expensive.

2 Sea ice seasonal forecast

43

Recently, there is a growing number of studies targeting seasonal sea ice forecasting using data-driven AI approaches like ML and deep learning. Chi and Kim (2017) proposed a fully data-driven long short-term memory (LSTM) model-based approach for Arctic Sea ice forecasting and compared it with a traditional statistical model; they found that the LSTM showed good performance for 1-month sea ice concentration (SIC) prediction, with less than 9% average monthly errors and around 11% mean absolute error during the melting season. Kim et al. (2020) developed a two-dimensional convolutional neural network (2DCNN) model that takes as input eight atmospheric predictors to predict SIC with 1 month’s lead time. They compared the performance with random forest baseline model, achieving root mean squared error (RMSE) of 5.76%. Liu et al. (2021) worked on daily prediction of the Arctic Sea Ice Concentration using reanalysis data based on a Convolutional LSTM Network. They proposed a ConvLSTM model to predict SIC for T timestep given T  1 and T  2 25 km resolution observational data from NSIDC (2008–18). They compared their model with a 2DCNN model that takes in spatial map with pixel grids from T  1 timestep. Their model achieved an RMSE of 11.2% as compared to the 2DCNN with RMSE of 13.7%. Kim et al. (2021) proposed a multitask ConvLSTM model that learns from SIC maps at T timestep and jointly predicts SIC and SIE for T + 1 timestep, achieving an RMSE of 10.8% for SIC and 0.303 million km2 for SIE. Exploring the potential of ensemble methods, Kim et al. (2019) worked on an multiple linear regression (MLR) + deep neural network (DNN) ensemble model using Bayesian model averaging to predict SICs for next 10–20 years. They evaluated their model using correlation coefficient and achieved a normalized RMSE of 0.8. Ali et al. (2021) proposed an attention-based LSTM ensemble that takes in multitemporal, daily and monthly, data and predicts sea ice extent for T + 1 timestep, achieving an RMSE of 4.9%. Most recently, Andersson et al. (2021) proposed IceNet, a U-Net-based ensemble model for seasonal sea ice forecasting. Their model takes in images as input and forecasts as output SIC maps in the form of three classes (open-water region SIC < 15%, ice-edge region 15% < SIC < 80%, and confident ice region SIC > 80%) for next 6 months. Through probabilistic deep learning, they show their forecasted values to be competent with the physics-based ECMWF seasonal forecast system SEAS5 ( Johnson et al., 2019). IceNet is pretrained using Coupled Model Intercomparison Project (CMIP6) 2220 years (1800–2011) simulation data and is fine-tuned on NSIDC’s observational data from 1979 to 2011. They evaluate their model performance on observational data from 2012 to 2017 using integrated ice-edge error (IIEE) and binary accuracy. Due to the unique nature of this problem, there are several limitations to the existing solutions and multiple prevailing challenges, such as accurately capturing the sea ice minimum and maximum values in the nonstationary dataset. The gravity of this challenge can be realized from the fact that the Sea Ice Prediction Network—Phase 2 (SIPN2) holds an annual competition for September Sea Ice Outlooka (SIO), where researchers from across the globe compete to predict the most accurate values for September Sea Ice Extent.

a

See https://www.arcus.org/sipn/sea-ice-outlook.

44

3. AI for sea ice forecasting

3 Sea ice data exploration For this study, we use sea ice extent (SIE) values, derived from SICs, that we obtained from the Nimbus-7 SSMR and DMSP SSM/I-SSMIS passive microwave data version 1 provided by the National Snow and Ice Data Center ([Dataset]National Snow and Ice Data Center, 2021). The monthly SIE data can be downloaded directly from NSIDC’s Data Center.b We start off with plotting the monthly SIE data from January 1979 to August 2021 using Python’s matplotlib package. import matplotlib.pyplot as plt import pandas as pd import numpy as np #loading the data and extracting SIE values for September data = pd.read_csv( ’NSIDC_monthly_1979_2021.csv’) sie = np.array(data)

#csv data downloaded from NSIDC

sept_values = sie[8::12] #calculating annual means def annual_mean(t): arr = [] for i in range(0,42): arr = np.append(arr,np.mean(t[i*12:(i*12)+12])) return arr annual_mean(sie) mean_values = annual_mean(sie) #plotting the data time_range=pd.date_range(start= "1979-01-01",end= "2021-09-01",freq= ’m’) time_range_sept=time_range[8::12] time_range_year=pd.date_range(start= "1979-01-01",end= "2021-09-01",freq= ’y’) fig, ax = plt.subplots(figsize=(24, 8)) ax.plot(time_range_year, mean_values, color =

’red’, label = ’Annual Mean Sea Ice’)

ax.plot(time_range_sept, sept_values, color =

’green’,linestyle= ’dashed’, label =

’September Sea Ice’) ax.plot(time_range, sie, color =

’blue’, label =

’Observed Sea Ice’)

ax.set_xlabel( ’Month’,fontsize = 15) ax.set_ylabel( ’Sea ice extent ($10^6$ $Km^2$)’,fontsize = 15) time_idx=np.arange(8,510,12) ax.legend() ax.set_title( ’Sea Ice Extent 1979-2021’,fontsize = 15)

As seen in Fig. 1, we clearly observe seasonality or nonstationary behavior in the sea ice variations. Notice how the SIE reaches its maximum and minimum value every year, which b

See https://nsidc.org/data/G02135/versions/3.

4 ML approaches for sea ice forecasting

45

FIG. 1 Observed sea ice extent from 1979 to 2021.

usually occurs in March and September. In addition to the annual variation, we can also observe a trend in the annual mean (red) and the September SIE values (dotted green). The main goal of sea ice forecasting is to capture this year-to-year variation using models that can predict the September SIE ahead of time for stakeholders and policy makers to take timely action. However, due to the Spring predictability barrier, the September SIE cannot be accurately predicted at lead times greater than 4–6 months.

3.1 Dataset description For the predictive analytics, we look at both daily and monthly sea ice extent data accessed from NSIDC data archives from January 1979 to August 2021. In addition to the sea ice data, we included nine meteorological variables in our analysis that contributes to the Arctic Sea ice variations as identified by Huang et al. (2021) in a causality study. This meteorological data are from ERA-5 global reanalysis product (European Centre for Medium-Range Weather Forecasts, 2021). We created two time series combining both sea ice extent and atmospheric variables for a span of 42 years, from 1979 to 2021. In the first time series, monthly gridded data during 1980–2021 has been averaged over the Arctic north of 25°N using area-weighted method. In the second time series, daily gridded data have been averaged over the same spatial location. The dataset description for these combined 10 variables is given in Table 1, whereas the combined dataset can be downloaded from our open-access githubc repository.

4 ML approaches for sea ice forecasting To study sea ice variations, scientists rely greatly on dynamic forecasting systems ( Johnson et al., 2019) that are mainly based on coupled Earth system models. However, over the last few c

https://github.com/big-data-lab-umbc/sea-ice-prediction/tree/main/data.

46

3. AI for sea ice forecasting

TABLE 1 Variables included in the dataset. Variables

Source

Range

Unit

Surface pressure

ERA5

[400, 1100]

hPa

Wind velocity

ERA5

[0, 40]

m/s

Specific humidity

ERA5

[0, 0.1]

kg/kg

Air temperature

ERA5

[200, 350]

K

Shortwave radiation

ERA5

[0, 1500]

W/m2

Longwave radiation

ERA5

[0, 700]

W/m2

Rain rate

ERA5

[0, 800]

mm/day

Snowfall rate

ERA5

[0, 200]

mm/day

Sea surface temperature

ERA5

[200, 350]

NSIDC

[3  10 , 17  10 ]

Sea ice extent

6

K 6

km2

years, researchers have shifted their focus to data-driven AI approaches like ML and deep learning. Since the climate data present high spatiotemporal correlations, ML models have shown promising results in spatiotemporal data mining leading to short- and long-term weather forecasting. Further, ML provides a cost efficient yet competitive alternative to climate model simulators. ML can provide valuable tools to tackle climate change. For example, ML approaches can be used to forecast El-Nino events, hurricanes, and ocean eddies and understand the role of greenhouse gases and aerosols on climate trends and events.

4.1 ML-based sea ice forecasting Since we are looking at the variations in sea ice over a period of time, this analysis can be considered as time-series regression problem. For this task, we train a multivariate linear regression (MLR) model using Python’s scikit-learnd package. 4.1.1 Data preprocessing We first load our dataset into a dataframe and remove any unwanted variables. In our case, that is the “Date” column. We print out the first five rows in the dataset to verify the data are loaded correctly (Fig. 2). import numpy as np import pandas as pd df = pd.read_csv( "Arctic_domain_mean_monthly_1979_2021.csv") # remove date from the set df = df.drop( ’Date’, 1) # print first 5 rows of data df.head()

d

scikit-learn.org/stable/.

4 ML approaches for sea ice forecasting

47

FIG. 2 Dataset with all 10 meteorological variables.

Next, we duplicate the sea ice column to create our target column. We use a lag of 1 month in our predictors and target variables. We achieve this by removing the first entry of sea ice values and last record from monthly data as shown in the code below. With this arrangement, we get January 1979s data values against February 1979s sea ice extent value and so on. df.loc[:,

’target’] = df[’sea_ice_extent’]

df = df.assign(target = df.target.shift(-1)).drop(df.index[-1])

Next step is to split the data into training and testing sets. To retain the seasonality patterns in the data, we sequentially split it by reserving the last 30 months for testing and keeping everything else in the training set. import numpy as np data = np.array(df )

#converting data into numpy

target = data[:,-1]

#assigning last column to be target variable

data = data[:,:-1]

#dropping last column from features

LEN_DATA = len(data)

#total number of records

NUM_TRAIN = LEN_DATA - (24+6)

#reserve last 30 months for testing

x_train = data[0:NUM_TRAIN] y_train = target[0:NUM_TRAIN] x_test = data[NUM_TRAIN:] y_test=target[NUM_TRAIN:] #verify data shape after split print(x_train.shape) print(y_train.shape) print(x_test.shape) print(y_test.shape) >(481, 10) >(481,) >(30, 10) >(30,)

Since our dataset has variables with different physical interpretations and likewise different range of values, it is recommended to bring the entire dataset into a single range of values, for instance [0, 1]. This can be done using one of the normalization techniques. We normalize our datasets using the MinMaxe scaling approach. e

scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html.

48

3. AI for sea ice forecasting

from sklearn.preprocessing import MinMaxScaler scaler_x = MinMaxScaler() x_train = scaler_x.fit_transform(x_train) x_test = scaler_x.transform(x_test) scaler_y = MinMaxScaler() y_train = scaler_y.fit_transform(y_train.reshape(-1,1)) y_test = scaler_y.transform(y_test.reshape(-1,1))

4.1.2 Fitting the model Once we have the data ready, we train the MLR model on the train data and evaluate its performance on the test data. from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(x_train, y_train) y_train_pred = model.predict(x_train) y_test_pred = model.predict(x_test)

Now, we inverse transform the predictions to bring them back to their original scale from our normalized scale. inv_y_test = scaler_y.inverse_transform(y_test) inv_y_pred = scaler_y.inverse_transform(y_test_pred)

We plot the test data against the model predictions to see how well the model has learnt seasonality patterns in the data. This is shown in Fig. 3. import matplotlib.pyplot as plt fig, ax= plt.subplots(figsize=(15, 6))

FIG. 3 Observed versus MLR-predicted sea ice extent from January 2019 to August 2021.

4 ML approaches for sea ice forecasting

49

plt.plot(inv_y_test, color= ’red’) plt.plot(inv_y_pred) plt.legend([’y_test’, ’y_pred’]) plt.show()

4.1.3 Model evaluation We evaluate the train and test predictions using the RMSE score. Following the data scale, the RMSE value is in square kilometers so we divide it by 106. rmse = np.sqrt(np.mean((inv_y_pred - inv_y_test) ** 2)) print(’Test RMSE: %0.3f mil. km2’ %(rmse/1000000)) >Test RMSE: 0.433 mil. km^2

We further calculate the correlation coefficient, also called R2 score, to quantify how well the model is performing on unseen data. The R2 score value ranges between 0 and 1. The higher the value, the better the performance. from sklearn.metrics import r2_score r2 = r2_score(inv_y_test, inv_y_pred) print(’R2 Score: %0.2f’ %(r2)) >R2 Score: 0.980

4.2 Deep learning-based sea ice forecasting In this section, we implement a deep learning model suitable for time-series analysis to forecast Arctic Sea ice variations. LSTM model is a variant of recurrent neural networks that is capable of processing time-series data by retaining temporal patterns over long periods of time. We explore this temporal power of LSTM to predict sea ice variations for multiple lead times. Demonstrated below is an LSTM model for a lag of 1 month. The same experiment can be repeated for multiple lead times. 4.2.1 Data preprocessing LSTM network expects the input data to be provided with a specific array structure in the form of: samples  timestep  features. To feed our data to the LSTM model, we reshape our N  11 two-dimensional datasets to M  T  11 three-dimensional (3D) datasets in such a manner that each row of predictors corresponds to (M-lag)th month’s sea ice value. Here, N represents total number of data samples, M represents total number of months, T represents the timestep, and lag is the lead time. We achieve this by reshaping the data, using a custom reshape function, as shown in the code below. import numpy as np import pandas as pd lstm_df = pd.read_csv(’Arctic_domain_mean_monthly_1979_2021.csv’) #drop unwanted variables lstm_df = lstm_df.drop([’Date’],axis=1)

50

3. AI for sea ice forecasting

data = np.array(lstm_df ) target = data[:,-1] data = data[:,:-1]

#assigning last column to be target variable #dropping last column from features

Next, we add a lag to our target variable and shift the predictors data to align it with the lagged values. Remember, LSTM models learn from the past values of a variable to predict its future. So, we need to add a lag in the predictors and the target variables to incorporate the concept of past and future. We demonstrate this below for a lead time of 1 month. this means a January 1979 input will help the model in making prediction for February 1979. lag = 1 data = data[:-lag,:,:] target = target[lag:]

Similar to the MLR model, we sequentially split data into train and test sets, reserving the last 30 months for model evaluation. LEN_DATA = len(data)

#total number of samples

NUM_TRAIN = LEN_DATA - (12*2)

#reserve last 30 months for testing

x_train = data[0:NUM_TRAIN] x_test = data[NUM_TRAIN:] #split features and labels y_train=target[:NUM_TRAIN] y_test=target[NUM_TRAIN:]

We then normalize the data using another normalization technique, that is, the standard scaler technique to feed it to our neural network. from sklearn.preprocessing import StandardScaler scaler_x = StandardScaler() x_train = scaler_x.fit_transform(x_train) x_test = scaler_x.transform(x_test) scaler_y = StandardScaler() y_train = scaler_y.fit_transform(y_train.reshape(-1,1)) y_test = scaler_y.transform(y_test.reshape(-1,1))

Finally, we reshape the data back to 3D for our neural network. timesteps = 1 x_train = reshape_features(x_train, timesteps) x_test = reshape_features(x_test, timesteps)

# reshaping to 3d for model

# reshaping to 3d for model

4.2.2 Model training Let us first design a neural network with some LSTM and fully connected layers. We add an intermediary self-attention layer on top of the final LSTM layers, in order to identify which hidden states contribute more to the target prediction. The attention mechanism assigns

4 ML approaches for sea ice forecasting

51

importance scores to the different hidden states of the LSTM model enabling the model to focus on most relevant features within the input. Specifically, at each timestep t, the attention layer first takes as input the hidden states ht, at the top layer of a stacking LSTM, it then infers a context vector ct, that captures relevant hidden information based on current target state ht, and all source states hs, of the LSTM model. Attention mechanism helps improve the deep learning model by attending to relevant hidden states so that it can significantly reduces the error of the prediction. Finally, we add two dropout layers to avoid overfitting of the model on training data. from tensorflow.keras import Input from tensorflow.keras.layers import Dense, Dropout, LSTM from tensorflow.keras.models import load_model, Model from attention import Attention timestep = timesteps features = 10 model_input = Input(shape=(timestep,features)) x = LSTM(64, return_sequences=True)(model_input) x = Dropout(0.2)(x) x = LSTM(32, return_sequences=True)(x) x = LSTM(16, return_sequences=True)(x) x = LSTM(16, return_sequences=True)(x) x = Attention(trainable = True)(x) x = Dropout(0.2)(x) x = Dense(32)(x) x = Dense(16)(x) x = Dense(1)(x) model = Model(model_input, x)

To compile a deep learning model, we need to choose an appropriate loss function and optimizer. There are a variety of Keras optimizers available to choose from. Here, we compile our model using Adamf optimizer as it handles noisy data well. Since it is a regression problem, we will use mean squared error (MSE)g as the loss function. model.compile(loss= ’mean_squared_error’, optimizer= ’adam’)

Finally, it is time to fit the model and start the training. We train the neural network on 200 epochs. Alternatively, this can be done using the Early Stoppingh method to halt training when the model loss no longer optimizes. We keep the batch size as 12 representing a chunk of 12 months of annual data. Further, we specify a 30% validation split for the model to evaluate its performance on a different dataset at each epoch.

f

keras.io/api/optimizers/adam/.

g

keras.io/api/losses/regression_losses.

h

keras.io/api/callbacks/early_stopping/.

52

3. AI for sea ice forecasting

train valid

0.8

0.6

0.4

0.2

0.0 0

25

50

75

100

125

150

175

200

FIG. 4 Model’s training versus validation (test) loss.

history=model.fit(x_train, y_train, epochs=200, batch_size=12, verbose=2, validation_split =0.3, shuffle=True)

We can plot the training and validation loss to see how well the model has generalized on the data (Fig. 4). from matplotlib import pyplot fig, ax = pyplot.subplots(figsize=(8,6)) pyplot.plot(history.history[’loss’], label = ’train’) pyplot.plot(history.history[’val_loss’], label = ’valid’) pyplot.legend() pyplot.show()

4.2.3 Model evaluation It is time to make predictions and evaluate the model performance on test data. We inverse transform the predicted values back to their original scale to calculate the RMSE and R2 score. testPred = model.predict(x_test) inv_testPred = scaler_l.inverse_transform(testPred) inv_y_test = scaler_l.inverse_transform(y_test)

4 ML approaches for sea ice forecasting

53

# calculate RMSE from sklearn.metrics import mean_squared_error from math import sqrt rmse = sqrt(mean_squared_error(inv_y_test, inv_testPred)) print(’Test RMSE: %.3f mil km^2’ % rmse/1000000) >Test RMSE: 0.672 mil km^2 # calculate R-square from sklearn.metrics import r2_score from math import sqrt r_sq = r2_score(inv_y_test, inv_testPred) print(’Test R_Square: %.2f’ % r_sq) >Test R_Square: 0.96

Finally, let us plot the observed versus predicted sea ice extent. from matplotlib import pyplot fig, ax= plt.subplots(figsize=(24, 8)) pyplot.plot(inv_testPred) pyplot.plot(inv_y_test) plt.legend ([’y_pred’,’y_test’]) pyplot.show()

As seen in Fig. 5, the model captures the seasonality patterns in data but overpredicts the SIE values for summer season, that is, September.

4.3 Ensemble learning-based sea ice forecasting Ensembling is a hybrid modeling approach, where outputs from multiple models are combined to improve performance while reducing variance and generalization error. Instead

FIG. 5 Observed versus LSTM-predicted sea ice extent from January 2019 to August 2021.

54

3. AI for sea ice forecasting

FIG. 6 Architecture of MLR-LSTM ensemble model.

of rigorously training a single model to be the best predictor, ensemble methods combine a series of models and aggregate them to produce one final model. There are multiple ways of ensembling ML models including bagging, boosting, and stacking. Here, we follow a custom ensemble approach where we first train the MLR on the data and then combine its predictions with the LSTM model to get final predictions. The overall architecture of this approach is given in Fig. 6. 4.3.1 Data concatenation As shown in Fig. 6, we are combining MLR predictions Y0M with the original dataset and feeding the new dataset to the LSTM model to get final predictions Y00M . For this, we need to save the predictions from the MLR model in a vector array and concatenate them with the original data, as demonstrated in the code below. This is illustrated in Fig. 6 using the  symbol. The dataset is reshaped to 3D for feeding to LSTM model as it requires an additional dimension T, that is, the timestep. To utilize the same data for MLR and LSTM, we keep the timestep value as 1. Notice that we use the entire dataset in training the MLR so we have the same number of data samples to be concatenated as the original dataset. import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression df = pd.read_csv("Arctic_domain_mean_monthly_1979_2021.csv") # remove date from the set df = df.drop(’Date’, 1) data = np.array(df ) target = data[:,-1]

#assigning last column to be target variable

5 Results and analysis

55

model = LinearRegression() model.fit(data, target) lr_data = model.predict(data)

Since the predictions are in one-dimensional, we need to reshape them to match the dimensions of our initial dataset. lr_data = lr_data.reshape(len(lr_data),1)

We are now ready to concatenate the MLR predictions to our original dataset. data = np.concatenate((data,lr_data),axis=1)

Once we have this new dataset ready, we can repeat the steps in Sections 4.2.1–4.2.2 for training our LSTM model. The only difference would be that the number of features in the LSTM model is now 11 instead of 10, where the predictions from MLR will serve as the 11th predictor for the LSTM model. 4.3.2 Model evaluation Let us evaluate the model performance by calculating the RMSE and R2 score. Do not forget to inverse transform your predictions before calculating the RMSE and R2 score as explained in Section 4.2.3. # calculate RMSE from sklearn.metrics import mean_squared_error from math import sqrt rmse = sqrt(mean_squared_error(inv_y_valid, inv_testPred)) print(’Test RMSE: %.3f’ % rmse/1000000) >Test RMSE: 0.637 # calculate R2 Score r2 = r2_score(inv_y_valid, inv_testPred) print(’Test R2 Score: %.3f’ % r2) >Test R2 Score: 0.975

Looking at the RMSE and R2 score values, we notice how LSTM model’s predictive performance has increased by ensembling it with MLR model.

5 Results and analysis In this section, we compare the performance of all three methods implemented in the previous sections for predictions at greater lead times. Owing to the spring season predictive barrier, we only compare the model predictions for 1–3 months of lead times. Looking at the results in Table 2, we notice how MLR performs better than deep learning and ensemble methods at lead times of 1–2 months, however, for greater lead times, the predictive performance of MLR starts deteriorating. We notice that the ensemble method

56

3. AI for sea ice forecasting

TABLE 2 RMSE and R2 score for all models for lead times of 1–3 months. R2 score

RMSE Models/lead time

1 month

2 months

3 months

1 month

2 months

3 months

MLR

0.433

0.892

1.429

0.980

0.937

0.839

LSTM

0.627

1.102

1.401

0.960

0.930

0.845

MLR-LSTM

0.637

1.042

1.400

0.975

0.932

0.846

improves the predictive performance of LSTM and finally provides promising results on lead time of 3 months by outperforming both MLR and LSTM.

6 Discussion In this chapter, we looked at multiple data-driven models to study the sea ice variations in the Arctic. This chapter is targeted to provide hands-on experience in forecasting Arctic Sea ice, where all demonstrated models focus on the temporal aspect of the data. However, there are several other spatiotemporal deep learning and probabilistic ML models that can be implemented in a similar manner to not only predict spatial maps of SICs but also capture the long-term trend and interannual variations. Note that sample analysis shown here mainly targets predicting pan-Arctic Sea ice extent at a lead time from 1 to 3 months. The study can be improved in several ways to further advance the modeling capabilities of sea ice seasonal forecasting. First, more climate variables can be included as predictors to improve the forecast, such as ocean density, ocean salinity, ocean waves, snow cover, etc. It is also worth including some key climate indices such as North Atlantic Oscillation and Arctic Oscillation to capture the teleconnection between Arctic and lower latitudes. Second, the sea ice forecasting can be improved by replacing reanalysis product with reliable satellite observations or surface in situ measurements. Third, the spatial information can be included in the model, which means the model can be trained in different regions of interest, such as Beaufort Sea, East Siberian Sea, which shows relatively larger interannual variations than other marginal seas of Arctic Ocean. The model can be further improved to predict gridded SIC to provide more detailed information. Note that the physical processes that drive sea ice variations may work differently in different time of year (e.g., with or without sunlight), it would be also worthwhile to develop sea ice forecast model for different seasons. On the current global warming trend, we may expect a sea ice-free summer by 2050 based on the most recent climate model projections (Notz and Community, 2020). This implies that the point at which sea ice cover falls to close to zero in the summer will have serious consequences. It therefore becomes immensely important to study how to build AI-based models to project long-term sea ice changes, for example, monthly sea ice variations in the next 30 or 50 years. Nevertheless, the sample analysis shown in this chapter is a good starting point to show how cutting-edge AI technologies can be used in the frontier area of Earth Science. It will motivate the development of a more sophisticated and reliable seasonal sea ice forecasting model, ultimately advancing our capabilities to better predict and prepare for global climate change.

References

57

7 Open questions 1. In this chapter, we used some of the atmospheric variables for our analysis of sea ice variations. What other climate variables (atmosphere, ocean, ecosystem, etc.) can be used as the predictors to further improve the models discussed? 2. We demonstrated annual predictive skill of different ML models over the entire Pan-Arctic region. Do you think the models would perform differently by seasons and subregions? Please explain the reasons and discuss how to further improve the models by including seasonality and spatial heterogeneity. 3. In addition to declined SICs/extent, reconstructions using numerous observational sources show that the average thickness has decreased more than 60% in the past six decades (Kwok, 2018). How can we leverage the cutting-edge Earth AI techniques to better predict short- and long-term changes of sea ice thickness? What data do you consider relevant to this kind of analysis?

8 Assignments 1. In this analysis, we chose the dataset variables based on their causal relationship with sea ice (Huang et al., 2021). Perform a statistical correlation analysis on this dataset to identify the variables that have a strong correlation with each other. Further, identify the variables that have a weak correlation with each other. You can use Pearson correlation test for this task. 2. We demonstrated how MLR can be performed on the annual dataset. Create a subset of the dataset with only September values. Train an MLR model on this updated dataset and verify its predictive performance for lead times of 1–3 months. Repeat the same for the subset of dataset with only March values. How different are these monthly forecasts from the predictions of models trained on the annual dataset, demonstrated in this chapter? Does the model performance improve or deteriorate? 3. Python sklearn package provides ML-based ensemble models like Random Forest and XGBoost. Implement one of these ensemble methods on the dataset and compare the performance of this model with the results of MLR-LSTM ensemble method. Explain your results by comparing the RMSE and R2 scores for both modeling techniques.

References Ali, S., Huang, Y., Huang, X., Wang, J., 2021. Sea ice forecasting using attention-based ensemble LSTM. Tackling climate change with machine learning workshop at ICML. arXiv preprint:2108.00853. Andersson, T.R., Hosking, J.S., Perez-Ortiz, M., Paige, B., Elliott, A., Russell, C., Law, S., Jones, D.C., Wilkinson, J., Phillips, T., et al., 2021. Seasonal Arctic Sea ice forecasting with probabilistic deep learning. Nat. Commun. 12 (1), 1–12. Chi, J., Kim, H.-C., 2017. Prediction of Arctic Sea ice concentration using a fully data driven deep neural network. Remote Sens. 9 (12). https://doi.org/10.3390/rs9121305. [Dataset]National Snow and Ice Data Center, 2021. Sea Ice Concentrations From Nimbus-7 SMMR and DMSP SSM/ISSMIS Passive Microwave Data, Version 1. (Accessed 5 September 2021) http://nsidc.org/data/NSIDC-0051.

58

3. AI for sea ice forecasting

Delworth, T.L., Broccoli, A.J., Rosati, A., Stouffer, R.J., Balaji, V., Beesley, J.A., Cooke, W.F., Dixon, K.W., Dunne, J., Dunne, K., et al., 2006. GFDL’s CM2 global coupled climate models. Part I: formulation and simulation characteristics. J. Clim. 19 (5), 643–674. European Centre for Medium-Range Weather Forecasts, 2021. ERA-5 Global Reanalysis Product. (Accessed 5 September 2021) https://cds.climate.copernicus.eu/cdsapp#!/home. Gutjahr, O., Putrasahan, D., Lohmann, K., Jungclaus, J.H., von Storch, J.-S., Br€ uggemann, N., Haak, H., St€ ossel, A., 2019. Max planck institute Earth system model (MPI-ESM1.2) for the high-resolution model intercomparison project (HighResMIP). Geosci. Model Dev. 12 (7), 3241–3281. Huang, Y., Kleindessner, M., Munishkin, A., Varshney, D., Guo, P., Wang, J., 2021. Benchmarking of data-driven causality discovery approaches in the interactions of Arctic Sea ice and atmosphere. Front. Big Data 4, 642182. https://doi.org/10.3389/fdata.2021.642182. Johnson, S.J., Stockdale, T.N., Ferranti, L., Balmaseda, M.A., Molteni, F., Magnusson, L., Tietsche, S., Decremer, D., Weisheimer, A., Balsamo, G., Keeley, S.P.E., Mogensen, K., Zuo, H., Monge-Sanz, B.M., 2019. SEAS5: the new ECMWF seasonal forecast system. Geosci. Model Dev. 12 (3), 1087–1117. Kim, J., Kim, K., Cho, J., Kang, Y.Q., Yoon, H.-J., Lee, Y.-W., 2019. Satellite-based prediction of Arctic Sea ice concentration using a deep neural network with multi-model ensemble. Remote Sens. 11 (1), 19. Kim, Y.J., Kim, H.-C., Han, D., Lee, S., Im, J., 2020. Prediction of monthly Arctic Sea ice concentrations using satellite and reanalysis data based on convolutional neural networks. Cryosphere 14 (3), 1083–1104. Kim, E., Kruse, P., Lama, S., Bourne, J., Hu, M., Ali, S., Huang, Y., Wang, J., 2021. Multi-task deep learning based spatiotemporal Arctic Sea ice forecasting. In: 2021 IEEE International Conference on Big Data (Big Data), IEEE, pp. 1847–1857. Kwok, R., 2018. Arctic Sea ice thickness, volume, and multiyear ice coverage: losses and coupled variability (1958–2018). Environ. Res. Lett. 13 (10), 105005. Liu, Q., Zhang, R., Wang, Y., Yan, H., Hong, M., 2021. Daily prediction of the Arctic Sea ice concentration using reanalysis data based on a convolutional LSTM network. J. Mar. Sci. Eng. 9 (3), 330. National Snow and Ice Data Center, 2022. Quick Facts on Arctic Sea Ice. National Snow and Ice Data Center. (Accessed 20 January 2022) https://nsidc.org/cryosphere/quickfacts/seaice.html. Notz, D., Community, S., 2020. Arctic Sea ice in CMIP6. Geophys. Res. Lett. 47 (10). e2019GL086749.

C H A P T E R

4 Deep learning for ocean mesoscale eddy detection Edwin Goh, Annie Didier, and Jinbo Wang Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA, United States

1 Introduction In this chapter, we will cover how to leverage convolutional neural networks to detect mesoscale (100–300 km diameter) ocean eddies. Atmospheric wind together with solar heat flux at the sea surface drive the ocean circulation. The ocean’s circulation is turbulent and full of small-scale vortices with closed circulation contours. These vortices in the ocean are often referred to as ocean eddies—oceanic analogues of atmospheric cyclones. Ocean eddies play an important role in the transport of heat, salt, carbon, and other tracers that are important for marine biology and climate regulation. The interested reader new to oceanography is referred to Chelton et al. (2011) for a review. Ocean eddies have particular thermohaline structures that generate “hills” and “valleys” in the sea surface height (SSH). The SSH associated with these eddy features are on the order of tens of centimeters (above/below mean sea level), and can be detected using satellite altimetry. A satellite altimeter measures the distance between the satellite and the ocean surface by measuring the round-trip time of radar pulses, which indirectly measures the SSH by referencing a stationary geoid. Quick and accurate eddy detection can enable the identification of eddies at a global scale (i.e., for the entire ocean), and can in turn lead to better quantification of their influence on the ocean’s role in Earth’s climate system. In this chapter, we leverage an existing open-source deep learning model and a gridded satellite altimetry product to illustrate how to use convolutional neural networks (CNN) to detect ocean eddies. CNNs mimic the brain’s visual cortex, specifically the idea that groups of neurons only react to stimuli in a limited region of the visual field (i.e., they have a localized receptive field) and the hierarchical structure between higher-level and lower-level visual

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00011-6

59

Copyright # 2023 Elsevier Inc. All rights reserved. Contribution prepared by the Contributor on behalf of JPL/Caltech.

60

4. Deep learning for ocean mesoscale eddy detection

neurons. In fact, CNNs such as Yann LeCun’s LeNet-5 architecture (LeCun et al., 1998) have been widely used since the 1990s for handwritten digit recognition and gained renewed attention when AlexNet (Krizhevsky et al., 2012) won the ILSVRC ImageNet challenge in 2012. As a demonstration in this chapter, we implement a version of EddyNet, a small and efficient CNN for pixel-wise classification (i.e., semantic segmentation) of ocean eddies (Lguensat et al., 2018). As a testament to the benefits of open-source research, several iterations/variants of the EddyNet architecture have been described in subsequent literature, including work by other researchers such as Santana et al. (2020). Here, we focus on the baseline/"vanilla" version of EddyNet, in order to emphasize a basic machine learning workflow as opposed to model architecture development and fine-tuning.

2 Chapter layout This chapter is organized as follows. We first discuss the eddy detection task and existing literature. We then discuss the data preparation process and provide details about the source dataset used throughout this chapter, the algorithm(s) used to obtain ground truth eddy masks from this dataset, and the utilities written to transform the data into a “machine learning-ready” dataset. Wherever possible, we discuss alternative data sources and algorithms that can be used in place of the ones used in this chapter. Next, we pose the eddy detection problem as a machine learning task (semantic segmentation, specifically) and describe a custom deep learning model by Lguensat et al. (2018) to solve this task. We provide an illustrative machine learning workflow and discuss relevant machine learning concepts such as training and testing sets, overfitting, transfer learning, and various metrics to evaluate model performance. Finally, we provide as homework and/or personal research assignments the future work identified in Lguensat et al. (2018) and Sun et al. (2022). We use Python as the main programming language, and PyTorch as the primary deep learning framework. In general, Python users are strongly encouraged to work in an isolated environment for each project in order to avoid conflicting library versions. This can be achieved using Python’s built-in venv module or Anaconda. For this chapter, we will use Anaconda to create a new environment called eddy_env and install the required libraries. We assume that the reader has access to a GPU-enabled machine with CUDA 11.2 or higher installed. Users with older CUDA versions (e.g., 10.2) can still run the code, but will need to install the appropriate version of PyTorch from https://pytorch.org/get-started/ previous-versions/. The following command will create a new environment called eddy_env: conda create --name eddy_env python=3.8 pytorch==1.10.2 torchvision==0.11.3=py38_cu113 torchaudio==0.10.2=py38_cu113 cudatoolkit=11.3 cudnn=8.2.1 tensorboard torchmetrics=0.9.0 setuptools=59.5.0 seaborn pandas tqdm opencv jupyter scikit-learn -c pytorch -c conda-forge

Readers will also need to install the pyEddyTracker package to generate the ground truth eddy masks. Since the pyEddyTracker package is only available on PyPI and not the conda repository, we will use Python’s built-in pip module to install it. Once the eddy_env environment is activated, run the following command to install pyEddyTracker: pip install pyEddyTracker

3 Data preparation

61

3 Data preparation 3.1 AVISO-SSH data product This chapter aims to detect eddies from sea level anomaly (SLA) using deep learning. Following Lguensat et al. (2018), we use the AVISO-SSH sea level product provided by the Copernicus Climate Change Service (C3S). This product provides daily global estimates of sea level anomaly at 0.25° resolution based on satellite altimetry measurements from 1993 to present, and can be downloaded through Copernicus Climate Change Service (https:// doi.org/10.24381/cds.4c328c78). Eddy detection, when posed as a semantic segmentation task, is a supervised learning problem. Given a dataset D that consists of a set of examples x  X and corresponding targets or labels y  Y supervised learning aims to obtain a model f parameterized by θ that generates a prediction b y , i.e., b y ¼ f θ ðxÞ. For the eddy detection task, X is taken to be sea level anomaly (SLA) or absolute dynamic topography (ADT) obtained from the AVISO-SSH product, whereas the ground truth labels Y are pixel-level classifications of whether each pixel is a cyclonic eddy, anticyclonic eddy or not an eddy. We obtain Y using the py-eddy-tracker algorithm (Mason et al., 2014), which is a geometric approach that takes SLA maps as input and outputs eddy properties. While the use of an automated algorithm as ground truth for another automated algorithm may seem circular at first glance, an ongoing area of research is to investigate whether machine learning models can be resolution and scale invariant. That is, we wish to develop machine learning approaches that can be applied to identify not only mesoscale eddies, but also smaller (e.g., submesoscale 0] = 1 contours, hierarchy = cv2.findContours(mask, cv2.RETR_TREE, cv2. CHAIN_APPROX_SIMPLE) return len(contours)

4.4.2 Analyze training curves in TensorBoard Run this cell to initialize a TensorBoard instance, which we will use to monitor the training progress. After the next cell (i.e., the one that runs run_epochs) is running and the model is training, click on the gear icon in the top right and make sure the “Reload data” option is checked. from IPython.display import display, HTML display(HTML("")) %load_ext tensorboard %tensorboard –bind_all –logdir $writer.log_dir –port=6008 # the default is 6006 but we set it to 6009 to avoid conflicts with other notebooks

4.4.3 Run the training loop for prescribed num_epochs Here, we implement a training loop that calls the run_epoch function for a specified number of epochs, updating the model’s parameters θ using our optimizer to drive down the specified loss function. We implement early stopping, which is a form of regularization to prevent the model from overfitting to the training data and losing its generalization ability. To do this, we track the model’s performance on the validation set (i.e., SSH + eddy maps from 2019) as it trains on data from 1998 to 2018 and stop the training once its performance on the 2019 dataset stops increasing (or even decreases) for a specified number of epochs (specified using the patience argument).

Try it out In a typical machine learning workflow, researchers/engineers save checkpoints of the model at regular intervals to evaluate the model’s training progress and safeguard against potential crashes during the training run, in which case the experiment can be restarted from the saved checkpoint. In the cell below, add a block of code to save a checkpoint every 10 epochs using the checkpoint_path variable.

90

4. Deep learning for ocean mesoscale eddy detection

from eddy_train_utils import add_hparams # create some aliases loss, opt, sched = loss_fn, optimizer, scheduler checkpoint_path = os.path.join(tensorboard_dir, "model_ckpt_{epoch}.pt") early_stopping = EarlyStopping( patience=10, path=checkpoint_path, min_epochs=30, ) progress_bar = tqdm(range(num_epochs), desc="Training: ", unit="epoch(s)") for N in progress_bar: train_loss, val_loss, train_m, val_m = run_epoch( N, model, loss, opt, sched, train_loader, val_loader, train_metrics, val_metrics, writer, ) # update progress bar train_m_copy = {f"train_{k}".lower(): v.cpu().numpy() for k, v in train_m.items ()} val_m_copy = {f"val_{k}".lower(): v.cpu().numpy() for k, v in val_m.items()} progress_bar.set_postfix(**train_m_copy, **val_m_copy) # early stopping when validation loss stops improving early_stopping.path = checkpoint_path.format(epoch=N) early_stopping(val_loss, model) if early_stopping.early_stop: print( f"Early stopping at epoch {N}" f" with validation loss {val_loss:.3f}" f" and training loss {train_loss:.3f}" ) break # TODO (homework): save checkpoint every 10 epochs # add hyperparameters and corresponding results to tensorboard HParams table hparam_dict = {

4 Training and evaluating an eddy detection model

91

"backbone”: model_name, "num_epochs”: num_epochs, "batch_size”: batch_size, "num_classes”: num_classes, "binary_mask”: binary, "optimizer”: optimizer.__class__.__name__, "max_lr”: max_lr, "loss_function”: loss_fn.__class__.__name__, } metrics_dict = { "train/end_epoch”: N, "train/loss”: train_loss, "train/Accuracy”: train_m["Accuracy"], "val/loss”: val_loss, "val/Accuracy”: val_m["Accuracy"], } add_hparams(writer, hparam_dict, metrics_dict, epoch_num=N) writer.close() # save model to tensorboard folder model_path = os.path.join(tensorboard_dir, f"model_ckpt_{N+1}.pt") torch.save(model.state_dict(), model_path) {"model_id":"1877ff7bfa454b7aab999cb6f2c6d8f6","version_major":2, "version_minor":0} Early stopping at epoch 193 with validation loss 0.000 and training loss 0.714

4.5 Evaluate model on training and validation sets Once the training loop is complete, we evaluate the trained model on our validation set (2019) one more time, and generate an animation like the one shown at the beginning of this notebook. Recall that the val_loader object returns SSH maps and eddy masks in minibatches, so an inner loop across the minibatch is used to generate the animation. from matplotlib.animation import ArtistAnimation device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.eval() with torch.no_grad(): fig, ax = plt.subplots(1, 3, figsize=(25, 10)) artists = [] # loop through all SSH maps and eddy masks in 2019 # and run the model to generate predicted eddy masks for n, (ssh_vars, seg_masks, date_indices) in enumerate(val_loader): ssh_vars = ssh_vars.to(device) seg_masks = seg_masks.to(device) # Run the model to generate predictions preds = model(ssh_vars)

92

4. Deep learning for ocean mesoscale eddy detection

# For each pixel, EddyNet outputs predictions in probabilities, # so choose the channels (0, 1, or 2) with the highest prob. preds = preds.argmax(dim=1)

# Loop through all SSH maps, eddy masks, and predicted masks # in this minibatch and generate a video preds = preds.cpu().numpy() seg_masks = seg_masks.cpu().numpy() ssh_vars = ssh_vars.cpu().numpy() date_indices = date_indices.cpu().numpy() for i in range(len(ssh_vars)): date, img, mask, pred = date_indices[i], ssh_vars[i], seg_masks[i], preds[i] img1, title1, img2, title2, img3, title3 = plot_eddies_on_axes( date, img, mask, pred, ax[0], ax[1], ax[2] ) artists.append([img1, title1, img2, title2, img3, title3]) fig.canvas.draw() fig.canvas.flush_events() animation = ArtistAnimation(fig, artists, interval=200, blit=True) plt.close()

animation.save(os.path.join(tensorboard_dir, "val_predictions.gif"), writer="pillow") HTML(animation.to_jshtml())

4 Training and evaluating an eddy detection model

93

Finally, we demonstrate the beginnings of how one can use classical computer vision techniques to recover eddy contours from the predicted segmentation masks. We use the first prediction in the list (i.e., Jan 1, 2019) and use the OpenCV library to identify contours (regardless of anticyclonic or cyclonic). We then show a visualization of the recovered contours, and use the contourArea() function to determine the average area (in square pixels) enclosed by each contour. This operation can be performed for each prediction to enable eddy tracking and trajectory determination. p = preds[0].astype(np.uint8) print(f"Number of anticyclonic eddies: {count_eddies(p, eddy_type='anticyclonic')}") print(f"Number of cyclonic eddies: {count_eddies(p, eddy_type='cyclonic')}") print(f"Number of both eddies: {count_eddies(p, eddy_type='both')}") # draw contours on the image thr = cv2.threshold(p, 0, 1, cv2.THRESH_BINARY)[1].astype(np.uint8) contours, hierarchy = cv2.findContours(thr, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) img = np.zeros(p.shape, np.uint8) cv2.drawContours(img, contours, -1, (255, 255, 255), 1) plt.imshow(img, cmap="gray") plt.axis("off")

94

4. Deep learning for ocean mesoscale eddy detection

# get average contour area area = 0 for cnt in contours: area += cv2.contourArea(cnt) area /= len(contours) print(f"Average contour area: {area:.2f} sq. pixels") Number of anticyclonic eddies: 154 Number of cyclonic eddies: 162 Number of both eddies: 281 Average contour area: 14.81 sq. pixels

5 Discussion Figs. 1–3 show several results from the training process described in the foregoing sections. Fig. 1 shows that the EddyNet model proposed by Lguensat et al. (2018) achieves an accuracy of around 80% on the 2019 test set in the Northern Pacific. The precision and recall curves on the training set from 1998 to 2018 show that the model has a high recall on eddies, meaning that most of the eddies (both anticyclonic and cyclonic) in the training set are accurately predicted by the model. However, the model exhibits low eddy precision of around 0.55. In other words, only 55% of all pixels that the model classifies as eddies are actually eddies; the remaining 45% are false positives and should have been classified as “not eddies” instead. The reverse trend is true for the “not eddies” class—the model is only able to correctly identify 65% of all noneddy pixels in the 2019 test set, but almost all of the pixels classified as “not eddies” by the model are in fact not eddies (i.e., high precision on the “not eddies” class). While Fig. 1 shows the relevant ML metrics averaged across the total number of pixels in the training set, Figs. 2 and 3 further show the tendency of the model to overpredict the number of eddies on the test set. The distribution for the predicted number of eddies is significantly shifted to the right compared to that of the ground truth. A similar trend is shown in Fig. 3, where a time series plot illustrates the change in eddy counts across 2019. Although the model tends to predict around 50 more eddies per tile, it can capture seasonal trends across 2019. It correctly predicts that August of 2019 has the fewest anticyclonic eddies

FIG. 1

Training curves of pixel-level accuracy, precision, and recall that show the model’s progress throughout each training iteration.

96

4. Deep learning for ocean mesoscale eddy detection

FIG. 2

Comparison of anticyclonic and cyclonic eddy counts by the py-eddy-tracker ground truth and the model’s predictions on the test set (2019) in the Northern Pacific Ocean (see Section 3.6). The model consistently predicts more eddies compared to the ground truth due to low recall performance on the negative class.

FIG. 3 Time series comparison of mean eddy counts per month from ground truth and model predictions in the Northern Pacific Ocean (see Section 3.6). A finer-grained view shows that the model can reasonably track trends in the number of eddies over time.

and that July of 2019 has the fewest cyclonic eddies. The model also reasonably tracks the trend of a decreasing number of eddies in the Northern Pacific from March to July. It is important to note that this chapter serves as a didactic introduction, and various methods may be employed to improve the model’s performance and decrease the number of false positives. Chief among them is the expansion of the dataset to encompass additional regions beyond the Northern Pacific Ocean, as well as the modification of the EddyNet architecture to leverage temporal features in addition to spatial features.

7 Assignments

97

6 Summary In this chapter, we covered a deep learning-based approach to detect mesoscale ocean eddies, which influence ocean transport processes as well as the global energy budget. Compared to classical machine learning models, deep learning models offer the capacity or complexity to capture nonlinear interactions (when trained on high-quality annotated data) between different parts of the input features at various scales. This is especially relevant in physical oceanography, which is governed by nonlinear, multiscale physics. We posed the eddy detection problem as a semantic segmentation (i.e., pixel-wise classification) problem and generated a pixel-wise annotated satellite altimetry dataset using the py-eddy-tracker algorithm to train the deep learning model. In doing so, we demonstrated how to obtain data from large oceanography data repositories and leverage open-source research tools to produce ground truth eddy maps, which is arguably the most important (and time-consuming) step of any new machine learning project. We then trained our model on a subset (in the Northern Pacific) of the global eddy maps, covering several fundamental concepts in practical machine learning such as the relevant performance metrics to address the imbalanced class distribution (i.e., a much larger part of our subset region not covered by eddies compared to regions containing eddies), the need for training and testing splits to evaluate our model’s performance, and how to monitor the training process using TensorBoard. We hope that this chapter, though without a comprehensive literature review, will provide the reader not only with a strong sense of the overarching processes involved in an end-to-end machine learning research project, but also with inspiration to apply these same processes and fundamental principles to other tasks in Earth Science to help address some of the most pressing issues in Earth and climate sciences.

7 Assignments 1. Try to set average=micro for torchmetrics.F1Score and re-run the training loop. What is this micro-averaged F1 score equivalent to? 2. The Jaccard Index (also known as Intersection over Union or IoU) is quantity that is commonly in semantic segmentation to measure the similarity between the model’s predicted segmentation mask and the ground truth segmentation mask. Implement the Jaccard index (Intersection over Union) metric using the torchmetrics library and add it to the list of metrics returned by the get_metrics() function. What do you observe in the relationship between the macro-averaged F1 and Jaccard Index? 3. Data augmentations are commonly used to the model more robust to various changes (e.g., in brightness, orientation, scale, resolution, etc.) and prevent overfitting on the original (unaugmented) training examples. Use the torchvision.transforms module to incorporate horizontal flipping and random cropping followed by a resize to the original image size (RandomResizedCrop). What are the differences in performance resulting from the augmentations? Do augmentations have a larger effect on accuracy or F1? Hint: Be sure to apply the same transformations to the segmentation masks, and when cropping/resizing, use nearest-neighbor interpolation to avoid floating point labels. 4. We have looked at a subset of our eddy dataset in the Northern Pacific, giving us around 1000 training examples and 50 SSH patches for testing. We can easily double this number

98

4. Deep learning for ocean mesoscale eddy detection

by including an additional region in our dataset. Try using the subset_arrays() function in Section 3 (Data Preparation) to take a subset of the Northern Atlantic (lon_range ¼ (50, 18), lat_range ¼ (14, 46)) and save the train and test masks to npz files. Then pass a list of npz files (e.g., [train_file_pacific, train_file_atlantic]) into the get_eddy_dataloader() function in Section 4.1. How does this affect the model’s performance on: a. The original Pacific test set of 47 SSH patches? b. The combined Pacific + Atlantic test set of 94 SSH patches? 5. Try using training on only the Northern Atlantic dataset from Problem 4 and evaluating on our original Northern Pacific test set. Write down your hypothesis (and justification) of how you expect the model to perform when shown only Northern Atlantic SSH/eddy patches and on the Northern Pacific. Do your experimental results match your hypothesis? Why or why not? 6. In a typical machine learning workflow, researchers/engineers save checkpoints of the model at regular intervals to evaluate the model’s training progress and safeguard against potential interruptions during the training run, in which case the experiment can be restarted from the saved checkpoint. In the relevant cell with a #TODO comment, add a block of code to save a checkpoint every 10 epochs using the checkpoint_path variable. 7. In the final cell, write a function to convert square pixels to square km based on a given pixel resolution (assuming square pixels).

8 Open questions Machine Learning for oceanography and earth science is a nascent yet rapidly developing field. Challenges and opportunities exist on multiple fronts, such as model design, development, validation, and deployment, as summarized by Sun et al. (2022). Mesoscale eddy identification from sea surface height is a well-defined problem that is well suited to various machine learning approaches. The data volume of the available SSH/altimetry data is also relatively small, which removes the need for large computing resources; indeed, the EddyNet model by Lguensat et al. (2018) is very parameter-efficient, and can be trained on a desktop or even a laptop GPU. Challenges and opportunities remain in model architecture improvements, standardized performance validation metrics, and resolution-invariant models that can leverage finer resolution SSH data from the upcoming SWOT mission. Similar to how the open-source py-eddy-tracker and EddyNet models spurred further development in eddy tracking algorithms, a well-organized, open-source model comparison/performance validation platform will increase community engagement. It presents a good opportunity for a cloud-based framework that reduces the barrier in the compatibility of software and hardware in supporting model comparison. The cloud-based framework will also remove the hurdle in technology transfer from model development to model operations, for example, by NASA data centers such as Physical Oceanography Distributed, Active, Archive Center (PO.DAAC), where Petabytes of NASA remote sensing data were hosted on Earthdata cloud (AWS S3). It will become a component of an integrated geoscience system built around Earth observations hosted on the cloud using the power of cloud computing.

References

99

Detecting ocean mesoscale (100 km) eddies from low-resolution sea surface height gridded maps is a focused and rather simple task. Opportunities and challenges exist in extending the machine learning model to (1) extend the capability to smaller-scale ocean eddies and (2) infer the ocean interior state. Specifically, the following questions are at forefront of satellite altimetry, at the time of writing, and warrant future investigation. 1. How can we combine the future SWOT sea surface height observations with current nadir altimetry to retrieve small-scale ocean eddies? SWOT has much higher resolution down to tens of kilometers and with a 120 km swath but with sparse temporal sampling. What is the best machine learning model to achieve this unique spatial-temporal data fusion? 2. We have one ocean but numerous observables, such as the sea surface height discussed here, sea surface temperature and color etc. How can we combine multiple remote sensing variables in detecting ocean eddies? 3. Satellite measurements have the advantage of global coverage with limited temporal resolution. Can machine learning be used to infer the ocean interior structure at small (10 km) scales?

Acknowledgments The work was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration (80NM0018D0004).

References Chelton, D.B., Schlax, M.G., Samelson, R.M., 2011. Global observations of nonlinear mesoscale eddies. Prog. Oceanogr. 91, 167–216. https://doi.org/10.1016/J.POCEAN.2011.01.002. Kingma, D.P., Ba, J., 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J., Bottou, L., Weinberger, K.Q. (Eds.), Advances in Neural Information Processing Systems. Curran Associates, Inc. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2323. https://doi.org/10.1109/5.726791. Lguensat, R., Sun, M., Fablet, R., Mason, E., Tandeo, P., Chen, G., 2018. EddyNet: A deep neural network for pixelwise classification of oceanic eddies. In: Int. Geosci. Remote Sens. Symp. 2018-July, pp. 1764–1767, https://doi. org/10.1109/IGARSS.2018.8518411. Mason, E., Pascual, A., McWilliams, J.C., 2014. A new sea surface height–based code for oceanic mesoscale eddy tracking. J. Atmos. Ocean. Technol. 31, 1181–1188. https://doi.org/10.1175/JTECH-D-14-00019.1. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: convolutional networks for biomedical image segmentation. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 9351. Springer International Publishing, pp. 234–241. Santana, O.J., Herna´ndez-Sosa, D., Martz, J., Smith, R.N., 2020. Neural network training for the detection and classification of oceanic mesoscale eddies. Remote Sens. 12, 2625. https://doi.org/10.3390/RS12162625. Smith, L.N., Topin, N., 2019. Super-convergence: very fast training of neural networks using large learning rates. In: SPIE, p. 36, https://doi.org/10.1117/12.2520589. Sun, Z., Sandoval, L., Crystal-Ornelas, R., Mousavi, S.M., Wang, J., Lin, C., Cristea, N., Tong, D., Carande, W.H., Ma, X., Rao, Y., Bednar, J.A., Tan, A., Wang, J., Purushotham, S., Gill, T.E., Chastang, J., Howard, D., Holt, B., Gangodagamage, C., Zhao, P., Rivas, P., Chester, Z., Orduz, J., John, A., 2022. A review of earth artificial intelligence. Comput. Geosci. 159, 105034. https://doi.org/10.1016/J.CAGEO.2022.105034.

This page intentionally left blank

C H A P T E R

5 Artificial intelligence for plant disease recognition Jayme Garcia Arnal Barbedo Embrapa Digital Agriculture, Campinas, Brazil

1 Introduction 1.1 Plant disease challenge Plant disease has been a serious problem that causes major agriculture losses every year. The crop loss worldwide is estimated annually to be $220 billion USD or 14.1% of crop loss due to plant disease (https://www.fao.org/news/story/en/item/1402920/icode/). Protecting plant health is one of the vital challenges of our society, and detecting plant disease early and quickly is the key first step to tackle the challenge. The detection of symptoms and determination of the associated disorder is still mostly being carried out visually. However, timely coverage of the vast areas is often unfeasible, especially considering workforce shortages observed in many areas (Charlton et al., 2019). In addition, visual assessment is prone to psychological and cognitive phenomena that may lead to bias, optical illusions and, ultimately, to error. Laboratorial analyses such as molecular, immunological or pathogen culturing-based approaches are often time consuming, failing to provide answers in a timely manner (Barbedo, 2016). In this context, the automation of the process using AI techniques becomes an appealing option, especially when combined with digital images. There are, however, a few factors that, if not properly addressed, can lead to technologies that lack the robustness to deal with the variety of conditions found in practice. Some of those factors are intrinsic to the problem of disease recognition, and their impact depends strongly on the crop and disease of interest (Barbedo, 2016). First, usually the symptoms produced by a given disease can have a range of visual characteristics, and those variations need to be represented in the dataset used for training the model. Second, different disorders may produce similar symptoms. This is a very challenging situation, because even if the variability associated to the diseases is properly

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00001-3

101

Copyright # 2023 Elsevier Inc. All rights reserved.

102

5. Artificial intelligence for plant disease recognition

represented in the dataset, some degree of confusion will likely occur unless other types of data (e.g., meteorological data, soil properties, etc.) are also employed by means of data fusion (Barbedo, 2022). Third, multiple disorders may be present simultaneously, producing symptoms that may be visually distinct from those produced by the disorders individually. Thus, including this type of situation in the training dataset is important, although this may not be an easy task due to the relative rarity of some disease interactions and the difficulty of identifying those cases in the field. There are also numerous extrinsic factors that need to be taken into account: image background, illumination conditions, intraclass variations, geographic differences between areas, sensor and camera configurations, and camera operation. These factors and their consequences are detailed in Section 2.1.

1.2 Promising AI technique for plant disease detection and classification Machine learning and artificial intelligence techniques have been around for decades. These have been employed in virtually all areas of knowledge, with difference degrees of success. Among those, neural networks have arguably experienced the most hype and also the most disillusionment. In the first half of the 1990 decade, neural networks were viewed as a potential solution to almost all data-related problems. Such high expectations were unrealistic, and although neural networks continued to be applied to different problems, many started to view this type of technique as an option with rather limited potential. However, the development of a new family of neural network architectures with a large number of layers dedicated to very specific tasks radically changed this view. The inception of the so-called deep learning techniques in the early years of the 2010 decade brought neural networks back to the spotlight, but this time the promise showed early has often been realized, as long as the data used for training was representative enough. This type of technique has been particularly successful for image analysis and classification, which is in part due to the remarkable evolution of imaging sensors and computational power. Considering that many of the problems found in agriculture are inherently visual, deep learning techniques quickly became a prime option for solving many agricultural problems. Artificial intelligence and deep learning, in particular, are being increasingly applied to agriculture issues due to their ability to deal with the nonstructured and dynamic conditions found in the field. Classification methods based on deep learning methods usually employ Convolutional Neural Networks (CNNs) (Krizhevsky et al., 2012; Szegedy et al., 2015; Chollet, 2017; Zoph et al., 2018). In the context of plant disease management, the automatic plant disease detection and recognition problems are among the most widely explored by the agricultural research community (Arg€ ueso et al., 2020; Barbedo, 2019; Brahimi et al., 2017; Chen et al., 2020; Darwish et al., 2020; Ferentinos, 2018; Jiang et al., 2019; Johannes et al., 2017; Lee et al., 2020; Liu et al., 2018; Lu et al., 2017; Mohanty et al., 2016; Rahman et al., 2020; Ramcharan et al., 2017; Too et al., 2019; Zhang et al., 2018; Zhong and Zhao, 2020). Detection problems are often treated as a binary classification problem, in which the first class is the object of interest (symptom or sign), and the second class is everything else (Barbedo, 2013, 2021). Although architectures dedicated exclusively to classification can be applied in this case, detection models such as YOLOv3 (Redmon and Farhadi, 2018) and Mask

103

2 Data retrieval and preparation

Image Capture

Image Annotation

Experiment Planning

Model Selection

FIG. 1 Workflow usually adopted for training deep learning models.

R-CNN (He et al., 2017) are often preferred, as these can not only detect the presence of the object of interest, but also indicate where symptoms are located. Segmentation problems usually employ semantic segmentation models such as Deeplab v3 (Chen et al., 2017), SegNet (Badrinarayanan et al., 2016), and U-Net (Ronneberger et al., 2015). All these models have implementations and pretrained versions available in different platforms, and the documentation associated is detailed and accessible (Barbedo, 2021). As a result of these available AI techniques, the number of research articles that use deep learning architectures is steadily growing (Barbedo, 2021). However, most of them share the same problems, one of which is lacking the necessary precautions to avoid unrealistic and biased results. Many AI research papers have serious methodological errors, which indicates that even reviewers have a limited knowledge about the subject. A detailed study about the main methodological problems found in the literature was recently published (Barbedo, 2021). The objective of this text is to provide a deeper analysis of the main aspects to be considered in each step of the workflow usually adopted for training deep learning models (Fig. 1), with special attention being dedicated to the many obstacles that can make the models generated inadequate for practical use.

2 Data retrieval and preparation 2.1 Data variability The input data for AI-powered plant disease detection are normally photos taken by cameras or smartphones. Preparing the input data is a huge task as the training dataset for AI should not just consist of arbitrary photos taken in random locations and time. Many AI experiments for plant disease detection fail because the underlying models are trained with partial data that represents only a small portion of the variety of situations found in practice. Thus, an important challenge to be addressed is building image datasets that comprehensively represent the variability associated with the target problem. This is particularly difficult to achieve in plant disease applications because it is not only difficult to collect images in the field, but there are several factors that introduce additional variation (Barbedo, 2018b). Some of these factors have significant impact and must be considered when collecting training sample photos: Image background: image background (those contextual pixels, excluding the target objects like disease in this case) matters a lot in AI, and most of the time AI is working on processing the background pixels rather than processing the relevant pixels. It is virtually impossible to include all possible types of backgrounds due to the variety of objects that can be present in the field of view (Mohanty et al., 2016; Barbedo, 2018b; Ferentinos, 2018). However, it is possible to reduce the impact of the background by ensuring that the objects of interest occupy most of the field of view when images are captured.

104

5. Artificial intelligence for plant disease recognition

Illumination conditions: The AI models trained with daylight pictures cannot correctly analyze pictures taken at night. We need pictures taken at all natural illumination conditions if planning to make the model operating day and night. Illumination conditions in the field can vary considerably due to factors such as angle of insolation, shadows and specular reflections. In practice, it is difficult to guarantee that users of a potential technology will follow the guidelines and avoid capturing images under extreme illumination conditions, so it is safer to ensure that the models are prepared to deal with suboptimal images. Intraclass variations: The taxonomy of plant diseases is very complicated and changing as scientists continue to discover new diseases, and full of similarities and inconsistent naming. Many diseases in the same group are very similar in color, texture, and symptoms. Certain classes, which in the context of plant pathology correspond to certain disorders, can present different visual characteristics depending on several factors. As a result, the gamut of visual characteristics associated with a given class can be broad, and the training set needs to reflect this in order to produce a robust model. Geographic differences: specific characteristics of a given area may also alter the visual appearance of a given plant disease. The most effective way to deal with this problem is by collecting images in as many different places as possible. Sensor and camera configurations: AI models are very sensitive to the minor differences in photos caused by camera settings which will impact the entire distribution and value range of spectral bands. Cameras have several settings that can be defined manually or automatically such that the images produced have the best possible quality. Although there are preprocessing techniques capable of handling, at least partially, the variations caused by different camera settings, usually the resulting variability is still high enough to cause model fitting problems. Hence, it is recommended that as many different cameras and sensors be employed in the image capture process as possible. Camera operation: AI can almost certainly outperform other models if the provided data has large scale and high quality. The human factor also has an important role in the image capture process. Even when the operators are trained to capture images according to certain protocols, there are differences in the way the camera is handled, in the way the region of interest is framed and in hand steadiness, among others (Barbedo, 2018a). It is almost impossible to consider all possible operation behaviors, but the larger the number of people taking part in the capture process, the more robust the models will tend to be.

2.2 Protocols for image capture A potential way to reduce the variability in captured photos and, consequently, to reduce the number of worrying factors when building the training set, is to impose some image capture protocols among the field surveyors. In particular, illumination conditions could be partially controlled by, for example, guaranteeing that images are captured only at certain times of the day, under specific meteorological conditions and avoiding the presence of shadows and specular reflections. Besides, the background can be significantly simplified by placing a screen behind the object of interest (normally a leaf or a plant). Although these two actions can remove a large portion of the variability found in images captured in the field, they have some disadvantages. Image capture in the field usually is a

3 Step-by-step implementation

105

matter of opportunity. The need to wait for the right conditions can lead to missed opportunities that may not repeat. These procedures can also cause the people responsible for the image collection to lose interest in the task. In turn, the use of a screen to remove the background can disturb and alter the visual characteristics of the object of interest, besides making the image capture process more complex and time consuming. Other actions to reduce variability normally cause similar problems. In general, the best course of action is to avoid protocols that increase the complexity of the image capture process, unless either the factor that introduces variability is too damaging, or if the trained models are intended to be used in specific situations (for example, as an aiding tool in research experiments).

2.3 Image annotation A fundamental part of the development and training of AI models is the reference data annotation, in which the correct classes and possibly other relevant pieces of information are associated with each sample. This labeling is manual work and requires a big amount of human labor and costs. There are many challenges involved in this task, but two are particularly difficult. First, even experienced plant pathologists have trouble identifying the correct classes, which may lead to label errors and, consequently, to flawed training (Barbedo, 2018a, 2021). Second, the annotation process is inherently subjective and, as a result, it is subject to cognitive and psychological phenomena that can lead to biases and optical illusions and, ultimately, to error (Bock et al., 2010, 2020). Although there are some tools to assist the annotation process (Verma et al., 2020), these can only partially mitigate those problems. The most appropriate solution in both cases is to employ several people in the process, and then apply the majority rule whenever there is disagreement. This strategy tends to reduce inconsistencies in the annotation process, improving the quality and reliability of both the dataset and trained models.

3 Step-by-step implementation In principle, deep learning models can be implemented in a number of different programming languages. However, Python has been the preferred language by the community. With some basic to intermediate knowledge on this language, implementing deep learning models is relatively simple due to the many excellent tutorials available online. A real code used to train the NasNet Large model to recognize soybean rust is presented below as an example, together with some brief explanation regarding each block of code. A more detailed explanation about some key aspects of the implementation is provided afterward. # Import basic packages (they may vary) from __future__ import absolute_import, division, print_function, unicode_literals import os import tensorflow as tf from tensorflow import keras import numpy as np

106

5. Artificial intelligence for plant disease recognition

import matplotlib.pyplot as plt import matplotlib.image as mpimg # split data into train, validation and test import math import os import shutil import pathlib import random data_path = '/tf/notebooks/data2' splitted_data_path = os.path.join(data_path, 'splitted_data') train_path = os.path.join(splitted_data_path, 'train') validation_path = os.path.join(splitted_data_path, 'validation') test_path = os.path.join(splitted_data_path, 'test') train_size = 0.7 validation_size = 0.1 test_size = 0.2 shutil.rmtree(splitted_data_path, ignore_errors=True) data_root = pathlib.Path(data_path) classes = [] for class_subdir in data_root.iterdir(): class_image_paths = [str(path) for path in class_subdir.iterdir()] random.shuffle(class_image_paths) class_name = os.path.basename(str(class_subdir)) classes.append([class_name, class_image_paths]) os.mkdir(splitted_data_path) os.mkdir(train_path) os.mkdir(validation_path) os.mkdir(test_path) classes_len = [len(el[1]) for el in classes] print(classes_len) min_class_len = min(classes_len) for elem in classes: label = elem[0] print(label) files = elem[1]

3 Step-by-step implementation

107

class_len = len(files) os.mkdir(train_path_wlabel) i = 0 j = math.floor(class_len*train_size) k = math.floor(class_len*validation_size) l = math.floor(class_len*test_size) print(str(i)+' - '+str(i+j)) for file in files[i:i+j]: dst = os.path.join(train_path_wlabel, os.path.basename(file)) shutil.copyfile(file.dst) os.mkdir(validation_path_wlabel) print(str(i+j+1)+' - '+str(i+j+k)) for file in files[i+j+1:j+k]: dst = os.path.join(validation_path_wlabel, os.path.basename(file)) shutil.copyfile(file, dst) os.mkdir(test_path_wlabel) print(str(i+j+k+1)+' - '+str(i+j+k+l)) for file in files[i+j+k+1:i+j+k+l]: dst = os.path.join(test_path_wlabel, os.path.basename(file)) shutil.copyfile(file, dst) # Resizing the images to the correct dimensions import PIL from PIL import Image def myFunc(image): new_im = Image.fromarray(image.astype('uint8')) res_im = new_im.resize((56,56), resample=PIL.Image.BILINEAR) res2_im = res_im.resize((331,331), resample=PIL.Image.BILINEAR) converted_img = np.array(res2_im) return converted_img.astype('float32') image_size = 331 # ALl images will be resized to 331x331 batch_size = 32 # Rescale all images by 1./255 and apply image augmentation train_datagen = keras.preprocessing.image.ImageDataGenerator(rotation_range=45, width_shift_range=0.2, height_shift_range=0.2, brightness_range=(0.8, 1.2), shear_range=0.2, zoom_range=0.2, preprocessing_function=myFunc)

108

5. Artificial intelligence for plant disease recognition

validation_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255, preprocessing_function=myFunc) # Flow training images in batches of 32 using train_datagen generator train_generator = train_datagen.flow_from_directory( train_path, #source directory for the training images batch_size = batch_size, class_mode = 'categorical') IMG_SHAPE = (image_size, image_size, 3) # Create the base model from the pre-trained model NasNet Large base_model = keras.applications.nasnet.NASNetLarge(input_shape=IMG_SHAPE, include_top=False, weights='imagenet') base_model.trainable = False model = tf.keras.Sequential([ base_model, keras.layers.GlobalAveragePooling2D(), keras.layers.Dense(2, activation = 'sigmoid')]) model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.0001), loss='categorical_crossentropy', metrics=['accuracy']) epochs=10 steps_per_epoch = train_generator.n // batch_size validation_steps = validation_generator.n // batch_size history = model.fit_generator(train_generator, steps_per_epoch = steps_per_epoch, epochs=epochs, workers = 4, validation_data = validation_generator, validation_steps = validation_steps) # Visualize training results acc = history.history['acc'] val_acc = history.history['val_acc'] loss = history.history['loss'] val_loss = history.history['val_loss'] plt.figure(figsize=(8,8)) plt.subplot(2,1,1)

3 Step-by-step implementation

109

plt.plot(acc, label='Training Accuracy') plt.plot(val_acc, label='Validation Accuracy') plt.legend(loc='lower right') plt.ylabel('Accuracy') plt.ylim([min(plt.ylim()), 1]) plt.title('Training and Validation Accuracy') Y_pred = model.predict_generator(validation_generator, validation_generator.m // batch_size + 1, workers = 0) y_pred = np.around(-(Y_pred[:, 0]-1))

Experiments involving machine learning, and deep learning in particular, need to be carefully planned in order to generate the best possible model. Among the aspects that need to be considered, some are particularly important: Architectures: as artificial intelligence algorithms evolve, new architectures arise and new structures are incorporated into the models. Besides the standard architectures, it is possible to combine different elements such that new architectures specifically devoted to the problem at hand are created. The decision about which strategy to adopt depends on some factors. Customized architectures have the potential to yield better results, but finding the ideal structure may require a lengthy experimental process, without any guarantee that the performance of the resulting architecture will be better than that achieved by standard architectures. Automatized machine learning (AutoML) like AutoSklearn and Pycaret can help select the best performing models. If existing architectures are employed, it is important to consider that, in general, there is a tradeoff between performance and complexity and size of the network. Selecting the most suitable architecture must rely on a careful analysis of these factors. Transfer learning: To avoid duplicated training and wasting computing resources on re-training large models, most standard architectures are made available already pretrained using databases like ImageNet (Deng et al., 2009). Transfer learning normally works well, being effective in accelerating considerably the network convergence. Although there are specific cases in which training from scratch is a better option, most studies indicate transfer learning as the best option in most cases (Boulent et al., 2019). Parameters to be optimized: deep neural networks have a number of parameters that can be optimized, such as minibatch size, learning rates and input size, among others. Since training deep networks can be very time consuming, testing all different parameter combinations is usually unfeasible. In general, the best strategy is to separately test some carefully selected parameters until the most suitable values are found. Class balance: if the number of samples varies too much between classes, training may favor the most numerous ones, leading to severe bias. It is good practice to even out the number of samples, either by limiting the maximum number of samples in each class, or by augmenting smaller classes using the techniques described in the next item. Image augmentation: due to the rareness of the training data, researchers come up with some workaround methods to fix those pressing issues like biases and missing to improve the quality. Data augmentation is one of those techniques which are frequently applied to image sets in order to artificially increase the size of the dataset and the variability of the data used for training. Augmentation can also be used to partially compensate for class

110

5. Artificial intelligence for plant disease recognition

imbalances. The most common augmentation operations include image rotation, inversion and shifting, histogram equalization, contrast enhancement, and adaptive equalization (Barbedo, 2018b). The code at the beginning of this section shows how to apply many of those augmentation operations. Augmentation can be a valuable tool, especially when the image set is relatively small. However, this type of technique should not be used indiscriminately. Augmentation should be applied only to the training set. It is possible to apply augmentation to the test set after the division into subsets, but in most cases this action is superfluous. If augmentation is applied before the separation into the training and test subsets, the subsequent random subset division may cause the same images, with only very minor variations, to be present in both subsets, causing a strong bias on the observed results. Also, if too many image augmentation operations are applied, this may cause the training set to be very redundant, which may lead to unforeseen effects on the way the network is trained. Unfortunately, there are many peer-reviewed published articles that use this improper strategy, a fact that has frequently been used as a justification for the adoption of this strategy by others (Sladojevic et al., 2016; Liu et al., 2018; Zhang et al., 2018; Jiang et al., 2019). Training/test procedure: the most commonly used ratios between the number of samples in the training and test subsets are 70%/30% and 80%/20%. Some studies also employ a validation subset, usually containing 10% of all samples (as in the code above), in order to assess the accuracy of the model as the training process progresses. These proportions are well established and, in general, do not need to be altered. The ideal number of epochs to be adopted for training depends on the convergence speed, which is highly dependent on the characteristics of the data and number of classes. Once accuracy begins to stabilize, training should be interrupted in order to avoid waste a time and the possibility of overfitting. In the example shown in Fig. 2, accuracy began to stabilize around 0.95, so the training process was interrupted after the ninth epoch. Other aspects of the training process, such as learning rate, type of optimizer, size of mini-batch, etc., normally are determined case by case, but using values found in codes available online for similar applications is usually a good practice, and few researchers actually spend significant time optimizing those values. In the code above, learning rate was 0.0001, the optimizer was the cross-entropy, and the mini-batch size was 32. Cross-validation: cross-validation is of fundamental importance for producing meaningful and reliable results. Especially in cases in which the dataset does not contain much variation, if a single data partition is used the distributions of the training and test sets can be biased, inadvertently favoring or penalizing the models being compared. In other words, bias caused by unfavorable data distributions in the training and test sets can lead to strongly biased and unrealistic results. The most effective way to avoid this is by applying cross-validation with at least 5 folds (Brahimi et al., 2017; Lu et al., 2017; Picon et al., 2019; Sladojevic et al., 2016), but many studies do not adopt this strategy (Arg€ ueso et al., 2020; Chen et al., 2020; Darwish et al., 2020; Esgario et al., 2020; Ferentinos, 2018; Jiang et al., 2019; Li et al., 2020; Liu et al., 2018; Ramcharan et al., 2017; Too et al., 2019; Zhong and Zhao, 2020). Cross validation is very useful to avoid the trained models leaning toward certain major diseases too much while ignoring or misclassifying minor disease classes. Independent datasets: covariate shift is the phenomenon in which differences in the distributions of the data used for training the model and the data used for testing and validation result in low accuracy (Barbedo, 2017, 2018a,b, 2021). In many cases, although test, validation and training samples are distinct, normally they come from the same dataset, naturally

111

4 Experimental results and how to select a model

Training and Validation Accuracy

1.00 0.95

Accuracy

0.90 0.85 0.80 0.75

Training Accuracy Validation Accuracy

0.70 0

2

4

6

8

Training and Validation Loss 0.6

Training Loss Validation Loss

Cross Entropy

0.5 0.4 0.3 0.2 0.1 0.0 0

2

4

6

8

FIG. 2 Example of convergence curves for a deep neural network. The numbers in the x axis are the number of epochs.

possessing some degree of correlation among them. Indeed, some authors have observed sharp decreases in accuracy when the trained model is applied to different image datasets (Ferentinos, 2018; Mohanty et al., 2016). There are some ways to mitigate this type of problem, especially through domain adaptation techniques, but for more realistic results, it is always recommended to employ a separate dataset to assess the models (Chen et al., 2020; Lee et al., 2020).

4 Experimental results and how to select a model Normally, several distinct architectures are tested for the selection of the best model. Among the factors that should be considered to guide the final model selection, three are particularly important: accuracy, model complexity and generalization capability. There are several possible measurements to assess the accuracy of a model, depending on the type of the problem at hand, such as precision and recall in the case of classification

112

5. Artificial intelligence for plant disease recognition

problems and Intersection over Union (IoU) in the case of segmentation problems. Accuracy is the most obvious criterion for choosing a model, but in isolation it may not be enough to point out the best option. There are a few ways to organize and present the accuracies obtained by the trained models. In the case of plant disease recognition, overall accuracies (Table 1) and confusion matrices (Fig. 3) are arguably the most common. Accuracy tables can also contain a detailed statistical analysis, including different correlation coefficients and bias measurements. The confusion matrix reveals both the accuracy of the algorithm (main diagonal) and how the errors are distributed (other cells). Confusion matrices can be presented in absolute terms or in percentages, as is the case in Fig. 3. Rows and columns represent actual and predicted classes, respectively. In the example shown in Fig. 3, the model correctly identified the class “Leprosis” in 50% of the cases, misclassifying “Leprosis” samples as “Canker” and “Greasy Spot” in 19% and 31% of the cases, respectively. Model complexity is another important factor. The training process of deep learning networks is almost always computationally intensive, sometimes requiring days or even weeks TABLE 1 Overall accuracies obtained in Barbedo (2019). # Classes

# Images

Accuracy Expanded

Crop

Original

Expanded

Original

Expanded

Original images

Background removed

C

R

Common Bean

5

10

64

3079

83%

95%

94%

91%

Cassava

3

3

37

895

92%

83%

100%

100%

Citrus

7

9

87

1868

79%

62%

96%

93%

Coconut Tree

4

5

77

1504

97%

97%

98%

97%

Corn

7

11

165

10,480

60%

66%

75%

74%

Coffee

6

6

142

1899

76%

77%

89%

86%

Cotton

3

3

95

2023

100%

100%

99%

99%

Cashew Tree

3

6

78

4509

88%

83%

98%

96%

Grapevines

4

6

72

2330

75%

81%

96%

91%

Kale

0

2

0

196





100%



Passion Fruit

2

3

40

280

50%

90%

80%

80%

Soybean

8

9

377

13,733

82%

76%

87%

86%

Sugarcane

3

3

110

2773

93%

100%

99%

97%

Wheat

3

4

73

840

92%

61%

99%

98%

Total

56

81

1383

46,135

82%

82%

94%

91%

5 Discussion

113

FIG. 3 Example of confusion matrix given in terms of percentages.

to be completed, depending on the equipment and amount of data. Although most models can be executed even in equipment with limited computational power, real time applications sometimes require lighter models to be viable. In some circumstances, models may achieve very high accuracies during their development, but fail when used under real conditions, revealing poor generalization capabilities. Since data used for training almost always have some representativeness gaps (Barbedo, 2021), it is important that the adopted architecture have good generalization capability to deal with those omitted cases. This factor is not easy to evaluate, but it is fundamentally important to assess the potential of the model. This once again highlights the importance of using independent datasets to evaluate the models, as previously discussed.

5 Discussion In an academic research context, the work frequently stops once the model is trained and properly evaluated and the results are published. However, if the model is expected to be used in practice, there are some other challenges and steps that need to be addressed. As discussed earlier, it is very important to know the exact generalization capabilities of the model, that is, how prepared is the model to deal with images coming from different sensors, regions and seasons. This is directly related to the data issues mentioned in Section 2.1. In plant pathology, it is very difficult to build datasets that cover the entire range of variations. As a result, in practice there will be situations for which the model was not properly trained. In turn, this means that reliable results are expected only in situations that are similar to those present in the original training dataset. This is almost certainly that main reason for the lack of dependable applications for plant pathology. The most obvious solution is to add more

114

5. Artificial intelligence for plant disease recognition

variability to the training dataset, but this takes time and it is not always feasible. This being said, depending on the characteristics of the problem and the model being adopted, there may be some robustness to situations not present in the training set. Determining the exact range of situations for which the model can be applied is not simple and likely involves more research. One way to do this is have some partners (preferably producers) willing to use the technology, generate data and report possible problems. This type of strategy has been adopted before by companies like PEAT (https://plantix.net), with some success. This not only allows for a finer delineation of the technology’s applicability, but also generates new samples that can be used to retrain the model. In any case, this involves actions that go far beyond the technical issues involved in the problem. An alternative that has been producing promising results is the employment of domain adaptation techniques. Domain Adaptation is a particular case of transfer learning that leverages labeled data in one or more related source domains, to learn a classifier for unseen or unlabeled data in a target domain (Csurka, 2017). In other words, the characteristics of new data to be processed by the model are artificially altered to resemble that characteristics of the data used for training. For example, if the model was trained only with images taken under direct sunlight, but it has now to deal with images taken under overcast conditions, the latter can in theory be altered to more closely resemble the former ones, potentially reducing error rates. This type of strategy is being actively studied and should be increasingly employed in the near future. Another important aspect that has a great impact on the usability of the technology is the user interface. User profiles can be very heterogeneous in terms of preferences, age, level of formal education, technology proficiency, etc. If the technology is expected to be used by a large number of people, the interface needs to meet, as much as possible, all their expectations. This is far from trivial, and may involve several trials with different groups of people until a suitable interface is achieved. Past experiences indicate that a good interface for use in agricultural fields should be simple and fast to use, provide fast and easy to read information, but also offer more sophisticated options for a more in-depth analysis of the situation. Another feature that can be very useful is the detection of images of poor quality. Although the technology can be distributed with basic instructions on how to properly capture the images, in practice it is very difficult to guarantee that those will be followed. Fortunately, there are some well-established methods for detecting unsuitable images (Zhai and Min, 2020), in which case the user is asked to take another picture. Finally, it is important to consider that no AI-based technology is perfect. One interesting feature of deep learning models is that they can provide not only a classification, but also the degree of confidence of that answer. It is a good idea to provide this information to users in order to make it clear that mistakes are possible. This highlights the fact that the answers provided by AI should not be taken blindly. In fact, the main purpose of AI technologies applied to agriculture is to quickly provide reliable information to support the decision-making process. The idea is not to eliminate human involvement, but rather to close eventual information gaps that might lead to poor decisions. This is very important to have in mind, because if those information gaps do not exist, employing those technologies can be redundant and even add noise to the decision-making process.

8 Open questions

115

6 Conclusion The use of artificial intelligence algorithms, and deep learning neural networks in particular, has been constantly growing in a myriad of applications, including those relevant to the agricultural sector. Although the steps to generate trained models are relatively straightforward, there are many methodological subtleties that are often ignored, leading to unreliable results. This is particularly true in applications aimed at plant pathology monitoring and management. The objective of this article was to discuss some guidelines aimed at avoiding some of the problems observed when artificial-intelligence-based applications are used in practice, highlighting some of the main precautions that can be adopted in each step of the process.

7 Assignment 1. Choose a plant disease classification problem. For a first contact with AI models, a binary problem would be ideal. In this case, only two classes will exist, one with the disease of interest present and the other with the disease of interest absent. 2. Build the dataset to be used for training and evaluating the models. Building a new dataset from scratch is ideal, but since this might not be feasible, there are some good plant pathology datasets that are freely available for noncommercial purposes, such as Digipathos (https://www.digipathosrep.cnptia.embrapa.br) and PlantVillage (https:// plantvillage.psu.edu/). Samples representing each class should be stored in separate folders. The sample set representing the disease of interest should contain images of symptoms with different degrees of severity, while the other set should contain samples of both healthy leaves and leaves with symptoms produced by other disorders. 3. Train at least three models, using either the code provided in this chapter or the example codes found on the web as reference. Use the proportions 70/20/10 for the training, test and validation sets, and the training parameters adopted in the example code of Section 3. This task can be carried out more easily by using a platform such as Jupyter Notebook. It is worth noting that, although knowledge on Python programming language is recommended, it is possible to learn the basics “on-the-fly” by implementing the models and solving eventual problems with the aid of the many tutorials found online. 4. Train each model multiple times using different parameter values and carefully observe the differences in accuracy and convergence speed. 5. Compile the results and write a report outlining the performance yielded by each model with each combination of parameters. The results should include some basic statistical analysis (correlation coefficients, bias measurements).

8 Open questions Artificial intelligence has evolved to a point in which models trained with good data will almost certainly yield good results. Because of the variability present in agricultural fields, it

116

5. Artificial intelligence for plant disease recognition

is very difficult to build datasets that are representative enough. As data sharing practices become more widespread and citizen science initiatives become more effective, the availability of plant pathology images and data should increase. As discussed before, domain adaptation seems to be a good alternative to make models more robust to different conditions. While the techniques that can be applied are well established and understood, how to apply them effectively in the plant pathology context is still an open question that needs to be further investigated. Another interesting open question is the possibility of using samples from different crops for training the same model. Many plant diseases are common to many crops, so it would be very convenient if it would be possible to train models capable of recognizing diseases across different plant species. Studies on this matter are incipient and more research is needed for stronger conclusions. Finally, it is worth pointing out that in some cases the images do not carry enough information to resolve the classification problem. In cases like this, assertive answers may only be possible with the aid of ancillary data like soil properties and meteorological data. Data fusion, which seeks the effective combination of different data sources, is another very active line of research that could be further explored in the context of plant pathology.

References Arg€ ueso, D., Picon, A., Irusta, U., Medela, A., San-Emeterio, M.G., Bereciartua, A., Alvarez-Gila, A., 2020. Few-shot learning approach for plant disease classification using images taken in the field. Comput. Electron. Agric. 175, 105542. https://doi.org/10.1016/j.compag.2020.105542. Badrinarayanan, V., Kendall, A., Cipolla, R., 2016. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv:1511.00561. Barbedo, J.G.A., 2013. Digital image processing techniques for detecting, quantifying and classifying plant diseases. Springerplus 2, 660. https://doi.org/10.1186/2193-1801-2-660. Barbedo, J.G.A., 2016. A review on the main challenges in automatic plant disease identification based on visible range images. Biosyst. Eng. 144, 52–60. https://doi.org/10.1016/j.biosystemseng.2016.01.017. Barbedo, J.G.A., 2017. A new automatic method for disease symptom segmentation in digital photographs of plant leaves. Eur. J. Plant Pathol. 147, 349–364. https://doi.org/10.1007/s10658-016-1007-6. Barbedo, J.G., 2018a. Factors influencing the use of deep learning for plant disease recognition. Biosyst. Eng. 172, 84–91. https://doi.org/10.1016/j.biosystemseng.2018.05.013. Barbedo, J.G.A., 2018b. Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification. Comput. Electron. Agric. 153, 46–53. https://doi.org/10.1016/j. compag.2018.08.013. Barbedo, J.G.A., 2019. Plant disease identification from individual lesions and spots using deep learning. Biosyst. Eng. 180, 96–107. https://doi.org/10.1016/j.biosystemseng.2019.02.002. Barbedo, J.G.A., 2021. Deep learning applied to plant pathology: the problem of data representativeness. Trop. Plant Pathol. 2021. https://doi.org/10.1007/s40858-021-00459-9. Barbedo, J.G.A., 2022. Data fusion in agriculture: resolving ambiguities and closing data gaps. Sensors 22, 2285. https://doi.org/10.3390/s22062285. Bock, C.H., Poole, G.H., Parker, P.E., Gottwald, T.R., 2010. Plant disease severity estimated visually, by digital photography and image analysis, and by hyperspectral imaging. Crit. Rev. Plant Sci. 29, 59–107. https://doi.org/ 10.1080/07352681003617285. Bock, C.H., Barbedo, J.G.A., Ponte, E.M.D., Bohnenkamp, D., Mahlein, A.K., 2020. From visual estimates to fully automated sensor-based measurements of plant disease severity: status and challenges for improving accuracy. Phytopathol. Res. 2. https://doi.org/10.1186/s42483-020-00049-8. Boulent, J., Foucher, S., Theau, J., St-Charles, P.L., 2019. Convolutional neural networks for the automatic identification of plant diseases. Front. Plant Sci. 10. https://doi.org/10.3389/fpls.2019.00941.

References

117

Brahimi, M., Boukhalfa, K., Moussaoui, A., 2017. Deep learning for tomato diseases: classification and symptoms visualization. Appl. Artif. Intell. 31, 299–315. https://doi.org/10.1080/08839514.2017.1315516. Charlton, D., Taylor, J.E., Vougioukas, S., Rutledge, Z., 2019. Innovations for a shrinking agricultural workforce. Choices 34 (2), 1–8. Chen, L.C., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv:1706.05587. Chen, J., Chen, J., Zhang, D., Sun, Y., Nanehkaran, Y., 2020. Using deep transfer learning for image-based plant disease identification. Comput. Electron. Agric. 173, 105393. https://doi.org/10.1016/j.compag.2020.105393. Chollet, F., 2017. Xception: deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807, https://doi.org/10.1109/CVPR.2017.195. Csurka, G., 2017. Domain Adaptation for Visual Applications: A Comprehensive Survey. arXiv: 1702.05374. Darwish, A., Ezzat, D., Hassanien, A.E., 2020. An optimized model based on convolutional neural networks and orthogonal learning particle swarm optimization algorithm for plant diseases diagnosis. Swarm Evol. Comput. 52, 100616. https://doi.org/10.1016/j.swevo.2019.100616. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 248–255. Esgario, J.G., Krohling, R.A., Ventura, J.A., 2020. Deep learning for classification and severity estimation of coffee leaf biotic stress. Comput. Electron. Agric. 169, 105162. https://doi.org/10.1016/j.compag.2019.105162. Ferentinos, K.P., 2018. Deep learning models for plant disease detection and diagnosis. Comput. Electron. Agric. 145, 311–318. https://doi.org/10.1016/j.compag.2018.01.009. He, K., Gkioxari, G., Dolla´r, P., Girshick, R., 2017. Mask r-cnn. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, https://doi.org/10.1109/ICCV.2017.322. Jiang, P., Chen, Y., Liu, B., He, D., Liang, C., 2019. Real-time detection of apple 305 leaf diseases using deep learning approach based on improved convolutional neural networks. IEEE Access 7, 59069–59080. https://doi.org/ 10.1109/ACCESS.2019.2914929. Johannes, A., Picon, A., Alvarez-Gila, A., Echazarra, J., Rodriguez-Vaamonde, S., Navajas, A.D., Ortiz-Barredo, A., 2017. Automatic plant disease diagnosis using mobile capture devices, applied on a wheat use case. Comput. Electron. Agric. 138, 200–209. https://doi.org/10.1016/j.compag.2017.04.013. Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 1106–1114. Lee, S.H., Goe¨au, H., Bonnet, P., Joly, A., 2020. New perspectives on plant disease characterization based on deep learning. Comput. Electron. Agric. 170, 105220. https://doi.org/10.1016/j.compag.2020.105220. Li, D., Wang, R., Xie, C., Liu, L., Zhang, J., Li, R., Wang, F., Zhou, M., Liu, W., 2020. A recognition method for rice plant diseases and pests video detection based on deep convolutional neural network. Sensors 20. https://doi.org/ 10.3390/s20030578. Liu, B., Zhang, Y., He, D., Li, Y., 2018. Identification of apple leaf diseases based on deep convolutional neural networks. Symmetry 10. https://www.mdpi.com/2073-8994/10/1/11. Lu, Y., Yi, S., Zeng, N., Liu, Y., Zhang, Y., 2017. Identification of rice diseases using deep convolutional neural networks. Neurocomputing 267, 378–384. https://doi.org/10.1016/j.neucom.2017.06.023. Mohanty, S.P., Hughes, D.P., Salathe, M., 2016. Using deep learning for image based plant disease detection. Front. Plant Sci. 7, 1419. https://doi.org/10.3389/fpls.2016.01419. Picon, A., Alvarez-Gila, A., Seitz, M., Ortiz-Barredo, A., Echazarra, J., Johannes, A., 2019. Deep convolutional neural networks for mobile capture 335 device-based crop disease classification in the wild. Comput. Electron. Agric. 161, 280–290. https://doi.org/10.1016/j.compag.2018.04.002. Rahman, C.R., Arko, P.S., Ali, M.E., Iqbal Khan, M.A., Apon, S.H., Nowrin, F., Wasif, A., 2020. Identification and recogni 340 tion of rice diseases and pests using convolutional neural networks. Biosyst. Eng. 194, 112–120. https://doi.org/10.1016/j.biosystemseng.2020.03.020. Ramcharan, A., Baranowski, K., McCloskey, P., Ahmed, B., Legg, J., Hughes, D.P., 2017. Deep learning for imagebased cassava disease detection. Front. Plant Sci. 8, 1852. https://doi.org/10.3389/fpls.2017.01852. Redmon, J., Farhadi, A., 2018. Yolov 3: An Incremental Improvement. arXiv:1804.02767. Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: convolutional networks for 350 biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (Eds.), Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer International Publishing, Cham, pp. 234–241.

118

5. Artificial intelligence for plant disease recognition

Sladojevic, S., Arsenovic, M., Anderla, A., Culibrk, D., Stefanovic, D., 2016. 355 Deep neural networks based recognition of plant diseases by leaf image classification. Comput. Intell. Neurosci. 2016. https://doi.org/ 10.1155/2016/3289801. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, https://doi.org/10.1109/CVPR.2015.7298594. Too, E.C., Yujian, L., Njuki, S., Yingchun, L., 2019. A com parative study of fine-tuning deep learning models for plant dis ease identification. Comput. Electron. Agric. 161, 272–279. https://doi.org/10.1016/j.compag.2018.03.032. Verma, S., Bhatia, A., Chug, A., Singh, A.P., 2020. Recent Advancements in Multimedia Big Data Computing for IoT Applications in Precision Agricul 370 ture: Opportunities, Issues, and Challenges. Springer Singapore, Singapore, pp. 391–416, https://doi.org/10.1007/978-981-13-8759-3_15. Zhai, G., Min, X., 2020. Perceptual image quality assessment: a survey. SCIENCE CHINA Inf. Sci. 63, 211301. https:// doi.org/10.1007/s11432-019-2757-1. Zhang, S., Wang, H., Huang, W., You, Z., 2018. Plant diseased leaf segmentation and recognition by fusion of superpixel, k-means and phog. Optik 157, 866–872. https://doi.org/10.1016/j.ijleo.2017.11.190. Zhong, Y., Zhao, M., 2020. Research on deep learning in apple leaf disease recognition. Comput. Electron. Agric. 168, 105146. https://doi.org/10.1016/j.compag.2019.105146. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V., 2018. Learning Transferable Architectures for Scalable Image Recognition. arXiv:1707.07012.

C H A P T E R

6 Spatiotemporal attention ConvLSTM networks for predicting and physically interpreting wildfire spread Arif Masrur and Manzhu Yu Pennsylvania State University, University Park, PA, United States

1 Introduction Wildfires cause billions of dollars in damage each year in the United States, and their frequency increased over the past decade due to climate change. As of September 2, 2022, nearly 48,500 wildfires have burned about 6.2 million acres in the United States (https://crsreports. congress.gov/). While wildfires can benefit ecological processes, communities can be negatively impacted. For instance, wildfires burned nearly 6000 structures in 2021, the majority of which occurred in California. This state continues to experience longer wildfire seasons in recent years due to increased drought and heat-induced by climate change (https://www. fire.ca.gov/incidents/). Accurate prediction of wildfire occurrence and spread is thus consequential to safeguarding human life, the economy, and the environment. However, accurately predicting a spatiotemporal event like wildfire is a challenging task due to the changing dynamics and morphology of the fire-front in space and over a relatively short period of time. Furthermore, existing physical models are typically difficult to develop as they are data-intensive, whereas deep learning (DL) approaches are not readily scalable spatiotemporally and lack physical interpretability. Here, we present a novel Geographic Artificial Intelligence (GEO-AI) approach to address the challenge of accurate prediction and physical interpretation in a common Earth AI application: dynamic/stochastic spatiotemporal progression of wildfire spread. The proposed approach and a comparison with the benchmark are presented within a spatial data science life cycle (Fig. 1), where three steps include the generic Earth AI workflow (Sun et al., 2022) such as problem statement, data preparation, model building—training & validation, and posthoc analysis (sensitivity tests and physical interpretations of trained models). Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00009-8

119

Copyright # 2023 Elsevier Inc. All rights reserved.

120

6. Spatiotemporal attention ConvLSTM networks

FIG. 1 A typical spatial data science life cycle for Earth AI applications that supports iterative human-in-the-loop interactions in the model training, prediction, and model interpretation stages.

1.1 Technical contributions Existing wildfire spread models can be categorized into three classes: physical, empirical, and semiempirical (Burge et al., 2020; Finney, 1994; Radke et al., 2019; Rothermel, 1972). Physical models based on the laws of physics and chemistry are too mechanistic and dataintensive [i.e., requiring physical documentation of fire ecology and atmospheric processes; Weather Research Forecast (WRF)-FIRE model (Coen et al., 2013; Mandel et al., 2011)], whereas purely empirical models may suffer from poor generalizability across environmental gradients. In recent years, machine learning (ML) models have gained popularity in modeling wildfire dynamics (Burge et al., 2020; Green et al., 2020; Jain et al., 2020; Kozik et al., 2014; Radke et al., 2019; Subramanian and Crowley, 2018). These ML-based approaches can improve the accuracy of physical methods while remaining flexible toward learning environmentally heterogeneous processes in space and time. Among the ML approaches, the Convolutional LSTM (ConvLSTM) offers an ideal framework for spatiotemporal prediction (Shi et al., 2015), as it is capable of handling both spatial and temporal correlations. The generic convolutional operation extracts informative features by blending cross-channel and spatial information together. However, the spatial propagation for wildfire is probabilistic in nature and thus can manifest variable local and long-range spatiotemporal dependencies (Cai et al., 2020; Quinn et al., 2019). As suggested by the buoyant flame dynamics in wildfire spread (Finney et al., 2015), biophysical factors, such as topography, heterogeneous fuel patterns, moisture of dead and live vegetation, and atmospheric dynamics influence the spatiotemporal propagation process of fire-fronts at scale—e.g., depending on fuel size and distributions. The change in biophysical dynamics over time can potentially control wildfire spread. Thus, capturing important event-driver relationships within the local neighbors (i.e., smaller receptive field) and across time could be useful for accurate prediction and interpretation. In this chapter, we address this modeling problem by introducing a space-time attention mechanism within the sequence-to-sequence modeling framework, specifically in the ConvLSTM, that can increase representation power to focus on important predictive features and suppress unnecessary ones in space and time. Our contribution is twofold: • Development of attention-based sequence-sequence modeling frameworks using Convolutional Long Short-Term Memory (ConvLSTM) networks to accurately predict wildfire spread dynamics. • Development of a physical interpretability module with variable-wise space-time ConvLSTM networks to capture the feature importance dynamics in space and time.

2 Methodology

121

2 Methodology The attention technique in DL allows the modeling of dependencies without regard to their distance in the input or output sequences (Vaswani et al., 2017). In particular, a self-attention mechanism allows the inputs to interact with each other (hence “self”) to identify who and where they should pay more attention (hence “attention”). The outputs of self-attention are the aggregates of those interactions and resulting attention scores. For the wildfire spread prediction and interpretation, we integrate two different variants of attention mechanisms to the ConvLSTM: spatial and channel (Woo et al., 2018). As mentioned above, wildfire spread is probabilistic in nature, depending on interactions among multiple interrelated biophysical and human factors. Hence, we focus both on the prediction and physical interpretability of our proposed approaches. Our methodological contribution is the spatiotemporal attention-based sequence forecasting frameworks using the ConvLSTM network that predicts the spatial propagation of fire-fronts at multitime steps based on sequentially learning the dynamic wildfire-environmental driver relationships or dependencies (Fig. 2).

2.1 ConvLSTM network The typical ConvLSTM network (Fig. 3) combines a convolutional neural network and an LSTM network. This combination enables the network to capture local spatiotemporal relationships in the sequential 4D (i.e., time  height  width  predictors) image data, which are used to predict fixed length, i.e., next time-steps of fire spread occurrences in the 2D lattice.

2.2 Attention-based methods for ConvLSTM networks Attention mechanisms were introduced in encoder-decoder models to facilitate the flexible utilization of the most relevant information from the input sequence. This relevancy is captured from the encoded input vectors by weights, with the most relevant information being attributed as the highest weights. Various attention mechanisms have been proposed in DL, particularly since Vaswani et al. (2017) introduced a multihead self-attention mechanism that facilitates sequence-to-sequence modeling. However, existing attention mechanisms focus on a spatial or variable dimension separately; and it is challenging to balance the model’s interpretability and the model capability of capturing the inter-related spatial, temporal, and variable-wise dependencies. In order to evaluate the effectiveness of the attention mechanism in wildfire spread prediction and physical interpretation, as an example of the new model, we propose to include state-of-the-art attention modules that complement the convolutional operation in the generic ConvLSTM model. As our focus is to showcase the space-time aspects of model interpretability, here we focus on only two variants of attention mechanism within the ConvLSTM network based on the convolutional block attention module (CBAM) originally introduced in (Woo et al., 2018). We investigate tradeoffs among these different attention mechanisms and their nonattention benchmark network for analyzing their prediction accuracy and physical interpretability of stochastic wildfire spread in space-time.

122

6. Spatiotemporal attention ConvLSTM networks

FIG. 2 Conceptual framework of wildfire and environmental driver interactions influencing fire dynamics in space and time. Each row of the graph represents spatial and temporal “statuses” of individual biophysical factors that are related to wildfire spread. The top row shows the fuel patterns in the landscape and how the fire-front (red areas) progressed through successive steps and leaving ashes (gray areas) in space-time from t ¼ 0 to t + N. Other factors at each time-step can manifest different spatial patterns suggesting change. At each step in time, all these factors interact to determine the next location of the fire-front.

Spatial and channel-wise attention convolutional LSTM (SCA-ConvLSTM): We propose to replace the convolutional operation of ConvLSTM with a CBAM, where attention is employed as a complementary to convolution (Fig. 4). Our CBAM-based ConvLSTM is an advancement in that it sequentially applies two separate attentions along the image bands (hereafter “channels”) and spatial dimensions in the convolution operation (as opposed to direct convolution in the original ConvLSTM that blends cross-channel and spatial information). The channel attention focuses on identifying “what,” and spatial attention focuses on “where” are the most informative features in the input lattice, thus attempting to capture meaningful features from specific regions instead of all features from all regions.

3 Earth AI workflow

123

FIG. 3 Structure of a ConvLSTM cell. Red, gray, and blue cells in the lattices represent input burning and ash cells at t and t  1, and learned feature maps at t, respectively.

3 Earth AI workflow 3.1 Dataset acquisition and preparation For spatiotemporal attention-based predictions and interpretations, we conducted experiments on the wildfire spread data generated by a semiempirical model simulation, known as the percolation model used in Burge et al. (2020). We acquired the dataset from https://www. kaggle.com/johnburge (accessed on Nov 30, 2021). The wildfire data is represented on a 2D lattice where each grid corresponds to the fraction of the vegetation that has burned. They have different datasets, and we used the realistic train and test samples contained in realistic_training.tar.gz and realistic_testing.tar.gz files, respectively. Each data sample within the extracted training and testing files is a 4D “image” that corresponds to a single wildfire propagating through a field of vegetation. Each image has the following dimensions: time  height  width  channels ðpredictorsÞ Time ranges between 50 and 200 steps, and the length of a single time-step approximately corresponds to the fire spread dynamics that occur on a scale of 5–10 min. The height and

124

6. Spatiotemporal attention ConvLSTM networks

c(t-1)

tanh

ft

Ot

Ĉt

it

σ

σ

c(t)

tanh

σ

ĥ(t-1)

h(t)

. X(t)

CBAM

Spatial attention Channel attention ĥ(t-1) . X(t)

Input lattice

h(t-1) X(t)

Conv 2D

Spatial Attention (Ms)

Spatially-attended features

Pooling Layers

Pooling Layers

Shared MLP Channel Attention (Mc)

FIG. 4 Top: Proposed CBAM-based attention ConvLSTM network. Bottom: Illustration of the SCA module that can form the CBAM block in the ConvLSTM network, named SCA-ConvLSTM. To construct CSA-ConvLSTM, the relative order of spatial and channel attention modules are swapped within the CBAM block.

width of all images are 110 by 110, which, as per the percolation model described by Burge et al. (2020), corresponds to roughly the breadth of an average tree, e.g., 10–20 m. The channels represent the fire-ecological attributes (Fig. 5), such as: • • • • •

location of the fire-front patchy and heterogeneous vegetation heterogeneous moisture content realistic terrain height (i.e., elevation) wind speed (vertical and horizontal)

Each data sample file exists in Python-specific pickle format that contains a tuple of rank 2, where each value is an n-dimensional array. The first value is a NumPy ndarray of shape (t, h,

125

3 Earth AI workflow

Vegetation (t)

Fire front (t-1)

Ash (t) 0.14 0.7 0.12

0.6

0.10

0.5

0.08

0.4

0.06

0.3

0.04

0.2

0.02

0.1

0.00

Horizontal wind (t)

Vertical wind (t)

0.0

Moisture (t)

Elevation (t) 0.5 0.4

0.7 0.6 0.5

0.3 0.2

0.4 0.3 0.2

0.1 0.1 0.0

FIG. 5 Seven input channels of a 110  110 lattice for a single training instance.

w, c) representing the training/test data, while the second value is a NumPy array of shape (t, h, w, 1) representing the label—i.e., location of the fire-front within the 110  110 image. 3.1.1 Input-output sequence generation Wildfire spread prediction at multitime steps is a spatiotemporal sequence prediction problem that utilizes previously observed spatial time-step maps to forecast a fixed length of the future maps. Therefore, we need to organize the 4D images as input-output sequences. We make the sequence 20 steps or frames long (i.e., 10 frames for the input and 10 for the output/prediction). In the realistic dataset used here, the burnout time for a single fire is generally between 75 and 200 time-steps. Thus, for the input-output sequence generation, we loop through each fire sample and split it into subsets. Each subset becomes an individual fire sample with 20 sequential steps of fire progression through the landscape. Fig. 6 shows an input-output data sample used for model training.

FIG. 6 An input-output data sample for model training. Top: Input sequence of 10 steps; Bottom: Output/prediction sequence of 10 steps.

126

6. Spatiotemporal attention ConvLSTM networks

3.1.2 Data normalization In DL, it is important to transform your data so that all channel features are on a similar scale for improved training performance and accuracy. Therefore, we ensure that all of our seven input channel values are normalized before model training. The channels are normalized between 0.0 and 1.0 based on minimum-maximum normalization, except the wind channels (horizontal and vertical), which are normalized to have unit variance and zero mean because they initially had both positive and negative values. The output sequence frame(s) has the same shape as the inputs except for the channel dimension with only one channel, i.e., our prediction target is the location/pixels within the 110  110 frame that the fire-front will transition to from the prior time-step. This target channel value can be continuous or binary depending on the prediction goal—i.e., fire intensity or fire presence/absence. In our experiments, we only focus on predicting fire intensity, which can also indicate locations of fire presence and absence. For instance, if the channel value of a cell is >0.0, it contains fire with some intensity; otherwise, it is a fire-absent cell.

3.2 Modeling workflow demonstration 3.2.1 Attention ConvLSTM networks architecture The base structure of our nonattention benchmark ConvLSTM model (hereafter NA-ConvLSTM) and the attention-based ConvLSTM networks is shown in Fig. 7. We used two ConvLSTM layers followed by a Conv2D layer with kernel size (1,1). Each ConvLSTM layer internally processes 10 sequential 2D lattice frames (i.e., t1 to t30) to predict 2D outputs at the 10th time-step. Fire-environment input data

Input (S x H x W x C)

10 time steps (t1 to t10)

X-ConvLSTM (S x H x W x C)

10 time steps (t1 to t10)

X -ConvLSTM (S x H x W x C)

Predict Fire location with severity at t (11th)

2D Conv (H x W x C)

2D Conv (H x W x 1)

Output (H x W x 1)

FIG. 7 Structure of the proposed attention-based ConvLSTM model. “X” in X-ConvLSTM stands for different attention modules. Input data has S, H, W, and C dimensions that indicate the number of temporal sequences, spatial height, spatial width, and the number of channels or predictor variables, respectively. The convolution block in each ConvLSTM cell can be adjusted by its kernel size. Kernel means the small window (e.g., 3  3 pixels) that the convolution function will use to move across the image for its calculation. Different kernel sizes are used in X-ConvLSTM and the first Conv2D layers, whereas the final Conv2D layer has a fixed (1,1) kernel size.

3 Earth AI workflow

127

Imports import os import torch import torch.nn as nn import numpy as np import torchvision import torchvision.transforms as transforms import torchvision.transforms.functional as TF from torchvision import models

Model configuration

Codes below specify model configurations such as graphics processing unit (GPU) vs. central processing unit (CPU), epoch size, followed by network type, activation function type, number of input and output channels, kernel size, padding, and stride for each layer in the encoder and decoder. class Config: gpus = [0, ] device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') if torch.cuda.is_available(): num_workers = 4 * len(gpus) train_batch_size = 2 valid_batch_size = 2 * train_batch_size test_batch_size = 2 * train_batch_size else: num_workers = 0 train_batch_size = 2 valid_batch_size = 2 * train_batch_size test_batch_size = 2 * train_batch_size image_size = (110, 110) display = 10 draw = 10 epochs = 15 encoder = [('convlstm', '', 7, 16, 3, 1, 1), ('convlstm', '', 16, 32, 3, 1, 1), decoder = [('convlstm', '', 32, 16, 3, 1, 1), ('convlstm', '', 32, 16, 3, 1, 1), ('conv', 'sigmoid', 16, 1, 1, 0, 1)] config = Config()

128

6. Spatiotemporal attention ConvLSTM networks

Convolutional block attention module (CBAM)

Based on the arrangement of attention modules in the CBAM block (Fig. 4), we construct two different ConvLSTM networks. First, we put spatial attention before channel attention, which constructs the spatial-channel attention-supported network termed SCA-ConvLSTM. By swapping their sequential order, as shown below with codes, we can construct a CSAConvLSTM network. Given an input lattice or an intermediate feature map F  RCHw, the CBAM infers a 2D spatial attention map and a 1D channel attention map. 1. Channel-spatial attention module The channel-spatial attention module in the CSA-ConvLSTM first learns what features are important via the channel attention module, whose output feature maps are then forwarded to the spatial attention module that learns where important features are located. The resulting channel-spatial attention outputs become input to the LSTM module in Fig. 4(top). In the ChannelGate class, features are attended channel-wise to learn their inter-channel relationships. For that, the spatial dimension is squeezed by multiple pooling layers, including average-pooling, max-pooling, power-average pooling, and logarithmic summed exponential pooling, whose outputs are then individually forwarded to a shared network (i.e., a multilayer perceptron) that generates a channel attention map MC  RC11. The MLP has one c hidden layer just to generate a channel attention map, and whose activation size is Rr11 where r denotes the reduction ratio that helps reduce parameter overhead. The outputs from the shared MLP corresponding to all four feature descriptors are then merged using elementwise summation, followed by the ReLU activation function. Note that the four pooling operations (prior to the MLP) yielded superior performance compared to a subset of them. These channel-attended feature maps are then forwarded to the SpatialGate class, which learns where important features are located. We apply a 2D convolution layer to the features, and the output is transformed by a sigmoid activation. Finally, the sequential channel-spatial attention maps become input to the CSA-ConvLSTM block for sequential learning and multistep predictions. class BasicConv(nn.Module): def __init__(self, in_channels, num_feature, kernel_size, padding, stride, bn=True, bias=False): super().__init__() self.conv = nn.Conv2d(in_channels, num_feature, kernel_size=kernel_size, padding=padding, stride=stride, bias=False) self.bn = nn.BatchNorm2d(num_feature) def forward(self, x): x = self.conv(x) return x class ChannelGate(nn.Module): def __init__(self, gate_channels, reduction_ratio=4, pool_types=['avg', 'max', 'lp', 'lse']):

129

3 Earth AI workflow

super(ChannelGate, self).__init__() self.gate_channels = gate_channels self.fc1 = nn.Linear(gate_channels, gate_channels // reduction_ratio) self.relu1 = nn.ReLU() self.fc2 = nn.Linear(gate_channels // reduction_ratio, gate_channels) self.pool_types = pool_types def forward(self, x): channel_att_sum = None for pool_type in self.pool_types: if pool_type=='avg': avg_pool = F.avg_pool2d(x, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))) avg_pool = avg_pool.view(avg_pool.size(0), -1) channel_att_raw = self.relu1(self.fc1(avg_pool)) channel_att_raw = self.fc2(channel_att_raw) elif pool_type=='max': max_pool = F.max_pool2d(x, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))) max_pool = max_pool.view(max_pool.size(0), -1) channel_att_raw = self.relu1(self.fc1(max_pool)) channel_att_raw = self.fc2(channel_att_raw) elif pool_type=='lp': lp_pool = F.lp_pool2d(x, 2, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))) lp_pool = lp_pool.view(lp_pool.size(0), -1) channel_att_raw = self.relu1(self.fc1(lp_pool)) channel_att_raw = self.fc2(channel_att_raw) elif pool_type=='lse': lse_pool = x.view(x.size(0), x.size(1), -1) s, _ = torch.max(lse_pool, dim=2, keepdim=True) outputs = s + (lse_pool - s).exp().sum(dim=2, keepdim=True).log() lse_pool = outputs.view(outputs.size(0), -1) channel_att_raw = self.relu1(self.fc1(lse_pool)) channel_att_raw = self.fc2(channel_att_raw) if channel_att_sum is None: channel_att_sum = channel_att_raw else: channel_att_sum = channel_att_sum + channel_att_raw scale = torch.sigmoid(channel_att_sum ).unsqueeze(2) .unsqueeze(3).expand_as(x)

130

6. Spatiotemporal attention ConvLSTM networks

out = x * scale return x * scale class SpatialGate(nn.Module): def __init__(self, in_channels, num_features, kernel_size, padding, stride, bn=True, bias=False): super().__init__() self.spatial = BasicConv(in_channels, num_features, kernel_size, stride=1, padding=(kernel_size-1) // 2) def forward(self, x): x_out = self.spatial(x) scale = torch.sigmoid(x_out) out = scale return scale

2. Spatial-channel attention module To generate spatial attention maps in SCA-ConvLSTM, we first perform pooling operations (i.e., max-pooling, min-pooling, and average-pooling) along the input’s channel dimension to get “the best” feature descriptors for maximum, minimum, average feature values. These three spatially attended features are then concatenated with the input features along the channel axis. After that, we apply a 2D convolution layer to the concatenated features, and the output is transformed based on sigmoid activation. Unlike Woo et al. (2018), prior to this convolution, we mix spatially attended features with the original input features as we assume that spatial events such as wildfires are typically driven by the interactions among multiple features (e.g., biophysical factors) with varying threshold requirements. The resulting spatial attention map Ms  RCHW then becomes an input to the channel attention module. As described in the previous section, the features are then attended channel-wise to learn their inter-channel relationships. Finally, the sequential spatial-channel attention maps become input to the SCA-ConvLSTM block for sequential learning and multistep predictions. class BasicConv(nn.Module): def __init__(self, in_channels, num_feature, kernel_size, padding, stride, bn=True, bias=False): super().__init__() self.conv = nn.Conv2d(in_channels, num_feature, kernel_size=kernel_size, padding=padding, stride=stride, bias=False) def forward(self, x): x = self.conv(x) return x class ChannelGate(nn.Module): def __init__(self, gate_channels, reduction_ratio=8,

131

3 Earth AI workflow

pool_types=['avg', 'max', 'lp', 'lse']): super(ChannelGate, self).__init__() self.gate_channels = gate_channels self.fc1 = nn.Linear(gate_channels, gate_channels // reduction_ratio) self.relu1 = nn.ReLU() self.fc2 = nn.Linear(gate_channels // reduction_ratio, gate_channels) self.pool_types = pool_types def forward(self, x): channel_att_sum = None for pool_type in self.pool_types: if pool_type=='avg': avg_pool = F.avg_pool2d(x, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))) avg_pool = avg_pool.view(avg_pool.size(0), -1) channel_att_raw = self.relu1(self.fc1(avg_pool)) channel_att_raw = self.fc2(channel_att_raw) elif pool_type=='max': max_pool = F.max_pool2d(x, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))) max_pool = max_pool.view(max_pool.size(0), -1) channel_att_raw = self.relu1(self.fc1(max_pool)) channel_att_raw = self.fc2(channel_att_raw) elif pool_type=='lp': lp_pool = F.lp_pool2d(x, 2, (x.size(2), x.size(3)), stride=(x.size(2), x.size(3))) lp_pool = lp_pool.view(lp_pool.size(0), -1) channel_att_raw = self.relu1(self.fc1(lp_pool)) channel_att_raw = self.fc2(channel_att_raw) elif pool_type=='lse': tensor_flatten = x.view(x.size(0), x.size(1), -1) s, _ = torch.max(tensor_flatten, dim=2, keepdim=True) lse_pool = s+(tensor_flatten-s).exp().sum(dim=2, keepdim=True).log() lse_pool = lse_pool.view(lse_pool.size(0), -1) channel_att_raw = self.relu1(self.fc1(lse_pool)) channel_att_raw = self.fc2(channel_att_raw) if channel_att_sum is None: channel_att_sum = channel_att_raw else: channel_att_sum = channel_att_sum + channel_att_raw

132

6. Spatiotemporal attention ConvLSTM networks

scale = torch.sigmoid(channel_att_sum ).unsqueeze(2) .unsqueeze(3).expand_as(x) return x * scale class SpatialGate(nn.Module): def __init__(self, in_channels, num_features, kernel_size, padding, stride, bn=True, bias=False): super().__init__() self.spatial = BasicConv(in_channels, num_features, kernel_size, stride=1, padding=(kernel_size-1) // 2) def forward(self, x): pooled = torch.cat((torch.max(x,1)[0].unsqueeze(1), torch.mean(x,1) .unsqueeze(1), torch.min(x,1)[0].unsqueeze(1)), dim=1) x_pooled = torch.cat((pooled, x), dim=1) x_out = self.spatial(x_pooled) scale = torch.sigmoid(x_out) return scale

Nonattention ConvLSTM block

As mentioned above, a typical ConvLSTM network (Fig. 3) blends convolutional layers and LSTM layers. The input data to the network defined below is a 4D tensor having the following size: (Batch, Sequence, Channel, Height, Width). The prediction target has the same dimensionality as the input. However, the temporal length of the target sequence can either be a single step or multistep (e.g., 10 steps). As shown in Figs. 3 and 4 (top), every time a new input lattice comes, its information is accumulated in the memory cell (Ct) if the input gate (it) is open. Also, if the forget gate (ft) is open, then the past cell status Ct1 is forgotten; this gate helps forget irrelevant information in the sequence, thus resetting the memory of the recurrent network. The output gate (Ot) controls whether the current Ct will be propagated to the final state ht, which acts as the final memory of the network at the current time-step t. In essence, the ht is learned by the network-based input-to-hidden and hidden-to-hidden nonlinear transformations of lattice values at each step before it is fed into a 1  1 convolutional layer to generate the final prediction. In a ConvLSTM network, a future state of a particular cell in the lattice is determined by the current inputs and past states of its local neighbors. This is accomplished by the convolutional operation (i.e., self.conv) that performs input-to-state and state-to-state transitions over the lattice. Specifically, like in a traditional CNN, the convolutional operator performs two tasks: feature aggregation and transformation. Feature aggregation means using a fixed kernel with pretrained and fixed weights to linearly combine feature values from the local neighborhood (pixels inside the sliding window). Feature transformation means using linear mappings and nonlinear scalar functions such as sigmoid, tanh, etc., to calculate a new feature value.

133

3 Earth AI workflow

Following the encoder-decoder layers, as described before, we increment the number of input and output channels by in_channels+num_features and num_features4* in the self. _make_layer method. In the case of the general ConvLSTM, the nn.Sequential construction runs basic Conv2d operation on the input lattice. The outputs from the convolution are then passed through the gating mechanism for each time-step within the LSTM workflow. class ConvLSTMBlock(nn.Module): def __init__(self, in_channels, num_features, kernel_size=3, padding=1, stride=1): super().__init__() self.num_features = num_features self.conv = self._make_layer(in_channels+num_features, num_features*4, kernel_size, padding, stride) def _make_layer(self, in_channels, out_channels, kernel_size, padding, stride): return nn.Sequential( nn.Conv2d(in_channels, out_channels, kernel_size=kernel_size, padding=padding, stride=stride, bias=False)) def forward(self, inputs): ''' :param inputs: (B, S, C, H, W) :param hidden_state: (hx: (B, S, C, H, W), cx: (B, S, C, H, W)) :return: ''' outputs = [] B, S, C, H, W = inputs.shape hx = torch.zeros(B, self.num_features, H, W).to(inputs.device) cx = torch.zeros(B, self.num_features, H, W).to(inputs.device) for t in range(S): combined = torch.cat([inputs[:, t],

# (B, C, H, W)

hx], dim=1) gates = self.conv(combined) ingate, forgetgate, cellgate, outgate = torch.split(gates, self.num_features, dim=1) ingate = torch.sigmoid(ingate) forgetgate = torch.sigmoid(forgetgate) outgate = torch.sigmoid(outgate) cy = (forgetgate * cx) + (ingate * cellgate) hy = outgate * torch.tanh(cy) outputs.append(hy)

134

6. Spatiotemporal attention ConvLSTM networks

hx = hy cx = cy # (S, B, C, H, W) -> (B, S, C, H, W) finalOutputs = torch.stack(outputs) .permute(1, 0, 2, 3, 4).contiguous() return finalOutputs

CSA-ConvLSTM block

In case of the CSA-ConvLSTM, the nn.Sequential construction runs CBAM class on the input lattice instead of the basic Conv2d operation shown for the general ConvLSTM. For CSA, as shown in the forward function of CBAM class, ChannelGate class is executed first, whose outputs become input to the SpatialGate class. The outputs from the CBAM-based convolution are then passed through the gating mechanism for each time-step within the LSTM workflow. class CBAM(nn.Module): def __init__(self, in_channels, num_features, kernel_size, padding, stride, reduction_ratio=4, pool_types=['avg', 'max', 'lp', 'lse'], no_spatial=False): super(CBAM, self).__init__() self.ChannelGate = ChannelGate(in_channels, reduction_ratio, pool_types) self.no_spatial=no_spatial if not no_spatial: self.SpatialGate = SpatialGate(in_channels, num_features, kernel_size, padding, stride) def forward(self, x): x_out = self.ChannelGate(x) if not self.no_spatial: x_out = self.SpatialGate(x_out) return x_out class ConvLSTMBlock(nn.Module): def __init__(self, in_channels, num_features, kernel_size=3, padding=1, stride=1): super().__init__() self.num_features = num_features self.conv = self._make_layer(in_channels+num_features, num_features*4, kernel_size, padding, stride)

3 Earth AI workflow

def _make_layer(self, in_channels, out_channels, kernel_size, padding, stride): self.in_channels = in_channels self.out_channels = out_channels return nn.Sequential(CBAM(in_channels, out_channels, kernel_size=kernel_size, padding=padding, stride=stride, reduction_ratio=8, pool_types=['avg', 'max', 'lp', 'lse'], no_spatial=False)) def forward(self, inputs): ''' :param inputs: (B, S, C, H, W) :param hidden_state: (hx: (B, S, C, H, W), cx: (B, S, C, H, W)) :return: ''' outputs = [] B, S, C, H, W = inputs.shape hx = torch.zeros(B, self.num_features, H, W).to(inputs.device) cx = torch.zeros(B, self.num_features, H, W).to(inputs.device) for t in range(S): combined = torch.cat([inputs[:, t], # (B, C, H, W) hx], dim=1) gates = self.conv(combined) ingate, forgetgate, cellgate, outgate = torch.split(gates, self.num_features, dim=1) ingate = torch.sigmoid(ingate) forgetgate = torch.sigmoid(forgetgate) outgate = torch.sigmoid(outgate) cy = (forgetgate * cx) + (ingate * cellgate) hy = outgate * torch.tanh(cy) outputs.append(hy) hx = hy cx = cy finalOutputs = torch.stack(outputs) .permute(1, 0, 2, 3, 4).contiguous() return finalOutputs

135

136

6. Spatiotemporal attention ConvLSTM networks

SCA-ConvLSTM block

In case of the SCA-ConvLSTM, the nn.Sequential construction runs CBAM class on the input lattice instead of the basic Conv2d operation shown for the general ConvLSTM. For SCA, as shown in the forward function of CBAM class, SpatialGate class is executed first, whose outputs will be used as input to the ChannelGate class. The outputs from the CBAM-based convolution are then passed through the gating mechanism for each time-step within the LSTM workflow. class CBAM(nn.Module): def __init__(self, in_channels, num_features, kernel_size, padding, stride, reduction_ratio=8, pool_types=['avg', 'max', 'lp', 'lse'], no_spatial=False): super(CBAM, self ).__init__() self.ChannelGate = ChannelGate(num_features, reduction_ratio, pool_types) self.no_spatial=no_spatial if not no_spatial: self.SpatialGate = SpatialGate(in_channels, num_features, kernel_size, padding, stride) def forward(self, x): if not self.no_spatial: x_out = self.SpatialGate(x) x_out = self.ChannelGate(x_out) return x_out class ConvLSTMBlock(nn.Module): def __init__(self, in_channels, num_features, kernel_size=3, padding=1, stride=1): super().__init__() self.num_features = num_features self.pooled_features = 3 self.conv = self._make_layer(in_channels+num_features+3, num_features*4, kernel_size, padding, stride) def _make_layer(self, in_channels, out_channels, kernel_size, padding, stride): self.in_channels = in_channels self.out_channels = out_channels return nn.Sequential(CBAM(in_channels, out_channels, kernel_size=kernel_size, padding=padding, stride=stride, reduction_ratio=8, pool_types=['avg', 'max', 'lp', 'lse'], no_spatial=False))

3 Earth AI workflow

137

def forward(self, inputs): ''' :param inputs: (B, S, C, H, W) :param hidden_state: (hx: (B, S, C, H, W), cx: (B, S, C, H, W)) :return: ''' outputs = [] B, S, C, H, W = inputs.shape hx = torch.zeros(B, self.num_features, H, W).to(inputs.device) cx = torch.zeros(B, self.num_features, H, W).to(inputs.device) for t in range(S): combined = torch.cat([inputs[:, t], # (B, C, H, W) hx], dim=1) gates = self.conv(combined) ingate, forgetgate, cellgate, outgate = torch.split(gates, self.num_features, dim=1) ingate = torch.sigmoid(ingate) forgetgate = torch.sigmoid(forgetgate) outgate = torch.sigmoid(outgate) cy = (forgetgate * cx) + (ingate * cellgate) hy = outgate * torch.tanh(cy) outputs.append(hy) hx = hy cx = cy finalOutputs = torch.stack(outputs) .permute(1, 0, 2, 3, 4).contiguous() return finalOutputs

Encoder-decoder block

The encode and decoder block is a typical fundamental setting for image analysis. The encoder is responsible for extracting more abstract features via multiple layers whose neurons gradually shrink. At the same time, the decoder is the translation module responsible for understanding and translating the extracted information from the encoder into information for specific problems. We define and use the block here to highlight specific tasks that the encoding and decoding network performs. In our encoder-decoder framework, the encoding ConvLSTM compresses the input sequence into hidden state tensors, whereas the decoding or forecasting ConvLSTM unfolds the hidden state to give the final prediction. The initial states and cell outputs of the decoder network are copied from the last state of the encoding network. We formed both networks by stacking two ConvLSTM layers. Because our prediction target has the same dimensionality as the input, we concatenate all the states in the decoder network before feeding them into a 1  1

138

6. Spatiotemporal attention ConvLSTM networks

convolutional layer that generates the final prediction. A similar encoder-decoder approach has been implemented in (Shi et al., 2015) for predicting rainfall intensity over short temporal intervals. class Encoder(nn.Module): def __init__(self, config): super().__init__() self.layers = [] for idx, params in enumerate(config.encoder): setattr(self, params[0]+'_'+str(idx), self._make_layer(*params)) self.layers.append(params[0]+'_'+str(idx)) def _make_layer(self, type, activation, in_ch, out_ch, kernel_size, padding, stride): layers = [] if type == 'conv': layers.append(nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size, padding=padding, stride=stride, bias=False)) if activation == 'leaky': layers.append(nn.LeakyReLU(inplace=True)) elif activation == 'relu': layers.append(nn.ReLU(inplace=True)) elif type == 'convlstm': layers.append(ConvLSTMBlock(in_ch, out_ch, kernel_size=kernel_size, padding=padding, stride=stride)) return nn.Sequential(*layers) def forward(self, x): ''' :param x: (B, S, C, H, W) :return: ''' outputs = [x] for layer in self.layers: if 'conv_' in layer: B, S, C, H, W = x.shape x = x.view(B*S, C, H, W) if 'convlstm' in layer: x = getattr(self, layer)(x) else: x = getattr(self, layer)(x)

3 Earth AI workflow

if 'conv_' in layer: x = x.view(B, S, x.shape[1], x.shape[2], x.shape[3]) outputs.append(x) if 'convlstm' in layer: outputs.append(x) return outputs class Decoder(nn.Module): def __init__(self, config): super().__init__() self.layers = [] for idx, params in enumerate(config.decoder): setattr(self, params[0]+'_'+str(idx), self._make_layer(*params)) self.layers.append(params[0]+'_'+str(idx)) def _make_layer(self, type, activation, in_ch, out_ch, kernel_size, padding, stride): layers = [] if type == 'conv': layers.append(nn.Conv2d(in_ch, out_ch, kernel_size=kernel_size, padding=padding, stride=stride, bias=False)) if activation == 'leaky': layers.append(nn.LeakyReLU(inplace=True)) elif activation == 'relu': layers.append(nn.ReLU(inplace=True)) elif activation == 'sigmoid': layers.append(nn.Sigmoid()) elif type == 'convlstm': layers.append(ConvLSTMBlock(in_ch, out_ch, kernel_size=kernel_size, padding=padding, stride=stride)) elif type == 'deconv': layers.append(nn.ConvTranspose2d(in_ch, out_ch, kernel_size=kernel_size, padding=padding, stride=stride, bias=False)) if activation == 'leaky': layers.append(nn.LeakyReLU(inplace=True)) elif activation == 'relu': layers.append(nn.ReLU(inplace=True)) return nn.Sequential(*layers) def forward(self, encoder_outputs): ''' :param x: (B, S, C, H, W) :return: ''' idx = len(encoder_outputs)-1

139

140

6. Spatiotemporal attention ConvLSTM networks

for layer in self.layers: if 'conv_' in layer or 'deconv_' in layer: x = encoder_outputs[idx] B, S, C, H, W = x.shape x = x.view(B*S, C, H, W) x = getattr(self, layer)(x) x = x.view(B, S, x.shape[1], x.shape[2], x.shape[3]) elif 'convlstm' in layer: idx -= 1 if idx == 2: x = encoder_outputs[idx] x = getattr(self, layer)(x) elif idx == 1: x = torch.cat([encoder_outputs[idx], x], dim=2) x = getattr(self, layer)(x) else: break return x class ConvLSTM(nn.Module): def __init__(self, config): super().__init__() self.encoder = Encoder(config) self.decoder = Decoder(config) def forward(self, x): x = self.encoder(x) x = self.decoder(x) return x

Train and test network

Below, we showcase train and test functions used for model training and for testing its prediction performance. Both functions have the following essential inputs: epoch number, the model, data loader, loss criterion, and optimizer. Two other custom functions are also used in the test function, such as subArray_fire and subArray_Nofire, which help spatially filter the input and target data to calculate prediction errors over the fire and no-fire regions, respectively. def subArray_fire(x, y): ## ————————————— Target ————————————— x = x.permute(1, 2, 0, 3, 4).contiguous() y = y.permute(1, 2, 0, 3, 4).contiguous() target_10s = [] output_10s = []

3 Earth AI workflow

for t in range(10): x1 = x[t] x2 = torch.flatten(x1) mask = x2 > 0 indices = torch.nonzero(mask) ## ————————————— Subset target cells based on indices xx = x2[indices].squeeze() ## ————————————— Subset output cells based on fire cell index y1 = y[t] y2 = torch.flatten(y1) yy = y2[indices].squeeze() xx_yy = torch.stack([xx, yy], dim=0) target_10s.append(xx_yy[0].cpu().detach().numpy()) output_10s.append(xx_yy[1].cpu().detach().numpy()) target_10s_mean = torch.from_numpy(np.asarray(np.mean(np.hstack(target_10s)))) output_10s_mean = torch.from_numpy(np.asarray(np.mean(np.hstack(output_10s)))) target_output_10s = [target_10s, output_10s] return target_10s_mean, output_10s_mean, target_output_10s def subArray_Nofire(x, y): ## --------------- Target -----------x = x.permute(1, 2, 0, 3, 4).contiguous() y = y.permute(1, 2, 0, 3, 4).contiguous() target_10s = [] output_10s = [] for t in range(10): #print(x.shape) x1 = x[t] x2 = torch.flatten(x1) ## Get index of Target values > 0 [[i.e. Fire cells only]] mask = x2 == 0 indices = torch.nonzero(mask)

141

142

6. Spatiotemporal attention ConvLSTM networks

## ————————————— Subset target cells based on indices xx = x2[indices].squeeze() ## ————————————— Subset output cells based on fire cell index y1 = y[t] y2 = torch.flatten(y1) yy = y2[indices].squeeze() xx_yy = torch.stack([xx, yy], dim=0) target_10s.append(xx_yy[0].cpu().detach().numpy()) output_10s.append(xx_yy[1].cpu().detach().numpy()) target_10s_mean = torch.from_numpy(np.asarray(np.mean(np.hstack(target_10s)))) output_10s_mean = torch.from_numpy(np.asarray(np.mean(np.hstack(output_10s)))) target_output_10s = [target_10s, output_10s] return target_10s_mean, output_10s_mean, target_output_10s def train(config, logger, epoch, model, train_loader, criterion, optimizer): model.train() epoch_records = {'loss': []} num_batchs = len(train_loader) for batch_idx, (inputs, targets) in enumerate(train_loader): ## Make sure inputs and targets are calculated ## over 110 x 100 regions. inputs = inputs.float().to(config.device) targets = targets.float().to(config.device) outputs = model(inputs) losses = criterion(outputs, targets) optimizer.zero_grad() losses.backward() optimizer.step() epoch_records['loss'].append(losses.item()) return epoch_records def test(config, logger, epoch, model, test_loader, criterion): model.eval() epoch_records = {'loss': []} test_loss_Nofire = {'loss': []} test_loss_fire = {'loss': []} accuracy = {'class_accuracy': []}

3 Earth AI workflow

## Losses at 10 time-steps epoch_losses10T = {'loss10T': []} fire_losses10T = {'loss10T': []} nofire_losses10T = {'loss10T': []} num_batchs = len(test_loader) for batch_idx, (inputs, targets) in enumerate(test_loader): with torch.no_grad(): ## Make sure inputs and targets are calculated ## over 110 x 100 regions. inputs = inputs.float().to(config.device) targets = targets.float().to(config.device) outputs = model(inputs) ### Fire and NoFire-only losses losses = criterion(outputs, targets).to(config.device) targets_reshaped = targets.permute(1, 2, 0, 3, 4).contiguous() outputs_reshaped = outputs.permute(1, 2, 0, 3, 4).contiguous() losses_10T = [] for ts in range(10): target_t = targets_reshaped[ts] output_t = outputs_reshaped[ts] losses_t = criterion(output_t, target_t).to(config.device) losses_10T.append(losses_t) epoch_losses10T['loss10T'].append(losses_10T) ### Fire-only losses target_fire, output_fire, fire_target_output_10t = subArray_fire(targets, outputs) targets_fire = target_fire.to(config.device) outputs_fire = output_fire.to(config.device) losses_fire = criterion(outputs_fire, targets_fire) .to(config.device) ##-- Losses at 10 time steps fire_losses_10T = [] for ts in range(10): target_t1 = torch.from_numpy(fire_target_output_10t[0][ts]) output_t1 = torch.from_numpy(fire_target_output_10t[1][ts]) losses_t1 = criterion(output_t1, target_t1).to(config.device)

143

144

6. Spatiotemporal attention ConvLSTM networks

fire_losses_10T.append(losses_t1) fire_losses10T['loss10T'].append(fire_losses_10T) ### NoFire-only losses target_Nofire, output_Nofire, noFire_target_output_10t = subArray_Nofire(targets, outputs) targets_Nofire = target_Nofire.to(config.device) outputs_Nofire = output_Nofire.to(config.device) losses_Nonfire = criterion(outputs_Nofire, targets_Nofire) .to(config.device) #-- Losses at 10 time steps nofire_losses_10T = [] for ts in range(10): target_t2 = torch.from_numpy(noFire_target_output_10t[0][ts]) output_t2 = torch.from_numpy(noFire_target_output_10t[1][ts]) losses_t2 = criterion(output_t2, target_t2).to(config.device) nofire_losses_10T.append(losses_t2) nofire_losses10T['loss10T'].append(nofire_losses_10T) epoch_records['loss'].append(losses.item()) test_loss_Nofire['loss'].append(losses_Nonfire.item()) test_loss_fire['loss'].append(losses_fire.item()) return epoch_records, test_loss_fire, test_loss_Nofire, epoch_losses10T, fire_losses10T, nofire_losses10T

3.2.2 Execute the model Below we specify train and test data paths, define the model, loss function, optimizer function, data loaders, and parameter thresholds for saving the best model from the number of epochs we chose before. Generally, an optimal number of epochs should be used to mitigate overfitting and increase the generalization capacity of the trained neural network. We have seen that in all three ConvLSTM networks, the model performance did not improve after 15 epochs. We list the overall model errors (based on MSE loss) for each epoch for the train and test data. We also list losses for each step of the 10-step predictions.

3 Earth AI workflow

trainDataSetDir = '/home/ConvLSTM.pytorch/Data_ConvLSTM/Train' testDataSetDir = '/home/ConvLSTM.pytorch/Data_ConvLSTM/Test' model = ConvLSTM(config).to(config.device) summary(model) criterion = torch.nn.MSELoss().to(config.device) optimizer = torch.optim.Adam(model.parameters(), lr=1e-4) train_dataset = WildFireDataset(config, logger, trainDataSetDir, split='train') train_loader = DataLoader(train_dataset, batch_size=config.train_batch_size, num_workers=config.num_workers, shuffle=True, pin_memory=True) test_dataset = WildFireDataset_Test(config, logger, testDataSetDir, split='test') test_loader = DataLoader(test_dataset, batch_size=config.test_batch_size, num_workers=config.num_workers, shuffle=False, pin_memory=True) patience = 10 min_val_loss = 9999 epoch_interval = 1 counter = 0 train_records, tst_records = [], [], [] test_loss_fire, test_loss_Nofire, pred_acc = [], [], [] test_losses10T, test_fire_losses10T, test_nofire_losses10T = [], [], [] losses_all_fire_nofire_10T = {} for epoch in range(config.epochs): epoch_records = train(config, logger, epoch, model, train_loader, criterion, optimizer) epoch_records = np.mean(epoch_records['loss']) train_records.append(epoch_records) test_records, test_records_fire, test_records_Nofire, accuracy_class, test_losses10, test_fire_losses10, test_nofire_losses10 = test(config, logger, epoch, model, test_loader, criterion) test_records = np.mean(test_records['loss']) tst_records.append(test_records) test_records_fire = np.mean(test_records_fire['loss'])

145

146

6. Spatiotemporal attention ConvLSTM networks

test_loss_fire.append(test_records_fire) test_records_Nofire = np.mean(test_records_Nofire['loss']) test_loss_Nofire.append(test_records_Nofire) test_losses10T.append(test_losses10) test_fire_losses10T.append(test_fire_losses10) test_nofire_losses10T.append(test_nofire_losses10) if min_val_loss > test_records**0.5: min_val_loss = test_records**0.5 print("Saving...") torch.save(model.state_dict(), "/home/ConvLSTM.pytorch/Output_ConvLSTM/model_Convlstm.pth") counter = 0 else: counter += 1 if counter == patience: break print("Losses: ", train_records, tst_records, test_loss_fire, test_loss_Nofire) ## Save losses for 10-step outputs/predictions losses_all_fire_nofire_10T["all_losses10T"] = test_losses10T losses_all_fire_nofire_10T["fire_losses10T"] = test_fire_losses10T losses_all_fire_nofire_10T["nofire_losses10T"] = test_nofire_losses10T out = open(os.path.join(r"/home/arifm/ConvLSTM.pytorch/Output_ConvLSTM", "ConvLSTM_losses_10T.pkl"),'wb') pickle.dump(losses_10T, out) out.close()

3.3 Physical interpretability of the trained model: integrated gradients-based feature importance Integrated gradients can be defined as the path integral of the gradients along the straightline path from the baseline x0 to the input x. Formally, suppose F : Rn ! [0, 1] F represents one of our ConvLSTM networks for a classification problem. Let x  Rn be the input lattice at hand and x0  Rn be the baseline input which can be a black image, for example. The integrated gradient along the ith dimension for an input x and baseline x0 can be defined as follows: ð1   ∂Fðx0 + α  ðx  x0 ÞÞ 0 dα Integrated Gradientsi ðxÞ :: ¼ xi  xi  ∂xi α¼0

3 Earth AI workflow

147

where ⅈ ¼ feature, x ¼ input, x0 ¼ baseline, α ¼ interpolation constant to perturb features by, and ∂F(x)/(∂xi) is the gradient of F(x) along the ith dimensions. We use integrated gradients to generate feature importance from the trained model. These feature importances are then used for generating temporal heatmaps (see Results). ## ====Calculate Feature Importance by Integrated Gradients====## ig = IntegratedGradients(model) ig_list_dict = {} ig_list = [] ig_list_all = [] for i in range(0, len(label_grater_than_0), 1): ig_attr_test = ig.attribute(input.float().to(config.device), baselines=input.float().to(config.device)*0, target=(9, 0, label_grater_than_0[i][0], label_grater_than_0[i][1]), n_steps=1) img_ig_attr_test = ig_attr_test.cpu().detach().numpy() ig_list.append(img_ig_attr_test[0][9]) ig_list_all.append(img_ig_attr_test[0]) key = str(i) ig_list_dict[key] = img_ig_attr_test[0] ig_array = np.stack((ig_list), axis=0) ig_mean = np.mean(ig_array, axis=0) ig_mean = (2 * (ig_mean-ig_mean.min())/(ig_mean.max()-ig_mean.min())) - 1 ig_array_10T = np.stack((ig_list_all), axis=0) ig_mean_10T = np.mean(ig_array_10T, axis=0) workspace = '/content/drive/MyDrive/interpretation_outputs/' #### Save temporal importance values for fire/no-fire cells pickle.dump(ig_array_10T, open(os.path.join(workspace, "ig_Grids_10T_FIRE_NAConvLSTM_NEW.pkl"), 'wb')) pickle.dump(ig_mean_10T, open(os.path.join(workspace, "ig_GridsMean_10T_FIRE_NAConvLSTM_NEW.pkl") , 'wb')) #### Temporal Change in feature importance #### importances_dict = {}

148

6. Spatiotemporal attention ConvLSTM networks

ig_array_10T_TC = ((ig_mean_10T - ig_mean_10T.min()) * (1/(ig_mean_10T.max() - ig_mean_10T.min()) * 255)) ig_mean_10T_TC = ig_mean_10T print(ig_mean_10T_TC.shape) for i in range(0, 10, 1): importances = np.mean(ig_mean_10T_TC[i], axis=(1,2)) timestep = "T" + str(i) importances_dict[timestep] = importances print(len(importances_dict)) df = pd.DataFrame.from_dict(importances_dict, orient = 'index') df.to_csv("/content/drive/MyDrive/interpretation_outputs/ temporal_importances_fire_NAConvLSTM_NEW.csv")

4 Results Our experiments focused on evaluating prediction accuracies of the proposed attentionbased ConvLSTM models against the basic nonattention model. All ConvLSTM models predict the progression of wildfire spread—location of fire-front—over a sequence of 20 timesteps, thus capturing fire spread dynamics for approximately 1.5–2.5 hours. The first 10 time-steps were used as input sequences, and the rest were labels for the next 10-step predictions.

4.1 Prediction performance We listed two questions concerning predictive accuracy and conducted corresponding tests. The questions, the corresponding answers, and a step-by-step guide to the tests are listed below. Q1: Do attention mechanisms within ConvLSTM improve event progression prediction? Specifically, which model(s) performs better for predicting the overall, fire-only, and no-fire cells? Q2: Which models performed better for predicting consecutive predictions? To answer these questions, we copied the epoch-wise error lists and temporal losses into temporary CSV files, which are then used for generating error graphs. import pandas as pd import matplotlib.pyplot as plt from matplotlib.pyplot import figure import seaborn as sns sns.set_theme(style="whitegrid")

4 Results

149

df_fire_nofire = pd.read_csv("Accuracy_overall.csv") df_fire = pd.read_csv("Accuracy_fire.csv") df_nofire = pd.read_csv("Accuracy_nofire.csv") Models = np.array(['NA-ConvLSTM', 'Patch-ConvLSTM', 'Pair-ConvLSTM', 'CSA-ConvLSTM', 'SCA-ConvLSTM']) error_data = {‘Models’: Models, 'Loss (Fire & No-fire)': loss_all, 'Loss (fire)': loss_fire, 'Loss (No-fire)': loss_nofire} _, axarr = plt.subplots(1, 3, figsize=(10*3, 10)) sns.set(font_scale = 2) ax = sns.barplot(x="Models", y='Loss (Fire & No-fire)', data=error_data, ci="sd", ax=axarr[0]) ax.set_xticklabels(ax.get_xticklabels(), rotation=45) ax.set_ylabel("RMSE") ax = sns.barplot(x="Models", y='Loss (fire)', data=error_data, ci="sd", ax=axarr[1]) ax.set_xticklabels(ax.get_xticklabels(), rotation=45) ax = sns.barplot(x="Models", y='Loss (No-fire)', data=error_data, ci="sd", ax=axarr[2]) ax.set_xticklabels(ax.get_xticklabels(), rotation=45)

To answer Q1, the average RMSE of the predictions for fire progression at 10 sequential time-steps (lowest among 15 epochs) from different models are presented in Fig. 8. To answer Q2, we generate Fig. 9 based on temporal loss data which was saved as pickle files.

FIG. 8

Performance of the nonattention and attention-based ConvLSTM models for predictions on the overall (left), fire-only (middle), and no-fire (right) lattices.

150

6. Spatiotemporal attention ConvLSTM networks Fire and no-fire

Fire

NA-ConvLSTM CSA-ConvLSTM SCA-ConvLSTM

0.12

No-fire 0.12

0.12 0.10

RMSE

0.10

0.08

0.10

0.08

0.06

0.06

0.08 0.04

0.04

0.02

0.06 0.02

0.00 2

4

6

8

10

Time

2

4

6

8

10

Time

2

4

6

8

10

Time

FIG. 9 Performance of the nonattention and attention-based ConvLSTM models for 10 sequential predictions on all (fire and no-fire), fire-only, and no-fire lattices. CSA-ConvLSTM predicted fire-front cells more accurately than the other two methods, but the nonattention ConvLSTM was superior at predicting no-fire areas. Note that the reported SCA-ConvLSTM used channel attention followed by only a Conv2D operation in the spatial attention module. import pickle infile = open("NA_ConvLSTM_losses_10T.pkl",'rb') NA_Pkl = pickle.load(infile) infile.close() infile = open("CSA_ConvLSTM_losses_10T.pkl",'rb') CSA_Pkl = pickle.load(infile) infile.close() infile = open("SCA_ConvLSTM_losses_10T.pkl",'rb') SCA_Pkl = pickle.load(infile) infile.close()

Here, we retrieve the overall, fire-only, no fire-only temporal losses as listed from the picked dictionary file that contains epoch-wise temporal losses. We showcase this retrieval workflow for the nonattention ConvLSTM model as an example. for key, val in NA_Pkl.items(): print (key, len(val)) if key == 'all_losses10T': n = 0 for d in val: n+=1 if n == 15: print("Epoch____________________: ", n)

4 Results

for k, v in d.items(): value = v print(len(value)) tot = 0 NAM_batch_losses_all = [] for v in value: tot+=1 value_list = [] for val in v: valu = val.detach().cpu().numpy() value_list.append(valu.tolist()) NAM_batch_losses_all.append(value_list) elif key == 'fire_losses10T': n = 0 for d in val: n+=1 if n == 15: print("Epoch____________________: ", n) for k, v in d.items(): value = v print(len(value)) tot = 0 NAM_batch_losses_fire = [] for v in value: tot+=1 value_list = [] for val in v: valu = val.detach().cpu().numpy() value_list.append(valu.tolist()) NAM_batch_losses_fire.append(value_list) elif key == 'nofire_losses10T': n = 0 for d in val: n+=1 if n == 15: print("Epoch____________________: ", n)

151

152

6. Spatiotemporal attention ConvLSTM networks

for k, v in d.items(): value = v print(len(value)) tot = 0 NAM_batch_losses_nofire = [] for v in value: tot+=1 value_list = [] for val in v: valu = val.detach().cpu().numpy() value_list.append(valu.tolist()) NAM_batch_losses_nofire.append(value_list)

Next, we derive RMSE from the mean MSE of all epochs. We showcase this step for the overall errors as an example. NAM_batch_losses_all = np.array(NAM_batch_losses_all) CSA_batch_losses_all = np.array(CSA_batch_losses_all) SCA_batch_losses_all = np.array(SCA_batch_losses_all) NAM_batch_losses_all_10T = np.sqrt(np.mean(NAM_batch_losses_all, axis = 0)) CSA_batch_losses_all_10T = np.sqrt(np.mean(CSA_batch_losses_all, axis = 0)) SCA_batch_losses_all_10T = np.sqrt(np.mean(SCA_batch_losses_all, axis = 0))

Next, we create dataframes of all three types of errors for visualization. Time = np.array([1,2,3,4,5,6,7,8,9,10]) all_data = {‘Time’: Time, 'NA-ConvLSTM': NAM_batch_losses_all_10T, 'CSA-ConvLSTM': CSA_batch_losses_all_10T, 'SCA-ConvLSTM': SCA_batch_losses_all_10T} fire_data = {‘Time’: Time, 'NA-ConvLSTM': NAM_batch_losses_fire_10T, 'CSA-ConvLSTM': CSA_batch_losses_fire_10T, 'SCA-ConvLSTM': SCA_batch_losses_fire_10T} nofire_data = {‘Time’: Time, 'NA-ConvLSTM': NAM_batch_losses_nofire_10T, 'CSA-ConvLSTM': CSA_batch_losses_nofire_10T, 'SCA-ConvLSTM': SCA_batch_losses_nofire_10T} df_all = pd.DataFrame(all_data) df_fire = pd.DataFrame(fire_data) df_nofire = pd.DataFrame(nofire_data)

4 Results

153

_, axarr = plt.subplots(1, 3, figsize=(5*6, 10), dpi=100) font = {'family' : 'normal', 'weight' : 'bold', 'size' : 22} matplotlib.rc('font', **font) axarr[0].plot('Time', 'NA-ConvLSTM', data=all_data, marker='o', markerfacecolor='red', markersize=3, color='black', linewidth=1) axarr[0].plot('Time', 'CSA-ConvLSTM', data=all_data, marker='', color='green', linewidth=1, linestyle='dashed') axarr[0].plot('Time', 'SCA-ConvLSTM', data=all_data, marker='+', color='gray', linewidth=1.5, linestyle='dotted') axarr[0].set_xlabel("Time") axarr[0].set_ylabel("RMSE") axarr[0].set_title("Fire and no-fire") axarr[0].legend() axarr[1].plot('Time', 'NA-ConvLSTM', data=fire_data, marker='o', markerfacecolor='red', markersize=3, color='black', linewidth=1) axarr[1].plot('Time', 'CSA-ConvLSTM', data=fire_data, marker='', color='green', linewidth=1, linestyle='dashed') axarr[1].plot('Time', 'SCA-ConvLSTM', data=fire_data, marker='+', color='gray', linewidth=1.5, linestyle='dotted') axarr[1].set_xlabel("Time") axarr[1].set_title("Fire") axarr[2].plot('Time', 'NA-ConvLSTM', data=nofire_data, marker='o', markerfacecolor='red', markersize=3, color='black', linewidth=1) axarr[2].plot('Time', 'CSA-ConvLSTM', data=nofire_data, marker='', color='green', linewidth=1, linestyle='dashed') axarr[2].plot('Time', 'SCA-ConvLSTM', data=nofire_data, marker='+', color='gray', linewidth=1.5, linestyle='dotted') axarr[2].set_xlabel("Time") axarr[2].set_title("No-fire")

4.2 Physical interpretation The temporal feature importance graphs based on integrated gradients demonstrate how the importance varies in the nonattention and attention-based models. Interestingly, nonattention ConvLSTM finds all features except wind (h) as the most important features since the beginning of the input sequence, and wind becomes the most important variable

154

6. Spatiotemporal attention ConvLSTM networks

FIG. 10

Temporal patterns of the input sequence showing average feature values (not weights) and gradient-based average feature importance from nonattention, CSA, and SCA ConvLSTM models.

to predict fire at t + 11 time-step. However, attention-based models generally showed varying patterns for different features, although the pattern with wind (h) at T10 matches across all models (Fig. 10).

5 Conclusions Global wildfires are rising and causing billions of dollars in damage annually. Accurate prediction of wildfires and a better understanding of their spreading dynamics at successive time-steps are indispensable for effective fire-fighting and longer-term fire management efforts. However, wildfire spread predictions with higher accuracy remain a challenging problem primarily due to their stochastic and dynamic nature. Toward addressing this prediction problem, physical models based on the laws of physics and chemistry are too mechanistic and data-intensive, i.e., they require physical documentation of fire ecology and atmospheric processes. On the other hand, completely empirical models may suffer from poor generalizability across environmental gradients. In the past decade, wildfire spread modeling has grown from fully physical models to data-driven models (Green et al., 2020), particularly DL models such as convolutional neural networks. However, relatively less focus has been placed on capturing spatiotemporally changing relationships among biophysical factors that drive fire-front movement across a heterogeneous landscape with varying fuel density, temperature, moisture condition, wind direction, and topography. The models demonstrated in this chapter attempted to address the challenge of accurate spatiotemporal prediction and interpretation of wildfire spread dynamics in such conditions. Specifically, we contribute to the growing

References

155

literature in sequence-to-sequence modeling based on attention mechanisms that predict complex spatiotemporal events such as wildfires. The attention-based Convolutional Long Short-Term Memory (ConvLSTM) deep neural networks presented here showed that the convolutional block-based attention mechanisms (e.g., spatial-channel and channel-spatial) in ConvLSTM can offer lower errors for predicting fire-front progression at sequential time-steps compared to its nonattention benchmark. However, we also found that prediction performance can suffer from overestimation in nonfire locations. Nevertheless, such issues are not entirely surprising as the fire-spread rate tends to be variable and selective of the vegetation areas it burns. Future studies are warranted to address the overestimation problem and investigate the time-dependent process of wildfire spread using empirical datasets. The attention-based ConvLSTM approaches demonstrated here should generally apply to predicting and interpreting spatiotemporal dynamics of other spatial events, such as traffic flow. With traffic flow prediction, the goal is to analyze historical traffic conditions (e.g., vehicle density and speed) and their spatial patterns to estimate future traffic conditions (Cai et al., 2020; Xu et al., 2020). Capturing spatiotemporal dependency of traffic conditions through attention mechanisms will be key to effectively predicting where and when traffic congestions will occur. The application of different self-attention mechanisms (Lin et al., 2020) to capture long-range spatial dependencies is another research avenue that can be explored in future research.

6 Assignment • Investigate the effects of different kernel sizes in the CBAM module on the overall, fireonly, and no-fire predictions. • Examine prediction accuracies by replacing the least squares loss function with a more sophisticated probabilistic model. • Investigate model performance by increasing ConvLSTM and Conv layers in the encoder and/or decoder. • Examine how well both the nonattention ConvLSTMs and attention-based ConvLSTMs predict and interpret real-world wildfire spread dynamics at varied spatiotemporal scales.

7 Open questions ● What other attention mechanisms, such as self-attention (Vaswani et al., 2017), could be useful for more accurately modeling wildfire spread dynamics? ● Whether and why self-attentions may contribute to overestimation or underestimation of fire-front location and fire intensity?

References Burge, J., Bonanni, M., Ihme, M., Hu, L., 2020. Convolutional LSTM neural networks for modeling wildland fire dynamics. Retrieved from http://arxiv.org/abs/2012.06679.

156

6. Spatiotemporal attention ConvLSTM networks

Cai, L., Janowicz, K., Mai, G., Yan, B., Zhu, R., 2020. Traffic transformer: capturing the continuity and periodicity of time series for traffic forecasting. Trans. GIS 24 (3), 736–755. https://doi.org/10.1111/tgis.12644. Coen, J.L., Cameron, M., Michalakes, J., Patton, E.G., Riggan, P.J., Yedinak, K.M., 2013. WRF-FIRE: coupled weatherwildland fire modeling with the weather research and forecasting model. J. Appl. Meteorol. Climatol. 52 (1), 16–38. https://doi.org/10.1175/JAMC-D-12-023.1. Finney, M.A., 1994. FARSITE: a fire area simulator for fire managers. In: The Proceedings of The Biswell Symposium, pp. 55–56. Retrieved from http://www.firemodels.org/downloads/farsite/publications/Finney_1995_PSWGTR-158_pp55-56.pdf. Finney, M.A., Cohen, J.D., Forthofer, J.M., McAllister, S.S., Gollner, M.J., Gorham, D.J., et al., 2015. Role of buoyant flame dynamics in wildfire spread. Proc. Natl. Acad. Sci. U. S. A. 112 (32), 9833–9838. https://doi.org/10.1073/ PNAS.1504498112. Green, M.E., Kaiser, K., Shenton, N., 2020. Modeling wildfire perimeter evolution using deep neural networks. Retrieved from http://arxiv.org/abs/2009.03977. Jain, P., Coogan, S.C.P., Subramanian, S.G., Crowley, M., Taylor, S., Flannigan, M.D., 2020. A review of machine learning applications in wildfire science and management. Environ. Rev. 28 (4), 478–505. https://doi.org/10.1139/er2020-0019. Kozik, V.I., Nezhevenko, E.S., Feoktistov, A.S., 2014. Adaptive prediction of forest fire behavior on the basis of recurrent neural networks. Optoelectron. Instrument. Data Process. 50 (4), 395–401. https://doi.org/10.3103/ S8756699014040116. Lin, Z., Li, M., Zheng, Z., Cheng, Y., Yuan, C., 2020. Self-attention ConvLSTM for spatiotemporal prediction. Proc. AAAI Conf. Artif. Intell. 34 (07), 11531–11538. https://doi.org/10.1609/aaai.v34i07.6819. Mandel, J., Beezley, J.D., Kochanski, A.K., 2011. Coupled atmosphere-wildland fire modeling with WRF 3.3 and SFIRE 2011. Geosci. Model Dev. 4 (3), 591–610. https://doi.org/10.5194/gmd-4-591-2011. Quinn, N., Bates, P.D., Neal, J., Smith, A., Wing, O., Sampson, C., et al., 2019. The spatial dependence of flood hazard and risk in the United States. Water Resour. Res. 55 (3), 1890–1911. https://doi.org/10.1029/2018WR024205. Radke, D., Hessler, A., Ellsworth, D., 2019. Firecast: leveraging deep learning to predict wildfire spread. In: IJCAI International Joint Conference on Artificial Intelligence, 2019-August, pp. 4575–4581, https://doi.org/ 10.24963/ijcai.2019/636. Rothermel, R.C., 1972. A Mathematical Model for Predicting Fire Spread. Research Paper INT-115, US Department of Agriculture, Intermountain Forest and Range Experiment Station, Ogden, UT 84401. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C., 2015. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, 2015Janua (June), pp. 802–810. Subramanian, S.G., Crowley, M., 2018. Using spatial reinforcement learning to build forest wildfire dynamics models from satellite images. Frontiers in ICT 5 (APR), 1–12. https://doi.org/10.3389/fict.2018.00006. Sun, Z., Sandoval, L., Crystal-Ornelas, R., Mousavi, S.M., Wang, J., Lin, C., John, A., 2022. A review of earth artificial intelligence. Comput. Geosci., 105034. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems, 2017-Decem(Nips), pp. 5999–6009. Woo, S., Park, J., Lee, J.Y., Kweon, I.S., 2018. CBAM: convolutional block attention module. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11211 LNCS, pp. 3–19, https://doi.org/10.1007/978-3-030-01234-2_1. Xu, M., Dai, W., Liu, C., Gao, X., Lin, W., Qi, G.J., Xiong, H., 2020. Spatial-temporal transformer networks for traffic flow forecasting. ArXiv, pp. 1–14.

C H A P T E R

7 AI for physics-inspired hydrology modeling Andrew Bennett University of Arizona, Tucson, AZ, United States

1 Introduction and background There is little doubt that machine learning-based methods are a valuable framework for hydrologic modeling (Nearing et al., 2021; Shen, 2018). Before diving into recent applications and their implications, we would like to summarize at a very high level some of the aims and applications of modeling the hydrologic cycle. To do so, it is first worthwhile to discuss the hydrologic cycle as a whole, not only for the general reader but also to put the work we will undertake in this chapter into a proper frame of view. The goal of hydrologic science, broadly stated, is to better understand both the fate and role of water on land surface processes. These processes comprise a diverse set including streamflow in rivers, soil moisture stored in soil, snowpack stored on the land, groundwater storage in the deep subsurface, transport of both nutrients and contaminants in the subsurface mediated by subsurface water availability, biogeochemical cycling of plants and bacteria which are dependent on water availability, landscape morphology including erosion and weathering, and more. While there are clearly a large number of scientifically interesting open problems from a pure-research point of view in hydrology, it is also clear that there is an absolutely critical role that water plays in human life and our societies broadly. As such, it is necessary that we have adequate ways to model and forecast the aforementioned quantities to ensure both quality and quantity of water supplies for human consumption and infrastructure, not to mention to maintain ecosystem health (or rather, to avoid damaging them via human activities). In this chapter, we want to demystify and provide a basic set of tools for developing “physics inspired” hydrologic models. Indeed, the terminology “physics inspired” is overloaded, especially in the context of AI and machine learning. While there are enormous opportunities for future research projects to meld traditional hydrology knowledge into datadriven approaches via machine learning, it can be difficult to get up to speed in

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00006-2

157

Copyright # 2023 Elsevier Inc. All rights reserved.

158

7. AI for physics-inspired hydrology modeling

understanding how these approaches relate to each other. In this chapter, we will build up from basic principles how to encode hydrologic theory into machine-learning frameworks, with the PyTorch ecosystem as an example. To this end, we want to be clear in our goals: this chapter is mainly focused on explaining theoretic constructs and methods as well as providing a set of working code developed from the ground up rather than focusing on obtaining state-of-the-art performance metrics in our finally constructed model. Concretely, our goal in this chapter is to lay out some ways in which technologies from machine learning can be merged with traditional hydrologic modeling. As this chapter is mainly aimed at Earth scientists looking to find concrete workflows of machine-learning technologies, we must constrain the hydrologic point of view. In doing so, we have chosen to focus on catchment hydrology and the prediction of streamflow. This is a popular, and highly successful, application of machine learning in hydrology due to a number of well-curated datasets, a clear problem statement, and the clear ability for data-driven models to outperform simple conceptual formulations of the physical systems. The recent rise in such applications can be seen in a large number of studies (Kratzert et al., 2019; Gauch et al., 2021; Thapa et al., 2020; Mai et al., 2022). One of the main criticisms of these purely data-driven approaches is that they may not adhere to physical principles such as mass or energy balance (although this relaxation may be one of the reasons for their high performance), and do not offer any guarantee that the resulting trained model will have any obvious interpretability. While there have been a number of explorations in using “explainable artificial intelligence” methods (XAI) in hydrology to validate trained machine-learning models ( Jiang et al., 2022; Schaeffer, 2017), these models still are almost always designed for a particular purpose such as simulating streamflow time series for a single catchment. At the time of writing this, it is very rare for machine-learning models in hydrology to be trained to output multiple variables, and even when they do they are often not explicitly linked in how they are calculated as is done in both conceptual and physics-based hydrologic models. The explicit nature of non-ML hydrologic models is also appealing because they can (1) be used for multiple purposes and (2) allow us to see how resulting changes in one quantity affect changes in another. For instance, you can directly explore how a decrease in simulated snowpack affects the simulated streamflow in most commonly used hydrologic models. However, these criticisms must be kept in check with the overwhelming reality that in only a short number of years the ML-based modeling approaches have easily and significantly improved our ability to model quantities of hydrologic interest including streamflow, soil moisture, evapotranspiration, snowpack, and groundwater levels. Because of the consistent and significant advantages in the predictive capabilities of ML-based models as criticisms around their physical interpretability or ability to directly model intermediate processes, there is a growing interest in finding hybrid approaches which blend ideas from traditional hydrologic modeling and machine learning. The interest in hybrid approaches is often appealing due to numerous innovations of such approaches in other fields such as atmospheric science, chemistry, and physics. Because the field of hybrid-machine-learning methods is both new and rapidly evolving, there are numerous terms being used for similar things. For the sake of clarity, we will provide a set of definitions for these different approaches. While we hope that laying out these definitions up front clarifies the landscape, we also must admit that our view has a certain amount of bias, and may not be agreed upon by everybody but at least provides a consistent viewpoint.

1 Introduction and background

159

The terms which we wish to disentangle are Knowledge Guided Machine Learning (KGML), Neural Ordinary Differential Equation (NeuralODE), Physics Inspired Neural Networks (PINN), and Hybrid Models. KGML largely is an informal set of approaches, which seems to span many areas of the other terms as well as being more general. KGML-based methods can incorporate methods of PINNS, NeuralODEs, and use Hybrid Models, as well as refer to activities such as feature selection and engineering based on domain-specific knowledge. The NeuralODE method was popularized by Chen et al. (2018) and refers to a specific neural network architecture where, rather than specifying discrete hidden layers, the hidden state of the network is specified via an ordinary differential equation (ODE) where the form is not explicitly specified or designed for each particular application as in hydrologic or other more traditional modeling efforts. NeuralODE-based methods have been successful in generic machine-learning settings, particularly for Continuous Normalizing Flows (CNFs, which are used in generative modeling to sample from very complex probability distributions) and time series modeling with irregular observations/inputs. On the other hand, in PINNs, the architecture of the network is open to definition, but the loss function is set up to satisfy a particular equation form. Most successfully, this method has been used to learn to simulate complex partial differential equations (PDEs, Raissi et al., 2019). However, application of PINNs in the broader Earth sciences has not seen large practical uptakes due to their difficulty in training them (Krishnapriyan et al., 2021). Finally, hybrid models are those, which contain both a data driven component as well as a traditional ODE or PDE-based component. Largely, these approaches to hybrid models can be summed up into three camps. First is putting a neural network into a larger computational model. This is popular in the atmospheric science community to try to resolve processes, which happen at scales smaller than the computational element (or grid cell) to improve predictions without the need for higher resolutions, which is very computationally costly. Examples of this type of approach include Brenowitz and Bretherton (2018), Rasp et al. (2018), Beucler et al. (2020), Zhao et al. (2019), and Bennett and Nijssen (2021). On the other hand, you also might switch the ordering and put a particular parametric form of an ODE or PDE into a broader neural network framework. This has become more popular in the hydrology community, particularly because it is easier to make models of particular locations rather than needing to resolve the entire planet with each prediction step. Examples of this include Jiang et al. (2020) and Kraft et al. (2021). This is the approach that we take in this chapter. Finally, there is a third general approach to hybrid modeling, which chains together model types, such as using a data-driven approach to either pre/post-process data to/from a more traditionally based model. Examples of this include Frame et al. (2021), Feigl et al. (2020), and Tian et al. (2018). Of course, these are broad categorizations, and there are many overlaps and fuzzy boundaries, which make exact definitions elusive. With a broad overview of the different approaches to physics inspired machine learning, we now turn to the outline of what we will describe in this chapter. The end goal of this chapter is to demonstrate how to build a conceptual hydrologic model parameterized by PyTorch constructs that interoperate with the broader machine-learning infrastructure of optimizers and automatic differentiation. This chapter will begin with a brief introduction to the PyTorch ecosystem and is followed by a primer of automatic differentiation, which is one of the modern foundations of machine learning. It then conducts another brief overview on numerical

160

7. AI for physics-inspired hydrology modeling

optimization. The combination of automatic differentiation and optimization provides the basic underlying tools to start building models, which blends solving domain-specific ODEs with simple neural networks. We first demonstrate the principle with a simple nonlinear reservoir model and synthetic data to help understand how all of the parts operate together. Finally, we construct a variant of a conceptual hydrologic model, which is parameterized by PyTorch parameters and then train the model in the same fashion as a neural network. We examine this model’s performance and show how this approach is easily interpretible and explore the intermediate processes such as evapotranspiration and soil storages to better understand what our trained model is doing. Finally, we end with concluding remarks and offer some modifications and exercises you might complete to solidify your understanding and build on these principles. A note on software versions This tutorial was designed to be able to run as an interactive Jupyter notebook. We provide a conda environment and all of the necessary data that you need to run this end to end. Along the way we recommend that you play with specific numbers and try modifications of each section to see their effects. In a standard fashion, our first code cell is handling some imports and code setup cells. However, if you are simply reading the text of this chapter without the full computational environment the library versions for the code that we use are as follows: • • • • • •

matplotlib: 3.2.2 numpy: 1.21.5 xarray: 2022.3.0 tqdm: 4.62.3 torch: 1.10.1 torchdiffeq: 0.2.2

2 PyTorch and autodifferentiation 2.1 Getting started with PyTorch We will use the PyTorch (https://pytorch.org/) as our machine-learning framework. PyTorch is one of the most popular machine-learning frameworks and has been used in numerous applications (Paszke et al., 2019). It is most commonly used to build neural networks from existing building blocks, and also provides a great deal of support for developing novel architectures and components. We will use both aspects of PyTorch in this chapter; so, it is good to start with some of the basics. Like many other Python packages PyTorch has a

2 PyTorch and autodifferentiation

161

number of useful modules. Here, we will quickly go over the ones that we need to build, train, and evaluate our models. • torch: This is the base PyTorch module, and provides access to fundamental components like the array interface torch.tensor. Note that you can consider a tensor to simply be an array for the remainder of this chapter. The base torch module also includes functions, which operate on the tensors (such as torch.sum and torch.abs). Among other things, it also has functionality to build arrays with many similarities to the numpy library (e.g., torch.arange and torch.ones). The need for a separate implementation of array operations will become clear when we discuss automatic differentiation in the next sections. • torch.nn: This module provides base implementations of many standard neural network architecture components, or layers. • torch.nn.functional (imported as F): This implements a number of functions, which are “functional,” meaning they do not retain any state. These functions are often used outside of the neural network definitions. • torch.autograd: This implements automatic differentiation routines. The base module provides two main functions torch.autograd.grad, which returns the sum of gradients of the output of a function with respect to the inputs, and torch.autograd.backward, which computes the sum of the gradients of the input with respect to the outputs of a function. We will see the uses of both of these functions in the next section, but will not be using this module explicitly. • torch.autograd.functional: This implements some building blocks for writing numerical optimization routines (among other uses). We will make heavy use of the jacobian function here, but the other functions can be used to implement more advanced algorithms. • torch.utils.data: This implements the Dataset and DataLoader classes, which will be used later in this chapter to load real-life hydrologic data to train our models. To complete the setup of our environment we will import these packages as well. Then, we will set some basic configuration options. First, the device will be set, which specifies the location where computation will happen on the computer. For the sake of accessibility in this chapter, it will be run as a cpu computation, but for more advanced uses of ML in the real world using a GPU will be much more performant. If you are running this chapter interactively and have an environment which can use a GPU you may see it used here. Next, we set the dtype, which is the data type that we will use throughout. Here, we set torch. float32, which is a single precision floating point number. Most standard numerical implementations default to double precision (or float64), which require twice as much memory per value and are also slower to compute. Because machine learning is often computationally expensive, we default to using float32, which is fairly common. Recent advances in processing capabilities on certain devices like GPUs and TPUs have also driven adoption of the float16 or short to further improve the throughput of models. To get started, we create an example tensor vector with the torch.arange function. This will be used as our example domain in the remainder of this section. As you can see, this looks very similar to a numpy array, but it has some special properties like recording its derivative with respect to mathematical operations.

162

7. AI for physics-inspired hydrology modeling

2 PyTorch and autodifferentiation

163

2.2 Autodifferentiation theory As alluded to in the previous section, one of the foundations of machine-learning frameworks is automatic differentiation (which we will refer to simply as autodifferentiation or autodiff ). This is the underlying numerical technique that makes the backpropagation algorithm work, which allows for neural networks to be trained. This technique can also be used in other numerical optimization problems such as solving differential equations. We will see how this works out in subsequent sections, but first we will provide a very brief overview of how autodiff works for simple scalar valued functions. For a more complete treatment, see Nocedal and Wright (2006). As an example, consider the simple function: f ðxÞ ¼ sinðx2 Þ+ cosðx2 Þ We can easily compute the derivative of this function using our toolbox of derivative tricks from calculus 1 to find  df d ðxÞ ¼ sin ðx2 Þ + cos ðx2 Þ dx dx d d ¼ sin ðx2 Þ + cos ðx2 Þ dx dx d sin ðx2 Þ dðx2 Þ dcos ðx2 Þ dðx2 Þ + ¼ dðx2 Þ dx dðx2 Þ dx ¼ 2xð cos ðx2 ÞÞ + 2xð sin ðx2 ÞÞ ¼ 2xð cos ðx2 Þ  sin ðx2 ÞÞ Notice here what we did intuitively was break the equation down until each “elementary” part had a single derivative to calculate with respect to its input. Then, we simply use the rules for differentiation of these elementary parts to complete the equation, and finally simplify the algebra. Autodiff works by doing something similar, breaking apart the equation to be differentiated into the elementary parts in what’s known as a “computational graph.” This graph can then be operated on atomistically, solving each part by the known rules. The actual way to compute this can be done in one of two ways, namely “forward mode autodiff” and “reverse mode autodiff.” The graph of our example function looks like Fig. 1.

FIG. 1 The computational graph for sinðx2 Þ+ cosðx2 Þ.

164

7. AI for physics-inspired hydrology modeling

For the forward mode autodifferentiation, the program starts at the beginning (left) of the graph and makes use of a construct called “dual numbers” to accumulate derivatives in the forward pass. The implementation of “dual numbers” is beyond the scope of this chapter; for a reference implementation see Kochenderfer and Wheeler (2019). For practical purposes, you can think of them as a data type that records its value and its derivative with respect to the previously applied operation. To calculate the derivative, we take the value of the function and its derivative is recorded. So, for example at x ¼ π the input node we have dx dx ¼ 1. Then, da at the output of the next layer of computational nodes we have a ¼ x2 ¼ π 2, we have dx ¼ 2 2 db 2x ¼ 2π and, likewise, b ¼ x ¼ π ; so, dx ¼ 2x ¼ 2π. Similarly, after the trigonometric function dc nodes we have c ¼ cosðaÞ ¼ cosðπ 2 Þ and da ¼ sinðaÞ ¼ 2π  cosðπ 2 Þ and d ¼ sinðbÞ ¼ sinðπ 2 Þ dd db 2 so db ¼  dx cosðbÞ ¼ 2π  sinðπ Þ. Finally, the sum gives us the output of the derivative; so, df ðx ¼ πÞ ¼ 2π  cosðπ 2 Þ+2π  sinðπ 2 Þ ¼ 2πðsinðπ 2 Þ  cosðπ 2 ÞÞ  2:97 dx For the reverse mode autodifferentiation, we begin at the end of the computational graph, tallying up each of the derivatives needed along the way. Mathematically, this comes out to be df df dc5 ¼ dx dc5 dx   df dc5 dc4 dc5 dc3 + ¼ dc5 dc4 dx dc3 dx      df dc5 dc4 dc2 dc5 dc3 dc3 ¼   + dc5 dc4 dc2 dx dc3 dc1 dx Chances are this individual example will not be entirely clear if you have not seen these concepts before, and if you want to dive in deeper we recommend both Kochenderfer and Wheeler (2019) as well as Nocedal and Wright (2006), as previously referenced. In the mean time, it is entirely possible to sidestep the theory of automatic differentiation as we will build up confidence for the method in the following sections.

2.3 Practical use of autodifferentiation in PyTorch Without diving too much deeper into the theory that drives autodiff or the details of its implementation we can hand-wave some of this away by using the built-in implementation in PyTorch. For the sake of this chapter, we will be focusing on scalar valued functions, which means that the Jacobian of a function is simply its derivative. So, making use of the jacobian function that we imported earlier is simply a means of getting the derivative of a scalar valued function at some point, which we want to evaluate. Just to get a feel for how the jacobian function works, let us look at some examples where we have analytic solutions. Here, I show that the autodiff calculation of the derivatives of both ReLU and hyperbolic tangent are equivalent to their analytic counterparts. First, let us look at the ReLU (or Rectified Linear Unit) function, a common nonlinearity introduced into neural networks. Its functional form and derivative are quite easy to write down: ReLUðxÞ ¼ maxð0, xÞ

2 PyTorch and autodifferentiation

165

and d ReLUðxÞ ¼ 1 ifx > 0, else 0 dx The ReLU function is available in the pytorch.functional (or F) module. We can implement the derivative by hand quite easily as drelu and finally compare our hand-done derivative to one computed by the jacobian function on our set of test inputs, v. Note that you have to iterate over individual values of v in a loop to make the computation on the scalar valued function. Providing the full vector, v, as the second argument to jacobian will result in the full Jacobian matrix because PyTorch is treating the ReLU function as a vector function rather than a scalar function. If you were to simply run the jacobian function over v then what you would find is that the diagonal of the computed matrix would contain the derivative values that we are after. The call to np.allclose simply makes sure that both sets of derivatives are numerically equal. A note on calculating derivatives of functions in PyTorch It is also worth noting here that the autograd functionality in PyTorch is still being developed and there are several other ways to automatically compute the derivative of a function with PyTorch. The syntax that we use here is not the most performant way to do this computation, but should be easy to understand compared to some of the other, faster methods.

Next up, we will try another common activation function, the hyperbolic tangent. The code is almost exactly the same as before, but now we were going to evaluate against the analytic derivative: d 1 tanh ¼ dx cosh 2 ðxÞ Again, we evaluate to make sure that the autodiff version is numerically equivalent to the hand-calculated derivative.

166

7. AI for physics-inspired hydrology modeling

Great! These both work out of the box. Now, can we take derivatives of simple neural networks? Let us find out. Here, we will define a basic feedforward type network. Rather than initializing weights randomly from a distribution, we will specify the values so that we know what the derivative should be. Also note that our activation functions will be the ReLU and hyperbolic tangents that we know the jacobian function works on, but you could expand this as hinted at in the exercises for this section. For the first test, we will just do the derivative of a neural network that consists of a single neuron with the hyperbolic tangent activation. This is equivalent to the previous test we did just to make sure that we could use the jacobian function as intended. Now you will note there are some subtle things happening in this code. First, we now use the nn.Tanh class, rather than the F.tanh function. This is generally considered a better practice when putting activation functions into a neural network, rather than using the functional form. This is because it is a subclass of the nn.Module class, which is a basic building block of PyTorch neural networks. Second, we have to initialize parameters for the nn.Linear layer, which has a weight and bias. For our network where width is set to 1, this corresponds to a linear equation where weight is the slope and bias is the intercept. We default the weight to be 1 and the bias to be 0, which encodes the function f(x) ¼ x. When the forward method is used on the Neuron, this invokes the forward computation, which in our first example boils down to NeuronðxÞ ¼ tanhðf ðxÞÞ ¼ d tanhðxÞ. And thus dx NeuronðxÞ ¼ cosh1 2 ðxÞ as before. We verify this via the plot. Note here that while we could derive what we wanted to happen with the derivative, it is important to make sure that the introduction of the nn.Module machinery and creating a new instance class could still preserve the numerical relations that we wanted to encode. This quick test of something we previously knew to be true still holds, and this makes us much more confident that our code is correct.

2 PyTorch and autodifferentiation

167

Now that we are getting more confident, let us do a two layer (each with single neurons) network of a ReLU activation followed by a hyperbolic tangent activation. Each of the linear layers still have the weight and bias set to 1 and 0, respectively. Thus, we can write the mathematical function that the network performs as f ðxÞ ¼ tanhðReLUðxÞÞ So, calculating the derivative via the chain rule we have df d dReLU d tanh dReLU d tanh ðxÞ ¼ ðReLU∘ tanh ÞðxÞ ¼ ðxÞ ¼ ðxÞ  ðxÞ dx dx d tanh dx dx dx As we see from the code below, this is reproduced by running the jacobian function on our neural network. This shows that the autodiff implementation can work through deeper networks besides a single layer, as it should.

168

7. AI for physics-inspired hydrology modeling

This is not a surprising result, as deep neural networks are able to be trained via the backpropagation algorithm—but again, it instills confidence that we can use the jacobian function for autodifferentiating complex mathematical constructs that can be assembled with PyTorch. If you are still feeling unsure how these principles work you should try writing out the derivatives for networks where the weight and bias are set to values other than 1 and 0. If you want to get further in the weeds, you could try writing a perceptron class which implements multiple neurons in a single layer, and see if you can work out the derivative of such a network with three neurons with hyperbolic tangent activations, weights of 2, and biases of 1, whose output dimension is 1. For reference, this network implements the mathematical function: f ðxÞ ¼ 3  tanhð2x + 1Þ

3 Extremely brief background on numerical optimization

169

and the neural network implementing this is

3 Extremely brief background on numerical optimization To progress beyond just implementing existing machine learning-based models that can regress on quantities of interest for hydrology, we need to take yet another brief detour. This detour is a bit longer than the previous, but consider it essential to understanding the methods. We want to emphasize that the field of numerical optimization is evolving nearly as quickly as the rest of machine learning. This section will begin with a quick overview of first-order gradient-based optimization, which provides the basis for the most commonly used optimization strategies for training neural networks (examples include stochastic gradient descent [SGD], Adam, etc.). By including them we hope to emphasize that optimization is not unique to machine learning- and data-driven methods, but a core and essential component to performing numerical computation of complex systems. From here, we progress to second-order methods, which are the bread-and-butter of introductory numerical analysis and computational modeling courses. These optimization strategies form the backbone of many common numerical ODE and PDE solvers. These ODE and PDE solvers, in turn, are almost exclusively what we refer to as “physics-based” models. At this point, you might be asking, why are first-order methods implied for neural networks while second-order methods implied for differential equations? This is a good question to ask indeed, but once again it is beyond the scope of this chapter. We refer interested readers to Nocedal and Wright (2006), Kochenderfer and Wheeler (2019), Goodfellow et al. (2016), and Isaacson and Keller (1994).

170

7. AI for physics-inspired hydrology modeling

3.1 First-order methods: Gradient descent and other flavors for training neural networks The language of optimization is often rooted in topography—if you want to get to the bottom of a valley you must head downhill after all. Simply put, gradient-descent methods are often called first-order methods because they make use of the first derivative (a.k.a. the slope of the valley) to find the direction to head toward the optimal point (the bottom of the valley). Considering such an approach from an intuitive standpoint brings up several questions— Should I always go in the direction of the steepest slope? How far should I continue before considering changing to another direction? When is it okay to go up instead of down? These are predominantly the questions that the different popular optimization strategies for training neural networks are concerned with, albeit in a more formalized mathematical sense than we have let on here. To get away from the intuitive and into the formalism, if we are trying to minimize a function f we ought to head in the opposite direction of the gradient rf. To do so we might choose a direction, d, that maximizes the gradient via dðxÞ ¼ 

rf ðxÞ k rf ðxÞ k

where kk implies the Euclidean norm (also known as the squared error). Using the previous implementation of the autodiff derivative/gradient, you might see how we can implement this in code. However, our implementations have been for scalar (or one-dimensional) functions, and this becomes a very difficult problem when going to higher dimensions. In practical machine-learning optimization strategies, a number of things have become standard to make optimizing large neural networks possible. First, the actual gradient calculation is often approximated to reduce the computational burden of computing highdimensional gradients (computational cost is also a main reason for reaching for first, rather than second-order methods for training neural networks). Additionally, the learning rate or roughly amount of time that you walk in the downhill direction before changing course is often not fixed. This is implemented in different ways for different methods, but popular choices include Adam and RMSProp (Kingma and Ba, 2017; Ruder, 2017). Another view on augmenting the speed at which the optimizer heads downhill is the concept of momentum, which operates in line with the physics basis of the terminology. Optimization steps with high momentum tend to be difficult to change direction rapidly, which is favored when the topography of the optimization landscape is relatively smooth. On the other hand, low-momentum steps should be adopted when the optimization landscape is very jagged, meaning each step is important to consider to make sure the optimization does not get trapped in a local minimum or overshoot out of the valley and into the mountains.

3.2 Second-order methods: Standards for numerical solutions to differential equations Many questions of first-order optimization can be answered by second-order optimization, but with a very large number of caveats. As first-order optimization uses the first derivative to

3 Extremely brief background on numerical optimization

171

find the direction of search, second-order methods use the second derivative to determine the search strategy. Recall how the first-order optimization strategies implied questions of how far and which direction to actually step—second-order strategies alleviate some of these concerns by providing information about not only the slope of the optimization landscape but also the curvature. The basis for second-order methods is called Newton’s method and can be implemented for a scalar value function f first by taking the second-order expansion about a test-point xi i 2

ðx  x Þ  f 00 ðxi Þ f^ðxÞ  f ðxi Þ+ðx  xi Þf 0 ðxi Þ+ 2 Using this equation, we can iterate an update equation until some tolerance has converged or a maximum number of iterations has occurred. For the (i + 1)th step of the iteration process, the iteration becomes xði+1Þ ¼ xðiÞ 

f 0 ðxðiÞ Þ f 00 ðxðiÞ Þ

Newton’s method will converge quadratically, much faster than the linear convergence of first-order methods, provided some very specific criteria are met. These criteria often revolve around choosing the initial test point x(0) as well as the curvature of the function being optimized. Newton’s method is notoriously simple to derive but difficult to make work in practice, meaning it is only used in its simplest form in simple applications. Many extensions of Newton’s method exist but are beyond the scope of this chapter. Generally, we urge scientists not to implement their own numerical solvers, and to rely on the numerous and well-tested packages for solving such problems. However, in our case, we must be able to implement numerical solvers in a way that interoperates with autodiff and backpropagation via the PyTorch package. We also want to provide reference implementations that make it clear how the principles operate in code. That said, the torch.optim default package implements a host of standard (generally first-order) solvers, pytorch-optimizers (pip installable as torch_optimizer) implements a wide range of other optimizers, and torchdiffeq offers some other optimizers for numerically solving differential equations with autodiff capabilities within Python, which we will rely on in later portions of this chapter. Below we provide a bare-bones implementation of a Newton’s method iterative solver for second-order optimization. You will note it takes a function (f), its derivative (fprime), a test point (x), and some tolerance/iteration criteria. As mentioned before, choice of the test point, x, is critical for convergence in complex landscapes. For this reason, as well as computational complexity, we will continue to focus on scalar valued functions. A view of other autodiff packages We also would remiss if we did not mention the rapidly developing capabilities of the JAX Python framework that provides high-performance low-level primitives for autodiff of arbitrary Python code and associated frameworks such as haiku, equinox, and diffrax. Similarly, we must acknowledge the broader Julia programming language ecosystem, which has the excellent Flux.jl and DifferentialEquations.jl packages. Both sets of ecosystems/packages implement performant and state-of-the-art packages for machine-learning, numerical analysis, and solving differential equations and arbitrary differential optimization at large.

172

7. AI for physics-inspired hydrology modeling

For confidence, let us do an easy one, solving for the minimum of a parabola. We do not have a need for the jacobian function, neural networks, or anything fancy; we will write it out from scratch. To do so we will define the function f as well as the hand-done derivative fprime explicitly. We will then apply it to a random points on the domain x  [1, 1]. As you can see from the resulting plot, our initial guess (poorly chosen by design) is very far off of the minimum while the minimum is, to our eye, very close to the actual minimum of the parabolic function.

3 Extremely brief background on numerical optimization

173

Now let us test the solvers on a simple “neural net” defined by a new nn.Module. Keep in mind here the question is—can we use traditional numerical optimization techniques on top of the broader autodiff frameworks that are used to train neural networks. To make this work, we will need to define a neural network, which is likely not useful in any real setting. Let us call it TroughLayer because it essentially represents a “trough” where there is a low point in the middle of the domain surrounded by high walls. This is a carefully designed network with two sigmoid neurons whose weights and biases were chosen so that the output is what we want. The definition of the TroughLayer follows below.

Despite being a contrived setup, we can take the jacobian function and apply it to the network to get fprime so that we can put it into the Newton solver. As you will see, we find a

174

7. AI for physics-inspired hydrology modeling

generally good minimum. There was some tuning of the tolerance here; if you set it too low you will diverge. This “network” encodes a function that is quite hard for the standard Newton iteration to converge on the global minimum. As an exercise, you should try different values of x_init as well as determining when it converges with respect to the actual derivative (and second derivative if you were ambitious). This function should give you a good overview of some of the difficulties of numerical optimization in a 1-d setting. In the meantime, we can simply bypass these concerns because most of the time it works.

3.3 Brief detour on numerically solving ODEs At this point, you might be wondering what all of this curve traversing is about. Finally time to bring it back—we were trying to merge some machine-learning methods with hydrologic modeling methods. As always, we should begin by starting with the simplest approach. To do so, we will describe the linear reservoir model, implement it with the tools we have built up, and show how we can easily solve it. However, it is worth pointing out that the model we will develop here still does not quite get us to a useful hydrologic model at a catchment scale. This is because a single linear reservoir cannot capture the richness of topography and vegetation that we see in the real world. But, it does provide a useful stepping stone toward more complex models. Once we have seen we can numerically solve the linear reservoir model with the tools we have built up we will relax the constraint

3 Extremely brief background on numerical optimization

175

of linearity. We will show that we can learn a nonlinear conductivity curve for a reservoir operation release in this idealized model using a hybrid physics/data-driven approach. Let us get started! The hydrologist’s favorite: The linear reservoir model It is the hydrologist’s favorite model! We can pretty easily solve this one analytically; so, let us use it to make sure that our solvers are capable of producing good solutions. The linear reservoir model (with no inflows) is given by the equation: dS ¼ k  SðtÞ dt where S(t) is the storage of the reservoir at time t and k is the reservoir conductivity constant with units 1/t. For fixed values of k we can solve this equation analytically, as it is simply an exponential: SðtÞ ¼ Sð0Þ  ek  t where S(0) is the initial storage. You should verify that this equation satisfies the original differential equation. You might note that if k is positive, we get exponential growth and if k is negative, we get exponential decay. In line with some physical intuition, the k value in a “hydrologically flavored” reservoir should be negative, as the water will drain out. To see an example of this, along with comparison of our implementation of a numerical solution using Newton’s method, we will select k ¼ 0.1 and S(0) ¼ 1.0. To solve the equation we will use the implicit, or backward, Euler method, which is one of the simplest methods for numerically solving differential equations. We can estimate the storage at some time ti+1 given the current storage at time ti as Sðti+1 Þ ¼ Sðti Þ+Δt  ðk  Sðti+1 ÞÞ Given this equation, you might see the problem: S(ti+1) is on both sides of the equation, meaning we must make an optimization step to iteratively solve for Si+1. This is where Newton’s method comes in. We can frame this problem as a Newton iteration by rearranging the equation to: 0 ¼ Sðti+1 Þ  Sðti Þ  Δt  ðk  Sðti+1 ÞÞ This equation will be supplied to the Newton iteration (defined as f below). This function is also easy enough to take the derivative of with respect to S(ti+1), which we define as fprime for the sake of example. We also compute an autodiff variant of this derivative to show how we do not need to always hand-calculate the derivatives for every problem we want to solve. As you can see, we can solve this equation pretty well numerically. Sure, there is some discrepancy, which is due to the relative simplicity of our numerical solver. More complex solvers, using higher convergence tolerances and smaller step sizes can more faithfully reproduce the analytical solution at the cost of computational complexity. However, we mainly want to show how to get an end-to-end solution, rather than fine tuning each piece. Let us call this good enough and move on because there is no need to do any machine learning here.

176

7. AI for physics-inspired hydrology modeling

4 Bringing things together: Solving ODEs inside of neural networks

177

4 Bringing things together: Solving ODEs inside of neural networks The nonlinear reservoir model As we said, the linear reservoir is too easy and does not involve much uncertainty that would require any extra fancy machinery. So, to get there, let us change the ODE so that the conductivity term, K, is now dependent on the current storage. We will use the conductivity term as KðSÞ ¼ 0:1  tanhð10  ðS  0:5ÞÞ This represents a conductivity term where, basically, if you have low storage, you start filling up and if you have high storage, you start draining. Steady state is at a nice even value of 0.5. We encode this into our function kx below, and then run it for a bunch of values ranging from 0 to 1 to see the actual curve. The plot is output from the next code cell. Again, values above 0 correspond to the reservoir draining and values below 0 correspond to the reservoir filling.

178

7. AI for physics-inspired hydrology modeling

Given all of our machinery developed so far, we can now solve the nonlinear version for all sorts of initial conditions. Note that we are using autodiff’s jacobian to determine the derivative of the conductivity function K(S), but there is also a commented out version of the hand calculated derivative that you might also use to verify the automated solution.

4 Bringing things together: Solving ODEs inside of neural networks

179

Learning the reservoir conductivity function with neural networks Now, imagine you are actually a hydrologist—you cannot directly measure the conductivity of the “reservoir” but you can measure storage levels. If you are interested in determining K(S) from data you have all sorts of avenues, but the one we were interested in is a neural network. In this case, imagine we “know” what the dynamics look like (a.k.a. the general structure of the ODE defining the system), but we do not know the functional form for K(S). Our neural network, then, will solve the dynamics, and update the weights of a network that represents the conductivity during training. Once trained, we can pull out the network and look at what K(S) was determined to be from the data. Below is the network in question. Note, it is just a simple densely connected network with width and depth hyperparameters. You might modify this to be a more complex structure, but for simplicity let us give this a go.

180

7. AI for physics-inspired hydrology modeling

4 Bringing things together: Solving ODEs inside of neural networks

181

Before we can train the network we need some data—this is where we use our previous solution to generate some synthetic data and we will see how well the network can reconstruct the known conductivity function. I will just run a few timesteps for a bunch of different initial conditions.

We will also define a standard epoch function, which simply iterates over the training data and runs the optimization routine (here, a variant of gradient descent). If you are not familiar with this sort of construct we recommend you go back and work through some basic PyTorch tutorials because designing the training loop is one of the most crucial and common workflows in deep learning.

4.1 Split out the input/output data Our network’s training task is to figure out the storage one timestep later, given its initial storage. We have 10 samples for training, but you could add more by adding more S0s under the Training Data heading or by increasing the number of timesteps in the my_odeint function. You might be thinking 10 samples is way too small to do any machine learning on, but keep in mind we have a very strong inductive bias for how our model operates because we have directly encoded the differential equation and are only attempting to figure out a parameterization of it. If you bump this number up to 100 or 1000, you will see we can pretty much perfectly match the conductivity curve. But, using only 10 samples is conceptually more interesting as it shows how viable this type of approach can be.

182

7. AI for physics-inspired hydrology modeling

4.1.1 Let us train! The NeuralReservoir is set up with some user-defined width and depth settings. We additionally set a learning rate for our optimizer (chosen to be Adam) and finally train a userdefined number of epochs. Following training, we can look at the loss curve to see if the model was able to converge.

4 Bringing things together: Solving ODEs inside of neural networks

183

4.2 What did the network actually learn though? Looking at the loss curve, we can see that the model was able to reduce the loss over a number of epochs, and then started to oscillate in its performance. These oscillations are common during the training process as the optimizer gets stuck in regions of the overall parameter space (like circling a valley). It is possible that the model could be trained to a better overall loss value, but in the interest of conciseness we leave these as possible exercises. Getting back to the question at hand, our network was evaluated on how it was able to predict the next timestep’s storage, but we were interested in getting the reservoir conductivity function, K(S), out. Lucky for us, we can just pull it out with model.K and start inputting storage values. Let us see how we did!

184

4.3 Introducing

7. AI for physics-inspired hydrology modeling

torchdiffeq

Now that you see how we can solve differential equations inside of neural networks and still train them it is time to move beyond home-grown solutions. As solving differential equations is quite difficult to get right and the basic methods tend to break down on harder problems’ mathematicians, scientists, and software engineers have spent a great deal of time developing packages and methods that can be taken “off the shelf.” The torchdiffeq package (https://github.com/rtqichen/torchdiffeq) developed in recent years allows for many of these numerical solvers to be integrated seamlessly with PyTorch (Chen et al., 2018). We can reframe the previous problem of estimating the reservoir constant using this new package quite easily. To do so, we first package up the ODE in a new nn.Module subclass, whose forward method simply implements the right-hand side of the reservoir equation: dS ¼ KðSÞ  S dt The constructor on the ReservoirEquation class takes in a K parameter, which we will represent with a multilayer perceptron as before. Then, we create a new class, TorchDiffEqNeuralReservoir, which solves the ReservoirEquation via the new odeint function, which we just imported. There are some extra details here, such as the solver_method and integration_time, which we will not cover directly here, but are explained in the torchdiffeq documentation at the link given above. Anyhow, as you can see the training procedure is nearly identical to before, as is the training time. Similarly, the extracted conductivity curve still looks reasonably close to the target,

4 Bringing things together: Solving ODEs inside of neural networks

185

though it differs qualitatively from our previous model simply due to the randomness in the training process.

186

7. AI for physics-inspired hydrology modeling

5 Scaling up to a conceptual hydrologic model 5.1 The system of equations Now, we have a working way to train differential equations parameterized by neural networks. It is time to move to something a bit more useful than the nonlinear reservoir example. We will develop a relatively simple conceptual hydrologic model with two storage buckets. We will allow drainage from the “surface” bucket to the “subsurface” bucket as well as “evapotranspiration” to come from the “surface” bucket. Streamflow will be considered the sum of the outflow of the two buckets. In mathematical terms, we can write this as a system of equations:     d S0 P  ET  D  Q0 ¼ D  Q1 dt S1 where S0 is the “surface” bucket, S1 is the “subsurface” bucket, P is precipitation, ET is evapotranspiration, D is drainage from the surface to the subsurface, and Q0/Q1 are the surface/ subsurface discharge. Before defining how each of these calculations is done explicitly, we will also define the following terminology for how certain parameters are handled: σjab ðxÞ ¼ ða  bÞ  σðxÞ+b where x is the name of the parameter, σ is the sigmoid function, and a and b are the bounds of the parameter with a > b. This operation allows us to train the parameters, but not have to strictly constrain them to be roughly on the 1 range. Essentially, this is the sigmoid function scaled between a and b and is employed so that all underlying trainable parameters fall on the same range but can be translated to hydrologically relevant values. We implement this as the HydroParam module:

187

5 Scaling up to a conceptual hydrologic model

To parameterize each of these terms, we will follow some conventional approaches, namely a set of relations described in the Framework for Understanding Structural Errors (FUSE) model (Clark et al., 2008). Namely, we will calculate ET as

ET ¼

σj10 ðpÞ

 PET 

S0

!

σj100 1 ðS0, max Þ

where p is the tunable parameter, and PET is the reference potential evapotranspiration. We implement this equation via the ETTerm:

188

7. AI for physics-inspired hydrology modeling

Drainage is calculated as D¼

σj100 0:01 ðku Þ



S0

!σj10 0:01 ðcÞ

σj100 1 ðS0, max Þ

where ku and c are tunable parameters. We implement this as the DrainageTerm:

Surface flow Q0 is calculated by estimating the saturated fraction (Asat) as !σj3:0 0:001 ðbÞ S0 Asat ¼ 1  1  100 σj1 ðS0, max Þ And the implementation via the SaturatedAreaTerm class:

5 Scaling up to a conceptual hydrologic model

189

The surface flow is simply calculated by multiplying the surface saturation by the incoming precipitation: Q0 ¼ Asat  P And the implementation via the SurfaceFlowTerm class:

Finally, we define the subsurface flow as Q1 ¼

σj10 0:001 ðks Þ



S1

!σj10 0:01 ðnÞ

σj100 1 ðS1, max Þ

And the implementation via the SubsurfaceFlowTerm class:

This defines all of the necessary fluxes for our conceptual hydrologic model! The model structure as defined here was taken from Clark et al. (2008) and was designed to be somewhat analogous to a simplified variable infiltration capacity (VIC, Liang et al., 1994)-type model, though lacking clear energy and water balance interaction terms as well as the obvious omission of snow and vegetation processes beyond a simple “lumped” ET quantity. The implementation of this is all wrapped into our HydroEquation class, as follows below. You will

190

7. AI for physics-inspired hydrology modeling

note here that there are not any user-definable parameters; everything is learned! You might consider the ranges allowed on the parameter values as hyperparameter, but we will not consider trying to tune them specifically here.

5 Scaling up to a conceptual hydrologic model

191

As in the nonlinear reservoir network here we will also wrap up the HydroEquation class so that it is easier to extract what we want out of the solution to the ODE system. We will call this the HydroSimulator since it solves the HydroEquation but allows us to see the time evolution of the system by recording all of the necessary fluxes and states along the way.

192

7. AI for physics-inspired hydrology modeling

5.2 Data With the overall model structure out of the way, we come to the real fork in the road—data! So far we have been working with synthetic data, or just idealized situations. But, as hydrologists and Earth systems’ modelers we cannot live in fantasy land, and so must eventually be confronted by the real world. This means we need some basic infrastructure for training (a.k. a. calibrating) the model structure we have designed against something closer to reality. This requires real data—for this, we will use a subset of the CAMELS dataset (Newman et al., 2015; Addor et al., 2017). For this chapter, we will only train for a single basin; but, we include data for multiple basins for you to explore. The CAMELS dataset contains a wide range of hydroclimatic conditions in basins, which are minimally impacted by human infrastructure and have long records of streamflow observations. As our model contains no explicit store for snow, we will take some time to filter out basins where snowpack is a dominant factor. However, since we do not have actual observed snow data for these basins to filter with we will use daily minimum temperature as a proxy. To do so, we simply filter out any basins where the 10th percentile of daily minimum temperature during the winter months is below 0°C. The code below opens up the provided NetCDF dataset via xarray and converts all of the data to float32, which is done to improve memory usage and computational speed. We then calculate the winter low temperatures using a groupby approach, which allows us to concisely find which basins meet the criteria specified earlier. We then do so using the .where method, and finally select the basins we want using the .sel method. As you can see, the final dataset printed out still has 131 basins, each with 10 water years of data. This dataset contains a number of prerun results from other hydrologic models, forcing variables such as daylength (dayl), precipitation (prcp), daily minimum and maximum temperatures (Tmin and Tmax), and potential evapotranspiration (PET, pet), among others. Additionally, there are a number of basin-specific attributes such as basin area (area), basin average elevation (elevation), and aridity (aridity). The temperature, PET, and a selection of attributes will form the basis of our model inputs for training.

5 Scaling up to a conceptual hydrologic model

193

Now that we have got a dataset to use we need a way of ingesting it so that we can actually train the model. To do so, we will employ a slightly nonstandard technique called “multiple trajectory,” or “multiple shooting,” optimization. The problem is essentially that the model we defined above is somewhat analagous to a recurrent neural network (RNN), where to train we iterate over sequences with some length. In this case, the sequences are multiple timesteps. Like RNNs, you do not want to update the model parameters after every step because that makes the training process almost impossible. Instead, we run the model forward in time for some period of time, then update the model parameters according to the accumulated gradients. In principle, we could run the model over the entire training period; but, this means fewer data points to update the model parameters, leading to slower convergence. Instead, we break the full dataset into a series of “trajectories,” which contain a specified time period. Then, during the training process we iterate over each of these trajectories, updating model parameters along the way. A full pass over all of the trajectories, then, is a single training epoch. We implement this below. This sort of data wrangling is core to doing machine learning; so, thinking about how and why to arrange data any particular way is very important. As such, we build on the PyTorch Dataset class, which simplifies the interface to referencing data for both training and inference (commonly referred to as prediction). Without getting too far into the weeds of data loading, we just set up the way to index the dataset via the __getitem__ method and ways to determine how big the dataset is through the __len__ method. Finally, let us talk about what actually comes out of the dataset. When you index on the dataset (as x[i]), you will get two tensor arrays back. The first is the input to the model,

194

7. AI for physics-inspired hydrology modeling

and the second is the target data that we want the model to produce when given the input. In our case, we will record the inputs and outputs via the in_vars and out_vars variables. This will make it easier to explore, which variables have an impact during training. On __dunder__ methods The methods that we implemented in the MultipleTrajectoryDataset are referred to as “dunder” (a.k.a. double underscore) methods, which Python uses to call the base indexing calls. Simply stated, if you call x[i] you are really calling x.__getitem__(i) and if you call len(x) you are really calling x.__len__(). Of course, these are simplifications; but, the shorthand is useful to understand.

5.3 The model training functions With a model and dataset in hand, we will define some functions to simplify the core training loop. First, the update_model_step function is used to actually perform the optimization step. This is pretty much a standard update function where we run the model on some training data, compute the loss, and use the gradients of the loss with respect to the parameters to update the parameters via the optimizer’s step method. The other function we define is the

5 Scaling up to a conceptual hydrologic model

195

update_ic_step,

which is where we update the initial conditions (i.e., the initial storages) at the end of an individual training trajectory. This is calculated after update_model_step and used to transition between training trajectories.

5.4 Setting up our training/testing data Now that we have got everything we need to do the training let us go ahead and set up our process. First, we need to select a basin that we want to train at, select out the train and test timeframes, and then create the dataset with some trajectory lengths (in units of days). Finally, we create a MultipleTrajectoryDataset for our training and testing periods.

196

7. AI for physics-inspired hydrology modeling

5.5 Defining the model setup Now, we set some hyperparameters, which are our neural network width and depth for each parameterization defined previously, initial storage values, learning rate, and bounds of the physical constants of each parameter for the HydroSimulator. The initial storages warrant some discussion. In our testing, this can actually have a decent impact on the overall performance; so, if you try different basins you might fiddle with this to see if you can get better performance. We have set up some infrastructure in our training loop that minimizes this effect; but, it is still present to some extent. The way that we mitigate some of the impact of the choice of the initial_storage values is after a full epoch (i.e., a pass over each of the training trajectories) we take the average of the ending storage values and use that as the initial storage for the next epoch. Following that we set the hyperparameters for the HydroParam objects, which represent parameter values for the model. We have simply set all of them to be single-layer MLPs with six nodes. You can try to adjust this, but since we are training on a single basin with only static attributes the model complexity does not actually have a large impact on model performance. The actual bounds on each of the HydroParam instances was taken more or less from the recommendations from the original implementation of this conceptual model by Clark et al. (2008).

5 Scaling up to a conceptual hydrologic model

197

With the hyperparameters set up, we will create our model instance and set the optimizer. Here, we will just use the Adam optimizer and mean squared error losses as they are a good all around picks; but, you can explore different options here as well. The learning rate was chosen by trial and error, and seems to work well enough.

5.6 Training the model Now, finally, we get to train the model. The “training loop” here consists of three actual Python loops. The first is for overall epochs, which is the number of times that we pass over the full dataset. The second is to pass over each of our training trajectories. And, finally, the third is the number of times that we look at an individual trajectory before moving to the next. For each innermost loop (that is passes over the same trajectory), we compute the training loss. Once we have moved onto a new trajectory, we record the ending storages so that they can be supplied to the next trajectory as an initial condition. This is how we use the multiple trajectory training to work around the fact that we do not know the actual values for the storages in the upper and lower layers at any given time, and particularly at the initial times of each trajectory, and is done by calling update_ic_step after the innermost training loop has completed. In our case, training may take a couple of minutes.

198

7. AI for physics-inspired hydrology modeling

As you can see from the training curves, we were able to reduce the loss relatively quickly. Additionally, you see that each of the trajectories (which are different periods of time) end up with different loss curves. You can see then that some years are harder for the model than others, although all show improvement over their starting point. You might look at these trajectories individually to see if you can diagnose why some trajectories are easier to optimize than others. You may also notice that this training process did not converge to a minimum. We cut the training time off here because we wanted to provide a simple and efficient example for you to be able to tinker with.

5 Scaling up to a conceptual hydrologic model

199

5.7 Model analysis With the model trained, we can now run it on the test data and see what we have produced. Before analyzing the model, we will define the nse function, which calculates the NashSutcliffe efficiency, which is a measure of performance (Nash and Sutcliffe, 1970). Values above 0 indicate that the model does better than simply using the observed mean while a value of 1 is a perfect match.

To actually run the model, we pull our forcings (i.e., daily precipitation and potentialevapotranspiration) and observed streamflow from the test_data. The forcings can then be put into the model, with the average storage from training as our starting place. We first run the trained model to get the predicted streamflow as well as pull out the relevant storage terms and fluxes.

Next, we can plot all of the data to see what is going on in the model. There are a number of interesting things to discuss in the model outputs, but obviously the first thing you will look at is how well the predicted streamflow matches the observed streamflow. If you are using the basin we selected as default, you will find we get a value of about 0.71, which is a quite reasonably performing model. Not bad for such a simple setup! You can also look at the storage timeseries, which shows the internal dynamics of the system. Overall, we see slower dynamics in the subsurface, which is reasonable. Further, we see that the subsurface in this model configuration retains quite a bit of water as a “steady state.” The surface bucket, on the other hand, reacts much more quickly to precipitation inputs and seems to help produce the “flashy” streamflow events, as you would suspect. Finally, looking at the ET/PET ratios we see that generally the ratio is maximized when moisture is high in the system, while the ratio becomes lower in the dry periods, despite a high demand. Again, all of this is reasonable for a hydrologic model. But, the key difference here is that our parameter values were represented by neural networks rather than single numbers.

200

7. AI for physics-inspired hydrology modeling

6 Conclusions

201

6 Conclusions In this chapter, we hope you have learned that the numerical modeling approaches in traditional hydrologic models are not so different from than those of machine learning (particularly with respect to ODE-based models). The conceptual model that we formulated within the PyTorch framework was able to be trained with standard optimizers via backpropagation. While the final model we trained is not state of the art in performance, we hope that seeing it built up from base principles demystifies many aspects of merging physics with machine learning.

6.1 Exercises To this end, we offer some possible modifications and extensions of the work that we have described earlier that we hope sparks your own work. 1. Rather than working the nonlinear reservoir, can you replace the ODE with other classical examples? Perhaps try things like a forced/damped oscillator or projectile motion. How might you handle something like a reservoir with hysteresis where filling has different trajectories than draining? 2. Consider how you might adapt storage-discharge relations to a more readily observed area-elevation relation from a satellite imagery of reservoirs. Do you think you could reconstruct operation curves from such observations? 3. How would you extend the HydroEquation module to include more storage buckets? What if this was a configurable option? What about including a store for snow processes? Or a specific vegetation store to account for canopy storage? 4. Could you extend the HydroEquation to take in time-dependent quantities as inputs? Perhaps you could start with including the precipitation and potential evapotranspiration as input variables. 5. Consider spatially explicit subsurface representations. Could such methods possibly be used to learn subsurface properties?

6.2 Open questions We hope that our simple worked examples are enough for you to be able to modify and expand into code that is useful for your own research. In doing so, there are clearly many open questions. For instance, a natural extension of the final model would be to replace the parameters with more complex neural networks such as LSTM networks. Would it be possible to feed such networks with data from multiple basins to produce a global model as has been possible with LSTM models? There is also the question of adding on pre/postprocessing networks, which could enhance predictive capabilities, yet retain the ability to model internal states in a physically satisfying way that moves the needle closer to “best of both worlds” with respect to the interpretability-predictive spectrum.

202

7. AI for physics-inspired hydrology modeling

References Addor, N., Newman, A.J., Mizukami, N., Clark, M.P., 2017. The CAMELS data set: catchment attributes and meteorology for large-sample studies. Hydrol. Earth Syst. Sci. 21, 5293–5313. Bennett, A., Nijssen, B., 2021. Deep learned process parameterizations provide better representations of turbulent heat fluxes in hydrologic models. Water Resour. Res. 57 (5). https://doi.org/10.1029/2020WR029328. e2020WR029328. Beucler, T., Pritchard, M., Rasp, S., Ott, J., Baldi, P., Gentine, P., 2020. Enforcing analytic constraints in neuralnetworks emulating physical systems. ArXiv:1909.00912 [Physics]. Retrieved from: Retrieved from: http:// arxiv.org/abs/1909.00912. Brenowitz, N.D., Bretherton, C.S., 2018. Prognostic validation of a neural network unified physics parameterization. Geophys. Res. Lett. 45 (12), 6289–6298. https://doi.org/10.1029/2018GL078510. Clark, M.P., Slater, A.G., Rupp, D.E., Woods, R.A., Vrugt, J.A., Gupta, H.V., et al., 2008. Framework for understanding structural errors (FUSE): a modular framework to diagnose differences between hydrological models. Water Resour. Res. 44 (12). https://doi.org/10.1029/2007WR006735. Chen, R.T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D., 2018. Neural Ordinary Differential Equations. ArXiv:1806.07366 [Cs, Stat]. Retrieved from: http://arxiv.org/abs/1806.07366. Feigl, M., Herrnegger, M., Klotz, D., Schulz, K., 2020. Function space optimization: a symbolic regression method for estimating parameter transfer functions for hydrological models. Water Resour. Res. 56 (10). https://doi.org/ 10.1029/2020WR027385. e2020WR027385. Frame, J.M., Kratzert, F., Raney II, A., Rahman, M., Salas, F.R., Nearing, G.S., 2021. Post-processing the national water model with long short-term memory networks for streamflow predictions and model diagnostics. J. Am. Water Resour. Assoc. 57 (6), 885–905. https://doi.org/10.1111/1752-1688.12964. Gauch, M., Kratzert, F., Klotz, D., Nearing, G., Lin, J., Hochreiter, S., 2021. Rainfall-runoff prediction at multiple timescales with a single long short-term memory network. Hydrol. Earth Syst. Sci. 25 (4), 2045–2062. https://doi.org/ 10.5194/hess-25-2045-2021. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press. Retrieved from: https://mitpress.mit.edu/ books/deep-learning. Isaacson, E., Keller, H.B., 1994. Analysis of Numerical Methods. Dover Publications, New York. Jiang, S., Zheng, Y., Solomatine, D., 2020. Improving AI system awareness of geoscience knowledge: symbiotic integration of physical approaches and deep learning. Geophys. Res. Lett. 47 (13). https://doi.org/ 10.1029/2020GL088229. e2020GL088229. Jiang, S., Zheng, Y., Wang, C., Babovic, V., 2022. Uncovering flooding mechanisms across the contiguous United States through interpretive deep learning on representative catchments. Water Resour. Res. 58 (1). https://doi. org/10.1029/2021WR030185. e2021WR030185. Kingma, D.P., Ba, J., 2017. Adam: a method for stochastic optimization. ArXiv:1412.6980 [Cs]. Retrieved from: http:// arxiv.org/abs/1412.6980. Kochenderfer, M., Wheeler, T., 2019. Algorithms for Optimization. MIT Press. https://mitpress.mit.edu/books/ algorithms-optimization. (Retrieved 30 April 2022). Kraft, B., Jung, M., K€ orner, M., Koirala, S., Reichstein, M., 2021. Towards hybrid modeling of the global hydrological cycle (Preprint)., https://doi.org/10.5194/hess-2021-211. Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A.K., Hochreiter, S., Nearing, G.S., 2019. Toward improved predictions in ungauged basins: exploiting the power of machine learning. Water Resour. Res. 55 (12), 11344–11354. https://doi.org/10.1029/2019WR026065. Krishnapriyan, A.S., Gholami, A., Zhe, S., Kirby, R.M., Mahoney, M.W., 2021. Characterizing possible failure modes in physics-informed neural networks. ArXiv:2109.01050 [Physics]. Retrieved from: http://arxiv.org/abs/2109. 01050. Liang, X., Lettenmaier, D.P., Wood, E.F., Burges, S.J., 1994. A simple hydrologically based model of land surface water and energy fluxes for general circulation models. J. Geophys. Res. Atmos. 99 (D7), 14415–14428. https://doi.org/ 10.1029/94JD00483. Mai, J., Shen, H., Tolson, B.A., Gaborit, E., Arsenault, R., Craig, J.R., et al., 2022. The great lakes runoff intercomparison project phase 4: the great lakes (GRIP-GL). Hydrol. Earth Syst. Sci. 26 (13), 3537–3572. https://doi.org/10.5194/ hess-26-3537-2022. Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through conceptual models part I—a discussion of principles. J. Hydrol. 10 (3), 282–290. https://doi.org/10.1016/0022-1694(70)902556.

Further reading

203

Nearing, G.S., Kratzert, F., Sampson, A.K., Pelissier, C.S., Klotz, D., Frame, J.M., et al., 2021. What role does hydrological science play in the age of machine learning? Water Resour. Res. 57. https://doi.org/ 10.1029/2020WR028091. e2020WR028091. Newman, A.J., Clark, M.P., Sampson, K., Wood, A., Hay, L.E., Bock, A., et al., 2015. Development of a large-sample watershed-scale hydrometeorological data set for the contiguous USA: data set characteristics and assessment of regional variability in hydrologic model performance. Hydrol. Earth Syst. Sci. 19 (1), 209–223. https://doi.org/ 10.5194/hess-19-209-2015. Nocedal, J., Wright, S.J., 2006. Numerical Optimization. Springer, New York. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., et al., 2019. An imperative style, high-performance deep learning library. arXiv. PyTorch. Retrieved from: http://arxiv.org/abs/1912.01703. Raissi, M., Perdikaris, P., Karniadakis, G.E., 2019. Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707. https://doi.org/10.1016/j.jcp.2018.10.045. Rasp, S., Pritchard, M.S., Gentine, P., 2018. Deep learning to represent subgrid processes in climate models. Proc. Natl. Acad. Sci. 115 (39), 9684–9689. https://doi.org/10.1073/pnas.1810286115. Ruder, S., 2017. An overview of gradient descent optimization algorithms. arXiv. Retrieved from: http://arxiv.org/ abs/1609.04747. (June 15). Schaeffer, H., 2017. Learning partial differential equations via data discovery and sparse optimization. Proc. R. Soc. A Math. Phys. Eng. Sci. 473 (2197), 20160446. https://doi.org/10.1098/rspa.2016.0446. Shen, C., 2018. A transdisciplinary review of deep learning research and its relevance for water resources scientists. Water Resour. Res. 54 (11), 8558–8593. https://doi.org/10.1029/2018WR022643. Thapa, S., Zhao, Z., Li, B., Lu, L., Fu, D., Shi, X., et al., 2020. Snowmelt-driven streamflow prediction using machine learning techniques (LSTM, NARX, GPR, and SVR). Water 12 (6), 1734. https://doi.org/10.3390/w12061734. Tian, Y., Xu, Y.P., Yang, Z., Wang, G., Zhu, Q., 2018. Integration of a parsimonious hydrological model with recurrent neural networks for improved streamflow forecasting. Water 10 (11), 1655. https://doi.org/10.3390/w10111655. Zhao, W.L., Gentine, P., Reichstein, M., Zhang, Y., Zhou, S., Wen, Y., et al., 2019. Physics-constrained machine learning of evapotranspiration. Geophys. Res. Lett. 46 (24), 14496–14507. https://doi.org/10.1029/2019GL085291.

Further reading Chen, X., Jiang, P., Missik, J.E.C., Gao, Z., Verbeke, B., Liu, H., 2020. Opening the Black Box of LSTM Models Using XAI. Presented at the American Geophysical Union Fall Meeting. American Geophysical Union. Feigl, M., Roesky, B., Herrnegger, M., Schulz, K., Hayashi, M., 2022. Learning from mistakes—assessing the performance and uncertainty in process-based models. Hydrol. Process. 36 (2). https://doi.org/10.1002/hyp.14515. Konapala, G., Kao, S.C., Painter, S.L., Lu, D., 2020. Machine learning assisted hybrid models can improve streamflow simulation in diverse catchments across the conterminous US. Environ. Res. Lett. 15 (10), 104022. https://doi.org/ 10.1088/1748-9326/aba927. Kraft, B., Jung, M., K€ orner, M., Koirala, S., Reichstein, M., 2022. Towards hybrid modeling of the global hydrological cycle. Hydrol. Earth Syst. Sci. 26 (6), 1579–1614. https://doi.org/10.5194/hess-26-1579-2022. Yuan-Heng, W., Gupta, H.V., Zeng, X., Niu, G., 2021. Exploring the potential of long short-term memory networks for improving understanding of continental- and regional-scale snowpack dynamics. Earth Space Sci. Open Arch. https://doi.org/10.1002/essoar.10507610.1.

This page intentionally left blank

C H A P T E R

8 Theory of spatiotemporal deep analogs and their application to solar forecasting Weiming Hua,b, Guido Cervonea,c, and George Youngc a

Department of Geography and Institute for Computational and Data Sciences, The Pennsylvania State University, University Park, State College, PA, United States bCenter for Western Weather and Water Extremes, Scripps Institution of Oceanography, University of California, San Diego, CA, United States cDepartment of Meteorology and Atmospheric Science, The Pennsylvania State University, University Park, State College, PA, United States

Acronyms AI

artificial intelligence

AnEn CNN CRPS CV DA DL FLT GFS IG IS LSTM MAE ML NAM NCEP NMM NNet

analog ensemble convolutional neural network continuous rank probability score computer vision deep analog deep learning forecast lead time global forecast system integrated gradient independent search long short-term memory mean absolute error machine learning North American Mesoscale Model National Centers for Environmental Prediction Nonhydrostatic Mesoscale Model neural network

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00005-0

205

Copyright # 2023 Elsevier Inc. All rights reserved.

206 NWP PAnEn RA RNN SSE SURFRAD WRF

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

numerical weather prediction parallel analog ensemble reverse analog recurrent neural network search space extension surface radiation budget Weather Research and Forecasting

1 Introduction Solar irradiance is one of the most important variables for solar power generation. An accurate prediction of solar irradiance plays a critical role in the planning, scheduling, and management of photovoltaic power plants and their grid-combined generating systems (Gao et al., 2019). Meanwhile, solar irradiance is also heavily affected by other weather variables such as total cloud cover, temperature, wind speed, and relative humidity. In this chapter, we approach the problem of solar forecasting through the lens of weather forecasts and discuss how weather analogs and machine learning (ML) can be used for solar irradiance forecasting.

1.1 A brief history of weather analogs Analog forecasting originates from a simple yet powerful idea: if similar weather situations from the past can be found, information can then be gleaned from these past weather situations to help better inform how current weather will evolve and better predict the future. The most notable early success of analog forecasting, to our knowledge, is probably the weather forecast during the D-Day invasion of 1944 (Fleming, 2004): US Army Air Corps team, using Krick’s analog techniques, was able to predict short fair weather during early June 1944, which became the key date for the operation. However, analog forecasting soon encountered a formidable foe in 1963 when Lorenz discovered Chaos Theory (Lorenz, 1963) and showed that weather is a chaotic system. It means that a slight change in the initial conditions could lead to a drastically different output from a dynamic weather system (Gleick, 2011). This finding posed a great challenge to analog forecasting because the chaotic nature of the weather system significantly limits the predictability of future states of the atmosphere using weather analogs. In 1994, Van den Dool’s astronomical estimation of the number of years (1030 years) one should wait before finding good weather analogs (Van den Dool, 1994) again discouraged effort into analog forecasting. Not being able to find good weather analogs was one of the most important factors impeding the application of analog forecasting (Lorenz, 1969). It was later recognized as a viable empirical forecast method for medium- to long-range forecasts and climate downscaling (McDermott and Wikle, 2016), but less useful for short-term forecasting compared to numerical weather prediction (NWP) models. A more recent technique challenged this convention. The Analog Ensemble (AnEn) technique (Monache et al., 2011; Delle Monache et al., 2013) was proposed as an ensemble forecasting technique that achieved better prediction accuracy than contemporary NWP models. One of the major differences between the AnEn and its predecessors is that weather analogs are identified on a highly constrained location, effectively independent at each NWP model

1 Introduction

207

grid. Meanwhile, the temporal search window is also shortened to only several hours. AnEn relies on the highly constrained local search and thus it reduces the degree of statistical freedom when searching for similar weather patterns. Applications of the AnEn ( Junk et al., 2015; Alessandrini et al., 2015) later showed that good weather analogs for this purpose could be effectively found from typically 2 years of history. The AnEn has successfully set the standard for analog forecasts on the short-term temporal scale. It has since been applied to various predictions, including surface variables such as temperature (Frediani et al., 2017; Hu and Cervone, 2019), air quality (Delle Monache et al., 2018), and renewable energy sources (Cervone et al., 2017; Alessandrini et al., 2019; Hu et al., 2022).

1.2 Machine learning and its integration with analog ensemble ML represents a wide group of modeling techniques that can learn from existing data and then apply the learned features to future behavior forecasting. ML is data-driven and inductive meaning that it requires training with a large amount of data. It provides a domainagnostic modeling framework, because the model parameters can be statistically learned given sufficient training data. This modeling practice is different from dynamic models, where equations are explicitly designed and a proper parameterization scheme needs to be used for the model to generate meaningful results. However, ML can be data-thirsty, especially in the case of deep learning (DL) because of the high model complexity and a large number of model parameters. When not enough data are available, overfitting is a common yet challenging problem haunting the ML community. Overfitting means that the model does not learn a generalizable relationship between the predictors and the predictand, but instead it simply “memorizes” the training dataset. Overfitting can be typically diagnosed when the model has a low error on the training data but a much higher error on the testing data. Usually, collecting more training data or using a smaller model (e.g., dropout) would help to prevent overfitting, but the problem can be much more subtle depending on individual applications. Although ML provides a general framework to model different physical processes, a trained model is usually treated as a “black box” because the learned weights are deprived of any interpretation following conventional techniques. ML is usually able to produce more accurate predictions, but it is challenging to “pry open the box” and discover physical relationships learned by the ML models. The limitation in interpretability has led to heated debates on model reliability and ethics. The conception of ML being “a black box” has started to shift recently with emerging methods in interpretable artificial intelligence (AI). For example, gradient-based methods can be used to examine the attribute of input features to a specific prediction output. These methods treat gradients calculated from backward propagation as a proxy for feature importance. They help answer questions like how important an input feature is and where the model finds most important to a certain prediction in the input image. Aside from gradient-based approaches, other methods “learn” an explanation metric per sample for a model (Ribeiro et al., 2016; Zintgraf et al., 2017; Kindermans et al., 2017). These approaches are typically intrusive because they need to change the internal architecture to learn an additional metric. As a result, the model needs to be retrained due to the change in the architecture. Instead, gradient-based methods generally require only “black-box” access to the trained model.

208

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

Inspired by the adaptive ability of ML to learn a large parameter space, we investigated the integration of ML into the AnEn so that weather analogs can be sought with a metric that considers both the spatial and temporal features. Although the AnEn technique significantly benefits from a highly constrained local search in both space and time, it is also limited for the same reason. Currently, the similarity metric of the AnEn only works with forecasts on a single-grid point by design which ignores spatial features from weather forecasts. However, spatial information can further help to identify weather analogs using shapes and locations of weather patterns, rather than only the magnitude of certain weather fields on one grid point. For example, clouds are usually resolved at the grid level and any subgrid level clouds are often parameterized and poorly modeled. This could make the predicted total cloud cover less reliable for identifying weather patterns, despite it being one of the most relevant predictors of solar irradiance. If the model is subject to a displacement error of cloud, the weather analogs based on a single-grid point could be subject to prediction bias. On the other hand, by considering spatial features and determining the cloud regime from a spatial context, this bias can be better corrected. This chapter describes how a spatial metric can be integrated into the AnEn by using convolution operators. We argue that spatial patterns are critical for identifying weather analogs, and thus spatial information should be encoded into the similarity metric by design. We hereby propose deep analog (DA) that relies on a convolutional long short-term memory (LSTM) to identify weather analogs due to its capability to encode spatial information and the preference of using an analytical backpropagation method for weight optimization rather than an extensive grid search.

1.3 What you will learn in this chapter The rest of the chapter is organized as follows: 1. Section 2 summarizes the NWP models and the ground-based solar irradiance measurements used in this study. You will learn how data have been collected and organized. 2. Section 3 first introduces how analog-based forecasting techniques evolve over the years and how the proposed technique fits into the larger context, and then it briefly introduces the AnEn and a simple variant with a spatial metric. Finally, you will learn about the DA with a spatiotemporal weather similarity metric enhanced by a neural network (NNet). 3. Section 4 presents the experiment design and research results. You will learn how performance from different methods is evaluated. 4. Section 5 summarizes the chapter with discussions and conclusions.

2 Research data The generation of weather analogs relies on weather forecasts generated by NWP models and a historical observation archive of the variable of interest. This section introduces data collected for subsequent experiments.

2 Research data

209

2.1 Surface radiation budget network Surface radiation budget (SURFRAD) is a multidecadal project that provides continuous, accurate, and high-quality surface radiation budget measurements for six locations across diverse climate conditions (Augustine et al., 2000, 2005). The mission of the project is to support satellite retrieval validation, modeling, and other research in climate science, hydrology, and atmospheric science. Its main appeal to the research community includes the downwelling and upwelling components of broadband solar and thermal infrared irradiance. The downwelling global solar irradiance is of particular interest to this research. It is measured on the main platform by an upward-facing pyranometer. In this work, the downward solar radiation, tagged with “dw_solar” in the dataset (publically accessible at https://gml.noaa.gov/aftp/data/radiation/surfrad), has been collected from January 1, 2015, to December 31, 2019, at the station, Penn State (40.72°N, 77.93°W) with an elevation of 376 m. Radiation measurements are recorded every minute. To be consistent with the temporal consideration of a day-ahead energy market forecast, an hourly average has been calculated to downsample the original temporal resolution and to smooth ultrashort measurement noise.

2.2 Numerical weather prediction models Two NWP models have been analyzed for this study: The North American Mesoscale Forecast System (NAM) is an operational weather model produced and maintained by the National Centers for Environmental Prediction (NCEP) (publically accessible at https://www.ncei. noaa.gov/products/weather-climate-models/north-american-mesoscale). It was developed to provide mesoscale forecasts to the public and private sector meteorologists. NAM has constantly been updated. The most significant change took place in 2006 when NCEP replaced the simulation core of NAM from Eta to the Nonhydrostatic Mesoscale Model (NMM) of the WRF framework (Rogers et al., 2005, 2009). Another major upgrade was carried out in March 2017 when a high-resolution regional dynamic core was used to facilitate future high-resolution convective-allowing ensemble systems. With its current implementation, NAM is initialized 4 times a day at 00, 06, 12, and 18 UTC. The parent domain covers North America with a regular mesh grid of approximately 12 km as the horizontal resolution. The output at the parent domain is also used as boundary conditions to initialize other runs on nested domains on higher resolutions. NAM provides hourly forecasts up until 36 hours into the future and then forecasts are shifted to every 3 hours until 84 hours into the future, equivalent to three and a half days. The second weather model analyzed during experiments is the Global Forecast System (GFS) (publically accessible at https://rda.ucar.edu). Also produced and maintained by NCEP, GFS is a deterministic weather forecast model but with global coverage (Mathiesen and Kleissl, 2011). It is initialized 4 times a day at 00, 06, 12, and 18 UTC. Each model initialization outputs predictions up to 180 hours, equivalent to seven and a half days into the future. Similar to NAM, GFS also underwent relatively frequent model changes, including several improvements to the horizontal and vertical resolutions.a GFS with different a

A complete changelog can be found at https://www.emc.ncep.noaa.gov/gmb/STATS/html/model_changes.html.

210

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

horizontal resolutions can be a good testbed to analyze the impact of spatial granularity when defining weather analogs. Therefore, different versions of GFS with resolutions of 0.25, 0.50, and 1.00 degrees have been collected in subsequent experiments. Due to the difference in periods of various GFS versions, the overlapped period needs to be identified among GFS versions and NAM availability. As a result, predictions from GFS and NAM are collected from January 1, 2015 to October 31, 2019. A spatial mask of approximately 500 km by 500 km has been applied to the domain of the NWP model, centered at the grid point closest to the Penn State SURFRAD station. The spatial mask extends from 74.93°W to 80.93°W and 37.72°N to 43.72°N. This spatial mask extracted 2460 grid points for NAM, 576 for GFS 0.25 degrees, 144 for GFS 0.50 degrees, and 36 for GFS 1.00 degrees. Because solar radiation has a strong diurnal cycle and typically peaks around the local solar noon, NWP forecasts at 1 p.m. ET are collected to focus on the most important time of day. Although a continuous series of hourly forecasts is theoretically possible with NAM, GFS only provides forecasts every 3 hours. Therefore, we focus on evaluating forecasts only at the overlapped irradiance peak time, 1800 UTC. Fig. 1A compares the predicted solar irradiance time series from GFS and NAM to the observed time series. Different versions of GFS have similar behaviors in terms of irradiance peaks and valleys, probably because these versions only differ in the output resolutions rather than the physical processes. It can be confirmed by the change history of GFS that the radiation estimation scheme did not change between January 14, 2015, and March 22, 2021, and that the improvement of resolution is probably a result of the updated data assimilation scheme. The core of the model is similar in these versions. NAM, on the other hand, shows a difference in its forecasts before and after 2017. The annual maximum of predicted downwelling solar radiation flux drops from around 800 W/m2b before 2017 to around 700 W/m2 after. A decrease in the simulated annual variation of solar irradiance can also be observed. This change in the model behavior can be confirmed with the NAM update on March 15, 2017,c effective on February 1, 2017. In this update, NAM has been upgraded to version 4 where the radiation and microphysics schemes have been changed to reduce incoming surface shortwave radiation and the warm-season 2-m temperature bias has also been reduced. Fig. 1B shows the prediction error of GFS and NAM. At this particular location, variants of GFS consistently show a low bias (under-predicting) while NAM has a high bias (overpredicting). The prediction accuracy does not monotonically increase with the improved horizontal resolution for GFS, with GFS 0.50 degrees having the highest bias (best) out of the three. It is worthwhile to note the slightly decreasing trend of the bias associated with NAM. The nonstationary model performance (prediction error) could potentially pose a challenge when used with analog forecasting techniques.

b

This unit measures the rate at which solar energy falls onto a surface. The unit of power is Watt, and in the case of solar irradiance, it is measured by the power per unit area.

c

The changelog can be accessed at https://www.nco.ncep.noaa.gov/pmb/changes/.

3 Methodology

211

FIG. 1 (A) The downward shortwave solar radiation time series from GFS, NAM, and SURFRAD. Data points are shown at 1 p.m. every day typically corresponding to the daily maximum irradiance. Predicted time series are shown in solid lines and the observed time series is shown in the dashed line. (B) The prediction bias of GFS and NAM.

3 Methodology Analog forecasting is a type of weather forecast technique that relies on similar weather patterns identified from the past to issue a future prediction. It has undergone several decades of development, and currently it sees applications in synoptic and mesoscale weather forecasting, ecology, air quality, and the energy sector. This section aims to first summarize the three types of similarity for weather patterns. Variations of analog forecasting techniques have been proposed and tested throughout the years, and the motivation behind this introduction is to identify commonality and novelty between the variations, and also to spot potential new ground for future research. Then the AnEn technique is introduced given the critical contribution to boosting its efficacy. Finally, we extend and propose DA driven by a spatiotemporal ML technique. We aim to show that DA is an even more powerful renovation of the AnEn with a better balance between confining the

212

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

degrees of freedom during the weather analog search while still considering spatiotemporal weather patterns.

3.1 Analog forecasting Analog forecasting generates predictions based on similar weather patterns. The key concept is that if we can identify times in the past in which the pattern was similar to the current pattern, there is a high chance current weather conditions will follow a similar predictable evolution with that historical information. Analog forecasting is typically composed of two steps, quantification of similarity between weather patterns and generation of future predictions. Researchers proposed different approaches to implement these two steps. Before detailed discussions, it is worthwhile to point out that analog forecasting techniques can generate ensembles, from which probability can then be estimated based on the generated ensemble in a postprocessing fashion (Monache et al., 2011; Zhao and Giannakis, 2016). However, this chapter focuses on the generation of good-quality forecast ensembles, excluding the discussion about probability estimation. 3.1.1 Quantification of similarity between weather patterns Quantification of the similarity is usually done by first defining a similarity metric on a set of weather variables, and then evaluating this similarity metric between two weather patterns at different times. It is hard to find exact weather analogs, but this does not limit analogs from being effective for forecasting (van den Dool, 1989). Although the atmosphere never repeats itself exactly, similar weather conditions tend to share local-scale processes, for example, advection and radiation (van den Dool, 1989). Weather pattern similarity quantification techniques fall into three general types: • Type I: using multivariate forecasts • Type II: using hand-engineered weather features • Type III: using a trained NNet Type I is the most popular way to define similar weather patterns. NWP models are usually helpful in describing the state of the atmosphere with various variables on different vertical layers. For example, the 500 mb geopotential height is an important predictor for continental scale and long-range forecasts of temperature and precipitation (van den Dool, 1989). However, weather analogs based on a single predictor might not offer sufficient information about the state of the atmosphere to yield a useful analog. Multivariate forecasts, including wind speed, temperature, and geopotential heights at various vertical layers (Hamill and Whitaker, 2006), therefore can be used to further pin down historical dates that are closely similar to the current state of the atmosphere. Type II has occasionally emerged in the literature but it did not gain much attention. One attempt of this type was carried out by Toth (1989). The 500 mb geopotential height data were first collected and then used to detect well-developed ridgelines in the mid-latitude middle troposphere. Weather analogs were then identified using the number of ridgelines in each of the 15 predefined longitudinal sections across the Northern Hemisphere. Another more recent attempt was reported by Zhao and Giannakis (2016). Kernel functions were used to

3 Methodology

213

map pairs of points in the observation space onto a nonnegative real number, which is treated as a similarity measure. Different kernels were assessed based on their efficacy. Type II did not receive much attention due to the difficulty in designing good features by applying transformation to weather forecasts. Type III emerged recently because of the growing interest in applying ML to forecasting problems. An ML algorithm can learn patterns from historical cases so that hand-engineered features are unnecessary. Moreover, it is capable of learning many features given a large network and sufficient training cases. The AnEn uses the Type I similarity because it relies directly on the variables from NWP to calculate the similarity metric. DA uses Type III similarity because it trains an NNet that learns the relationship between historical weather patterns and the associated observations. Details on both methods are covered in Sections 3.2 and 3.3. 3.1.2 Generation of future predictions After the most similar historical patterns have been identified, future predictions can be generated using a variety of approaches. The generation of future predictions is usually less discussed because it is usually dictated by observation availability and the predictand of interest. However, choosing a proper approach is critical to increasing prediction accuracy, as in the case of the AnEn. Since the evolution of weather systems is nonlinear with high degrees of freedom, initially similar weather patterns will evolve with time to become increasingly different (Lorenz, 1969). Therefore, one cannot solely rely on historical analogs to the current condition and simply assume that they have the same subsequent evolution as the current weather pattern. The AnEn proposed a different way to generate future predictions using observations associated with the most similar historical forecasts. The difference is that AnEn searches for similar weather patterns only for a short time window at each forecast lead time. Thus, the task of modeling the chaotic nature of the weather system is left to the underlying dynamic model. As a result, the AnEn can focus on correcting forecast biases and quantifying uncertainty. Another application of the AnEn is model downscaling. For example, weather analogs can be identified using a lower resolution NWP model, and then if observations are dense (e.g., temperature measurements from a dense Internet of Things network; Calovi et al., 2021), the most similar weather patterns can be applied to each of the observations from the dense network. Predictions can be generated at each geographic location where observations are available. Previous studies have looked at downscaling variables including precipitation (Hamill et al., 2006; Hamill and Whitaker, 2006; Charles et al., 2013; Shao and Li, 2013), evapotranspiration (Tian and Martinez, 2012), and temperature (Timbal and McAvaney, 2001; Calovi et al., 2021) using analog forecasting techniques.

3.2 Analog ensemble and the spatial extension The AnEn (Delle Monache et al., 2013) has become an important branch as an analog-based forecasting technique due to its improved accuracy compared with alternatives. The identification of weather analogs falls in the group of Type I, using a multivariate distance function

214

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

based on NWP forecasts. The metric that measures the similarity between weather patterns is proposed as follows (Delle Monache et al., 2013): vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u ~ Nv t X 2 ωi u uX (1) Fi,t+j  Ai,t0 +j , k Ft , A t 0 k ¼ t σ ~ i¼1 f i j¼t

where Ft is the NWP model prediction valid at the model initialization time stamp t at a specific location and forecast lead time (FLT); At0 is the historical repository of NWP deterministic forecasts from the search space at the same location and FLT, but with a different model initialization time t0 from within the historical repository of deterministic multivariate predictions; Nv is the number of physical variables used during forecast similarity calculation; ωi is the weight for each physical variable which suggests the relevant importance of the physical variable with respect to the others; σ f i is the standard deviation for the physical variable i calculated from the historical forecasts at the same location and FLT; ~t is equal to half of the time window size of the FLTs to be compared so that weather analogs are identified within a very small time window, usually equivalent to the length of three FLTs; Fi,t+j is the value of the current forecast for the physical variable i at the valid time t + j; and Ai,t0 +j is the value of the historical forecast for the physical variable i at the valid time t0 + j. A typical workflow of generating the AnEn is summarized below: 1. A multivariate target forecast at a single location within a short time window is given for a particular forecast time centered at the desired FLT. 2. The similarity metric is evaluated between each pair of the target forecast and the historical forecasts in the search repository at the same location and lead time but on a different day. 3. The historical forecasts with the highest similarity measure are selected, deemed as weather analogs. 4. Observations associated with the weather analogs are used as AnEn predictions. 5. Repeat Steps 1–4 for target forecasts at all locations, forecast times, and FLTs. The key differences between the AnEn and previous alternatives can be analyzed from two aspects: how similar weather patterns are identified and how AnEn forecasts are generated. AnEn uses a highly confined domain in both space and time when identifying weather analogs. In space, the AnEn identifies analogs on a single-grid point, as opposed to other largescale weather analogs; in time, the AnEn only has a temporal window as short as 3 hours, as opposed to other long-term comparisons for weeks. Due to the shortened time window, the AnEn does not confine the search of previous years during similar times of the year. Rather, it searches the year-round repository, although the identified weather analogs typically still come from a similar time of the year which is due to the multivariate similarity metric. Another difference is that weather analogs are directly sought at the forecast lead time. AnEn forecasts are generated from historical observations that correspond to the target lead time so that the chaotic behaviors of the weather system are still dealt with by the underlying model and AnEn can be focused on addressing the more relevant problem of the bias and uncertainty of forecasts at a specific lead time. However, similar to any scientific toolbox in the literature, AnEn has its shortcomings. Confining the search within a single-grid point not only helps to find reliable weather analogs

3 Methodology

215

but also limits its ability to detect spatial weather patterns and larger-scale convergence and divergence. A variant of the AnEn is termed the search space extension (SSE) (ClementeHarding, 2019). It compares forecasts on a single-grid point, but historical forecasts can come from a nearby grid point, not necessarily fixed at the same grid point as the target forecast. SSE expands the pool of potential analogs and effectively increases the number of historical forecasts that can be searched through, as opposed to searching at only one grid point (hereafter independent search [IS]). However, the similarity metric is not changed, indicating that when comparing forecasts from different locations, these forecasts are still single-grid based. So a different modification is needed directly on the similarity metric to incorporate spatial information. A simplistic form of the spatial similarity metric can be achieved by treating forecasts from nearby grid points as additional physical variables, later referred to as the AnEn Spatial. Recall in the original similarity metric, that Nv stands for the number of predictors. If a spatial mask of 3  3 is used, during the evaluation of the similarity metric, the number of predictors is increased to 9  Nv because of the additional forecasts nearby. We report our results using this simplistic version of the AnEn Spatial, as a benchmark to the spatiotemporal DA. In this chapter, we use the C++ implementation of AnEn (Hu et al., 2021a), publicly available at https://weiming-hu.github.io/AnalogsEnsemble/. An R package is also available from the same repository. We recommend reading the reference code and using the tutorials at https://weiming-hu.github.io/AnalogsEnsemble/tags.html#tutorial to get familiar with AnEn.

3.3 Spatial-temporal similarity metric with machine learning The similarity metric adopted by the AnEn, although found effective, has its limitations. 1. Predictor weights: A set of weights needs to be determined a priori with an extensive grid search. This process is usually computationally expensive. More importantly, it limits the AnEn to only work with a few variables due to the prohibitive computational cost of a highdimensional grid search algorithm. 2. Dependence on forecasts: The similarity metric relies exclusively on NWP forecasts, leaving out the potentially helpful information from the associated observations. 3. Spatial extension: The similarity metric relies on single-grid forecasts which can limit its ability to exploit spatial weather patterns. The simplistic approach, the AnEn Spatial, while considering spatial forecasts, can cause other problems, for example, with spatial weight optimization. It is, at best, only an expedient choice. To address the above challenges, we propose DA that relies on a spatial-temporal similarity metric during analog generation. This metric is driven by a trained NNet, whose structure is inspired by the recent progress and success in DL and computer vision (CV), especially the FaceNet (Schroff et al., 2015) from Google Inc. A properly trained NNet can learn the definition of “good” analogs given robust historical weather records, thereby avoiding the necessity to define a similarity metric beforehand. Fig. 2 shows the embedding network architecture of DA designed for NAM. The purpose of this embedding network is to transform high-dimensional NWP forecasts into a latent

216

8. Theory of spatiotemporal deep analogs and their application to solar forecasting t1

t1 t2 t3

t2

t3

120

t2

t3

t3



10 10 10 Input

120 σ(Conv-LSTM) + Max-Pool

1 39

39 × 46 0 39 G × 46 rid 0k s m2

60 60 60 σ(Conv) + Max-Pool

7

3

t1

Embeddings

30 30 30 σ(Conv) + Max-Pool

FIG. 2 Architecture of the proposed DA spatiotemporal embedding network with a spatial mask of 39  39 for NAM. Timestamps are labeled at the top of cubes. The number of channels (variables) is labeled at the bottom of the cubes. The size of the cube (spatial domain) is labeled to the lower right. For the embedding shape, it is a vector of 120 values without a spatial domain (hence 1).

representation vector, also referred to as the latent features. After the transformation, analog identification can be carried out in the transformed latent space. With different NWP models like GFS and NAM, the size of the input can change based on the horizontal resolution of the model. 39  39 indicates the input size of the spatial forecasts from NAM. The input to the network is a four-dimensional data structure with height, width, number of variables, and time window. It is shown as three horizontally aligned cubes in Fig. 2 with the fourth dimension being the time (top of the cubes). This dimension is consistent with the parameter ~t in Eq. (2) of Delle Monache et al. (2013). In practice, this parameter is usually set to one, so that a short temporal trend (1 and +1 hour) centered at the desired FLT is compared. A similar design is adopted in DA by inputting nearby FLTs to the embedding network. The embedding network has three hidden blocks, with each block composed of a convolutional layer (beige), a nonlinear activation (orange), and a max-pooling layer (red). Convolution layers are helpful to extract high-level spatial features and max-pooling layers are useful to preserve the strongest signal within the perceptive field (Lawrence et al., 1997; O’Shea and Nash, 2015; Albawi et al., 2017). The combination of these two types of NNets is powerful in performing tasks like image classification and pattern recognition in CV (Alom et al., 2018). To encode time information, the last block uses a convolutional LSTM layer. This network was originally proposed for precipitation forecasting (Shi et al., 2015) and has been studied with success in various applications (Wang et al., 2018; Kim et al., 2020) where a time sequence of spatial data is analyzed. Here, we apply the layer to encode the length-of-three time sequence of spatial forecasts and compress the information to the last time step. Therefore, 0 after the operation, only the last time step is kept, but it is tagged with a different label, t3 rather than t3, because this time step now holds information from all previous time steps. See Appendix A for a mathematical description of the components in hidden blocks.

3 Methodology

217

The embedding network can encode spatial information for the identification of weather analogs, solving the problem of the spatial extension of AnEn. A PyTorch implementation of the proposed embedding network can be found in the public repository, Deep Analogs.d The class EmbeddingConvLSTM derives from a PyTorch nn.Module, and therefore, it can be used the same way as any other standard neural network module. Although PyTorch natively provides the LSTM module, it does not include its variant with convolution. Therefore, the ConvLSTM implementation is also available from the DA repository. NNet is a supervised technique meaning labeled examples are needed during training. However, providing the latent features for the NNet to learn is similar to generating handcrafted features (Type II), which goes against the original purpose of using a data-driven approach. We seek to train the NNet in a way that develops its version of the latent features. Two additional techniques are needed to achieve effective training, triplet network training (Hoffer and Ailon, 2015) and the reverse analog (RA) (Hu et al., 2023). A triplet network is a training technique designed for image similarity problems. Instead of giving the actual latent feature vectors that the model should learn, it encourages the model to develop a set of latent features that best distinguish input images. During each forward propagation, three samples are provided, termed as the anchor image, the positive image, and the negative image. The anchor image is more similar to the positive image than it is to the negative image. Three images are fed into the same embedding network separately and transformed into their respective latent feature vectors. The triplet loss function is calculated using these latent feature vectors. Minimizing the loss function updates the model weights so that the distance between the anchor and positive images becomes smaller and the distance between the anchor and negative images becomes larger. Details can be found in Hu et al. (2023). The last piece of the puzzle lies in how to determine which two out of the three samples are weather analogs. The RA approach is used to distill information from observations and to guide the model training (Hu et al., 2023). RA stipulates that two forecasts are weather analogs if the associated observations are similar. Opportunities exist where two NWP forecasts are similar but they are associated with vastly different observations due to unresolved weather phenomena and model biases. The RA would determine that these two weather forecasts should not be treated as analogs because of the relatively high model error. This additional information regarding the forecast error associated with certain weather patterns is a result of using observations to guide the model training and the identification of weather analogs (Type III). The network weights are optimized using Adagrad (Lydia and Francis, 2019) but there are many other optimization algorithms available in the training script at “DeepAnalogs/ DeepAnalogs/train.py.”e The training script implements the triplet network training and the RA sampling technique. It allows building the embedding network with different sizes

d

See https://github.com/Weiming-Hu/DeepAnalogs/blob/25870bbca137a5fd6927f2cc47ea70d3381046ac/ DeepAnalogs/Embeddings.py.

e

See https://github.com/Weiming-Hu/DeepAnalogs/blob/25870bbca137a5fd6927f2cc47ea70d3381046ac/ DeepAnalogs/train.py.

218

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

and it also carries out the separation of datasets into train, validation, and testing. See footnotef for a complete list of supported arguments by the training script. Theoretically, the embedding network can adopt different network architectures. Previous versions of DA did not exploit the convolutional layers hence leaving out the spatial information. We show, in this work, that spatial information is crucial to improving the prediction quality of weather analogs. In the rest of the material, the proposed technique is referred to as DA Spatial.

4 Results and discussion Experiments have been carried out for solar irradiance forecasting. Four NWP model forecast datasets have been studied with various versions of AnEn, including GFS (1.00, 0.50, and 0.25 degrees) and NAM (12 km/0.125 degrees). Table 1 summarizes the experiment setup for a particular NWP model: • The baseline comparison is the uncalibrated shortwave downward radiation flux from the NWP model at the grid point closest to the Penn State station. TABLE 1 Experiment setup for six prediction techniques. Experiment

Predictor

Time window

Search

NWP

1 (dswrf )

NA

NA

AnEn

5 (dswrf, u, v, t, r)

[e t  1, e t + 1]

Fixed

AnEn Spatial

10 (dswrf, t, tcc, q, r@6) within 500  500km2 [e t  1, e t + 1]

Fixed

[e t  1, e t + 1]

Fixed

[e t  1, e t + 1]

Extended

[e t  1, e t + 1]

Fixed

DA IS

DA SSE

DA Spatial

48 (dswrf, dlwrf, uswrf, ulwrft, tcc, q) u@7, v@7, w@7, t@7, r@7, gh@7)

48 (dswrf, dlwrf, uswrf, ulwrft, tcc, q) u@7, v@7, w@7, t@7, r@7, gh@7)

10 (dswrf, t, tcc, q, r@6) within 500  500km2

dswrf, dlwrf, uswrf, ulwrf, u, v, t, r, tcc, and q stand for downward shortwave radiation flux, downward longwave radiation flux, upward shortwave radiation flux, upward longwave radiation flux, u-/v-component of wind, temperature, relative humidity, total cloud cover, and specific humidity. r@6 stands for relative humidity from six vertical layers, ranging from 1000 to 500 isobar. All other predictors are either for the surface layer or the entire atmosphere as a single layer. e t is the FLT.

f

See https://github.com/Weiming-Hu/DeepAnalogs/blob/25870bbca137a5fd6927f2cc47ea70d3381046ac/ Examples/example.yaml.

4 Results and discussion

219

• AnEn has been applied to NWP forecasts to generate ensemble forecasts for solar irradiance with five predictors. Predictor weights are optimized with an extensive grid search using training data only (2015–17). • AnEn Spatial treats forecasts from nearby locations as additional predictors. Due to difficulties in weight optimization, equal weight is applied. Please note that the analog search is still fixed at a single location but the similarity metric uses a spatial mask while comparing forecasts. • DA IS uses LSTM layers as embedding network (Hu et al., 2023). It does not utilize spatial information. Therefore, the identification of analogs is carried out using forecasts from a single grid. The inclusion of this method shows its capability to perform similarity searches on many more weather variables. • DA SSE directly follows the approach of AnEn SSE in adopting spatial information. The only difference between DA IS and SSE is that during model training, nearby forecasts are fed into the embedding network so that the model is exposed to forecasts on slightly different locations. It is hoped that the trained model can identify better analogs when using a forecast from nearby locations. • DA Spatial represents the proposed method in this work. As discussed in Section 3.3, fourdimensional weather forecasts are used to determine weather similarity. The network is only trained at a fixed location, the Penn State station, but with the spatial and temporal components. The test period is from January 1, 2018 to October 31, 2019 and the training period is from January 1, 2015 to December 31, 2017, as this is the time when available data for NAM and all versions of GFS overlap. Forecasts of solar irradiance at the local solar noon (1 p.m.) are evaluated and compared. AnEn has 21 ensemble members which is a typical choice with a 2–3 years of search repository. GFS and NAM originally have hundreds of weather variables. To ensure a fair comparison, 48 common atmospheric variables have been identified from various vertical layers, as listed in Table 1, for DA IS and SSE. DA Spatial only uses 10 predictors because of the intensive memory consumption of training the convolutional LSTM model and the limited memory available in our equipped GPU. Embedding models are trained on a Dell Precision 7920 desktop tower with 64 GB of RAM, 16 Intel(R) Xeon(R) Gold 6130 CPU @ 2.10 GHz, and an NVIDIA Quadro P4000 GPU with 8 GB of memory. CUDA version is 10.2 and PyTorch version is 1.8.1+cu102.

4.1 Verification at a single location Fig. 3 evaluates forecasts generated from AnEn and DA with different NWP models using, in turn, the mean absolute error (MAE), bias, and the continuous rank probability score (CRPS). NWP model forecasts typically have the MAE around 150 W/m2. The AnEn and the AnEn Spatial are both able to reduce prediction errors compared to the baseline NWP forecasts. However, AnEn Spatial outperforms the AnEn when coupled with NAM but not with GFS. The mixed performance of AnEn Spatial indicates that simply treating forecasts on nearby grid points is merely a convenient yet suboptimal solution for exploiting spatial correlation.

220

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. 3 Comparisons of MAE (A), bias (B), and CRPS (C) with GFS and NAM. NAM is shown in dashed lines and versions of GFS are shown in solid lines. Statistics averaged across the complete test period.

Comparing the MAE of the AnEn and DA, in Fig. 3A, DA outperforms AnEn except DA SSE. This result is expected because the embedding network of DA SSE is trained with singlegrid forecasts but from nearby locations. Topography in central Pennsylvania can vary drastically featuring mountains and valleys. Nearby forecasts, even being associated with similar observations at the fixed forecast location, may be too dissimilar for the network to learn with only single-grid forecasts. However, the model is still forced to construct a latent vector that could potentially prevent the model from learning anything generalizable. This effect is most salient for NAM because of the highest spatial resolution (most grids). For lower-resolution models, like GFS, training DA with SSE does not seem to have a significant side effect, but it certainly does not improve the prediction accuracy compared to DA IS. Similar results can be observed in Fig. 3C. It is worthwhile to mention the extra computational cost of training DA SSE due to more training samples.

4 Results and discussion

221

Finally, DA Spatial outperforms all its counterparts with MAE and CRPS. This suggests that convolutional layers are critical in encoding spatial information. They are more effective in introducing spatial context into the weather similarity metric than AnEn Spatial. Although DA Spatial has the lowest (worst) bias compared with AnEn alternatives, all AnEn techniques yield a low bias of around 25 W/m2, which suggests that none of the variants is efficient in removing model bias. Thus, removing bias in NAM could be a challenging task given a noticeable change in the radiation scheme and model prediction behavior, as shown in Fig. 1. Fig. 4 shows bivariate scatter plots with observed solar irradiance on the horizontal axis and the forecasted on the vertical axis. The diagonal line shows the perfect correlation between the forecasted and observed solar irradiance. Each panel shows data from a particular forecasting technique, with NWP forecasts in the first column, AnEn in the second, and DA Spatial in the third column. From the right tail of the distribution in Fig. 4 (1, 4, 7), GFS suffers from an underprediction of around 200 W/m2 at the extreme level (upper right). This underprediction could be false alarms of the cloud cover prediction at the grid level, but more importantly, it could be caused by the model resolving parameterized subgrid cloud, for example, cumulus cloud. GFS adopts a Monte-Carlo Independent Column Approximation method during radiation transfer computations to address the unresolved subgrid cloud variability (Iacono et al., 2000; Clough et al., 2005). Both AnEn and DA Spatial are capable of correcting errors due to the imprecise parameterization process, while DA Spatial slightly outperforms the AnEn in this respect. On the other hand, NAM suffers from a low correlation between the observed and forecasted solar irradiance. Both the AnEn and DA Spatial can cluster data points closer to the diagonal line with DA Spatial achieving a higher correlation. However, the AnEn shows underprediction for high solar irradiance regimes. This could be related to the schematic changes that happened in early 2017 reducing the simulated incoming solar radiation, as shown in Fig. 1. The AnEn does not account for changes in the model archive. In contrast, results from DA Spatial show improvements because of the training process and the use of RA. If there is a change in the forecast patterns throughout the search repository, the model is designed to account for it and to build latent features that would ultimately lead to more similar observations.

4.2 Search space extension SSE, as mentioned in Section 3.2, is an extension of the AnEn that allows the target forecast at a fixed location to be compared with nearby forecasts. With the enlarged search repository, it is more likely to find better weather analogs and therefore leads to a higher prediction accuracy at the fixed location. SSE is also useful in assessing how weather analogs can be identified using forecasts at different locations, essentially allowing a certain level of fuzziness during the weather analog search. Fig. 5 shows the MAE of solar irradiance predictions calculated at the fixed location (shown in star) when forecasts from a distant grid point are used to compare with the target forecast at the fixed location. There are two main concerns Fig. 5 aims to address:

222

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. 4 Bivariate scatter plots with observed solar irradiation on the horizontal axis and the forecasted on the vertical axis. Perfect correlation is shown by the diagonal referential line in red. Each panel represents a forecasting technique. Ensemble mean is shown for ensemble forecasts. Pairwise correlation is calculated without binning.

1. Can weather analogs be better identified with forecasts from nearby locations for spatial considerations? 2. How sensitive is the prediction error to the distance between two nearby forecasts? Generally, Fig. 5A–C features rough error terrains while Fig. 5D features a smooth error surface. The largely blue-to-brown color in Fig. 5D indicates a prevalent outperformance

4 Results and discussion

223

FIG. 5

Solar irradiance MAE calculated at the fixed location (star) when forecasts from a distant grid point is used to compare with the forecast at the fixed location. The best search grid, based on the lowest prediction error, is labeled as a red cross. Contour lines show interpolated error surface from grid points. The NWP model is NAM.

of DA Spatial over the competitors. It indicates that DA Spatial, trained using forecasts with a spatial mask, has a higher tolerance of the distance between nearby forecasts as long as forecasts represent roughly the same spatial region. This tolerance is limited when forecasts from a distant location are used (lower right of Fig. 5D). The roughness in the error surface associated with Fig. 5A–C is caused by ignoring spatial information during the weather analog search. Among these techniques, selecting the proper forecast grid point to use for analog identification has a bigger impact (shown by the fast increase of errors when deviating from the predicted location), but the impact is smaller with DA Spatial when spatial information is already encoded internally. Although DA SSE has been exposed to nearby forecasts during the training period, it is not as effective as the DA Spatial where convolutional layers are used together with LSTM layers to improve the spatial awareness of the embedding model. Lastly, all techniques show a slight shift regarding the forecasted location and the best search grid. The best search grid point is determined based on the lowest forecast error. This could be attributed to the incorrectness in resolution-limited model topography. The geographic shift demonstrates the necessity of adopting the SSE on top of AnEn and DA for including potentially better locations for the identification of weather analogs. Similar results are also observed for GFS. Verification can be found in Appendix B.

224

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. 6 Prediction MAE of solar irradiance as a function of the distance between the forecast location and a nearby search location in SSE. Forecasts fixed at the predicted location is compared to forecasts from a nearby location during the identification of weather analogs. Distances are binned every 0.5 degrees, roughly every 50 km.

Fig. 6 shows the prediction MAE of solar irradiance as a function of the distance between compared forecasts. With SSE, the target forecast is compared with forecasts from nearby locations to increase the number of search samples. Fig. 6 further investigates the sensitivity of prediction error with respect to the search distance. Generally, it is well expected that as the search distance increases, the prediction error will also increase. However, the rate of the increase sheds light on how different techniques behave with spatially variant forecasts. Since DA IS is trained with data only from the predicted location, its prediction error appears lower than that of the AnEn within the first bin, (0.0, 0.5]. However, its error quickly grows as nearby forecasts, that have never been seen by the network, are added. In comparison, DA SSE, trained with single-grid forecasts but from nearby locations, shows a

4 Results and discussion

225

constrained growth of error as a function of distances for GFS. Typically, the error growth of DA SSE is slower than that of the DA IS, meaning that the model indeed recognizes similar weather patterns even with nearby forecasts. The AnEn shows a similar error signature with DA SSE, likely because optimal predictor weights are also applicable to nearby regions governed by similar weather regimes. The relationship between DA IS and SSE is reversed with NAM in Fig. 6A. This is because excessive model grids from NAM are used during training. In the training domain, there are 2460 grid points from NAM, however, only 576 from GFS 0.25 degrees, 144 from GFS 0.50 degrees, and 36 from GFS 1.00 degrees. Too many nearby grid points prevent the model from learning patterns that are transferable, resulting in a worse performance. Results from DA Spatial are missing in the last bin, (3.5, 4.0], in Fig. 6A because, different from GFS, NAM model grids do not lie on a regular latitude/longitude mesh. Therefore, grid points close to the edges of the domain do not have enough nearby grid points to construct a meaningful spatial mask for analog forecasts.

4.3 Weather analog identification One of the advantages of using an analog-based technique for weather forecasting lies in the identified weather analogs. Visualizing these weather analogs helps to understand how a certain prediction is made. Weather analogs can be treated as a local cluster of weather events centered at the predicted weather event through the lens of the adopted similarity metric. Analyzing these clusters can help better understand the advantages and limitations of the similarity metric. Fig. 7 examines forecasts generated for the particular day on May 5, 2019. Two types of ensembles are shown, the AnEn in blue and DA Spatial in black. The deterministic forecasts from NAM and SURFRAD measurements are shown in horizontal lines. As a recapitulation, weather analogs are identified by comparing target and historical NAM forecasts. However, the generated ensembles are historical observations associated with weather analogs (Type A). The difference between the AnEn and the DA Spatial is that the AnEn has a complete reliance on weather forecasts during the identification of weather analogs while DA Spatial also uses observations to determine whether the two forecasts should be deemed analog or not, given a potentially imperfect and biased weather forecast. The difference is illustrated in Fig. 7. Given an imperfect weather forecast of solar irradiance (63.51 W/m2 forecasted while 103.19 W/m2 observed), the goal is to generate an ensemble that centers at the observation (solid red line) with a reasonable spread. Both the AnEn (blue solid line) and DA Spatial (solid black line) ensemble members start at a solar irradiance lower than the observation, but the remaining AnEn members significantly overpredict the target which will eventually skew the distribution toward its right tail, leading to an overall overprediction. DA Spatial, on the other hand, performs better in terms of centering the ensemble around the observation. This difference is, indeed, caused by the guided weather analog identification process with observations and the training process with RA. The most similar NAM forecasts from the AnEn (dashed blue line) have a similar magnitude of solar irradiance to the given target (dashed red line). This is because solar irradiance is weighted highest as the most important predictor in finding weather analogs. However, similar forecasts from DA Spatial (dashed

226

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. 7 Two analog ensembles on May 5, 2019 generated by the AnEn and DA Spatial with NAM. Ensemble members are ranked based on increasing member values. The NAM forecast and SURFRAD measurement are shown in horizontal lines.

black line) do not follow the target forecast (dashed red line), rather they are picked based on a set of latent features generated by the embedding model. These latent features are nonlinear combinations of weather variables that aim to relate weather analogs directly to the underlying observations. Another way to visualize the difference is to locate forecasts within the latent space. Fig. 8 shows a scatter plot with the distance from the target forecast to the search forecast on the horizontal axis and the prediction error (bias) on the vertical axis. Historical forecasts available for search are shown in gray; analog forecasts from DA Spatial are shown in red; analog forecasts from AnEn are shown in green; analog forecasts from RA are shown in blue. The positive correlation between the distance and the prediction bias is an indicator of effective learning that the latent space is constructed based on clustering forecasts that ultimately contribute to better prediction accuracy. Analogs from DA Spatial lie to the left of the figure because they are selected based on minimizing distance in the latent space. Analogs from RA lie at the bottom of the figure because they are selected based on minimizing prediction error (difference in observation), which is ideal but only possible during training. Three samples from the RA group have been selected as well by the DA Spatial group. Although the rest of the samples are not selected, DA Spatial is still able to pick good-quality analogs using the latent features. In contrast, analogs from AnEn are shown to be associated with higher prediction error and a larger distance in the latent space. Figs. 9 and 10 compare the spatial features of weather analogs that were previously missing in Figs. 7 and 8. Both figures have the same setup with the target NAM forecast shown in the slightly enlarged panel and the weather analogs shown in the remaining panels. Text labels to the upper left of each panel show the date and the associated observation of the weather analog.

4 Results and discussion

227

FIG. 8 Bivariate scatter plot with distance in the latent space on the horizontal axis and difference between the forecasted and the observed solar irradiance on the vertical axis. The NWP model is NAM and the date of the prediction is May 5, 2019. In all, 21 members are shown for DA, AnEn, and RA.

In Fig. 9, the AnEn only compares forecasts at the predicted location (star). The limitation can be quickly spotted when weather analogs are presented with a spatial domain. On May 5, 2019, the target forecast predicts high solar irradiance to the northwest of the domain and largely low solar irradiance in the rest of the area. The AnEn fails to pick up this large-scale feature. It is prohibited from detecting this signal with only single-grid forecasts. Relying on a single-grid point also leads to unrealistic weather analogs like the members no. 14, 18, and 20 in Fig. 9. These weather patterns can be relatively easy to distinguish from the target forecast by treating nearby forecasts as additional predictors, like the AnEn Spatial. However, it would be extremely hard for the AnEn to rule out these analog members. Fig. 10 shows the same target NAM forecast and the weather analogs but from DA Spatial. The most notable difference, from Fig. 9, is the largely consistent magnitude of solar irradiance within the vicinity of the predicted location (star). However, it is important to note that this similarity in the forecast space is not necessary for a historical forecast to be deemed as an analog, for example, the members no. 8 and 19 in Fig. 10. DA Spatial relies on the latent features to find weather analogs. The transformation from the original forecast space to the latent space is a result of the model learning via RA. An example similar to Figs. 7–10, but under a high irradiance regime, can be found in Appendix C. Similar remarks can be made where the AnEn finds similar forecasts that potentially lead to under-prediction while DA Spatial is able to rely on the built relationship between forecasts and observations to find better weather analogs. It has been shown that weather analogs defined with the latent features, generated by DA Spatial embedding network, yield improved prediction accuracy and have better spatial

228

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. 9 Weather analogs identified by the AnEn using NAM on May 5, 2019. Data points are color coded with the forecasted solar irradiance. The target forecast is shown in the slightly larger panel. The 21 members are shown in the remaining panels. Labels at the upper right of each panel show the date of the forecast and the associated SURFRAD measurement at the predicted location (star).

awareness. It is preferable to define weather analogs with these generated latent features than with the original predictors. But the issue is since latent features are nonlinear combinations of the multivariate forecasts, it is challenging to pin down physical relationships from original predictors to latent features. In this work, we attempt to decipher the correlation between a single latent variable and the spatial features of solar irradiance. Fig. 11 shows the bivariate scatter plot in the larger panel, with a particular latent feature, no. 115, on the horizontal axis, and the SURFRAD measurements of solar irradiance on the vertical axis. We manually identify three clusters based on the scatter plot and backtrace

4 Results and discussion

229

FIG. 10

Weather analogs identified by the DA Spatial using NAM on May 5, 2019. Data points are color coded with the forecasted solar irradiance. The target forecast is shown in the slightly enlarged panel. The 21 members are shown in the remaining panels. Labels at the upper left of each panel show the date of the forecast and the associated SURFRAD measurement at the predicted location (star).

the forecasts within these clusters. Forecasts are shown in the rest of the panels. These forecasts are not identified by either AnEn or DA, rather they are selected based on a particular latent feature and the associated observations. The key concern is to see what spatial patterns and observational patterns this particular latent feature relate to. Clearly, Fig. 11 (1) shows a nonlinear relationship between solar irradiance and latent features. Three clusters are selected based on low/mid/high values of the latent feature. Fig. 11 (2–31) shows that the first cluster corresponds to cloud-free regimes in the close vicinity of the predicted location (star), although dissimilarity can be observed to the northeast in

230

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. 11

(1) The bivariate scatter plot with the latent feature no. 115 on the horizontal axis and SURFRAD observations on the vertical axis. Three manually chosen clusters are also shown. (2–11) Weather analogs from cluster 1; (12–21) weather analogs from cluster 2; and (22–31) weather analogs from cluster 3.

Fig. 11 (8–10). It is important to point out that DA Spatial has not been informed of the predicted location during training since longitude and longitude information is not part of the input. During training, it is simply provided with three forecasts of which two are more similar. The fact that the model has a preference for certain regions is already a strong indication of effective learning. The third cluster mostly features heavy cloud cover probably with moving cloud patterns in the diagonal direction (Fig. 11 (26–29)). Finally, the second cluster features some amount of cloud cover but it does not present an obvious pattern. This could be because the second cluster, being at the upper right corner in Fig. 11 (1), has a slightly larger within-cluster variance compared to the other two clusters. However, it is important to note that, even the observations associated with the first and the second clusters are very similar, DA Spatial can distinguish them due to vastly different spatial patterns and governing weather regimes.

4.4 Machine learning interpretability via attribution DA Spatial has been shown to outperform other variants of the AnEn. However, the explanation of the prediction generated by DA Spatial and the associated reasoning is still not clear.

4 Results and discussion

231

It is important to make sure that the trained NNet learns the spatial features helpful for predictions, rather than developing unexpected relationships from hidden biases in the dataset. This section attempts to interpret the relationship between model predictions and model input. The already trained embedding network from NAM is put to analysis. A key difference from model training is that model interpretation does not alter model weights as they have already been optimized. Rather, model interpretation aims to quantify the contribution of each input feature to the final prediction by analyzing intermediate values of the model, like gradients. The integrated gradient (IG) is chosen as the main model interpreting technique considering its computational benefits, axiomatic justification, and application robustness. Please see Appendix D for detailed discussions on the IG. A baseline image, indicating noninformation, is needed by the IG to quantify feature attribution. Noninformation indicates that the image contains no useful features for the network to make a decision. In CV, it can be represented by a blurred version of the input image or a black image (all zeros). Using a black image is not applicable in this case because it introduces bias into the attribution process. A black image depicts a valid physical state in weather forecasting, for example, zero can represent a valid temperature or wind speed. This does not fulfill the requirement of noninformation. We, therefore, used a Gaussian filter with a standard deviation of 10  σ of the input to smooth all channels independently, and the blurred images are used as baseline images. An example is provided in Fig. D.1. The severity of the smoothing (10  σ) is arbitrary but we postulate, by assessing visually, that any details have already been effectively removed from the original image. Fig. 12 shows the estimated feature importance for the trained DA Spatial embedding model for NAM. First, the IG is applied to a series of input samples that include the first day of each month throughout the test period. The IG can only be applied to attribute a single

FIG. 12

Feature importance calculated using the trained DA Spatial embedding model for NAM. dswrf, tcc, r, t, and q stand for downward shortwave radiation flux, total cloud cover, relative humidity, temperature, and specific humidity, respectively. Feature importance is estimated using the IG for all 120 latent features and accumulates across input samples. Input samples include the first day of each month from January 2018 to October 2019. IG values are normalized between [0, 1]. Error bars (red) show the standard deviation of importance over input samples.

232

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

latent feature at a time, so this operation needs to be repeated 120 times to cover all latent features. After these two steps, an attribution value is calculated for each grid, between an input feature and a latent feature, for all the input samples. To estimate the per-variable feature importance, the maximum IG value is preserved within the spatial domain because it is a measure of the potential maximum importance associated with a certain input to a certain latent feature. Then the IG values are averaged across 120 latent features and all input samples to estimate the per-variable feature importance. The error bars (red) show the standard deviation of feature importance calculated over input samples. Solar irradiance and total cloud cover appear to be among the most important features for DA Spatial predictions. This is consistent with the fact that weather analogs for solar irradiance forecast should exhibit similarity in these two fields that have a strong correlation. DA Spatial is an empirical model that has never been exposed to any prior physical relationship between the forecasted solar radiation/total cloud cover and the observed irradiance. The reasonable rank of importance shows the result of model learning and the efficacy of the RA training. Relative humidity has been ranked as the next important feature, although relative humidity at various vertical heights exhibits different importance. On average, the 500 hPa level has the highest importance. The 500 hPa level is, in fact, important in weather forecasting because it is very close to the level of nondivergence. It allows for an efficient analysis of vorticity, and in this case, the potential for cloud formation. Forecasters also use the 500 hPa level to locate troughs and ridges, which are the upper-air counterparts of surface cyclones and anticyclones. Previous literature has also favored variables at this level for identifying weather analogs (van den Dool, 1989). Finally, surface variables including temperature and specific humidity have lower attribution to the final prediction. This suggests that the model relies less on these variables when identifying weather analogs for solar irradiance predictions. Another interpretation is that surface variables are less important in learning the relationship between model forecasts and solar irradiance observations. Fig. 13 provides visual evidence of spatial attribution of a single input feature, solar irradiance, because of its outstanding feature importance. DA Spatial is merely an embedding network that performs a particular transformation on the input forecasts and outputs a set of latent features that describes the input forecasts. Weather analogs are, then, identified using the latent features. Some latent features are, to be expected, more effective under certain scenarios. For example, to make an analogy with the original weather variables, the total cloud cover would be a less important variable on a clear day, and therefore, other variables might be more important in detecting weather similarity. However, this does not impair the performance of DA Spatial overall because weather analogs are identified with all 120 latent features at the same time. To focus the analysis on the most important feature, Fig. 13 identifies two triplet forecasts (on each of the two rows). Anchors are target forecasts from the test period; positives are the analog forecasts identified by DA Spatial for the target; negatives are the forecasts that DA Spatial leaves out as dissimilar weather forecasts. The most important latent feature is determined as the latent feature that leads to the largest marginal difference between the anchor-positive and the anchor-negative pairs. Formally, the most important latent feature is defined as

4 Results and discussion

233

FIG. 13 Spatial attribution of solar irradiance for DA Spatial network with NAM input. Two example triplets from the training period are shown in rows for two latent features, no. 81 (1, 2, 3) and no. 98 (4, 5, 6). The backdrop is the solar irradiance from NAM and the highlighted region is masked based on the IG values higher than the 85th percentile within the domain. The highlighted region shows which part of the domain most affects the final prediction. The SURFRAD station is shown in red star.

p

arg max ðjlai  li j  jlai  lni jÞ,

(2)

1i120

where lai is the ith latent feature associated with the anchor forecasts, and p and n for positive and negative forecasts, respectively. This latent feature alone contributes the most to the final distance that separates the positive and the negative input. Finally, the IG values are calculated for this latent feature with respect to the solar irradiance of the input. For the first triplet (first row of Fig. 13), Fig. 13 (2) is a weather analog of Fig. 13 (1) because the difference in observation is smaller compared to Fig. 13 (3) (155.94  103.19 < 223.93  103.19 W/m2). DA Spatial is supposed to focus on the proper region to make a decision. It is obvious that, if only looking at the predicted location (red star), the anchor forecast turns out to be more similar to the negative forecast (88.45  63.51 < 480.01  63.51 W/m2). This is caused due to a significant overprediction in the positive forecast and an underprediction in the negative forecast. Instead, DA Spatial focuses on the high irradiance region in Fig. 13 (3) considering that the observed irradiance (223.93 W/m2) is actually much higher than the forecasted irradiance (88.45 W/m2). The spatial feature, for example, boundaries of the high-to-

234

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

low solar irradiance region, is also highlighted indicating that DA Spatial has developed an effective spatial awareness. In contrast, Fig. 13 (2) does not display clear spatial features in the forecasted solar irradiance. DA Spatial, therefore, focuses on the nonregular region around the predicted location with lower solar irradiance to account for the overprediction of NAM. Finally, by focusing on the highlighted regions in Fig. 13 (1–3), DA Spatial is able to tell the correct positive forecast. This positive forecast will then be included in the final analog ensemble forecast. Similar remarks can be made from the second triplet example.

5 Final remarks This chapter proposes a spatiotemporal weather similarity metric based on a trained NNet. Convolutional NNets are used to encode the spatial information into the similarity metric. The results show that a slight relaxation in the spatial component of analog search does not prevent it from finding good analogs. Instead, DA outperforms both the AnEn and the AnEn Spatial. The former suggests that the inclusion of spatial features can help find better weather analogs and eventually generate better predictions. The latter comparison reveals the necessity of convolutional layers. Convolutional layers can extract and exploit abstract spatial features where a spatial similarity metric operating on a pixel level (AnEn Spatial) falls short. Experiments have been carried out on four different spatial resolutions, ranging from 0.125 to 1 degree with NAM and GFS. The prediction accuracy is not a monotonous function of the resolution. In general, both AnEn and DA show better predictions when applied to GFS compared to NAM. This difference is model-specific. As shown in Fig. 1, models have been constantly updated and the behavior is unstable. Analog forecasting techniques could potentially suffer from model nonstationarity. However, DA shows a better tolerance to model updates compared to AnEn due to the more complex similarity metric. This finding is consistent with a previous study (Hu et al., 2023). Weather analogs identified by spatial forecasts are found to be more robust considering nearby forecasts. This is helpful in terms of the SSE where nearby forecasts can be used as potential forecasts to search from. But if the similarity metric is not robust within its immediate neighbor, applying SSE could lead to a counter effect. Since spatial information exhibits natural autocorrelation, using spatial forecasts can effectively alleviate the location-based sensitivity of the similarity metric. As a result, nearby forecasts can be safely searched when DA Spatial is applied. Interpretation of the proposed model has also been studied with the IG as a sanity check on whether the model has learned the physical relationship between input features and the predicted variable. Results are supportive, displaying evidence that the NNet has effectively learned different feature importance and learned to relate certain spatial features to constructing effective latent features. This is contributed to the training techniques including triplet network training and the RA technique. In terms of computational cost, DA can be slower than the AnEn. This might appear counterintuitive to some at first sight, but it is well expected. DA is not an end-to-end prediction model, rather it is a transformation model that converts the high-dimensional weather forecast to its latent representation. The runtime of AnEn depends on the length of the search and the number of predictors. In this case, the number of predictors in DA (120) is more than that in the AnEn (5). Given the same search period, DA is expected to be slower than AnEn.

6 Assignment

235

However, if a fair comparison is truly desired, the same input predictors of DA (39  39  10) should be used by the AnEn. But it is shown (Hu et al., 2023) that the AnEn does not perform well with a large number of predictors and the computational cost of weight optimization could be prohibitive. On the other hand, parallel analog ensemble (PAnEn) (Hu et al., 2021a) provides an efficient and scalable implementation for the AnEn and DA. It achieves about 95% parallelism (Adebayo et al., 2020) on supercomputers in previous benchmark tests. The computational requirement of DA can be fully met with supercomputers and tools like PAnEn. DA provides a powerful framework to identify weather analogs with both space and time. While AnEn succeeds in showing that localization is the key to effective weather analogs, DA takes the discussion a step further, showing that localized spatial patterns can also be used to identify even better weather analogs. Future contributions could focus on applying this framework to other variables of interest, for example, wind speed, or on a gridded observational product. Model interpretation of the embedding network is also another promising direction that sheds light on model reliability.

6 Assignment Now that you have read about how AnEn and DA work, it is time for some hands-on exercises. To help you better understand the mechanics and guide you through using these techniques in your own projects, we designed the following assignments. 1. Hands-on tutorials on AnEn. (a) Explore the landing page and the documentation of AnEn at https://weiming-hu. github.io/AnalogsEnsemble/. (b) There are multiple ways to run AnEn, for example, with R, C++, or in a browser. Decide the most applicable environment for you and finish the installation as instructed. (c) Follow the tutorials at https://weiming-hu.github.io/AnalogsEnsemble/tags. html#tutorial. It is recommended to start with the tutorials named Basics of RAnEn and Search Space Extension with RAnEn. (d) Do you have your own dataset that you hope to run AnEn with? Reformat your dataset according to the format guide at https://weiming-hu.github.io/AnalogsEnsemble/ 2019/01/16/NetCDF-File-Types.html and run AnEn with your own dataset. 2. Using DA for prediction (a) Install Python 3.x on your machine. (b) Install the DA module from https://github.com/Weiming-Hu/DeepAnalogs. (c) Get familiar with the arguments of deep_analogs_train. It allows to build networks with or without the convolution architecture. A full list of supported arguments can be found at https://github.com/Weiming-Hu/DeepAnalogs/blob/ 25870bbca137a5fd6927f2cc47ea70d3381046ac/Examples/example.yaml. (d) Train a DA without the convolution architecture, since we have only a limited domain, using data from the previous exercise. (e) Generate ensemble forecasts with the trained DA network. This might require setting up AnEn with C++.

236

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

7 Open questions The following open-ended questions are provided for you to consider: 1. AnEn and DA both rely on finding weather analogs to generate ensemble forecasts. What are the key differences between AnEn and DA? 2. Both AnEn and DA belong to analog-based forecasting techniques. How does analogbased techniques perform for forecasting extreme events? And how would the deep learning architecture in DA impact its performance for forecasting extreme events? 3. What are the pros and cons of having a forecast ensemble, as opposed to a deterministic prediction. Try to approach this question from the point of view of a modeler (how forecasts are generated) or a user (how predictions are interpreted).

Appendix A Deep learning layers and operators A.1 Convolution Convolutional layers (Xu et al., 2015; Bertinetto et al., 2016; Liu et al., 2019) are composed of convolutional kernels and an additional bias term. Each kernel has a set of weights that ought to be optimized during backward propagation. Given a multichannel image, usually represented as a three-dimensional data structure, Ichw, and a convolutional layer with ~c kernels, each kernel, ~cj , carry out the following transformation on the input: ~I~c ¼ Conv~c ðIÞ ¼ B~c + j j j

c X

W~cj ,k ?I k ,

(A.1)

k¼1

where ~I~cj is the transformed image by the kernel ~cj , B~cj is the bias term associated with the kernel, W~cj ,k is the kernel weights for the input channel k, Ik is the input on the channel k, and ? is the two-dimensional cross-correlation, also known as the sliding dot product operation. Each kernel operates on the full channels of the input image and generates a single-channel output transformed image. It serves as an image feature extraction operator. To extract more features from the image, multiple kernels are usually stacked and used together. The number of kernels used to process the input image would be the number of output channels, in this case, ~c. Due to the cross-correlation, the image size, originally h  w, shrinks after each convolution depending on the size of the kernel. This sometimes creates problems, especially on input images with already low resolutions. A common solution is to apply padding (Nam and Hung, 2019; Dwarampudi and Reddy, 2019; Hashemi, 2019) on I to increase the image size before the convolution. Some popular choices are zero-value padding or the same padding where values on the edge of the input are copied and used to increase the image size.

A.2 Nonlinear activation Despite its efficacy, Eq. (A.1) is essentially a linear transformation of the input image. It certainly would fail at capturing more abstract image features, or the nonlinear relationship between

237

Appendix A Deep learning layers and operators

the input and the target. Therefore, the output of convolutional layers is usually fed into a nonlinear activation. The nonlinear activation function is typically written as σ. Two nonlinear activation functions are commonly used in the literature (Bircano glu and Arıca, 2018). σ 1 ðxÞ ¼ SigmoidðxÞ ¼ σ 2 ðxÞ ¼ LeakyReLUðxÞ ¼

1 , 1 + ex  px, if x < 0 x,

otherwise

(A.2) ,

(A.3)

where p is a hyperparameter that needs to be determined a priori, usually a small positive value like 0.01. This parameter was introduced to mitigate the “dying ReLU” problem (Pedamonti, 2018) where, during training, the network stops learning because of negative values, hence, zero gradients and updates. This could be caused by improper initialization or the improper normalization of the data.

A.3 Pooling Pooling is another operation used by convolutional neural networks (CNNs). Pooling layers are used to reduce the dimensionality of the input images. As a result, it reduces the number of parameters the model has to learn and also helps to constrain the computation performed by the network. Another consideration for pooling layers is that it effectively summarizes signals within the region that it operates on so that later operations only perform on the higher-level feature, rather than on exact locations. Pooling layers encourage the model to focus on abstract features from the input, rather than pixel-level signals so that the trained models can be robust to regional variations like rotations and scaling. The pooling operator used in this work is the max-pooling: MaxPoolðIÞ ¼

max

max

m¼0, …, kH1 n¼0, …, kW1

I m,n ,

(A.4)

where kH and kW are the kernel size of the operator. The max-pooling operator is also applied to the input image in a sliding fashion so it would inevitably cause shrinking of the image size. Similarly, it is not uncommon to use padding with max-pooling to preserve the original data dimension. It is also important to note that max-pooling operators do not have trainable parameters.

A.4 Convolutional long short-term memory network The last piece to the hidden blocks is the convolutional LSTM model (Shi et al., 2015). Building on top of the LSTM network, the convolutional variant was originally proposed for precipitation nowcasting, for its capability in dealing with spatiotemporal sequence data. Here, the convolutional LSTM is used in DA Spatial indeed because of the spatiotemporal nature of the weather analog identification. Similar to the architecture of the LSTM embedding network (Hu et al., 2023), the convolutional variant is also composed of the following gates. 1. An update gate: Γ u ¼ σ ðW xu ⁎ X t + W hu ⁎Ht1 + bu Þ,

(A.5)

238

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

2. A forget gate:

  Γ f ¼ σ W xf ⁎ X t + W hf ⁎ Ht1 + bf ,

(A.6)

Γ o ¼ σ ðW xo ⁎ X t + W ho ⁎ Ht1 + bo Þ,

(A.7)

~chti ¼ tanhðW xc ⁎ X t + W hc ⁎ Ht1 + bc Þ,

(A.8)

chti ¼ Γ u ⁎ ~chti + Γ f ⁎ cht1i ,

(A.9)

ahti ¼ Γ o ⁎ tanh chti ,

(A.10)

3. An output gate:

4. Input transformation:

5. Cell state update:

6. Activation output:

where all W and b are model weights and biases to be learned and * indicates the convolutional operation. For example, Wxu is the weights connecting input and the update gate, X t is the input at time t, and Ht is the hidden state at time t. Note that the only difference between Γ u, Γ f, and Γ o is the model weights to be learned. The LSTM component of this model, manifested by the construction of a new cell state in Eq. (5), effectively avoids the gradient vanishing problem that previously haunted recurrent neural networks (RNNs) (DiPietro and Hager, 2020). Notice that there is a direct path from cht1i to chti controlled solely by the forget gate, Γ f. This means, despite the complex connection between two timestamps at t and t + δ, there is also one of these paths that is only a simple product between the forget gate and the previous cell state. It effectively is an additive gradient component. The gradient contributions can still decay exponentially with the time difference δ, but as long as the forget dates have elements that are close to 1, the base of the exponential decay is also close to 1, significantly slowing down the decaying of the gradients. The convolutional component of this model is represented by the convolutional operation in each of the gates, different from the fully connected product in the original LSTM network. To put this into context, in the convolutional LSTM, W are extended to be three-dimensional tensors with the last two dimensions being the height and the width. The input, X t can be imaged as vectors standing on a spatial grid. The original LSTM is a specific case of the convolutional LSTM where all features stand on a single grid.

Appendix B Verification of extended analog search with GFS See Figs. B.1–B.3.

FIG. B.1 Similar to Fig. 5 but for GFS with 0.25 degrees horizontal resolution.

FIG. B.2 Similar to Fig. 5 but for GFS with 0.50 degrees horizontal resolution.

FIG. B.3 Similar to Fig. 5 but for GFS with 1.00 degrees horizontal resolution.

Appendix C Weather analog identification under a high irradiance regime See Figs. C.1–C.3.

FIG. C.1

Two analog ensembles on March 4, 2018 generated by the AnEn and DA Spatial with NAM. Ensemble members are ranked based on the increasing member values. The NAM forecast and SURFRAD measurement are shown in horizontal lines.

Appendix C Weather analog identification under a high irradiance regime

241

FIG. C.2 Weather analogs identified by the AnEn using NAM on March 4, 2018. Data points are color coded with the forecasted solar irradiance. The target forecast is shown in the slightly larger panel. The 21 members are shown in the remaining panels. Labels at the upper right of each panel show the date of the forecast and the associated SURFRAD measurement at the predicted location (star).

242

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. C.3 Weather analogs identified by the DA Spatial using NAM on March 4, 2018. Data points are color coded with the forecasted solar irradiance. The target forecast is shown in the slightly enlarged panel. The 21 members are shown in the remaining panels. Labels at the upper left of each panel show the date of the forecast and the associated SURFRAD measurement at the predicted location (star).

Appendix D Model attribution Understanding how a trained model produces the particular output, or “makes a decision,” is of paramount importance. It is also a central challenge currently in the ML and nearly all the applied science to quantify the contribution of responsible factors. Although a global interpretation of ML models are currently hard to achieve, there are various

Appendix D Model attribution

243

local explanation methods that derive understanding from a trained model and evaluate what features of a single input are the most important to the final prediction. To draw an analogy to a linear model, y ¼ k1  x1 + k2  x2 + ⋯ , the contribution of a particular feature, say x1, to the final prediction, y, can be exactly calculated as the product of the input value and its coefficient term, denoted as x1  k1. This product quantifies the amount of contribution from a particular feature of a single input and the coefficient term is usually used as the feature importance in global interpretation. Although ML models, being nonlinear, do not have the equivalent coefficient terms, the gradient is a natural analog of the linear model coefficient for a deep network (Sundararajan et al., 2017), and therefore, the product of the gradient and the feature value is a reasonable starting point for an attribution method. The gradient we are referring to here is the gradient of the output with respect to the input, formally Egrad ðxÞ ¼ ∂S ∂x where S is the model prediction with input x. And the product-based attribution calculated as x⊙ ∂S ∂x, where ⊙ is the element-wise product. The problem with gradients is that it usually produces visually noisy attribution maps that yield little to no assistance to model interpretation. This is caused by “gradient saturation” (Adebayo et al., 2020) that a zero gradient does not necessarily mean zero contribution. The IG is an attribution method derived from the product of gradients and input values (Sundararajan et al., 2017). Given a single input, rather than calculating gradient directly from this one input, it addresses the gradient saturation by summing over scaled versions of the input from a constructed (or predefined) baseline. This baseline represents a state of noninformation. Typically, in an image classification task, a baseline is a black image. Given a category and an input image whose attribution is desired, the IG linearly interpolate images from the baseline to the input image, resulting in a series of images; and then the gradients are accumulated along with this “straight” line from the baseline to the given input. Formally, the IG calculates the following statistics for a given input image x: Z 1   ∂Fðx0 + α  ðx  x0 ÞÞ IntegratedGradsi ðxÞ ::¼ xi  x0i  dα, (D.1) ∂xi α¼0 where xi and x0i are the ith dimension of the input x and the baseline x0 , respectively; F is a trained ML model that is kept static of the weights; and ∂FðxÞ ∂xi is the gradient of F(x) along the ith dimension. The IG has gained significant popularity in recent years. It is relatively easy to implement and only requires black-box access to the ML model, meaning that it is a nonintrusive attribution technique. On the other hand, recent sanity checks (Adebayo et al., 2020) have favored gradient-based attribution over generative techniques such as Guided GradCAM and Guided BackProp due to their better performance on model parameter randomization tests and data randomization tests. The baseline images used in this study are created by applying a Gaussian filter, independently for all input and all channels, with a standard deviation of 10  σ. Fig. D.1 compares solar irradiance before and after the smoothing from two different dates. The smoothing has effectively removed spatial features and therefore the constructed baselines are treated as noninformation.

244

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

FIG. D.1 Comparison between the original solar irradiance input and the baseline from January 1, 2018 (A, B) and September 28, 2018 (C, D). Baselines are created by applying a Gaussian filter with a standard deviation of 10  σ of the input channel.

References Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., Kim, B., 2020. Sanity checks for saliency maps. arXiv:1810.03292 [cs, stat]. http://arxiv.org/abs/1810.03292. Albawi, S., Mohammed, T.A., Al-Zawi, S., 2017. Understanding of a convolutional neural network. In: 2017 International Conference on Engineering and Technology (ICET), pp. 1–6. Alessandrini, S., Delle Monache, L., Sperati, S., Cervone, G., 2015. An analog ensemble for short-term probabilistic solar power forecast. Appl. Energy 157, 95–110. https://doi.org/10.1016/j.apenergy.2015.08.011. Alessandrini, S., Sperati, S., Delle Monache, L., 2019. Improving the analog ensemble wind speed forecasts for rare events. Mon. Weather Rev. 147 (7), 2677–2692. https://doi.org/10.1175/MWR-D-19-0006.1. Alom, M.Z., Taha, T.M., Yakopcic, C., Westberg, S., Sidike, P., Nasrin, M.S., Van Esesn, B.C., Awwal, A.A.S., Asari, V.K., 2018. The history began from AlexNet: a comprehensive survey on deep learning approaches. arXiv:1803.01164 [cs]. http://arxiv.org/abs/1803.01164. Augustine, J.A., DeLuisi, J.J., Long, C.N., 2000. SURFRAD—a national surface radiation budget network for atmospheric research. Bull. Am. Meteorol. Soc. 81 (10), 2341–2358. Augustine, J.A., Hodges, G.B., Cornwall, C.R., Michalsky, J.J., Medina, C.I., 2005. An update on SURFRAD—the GCOS surface radiation budget network for the continental United States. J. Atmos. Ocean. Technol. 22 (10), 1460–1472. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.S., 2016. Fully-convolutional Siamese networks for object tracking. arXiv:1606.09549 [cs]. http://arxiv.org/abs/1606.09549. Bircano glu, C., Arıca, N., 2018. A comparison of activation functions in artificial neural networks. In: 2018 26th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. Calovi, M., Hu, W., Cervone, G., Delle Monache, L., 2021. NAM-NMM temperature downscaling using personal weather stations to study urban heat hazards. GeoHazards 2 (3), 257–276. Cervone, G., Clemente-Harding, L., Alessandrini, S., Delle Monache, L., 2017. Short-term photovoltaic power forecasting using artificial neural networks and an analog ensemble. Renew. Energy 108, 274–286. https://doi. org/10.1016/j.renene.2017.02.052. Charles, A., Timbal, B., Fernandez, E., Hendon, H., 2013. Analog downscaling of seasonal rainfall forecasts in the Murray darling basin. Mon. Weather Rev. 141 (3), 1099–1117. https://doi.org/10.1175/MWR-D-12-00098.1. Clemente-Harding, L., 2019. Extension of the Analog Ensemble Technique to the Spatial Domain (Ph.D. thesis), Pennsylvania State University. Clough, S.A., Shephard, M.W., Mlawer, E.J., Delamere, J.S., Iacono, M.J., Cady-Pereira, K., Boukabara, S., Brown, P.D., 2005. Atmospheric radiative transfer modeling: a summary of the AER codes. J. Quant. Spectrosc. Radiat. Transf. 91 (2), 233–244. https://doi.org/10.1016/j.jqsrt.2004.05.058. Delle Monache, L., Eckel, F.A., Rife, D.L., Nagarajan, B., Searight, K., 2013. Probabilistic weather prediction with an analog ensemble. Mon. Weather Rev. 141 (10), 3498–3516. https://doi.org/10.1175/MWR-D-12-00281.1. Delle Monache, L., Alessandrini, S., Djalalova, I., Wilczak, J., Knievel, J.C., 2018. Air quality predictions with an analog ensemble. Atmos. Chem. Phys. Discuss., 1–36. https://doi.org/10.5194/acp-2017-1214.

References

245

DiPietro, R., Hager, G.D., 2020. Chapter 21—Deep learning: RNNs and LSTM. In: Zhou, S.K., Rueckert, D., Fichtinger, G. (Eds.), Handbook of Medical Image Computing and Computer Assisted Intervention (January). The Elsevier and MICCAI Society Book Series, Academic Press, pp. 503–519. Dwarampudi, M., Reddy, N.V.S., 2019. Effects of padding on LSTMs and CNNs. arXiv:1903.07288 [cs, stat]. http:// arxiv.org/abs/1903.07288. Fleming, J.R., 2004. Sverre Petterssen and the contentious (and momentous) weather forecasts for D-Day. Endeavour 28 (2), 59–63. https://doi.org/10.1016/j.endeavour.2004.04.007. Frediani, M.E.B., Hopson, T.M., Hacker, J.P., Anagnostou, E.N., Delle Monache, L., Vandenberghe, F., 2017. Objectbased analog forecasts for surface wind speed. Mon. Weather Rev. 145 (12), 5083–5102. https://doi.org/10.1175/ MWR-D-17-0012.1. Gao, B., Huang, X., Shi, J., Tai, Y., Xiao, R., 2019. Predicting day-ahead solar irradiance through gated recurrent unit using weather forecasting data. J. Renew. Sustain. Energy 11 (4), 043705. Gleick, J., 2011. Chaos: Making a New Science. Open Road Media (March). Hamill, T.M., Whitaker, J.S., 2006. Probabilistic quantitative precipitation forecasts based on reforecast analogs: theory and application. Mon. Weather Rev. 134 (11), 3209–3229. https://doi.org/10.1175/MWR3237.1. Hamill, T.M., Whitaker, J.S., Mullen, S.L., 2006. Reforecasts: an important dataset for improving weather predictions. Bull. Am. Meteorol. Soc. 87 (1), 33–46. https://doi.org/10.1175/BAMS-87-1-33. Hashemi, M., 2019. Enlarging smaller images before inputting into convolutional neural network: zero-padding vs. interpolation. J. Big Data 6 (1), 98. https://doi.org/10.1186/s40537-019-0263-7. Hoffer, E., Ailon, N., 2015. Deep metric learning using triplet network. In: Feragen, A., Pelillo, M., Loog, M. (Eds.), Lecture Notes in Computer Science. Similarity-Based Pattern Recognition, Springer International Publishing, Cham, pp. 84–92. Hu, W., Cervone, G., 2019. Dynamically optimized unstructured grid (DOUG) for analog ensemble of numerical weather predictions using evolutionary algorithms. Comput. Geosci. 133, 104299. https://doi.org/10.1016/j. cageo.2019.07.003. Hu, W., Cervone, G., Clemente-Harding, L., Calovi, M., 2021. Parallel analog ensemble—the power of weather analogs. NCAR Technical Notes NCAR/TN-564+ PROC, 1. Hu, W., Cervone, G., Merzky, A., Turilli, M., Jha, S., 2022. A new hourly dataset for photovoltaic energy production for the continental USA. Data Brief 40, 107824. Hu, W., Cervone, G., Young, G., Delle Monache, L., 2023. Machine learning weather analogs for near-surface variables. Bound. Layer Meteorol., 1–25. Iacono, M.J., Mlawer, E.J., Clough, S.A., Morcrette, J.-J., 2000. Impact of an improved longwave radiation model, RRTM, on the energy budget and thermodynamic properties of the NCAR community climate model, CCM3. J. Geophys. Res. Atmos. 105 (D11), 14873–14890. https://doi.org/10.1029/2000JD900091. Junk, C., Delle Monache, L., Alessandrini, S., 2015. Analog-based ensemble model output statistics. Mon. Weather Rev. 143 (7), 2909–2917. https://doi.org/10.1175/MWR-D-15-0095.1. Kim, K.-S., Lee, J.-B., Roh, M.-I., Han, K.-M., Lee, G.-H., 2020. Prediction of ocean weather based on denoising AutoEncoder and convolutional LSTM. J. Mar. Sci. Eng. 8 (10), 805. https://doi.org/10.3390/jmse8100805. Kindermans, P.-J., Sch€ utt, K.T., Alber, M., M€ uller, K.-R., Erhan, D., Kim, B., D€ahne, S., 2017. Learning how to explain neural networks: PatternNet and PatternAttribution. arXiv:1705.05598 [cs, stat]. http://arxiv.org/abs/1705. 05598. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D., 1997. Face recognition: a convolutional neural-network approach. IEEE Trans. Neural Netw. 8 (1), 98–113. https://doi.org/10.1109/72.554195. Liu, X., Zhou, Y., Zhao, J., Yao, R., Liu, B., Zheng, Y., 2019. Siamese convolutional neural networks for remote sensing scene classification. IEEE Geosci. Remote Sens. Lett. 16 (8), 1200–1204. https://doi.org/10.1109/ LGRS.2019.2894399. Lorenz, E.N., 1963. Deterministic nonperiodic flow. J. Atmos. Sci. 20 (2), 130–141. Lorenz, E.N., 1969. Atmospheric predictability as revealed by naturally occurring analogues. J. Atmos. Sci. 26 (4), 636–646. https://doi.org/10.1175/1520-0469(1969)262.0.CO;2. Lydia, A., Francis, S., 2019. Adagrad—an optimizer for stochastic gradient descent. Int. J. Inf. Comput. Sci. 6 (5), 566–568. Mathiesen, P., Kleissl, J., 2011. Evaluation of numerical weather prediction for intra-day solar forecasting in the continental United States. Sol. Energy 85 (5), 967–977. https://doi.org/10.1016/j.solener.2011.02.013.

246

8. Theory of spatiotemporal deep analogs and their application to solar forecasting

McDermott, P.L., Wikle, C.K., 2016. A model-based approach for analog spatio-temporal dynamic forecasting. Environmetrics 27 (2), 70–82. https://doi.org/10.1002/env.2374. Monache, L.D., Nipen, T., Liu, Y., Roux, G., Stull, R., 2011. Kalman filter and analog schemes to postprocess numerical weather predictions. Mon. Weather Rev. 139 (11), 3554–3570. https://doi.org/10.1175/2011MWR3653.1. Nam, N.T., Hung, P.D., 2019. Padding methods in convolutional sequence model: an application in Japanese handwriting recognition. In: ICMLSC 2019. Proceedings of the 3rd International Conference on Machine Learning and Soft Computing, January, Association for Computing Machinery, New York, NY, pp. 138–142. O’Shea, K., Nash, R., 2015. An introduction to convolutional neural networks. arXiv:1511.08458 [cs]. http://arxiv. org/abs/1511.08458. Pedamonti, D., 2018. Comparison of non-linear activation functions for deep neural networks on MNIST classification task. arXiv:1804.02763 [cs, stat]. http://arxiv.org/abs/1804.02763. Ribeiro, M.T., Singh, S., Guestrin, C., 2016. “Why should I trust you?”: explaining the predictions of any classifier. In: KDD ’16. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August, Association for Computing Machinery, New York, NY, pp. 1135–1144. Rogers, E., Lin, Y., Mitchell, K., Wu, S., Ferrier, B., Gayno, G., Pondeca, M., Pyle, M., Wong, V., Ek, M., 2005. The NCEP North American Mesoscale Modeling System: final Eta model/analysis changes and preliminary experiments using the WRF-NMM. In: Preprints, 21st Conf. on Wea. Analysis and Forecasting/17th Conf. on Numerical Wea. Prediction, Washington, DC. vol. 4. American Meteorological Society, CD-ROM B. Rogers, E., DiMego, D., Black, T., Ek, M., Ferrier, B., Gayno, G., Janjic, Z., Lin, Y., Pyle, M., Wong, V., Wu, W.S., Carley, J., 2009. The NCEP North American mesoscale modeling system: recent changes and future plans. https://ams.confex.com/ams/23WAF19NWP/techprogram/paper_154114.htm. Schroff, F., Kalenichenko, D., Philbin, J., 2015. FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June, IEEE, Boston, MA, pp. 815–823. Shao, Q., Li, M., 2013. An improved statistical analogue downscaling procedure for seasonal precipitation forecast. Stochastic Environ. Res. Risk Assess. 27 (4), 819–830. https://doi.org/10.1007/s00477-012-0610-0. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.-C., 2015. Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, vol. 28. https://proceedings.neurips.cc/paper/2015/hash/07563a3fe3bbe7e3ba84431ad9d055af-Abstract.html. Sundararajan, M., Taly, A., Yan, Q., 2017. Axiomatic attribution for deep networks. In: International Conference on Machine Learning, July, PMLR, pp. 3319–3328. http://proceedings.mlr.press/v70/sundararajan17a.html. Tian, D., Martinez, C.J., 2012. Comparison of two analog-based downscaling methods for regional reference evapotranspiration forecasts. J. Hydrol. 475, 350–364. https://doi.org/10.1016/j.jhydrol.2012.10.009. Timbal, B., McAvaney, B.J., 2001. An analogue-based method to downscale surface air temperature: application for Australia. Clim. Dynam. 17 (12), 947–963. https://doi.org/10.1007/s003820100156. Toth, Z., 1989. Long-range weather forecasting using an analog approach. J. Clim. 2 (6), 594–607. https://doi.org/ 10.1175/1520-0442(1989)0022.0.CO;2. van den Dool, H.M., 1989. A new look at weather forecasting through analogues. Mon. Weather Rev. 117 (10), 2230–2247. https://doi.org/10.1175/1520-0493(1989)1172.0.CO;2. Van den Dool, H.M., 1994. Searching for analogues, how long must we wait? Tellus A 46 (3), 314–324. Wang, F., Yu, Y., Zhang, Z., Li, J., Zhen, Z., Li, K., 2018. Wavelet decomposition and convolutional LSTM networks based improved deep learning model for solar irradiance forecasting. Appl. Sci. 8 (8), 1286. https://doi.org/ 10.3390/app8081286. Xu, K., Feng, Y., Huang, S., Zhao, D., 2015. Semantic relation classification via convolutional neural networks with simple negative sampling. arXiv preprint:1506.07650. Zhao, Z., Giannakis, D., 2016. Analog forecasting with dynamics-adapted kernels. Nonlinearity 29 (9), 2888–2939. https://doi.org/10.1088/0951-7715/29/9/2888. Zintgraf, L.M., Cohen, T.S., Adel, T., Welling, M., 2017. Visualizing deep neural network decisions: prediction difference analysis. arXiv:1702.04595 [cs]. http://arxiv.org/abs/1702.04595.

C H A P T E R

9 AI for improving ozone forecasting Ahmed Alnuaim (Alnaim)a,b, Ziheng Suna,b, and Didarul Islama,b a

Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA, United States bCenter for Spatial Information Science and Systems, George Mason University, Fairfax, VA, United States

1 Introduction Ground-level ozone is one of the six most common air pollutants, and considering its impact on health, the Environmental Protection Agency (EPA) of the United States has limited the maximum daily eight-hour average (MDA8) concentration of ozone to 70 ppb. Based on scientific evidence about the effects of ozone on public health and welfare, EPA tightened the ground-level ozone standard to 0.070 ppm, averaged over an 8-h period. This standard is met at an air quality monitor when the 3-year average of the annual fourth-highest daily maximum 8-h average ozone concentration is less than or equal to 0.070 ppm (R. 1 US EPA, 2022). CMAQ is a popular open-source numerical modeling system for air quality estimation and forecasting. It combines current knowledge in atmospheric chemistry, physics, and air quality modeling with parallel computing acceleration techniques to simultaneously model multiple air pollutants, including ozone, particulate matter, and a variety of air toxics that endanger lives on Earth. CMAQ can simulate atmospheric components and help atmospheric scientists evaluate the current situation and trends to find optimal air quality management strategies suitable at different scales for communities, regions, and states. It also provides users with data on air pollutant concentrations in the given area for specified emissions within various climate contexts like wind speed, temperature, pressure, precipitation, humidity, etc. CMAQ is a product of the Center for Environmental Measurement and Modeling (CEMM) under the U.S. Environmental Protection Agency (EPA) Office of Research and Development (ORD) , and is used by a wide range of scientists, consultants, and researchers spanning the globe. Several U.S. government agencies have already adopted it, including the National Weather Service, the Center for Disease Control, and the EPA Office of Air Quality Planning and Standards (O. US EPA, 2016). CMAQ has served as a primary dynamical model in regional air pollution studies. However, CMAQ does have several limitations which are shared by most numerical models, like the high dependency on the initial conditions, physical approximations, etc. These potential error sources could

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00002-5

247

Copyright # 2023 Elsevier Inc. All rights reserved.

248

9. AI for improving ozone forecasting

spread uncertainties that lead to significant biases from real-world ozone concentrations (Sayeed et al., 2021). The objective of this chapter is to use a machine learning (ML) model to calibrate the CMAQ model prediction capability and bring them closer to more accurate observations of ozone concentration, with observed data by EPA AirNow as the source of truth or target. We aim to use ML to overcome the systematic biases brought by the settings of the numerical models and address the complicated uncertainties by directly adjusting the result numbers rather than adjusting CMAQ itself. The XGBoost model discussed in this chapter is utilized to attempt and enhance the CMAQ modeling results by taking advantage of the XGBoost computationally efficient optimized distributed gradient boosting approach, along with its gradient descent algorithm that recognizes uncertainties resulting from simplified physics and chemistry (e.g., parameterizations) of the CMAQ model (Sayeed et al., 2021). This approach aims to use the best of both numerical modeling and ML to design a robust and stable algorithm that more accurately forecasts hourly ozone concentrations in longer durations that can cover a larger spatial area.

1.1 What you will learn in this chapter • • • • •

Collecting simulated O3 concentrations from CMAQ results Collecting NO2, O3 and meteorological data from satellite observations Merging multiple data sources into one file Preprocessing data into ML-ready format for Machine Learning Training and Predicting XGBoost to calibrate the CMAQ O3 predictions

1.2 Prerequisites This chapter requires a few python packages to collect data, preprocess into ML-ready format, train an XGBoost model, and visualize the results. The following table lists the required packages: Package

Version

pandas

> ¼ 1.1.3

numpy

> ¼ 1.19.5

matplotlib

> ¼ 3.4.3

seaborn

> ¼ 0.11.2

earthengine-api

> ¼ 0.1.272

xgboost

> ¼ 1.5.2

scikit-learn

> ¼ 1.0

scipy

> ¼ 1.7.1

pyairnow

> ¼ 1.1.0

requests

> ¼ 2.25.1

2 Background

249

Python 3.8 or newer versions are required to have the examples successfully execute. Among the packages, the installation of “earthengine-api” developed by Google requires additional steps to set up on a local system due to authentication requirements to communicate with the Google Earth Engine API. To ensure a proper installation, please follow the outlined steps here (https://developers.google.com/earth-engine/guides/python_install).

2 Background Surface ozone concentration poses serious risks to human and animal health as it irritates airways and causes breathing difficulties and other health issues (Sun et al., 2021; Onan et al., 2016). Studies also found a relationship between surface ozone concentration and vegetation damage as it is absorbed by leaves and hinders plant growth (Felzer et al., 2007; de Graaf et al., 2016). Forecasting surface ozone concentration is critical to prepare in avoiding or minimizing hazardous effects on human and plant health by alerting residents and decision-makers in advance to develop mitigation strategies for sustainable development. The formation of ozone in the atmosphere requires sunlight and some airborne chemicals, such as CO2, NO2, etc., which are massively generated through human activities on the earth’s surface. The correlation between the atmospheric chemicals is complex and nonlinear in nature as the formation process involves weather conditions, precursor emissions, and photochemical reactivities. Therefore, these precursor airborne chemicals can be used to predict surface ozone concentration in a certain location which could help to develop mitigating strategies and minimize the negative impact of the pollutant. Several studies have developed statistical models to predict ozone concentration (Sun et al., 2021; Lightstone et al., 2017; AgudeloCastaneda et al., 2014; Talebi et al., 2021); however, the results are not satisfactory, especially in predicting extreme air pollutant concentrations (Chaloulakou et al., 2003). The widely used models for ozone prediction are typically numeric models like Community Multiscale Air Quality Modeling System (CMAQ, http://air.csiss.gmu.edu), which USEPA uses for studying air pollution from local to hemispheric scales (Abdul-Wahab et al., 1996; Rasmussen et al., 2012). As the introduction mentions, CMAQ (as shown in Fig. 1) is an open community model representing the fundamental atmospheric processes influencing air components like oxygen, ozone, nitrogen oxide, carbon oxide, etc. CMAQ can model hundreds of pollutants and greenhouse gases but can be uncertain in its predictions due to model assumptions and scarcity of observations. Many emission sources are prominent and can be easily accounted for in the model, but many other sources are challenging to be directly observed, such as livestock, agriculture field burning, camping wood burn, barbeques, etc. Conventional strategies like adding new monitoring equipment for human-induced emissions or real-time calibration are still too expensive. Scientists are actively looking for alternative solutions to close the gap and improve accuracy. In recent years, ML models including Random Forest, Artificial Neural Network (Chaloulakou et al., 2003; Aljanabi et al., 2020; Spellman, 1999), Support Vector Machine (Aljanabi et al., 2020), and XGBoost (Aljanabi et al., 2020) have been often experimented with to predict ozone concentration. Several studies used Deep Learning methods such as LSTM

250

9. AI for improving ozone forecasting

FIG. 1 Ozone forecast with a 12  12 km resolution by GMU-CMAQ on Oct 7, 2021 (circles are AirNow stations).

(Yafouz et al., 2021), CNN-LSTM (Pak et al., 2018), and CMAQ-CNN (Sayeed et al., 2021) for prediction where they incorporated CMAQ parameters in the model, to improve the prediction accuracy. This chapter attempts to improve the results of predicting O3 concentration levels by using inputs from the simulated CMAQ data (NO2, CO, PM2.5, and others) alongside a voting ensemble approach for the XGBoost model. With the development of remote sensing, satelliteborne observations have been adopted to observe ozone levels from space. Earth observation projects such as OMI (Ozone Monitoring Instrument) and TROPOMI (TROPOspheric Monitoring Instrument) have been operating since (2004 for OMI and 2017 for TROPOMI) and have captured long-term ozone records from space. OMI products by the NASA LANCE project can provide OMI total column atmospheric variables at about 13 km  24 km spatial resolution at the pixels near track change and 13 km  126 km at the pixels near track edge with a revisiting period of 24 h (de Graaf et al., 2016). The AirNow ground dataset (source of truth) is used in this chapter to test the CMAQ and ML models’ performance. AirNow is a collaboration project of the U.S. Environmental Protection Agency, National Oceanic and Atmospheric Administration (NOAA), National Park Service (NPS), NASA, Centers for Disease Control, and many tribal, state, and local air quality agencies. AirNow reports air quality on ground level following the official U.S. Air Quality Index (AQI), a color-coded index designed to communicate whether air quality is healthy or unhealthy for the local society.

3 Data collection

251

3 Data collection 3.1 AirNow O3 concentration AirNow observation network is able to deliver hourly ground-level O3 concentration. Most data arrives by half-past the previous hour and are quality assured via a series of data quality checking processes that then is released by the end of the hour. The EPA provides easy-to-use tools to access historical and real-time data. For example, the daily O3 measurement data can be retrieved by querying the AirNow API via a python client (https://github.com/ asymworks/pyairnow). The AirNow API requires a key that can be set up through this link (https://docs.airnowapi.org/account/request/). The GitHub repo for “pyairnow” has concise examples for using the library. The only thing this chapter does differently from the examples shown on the GitHub repo is passing the AQI levels returned from the EPA AirNow API to a function (provided by the library) to calculate the desired concentrate given the AQI value in the same area. The code below will get us the O3, CO, and NO2 values for the duration specified: from pyairnow.conv import aqi_to_concentration import pandas import requests import json # Latitude and Longitude location in Pima, AZ. lat = '32.047501' lon = '-110.773903' # Add the API Key generated from airnowapi.org API_KEY = '' # distance is used when no reporting area is associated with the latitude/longitude provided. # distance looks for an observation from a nearby reporting area within this distance (in miles). distance = '50' # Construct a list of date strings, each corresponding to a single day. date_range_str = [date.strftime('%Y-%m-%dT00-0000') for date in pandas.date_range ('2017-01-01','2017-02-28')] # Create empty lists to save the data returned from the AirNow API. O3_data = [] CO_data = [] NO2_data = [] # Loop through each date to initiate an API call for the duration specified above. for dateStr in date_range_str: # Construct URL to contact airnowapi.org and get the AQI (Air Quality Index) for our desired location.

252

9. AI for improving ozone forecasting

url = f'https://www.airnowapi.org/aq/observation/latLong/historical/?format= application/json&latitude={lat}&longitude={lon}&date={dateStr}&distance= {distance}&API_KEY={API_KEY}’ # Initiate the request to the API. res = requests.get(url) # Extract the AQI level for our location. aqiData = json.loads(res.content)[0]['AQI'] # aqi_to_concentration function can now be passed the AQI observations # and returns the corresponding concentration level in that area for the day for each pollutant. O3_data.append(aqi_to_concentration(aqiData, 'O3')) CO_data.append(aqi_to_concentration(aqiData, 'CO')) NO2_data.append(aqi_to_concentration(aqiData, 'NO2')) pandas.DataFrame({"Date”: date_range_str, "Lat”: lat, "Lon”: lon, "AirNow_O3”: O3_data, "AirNow_CO”: CO_data, "AirNow_NO2”: NO2_data}).to_csv("airnow_data. csv", index=False)

With the code above, we retrieved the AirNow O3 levels for the station located in Pima, Arizona. Now that we have our source of truth data to use as the target input in our XGBoost model, we now can collect our inputs (predictors) for our model to train on. 3.1.1 TROPOMI O3 The TROPOspheric Monitoring Instrument (TROPOMI) provides a daily high resolution of 3.5  5.5 km for atmospheric pollutants, including an atmospheric vertical column of O3 per mol/m2 through its L3 data products. This chapter utilizes the remotely sensed atmospheric O3 column to train our model as an additional feature for prediction. To collect TROPOMI data, we need to use the Google Earth Engine Python library to access its API. A location is specified to get a daily TROPOMI image for a fixed duration, in this case, one month. The code below will import the necessary packages, collect the data, average the TRPOPOMI O3 data into daily mean values, and save the data into a CSV file named “tropomi_o3.csv”. import json import ee as googleEarth def intialize_google_earth(): try: googleEarth.Initialize() except Exception as e: googleEarth.Authenticate() gee.Initialize()

3 Data collection

253

intialize_google_earth() # Create a point representing our study site in Pima, AZ site_location = googleEarth.Geometry.Point(-110.773903, 32.047501,).buffer(500) # (Long, Lat) # Get TROPOMI O3 Image Collection from Google Earth Engine collection_o3_tropomi = googleEarth.ImageCollection("COPERNICUS/S5P/OFFL/L3_O3"). filterDate('2019-01-01','2019-02-28') def study_site_ozone_average(imageCollection): # This function will average all of the data from all of the points in the area we defined in "site location". average = imageCollection.reduceRegion(reducer=googleEarth.Reducer.mean(), geometry=site_location, scale=250).get('O3_column_number_density') return imageCollection.set('date', imageCollection.date().format()).set('average', average) # Map study_site_ozone_average to our ImageCollection site_location_reduced_imgs = collection_o3_tropomi.map(study_site_ozone_average) reduced_values = site_location_reduced_imgs.reduceColumns(googleEarth.Reducer. toList(2), ['date','average']).values().get(0) # The data is obtained from Google Earth Engine using the callback function "getInfo". tropomiData = pandas.DataFrame(reduced_values.getInfo(), columns= ['Date','tropomi_O3_mean']) # Make "Date" column a DateTime object tropomiData['Date'] = pandas.to_datetime(tropomiData['Date']) tropomiData = tropomiData.set_index('Date') # Convert tropomi_O3_mean to ppbV (parts-per-billion-volume) tropomiData['tropomi_O3.ppbV'] = (tropomiData['tropomi_O3_mean']/3)*224 tropomiData.drop('tropomi_O3_mean', inplace=True, axis=1) # Save data to a CSV file tropomiData.to_csv('tropomi_O3.csv')

The tropomiCollection variable contains all the images, with only the O3_column_number_density feature being selected within the specified date range. A file labeled “tropomi_O3.csv” will be saved locally to the machine and is ready to be merged later in the following steps.

254

9. AI for improving ozone forecasting

3.2 CMAQ simulation data This chapter acquired CMAQ model simulation data from the George Mason University (GMU) air quality group, which includes a 24-h forecast of O3 concentrations with an initial simulation time of 18 UTC the day before, and a forecast time starting from 12 UTC to the next day at 11 UTC. NO2 emissions, meteorological input data, including surface pressure, Planetary Boundary Layer (PBL) height, temperature, wind speed, wind direction, convective velocity scale, solar radiation reaching ground float, nonconvective precipitation, convective precipitation, and total cloud fraction are included as inputs. GMU-CMAQ simulated the 24-h forecast O3 concentrations on an operational basis (Li et al., 2021) for 2020 and 2021 using meteorology data from the Weather Research and Forecasting (WRF) model version 4.2 (Skamarock et al., 2019). CMAQ data can be collected from the official source (https://www.epa.gov/cmaq/ forms/cmaq-data). The current publicly accessible CMAQ data is hosted on a Google Drive folder but only goes up to 2017 as the latest available data. The specific folder on Google Drive with the CSV can be found here (https://drive.google.com/drive/folders/17deTtdTv– 61MgVQ5UE0BCKZW2Gy0Lur). The available files in the Google Drive folder are broken down by month and contain daily averages of O3, CO, NO, NO2, and others. For the purpose of this chapter, the files labeled “Daily_EQUATES_CMAQv532_cb6r3_ae7_aq_STAGE_12US1_201701.csv” and “Daily_EQUATES_CMAQv532_cb6r3_ae7_aq_STAGE_12US1_201702.csv” in the Google drive linked above were downloaded, cleaned, merged, and renamed to “cmaq_2017_Jan_Feb.csv” for easier access. The ready CSV file containing 2017 CMAQ simulated data can be found here: and https://media.githubusercontent.com/media/earth-artificial-intelligence/earth_ai_ book_materials/main/chapter_09/cmaq_2017_Jan_Feb.csv. To get the latest CMAQ data, you can use the source code (https://www.epa.gov/cmaq/ access-cmaq-source-code) and compile the model to generate new data for the desired duration. Most of the source code is written in Fortran, so users would need some level of understanding about the language and the Linux environment before starting.

4 Dataset preparation The “airnow_data.csv,” and “cmaq_2017_Jan_Feb.csv” retrieved from the previous steps are already preprocessed, and some minor additional preprocessing steps are done in this section. Now that we have all the CSV files for each data source downloaded to our local system, we prepare the data by merging the final datasets for model training. The first step is to read all files and merge them all together. Some additional steps, such as date formatting, rescaling, and dropping unnecessary columns, happen here too. This will help in having a cleaner dataset to use for model training. # Read Airnow data extracted from AirnowAPI and saved to local system. airnow = pd.read_csv('airnow_data.csv', parse_dates=["Date"]) # Scale data to match predictors scale. airnow.AirNow_O3 = airnow.AirNow_O3 * 1000

255

5 Machine learning

airnow.AirNow_CO = airnow.AirNow_CO * 10 # Reformat date string. airnow["Date"] = airnow["Date"].dt.strftime('%Y%m%d') # Read Tropomi data extracted from Google Earth Engine and saved to local system. tropomi = pd.read_csv('tropomi_O3.csv', parse_dates=["Date"]) tropomi["Date"] = tropomi["Date"].dt.strftime('%Y%m%d') tropomiDaily = tropomi.groupby("Date").max() # Make date match CMAQ and Airnow retrieved data. This is done only becuase TROPOMI doesn't have data before 2018, so making the date the same as the other sources is required for merging later. # If the data retrieved for CMAQ and Airnow is after 2018, then this below line can be deleted. tropomiDaily["Date"] = pd.date_range("20170101", "20170227").strftime('%Y%m%d') tropomiDaily.reset_index(drop=True, inplace=True) # Read downloaded CMAQ data. cmaq = pd.read_csv('cmaq_2017_Jan_Feb.csv', parse_dates=["date"]) # Drop unnecessary columns. cmaq.drop(["column",

"row",

"Lambert_X",

"LAMBERT_Y",

"Unnamed:

0"],

axis=1,

inplace=True) # Reformat date string. cmaq['Date'] = cmaq["date"].dt.strftime('%Y%m%d') # Merge all data frames together. final = airnow.merge(tropomiDaily, on="Date").merge(cmaq, on="Date") # Drop any duplicated rows. final = final.drop_duplicates("Date")

When the above code execution finishes successfully, it should generate a data frame with 24 features for every day between January 2017 to February 2017 containing AirNow, Tropomi, and CMAQ data. Next, we can start to prepare the dataset for training the XGBoost model.

5 Machine learning 5.1 Extreme gradient boosting model Extreme Gradient Boosting (XGBoost) is an ensemble of multiple tree–based regressors introduced by (Chen and Guestrin, 2016) to reduce computing time and scale-up tree boosting by creating a series of decision trees to learn from many sequentially connected weak learners, where new tree models are added to correct the errors made by the existing tree models to boost the overall performance.

256

9. AI for improving ozone forecasting

XGBoost learns from training data xij (i.e., input parameters from CMAQ) and predicts yᵢ (i.e., the concentration of surface ozone, AirNow O3 values), where the prediction score of each tree is lumped together to get the final score, which is evaluated through N additive functions for each tree structure to predict the output. We use XGBoost with a Voting Ensemble method that creates multiple models and combines them to produce an accurate estimation. The Voting ensemble method combines prediction results from multiple models based on the average of the predictions from each model. A significant limitation of the voting ensemble is that it treats all models equally. Therefore, if some models perform better than others, averaging them in the final model would reduce the individual effect. However, the accuracy of the voting ensemble-based ML models is still better than the standalone models. To start modeling the data using these two methods, we need to set up our data by splitting our data frame to get the first 31 days that correspond to the month of January as our training set: # Take the first 31 rows corresponding to the month of January as the training input. X = final[:31].drop(['Date', 'AirNow_O3', 'Lat', 'Lon', 'latitude', 'longitude', 'date', 'O3_MDA8', 'O3_AVG'],axis=1) # Take the first 31 rows corresponding to the month of January as the training target. y = final[:31]['AirNow_O3']

After setting up our data, we set up the multiple variations of an XGBoost model for the voting ensemble to test out with each variation having a different value for max_depth, which is the length of the longest path from a root to a leaf: # make a prediction with a voting ensemble. from xgboost.sklearn import XGBRegressor from sklearn.ensemble import VotingRegressor # define the base models. models = list() models.append(('XGB1', XGBRegressor(max_depth=1))) models.append(('XGB2', XGBRegressor(max_depth=2))) models.append(('XGB3', XGBRegressor(max_depth=3))) models.append(('XGB4', XGBRegressor(max_depth=4))) models.append(('XGB5', XGBRegressor(max_depth=5))) models.append(('XGB6', XGBRegressor(max_depth=6))) # define the voting ensemble. ensemble = VotingRegressor(estimators=models) # fit the model on all available data. ensemble.fit(X, y)

Each “XGB” corresponds to a successive use of the XGBRegressor with increments for the max depth hyperparameter to see the most optimal value for our data. The data is fitted to the VotingRegressor to test all the different XGBoost models with our data.

5 Machine learning

257

It is now ready to use the ensemble model, we trained to predict our testing set. A testing set is created by taking the remaining rows after the month of January in our data frame, which is indexed and ordered by the day’s number, in this case, all days after the 31st: import math import sklearn # Get the remaining rows after row 31 corresponding to the month of February as the testing set input. X = final[31:].drop(['Date', 'AirNow_O3', 'Lat', 'Lon', 'latitude', 'longitude', 'date', 'O3_MDA8', 'O3_AVG'],axis=1) # Get the remaining rows after row 31 corresponding to the month of February as the testing set target. y = final[31:]['AirNow_O3'] pred = ensemble.predict(X) mse = sklearn.metrics.mean_squared_error(pred, y) rmse = math.sqrt(mse) print(mse, rmse)

OUTPUT: 6.36975108482667 2.523836580451807

We now have all the predictions in “pred” and have an idea of the model’s accuracy. In this case, the model predicted well with an MSE of 6.3 and an RMSE of 2.5 (lower is better) due to the dataset’s small size and short duration which indicates overfitting. A longer duration of data from the data sources used in this chapter will have a higher likelihood of a more generalized model. After finding out our model’s performance, we use the predictions to visualize its performance. To plot the predictions against the observed AirNow O3 values. We use the code below: # Get the equivalent rows of our test set from the main dataframe for plotting. dataset = final.iloc[31:].copy() # Add the prediction result to the new dataframe. dataset['prediction'] = pred.tolist() import matplotlib.pyplot as plt plt.rc('font', size=12) fig, ax = plt.subplots(figsize=(20, 13)) # Specify how our lines should look. ax.plot(dataset.Date, pred.tolist(), color='tab:orange', label='Prediction') # Use linestyle keyword to style our plot. ax.plot(dataset.Date, dataset.AirNow_O3, color='green', linestyle='–', label='AirNow')

258

9. AI for improving ozone forecasting

ax.plot(dataset.Date, dataset.O3_AVG, color='blue', linestyle='–', label='CMAQ_O3') # Redisplay the legend to show our new wind gust line. ax.legend(loc='upper left') # Same as above. ax.set_xlabel('Time') plt.xticks(rotation=45) ax.set_ylabel('Values') ax.set_title('Compare Observed, Prediction, CMAQ Simulation') ax.grid(True) ax.legend(loc='upper left');

Fig. 2 depicts an overfitting model. The short duration of the AirNow observations and the CMAQ simulated data, as mentioned previously, collected in this chapter is most likely responsible for the overfitting problem. Both of these data sources were collected during the same season of the year when there is a slight variation and a similar pattern in values

FIG. 2 A plot of predicted AirNow O3 (orange) against observed AirNow O3 (green) along with CMAQ O3 (blue) for the month of February (test set).

5 Machine learning

259

FIG. 3 A plot of predicted AirNow O3 (orange) against observed AirNow O3 (green) along with CMAQ O3 (blue) for the month of February (test set).

between January and February, which could contribute to the model overfitting rather than generalizing. A more slimmed-down version of Fig. 2 can be plotted using the below code. The size of this plot can be useful if multiple months were to be predicted, and all plots must be stacked without taking up too much space (Fig. 3). # Get the equivalent rows of our test set from the main dataframe for plotting dataset = final.iloc[31:].copy() # Add the prediction result to the new dataframe dataset['prediction'] = pred.tolist() # time series plot import matplotlib.pyplot as plt plt.rc('font', size=15) fig, ax = plt.subplots(figsize=(25, 4)) # Specify how our lines should look ax.plot(dataset.Date, dataset.prediction, color='tab:red', label='Prediction') ax.plot(dataset.Date, dataset.O3_AVG, color='tab:blue', label='CMAQ') # Use linestyle keyword to style our plot ax.plot(dataset.Date, dataset.AirNow_O3, color='tab:green', label='AirNow',linestyle='–') ax.set_xlabel('Date',size=20) ax.set_ylabel('O3 (ppbv)',size=20) ax.set_title('February',fontsize=25) ax.grid(True) plt.xticks(rotation=45) ax.legend(loc='upper left');

260

9. AI for improving ozone forecasting

To get a better idea of each model’s RMSE distribution, a box plot is plotted for each XGBoost model in the voting ensemble. To begin plotting the boxplot, we must first define a cross-validation scoring function that will score each XGBoost model with different max_depth values in our previously defined voting ensemble. To begin, a function is defined called “evaluate_model,” which will take a tuple containing each model name (e.g., “XGB3”) and each model object to be passed to sklearn’s “cross_val_score” function with a total K-fold of 10, each repeated three times. This function is created as shown in the code below: from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedKFold from numpy import mean from numpy import std # evaluate a given model or set of models using cross-validation def evaluate_model(model, X, y): cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) scores = cross_val_score(model, X, y, scoring='neg_root_mean_squared_error', cv=cv, n_jobs=-1, error_score='raise') return scores

Then use the following code with a test set (unseen data) to evaluate and store all the RMSE scores returned for each model: # Get the remaining rows after row 31 corresponding to the month of February as the testing set input. X = final[31:].drop(['Date', 'AirNow_O3', 'Lat', 'Lon', 'latitude', 'longitude', 'date', 'O3_MDA8', 'O3_AVG'],axis=1) # Get the remaining rows after row 31 corresponding to the month of February as the testing set target. y = final[31:]['AirNow_O3'] # evaluate the models and store results results, names = list(), list() for name, model in models: scores = evaluate_model(model, X, y) results.append(scores) names.append(name) print('>%s %.3f (%.3f )' % (name, mean(scores), std(scores)))

OUTPUT: >XGB1 -0.919 (0.385) >XGB2 -0.894 (0.414) >XGB3 -0.915 (0.398) >XGB4 -0.903 (0.391) >XGB5 -0.896 (0.399) >XGB6 -0.897 (0.406)

261

5 Machine learning

Voting–XGBoost –0.25 –0.50

neg_RMSE

–0.75 –1.00 –1.25 –1.50 –1.75

XGB1

XGB2

XGB3

XGB4

XGB5

XGB6

XGBoost models

FIG. 4 A Box plot of each XGBoost model negative root mean squared scores for the month of February (test set).

The output above displays the mean negated root mean squared error and mean standard deviation for each of the different max_depth XGBoost models. The higher the score, the better in the case of the negated root mean squared error, and since the score is shown as negative, we want to maximize the negated score to a negative one. Each model was very close, indicating good performance, as seen in the previous visualizations. A box plot is created to compare each model to visualize and understand the distribution of each fold for these averaged scores. The following code will generate the mentioned plot (Fig. 4): import matplotlib.pyplot as plt # plot model performance for comparison plt.boxplot(results, labels=names, showmeans=True) plt.suptitle("Voting-XGBoost", size=16) plt.xlabel("XGBoost models") plt.ylabel("neg_RMSE") plt.grid() plt.show()

The plot above shows that each model averaged around 0.90, with maximum and minimum values ranging from 0.25 to 1.90. This means that the different max_depth values had little effect on the majority of these models. A larger dataset with a longer duration will almost certainly yield more varied results for each XGBoost variant.

5.2 Accuracy assessment This chapter used traditional statistical accuracy assessment metrics such as the root mean squared error (RMSE) and mean squared error (MSE) for model evaluation. The MSE measures the average of the squared difference between predicted and actual values.

262

9. AI for improving ozone forecasting

MSE ¼

n  2 1X bi Yi  Y n i¼1

(1)

b i are the actual values and b Here, Y yi are the predicted values. The Root mean squared error is a measure of the error rate by the square root of MSE. RMSE is insensitive to the direction of errors, and lower values are better. ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi s  n   1 X b RMSE ¼ (2) Yi  Yi n i¼1 P b i is observed O3 values, b Here, b yi is the predicted O3 values, Y n is number of items, and c is the summation notation. RMSE is the square root of MSE and measures the standard deviation of the residuals. The reason for choosing RMSE and MSE over other metrics, such as mean average error (MAE) or mean percentage error (MPE), is because it is easier to observe the errors in a simple intuitive way and can be directly compared to the original ozone observation values from EPA AirNow. We used the same metrics during training and testing to keep the model consistent across all the datasets. RMSE is also the recommended accuracy metric to train an XGBoost model for calculating training and validation accuracy.

5.3 Comparison with other ML models While here in this chapter, we mainly focused on describing the XGBoost method, we also evaluated alternative ML models based on the 2020 training dataset available at GMU. We used tree-regressor models: Decision Tree (DT) and Random Forest (RF), in addition to XGBoost with a voting ensemble method as discussed in this chapter. A ten-fold crossvalidation method is used to ensure the accuracy of each model is noncoincidental during the training period. After training the model, we applied it to the test dataset for the year2021 for prediction. Fig. 5 includes Avg-Max-Min box charts showing the regression RMSE of each model on the testing data. The X-axis means the cloned model runs with different hyperparameter configurations, and the Y-axis represents regression RMSE. This experiment used six clone models as the ensemble candidates for the voting by averaging all predicted values. The unsuitable fitted (either overfitting or underfitting) isolated models will be suppressed, and better-fitted models that agree well with others are combined using a voting classifier. In Fig. 5, each column has one box and two lines indicating the mean value and the training and testing RMSE. The green dot in the box denotes the mean RMSE. The short top line is the testing RMSE, while the short bottom line means training RMSE. It can be seen that the voting ensemble-based DT finally achieved 9.707 RMSE on average during the training and 11.3 RMSE during the testing. The voting ensemble-based RF model achieved 9.55 RMSE during the training period and 11.01 RMSE during the testing period. The XGBoost model achieved 6.55 RMSE during the training period and 8.74 during the testing period. Fig. 5 also displays that voting ensemble-based ML models generally perform better than standalone ML models. Four models are less accurate, while only two models are more accurate than the voting

263

5 Machine learning

Voting-Random Forest

–9.0

–8.5

–9.5

–9.0 –9.5

–10.0

neg_RMSE

neg_RMSE

Voting-Decision Tree

–10.5 –11.0 –11.5

–10.0 –10.5 –11.0 –11.5

–12.0

–12.0 DT1

DT2

DT3

DT4 DT5 DT models

DT6

voting

RF1

RF2

RF3 RF4 RF5 Random Forest models

RF6

voting

Voting-XGBoost

–6.0

neg_RMSE

–6.5 –7.0 –7.5 –8.0 –8.5 XGB1

XGB2

XGB3 XGB4 XGB5 XGBoost models

XGB6

voting

FIG. 5 Voting regressor with three-member models: (A) Decision Tree; (B) Random Forest; (C) XGBoost.

ensemble models. Based on the RMSE assessment, we selected a voting ensemble–based XGBoost model as the best model, explored it in further analysis, and compared it with the cutting-edge time-series-prediction recurrent neural network models such as Long-term and Long-term Short-term memory (LSTM). Based on the preliminary results, we proved that regular ML models could regularly improve CMAQ ozone accuracy. We want to stretch the approach further to explore the ceiling capacity of the cutting-edge ML models in this case. Therefore, in the second-stage model comparison, we used the same dataset and calculated the accuracy of the voting ensemble–based XGBoost model, Tabnet, AutoML, and LSTM. In this phase, we first split the 2020–2021 dataset of all 28 sites into training (2020) and testing (2021) datasets. The training data is further split by 80–20 for deep learning to conduct validation in its learning iteration. All models were trained by the same training dataset and applied to the testing dataset for assessment. Before training ML models, we first calculated the RMSE of the original CMAQ against the AirNow samples as our benchmark. On average, the original CMAQ can achieve 12.78 RMSE and 158.2 MSE, indicating the CMAQ model’s limitation in predicting O3 concentration. As for the ML forecasting models, the accuracy is impressive,

264

9. AI for improving ozone forecasting

TABLE 1

Accuracy metrics.

Models

test-RMSE

test-MSE

CMAQ

12.78

158.2

CMAQ-tabnet

8.83

78.1

CMAQ-AutoML

8.92

79.6

CMAQ-Voting-XGBoost

8.74

76.5

CMAQ-LSTM

8.66

75.1

FIG. 6 Comparison of AirNow, CMAQ, and CMAQ-LSTM on the 2021 testing dataset.

and we look for other ML models to improve the results further. We have run the ML models mentioned above on three Linux servers with 48 cores Intel Xeon Silver 4116 CPU, 128 GB memory, and 4 NVIDIA K80 GPUs. All the results are listed in Table 1. It can be found that the voting ensemble-based XGBoost model can gain an 8.74 RMSE on the testing dataset, whereas the LSTM model can achieve a 8.66 RMSE. Both models perform almost similarly in prediction; however, LSTM performs slightly better than the voting XGBoost model. The following graph shows the fitting curve of the LSTM model with ground truth AirNow and CMAQ-O3 data. Fig. 6 shows the results of LSTM for September 2021. It strongly reflects that LSTM can accurately predict extreme AirNow values better than the original CMAQ model (Fig. 6). .

6 ML workflow management Geoweaver is an open-source workflow management software that aims to boost research productivity. It can let users compose data processing scripts and link them into integrated and managed workflows (Sun et al., 2020). Please check out the Geoweaver GitHub repository (https://github.com/ESIPFed/Geoweaver) for guidance on installation and usage. In this chapter, we created all the steps as processes and linked them into a workflow in Geoweaver, as shown in Fig. 7. All the source code are added inside the circle nodes. The connections

7 Discussion

265

FIG. 7 CMAQ workflow on Geoweaver.

among the nodes specify the order the processes are executed. Each process is isolated and executed separately, which will help users to focus and identify problems where it happens. Geoweaver can greatly help maintain code files throughout the changes applied and keeps a history of all the output generated from each process. Setting up a workflow in Geoweaver means that the complete code base can quickly be executed on any local or remote machine with a proper environment. It allows the team to collaborate seamlessly on a single project, prevents miscommunications and file losses, and automates the tracing and merging history for all processes on separate user’s machines.

7 Discussion This section examines the results in the previous section to discuss the feasibility and flexibility of ML in generally similar applications—using ML to improve numerical model results. We will focus on three significant aspects: accuracy improvement, the stability of ML models in operational scenarios, and the reusability of the ML models when transplanting in unknown or distant locations and for different time periods.

7.1 Accuracy improvement It is undoubtedly a complex process to figure out the real causes of the low accuracy of some numeric model predictions. There are many input variables and uncertain impact

266

9. AI for improving ozone forecasting

factors within the forcing files about the study region. Numeric models have accumulative error problems, meaning the errors will accumulate over time and become evident after a while. The model maintainers must constantly calibrate and adjust the models based on the newly observed data. A common practice is to use the new data to replace the erroneous initial condition file for each prediction cycle. However, due to the ground/aerial observation capacity limitation, some of the variables won’t always be accurate or frequently updated. The inaccuracy will be reflected in the results. Another conventional practice is to run multiple models with different configurations simultaneously and choose the best one or average all model results as consensus results. That would either involve spatiotemporal biases or diminish the performance of the best model. This chapter demonstrated that ML models are able to systematically and significantly improve the accuracy of ozone prediction. For example, it can be seen in Fig. 6 that CMAQ curves are higher at the valley than AirNow, and ML predictions are much closer to the AirNow data. In July and August, the CMAQ results generally underestimate the peak values, while the CMAQ-LSTM manages to increase the values to the middle points between CMAQ and AirNow. All three curves have very similar patterns proving that ML did produce better results than CMAQ. In the future, we will explore more models and methods to further advance the improvement of ozone predictions.

7.2 Stability and reliability CMAQ is a numeric model that can be operational in producing stable ozone predictions. The model runs hourly simulations every day and produces reliable forecasting on air quality dynamics. The ML model is expected to be run simultaneously at an hourly frequency to transform the CMAQ results into a product with higher accuracy in capturing the ozone peaks. Some ML models such as Voting Ensemble and TabNet, can apply to a large extent with minor changes in the performance. LSTM requires time-series input and more contextual information, so its performance might vary across sites and regions. The CMAQ-LSTM results are consistently better than CMAQ from January to September. The improvements are steadily occurring at the peaks and valleys and become most obvious in July and August when CMAQ underestimates a rather significant portion.

8 Conclusion This chapter demonstrates that the ML model, together with remote sensing and the CMAQ mechanistic model data, can improve the accuracy of O3 prediction. For our demonstration, we used the EPA AirNow data from a single station in Arizona as the training labels and extracted a time series from the CMAQ model run by the GMU team in the same region, along with data. This work has provided an essential reference for future mixed research of ML, numeric models, and remote sensing to further advance the capability of simulating ozone concentration while significantly reducing the cost of ground observation collection.

11 Lessons learned

267

9 Assignment After copying the source code and running it, try making experimental changes. For example • Increase the duration of data from a month to a year or multiple years. This would significantly increase the data collection step but let the model train on multiple patterns found through a few years, making a more robust prediction. • Collect data for multiple stations in other locations and merge all the data to expand the training dataset.  This can be useful for understanding other factors that could affect O3 levels in different regions. – Some regions might produce much more O3 levels. – Other regions might have the stations be much closer to urban areas, adding noise to the simulated CMAQ data we collect in this chapter. – Finding the best region or set of stations that would complement the goal of this chapter would require trial and error to see which region has the least amount of noise in its vicinity when collecting data.

10 Open questions The achieved accuracy is good but still far from being impressive. The benefit of AI/ML techniques is that there is a lot of room for future improvements. Here are some potential opportunities for future researchers to test: • Will traversing different combinations (rather than all at once) of input variables and different tuning of hyperparameters improve the accuracy? • Will more data sources, such as new satellite observations, allow AI/ML models to increase the spatiotemporal coverage to refine prediction? • Can additional secondary pollutants (PM2.5, NO2, CO) assist in increasing the model’s prediction accuracy due to correlations that exist among them? • How can the trained model in this chapter have better spatial generalization over unknown locations, and reused in other regions or countries? • How to deploy the trained model into cloud servers and provide operational services to stakeholders as an alternative information source to monitor air quality levels?

11 Lessons learned The code in this chapter has been modified in certain areas to enhance the overall model performance: • Collecting all the discussed data sources with a very long temporal range can result not just computationally demanding tasks but also incomplete data from a few of the sources.

268

9. AI for improving ozone forecasting

• TROPOMI data does not offer any data prior to 2018. This can limit a model that can be trained on a longer historical dataset. • Because weather conditions can have an impact on remote sensing data, the observation of some days with extreme weather will result in a very wrong prediction from the actual pollutant emission. • The tasks demonstrated in this chapter can be completed much more efficiently in a cloud environment by leveraging additional computing resources. This would also reduce the number of tasks that take a long time to complete, making variations of various models more attainable and less likely to fail in the middle of execution. • A more sophisticated machine learning or deep learning model suitable for temporal and spatial data might significantly improve and generate a more robust prediction with much better generalization for new places or time periods.

References Abdul-Wahab, S., Bouhamra, W., Ettouney, H., Sowerby, B., Crittenden, B., 1996. Predicting ozone levels. Environ. Sci. Pollut. Res. 3, 195–204. Agudelo-Castaneda, D., Calesso Teixeira, E., Norte Pereira, F., 2014. Time–series analysis of surface ozone and nitrogen oxides concentrations in an urban area at Brazil. Atmos. Pollut. Res. 5, 411–420. Aljanabi, M., Shkoukani, M., Hijjawi, M., 2020. Ground-level ozone prediction using machine learning techniques: a case study in Amman, Jordan. Int. J. Autom. Comput. 17, 667–677. Chaloulakou, A., Saisana, M., Spyrellis, N., 2003. Comparative assessment of neural networks and regression models for forecasting summertime ozone in Athens. Sci. Total Environ. 313, 1–13. Chen, T., Guestrin, C., 2016. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. de Graaf, M., Sihler, H., Tilstra, L., Stammes, P., 2016. How big is an OMI pixel? Atmos. Meas. Tech. 9, 3607–3618. Felzer, B., Cronin, T., Reilly, J., Melillo, J., Wang, X., 2007. Impacts of ozone on trees and crops. C. R. Geosci. 339, 784–798. Li, Y., Tong, D., Ma, S., Zhang, X., Kondragunta, S., Li, F., Saylor, R., 2021. Dominance of wildfires impact on air quality exceedances during the 2020 record-breaking wildfire season in the United States. Geophys. Res. Lett. 48. Lightstone, S., Moshary, F., Gross, B., 2017. Comparing CMAQ forecasts with a neural network forecast model for PM2.5 in New York. Atmos. 8, 161. O. US EPA, 2016. Frequent CMAQ Questions. https://www.epa.gov/cmaq/frequent-cmaq-questions. (Accessed 2 November 2022). Onan, A., Koruko glu, S., Bulut, H., 2016. A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Syst. Appl. 62, 1–16. Pak, U., Kim, C., Ryu, U., Sok, K., Pak, S., 2018. A hybrid model based on convolutional neural networks and long short-term memory for ozone concentration prediction. Air Qual. Atmos. Health 11, 883–895. R. 1 US EPA, 2022. Eight-hour Average Ozone Concentrations j Ground-level Ozone j New England j US EPA. https://www3.epa.gov/region1/airquality/avg8hr.html. (Accessed 3 November 2022). Rasmussen, D., Fiore, A., Naik, V., Horowitz, L., McGinnis, S., Schultz, M., 2012. Surface ozone-temperature relationships in the eastern US: a monthly climatology for evaluating chemistry-climate models. Atmos. Environ. 47, 142–153. Sayeed, A., et al., 2021. A novel CMAQ-CNN hybrid model to forecast hourly surface-ozone concentrations 14 days in advance. Sci. Rep. 11 (1), 1. https://doi.org/10.1038/s41598-021-90446-6. Skamarock, W.C., Klemp, J.B., Dudhia, J., Gill, D.O., Liu, Z., Berner, J., Wang, W., Powers, J.G., Duda, M.G., Barker, D.M., Huang, X.-Y., 2019. A Description of the Advanced Research WRF Version 4. NCAR Tech. Note NCAR/TN556+STR., p. 145. Spellman, G., 1999. An application of artificial neural networks to the prediction of surface ozone concentrations in the United Kingdom. Appl. Geogr. 19, 123–136.

References

269

Sun, Z., Di, L., Burgess, A., Tullis, J.A., Magill, A.B., 2020. Geoweaver: advanced cyberinfrastructure for managing hybrid geoscientific AI workflows. ISPRS Int. J. Geo Inf. 9 (2), 119. https://doi.org/10.3390/ijgi9020119. Sun, X., Ivey, C., Baker, K., Nenes, A., Lareau, N., Holmes, H., 2021. Confronting uncertainties of simulated air pollution concentrations during persistent cold air pool events in the salt lake valley, Utah. Environ. Sci. Technol. 55, 15072–15081. Talebi, H., Peeters, L., Otto, A., Tolosana-Delgado, R., 2021. A truly spatial random forests algorithm for geoscience data analysis and modelling. Math. Geosci. 54, 1–22. Yafouz, A., Ahmed, A., Zaini, N., Sherif, M., Sefelnasr, A., El-Shafie, A., 2021. Hybrid deep learning model for ozone concentration prediction: comprehensive evaluation and comparison with various machine and deep learning algorithms. Eng. Appl. Comput. Fluid Mech. 15, 902–933.

This page intentionally left blank

C H A P T E R

10 AI for monitoring power plant emissions from space Ahmed Alnuaim (Alnaim)a,b and Ziheng Suna,b a

Department of Geography and Geoinformation Science, George Mason University, Fairfax, VA, United States bCenter for Spatial Information Science and Systems, George Mason University, Fairfax, VA, United States

1 Introduction This chapter employs machine learning (ML) to simulate nonlinear relationships and memorize hidden patterns between remote sensing data and EPA ground observations in order to estimate power plant emissions. Because of the coal/gas combustion, power plants are a major source of manufactured NO2. NO2 is a trace gas produced by both anthropogenic and natural sources. In the troposphere, sunlight can cause chain chemistry reactions that produce nitric oxide and ozone. As a result, NO2 is frequently used to reflect the concentration among the more prominent family of nitrogen oxides (NOx). NOx emissions contribute to poor air quality events like smog, haze, and even acid rain, as well as severe respiratory problems. Emissions can also cause algae overpopulation and contamination of water bodies and soil. Despite the fact that many power plants in the United States have implemented methods to extract and reduce NOx and CO emissions, they continue to be a major concern on a large spatial scale, accumulating significantly after long-term operation. Remote sensing technology has advanced rapidly in recent decades. It is broadly used in observing many different facets of our planet, from forest areas to urban expansion, and provides decision-makers with consistent and continuous information. Because ground observations are point-based and cannot provide full spatial coverage, satellite data can help. Many satellites have multispectral and hyperspectral sensors that can remotely measure common air pollutants and greenhouse gases. The spatiotemporal resolution, sun height angle, spectral range, and reflectance characteristics of NO2 heavily influence sensor selection and algorithm development. Many studies have highlighted the utility of remote sensing for NO2 in

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00014-1

271

Copyright # 2023 Elsevier Inc. All rights reserved.

272

10. AI for monitoring power plant emissions from space

accurately monitoring NOx emissions using TROPOMI (TROPOspheric Monitoring Instrument) on the ESA Sentinel-5 Precursor (S5P) satellite and OMI (Ozone Monitoring Instrument) on the NASA Aura satellite. The TROPOMI satellite instrument has yet to achieve a micro-spatial resolution capable of accurately distinguishing power plant pollution emissions from other sources. There are examples of TROPOMI accurately identifying relatively small-scale isolated sources under ideal weather conditions using spatial oversampling, temporal averaging (CEOS, 2019), or enhanced surface reflectance conditions (Beirle et al., 2021). ML has emerged as an exciting method for linking remotely sensed data to ground-based observations while avoiding the need for computationally expensive atmospheric chemistry and atmospheric dynamics calculations. ML algorithms can recognize patterns in multidimensional datasets. Geoscience is a rapidly applied field, with large amounts of data accessible to train these models and several highly complex interactions that are not made explicit in some instances. It has already been used to successfully perform tasks like land use/cover classification and automatic production of equivalent maps with the same accuracy as traditional mapping techniques (Sun et al., 2022). Recently, ML has been widely applied to the physics-based modeling community to aid in understanding model uncertainty and biases, as well as how to correct them (Van der A et al., 2020). Given the criteria of observing power plants from space, the accessibility of remote sensing data, and the widely available ML frameworks, this chapter showcases an approach to monitor NO2 emissions from coal-fired power plants using only remote sensing data. Once the model is well trained, it attempts to use the significant power of ML models for simulating nonlinear relationships and memorizing underlying patterns in the training datasets to approximate emissions from power plants using only remotely sensed data. ML has fewer constraints than traditional techniques, such as numeric modeling and rule-based workflow processing systems, excluding such requirements as initial site conditions, equation coefficient adjustments, costly computation costs, and unattainable assumptions. In this chapter, we gather data from various sources (Fig. 1), including TROPOMI NO2, EPA eGRID ground surveilling data, and weather data from NASA MERRA. The training data include TROPOMI and MERRA features as input data and EPA NO2 emission data as output target. The chapter focuses on one coal-fired power plant in Alabama, and the results show that the support vector machine regression (SVR) ML model could correctly detect changing trends in NO2 emissions by power plants. It confirms that ML has tremendous potential for identifying a consistent value range for specific ground emission sources using only remote sensing. To ensure the accuracy of the model presented in this chapter, we compared it to other traditional models; we discovered that the SVR model performs better than most other ML models with the same training data. The relevance of each input variable is evaluated and tuned in order to improve the SVR ML model. The day of the year and TROPOMI NO2 are observed to be the two most considerably correlated variables, implying that the combination of date and TROPOMI accurately represents the peaks, valleys, and patterns of power plant NO2 emissions in the data. The impact of the connection between MERRA weather variables and EPA-observed emissions was not noticeable in this case, and we suspect that the low temporal resolution of MERRA was to blame. In the future, we hope to look for higher-resolution weather products to replace MERRA, and we expect the correlation significance to increase.

1 Introduction

273

FIG. 1 EmissionAI overall execution flow.

1.1 What you will learn in this chapter • • • • •

Collecting NO2 data from satellite observations Re-projecting satellite images for a specific location on earth to get NO2 daily averages Merging Remote and Ground data sources into one file Preprocessing NO2 data for ML Training and Predicting NO2 emissions for power plants

274

10. AI for monitoring power plant emissions from space

1.2 Credentials For this chapter, two accounts need to authorize access to two data sources used in this chapter. The first is a NASA Earthdata account that can be set up at https://urs.earthdata. nasa.gov/. The second is a Google account to access the Google Earth Engine platform to access the datasets offered in there.

1.3 Prerequisites This chapter requires a few python packages to help us collect, preprocess, visualize, and train the model: Package

Version

pandas

> ¼ 1.1.3

numpy

> ¼ 1.19.5

matplotlib

> ¼ 3.4.3

seaborn

> ¼ 0.11.2

earthengine-api

> ¼ 0.1.272

xarray

> ¼ 0.20.2

netCDF4

> ¼ 1.5.8

dask

> ¼ 2021.12.0

scikit-learn

> ¼ 1.0

scipy

> ¼ 1.7.1

Python 3.8 or higher is recommended to ensure reproducibility of the code used in this chapter. Additionally, the dependency “earthengine-api” requires users to follow the installation guide for proper installation (https://developers.google.com/earth-engine/guides/ python_install).

2 Background Manufactured emissions are the primary contributor to global climate change. They have a wide range of effects on environmental ecosystems and human well-being, including steadily increasing frequency and severity of disruptive climate events such as extreme drought, storms, and fatal floods. It is necessary to act in the coming decades to halt the trend of rising global temperatures, which have already risen by 1.5°C on average above preindustrial levels, a level not seen in the last 2000 years. Many organizations and governments worldwide are taking preventative long-term and short-term goals to combat climate change.

3 Data collection

275

According to the United States Environmental Protection Agency (EPA), the energy sector accounts for more than a quarter of all emissions from both pollutants and greenhouse gases in the United States, with coal, fossil fuels, and natural gas accounting for the majority. The climate (especially extreme heat and cold) has a strong influence on power demand, which changes dramatically on a daily basis (US EPA, 2021). To prevent further severe emission releases into the atmosphere, it is critical to monitor emissions on a timely basis accurately. The EPA currently requires coal-fired power plant factories to implement real-time sensors on emission sources such as chimneys and to send back data on a regular basis. The data is accurate and continuous, and it is the only source of information for policymaking and emissions regulation. Ground observation, on the other hand, is costly to install and preserve, and so far, only power plants in the United States and a few western countries permit unrestricted access to their data. To control and improve global emissions overall, we should be able to monitor all power plants on a low-cost and on an environmentally friendly basis. A few scientific papers have aimed to evaluate the abilities of using ML with remotely sensed data to investigate alternatives to relying exclusively on ground observation stations for assessing the impact. According to these studies, ML models can identify emissions from data collected remotely up to 84% of the time (Hedley et al., 2016). The EPA presently helps to regulate nitrogen dioxide (NO2), the most significant NOx pollutant in the troposphere. To protect public health, the EPA established national ambient air quality standards (NAAQS) to monitor and control tropospheric NO2 levels. The safe annual average concentration threshold is 0.053 ppm (ppm) (100 μg per cubic meter) (Alnaim et al., 2022). Because of the coal/gas combustion, power plants are a major source of man-made NO2. Moreover, despite 30 years of laws and regulations aimed at improving air quality by reducing emissions, some US power plants still fail to control pollutant emissions, despite the fact that monitoring and control technology is readily available (EPA, 1999).

3 Data collection 3.1 TROPOMI tropospheric NO2 data The TROPOspheric Monitoring Instrument (TROPOMI) provides a daily high resolution of 3.5  5.5 km for atmospheric pollutants, including a tropospheric vertical column of NO2 per mol/m through its L2/L3 data products. This chapter will utilize its tropospheric NO2 column as input to train our model and predict the NO2 level emitted from a power plant. To collect TROPOMI data, we need to use the Google Earth Engine Python library to access its API. Through the API calls, we can specify a location to get a daily TROPOMI image for a certain duration, in this case, one month. The following code will import the necessary packages, collect the data, average the TROPOMI NO2 data into daily mean values, and save the data into a CSV file named “tropomi_no2.csv”.

276

10. AI for monitoring power plant emissions from space

import json import pandas as pd import ee try: ee.Initialize() except Exception as e: ee.Authenticate() ee.Initialize() # Define location and add a 500 meter buffer around our Point Of Interest (POI) poi = ee.Geometry.Point(-87.910747, 31.488019).buffer(500) # Get TROPOMI NRTI Image Collection from GoogleEarth Engine tropomiCollection = ee.ImageCollection("COPERNICUS/S5P /NRTI/L3_NO2").filterDate('2019-01-01','2019-01-31') def poi_mean(img): # This function will reduce all the points in the area we specified in "poi" and average all the data into a single daily value mean = img.reduceRegion(reducer=ee.Reducer.mean(), geometry=poi,scale=250).get('tropospheric_NO2_column_number_density') return img.set('date', img.date().format()).set('mean',mean) # Map function to our ImageCollection poi_reduced_imgs = tropomiCollection.map(poi_mean) nested_list = poi_reduced_imgs.reduceColumns(ee.Reducer.toList(2), ['date','mean']). values().get(0) # We need to call the callback method "getInfo" to retrieve the data to local df = pd.DataFrame(nested_list.getInfo(), columns=['date','tropomi_no2_mean']) df['date'] = pd.to_datetime(df['date']) df = df.set_index('date') # Scaling the data to later match our target feature scale df['tropomi_no2_mean'] = df['tropomi_no2_mean']*1000 # Save data to a CSV file df.to_csv('tropomi_no2.csv')

This code will collect an image every day for the chosen location throughout the specified duration. It will also select a specific band “NO2_column_number_density” available in Google Earth Engine for the selected data product “tropomiCollection.” Each image is passed to a reducer to get a single daily average value for all points in the area. After that, each average value coupled with a date is added to a pandas data frame to make dates as the index. We also need to scale the collected tropomi data to match our target feature, EPA NO2, so that the ML model does not assign higher weights to the higher range values, making the model perform much worse than if both features were in the same unit scale. The data frame is saved to a CSV as the last step to easily merge it later.

3 Data collection

277

3.2 MERRA-2 meteorology data The second Modern-Era Retrospective analysis for Research and Applications, Version 2 (MERRA-2), offers meteorological data collections that include multiple features that are used in this chapter, such as surface temperature, bias-corrected precipitation, cloud fraction, and surface wind speed. OPeNDAP is used to automate the collection of these features over the course of a specified duration. Additional MERRA-2 data collection instructions can be accessed at NASA’s Distributed Active Archive Centers website (https://daac.gsfc.nasa.gov/information/howto). To use OPeNDAP, we must first create a NASA Earthdata account, as well as two additional files in our local machine’s home directory. The first file, “.netrc,” will contain the credentials required when the python code requests data from OPenDAP (where “uid” is your Earthdata user name and “password” is your Earthdata Login password). These lines below will add the necessary information to the ".netrc" file: # Mac/Linux users. $ touch $HOME/.netrc $ echo "machine urs.earthdata.nasa.gov login password " >> $HOME/. netrc # Windows users. Execute at $HOME directory NUL > .netrc machine urs.earthdata.nasa.gov login password

The next file required is “.urs_cookie”. Simply paste this into your command line to create the file: # Mac/Linux users. touch $HOME/.urs_cookies # Windows users. Execute at $HOME directory NUL > .urs_cookies

The final file required is “.dodsrc” to tell Data Access Protocol (DAP) clients to use the “. netrc” file for authentication. # Mac/Linux users. touch $HOME/.dodsrc echo "HTTP.COOKIEJAR=$HOME/.urs_cookies" >> $HOME/.dodsrc echo "HTTP.NETRC=$HOME/.netrc" >> $HOME/.dodsrc # Windows users. Execute at $HOME directory NUL > .dodsrc echo "HTTP.COOKIEJAR=$HOME/.urs_cookies" >> /.dodsrc echo "HTTP.NETRC=$HOME/.netrc" >> /.dodsrc

NOTE: Ensure “$HOME” is replaced with the full path of home directory of your local machine in all commands listed above.

278

10. AI for monitoring power plant emissions from space

NOTE: you may need to re-create “.urs_cookies” in case you have already executed wget without valid authentication. NOTE: If you get 'Access denied' error. Enter 'dir' to verify that '.urs_cookies' file is listed in your directory. (Windows Only)

After all the above commands have been executed, the following python code can now successfully contact the OPenDAP protocol to collect MERRA-2 data in batches. We start with specifying a time frame for the collected data by including these lines below in the python script/Jupyter notebook: # import required packages. import pandas as pd import dask import netCDF4 import xarray as xr # Time frame of MERRA-2 data to collect year = '2019' month_begin = '01' month_end = '01' day_begin = '01' day_end = '31'

We will collect data from three collections. The features we want are spread out in different collections, and in order to retrieve them, separate executions are needed. The first collection is The MERRA-2 “M2I1NXASM” to get Temp and Wind variables (T2M, V2M). Below lines are used to collect this collection: collection_shortname = 'M2I1NXASM' collection_longname = 'inst1_2d_asm_Nx' collection_number = 'MERRA2_400' MERRA2_version = '5.12.4' # OPeNDAP URL url = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/{}.{}/{}'.format (collection_shortname, MERRA2_version, year) files_month = ['{}/{}/{}.{}.{}{}.nc4'.format(url, month_days[0:2], collection_number, collection_longname, year, month_days) for month_days in pd. date_range(year + '-' + month_begin + '-' + day_begin, year + '-' + month_end + '-' + day_end, freq='D').strftime("%m%d").tolist()] # Get the number of files len_files_month = len(files_month) print("{} files to be opened:".format(len_files_month)) print("files_month", files_month) # Read dataset URLs ds_temp_wind = xr.open_mfdataset(files_month)

3 Data collection

279

The second collection is the MERRA-2 “M2T1NXLND” to get the Total precipitation variable (PRECTOTLAND). Below lines are used to collect this data: collection_shortname = 'M2T1NXLND' collection_longname = 'tavg1_2d_lnd_Nx' collection_number = 'MERRA2_400' MERRA2_version = '5.12.4' # OPeNDAP URL url = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/{}.{}/{}'.format (collection_shortname, MERRA2_version, year) files_month = ['{}/{}/{}.{}.{}{}.nc4'.format(url, month_days[0:2], collection_number, collection_longname, year, month_days) for month_days in pd. date_range(year + '-' + month_begin + '-' + day_begin, year + '-' + month_end + '-' + day_end, freq='D').strftime("%m%d").tolist()] # Get the number of files len_files_month = len(files_month) print("{} files to be opened:".format(len_files_month)) print("files_month", files_month) # Read dataset URLs ds_precip = xr.open_mfdataset(files_month)

The final collection is the MERRA-2 “M2T1NXRAD” to get the Cloud Fraction variable (CLDTOT). Below lines are used to collect this data: collection_shortname = 'M2T1NXRAD' collection_longname = 'tavg1_2d_rad_Nx' collection_number = 'MERRA2_400' MERRA2_version = '5.12.4' # OPeNDAP URL url = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/{}.{}/{}'.format (collection_shortname, MERRA2_version, year) files_month = ['{}/{}/{}.{}.{}{}.nc4'.format(url, month_days[0:2], collection_number, collection_longname, year, month_days) for month_days in pd.date_range(year + '-' + month_begin + '-' + day_begin, year + '-' + month_end + '-' + day_end, freq='D').strftime("%m%d").tolist()] # Get the number of files len_files_month = len(files_month) print("{} files to be opened:".format(len_files_month)) print("files_month", files_month) # Read dataset URLs ds_cloud = xr.open_mfdataset(files_month)

280

10. AI for monitoring power plant emissions from space

3.3 EPA eGRID data The Environmental Protection Agency (EPA) offers a comprehensive source of air quality data through its AirData website. This chapter utilizes daily ground-based NO2 data for a whole year that can be downloaded at https://www.epa.gov/outdoor-air-quality-data/ download-daily-data. The CSV file downloaded from (Fig. 2) will include multiple features based on the location specified on the download page. The File will show all EPA Site ID’s near the location with daily MAX readings of NO2. The generated CSV will be used in the Preprocessing section to merge the collected EPA NO2 data into a final CSV file. The collected EPA NO2 feature will be used as the target variable to predict.

3.4 MODIS MCD19A2 product The MCD19A2 product provides the atmospheric properties and view geometry used to calculate the MAIAC Land Surface Bidirectional Reflectance Factor (BRF) or surface reflectance. The MCD19A2 AOD data product contains a blue band AOD at 0.47 μm as one of its layers produced daily at 1 (km) resolution. It is a measure of aerosols (e.g., urban haze, smoke, particles, desert dust, sea salt) distributed within a column of air from the instrument. (Earth’s surface) to the top of the atmosphere. This chapter will utilize MODIS Blue Band AOD at 0.47 μm as it contributes to the effects of pollutant dispersion, tropospheric NO2 retrieval, and correlation to meteorological conditions.

FIG. 2 Download the page with input to get the EPA NO2 data.

4 Preprocessing

281

Data extraction of MODIS MCD19A2 follows the same steps as TROPOMI NO2 data collection with the difference of the collection name,“poi_mean” function, and pandas dataframe lines: import json import pandas as pd import ee try: ee.Initialize() except Exception as e: ee.Authenticate() ee.Initialize() # Define location and add a 500 meter buffer around our Point Of Interest (POI) poi = ee.Geometry.Point(-87.910747, 31.488019).buffer(500) mcd19a2 = ee.ImageCollection("MODIS/006/MCD19A2_GRANULES").filterDate('2019-0101','2019-01-31') def poi_mean(img): mean = img.reduceRegion(reducer=ee.Reducer.mean(), geometry=poi, scale=250).get('Optical_Depth_047') return img.set('date', img.date().format()).set('mean',mean) # Map function to our ImageCollection poi_reduced_imgs = mcd19a2.map(poi_mean) nested_list = poi_reduced_imgs.reduceColumns(ee.Reducer.toList(2), ['date','mean']). values().get(0) # we need to call the callback method "getInfo" to retrieve the data df = pd.DataFrame(nested_list.getInfo(), columns=['date','Optical_Depth_047']) df['date'] = pd.to_datetime(df['date']) df = df.set_index('date') # Save data to CSV file df.to_csv('mcd19a2_Optical_Depth_047.csv')

4 Preprocessing 4.1 TROPOMI NO2 The CSV file “tropomi_no2.csv” saved from the python code we did earlier was already preprocessed when we applied the poi_mean function, and no additional steps are needed here. All we need to do for it is to merge the data into the final dataset later. This will be the final step in the Preprocessing section.

282

10. AI for monitoring power plant emissions from space

4.2 MERRA-2 Extracting the data of the MERRA-2 collections downloaded is the first preprocessing step. To get an accurate meteorological reading of the area, we need to provide a latitude and longitude of a specific location corresponding to the power plant (Alabama) we are utilizing in this chapter. The coordinates can be passed to the xarray datasets created previously. To start with a subset of the data based on location, we execute these below lines for each dataset: # extract values from all datasets based on location (Alabama plant) alabama_plant_temp_wind = ds_temp_wind.sel(lat=31.48, lon=-87.91, method='nearest') # Keep only the desired variables from the dataset. alabama_plant_temp_wind = alabama_plant_temp_wind[['T2M', 'V2M']] alabama_plant_precip = ds_precip.sel(lat=31.48, lon=-87.91, method='nearest') # Keep only the desired variables from the dataset. alabama_plant_precip = alabama_plant_precip[['PRECTOTLAND']] alabama_plant_cloud = ds_cloud.sel(lat=31.48, lon=-87.91, method='nearest') # Keep only the desired variables from the dataset. alabama_plant_cloud = alabama_plant_cloud[['CLDTOT']]

All datasets (ds_temp_wind, ds_precip, ds_cloud) are used to select the data based on the coordinates passed. The sel() method accepts latitude and longitude parameters while also specifying a method parameter of nearest to select the nearest available data in the datasets for the location specified. The following preprocessing step is to resample the datasets to be a daily average of the hourly values. This can quickly be done through xarray’s resample() method. To start resampling, the following lines need to be executed for all datasets (ds_temp_wind, ds_precip, ds_cloud): # Resample dataset into daily average alabama_plant_temp_wind_mean = alabama_plant_temp_wind.resample( time="1D").mean(dim='time', skipna=True) # Resample dataset into daily average alabama_plant_precip_mean = alabama_plant_precip.resample( time="1D").mean(dim='time', skipna=True) # Resample dataset into daily average alabama_plant_cloud_mean = alabama_plant_cloud.resample( time="1D").mean(dim='time', skipna=True)

The time parameter in the method accepts a string indicating the resampling needed, and the lines above use “1D” to indicate a daily resample of all hourly data to be averaged.

4 Preprocessing

283

After all the steps above are finished, the last step needed is to convert the MERRA-2 datasets into a pandas dataframe and merge them to output the data into a CSV file. That can be done by running the code below: # Convert datasets to pandas df and save to CSV file. alabama_plant_temp_wind_mean_df = alabama_plant_temp_wind_mean.to_dataframe() alabama_plant_precip_mean_df = alabama_plant_precip_mean.to_dataframe() alabama_plant_cloud_mean_df = alabama_plant_cloud_mean.to_dataframe() #Merge all three dataframes into one df. merged_dfs = alabama_plant_temp_wind_mean_df.merge( alabama_plant_precip_mean_df, on='time').merge(alabama_plant_cloud_mean_df, on='time') # Save dataframe into a CSV file merged_dfs.to_csv('merra2_daily_mean_2019.csv')

4.3 MCD19A2 The CSV file “mcd19a2_Optical_Depth_047.csv” saved from the python code we did earlier was already preprocessed when we applied the poi_mean function, and no additional steps are needed here. All we need to do for it is to merge the data into the final dataset. This will be the next step.

4.4 Merging training data Now that we have all the CSV files for each data source downloaded to our local system, we prepare the data by merging and then preprocess the final dataset for model training. The first step is to read all the files and merge them all together. Some additional steps, such as date formatting, and dropping columns, happen here too. This will help in having a cleaner dataset to use for model training. # Read EPA NO2 CSV file downloaded from https://www.epa.gov/outdoor-air-qualitydata/download-daily-data epa_no2 = pd.read_csv('ad_viz_plotval_data.csv', parse_dates=["Date"]) # Rename "date" column for consistency between other dataframes epa_no2.rename(columns={'Date': 'date'}, inplace=True) # Keep needed features only from CSV file epa_no2 = epa_no2[['date', 'Daily Max 1-hour NO2 Concentration']].copy() # Divide Daily EPA NO2 mean to scale data. epa_no2['EPA_NO2/100000'] = epa_no2['Daily Max 1-hour NO2 Concentration']/1000 # Drop unscaled column. epa_no2.drop('Daily Max 1-hour NO2 Concentration', axis=1, inplace=True)

284

10. AI for monitoring power plant emissions from space

# Read TROPOMI NO2 CSV file retrieved from Google Earth Engine tropomi_no2 = pd.read_csv('tropomi_no2.csv', parse_dates=["date"]) # Reformat date for consistency between other dataframes tropomi_no2['date'] = tropomi_no2['date'].dt.strftime('%m/%d/%Y') # Convert "date" column to datetime object tropomi_no2['date'] = pd.to_datetime(tropomi_no2['date']) # Read MERRA2 Metrological CSV file retrieved from OPeNDAP merra_2_daily = pd.read_csv('merra2_daily_mean_2019.csv', parse_dates=["time"]) # Rename features to identifiable names merra_2_daily.rename(columns={'time': 'date', 'T2M': 'Temp (Daily)', 'V2M': 'Wind (Daily)', 'PRECTOTLAND': 'Precip (Daily)', 'CLDTOT': 'Cloud Fraction (Daily)'}, inplace=True) # Reformat date for consistency between other dataframes merra_2_daily['date'] = merra_2_daily['date'].dt.strftime('%m/%d/%Y') # Convert "date" column to datetime object merra_2_daily['date'] = pd.to_datetime(merra_2_daily['date']) # Drop unnecessary features merra_2_daily.drop(['lat_x', 'lon_x', 'lon_y', 'lat_y'], axis=1, inplace=True) # Read MODIS MCD19A2 CSV file retrieved from Google Earth Engine mcd19a2 = pd.read_csv('mcd19a2_Optical_Depth_047.csv', parse_dates=["date"]) # Reformat date for consistency between other dataframes mcd19a2['date'] = mcd19a2['date'].dt.strftime('%m/%d/%Y') # Convert "date" column to datetime object mcd19a2['date'] = pd.to_datetime(mcd19a2['date'])

Now we can start merging epa_no2, tropomi_no2, merra_2_daily, and mcd19a2 data frames that we just created. To do that, we need to run the below line of code: alabama_plant_merged_data = epa_no2.merge(tropomi_no2,on='date') .merge(merra_2_daily, on='date').merge(mcd19a2, on='date')

At this point a dataframe with 10 features should be generated. We can now start preparing the merged dataframe for training the model we will create in the next section. First, we split the date column into a numerical representation so that our model can accept it as input: alabama_plant_merged_data['dayofyear'] = alabama_plant_merged_data['date'].dt. dayofyear alabama_plant_merged_data['dayofweek'] = alabama_plant_merged_data['date'].dt. dayofweek alabama_plant_merged_data['dayofmonth'] = alabama_plant_merged_data['date'].dt.day

5 Machine learning

285

Then create a list of our dependent and independent features (predictors, and target) target_column = ['EPA_NO2/100000'] predictors = ['tropomi_no2_mean', 'Wind (Daily)', 'Temp (Daily)', 'Precip (Daily)', 'Cloud Fraction (Daily)', 'dayofyear', 'dayofweek', 'dayofmonth']

We split the data into a training set and a testing set. This will help us later determine how accurate our model by feeding it the testing set that it was not trained on (unseen data). from sklearn.model_selection import train_test_split X = alabama_plant_merged_data[predictors] y = alabama_plant_merged_data[target_column] trainX, testX, trainY, testY = train_test_split(X,y,test_size=0.30, random_state=42)

We split the data into a 70–30 split. The training set will contain 70% of our data, and the testing set will have the remaining 30%. Now we are done with preprocessing and ready to start creating our model for training and testing.

5 Machine learning 5.1 Support vector regression (SVR) SVR is a reliable model with nonlinear data. As the chapter’s data is nonlinear, an SVR model can be used to evaluate the performance capabilities of predicting EPA NO2 emissions. A radial basis kernel function is selected for this model. The RBF kernel is typically used to find a nonlinear classifier or a regression line for two points, X1 and X2, that computes the similarity or how close they are to each other. A grid search is additionally used for a range of values to choose an optimal set of hyperparameters. To create an SVR model, we start by importing the necessary functions: from sklearn.svm import SVR from sklearn.model_selection import GridSearchCV import numpy as np

The next step would be to create the search parameter grid by providing multiple possible options of values for each hyperparameter expected by an SVR model: # Define possible parameters to hyper tune the SVR model param_grid={ 'kernel': ["rbf"], 'C': [0.1, 1, 100, 1000, 10000], 'epsilon': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05], 'gamma': [0.0001, 0.001, 0.01]}

286

10. AI for monitoring power plant emissions from space

#Initialize a GridSearchCV Object grid = GridSearchCV(SVR(), param_grid, refit = True, verbose = 3,n_jobs=-1) # fitting the data for grid search grid.fit(trainX, np.ravel(trainY)) # print best parameter found for data print(grid.best_params_) print(grid.best_score_)

The “param_grid” dictionary contains options for the grid search to find the most optimal hyperparameters for the model based on the data we provided. The grid variable created will be the actual grid search object to which we will fit our data in the following line. The last two print statements will show us which hyperparameters out of the options we provided were the most optimal. “grid.best_score_” will display the average of all CV (Cross Validation) folds for a single combination of the parameters we specified in “param_grid”. The expected output of running the grid search should be similar to (Fig. 3). We then initialize the model by using the imported SVR function from sklearn while passing the most optimal hyperparameters displayed in Fig. 3: #Initialize an SVR model with the most optimal hyper parameters found with GridSearchCV svr_rbf = SVR(kernel='rbf', C=1, gamma=0.01, epsilon=0.05) #Fit data to SVR model svr_rbf.fit(trainX, np.ravel(trainY))

When initializing the SVR model, we set the C hyperparameter before training the model to control error. Similarly, gamma is also a hyperparameter used to give the curvature weight of the decision boundary. It defines how far the influence of a single training example reaches, with low values meaning “far”, and high values meaning “close”. Finally, epsilon determines the width of the tube around the estimated function (hyperplane). Points that fall inside this tube are considered correct predictions and are not penalized by the algorithm. After the model is trained, now we can continue to predict the testing set (unseen data) we created earlier in Dataset preparation: #Predict on testing set yhat = svr_rbf.predict(testX) #Display performance metrics of model prediction showAccuracyMetrics("SVR [Alabama Plant] Model performance: ", svr_rbf, testY, yhat)

FIG. 3 Example output of the best parameters and best grid search score for an SVR model.

5 Machine learning

287

In “showAccuracyMetrics” (defined in Section 5.2), we are passing a string to print information related to the specific model trained. Multiple attempts at training should require separate runs of showAccuracyMetrics to compare which model did the best. The other parameters passed to this function are “svr_rbf” our trained model, “testY” the target EPA NO2 test set, and “yhat” which is the prediction we got from passing the training features to our model. After running the above two lines, an output with different values should be similar to (Fig. 4). We can also visually check how the model performed against the actual data (Blue line) displayed in (Fig. 5) using the second utility function “plotOnActual” (defined in Section 5.2). Finally, we visualize a correlation plot of actual vs. predicted values from our SVR model to get a better idea of the regression line fit and the correlation coefficient. The third utility function “plotCorrelation” output is displayed in (Fig. 6).

5.2 Utility functions We created a few utility functions to calculate the models’ accuracy metrics and visualize the results. To show the accuracy metrics of testing our models, “showAccuracyMetrics” is defined below: from sklearn.metrics import mean_squared_error, r2_score from sklearn import metrics import math def showAccuracyMetrics(mlmethod, model, y_test, y_pred): print("Model ", mlmethod, " Performance:") mae = metrics.mean_absolute_error(y_test, y_pred) mse = metrics.mean_squared_error(y_test, y_pred) rmse = metrics.mean_squared_error(y_test, y_pred, squared=False) r2 = metrics.r2_score(y_test, y_pred) rse = math.sqrt(mse/(len(y_test)-2)) print(" MAE: ", mae) print(" MSE: ", mse) print(" RMSE: ", rmse) print(" RSE: ", rse) print(" R2: ", r2) print(" Mean: ", np.mean(y_test)) print(" Error Rate: ", rmse/np.mean(y_test))

FIG. 4 SVR model accuracy metrics using the “showAccuracyMetrics” function.

288

10. AI for monitoring power plant emissions from space

FIG. 5 Example output of “plotOnActual” utility function.

FIG. 6 Correlation plot of actual vs. predicted values using the “plotCorrelation” utility function.

5 Machine learning

289

This function will show all the necessary metrics expected from training a model, especially a nonlinear model. The MAE, MSE, and RMSE will tell us the severity of the prediction error, while RSE will show the goodness-of-fit of our model to the regression line associated with nonlinear data. Another useful utility function is visualizing the training and testing prediction values over the actual target variable (EPA NO2) for each day in the month to better understand how our model fits the real data. The “plotOnActual” function below can help us do that: import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set(rc={'figure.figsize': (20, 10)}) plt.style.use('fivethirtyeight') def plotOnActual(model): # make predictions trainPredict = model.predict(trainX) testPredict = model.predict(testX) emptyTestList = np.empty(27) emptyTestList[:] = np.nan testPredictPlot = np.concatenate((emptyTestList, testPredict)) # plot baseline and predictions plt.plot(alabama_plant_merged_data.values[:,1]) plt.plot(trainPredict) plt.plot(testPredictPlot) plt.ylabel('EPA NO2', fontsize=22) plt.yticks(fontsize=16) plt.xlabel('Days with EPA NO2 Value', fontsize=22) plt.xticks(fontsize=16) plt.legend(["Data", "Train", "Test"], fontsize=18, shadow=True) plt.show()

“plotOnActual” will take a trained model and predict training and testing set to visualize the results on a line chart overlay on top of the actual data. An emptyTestList variable is created to separate the test set prediction results on the graph from the training set resulting in a gap in the graph. This is done to overlay the test set predictions correctly over the proper area in the graph corresponding to the correct time the test set contains, rather than have it be visualized on top of the training set predictions. An example output of this function is displayed in Fig. 5. The final utility function “plotCorrelation” will help plot a correlation plot of our actual vs. predicted values of our model: import scipy from numpy.polynomial.polynomial import polyfit

290

10. AI for monitoring power plant emissions from space

def plotCorrelation(ypred, ylimitMin=None, ylimitMax=None): slope, intercept, r, p, stderr = scipy.stats.linregress(np.ravel(testY), ypred) line = f'Regression line: y={intercept:.2f}+{slope:.2f}%xdel' b, m = polyfit(np.ravel(testY), ypred, 1) plt.plot(np.ravel(testY), b + m * np.ravel(testY), '-', linewidth=2, alpha=0.7, color='lightcoral', label=line) g=plt.scatter(testY, ypred, s=65, label=f'Correlation Coefficient = {np.round(np.corrcoef(np.ravel(testY), ypred)[0,1], 2)}') g.axes.set_xlabel('Actual', fontsize=22) g.axes.set_ylabel('Predicted', fontsize=22) plt.yticks(fontsize=16) plt.xticks(fontsize=16) plt.ylim(ylimitMin, ylimitMax) plt.legend(facecolor='white', fontsize=16, shadow=True) g.axes.axis('equal') g.axes.axis('square');

The function “plotCorrelation” takes in the predicted values for either the training or testing set values and shows the linearity of the predictions against the actual target values (EPA NO2). Now we should already have a fully trained model on multiple remotely sensed features and other data sources that are used to predict our ground-based target feature EPA NO2. Many different approaches can be taken to improve the foundation we built here, either by using different ML models or different feature engineering techniques. The Open Questions section below discusses more ideas for improving our model, while the Assignment section has some ideas to try and implement to our current code base.

6 Managing emission AI workflow in Geoweaver During projects such as this chapter, productivity issues due to many possible variations and different experiment results throughout numerous testing iterations cause researchers and students to struggle. Software that can help us automatically manage every code file and every model history version so that nothing is lost, miscommunicated, or potentially causes severe issues like server collapse, disk failures, cloud credit running out, and bad team communication can be incredibly beneficial to a project. Geoweaver (Sun et al., 2020) is a solution to increase the productivity of experimental and operational projects substantially. It is an open-source workflow management software that can meet all previously listed requirements. Geoweaver allows users to set up scripts and link them into traceable, manageable pipelines. The workflow incorporates user-created processes, and the user can operate the workflow and any underlying processes locally or remotely on different hosts. It can simplify the reusability aspect of building

7 Discussion

291

FIG. 7 EmissionAI workflow on Geoweaver.

an environment-independent research experiment by allowing all project or workflow team members to export and share all of the necessary parts so that another user may quickly load it into their Geoweaver instance and replicate the results or edit the code. It currently can support Python code, shell scripts, and Jupyter notebooks. As shell and Python can call any other programs, Geoweaver theoretically can run any application available on computing platforms. We created a workflow in Geoweaver for this chapter’s experiment, as shown in Fig. 7. All steps introduced in this chapter have corresponding processes in the Geoweaver workflow. After creating the workflow, users will be able to execute it locally or remotely, and Geoweaver will automatically record all versions and execution output for every workflow execution. It can also allow users to export and share the entire experiment via a simple zip file. To start using Geoweaver, please visit either the webpage (https://esipfed.github.io/ Geoweaver/) or the public GitHub repo (https://github.com/ESIPFed/Geoweaver).

7 Discussion For various reasons, most existing research concentrates on the general large-scale spatial dispersion of NO2 concentration and does not investigate micro-resolution single-point emission (Sun et al., 2020). ML can address these concerns by effectively simulating the relationships among remotely sensed data and precise ground emission sources. As demonstrated in the result section, there was a strong correlation between TROPOMI and EPA ground observations for most of the specified duration, indicating that the relationship is significant enough to support the inferences from one to the other. Although the TROPOMI curve was changing less in frequency and range compared to the EPA observations, the changes and fluctuations for both time series data generally align with each other.

292

10. AI for monitoring power plant emissions from space

Since operational power plants typically run continuously with little break, the NO2 plume ought to be close to the power plants and can be managed to be captured by satellite imagery grids despite their coarse spatiotemporal resolution. The experiment results demonstrated that ML could alleviate a few of the obstacles and generate a realistic prediction on particular emission sources. However, certain conditions must be met in order for the ML models to function optimally. There should be no other significant second source of NO2 nearby, such as metropolitan cities or other large-emitter facilities. We have not tested the boundary of the noise sources in this chapter but recommend using 100 miles for a boundary to isolate power plant emissions and clean the training data for the time being. With a wider accessibility of higher-resolution satellite products, we will be able to use ML to predict all power plant emissions without regard to spatial constraints in the future.

8 Summary This chapter utilized ML to simulate the nonlinear relationships and hidden patterns between remote sensing data and EPA ground observation to estimate emissions from power plants. This chapter collected data from multiple data sources, including TROPOMI NRT NO2 products, EPA eGRID ground monitoring network data, and NASA MERRA weather data. The training data used TROPOMI, MERRA, and MODIS variables as inputs and EPA emission data as output. It pushes to bridge the gap between information and actionable insights by developing a low-cost operational model for estimating emissions. Deploying an ML model such as the one we built here can be consistently used to train new data and predict daily estimated NO2 levels of power plants. This chapter has shown that monitoring ground facility emissions from space using satellites imagery and ML is possible and practical.

9 Assignment After reproducing and running the source code in this chapter, try making experimental changes to improve the model or build new use cases. For example: 1. Change the time range of the test data collection to another month in a different season. Run the model with the new test data and see how well the predictions match with EPA observations in that month. 2. Replace the SVR model with another ML model suitable for nonlinear data, like Random Forest and XGBoost. Scikit-learn provides many different ML models, including, but not limited to, the above three. Please try other models, compare their results, and rank them based on accuracy and error rates. 3. Switch to other power plant locations and rerun the entire workflow. This can be useful for understanding other factors that could affect NO2 levels in different regions. ML could have various performances due to spatial locations and temporal range changes. Please consider changing to power plants in areas of the city, suburban, countryside, plain, mountains, river, sea, etc. Because the TROPOMI NO2 observation could include NO2 that is not from power plants, and cities have more multisource NO2 than other places. The ML

11 Lessons learned

293

model has to deal with the influences of other NO2 sources and will have more difficulty analyzing the correlation and predicting the portion of power plant NO2 emission than isolated locations.

10 Open questions The future improvements to this model might be done in the directions of: 1. TROPOMI is the best remotely sensed NO2 dataset we can get right now, and it has a 3–5 km spatial resolution. The direct calculation of NO2 at ground level cannot guarantee accuracy. Also, the revisit interval of TROPOMI is very scarce and unable to form a high temporal resolution observation of the atmosphere around power plants. That restricts the ML model from predicting each power plant’s consecutive emissions. When a new satellite with higher resolution is launched in the future, it should be able to improve models similar to ours. 2. The current ML model still relies on an empirical dataset, and generally speaking, this trained model cannot evolve or deduce. Other technologies like reinforcement learning might help this use case to learn the high-level patterns actively and also self-study new patterns from old datasets to make deductive predictions on future new data. Using dataand-rule-hybrid-driven AI technologies instead of data-driven methods might only bring us new capabilities for the generalization and usability of future trained models. Can you think about any AI methods which might achieve this idea?

11 Lessons learned We learned a few tips and tricks when we tried to improve the model performance to make it easier to manage and reuse: 1. The data collected can reach a big volume when the temporal range is long (e.g., twenty years) and is computationally expensive. Also, a lot of missing data points due to weather or sensor failure could hurt the model performance. (The model has to guess and predict for the blank gaps.) Here are something we learned: • The requirement of big data processing is normal, especially when remote sensed data is included. Some days, the quality of the captured images from the satellite instruments is not as high as the other days, resulting in either an empty value for the day or an incorrect data point. Google Earth Engine has provided us with a very useful API to quickly gather and extract the required TROPOMI values from the huge raw imagery. • The missing and noisy data could cause the models to skew to some arbitrary data points that do not represent the actual condition of the captured data, which would reduce the model’s accuracy overall. We must remove missing and noisy data from the training data in the preprocessing. 2. Executing the tasks shown in this chapter can be done much more efficiently in Geoweaver to leverage all the available computing resources in one place. This would also reduce the

294

10. AI for monitoring power plant emissions from space

chance of repeating the same experiments since all the history is recorded. The use of Geoweaver has saved a lot of computational intensive tasks that take a while to finish, made the performance of different models more accessible and less prone to fail midexecution.

References Alnaim, A., Sun, Z., Tong, D., 2022. Evaluating machine learning and remote sensing in monitoring NO2 emission of power plants. Remote Sens. (Basel) 14, 729. https://doi.org/10.3390/rs14030729. Beirle, S., Borger, C., D€ orner, S., Eskes, H., Kumar, V., de Laat, A., Wagner, T., 2021. Catalog of NOx emissions from point sources as derived from the divergence of the NO2 flux for TROPOMI. Earth Syst. Sci. Data 13, 2995–3012. CEOS, 2019. Geostationary Satellite Constellation for Observing Global Air Quality: Geophysical Validation Needs. Available online: https://ceos.org/document_management/Publications/Publications-and-Key-Documents/ Atmosphere/GEO_AQ_Constellation_Geophysical_Validation_Needs_1.1_2Oct2019.pdf. EPA, 1999. Nitrogen Oxides (NOx), Why and How They Are Controlled.57. Available online: https://www3.epa. gov/ttn/catc/dir1/fnoxdoc.pdf. (Accessed 15 December 2021). Hedley, J., Russell, B., Randolph, K., Dierssen, H., 2016. A physics-based method for the remote sensing of seagrasses. Remote Sens. Environ. 174, 134–147. Sun, Z., Di, L., Burgess, A., Tullis, J.A., Magill, A.B., 2020. Geoweaver: advanced cyberinfrastructure for managing hybrid geoscientific AI workflows. ISPRS Int. J. Geo Inf. 9 (2), 119. https://doi.org/10.3390/ijgi9020119. Sun, Z., Sandoval, L., Crystal-Ornelas, R., Mousavi, S.M., Wang, J., Lin, C., Cristea, N., Tong, D., Carande, W.H., Ma, X., et al., 2022. A review of earth artificial intelligence. Comput. Geosci. 159, 105034. US EPA, 2021. Cleaner power plants. Available from: https://www.epa.gov/mats/cleaner-power-plants. (Accessed 15 December 2021). Van der A, R.J., de Laat, A.T.J., Ding, J., Eskes, H.J., 2020. Connecting the dots: NOx emissions along a West Siberian natural gas pipeline. npj Clim. Atmos. Sci. 3, 1–7.

C H A P T E R

11 AI for shrubland identification and mapping Michael J. Mahoneya, Lucas K. Johnsona, and Colin M. Beierb a

Graduate Program in Environmental Science, State University of New York College of Environmental Science and Forestry, Syracuse, NY, United States bDepartment of Sustainable Resources Management, State University of New York College of Environmental Science and Forestry, Syracuse, NY, United States

1 Introduction This chapter walks through a procedure for predicting the prevalence of “shrubland” (defined here as low-statured vegetation between 1 and 5 m in height) across a diverse region in New York State, patterned off the process used in Mahoney et al. (2022a). Due to the impacts of climate change and human land use patterns, these shrublands are becoming an increasingly important land cover type in the region, often representing an entirely novel ecosystem type. As a result of this novelty, these shrublands and the roles they play in the larger landscape (for instance, as habitats and as components of biogeochemical cycles) are poorly understood. Even identifying shrublands using remote sensing data, a potential way to monitor their development over time, is difficult given the relative rarity of shrublands in this region and their similar appearance to forest lands and wetlands in satellite imagery. The chapter introduces step by step how to fit a feedforward neural network using the Keras module in the popular TensorFlow package and a subset of the data from Mahoney et al. (2022a). Due to the rarity of shrubland in this region of New York, the chapter focuses on the adjustments necessary when building models from data with imbalanced classes, and on how to interpret model performance metrics when fitting classification models for specific purposes. Chapter exercises prompt learners to investigate how different priorities for a model might result in notably different performance measures.

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00010-4

295

Copyright # 2023 Elsevier Inc. All rights reserved.

296

11. AI for shrubland identification and mapping

2 What you’ll learn • Ways to think about “model performance” in the context of machine-learning shrubland classification models • Creating a model for imbalanced classes • Fitting a simple feedforward neural network with Keras and TensorFlow • Rasterizing model predictions for visualization

3 Background Human land use has fundamentally reshaped the structure and composition of the surrounding environment, leaving lasting legacies including the emergence of novel communities and ecosystem types (Foster et al., 1998; Cramer et al., 2008). Among the outcomes of these changes, the emergence of low-statured vegetation or “shrublands” as a more common cover type in the US Northeast has been suggested by numerous field studies, but is poorly understood from a landscape perspective. Although long disregarded, these lands are rapidly gaining attention in today’s urgent push to implement “natural climate solutions” (Fargione et al., 2018) and identify “marginal” or “underutilized” lands for renewable energy generation. However, current limitations to the classification and mapping of these cover types pose obstacles to advancing both science and stewardship opportunities (Hobbs et al., 2009). Shrublands are a very challenging cover class to identify from imagery alone, given the breadth of community types included and the high variability in density and canopy cover that exists within and among those community types (King and Schlossberg, 2014). In practical terms this means that, when relying solely on imagery, shrublands encompass a full gradient from resembling herbaceous or barren land to resembling closed-canopy conditions (Brown et al., 2020). As a result, satellite or aerial imagery-based approaches tend to classify shrubland categories with substantially lower accuracy than other land use and land cover (LULC) classes (Wickham et al., 2021; Brown et al., 2020). A solution for this problem might be to incorporate additional, nonimagery sources of remote sensing data into LULC classification methodologies. LiDAR data collected through airborne laser scanning can provide essential information for identifying low-statured vegetation such as early-successional forests (Falkowski et al., 2009). In combination with imagery, LiDAR data can enable continuous, broad-scale estimation of canopy heights and other structural traits which greatly simplify the task of distinguishing between low-statured and taller closed-canopy cover types (Ruiz et al., 2018). Unfortunately, the cost and logistical challenges of airborne LiDAR collection have constrained its availability to smaller extents and with much longer return intervals than provided by satellite imagery. Yet if canopy structural estimates from airborne LiDAR could be used to label a training dataset in order to fit models using satellite imagery, it should be possible to produce models capable of identifying shrubland with greater accuracy than those trained on imagery alone, while being able to map/ model a larger and more contiguous spatial extent than models relying on airborne LiDAR data as predictors.

4 Prerequisites

297

As such, we undertook a project aiming to use an AI/ML based approach to identify probable “shrubland” areas across New York State, USA, using predictors derived from optical imagery classified as “shrubland” using available aerial LiDAR data. Full details on this project are available in Mahoney et al. (2022a). This chapter uses a subset of the data used in Mahoney et al. (2022a) to walk through our modeling approach and discuss the specific concerns associated with attempting to model a relatively rare land cover class across large regions.

4 Prerequisites This chapter was created using Python version 3.8.13 (Python Core Team, 2022), TensorFlow version 2.9.1 (Abadi et al., 2015; Chollet, 2015), scikit-learn version 1.1.2 (Pedregosa et al., 2011), pandas version 1.4.3 (The Pandas Development Team, 2020; McKinney, 2010), numpy version 1.23.1 (Harris et al., 2020), pydot 1.4.2 (Carrera et al., 2021), matplotlib 3.5.3 (Hunter, 2007), and rasterio 1.3.0 (Gillies et al., 2013). While the code in this chapter may work with other versions, it has not been tested with other configurations, and the code may produce different results. All of the required libraries can be installed using the command: pip install \ tensorflow==2.9.1 \ scikit-learn==1.1.2 \ numpy==1.23.1 \ pandas==1.4.3 \ pydot==1.4.2 \ matplotlib==3.5.3 \ rasterio==1.3.0

This command will install the main libraries we’ll be relying upon, alongside all of the other libraries these need in order to work properly. Because these “dependencies” are installed automatically, this command also installs all of the other libraries we’ll be using throughout this chapter. We’ll be working with a subset of the data used in the original study, published on Zenodo (Mahoney et al., 2022b). The following code can be used to download the data and unpack it in the current working directory. Note that the data is approximately 1.3 gigabytes, and as such can take a while to download over slow connections. import urllib.request from zipfile import ZipFile urllib.request.urlretrieve( "https://zenodo.org/record/6824173/files/data.zip?download=1", "data.zip", ) ZipFile("data.zip", "r").extractall(".")

298

11. AI for shrubland identification and mapping

FIG. 1 A map showing the location of the study area (filled blue polygon) within New York State.

This directory contains a file, 3_county_2014.csv, which contains all the data we’ll use to fit models in this chapter. Each row in this CSV represents a 30-m square “pixel” of land in New York’s lower Hudson River valley (Fig. 1). This study area includes Duchess, Orange and Ulster counties, and has a wide variety of land cover types ranging from highly urbanized areas along the Hudson River to the highly forested, largely protected Catskill Mountains in the western part of the study area. The data only reflects areas that are classified as vegetation based on the US Geological Survey’s Land Change Monitoring, Assessment, and Projection (LCMAP) data set’s primary land cover classification (Brown et al., 2020). As a result, most bodies of water and urban areas are excluded from the data. In order to focus primarily on the modeling process, we’ll be skipping most of the work involved in collecting and processing data. Instead, we’ll use a set of predictors precalculated from Landsat imagery collected between July 1st, 2014, and September 1st, 2014 (Table 1). These predictors were adjusted using the Landtrendr algorithm in order to fill in gaps from clouds and shadows and remove noise from each pixel (Kennedy et al., 2010; Kennedy et al., 2018). More detail on the data retrieval and preprocessing procedures can be found in Mahoney et al. (2022a). These predictors are also included as a TIFF file, 3_county_2014.tiff, projected using the PROJ string: +proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 \ +lat_2=45.5 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs

5 Model building

TABLE 1

299

Definitions of predictors used for model fitting.

Raster band name

Definition

TCB, TCW, TCG

Tassled cap brightness (TCB), wetness (TCW), and greenness (TCG), with noise removed using Landtrendr

NBR

Normalized burn ratio (NBR) with noise removed using Landtrendr

MAG, YOD

Magnitude (MAG) and year of most recent disturbance (YOD), as identified using Landtrendr

PRECIP, TMAX, TMIN

30-year normals for precipitation (PRECIP), maximum temperature (TMAX), and minimum temperature (TMIN), derived from annual PRISM climate models

ASPECT, DEM, SLOPE, TWI

Aspect, elevation (DEM), slope, and topographic wetness index (TWI) derived from a 30-m digital elevation model

LCSEC

LCMAP secondary land cover classification

You can open this TIFF file in any GIS software in order to see the actual distribution of each predictor throughout the study area.

5 Model building 5.1 Preprocessing With our libraries installed and our data downloaded, we’re ready to begin! First things first, let’s load all the libraries we’ll be using: # We’ll be working with our data primarily as pandas DataFrames # and converting them to numpy arrays as necessary: import numpy as np import pandas as pd # We’ll set a random number seed # to ensure reproducibility across notebook runs. # # First, set the environment variable ’PYTHONHASHSEED’ to 0: import os os.environ[’PYTHONHASHSEED’]=str(0) # Now, set the random seeds from the ‘random‘ and ‘numpy‘ packages: import random random.seed(123) np.random.seed(123) # We’ll use scikit-learn for normalizing our data, # and for splitting our data into training-validation-testing sets: import sklearn from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split

300

11. AI for shrubland identification and mapping

# We’ll use Keras, installed as part of Tensorflow, for model fitting: import tensorflow as tf from tensorflow import keras # Set one more random seed value, this time from Tensorflow itself: tf.random.set_seed(123) # Initialize our GPUs to use memory growth, in order to work better # when multiple jobs are using the same GPU gpu_devices = tf.config.experimental.list_physical_devices("GPU") for device in gpu_devices: tf.config.experimental.set_memory_growth(device, True)

With our environment all ready, we can go ahead and start preprocessing our data into an AI-ready format. Our data has been filtered to only include areas that LCMAP (Brown et al., 2020) has identified as being vegetated. As a result, all of our data represents areas that were classified by LCMAP as being either agricultural, forestland, herbaceous/grassland, or wetland areas. Let’s load that data into our session now, and then pop the label column (named shrub) out as its own object: full_data = pd.read_csv("data/3_county.csv") full_labels = full_data.pop("shrub")

If all has gone as planned, full_data should now be a data frame with roughly 7 million observations of 17 separate features. These features are primarily variables derived from Landsat imagery, but also include the X and Y coordinate positions of each pixel of the raster, and the LCMAP secondary land cover classification (stored as lcsec_2014; Table 1): print(full_data.iloc[:, 0:6].head(n=5)) x

y

lcsec_2014 tcb_2014

tcw_2014

tcg_2014

0 1

1788750 2339280 4 1788780 2339280 3

2995 2759

-638 -487

2642 2437

2

1788480 2339250 1

2655

-634

2259

3

1788510 2339250 4

2908

-720

2253

4

1788720 2339250 4

3153

-698

2726

In the 3_county file, that “LCSEC” feature is stored as a single categorical variable, meaning the variable’s value can only be chosen from a group of candidate values, with each different value indicating a different land cover class. However, ML models cannot directly digest categorical variables as ML models tend to output continuous values. In order for ML models to make use of this information, we need to encode this single categorical variable into a number of Boolean indicator variables, transposing our single categorical variable into a set of Booleans. We can do this using the get_dummies() function from pandas:

301

5 Model building

full_data = pd.concat( [ full_data, pd.get_dummies(full_data["lcsec_2014"], prefix="lcsec", drop_first=True), ], axis=1, ) print(full_data.iloc[:, 16:23].head(n=5)) lcsec_2

lcsec_3

lcsec_4

lcsec_5

lcsec_6

lcsec_8

0

0

0

1

0

0

0

1 2

0 0

1 0

0 0

0 0

0 0

0 0

3

0

0

1

0

0

0

4

0

0

1

0

0

0

With the “LCSEC” categorical variable encoded, the next step is to drop the variables we don’t intend to use in our model. Specifically, we’re going to drop the lcsec_2014 column, as it’s now encoded as a number of Boolean variables. We’ll also be dropping the x and y coordinate variables: full_features = full_data.drop(["lcsec_2014", "x", "y"], axis=1) print(full_features.iloc[:, 0:6].head(n=5))

0

tcb_2014 2995

tcw_2014 -638

tcg_2014 2642

nbr_2014 737

mag_2014 0

yod_2014 0

1

2759

-487

2437

750

0

0

2

2655

-634

2259

708

0

0

3

2908

-720

2253

678

0

0

4

3153

-698

2726

742

0

0

Now that we have the full set of variables we intend to use to fit our ML models, it’s time to split our data to create a “hold-out” test set that we’ll use to assess our final model. Normally 20% of the data will be allocated to the test set, to make sure that there are enough observations to assess our final models while also leaving enough data behind to train our neural net. We’ll use the train_test_split() function from scikit-learn to create these splits: [train_features, test_features, train_labels, test_labels] = train_test_split( full_features, full_labels, test_size=0.2, random_state=123, stratify=full_labels )

That test set will be used to calculate the our final performance metrics after our ML model training finished. We’ll still want to assess all the intermediate models produced during model training! To do so, we’ll need to split our data one more time to create a “validation set,” which will be used to evaluate models and get out-of-sample performance estimates before we’re ready to use our final test set. Just like before, we’ll use train_test_split() to take 20% of the remaining training set to produce our validation set:

302

11. AI for shrubland identification and mapping

[ train_features, validation_features, train_labels, validation_labels, ] = train_test_split( train_features, train_labels, test_size=0.2, random_state=123, stratify=train_labels )

Now that we’ve split our data into training, validation, and testing sets, it’s time for the last bit of preprocessing before we start fitting our models. First, we’ll need to convert our data into numpy arrays, a data format that Keras will automatically understand and work with: train_features = np.array(train_features) validation_features = np.array(validation_features) test_features = np.array(test_features)

Secondly, we’ll need to standardize our data so that all of the input variables have zero mean and unit variance. To do this, we’ll use the StandardScaler() function from scikit-learn. We’ll initialize the rescaler on our training data, to calculate the mean and standard deviation of each feature using the training data alone: scaler = StandardScaler() train_features = scaler.fit_transform(train_features)

And then we’ll use the same rescaler to transform our validation and test data. It’s very important to make sure you’re not including your evaluation data when fitting the rescaler, as this is a form of “data leakage” which makes your final model evaluation not truly independent from the model fitting process, meaning your reported accuracy might be too optimistic when compared to the model’s real-world performance. validation_features = scaler.transform(validation_features) test_features = scaler.transform(test_features)

We’re also going to go ahead and transform our numpy arrays into TensorFlow datasets, which will make fitting and evaluating our models more efficient. By batching our data and setting it to prefetch future batches, we’ll speed up the modeling process: train_ds = ( tf.data.Dataset.from_tensor_slices((train_features, train_labels)) .batch(64) .prefetch(2) ) val_ds = ( tf.data.Dataset.from_tensor_slices((validation_features, validation_labels)) .batch(64) .prefetch(2) )

5 Model building

303

test_ds = ( tf.data.Dataset.from_tensor_slices((test_features, test_labels)) .batch(64) .prefetch(2) )

The final preparation step is going to be determining class weights for tuning our model. Our data is extremely imbalanced, because most of New York State is not shrubland—our “positive” shrubland class is much, much smaller than the “negative” not-shrubland class: neg, pos = np.bincount(train_labels) total = neg + pos print( "Examples:\n Total: {}\n Positive: {} ({:.2f}% of total)\n".format( total, pos, 100 * pos / total ) ) Examples: Total: 4585730 Positive: 70213 (1.53% of total)

Because shrubland only constitutes about 1.5% of all observations, our model could achieve 98.5% accuracy by never predicting shrubland! As such, we need to do something to make our model “care” more about our positive shrubland class, in order to make sure the model is attempting to predict both classes. There are several different approaches we could take to balance our classes, including resampling or downsampling so that our training data had the same number of observations in each class. However, when possible it’s often more efficient to set the weights of each class in your model, to make the model “care” more about getting the right answer on your less prevalent class. Here, we’ll calculate how much we need to adjust the class weights so that the positive shrubland class is as important as the negative not-shrubland class in the ML model’s decision making: class_weight = { 0: (1 / neg) * (total / 2.0), 1: (1 / pos) * (total / 2.0) } class_weight {0: 0.5077746357726036, 1: 32.65584720778204}

Notice that we calculated these class weights using only our training data labels. Just as with rescaling your data, including your evaluation data when calculating class weights can be a form of data leakage which makes your evaluation data nonindependent and your reported accuracy metrics too optimistic.

304

11. AI for shrubland identification and mapping

5.2 Model fitting With the training data rescaled and class weights calculated, we’re ready to get into the modeling process. Simple models can give stable decent performance at lower cost, while the fancy more cutting-edge models can do better in accuracy but have a toll on costs and complexity. We’ll be fitting a relatively straightforward feedforward neural net, using the Keras module of the TensorFlow library (Chollet, 2015; Abadi et al., 2015; LeCun et al., 2015). In order to assess how well our models perform, we’ll need to calculate several different metrics. Classification models can be tricky to assess, however, because your priorities and targets for a model determine what makes a model “better.” Even the metrics you use to assess your model may vary as a result of your modeling goals. For instance, models of rare but highly-important events—such as models for detecting credit card fraud, or screening for diseases—might prioritize catching as many “positive” cases as possible, even if that means increasing the number of “false positive” classifications from the model. Other models, however, might have the exact opposite preference; for instance, a model predicting what stocks might be good investment decisions might prefer to only predict “sure bets,” and would be willing to produce more false negatives in order to avoid spending money on bad investments. In our situation, where we’re modeling a relatively rare land cover classification, we’re willing to accept some false positives to make sure that we’re capturing as much shrubland as possible. At the same time, we want to make sure that our shrubland predictions are as precise as possible, so that when our model classifies a pixel as “shrubland” we can be decently sure it is truly a shrubland pixel. As such, we’ll calculate a number of different model metrics, but are going to focus in particular on precision and PRC, the area under the precision-recall curve, as the way we interpret and understand our results. Let’s go ahead and define all the metrics we want to calculate when evaluating our models: metrics = [ keras.metrics.TruePositives(name="True_Positives"), keras.metrics.FalsePositives(name="False_Positives"), keras.metrics.TrueNegatives(name="True_Negatives"), keras.metrics.FalseNegatives(name="False_Negatives"), keras.metrics.BinaryAccuracy(name="Binary_Accuracy"), keras.metrics.Precision(name="Precision"), keras.metrics.Recall(name="Recall"), keras.metrics.AUC(name="AUC"), keras.metrics.AUC(name="PRC", curve="PR"), ]

With our metrics defined, our next step is to create the actual model structure. For the purposes of this chapter, we’ll use a straightforward feedforward neural network, with a total of six densely connected layers and a single dropout layer. We can define the model using Keras’ sequential API like so: def make_model(metrics=metrics): # Create a model object using the sequential API:

5 Model building

305

model = keras.Sequential( [ # Add a dense layer, using: # + 256 neurons # + The rectified linear unit ("ReLU") activation function # + An input layer with the same number of neurons # as predictors in our training data keras.layers.Dense( 256, activation="relu", input_shape=(train_features.shape[-1],) ), # Add another dense layer, this time with 128 neurons: keras.layers.Dense( 128, activation="relu"), # Another with 64: keras.layers.Dense( 64, activation="relu"), # Another with 32: keras.layers.Dense( 32, activation="relu"), # And another with 16: keras.layers.Dense( 16, activation="relu"), # Add a dropout layer. # This will randomly set 20% of the inputs -# -- that is, the outputs from the last dense layer -# to 0, which helps protect against overfitting keras.layers.Dropout(0.2), # Finally, add a dense layer with a single neuron, # using the "sigmoid" activation function # # This will produce our final probability predictions keras.layers.Dense(1, activation="sigmoid"), ] ) model.compile( # Use "Adaptive Moment Estimation" optimization to tune weights # # This algorithm will adjust the weights of each neuron in # the network every epoch, attempting to optimize the loss function # defined in the next argument # # While the details are somewhat complicated, for applied purposes # it’s often practical to just use the Adam optimizer with a rather # small "learning rate" to achieve a decent model

306

11. AI for shrubland identification and mapping

optimizer=keras.optimizers.Adam(learning_rate=1e-3), # Use the most standard loss for binary classification models, # binary cross-entropy, to judge how well our model is doing. # # Lower cross-entropy values are better. loss=keras.losses.BinaryCrossentropy(name="Binary_Cross_Entropy"), # In addition to calculating our cross-entropy loss at each step # and adjusting our model weights, we’ll also ask Keras to calculate # the metrics we defined earlier for every epoch metrics=metrics, ) return model

The bulk of this model is made up of “dense” layers, which are made up of some number of “neurons” (between 256 and 16). Each of those neurons takes the results from every neuron in the previous layer as input, and transforms them using a standard formula: input  kernel + bias

(1)

Where kernel is a weights matrix created by each layer, and bias is a bias vector created by each layer. The output of this formula is then run through an “activation function” in order to get the final output from each neuron. In this case, almost all of our layers are using the “relu” activation function, which stands for “rectified linear unit.” For a given value x, this function returns x if x is positive and 0 otherwise: max ð0, xÞ

(2)

The outputs from this activation function are then provided as inputs to every neuron in the next layer of the neural net. In addition to these densely connected layers, this neural net also has a dropout layer at the end of the net. This layer takes the inputs from all the neurons in the previous layer and randomly sets 20% of them to 0, reducing the model’s ability to overfit on the training data. Last but not least, the results from that dropout layer are passed as inputs to our final dense layer. Unlike the other dense layers, this layer—which we refer to as the “output” layer—is only going to generate a single result, which will be our predicted probability. Because we want to predict probability, which ranges from 0 to 1, we want to make sure our output layer will only predict probabilities between 0 and 1. In order to do so, we use the “sigmoid” activation function, which transforms a given input x via the formula: 1 1 + ex

(3)

This activation function will force our predictions to fall between 0 and 1 as desired. Because this layer only has a single neuron, it will generate a single output; this is how we’ll generate predictions for each observation in our data and eventually for every pixel in our map. All told, our model looks something like the schematic in Fig. 2, with a single input layer, a number of densely connected “hidden” layers, a dropout layer, and finally the single neuron output layer which will generate our final predictions.

5 Model building

307

FIG. 2 A diagram of a feedforward neural network using densely connected layers.

We can also visualize this specific model, using the plot_model() function from keras. utils. This function will give us the schematic in Fig. 3, showing the type of layers we’re using

(either “InputLayer,” “Dense,” or “Dropout”), the activation function in use (either “relu” or “sigmoid”), the number of inputs to each neuron in the layer, and the number of outputs generated by the layer. shrubland_model = make_model() keras.utils.plot_model( shrubland_model, show_shapes=True, show_layer_activations=True, dpi=1200 )

As we’re fitting a rather deep neural network against rather simple structured data, we need to be careful to avoid overfitting while we train the model. As a result, we should define a way to stop our training process early once we stop seeing improved accuracy against the validation data set. We can use the EarlyStopping() function to enforce this behavior, so that we’ll cut the training process short once we stop seeing improvements in PRC against the validation data: early_stopping = tf.keras.callbacks.EarlyStopping( monitor="val_PRC", verbose=1, patience=10, mode="max", restore_best_weights=True )

And now we’re ready to fit the model! Because we’re using early stopping, we can set the number of epochs to use extremely high, as we’ll automatically use the most successful iteration for our final model. We’ll also make sure to use the class weights we defined earlier:

308

11. AI for shrubland identification and mapping

FIG. 3 A schematic showing the structure of our neural network.

5 Model building

309

resampled_history = shrubland_model.fit( train_ds, steps_per_epoch=20, epochs=1000, callbacks=[early_stopping], validation_data=(val_ds), class_weight=class_weight, verbose=0, ) Restoring model weights from the end of the best epoch: 49. Epoch 59: early stopping

It appears that our model’s PRC score stops improving after 49 epochs, which due to our “patience” value of 10 causes our early stopping rules to kick in after epoch 59. We can visualize this process by plotting the PRC values from each epoch of model training, using the resampled_history object returned from the fitting process: import matplotlib.pyplot as plt plt.plot(resampled_history.history["PRC"], label="PRC (training data)") plt.plot(resampled_history.history["val_PRC"], label="PRC (validation data)") plt.ylabel("Metric value") plt.xlabel("Epoch number") plt.legend(loc="upper left") plt.show()

This graph (Fig. 4) provides more information about the model fitting process than simply knowing when our early stopping rules kicked in. It appears that, even though our highest PRC score was achieved after 49 epochs, we might have achieved even higher PRC values had early stopping not kicked in. For the purposes of this tutorial, we’re going to continue using the model produced after 49 epochs. Later, as part of the Assignment section of the chapter, you might want to try other parameters in the early stopping function to see if you can improve the performance of the model.

5.3 Model evaluation And just like that, we have a neural net trained to identify shrubland! Now that our model is fully trained (after 49 epochs), the next step is to evaluate it against our hold-out test data frame. We can use the evaluate() method of our model object to do so: results = shrubland_model.evaluate(test_ds, verbose=0) for name, value in zip(shrubland_model.metrics_names, results): print(name, ": ", value) print() loss : 0.4313044250011444 True_Positives : 19575.0 False_Positives : 424301.0

310

11. AI for shrubland identification and mapping

FIG. 4 Precision-recall curve (PRC) at each epoch of model training. Higher PRC values indicate a better classifier. True_Negatives : 986799.0 False_Negatives : 2366.0 Binary_Accuracy : 0.7022646069526672 Precision : 0.04410015419125557 Recall : 0.8921653628349304 AUC : 0.8874118328094482 PRC : 0.1810697615146637

We’ll discuss these results in more detail in the Discussion (Section 6). For now, though, make note of how high our model’s AUC (area under the ROC curve) is, compared to its PRC (area under the precision-recall curve) and precision. Last but not least, it’s time for us to visualize our predictions to get a sense of where our model believes we’re most likely to find shrublands. In order to map our results, we need to first generate a prediction for each cell in our raster. To do that, we need to first preprocess our full data set in the same way as our training and test data by rescaling it and transforming it to a TensorFlow dataset: full_array = np.array(full_features) full_array = scaler.transform(full_array) full_ds = tf.data.Dataset.from_tensor_slices((full_array, full_labels)).cache() full_ds = full_ds.batch(64).prefetch(2)

We can then generate a prediction for each cell of our raster using our model’s predict() method: predictions = pd.DataFrame(shrubland_model.predict(full_ds, verbose=0)) full_data = pd.concat( [full_data, predictions], axis=1, )

5 Model building

311

Now all that’s left is to save our predictions out as a raster file, so that we can visualize them in our favorite GIS tool. In order to save space, we’ll only save our X and Y coordinates and predictions in the output raster, producing a simple XYZ raster file: location_predictions = full_data[["x", "y", 0]] location_predictions.columns = ["x", "y", "z"]

We’ll use rasterio in order to save this out as a raster file that GIS tools will understand. Let’s import it (and its dependency, affine) now: import rasterio from affine import Affine

A raster file is effectively an array of values, with each cell’s X and Y position in the array corresponding to its X and Y position in space. As such, in order to create a raster file we must first transpose our one-dimensional column of predictions into a two-dimensional array. Our first step in this process is to find the corners of our data’s bounding box: xmin = location_predictions["x"].min() xmax = location_predictions["x"].max() ymin = location_predictions["y"].min() ymax = location_predictions["y"].max()

We then need to identify the cell positions of each pixel in our data set: # Resolution of our Landsat-derived predictors: # Each observation represents a 30-meter square "pixel" of the map pixel_size = 30 # Identify the X and Y values for each pixel in our output raster xv = pd.Series(np.arange(xmin, xmax + pixel_size, pixel_size)) yv = pd.Series(np.arange(ymin, ymax + pixel_size, pixel_size)[::-1]) # Get the X and Y cell indices for each of these pixels xi = pd.Series(xv.index.values, index=xv) yi = pd.Series(yv.index.values, index=yv)

And we’ll then use those positions to create an empty array, which we’ll then fill in with our predicted values: # Create an empty array of the proper size for our data: nodata = -9999.0 zv = np.ones((len(yi), len(xi)), np.float32) * nodata # Fill in the array with our predicted values, wherever they exist: zv[ yi[location_predictions["y"]].values, xi[location_predictions["x"]].values ] = location_predictions["z"]

312

11. AI for shrubland identification and mapping

And just like that, we’ve transformed our single-dimension prediction vector into a twodimensional array. All that remains is to translate that array from a numpy array into a raster file. We’ll first define a transformation, to give rasterio instructions on how much area each of our array cells should represent: transform = Affine(pixel_size, 0, xmin, 0, -pixel_size, ymax) * Affine.translation( -0.5, -0.5 )

And then lastly we’ll use rasterio and this transformation to actually write our values out to a GeoTIFF file: # This is the PROJ string for the raster data used in this study # It represents how to associate the X and Y coordinates with real world data projection = "+proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5" projection = projection + " +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs" with rasterio.open( # Our output file name "predictions.tiff", # What mode to open the file in – here, write mode "w", # What driver to use to write our file "GTiff", # Number of columns to write len(xi), # Number of rows to write len(yi), # How many "bands" to write 1, projection, # The transformation created above transform, # What data type to save as rasterio.float32, # What value indicates a missing value nodata, ) as ds: ds.write(zv.astype(np.float32), 1)

Once we’ve saved this GeoTIFF file, we can visualize it in any GIS program to see where our model predicts shrubland is located (Fig. 5).

6 Discussion So we’ve fit our models, predicted our data, and made a map of the results. But what do our results actually mean, both for our ability to identify shrubland and for how we understand model performance?

6 Discussion

313

FIG. 5 A map of predicted probability of shrubland occurrence across New York’s lower Hudson River valley, including Duchess, Orange and Ulster counties.

A lot of researchers are immediately drawn to the best performance metrics of the model— in this case, likely our fantastic AUC statistic. However, remember that our data is severely imbalanced, with only 1.5% of the training data representing shrublands. AUC is a measure of how well your model performs at the “pairing test”—that is, it represents how well our model would do at classifying two observations if one was guaranteed to represent shrubland and

314

11. AI for shrubland identification and mapping

one was guaranteed to not (Hand, 2009). In that situation, our model would give the right results 89% of the time, which makes it a highly effective classifier. However, given that only 1.5% of our region is shrubland, the scenario described by AUC isn’t a great representation of how our model actually performs on the ground. More interesting are our model’s recall and precision scores. Our recall score—that is, the proportion of actual “true” shrublands which the model calls shrubland—is extremely high for this model. Our model produces very few false negative predictions. However, our precision—the proportion of shrubland predictions which actually reflect “true” shrubland— is much lower, as we have a very high number of false positives. Depending on our goals for this model, this may be desirable; if our aim is to identify the majority of shrubland across the state, we may accept these false positives as a necessary drawback of that goal. However, if our goal is to produce the most accurate map of shrubland locations possible, for instance to try and choose sites for fieldwork in shrubland regions, we might want a higher precision in exchange for a lower recall value. Because our model predicts the probability that an observation represents shrubland, and not just the class, we can use different classification thresholds to balance recall and precision according to our tastes. For instance, we can see a large increase in model precision if we require a predicted probability of more than 90% before we classify an observation as shrubland: test_predictions = pd.DataFrame(shrubland_model.predict(test_ds, verbose=0)) import statistics statistics.mean(test_labels.loc[np.array(test_predictions[0] > 0.9)]) 0.2394189044714169

And an even bigger improvement if we require a probability of at least 95%: statistics.mean(test_labels.loc[np.array(test_predictions[0] > 0.95)]) 0.5160095263297169

Of course, this improved precision comes at the cost of a decrease in recall. This is a common trade-off with classification models: decreasing the number of false positives also decreases the number of true positives predicted by a given model. The appropriate threshold to use when making predictions for any classifier will be dependent upon the relative costs of false negative and false positive predictions for your use case. Perhaps more interesting than the specific probability predictions is the spatial arrangement of predictions across our study area (Fig. 5). Generally speaking, it appears like our model is expecting shrubland to be more dominant in areas along road networks and rivers (which are white in the map, as they were excluded from our input data set)—which makes a lot of sense, as these are the areas more likely to have been recently impacted by humans. In this way, mapping the results of a predictive model can help us to understand the patterns and processes happening across the landscape, even without the use of an inferential or causal framework. Being able to visualize what areas are more likely to be shrubland, in this scenario, can help us generate hypotheses for why shrubland occurs where it does and perhaps even suggest future areas for inferential investigation.

9 Open questions

315

7 Summary This chapter provided a step-by-step walk through of the process for producing models of a rare land-cover class, using a case study attempting to identify shrublands across a region in New York State. Due to the rarity of shrublands in this region, specific attention was paid to how to model imbalanced classes and how to measure model performance with specific objectives for the model. While our model was better at identifying shrubland than random chance alone (with a precision multiple times greater than the 1.5% “base rate” of all pixels being shrubland), the rarity of this land cover class means that the model’s precision is rather low in absolute terms. As higher predicted probabilities of shrubland are, as expected, more likely to represent actual shrubland areas, adjusting the classification threshold to require higher probabilities can help to improve model performance. More generally, this chapter focused on the difficulties of modeling rare events, and approaches that can be used in this common situation. It is frequently true that rare events and abnormalities are more scientifically interesting than the baseline case, and as such it is important to be able to model and predict these situations. By being able to assign class weights and thinking carefully about model performance metrics, we’re able to apply most modeling tools to this common type of problem.

8 Assignment • Try altering the architecture of the neural network—remove layers, change the number of nodes, alter the early stopping callback, and generally play with the form of the model. Can you out-perform the model from the chapter? • What happens if you use a different metric for early stopping? Can you optimize for a different performance metric? • What happens if you change the class weights to more strongly emphasize shrublands? To de-emphasize them? What metrics are impacted the most?

9 Open questions There remain some clear future directions for this model: • Could additional predictors (derived from Landsat imagery or other remote sensing data sources) improve predictive accuracy? • Could this model be used to track the development of shrubland areas over time, in order to monitor the abundance and distribution of this land cover type? • Will the reported performance statistics remain stable as the model is used to extrapolate into other regions of New York? Into other regions of the country? • Could a similar approach be used to track other novel land cover classes, or a finer gradation of land cover types than is usually modeled in LULC studies? • Could more complex models, such as convolutional neural networks, achieve higher accuracy against this data set?

316

11. AI for shrubland identification and mapping

References Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., et al., 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/. Brown, J.F., Tollerud, H.J., Barber, C.P., Zhou, Q., Dwyer, J.L., Vogelmann, J.E., Loveland, T.R., et al., 2020. Lessons learned implementing an operational continuous United States National Land Change Monitoring Capability: the land change monitoring, assessment, and projection (LCMAP) approach. Remote Sens. Environ. 238, 111356. https://doi.org/10.1016/j.rse.2019.111356. Carrera, E., Nowee, P., Kalinowski, S., 2021. Pydot. https://github.com/pydot/pydot. Chollet, F., 2015. Keras. https://keras.io. Cramer, V.A., Hobbs, R.J., Standish, R.J., 2008. What’s new about old fields? Land abandonment and ecosystem assembly. Trends Ecol. Evol. 23 (2), 104–112. https://doi.org/10.1016/j.tree.2007.10.005. Falkowski, M.J., Evans, J.S., Martinuzzi, S., Gessler, P.E., Hudak, A.T., 2009. Characterizing forest succession with Lidar data: an evaluation for the Inland Northwest, USA. Remote Sens. Environ. 113 (5), 946–956. https://doi. org/10.1016/j.rse.2009.01.003. Fargione, J.E., Bassett, S., Boucher, T., Bridgham, S.D., Conant, R.T., Cook-Patton, S.C., Ellis, P.W., et al., 2018. Natural climate solutions for the United States. Sci. Adv. 4 (11), eaat1869. https://doi.org/10.1126/sciadv.aat1869. Foster, D.R., Motzkin, G., Slater, B., 1998. Land-use history as long-term broad-scale disturbance: regional forest dynamics in Central New England. Ecosystems 1 (1), 96–119. https://doi.org/10.1007/s100219900008. Gillies, S., et al., 2013. Rasterio: Geospatial Raster I/O for Python Programmers. Mapbox. https://github.com/ rasterio/rasterio. Hand, D.J., 2009. Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach. Learn. 77, 103–123. Harris, C.R., Jarrod Millman, K., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., et al., 2020. Array programming with NumPy. Nature 585 (7825), 357–362. https://doi.org/10.1038/s41586-020-2649-2. Hobbs, R.J., Higgs, E., Harris, J.A., 2009. Novel ecosystems: implications for conservation and restoration. Trends Ecol. Evol. 24 (11), 599–605. https://doi.org/10.1016/j.tree.2009.05.012. Hunter, J.D., 2007. Matplotlib: a 2d graphics environment. Comput. Sci. Eng. 9 (3), 90–95. https://doi.org/10.1109/ MCSE.2007.55. Kennedy, R.E., Yang, Z., Cohen, W.B., 2010. Detecting trends in forest disturbance and recovery using yearly Landsat time series. 1. LandTrendr temporal segmentation algorithms. Remote Sens. Environ. 114 (12), 2897–2910. https:// doi.org/10.1016/j.rse.2010.07.008. Kennedy, R.E., Yang, Z., Gorelick, N., Braaten, J., Cavalcante, L., Cohen, W.B., Healey, S., 2018. Implementation of the LandTrendr algorithm on Google earth engine. Remote Sens. 10 (5). https://doi.org/10.3390/rs10050691. King, D.I., Schlossberg, S., 2014. Synthesis of the conservation value of the early-successional stage in forests of Eastern North America. For. Ecol. Manag. 324, 186–195. https://doi.org/10.1016/j.foreco.2013.12.001. LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. https://doi.org/10.1038/nature14539. Mahoney, M.J., Johnson, L.K., Guinan, A.Z., Beier, C.M., 2022a. Classification and mapping of low-statured shrubland cover types in post-agricultural landscapes of the US Northeast. Int. J. Remote Sens. 43 (19-24), 7117–7138. https:// doi.org/10.1080/01431161.2022.2155086. Mahoney, M.J., Johnson, L.K., Beier, C.M., 2022b. Data for: AI for Shrubland Identification and Mapping (in AI For Earth Science). Zenodo, https://doi.org/10.5281/zenodo.6824173. McKinney, W., 2010. Data structures for statistical computing in Python. In: van der Walt, S., Millman, J. (Eds.), Proceedings of the 9th Python in Science Conference, pp. 56–61, https://doi.org/10.25080/Majora-92bf1922-00a. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., et al., 2011. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830. Python Core Team, 2022. Python: A Dynamic, Open Source Programming Language. Python Software Foundation. https://www.python.org/. ´ ., Recio, J.A., Crespo-Peremarch, P., Sapena, M., 2018. An object-based approach for mapping forest strucRuiz, L.A tural types based on low-density LiDAR and multispectral imagery. Geocarto Int. 33 (5), 443–457. https://doi. org/10.1080/10106049.2016.1265595. The Pandas Development Team, 2020. Pandas-Dev/Pandas: Pandas (Version Latest). Zenodo, https://doi.org/ 10.5281/zenodo.3509134. Wickham, J., Stehman, S.V., Sorenson, D.G., Gass, L., Dewitz, J.A., 2021. Thematic accuracy assessment of the NLCD 2016 land cover for the conterminous United States. Remote Sens. Environ. 257, 112357. https://doi.org/10.1016/ j.rse.2021.112357.

C H A P T E R

12 Explainable AI for understanding ML-derived vegetation products Geetha Satya Mounika Ganji and Wai Hang Chow Lin KBRWyle, Inc., Sioux Falls, SD, United States

1 Introduction In this chapter, we will examine how using machine learning (ML) will improve our understanding about the vegetation cover types across the United States. We will first exemplify the use of ML techniques in traditional land monitoring projects, and then we will focus on discussing the concept and methods of explainable artificial intelligence, or short for XAI, and use it to explain ML models such as Random Forest Classifiers. These examples use the No Riparian shrub dataset from the LANDFIRE (Lundberg, 2021a). The project setting is operation oriented, as many users employ the LANDFIRE datasets as one of the essential data sources in their application. Artificial intelligence (AI)/ML is expected to improve existing datasets, or add data products with new insights or undiscovered information that will be helpful in wildfire management, or any ecological research and policy making. Here, we will use ML to generate a shrub map with available LANDFIRE data bands such as Landsat imagery, surface temperature, biophysical variables, etc., as inputs. Then, XAI will be used to explain the decision the ML model made on those pixels and why it classified them into certain categories. With the explanation, we could understand the reasoning within the ML model, and also spot potential bad quality data inside the used input datasets, or the source of uncertainties or bad predictions so we can further refine the ML model to be more reliable. The focus of this research is on the results of XAI, which is an important concern when researchers consider releasing the ML-derived products like shrub maps as a public data source which can be used by the entire community. We will first use ML to generate a prototype shrub map, and then experiment multiple XAI methods on it to evaluate and back-trace the reasoning path or the correlation between input features and the final decisions. Eventually, we will achieve understanding of the existing XAI methods, their

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00008-6

317

Copyright # 2023 Elsevier Inc. All rights reserved.

318

12. Explainable AI for understanding ML-derived vegetation products

advantages and disadvantages, and the actual insights they can deliver from the current ML results, and how we can leverage them to improve ML respectively.

2 Background If you are an ecologist or wildfire researcher, you might be familiar with the LANDFIRE datasets. The LANDFIRE Prototype project (Rollins and Frame, 2006) is conducted and sustained by various government agencies, universities, and private institutions, including U.S. Department of Agriculture Forest Service, Rocky Mountain Research Station, Missoula Fire Sciences Laboratory, U.S. Department of the Interior Geological Survey, USGS Center for Earth Resources Observation and Science (EROS). The data products from the LANDFIRE project have been considered as valuable and essential resources for scientists to study the wildland fuel and fire potentials, while giving accurate and in-advanced advice on planning, implementing, and monitoring of wildfire prevention and containment. The LANDFIRE national products contain a lot of useful datasets, ranging from vegetation, fuel, fire regime, topography, transportation, and seasonal changes. To keep its promise of producing “scientifically credible, consistent, and standardized spatial data” for fire and land managers (Rollins and Frame, 2006), the project spent tremendous amounts of efforts in collecting ground data, most of which are point-based, manually labeled or captured by deployed calibrated sensors maintained by government agencies’ observation networks. However, the point-based data is not very useful as it is usually missing information for the interest areas at most of the time, so many third-party datasets and methods are involved like satellite imagery, interpolation or ML classification to produce seamless continuous thematic maps of the lands and those variables relevant to hazardous fuel build-up or extreme outliers based on historical monitoring records to capture fire risks in early stage. The historical wildfire incidents are of course on their radar and have been included and fused into the project’s data products. All the data products are openly available for the public to download. Their website also provides an easy-to-use online GIS portal for users to specify area of interests and retrieve the datasets on demand. Fig. 1 shows the interface of the data viewer system. Users can click on the map to check the properties of any place in the United States interactively. LANDFIRE’s vegetation products (released every 2 years since 2012) usually include Vegetation Cover (EVC), Height (EVH), and Type (EVT) data (Picotte et al., 2019). The EVT products can reflect the current distribution of the terrestrial ecological systems classification. A terrestrial ecological system is defined as a group of plant types that tend to cooccur within landscapes with similar ecological processes, substrates, and/or environmental gradients. This classification looks uncertain and ambiguous, but it is necessary to make the vegetation mappable, because the vegetation is never pure in geospatial dimension. The forest area is commonly mixed with other things like shrubs and grass. We can only tell the differences by grouping them into a certain vegetation type, and specify the majority component to make it possible to produce a categorical vegetation map. The current EVT products are mapped using Decision Tree models, together with field-collected data, Landsat satellite imagery, elevation, and biophysical gradient data to represent each of the three existing life forms—tree, shrub, and herbaceous. The Decision Tree model (classification and regression tree, short for CART) achieved a very good accuracy and the products are validated against ground referenced data and proved to have a high classification accuracy.

319

2 Background

FIG. 1 LANDFIRE data viewer portal (https://www.landfire.gov/viewer/).

There is a remap effort in the LANDFIRE project to determine the new baseline for national vegetation conditions. The remap includes a new 81-class map with the National Vegetation Classification, and the overall accuracy for EVT is about 52%, which is a little bit low for certain regions. The dictionary used by the remap results can be found here (https://landfire. gov/DataDictionary/LF2016_LFRDB_DataDictionary.pdf). The dataset used in this research contains 38 features, and 37 of them are used as predictors (inputs for the ML). Most of them are field measurements of vegetation, Landsat imagery, elevation, biophysical gradient, and temperature data. The ML model will only have one output, which is a target/dependent variable (information the machine learns to predict) specifying one value for each pixel from the vegetation type dictionary as the ML output. The following image shows part of the data printed out by Python. The acronyms are explained in the LANDFIRE dictionary document. Table 1 listed some acronyms for your reference. TABLE 1 LANDFIRE remap feature names and explanation (further details can be found in https://landfire. gov/datadictionary/LF2016_LFRDB_DataDictionary.pdf). Code

Name

Explanation

Asp

Aspect

Topographic property, land aspect

Elev

Elevation

Altitude of the place

FalB1

Landsat Fall Band 1

Landsat Band 1 value captured in Fall

SprB1

Landsat Spring Band 1

Landsat Band 1 value captured in Spring

SumB3

Landsat Summer Band 3

Landsat Band 3 value captured in Summer

FalTcb

Landsat Fall Tassel Cap—Brightness

SumTCw

Landsat Fall Tassel Cap—Wetness

320

12. Explainable AI for understanding ML-derived vegetation products

Target variable is “dep,” meaning dependent target variable and its code value signifies a kind of shrub as shown in the following figure.

3 Prerequisites The programming language is Python as well, and this research mostly used the go-to packages which are well known and validated by many other ML applications. The version is just for reference, as there might be minor changes to the function interfaces. Most of the code included in this chapter should work for a broad set of versions and a long period of time, as most of the usages are quite routine and standard (Table 2).

4 Method & technique 4.1 Choosing a machine learning model There are many ML models to select for this task. Every model has their advantages and disadvantages. Some models perform better for image data, some for categorical data,

321

4 Method & technique

TABLE 2 Used tools and version. Python

>3.4

Scikit-learn

>0.18

ELI5

0.10.1

SHAP

0.30.2

Alibi

0.3.2

Anchor

0.0.0.6

continuous data, and some models for sequential data (music, text). In this chapter, to make it easy to apply and evaluate the explainable AI techniques, Random Forest Classifier model, which is a benchmark and standard baseline model, was used. Random Forest’s mechanism is to create a set of decision trees from randomly selected subset of training data and aggregate the votes from different decision trees to decide the final class of the test object. It was chosen as it has been proven good at solving categorical classification problems and is generally more accurate than the traditional single decision tree models.

4.2 Explainable artificial intelligence (XAI) Despite the popularity of AI/ML, it is still an immature field in many areas (Cui et al., 2018). In many cases, AI/ML technique is complex, opaque, nonintuitive, and difficult for people to understand. The nature of some black-box model systems has caused serious doubt among users and the model trustworthiness is falling due to the lack of interpretability and transparency. Interpretability in the ML world is highly desired because it can help humans understand the decisions made by AI/ML and identify actions on the path to inference so human users can determine whether they are worth to be trusted in real world scenarios. For example, in the use case of healthcare, XAI can assist doctors in explaining the diagnosis and how a treatment plan would benefit the patients, which could help patients and doctors build a stronger bond. XAI is a term that is commonly used to refer to the methodology and techniques that are developed to make AI understandable by its human stakeholders, either model developers or end users. It introduces a concept and a suite of ML methods that enable humans to understand, trust, and produce more explainable models while maintaining a comparable level of prediction accuracy. In some cases, XAI is also labeled as FAT ML (fairness, accountability, and transparency in ML). There are generally the two types of XAI techniques: Model-Specific techniques are only suitable for a single model and rely on a model’s operation and capabilities. Model-Agnostic techniques generally involve examining the input/output data distribution of the algorithm used and is utilized for many different types of models. This chapter will use model-agnostic methods, including ELI5 (Explain Like I’m Five), SHAP, ALE, and Anchor for local and global interpretability with the help of LANDFIRE Existing Vegetation Type (EVT) dataset.

322

12. Explainable AI for understanding ML-derived vegetation products

4.3 Local and global interpretability Local interpretability is a method in XAI that allows users to understand and interpret the prediction of a specific datapoint. It describes why the model made specific decisions for a single instance and for a group of instances. Local interpretation chooses an individual datapoint in the dataset and explains the model’s prediction for that datapoint. The example methods for providing local interpretability models examples have ELI5 and Anchor. Global interpretability is a method that allows the user to comprehend and explain the entire model at once. It allows users to build a better picture of the entire model including visualization of the weight distributions, feature importance, and how much each predictor contributes, either positively or negatively, to the target variable. Global interpretation focuses on explaining all features from the dataset and predictions. The example methods for global interpretability models have SHAP and ELI5. It should be noticed that a method is not necessarily only focusing on local or global. It can deliver explanations on both. For instance, SHAP and ELI5 libraries can be used for both local and global interpretability. But in this chapter, SHAP will be used to study the global interpretability and ELI5 will be used to study the local interpretability.

5 Experiment & results 5.1 ELI5 ELI5 is a python library that allows users to visualize and debug various ML models (Mishra, 2022). It provides some easy-to-use function interfaces for visualization and debugging of various ML models. It fully supports several popular ML frameworks including Keras, XGBoost, and scikit-learn that we used in this study. Its advantages include its intuitive workflow on feature selection with permutation importance, extraction and visualization of feature weights and their contribution in the sense of global explanation. ELI5 XAI toolkit can examine a classification model in two ways: 1. How does the model work globally? 2. Why does the model make the decision?

5.2 Implementation ELI5 is mainly for explaining the weights and predictions of the random model for a single datapoint in the results. In this study, we use ELI5 to take in the random forest-derived shrub prediction for all the pixels in the United States, and make it explain why those predictions are made. The following tutorial details the steps we took to feed in the data (mostly csv files), do the calculation, and visualize the ELI5 findings in the output. The source code snippets are associated. You could find more detailed steps in the book’s Github repository: https:// github.com/earth-artificial-intelligence/earth_ai_book_materials. 1. Install package eli5 (https://pypi.org/project/eli5/). pip install eli5

5 Experiment & results

323

2. Import necessary libraries. import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier

3. Retrieve the Shrub dataset (Lundberg, 2021a) from the LANDFIRE Reference Database LF 2016. data = pd.read_csv('Ecoregion_18_Nvc_cleaned_no_riparian_shrub.csv') data.head()

4. Data preprocessing: a. Replace the predictor variable codes with relevant shrub names. data['dep'] = data[''dep''].replace([6290,6282,6152,6287,6286,6291,6285,6153], ['Intermountain Dwarf Saltbush – Sagebrush Scrub', 'Intermountain Semi-Desert Shrubland & Steppe', 'Southern Rocky Mountain Gambel Oak – Mixed Montane Shrubland', 'Intermountain Mesic Tall Sagebrush Shrubland & Steppe', 'Intermountain Dry Tall Sagebrush Shrubland', 'Intermountain Low & Black Sagebrush Shrubland & Steppe', 'Southern Rocky Mountain-mahogany-Mixed Foothill Shrubland'])

b. Split the dataset into train and test sets. X = data.drop(''dep'', axis = 1) Y = data[''dep''] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

5. Model training: choose a model to train. This is the model for which decisions are explained by ELI5. In our case, let it be Random Forest Classifier as we are dealing with a classification problem. model = RandomForestClassifier(n_estimators = 200, random_state = 101) model.fit(X_train, y_train)

6. Model interpretation: ELI5 uses a technique called Permutation Importance (Altmann et al., 2010) to compute feature importance by measuring how the probability score decreases when a feature is not available or removed from the data. This explains the idea behind the concept of Global Interpretability. In other words, it explains how the model performance changes when a feature from the data is removed. This explains the importance of a particular feature in the data. It takes the trained model and a scoring method which in our case is “accuracy” to plot the feature importance. from eli5 import show_weights from eli5.sklearn import PermutationImportance

324

12. Explainable AI for understanding ML-derived vegetation products

perm = PermutationImportance(model, scoring = 'accuracy', random_state = 101) perm.fit(X_train, y_train) show_weights(perm, feature_names = X_test.columns.tolist())

Next, let us see how ELI5 gives the weight for features of data by considering individual data points from the test set which is the idea behind the concept of Local Interpretability (Fig. 2). The method explain_prediction takes a trained model, a datapoint from the test data and explains the result of that one datapoint showing top n features which drove the model to decide. from eli5 import explain_prediction explain_prediction(model, np.array(X_test[21], feature_names = X.columns.tolist(), top = 7, top_targets = 3))

ELI5 shows the decision of the model by adding up all the positive features along with “BIAS,” and the sum is shown as the probability score (Fig. 3). 5.2.1 Conclusion ELI5 makes it apparent which inputs contributed and their weights in the decisions respectively. Correlated data should not be used as inputs, because those correlated data will have no impacts on the results. In other words, dropping correlated features can still achieve the results without sacrificing accuracy.

Feature Weight tavei 0.1205 ± 0.0119 SprB4 0.0060 ± 0.0022 ppti 0.0049 ± 0.0022 LAE 0.0030 ± 0.0036 Elev 0.0030 ± 0.0032 NDVIMedian 0.0016 ± 0.0020 NDVIMin 0.0014 ± 0.0000 Slpp 0.0014 ± 0.0017 NDVIDiff 0.0005 ± 0.0013 FalB1 0 ± 0.0000 SprB1 0 ± 0.0000 FalB3 0 ± 0.0000 FalTCw 0 ± 0.0000 FalB4 0 ± 0.0000 FalB6 0 ± 0.0000 FalB2 0 ± 0.0000 FalB5 0 ± 0.0000 FalTCb 0 ± 0.0000 FalTCg 0 ± 0.0000 NDVIMax 0 ± 0.0000 ... 17 more ...

FIG. 2 The feature “tavei” has the highest score of 0.1205, which means removing it would have the most significant, detrimental effect on the prediction accuracy of the model, as large as about 0.1. Because the permutation importance method is a random process, we provide the uncertainty value, which can be seen after the plus-minus sign.

5 Experiment & results

325

FIG. 3 For datapoint 21, ELI5 shows how the model predicted the outcome based on given test data as “Intermountain Low & Black Sagebrush Shrubland Steppe,” and to support the decision it showed the probability as 94.5% along with other important features in descending order. It also shows the probability for each class and feature contribution for the probability score. A new feature “BIAS” can be seen; BIAS is the expected average score output given by the model, and this is given based on the distribution of the training data.

5.3 SHAP SHAP (SHapley Additive exPlanations) is a python library (https://github.com/slundberg/ shap; Lundberg, 2021b) that explains the output of any ML model. SHAP is based on the theoretically game optimal Shapley values that were given by Lloyd Shapley for game theory (Meng et al., 2021). This post hoc model explanation technique uses a surrogate model to find Shapley values of all the features. In a multiplayer game setting, multiple players work together to complete a task. Not all players contribute equally to the task. To reward each player, it is essential to compute the contribution of each player. The contribution of each player according to Lloyd Shapley is equivalent to finding Shapley value for that player. With respect to SHAP, it explains the prediction of an instance by calculating the contribution of each existing feature to the prediction. To provide a clear explanation on global interpretation with SHAP, we use a built-in tree explainer method because we use Random Forest Classifier to train the model to visualize feature importance and negative/positive correlation against the LANDFIRE example dataset. 5.3.1 Implementation 1. Install SHAP. pip install shap

2. Import necessary packages. import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier import shap

3. Load the data, replace predictor variable codes with appropriate shrub names, remove unwanted columns, and split the data into train and test sets in 70:30 ratio using train_test_split method from scikit-learn (Pedregosa et al., 2011). X = data.drop(''dep'', axis = 1) Y = data[''dep''] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

326

12. Explainable AI for understanding ML-derived vegetation products

4. Model training: Choose a model to train. This is the model for which decisions are explained by SHAP. In our case, let it be Random Forest Classifier as we are dealing with classification problems. model = RandomForestClassifier(n_estimators = 200, random_state = 101) model.fit(X_train, y_train).fit(X_train

5. Run the SHAP model by loading the trained model and training dataset. shap.initjs() shap_values = shap.TreeExplainer(model).shap_values(X_train)

6. Visualize the decisions/predictions using different plots of SHAP. a. Summary plot: Provide the trained classification model, training data, and plot type (in this case, bar type) and assign only top n important features where n is an integer that is less than or equal to total number of features. shap.summary_plot(shap_values[2], X_test, plot_type = ''bar'', max_display = 5, color = ''red'')

For example, plot SHAP values for prediction “Intermountain Semi-Desert Shrubland & Steppe” (shap_values[2] in code) (Fig. 4). b. Simplified plot (Kuo, 2019) (Fig. 5): shap_df = pd.DataFrame(shap_values[2]) X_train_1 = df.copy().reset_index().drop('index', axis = 1) Shap_df.columns = X_train.columns #Correlation Determination correlation_list = list() for i in X_train.columns(): corr_def = np.corrcoef(shap_df[i], X_train_1[i])[1][0] correlation_list.append(corr_coef) correlation = pd.concat([pd.Series(feature_list), pd.Series(correlation_list)], axis = 1).fillna(0) correlation_columns = ['Feature_name', 'Correlation'] correlation ['Sign'] = np.where(correlation['Correlation']>0, 'red', 'green') #Correlation Plot shap_plot = np.abs(shap_df) df = pd.DataFrame(shap_plot.mean()).reset_index() df.columns = ['Feature_name', 'SHAP_absolute_value'] df1 = df.merge(corr_df, left_on = 'Feature_name', right_on = 'Feature_name', how = 'inner') df1 = df1.sort_values(by = 'SHAP_absolute_value' ascending = True) ax = df1.plot.barh(x = 'Feature_name', y = 'SHAP_absolute_value', color = df1['Sign'], legend = False) ax.set_xlabel(''SHAP Value by Correlation'')

327

5 Experiment & results

tavei NDVIMedian LAE NDVIMin Slpp

0.000

0.001

0.002

0.003

0.004

0.005

mean(|SHAP value|) (average impact on model output magnitude)

FIG. 4 Summary plot that labels the important features in descending order on the y-axis for the shrub type

Feature_name

“Intermountain Semi-Desert Shrubland & Steppe” and the SHAP value score of each individual feature on the x-axis.

tavei NDVIMedian LAE NDVIMin Slpp NDVIDiff SprB1 SumB3 ppti NDVIMax SumB2 Elev SumB1 SprB6 SumTcb FalB4 SumB4 SprB4 SumTcg SprB2 SumB6 SumB5 FalB1 SprB3 SprTCb SprB5 FalTCw SumTCw FalB3 SprTCg SprTCw FalB6 FalB2 Asp FalTCb FalB5 FalTCg

0.000

0.001

0.002

0.003

0.004

0.005

SHAP Value by Correlation

FIG. 5 Simplified plot shows all the features that have negative/positive correlation with vegetation type “Intermountain Semi-Desert Shrubland & Steppe“on the y-axis, and the x-axis provides the SHAP value score of each feature. Features with green bars are the ones that are negatively correlated, and the red bars represent positive correlation to the prediction of the vegetation type.

328

12. Explainable AI for understanding ML-derived vegetation products

5.3.2 Conclusion The benefits of the SHAP model are the easy-to-use interface, the capability to identify the significant input factors and be able to detect whether it improves or degrades the prediction accuracy and by how much.

5.4 Accumulated local effects (ALE) Accumulated local effects demonstrate the fluctuation of probability of the class with a feature or collection of characteristics in the case of classification (Molnar, 2022); these ML models offer probabilities of distinct classes. This method overcomes the drawback with partial dependence plots, another form of global interpreter, when the features are correlated (Gupta, 2020). Unlike partial dependence plots that marginalize other feature values and average their prediction, ALE considers conditional distribution of other feature values and averages the difference in predictions. 5.4.1 Implementation 1. Install alibi package, which has an explainer and plot for ALE (https://pypi.org/project/ alibi/). pip install alibi

2. Import necessary packages. import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from alibi.explainers.ale import ALE, plot_ale

3. Load the data, replace predictor variable codes with appropriate shrub names, remove unwanted columns, and split the data into training and test sets in 70:30 ratio using train_test_split method from scikit-learn (Pedregosa et al., 2011). X = data.drop(''dep'', axis = 1) Y = data[''dep''] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

4. Model training: Choose a model to train. This is the model for which decisions are explained by ALE. In our case, let it be Random Forest Classifier as we are dealing with classification problems (Fig. 6). model = RandomForestClassifier(n_estimators = 200, random_state = 101) model.fit(X_train, y_train)

5. Calculate ALE. probability = model.predict_proba rf_ale = ALE(probability, feature_names = X_train.columns, target_names = ['Intermountain Dwarf Saltbush – Sagebrush Scrub', 'Intermountain Semi-Desert Shrubland & Steppe',

5 Experiment & results

329

FIG. 6

ALE plot shows that for the datapoints whose value for feature “tavei” is high, probability is high that the prediction is “Intermountain Dry Tall Sagebrush Shrubland.” Likewise, when the feature “tavei” value is low, the prediction for “Intermountain Mesic Tall Sagebrush Shrubland & Steppe” is a high probability. The y-axis shows the ALE of the feature “tavei,” and the x-axis shows the values of feature “tavei.”

'Southern Rocky Mountain Gambel Oak – Mixed Montane Shrubland', 'Intermountain Mesic Tall Sagebrush Shrubland & Steppe', 'Intermountain Dry Tall Sagebrush Shrubland', 'Intermountain Low & Black Sagebrush Shrubland & Steppe', 'Southern Rocky Mountain-mahogany-Mixed Foothill Shrubland']) rf_explain_ale = rf_ale(np.array(X_train.values)) plot_ale(rf_explain_ale, features = ['tavei'], fig_kw = {'figwidth':15, 'figheight':10})

5.4.2 Conclusion By partially separating the effects of other features of data, ALE exposed the influence of individual features on the predictions of an ML model. As a result, ALE works well on the data where features are more correlated, making ALE a reliable model where there is a chance that given data have bias. ALE explanation is concentrates on the effect of mean on a given feature, so that the main feature effect is compared to the data’s average prediction.

330

12. Explainable AI for understanding ML-derived vegetation products

5.5 Anchor Anchor is a useful local explanation technique that creates rules around feature values (ODSC—Open Data Science, 2019). These rules “anchor” the prediction; the model will give the same prediction if these rules are satisfied irrespective of changes to any feature values. Sample data are created around an instance of interest. However, instead of surrogate models the resulting explanations are expressed as easy to understand IF-THEN rules called anchors. Rules are conditions defined around the features of the dataset. The rules anchor a prediction—when the rules are satisfied the mode would always predict the same outcome. Anchors have a notion of coverage and precision. “Coverage” refers to the proportion of sample instances, predictions of which the anchor holds, and “precision” implies the extent to which the rules in the anchor are exclusively responsible for the predicted outcome. The anchor library uses reinforcement learning and beam search algorithm to find the best anchor (ODSC—Open Data Science, 2019). 5.5.1 Implementation 1. Install package anchor_exp. It is the full package library that includes all the modules of anchor. The option -user makes sure to install packages in the home directory. It can be excluded if not needed. The “python -m” can be removed if you are certain which pip command you are using. python -m pip install –user anchor_exp

2. Import necessary packages. The utils in the import statement will provide many independent functions that can be used depending on the task. Anchor has different explainers for different formats of data—text, tabular, and images. The anchor_tabular module is used in this chapter. from anchor import utils, anchor_tabular

3. Load the data, replace predictor variable codes with appropriate shrub names, remove unwanted columns, and split the data into train and test sets in 70:30 ratio using train_test_split method from scikit-learn (Pedregosa et al., 2011). X = data.drop(''dep'', axis = 1) Y = data[''dep''] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

4. Model training: Choose a model to train. This is the model for which decisions are explained by Anchor. In our case, let it be Random Forest Classifier as we are dealing with classification problems. model = RandomForestClassifier(n_estimators = 200, random_state = 101) model.fit(X_train, y_train)

5. Anchor explanation (Figs. 7 and 8) explainer = anchor_tabular.AnchorTabularExplainer( class_names = ['Intermountain Dwarf Saltbush – Sagebrush Scrub',

5 Experiment & results

331

FIG. 7 Anchor explainer shows an example where the prediction of model can be “Intermountain Dry Tall Sagebrush Shrubland” and explanation for the desired datapoint in test data. The result shows that “the AI will predict Intermountain Dry Tall sagebrush Shrubland 96.5% of the time,” if all the conditions listed are true. The listed conditions basically specified the comfortable zone for the AI model to operate. If the input value ranges exceed the condition scope, AI performance will decrease. 'Intermountain Semi-Desert Shrubland & Steppe', 'Southern Rocky Mountain Gambel Oak – Mixed Montane Shrubland', 'Intermountain Mesic Tall Sagebrush Shrubland & Steppe', 'Intermountain Dry Tall Sagebrush Shrubland', 'Intermountain Low & Black Sagebrush Shrubland & Steppe',

332

12. Explainable AI for understanding ML-derived vegetation products

FIG. 8 Anchor explainer gives examples where the prediction of the model can be “Intermountain Dry Tall Sagebrush Shrubland” and cannot be “Intermountain Dry Tall Sagebrush Shrubland” with features and their values.

6 Summary

333

'Southern Rocky Mountain-mahogany-Mixed Foothill Shrubland'], feature_names = X_train.columns.to_list(), train_data = X_train.to_numpy(), categorical_names = {}) def anchor_explainer(x): y = model.predict(x)[0] return np.array([y]) explainer = explainer.explain_instance(X_test.iloc[1].to_numpy(), anchor_explainer, threshold = 0.95) explainer.show_in_notebook()

5.5.2 Conclusion The anchor technique provides a rule-based explanation. Rules are conditions defined around the features of the dataset. The rules anchor a prediction; when the rules are satisfied the mode would always predict the same outcome.

6 Summary There are many use cases of AI technology in the community dataset projects like LANDFIRE. The generated datasets have already been widely used by tens of thousands researchers and policy makers. Normally people will question the accuracy and reliability of AI models which even the model creators don’t have fully understanding due to the black box nature of AI models, e.g., the nonintuitive weights in the neural networks. Although scientists have done a lot of quality checks and validation with ground truth, it is impossible to cover all the edge cases and it is still not straightforward to let people understand the entire picture about how AI makes the prediction exactly. To make AI more explainable in humanunderstandable way, there are many techniques to gain insights on “why” AI behaves like that. This chapter examines various XAI models including ELI5, SHAP, Anchor, ALE, to understand “how model transparency can be maintained” and “how a feature affects the output of the model.” The results show that ELI5 can provide simple explanations to feature importance with quantitative scores for each input variable, but only provide “feature-level” understandings, not instance-level like for single pixel or single date. It is a good choice for simple models and people who are nonexperts to get a swift answer. SHAP can provide both featurelevel and instance-level explanations, with powerful support for highly complicated models like deep neural networks. However, SHAP is also costing more computing power, and could be very slow if the to-be-analyzed dataset is big in spatial temporal scale. Also, the results might not be that obvious for people with limited experiences. SHAP is appropriate choice for people who want deep insights on both feature and instance level. ALE also is capable of providing instance-level insights and show how each input variable could influence the result of a specific pixel, meanwhile also support complex models and relatively efficient and scalable. ALE’s downside includes that it doesn’t emphasize its ability on feature-level and might also be challenging for nonexperts to percept. ALE is very useful to identify local patterns if the dataset is small. The last method, Anchor, supports pixel-level explanations to

334

12. Explainable AI for understanding ML-derived vegetation products

identify the vital features for single cases, and is relatively more customizable. But Anchor is limited to classification problems and has higher requirements for operators to prepare anchors from labeled data. It is also not very comprehensive and might not be able to capture the full complexity of the AI models. Anchor is useful for users who want to customize and build their the AI explaining tools. Overall, each approach has its strengths and weaknesses. Users can choose the appropriate one based on their use cases and the scope of the problems to be addressed.

7 Assignment 1. Before trying Explainable AI tools, analyze the data and verify if it is imbalanced or balanced by utilizing libraries like scikit-learn, seaborn and/or pandas. Research how training data imbalance can affect the results of any ML model and thus the results of explainable AI algorithms 2. Implement other explainable AI models like LIME and cross compare the output results with the findings of this chapter. 3. Make emphasis on the least/most important features from SHAP and ELI5. Re-build the random forest classifier model according to the output and verify if the metric has a positive or negative impact on the metrics.

8 Open questions 1. Verify the ML model performance before and after the parameter tuning phase. 2. Data correlation affects the results of a model, so detecting any data correlation prior to model training phase will save time and ML resources. 3. Usage of hyperparameter tuning gives an optimized model which is important as it might improve the overall accuracy/metrics. 4. Use different ML classifier models and cross compare metrics. 5. All the models and code provided in this chapter were also executed in a Jupyter Notebook hosted in Amazon Web Services (AWS).

9 Lessons learned 1. Before moving on to the model training phase, one must verify whether the data is imbalanced or balanced. 2. Always check the other performance metrics such as recall, f1-score, kappa coeffients that are calculated based on confusion matrix. The overall accuracy could be misleading in some scenarios and higher OA doesn’t always mean the model is good.

Further reading

335

Acknowledgments Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government. The text and language in Sections 1, 2, and 6 are significantly populated and revised by the editors with the permission of the chapter authors.

References Altmann, A., Tolosi, L., Sander, O., Lenqauer, T., 2010. Permutation importance: a corrected feature importance measure. Bioinformatics 26 (10), 1340–1347. https://doi.org/10.1093/bioinformatics/btq134. Cui, L., Yang, S., Chen, F., et al., 2018. A survey on application of machine learning for internet of things. Int. J. Mach. Learn. Cybern. 9, 1399–1417. https://doi.org/10.1007/s13042-018-0834-5. Gupta, A., 2020. Accumulated Local Effects (ALE)—Feature Effects Global Interpretability. https://www. analyticsvidhya.com/blog/2020/10/accumulated-local-effects-ale-feature-effects-global-interpretability/. Kuo, C., 2019. Explain Your Model with the SHAP Values. https://towardsdatascience.com/explain-your-modelwith-the-shap-values-bc36aac4de3d. Lundberg, B., 2021a. LANDFIRE Reference Database—LF 2016 Remap (LF2.0.0) Public Data Dictionary. https:// landfire.gov/DataDictionary/LF2016_LFRDB_DataDictionary.pdf. Lundberg, S., 2021b. SHAP Python Package. https://github.com/slundberg/shap. Meng, Y., Yang, N., Qian, Z., Zhang, G., 2021. What makes an online review more helpful: an interpretation framework using XGBoost and SHAP values. J. Theor. Appl. Electron. Commer. Res. 16 (3), 466–490. https://doi.org/ 10.3390/jtaer16030029. Mishra, P., 2022. Explainability for linear models. In: Practical Explainable AI Using Python. Apress, Berkely, CA, https://doi.org/10.1007/978-1-4842-7158-2_3. Molnar, C., 2022. Interpretable Machine Learning—A Guide for Making Black Box Models Explainable, second ed. https://christophm.github.io/interpretable-ml-book/shap.html. https://christophm.github.io/interpretable-mlbook/ale.html. ODSC—Open Data Science, 2019. Cracking the Box: Interpreting Black Box Machine Learning Models. https://odsc. medium.com/cracking-the-box-interpreting-black-box-machine-learning-models-bc4bdb2b1ed2. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., et al., 2011. Scikit-learn: machine earning in Python. J. Mach. Learn. Res. 12 (85), 2825–2830. https://scikit-learn.org/stable/about.html#citing-scikit-learn. Picotte, J.J., Dockter, D., Long, J., Tolk, B., Davidson, A., Peterson, B., 2019. LANDFIRE remap prototype mapping effort: developing a new framework for mapping vegetation classification, change, and structure. Fire 2 (2), 35. Rollins, M.G., Frame, C.K., 2006. The LANDFIRE Prototype Project: Nationally Consistent and Locally Relevant Geospatial Data for Wildland Fire Management. Gen. Tech. Rep. RMRS-GTR-175, US Department of Agriculture, Forest Service, Rocky Mountain Research Station, Fort Collins, p. 416. 175.

Further reading Gavrilin, Y., 2019. Cracking the Box: Interpreting Black Box Machine Learning Models. https://opendatascience. com/cracking-the-box-interpreting-black-box-machine-learning-models/. Ribeiro, M.T., Singh, S., Guestrin, C., 2018. Anchors: High-Precision Model-Agnostic Explanations. https://homes.cs. washington.edu/marcotcr/aaai18.pdf. Singhal, S., 2021. Visualization on Heart Disease Dataset (XAI). https://www.kaggle.com/smitisinghal/ visualization-on-heart-disease-dataset-xai?scriptVersionId¼73215961.

This page intentionally left blank

C H A P T E R

13 Satellite image classification using quantum machine learning Olawale Ayoadea, Pablo Rivasb,c, Javier Orduzd, and Nurul Rafib a

Department of Physics, College of Arts and Sciences, Baylor University, Waco, TX, United States b Department of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States cCenter for Standards and Ethics in Artificial Intelligence, Baylor University, Waco, TX, United States dDepartment of Mathematics and Computer Science, Earlham College, Richmond, IN, United States

1 Introduction As introduced in the previous chapters, machine learning has been used in many domains of Earth sciences. Considering the recent promising progress and popularity of machine learning powered by the rapid developments in computing platforms, the question has occurred to many people that how we can further improve our computing power to make it even better suitable for machine learning models. One new and exciting way of changing the current paradigm of machine learning practice is through quantum information tools. In general, quantum information science investigates how information can be encoded in a quantum system, as well as the associated statistics, limitations, and unique affordances of quantum mechanics. This field lays the groundwork for quantum communications, computing, and sensing. Quantum computing (QC) makes use of the quantum mechanical properties of superposition, interference, and entanglement to perform computations that are comparable to those performed on a classical computer (National Academies of Sciences, Engineering, and Medicine et al., 2019). It has been shown that using quantum techniques in the area of conventional machine learning can produce equivalent outcomes. Such a coupling of QC power and machine learning concepts would significantly advance the field of quantum information science and could result in the evaluation of novel, real-world solutions to ongoing machine learning issues (Schuld et al., 2015).

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00013-X

337

Copyright # 2023 Elsevier Inc. All rights reserved.

338

13. Satellite image classification using quantum machine learning

In this section, we introduce the techniques we will use in our quantum machine learning (QML) experiment, and describe our motivation for using it in conducting land cover classification from satellite imagery data.

1.1 Machine learning The term “machine learning” (ML), now a buzzword, is thought to have been coined in 1959 by Arthur Samuel to refer to the field in which computers can learn independently (Alzubi et al., 2018). The general objective of ML is to develop algorithms that can independently learn from complicated and big amounts of training data (Mehta et al., 2019). ML can be either supervised or unsupervised. In supervised learning, the algorithm learns from labels as a supervisor (or teacher). Unsupervised learning, in contrast, uses no labels and is a method of learning from data. Deep learning, which has applications in speech recognition, autonomous driving, image classification, and other areas, is a recent development in ML that has advanced quickly to higher dimensions (Mahesh, 2020). The main goal of the current artificial intelligence research and development is to train a machine to automate a task based on data as quickly as possible. When working with classical data, CPUs and GPUs are used to reduce the time constraint in classical computing. When there is an enormous amount of data; however, a machine requires more time to train, necessitating the use of a more powerful system known as “quantum machine learning.” In this chapter, we want to use QML to recognize the land cover classes from satellite imagery so that we can know about the land cover use without personally going there to investigate on the field.

1.2 Quantum computer and informatics Quantum is a fancy word in the field of Earth system sciences, as most geochemistry and geophysics scientists are studying the mechanism of the atmosphere, plate movements, earthquakes, and volcanoes at molecular level rather than particle level. In this chapter, we won’t talk about quantum physics to solve geophysics problems. Instead, we will look into the current common understanding and practice of QML, and how it might be used in a land cover classification task. The modern practical application (direction) of quantum physics in informatics (storage and processing of data) generally has three areas: quantum communication, QC, and quantum sensing (and metrology) (Ayoade et al., 2022; Biamonte et al., 2017). Each of these groups has its subcategories and specific research endeavors; QML, for example, is a rapidly developing area of quantum computing (QC) (Schuld and Petruccione, 2018). The recent development in this area is basically driven by the increased availability of quantum computers. In this chapter, we are talking about the Google Quantum Computer that uses Sycamore processor, which has up to 54 superconducting qubits (Google AI Quantum and Collaborators et al., 2020). A quantum computer is a device created to use quantum mechanics, a fascinating area of physics, to perform tasks that are impossible for any machine to achieve using only the laws of classical physics (Montanaro, 2016). It is widely accepted that the use of quantum computers could accelerate machine learning to process digital data and enable machine

1 Introduction

339

learning to discover the previously hard-to-learn patterns among variables. It is reasonable to hypothesize that quantum computers may perform better on machine learning tasks than classical computers because quantum systems achieve atypical patterns that traditional systems are thought not to achieve efficiently (Biamonte et al., 2017).

1.3 Quantum machine learning People might have heard about the two terms separately: quantum and machine learning, but maybe never heard about them together in one word. QML means running machine learning models on quantum computers. QML is a field in which researchers intend to address improved performance and optimization issues. It should be able to minimize running time and memory space in terms of complexity theory. The quantum computer is an entirely new computing device that has been added to the machine learning hardware pool by this fast-evolving field. Quantum computers use a set of fundamentally different physical laws known as quantum theory to process data. Like a traditional computer, a quantum computer employs qubits instead of bits as its basic storage unit (National Academies of Sciences, Engineering, and Medicine et al., 2019). A qubit can exist in a state of superposition, allowing a quantum computer to access a hidden realm where exponential computations are possible. To put it another way, the fact that a quantum system can exist in a superposition or linear combination of states allows us to carry out concurrent parallel computations that are not possible on any classical computer, giving QML an advantage over traditional ML. This property enables a quantum computer to carry out parallel computations on a single circuit, which frequently leads to significant speedups (McMahon, 2007). QML can have direct applications in several problems that can be solved with classic ML, including most classification problems. The particular classification problem we focus on is related to detecting vegetation from satellite data, as we discuss next. The hope is that quantum computers can recognize correlation patterns among variables that are exponentially difficult to sample and learn in classical machine learning models.

1.4 Remote sensing (RS) and land cover classification “Remote sensing” stands for the group of techniques that scan the Earth from air or space to monitor the things going on in the atmosphere or on the Earth’s surface. It has been one of the main tools used by scientists to observe and study Earth systems at a large scale with relatively high temporal resolution. Baddock et al. (2011) suggests that RS is a highly valuable tool for detecting the environmental events such as hurricane, forest changes, agricultural crop growth (Sun et al., 2019), dust storms, etc. at both the local and regional scales. Actually, a lot of information shared among the public and government about the environment and climate is derived from satellite imagery. The captured imagery could be optical or hyperspectral (with spectral bands beyond the visible bands), which contains very rich information about the upper Earth surface and the ecosystem we live in. Scientists have widely used them to discover insights and trends on our planet and guide us to make Earth a better place in future. The elements that cover the Earth’s surface and constitute the land cover are both naturally occurring (such as vegetation, cloud, desert, and water) and artificially created

340

13. Satellite image classification using quantum machine learning

(such as industrial and mine sites) (D€ uzg€ un and Demirel, 2011). The researchers from various fields and nations are mapping and monitoring land cover and focusing on human impact on the environment and the rate of land transformation. This impact includes vegetation removal, urbanization, and groundwater extraction, among other things (Gill and Malamud, 2017; Williams, 2000). For example, Song et al. (2018) found a link between land cover changes and global environmental change, underscoring the importance of studying land cover. Today, satellite-based RS is a key component in our society functioning and often provide the key evidences to reflect the ongoing situation of the nature and its interaction with the humankind. The data collection by RS is in a massive scale and real-time manner, and the techniques required for data analysis are struggling to catch up. Due to the significant variation in the spatial data, RS has developed into the approach generally accepted for detecting the existence of land types and monitoring their changes over time. However, the recognition of land objects from the satellite needs a lot of ground sampling to create the training labels for the corresponding satellite imagery pixels. In contrast to this conventional tool, QML is promising to offer a faster method for classifying these satellite image features according to the published literature.

1.5 Vegetation and nonvegetation cover Land cover is what is visible from above the Earth. It is a way to describe landscape patterns and features that are critical for understanding environmental issues such as habitat availability and changes, the possibilities for the spread of chemicals and other pollutants, and potential contributors to climate change such as land reflectivity. Some examples of land cover include clouds, water bodies, vegetation, and deserts, among others. Natural resource management and environmental science research both depend on the classification and mapping of vegetation (Franklin, 2010). By quantifying the amount of vegetation present at different scales, from local to global, at a specific time or over an extended period of time, vegetation mapping provides important information for understanding natural and manmade environments. Nonvegetation elements are also a contributing factor, for example, dust, a common form of aerosol that affects the water cycle. While gathering land cover data can be time consuming and expensive, RS technology offers a practical and cost-effective solution (Xie et al., 2008). We divided land cover into two categories—vegetation and nonvegetation—in order to accomplish the goal of this chapter and reduce complexity. The experiment aims to build a tool that can quickly spot nonvegetation regions from the vast area of regions scanned by satellites every day.

2 Data Let us briefly discuss the data this chapter requires to make our experiment work. This section provides the details of where to gather the data and how the data can be preprocessed for straightforward use in our QML model.

2 Data

341

2.1 Satellite data retrieval We collected various vegetation examples from NASA MODIS (Moderate Resolution Imaging Spectroradiometer) 1 km data products and later calibrated them through MATLAB. We must calibrate the data that were originally in Hierarchical Data Format version 5 (HDF5) format after being downloaded from the website of NASA LAADS DAAC (Level-1 and Atmosphere Archive and Distribution System Distributed Active Archive Center). The calibration will result in geometrically corrected radiance and reflectance data. HDF5 format is a common format within NASA to store and organize very large multilayered gridded data like satellite imagery Radiance data are collected directly by satellite, whereas reflectance data are collected via a secondary medium that reflects light to the satellite. An image with a dimension of 1280 by 1920 is generated from every HDF file. In our experiment, 115 images are used for generating small blocks of different objects such as vegetation, cloud, desert, etc. As the spatial resolution of these images is 1 km, every large image covers an area of 2,457,600 km2. The following list contains relevant information for accessing data from NASA information systems. The MODIS HDF5 files are well formatted and contain comprehensive metadata about the product description and the details of all the stacked bands. We downloaded various vegetation examples in HDF5 format from the URLs below: 1. LAADS DAAC • https://ladsweb.modaps.eosdis.nasa.gov • MODIS is actually the sensor onboard two satellites. The details about MODIS can be found on NASA website. 2. Terra and Aqua • https://ladsweb.modaps.eosdis.nasa.gov/missions-and-measurements/modis/ • Terra and Aqua are two NASA satellites carrying MODIS sensors. Their technical requirements and channels containing band information are listed on NASA website. 3. MODIS Level 0–1 • https://ladsweb.modaps.eosdis.nasa.gov/missions-and-measurements/sciencedomain/modis-L0L1/

• Both levels of data are accessible in this link’s data repository. A total of eight Terra and Aqua products and MODIS Level 1B calibrated data with spatial resolutions of 250 m, 500 m, and 1 km are available here. We only retained the three optical bands (red, green, blue, or RGB) from MODIS and got 115 calibrated MODIS RGB images with a 1280  1920-pixel resolution, which we will use in the next step.

2.2 Split images into batches for annotation To train our models, the 115 RGB images are later divided into 128  128 pixel-sized small images. We first divided each image into 10 and 15 parts on a horizontal and vertical axis. Then convert each 1280  1920-pixel image into 150 128  128-pixel images. Also, the MODIS image batches will be used as input, and we are still missing the output, which are training labels. Because the images are unlabeled, we need a method for labeling to annotate each

342

13. Satellite image classification using quantum machine learning

batch image with a single land cover class name. As this experiment aims to separate vegetation from other classes, we will make a simple labeling system with only several basic classes such as vegetation and nonvegetation. Most time, the labeling is done manually and via human eye inspection. We split the large original images into small batches so that each set can be made only to contain one object and easy to create labels. Once the image batches were ready, we made some collaborative tagging to create ground truth labels and cross-verify them by multiple team members. Figs. 1 and 2 show some examples from our object-labeled dataset.

3 Applying QML on MODIS hyperspectral images Once the training labels are ready, we can build an image classifier using a quantum neural network (QNN) (Schuld et al., 2014). We will train the QNN to learn and classify labeled images into two classes: those images with majority of the area covered in vegetation, and those images with less or little vegetation coverage.

FIG. 1 Examples of nonvegetation Earth data: water (A), cloud (B), desert (C), and dust (D).

3 Applying QML on MODIS hyperspectral images

343

FIG. 2 Example of vegetation Earth data.

3.1 Quantum neural network A QNN is a parameterized quantum computational model that functions more effectively on a quantum computer. In theory, training QNNs is similar to train traditional neural networks. The communication that occurs between the layers of a neural network is a significant distinction. For a specific operation, the current perceptron copies its output to the perceptron(s) at the network’s subsequent layer(s). In a QNN, where each perceptron is a qubit, this would violate the no-cloning (quantum state cannot be copied) theorem (Nielsen and Chuang, 2002). Generally speaking, the data are processed in the following ways by a QNN. The input data are initially encoded into the correct number of qubit states. The qubit state is then transformed by parameterized rotation gates and entangling gates over a predetermined number of layers. The expected value of a Hamiltonian operator like Pauli gates is then used to measure the transformed qubit state. These measurements are translated into useful output data after being decoded. Following that, the parameters are updated by an optimizer like Adam optimizer (Kwak et al., 2021). After running a lot of iterations going over these steps, the model will be trained to produce the results that are close to the expected numbers/labels, and the error is reduced to a certain threshold. Then the training will stop and we consider the model is fit and ready for testing.

3.2 Land cover (binary) classification To make the tutorial simple for readers to understand, we will experiment a binary classification problem which is very simple and basic in RS domain. In this context, binary classification refers to values that can be represented as either 0 or 1. To accomplish this,

344

13. Satellite image classification using quantum machine learning

we will train the model to distinguish between vegetative and nonvegetative images. Images with major vegetation coverage will be labeled as “1,” while images of objects devoid of vegetation, such as desert, water, clouds, and desert, will be labeled as “0.” The model reuses the general TensorFlow-Quantum’s (Broughton et al., 2020) framework, a popular package that allows users to use quantum circuits in machine learning models as layers in TensorFlow’s neural networks, with the significant distinction of using a modified quantum circuit to execute the classification task due to the type of input data (Fig. 3).

3.3 Setup of TensorFlow, TensorFlow quantum, and Cirq 3.3.1 TensorFlow (TF) TF is a popular open-source machine learning framework developed by engineers at Google. It features a large, flexible ecosystem of tools, libraries, and community resources that allow researchers to push the boundaries of machine learning and developers to implement ML-powered applications effectively. It is built to be particularly good at vectorial computations in neural network modeling. One of these tools is Keras, a TensorFlow-based Python deep learning API designed to facilitate quick experimentation. 3.3.2 TensorFlow quantum (TFQ) TFQ is a QML framework written in Python. Researchers working on quantum algorithms and machine learning applications can use the TFQ framework to access Google’s QC frameworks directly from TensorFlow. The datatype primitives introduced by TFQ are: • Quantum circuit within TensorFlow, a Cirq-defined quantum circuit creates batches of circuits of variable sizes, analogous to batches of distinct real-valued data points. • Pauli sum represents linear combinations of Pauli operators’ tensor products described in Cirq. Create batches of operators of varied sizes, much like circuits. 3.3.3 Cirq Cirq is a Python package for creating, editing, optimizing, and operating quantum circuits on quantum computers and simulators. Quantum circuits are analogous to classical circuits in that they serve as models for quantum computation.

FIG. 3 Architecture of quantum neural network on the satellite image dataset.

3 Applying QML on MODIS hyperspectral images

345

3.4 Setup Let us get started with the guidelines for setting up the necessary packages step by step: To account for version changes, In [1]:

!pip install tensorflow==2.4.1

and In [2]:

!pip install tensorflow-quantum

To update the package resources. In [3]:

import importlib, pkg_resources importlib.reload(pkg_resources)

We import packages such as TensorFlow, TensorFlow Quantum, and Cirq. In [4]:

import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from tensorflow.keras.models import Sequential import tensorflow_quantum as tfq import cirq

In addition, we import other packages to help with the visualization and implement statistical analysis among others. In [5]:

import cv2 import PIL.Image as Image import ostex import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split import pandas as pd import glob import csv import sympy import seaborn as sns import collections import numpy as np

We implement visualization tools in the notebook: In [6]:

%matplotlib inline

and In [7]:

from cirq.contrib.svg import SVGCircuit

The following codes will show how to download data from Google Drive.

346

13. Satellite image classification using quantum machine learning

3.5 Loading and preprocessing data Please go to the Earth AI book Github repository to download preprocessed data for this chapter. 1. Make sure you have “dataset.csv” and “dataset.zip.” The dataset contains images of Earth vegetation and nonvegetation: In [1]:

!ls

In [8]:

!gdown PATHTWO

where PATHONE or PATHTWO could be paths pointed to any server to download data file, for example, the two links: https://drive.google.com/uc?id=1dOyGLR3j8cFRVkFPA0SZ9D6vUiLQYUPa,

or https://

drive.google.com/uc?id=15-AbVJiSn8_aEUCX7dzZENVzdyHlzdlb.

(However, the links above could expire, please go to the Earth AI book Github repository to check the latest training dataset.) To unzip data: In [2]:

unzip data

or In [3]:

!unzip dataset.zip

2. We extract and load the data in Colab from Google Drive as follows: In [1]:

with open("dataset.csv", mode="r", newline="", encoding="utf-8") as f: reader = csv.reader(f ) gt = {rows[0]:rows[1] for rows in reader}

setting path and list variables In [2]:

path = "/content/dataset/" images = [] target = []

loading the dataset with: In [3]:

for root, dirs, files in os.walk(path): for file in files: with open(os.path.join(root, file), "r") as auto: im = cv2.imread(root+file, 0) images.append(im) if ".DS_Store" in files: files.remove(".DS_Store") target.append(float(gt[file])) df_images = np.array(images) X = df_images y = target

3 Applying QML on MODIS hyperspectral images

347

3. This is a relatively small dataset (about 640 photos), far too small to train a classical or quantum model effectively. The TensorFlow packages are used to load the dataset and perform common image operations, including resizing, centering, cropping, and normalizing, as seen below: We split the data for training and testing using the following command (we only assigned 0.01% of the entire data to be the test data due to the low data quantity): In [1]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01) y_train = np.array(y_train) y_test = np.array(y_test)

Rescale (also known as normalization in data preprocessing) the images from [0, 255] to the [0.0, 1.0] range: In [2]:

X_train, X_test = X_train[..., np.newaxis]/255.0, X_test[..., np.newaxis]/255.0

To print the number of train and test splits: In [3]:

print("Number of original training examples:", len(X_train)) print("Number of original test examples:", len(X_test)) plt.imshow(X_train[0, :, :, 0]) plt.colorbar()

3.6 Quantum circuit data encoding 1. Our dataset contains images of two types: vegetation (labeled as “1”) and nonvegetation (labeled as “0”) with a resolution of 128  128 pixels. As 128  128 is too large for noisy intermediate-scale quantum (NISQ) processors, the final preprocessing step is to “downscale” the images (Farhi and Neven, 2018). Larger pixel sizes, such as 6  6, break at the beginning of model training, so a 4  4-pixel is the only suitable size (Fig. 4).

FIG. 4 Left: Full-resolution 128  128 image size. Right: Downscaled 4  4 image size.

348

13. Satellite image classification using quantum machine learning

Downscale images: An image size of 128  128 is too large for current quantum computers. Resize the image down to 4  4 as follows: In [1]:

X_train_small = tf.image.resize(X_train, (4,4)).numpy() X_test_small = tf.image.resize(X_test, (4,4)).numpy() print(y_train[0]) plt.imshow(X_test_small[0, :, :, 0]) plt.colorbar()

The following is an example of a full-resolution and downscaled image: 2. Quantum information processing focuses on using quantum states to define and represent images and how to carry out operations using those states. Three main image formats— Qubit Lattice, Real Ket, and Flexible Representation of Quantum Images (FRQI)—have so far shown to perform better than classical systems when it comes to storing and retrieving images in QC (Latorre, 2005; Venegas-Andraca and Bose, 2003; Le et al., 2011). In particular, qubit lattice transforms pixels into a two-dimensional (2D) array of qubits. In this tutorial, we follow the proposal of Farhi and Neven (2018) that each pixel be represented by a qubit, the state of which is determined by the pixel’s value, in order to process images using a quantum computer. There are a variety of ways for doing so, and the encoding affects numerous parts of the learning process. The following method is used in this case: One qubit will correspond to each pixel in the image (floating point value in the range [0, 1]). We represent the pixel as the 0th state if it is less than 0.5; otherwise, it is the first state (think of it as black-and-white or on-and-off conversions) (Hidary, 2019). Set threshold: In [1]:

THRESHOLD = 0.5

We split the data in the train and test set. In [2]:

X_train_bin = np.array(X_train_small > THRESHOLD, dtype=np.float32) X_test_bin = np.array(X_test_small > THRESHOLD, dtype=np.float32)

We define a function with the batch image as argument: In [3]:

def convert_to_circuit(img): """Encode truncated classical image into quantum datapoint.""" values = np.ndarray.flatten(img) #print(values) qubits = cirq.GridQubit.rect(4, 4) #print(qubits) circuit = cirq.Circuit() for i, value in enumerate(values):

3 Applying QML on MODIS hyperspectral images

349

if value: circuit.append(cirq.X(qubits[i])) return circuit

Apply the defined function on the training datasets: In [4]:

X_train_circ = [convert_to_circuit(x) for x in X_train_bin] X_test_circ = [convert_to_circuit(x) for x in X_test_bin]

The circuit (shown in Fig. 5) for the first example is as follows: In [5]:

SVGCircuit(X_train_circ[0])

3. A quantum circuit has now been created for each image in the dataset. The Cirq circuits are converted to tensors for TFQ: In [1]:

X_train_tfcirc = tfq.convert_to_tensor(X_train_circ) X_test_tfcirc = tfq.convert_to_tensor(X_test_circ)

3.7 Quantum neural network: Building and compiling the model In this section, a QNN will be created to transport encoded data through a series of trainable quantum gates. 1. Model building. Following the construction of TFQ MNIST classification (TensorFlow Quantum, n.d.). We begin by creating a class that adds a layer of gates to a circuit (Fig. 6), with each layer utilizing “n” instances of the same gate and each data qubit operating on the readout qubit: In [1]:

class CircuitLayerBuilder(): def __init__(self, data_qubits, readout): self.data_qubits = data_qubits self.readout = readout def add_layer(self, circuit, gate, prefix):

FIG. 5 Encoded circuit.

350

13. Satellite image classification using quantum machine learning

FIG. 6 Gate layer as a circuit.

for i, qubit in enumerate(self.data_qubits): symbol = sympy.Symbol(prefix + "-" + str(i)) circuit.append(gate(qubit, self.readout)**symbol)

Sample circuit layer for visualization: In [2]:

demo_builder = CircuitLayerBuilder( data_qubits = cirq.GridQubit.rect(4,1), readout = cirq.GridQubit(-1,-1)) circuit = cirq.Circuit() demo_builder.add_layer(circuit, gate = cirq.XX, prefix="xx") SVGCircuit(circuit)

We now can create a full model that corresponds to the data-circuit size and includes the preparation and readout steps. Create a QNN model circuit and readout operation to go along with it. rect(a, b) is a function with two arguments, which generate a grid a  b. GridQubit(c, d) is a function that reads single qubits, where c and d are indexes to identify each wires in the grid: In [1]:

def create_quantum_model(): data_qubits = cirq.GridQubit.rect(4, 4) readout = cirq.GridQubit(-1, -1) circuit = cirq.Circuit()

Prepare the readout qubit: In [2]:

circuit.append(cirq.X(readout)) circuit.append(cirq.H(readout)) builder = CircuitLayerBuilder( data_qubits = data_qubits, readout=readout)

3 Applying QML on MODIS hyperspectral images

351

Then add layers (experiment by adding more): In [3]:

builder.add_layer(circuit, cirq.XX, "xx1") builder.add_layer(circuit, cirq.ZZ, "zz1")

Finally, prepare the readout qubit: In [4]:

circuit.append(cirq.H(readout)) return circuit, cirq.Z(readout) model_circuit, model_readout = create_quantum_model()

Build the Keras model. Inside the function, we have the first line, which contains the input, namely, the data circuit, encoded as a tf.string. Second line is the PQC layer that returns the expected value of the readout gate, range [-1,1]: In [1]:

model = tf.keras.Sequential([ tf.keras.layers.Input(shape=(), dtype=tf.string), tfq.layers.PQC(model_circuit, model_readout), ])

Prepare dataset for a loss function: In [2]:

y_train_hinge = np.array(y_train, dtype="float64") y_test_hinge = np.array(y_test, dtype="float64") y_train_hinge_df = pd.DataFrame(y_train_hinge) y_test_hinge_df = pd.DataFrame(y_test_hinge)

y_train_hinge_df.replace(to_replace=[0.0], value=-1.0, inplace=True) y_test_hinge_df.replace(to_replace=[0.0], value=-1.0, inplace=True)

y_train_hinge = np.array(y_train_hinge_df, dtype=float) y_test_hinge = np.array(y_test_hinge_df, dtype=float)

Define a hinge loss function: In [33]:

def hinge_accuracy(y_true, y_pred): y_true = tf.squeeze(y_true) > 0.0 y_pred = tf.squeeze(y_pred) > 0.0 result = tf.cast(y_true == y_pred, tf.float32)

return tf.reduce_mean(result)

352

13. Satellite image classification using quantum machine learning

2. Next, we use the hinge loss function, Adam optimizer, and hinge accuracy as a measure to compile the model: In [1]:

model.compile( loss=tf.keras.losses.Hinge(), optimizer=tf.keras.optimizers.Adam(), metrics=[hinge_accuracy])

print(model.summary())

3.8 Training the QNN model We are prepared to begin the training process (Fig. 7): In [1]:

EPOCHS = 5 BATCH_SIZE = 32 NUM_EXAMPLES = len(X_train_tfcirc)

X_train_tfcirc_sub = X_train_tfcirc[:NUM_EXAMPLES] y_train_hinge_sub = y_train_hinge[:NUM_EXAMPLES]

FIG. 7 QNN training performance across five epochs.

4 Conclusions

353

qnn_history = model.fit( X_train_tfcirc_sub, y_train_hinge_sub, batch_size=32, epochs=EPOCHS, verbose=1, validation_data=(X_test_tfcirc, y_test_hinge))

qnn_results = model.evaluate(X_test_tfcirc, y_test)

3.9 Classification performance We achieved 85% hinge accuracy in this tutorial. In [1]:

fig = plt.figure(figsize=(10,6)) plt.plot(qnn_history.history["loss"], color="#785ef0") plt.plot(qnn_history.history["val_loss"], color="#dc267f") plt.title("Model Loss Progress") plt.ylabel("Hinge Loss") plt.xlabel("Epoch") plt.legend(["Training Set", "Test Set"], loc="upper right") plt.show()

4 Conclusions In this tutorial, we applied a QNN using a NISQ processor to perform the binary classification task of Earth data. The QNN classified data as either vegetation or non-vegetation data. In the end, we achieved a classification accuracy of 85%. Although not as good as cutting-edge classical machine learning models, this is still quite good for a simple quantum model. Understanding each significant procedure that goes into creating a quantum model is critical, including preprocessing, data encoding, QNNs, readout, and training. The dataset we described can not only solve modeling vegetation from satellite data but also easily solve the classification of the other labels, such as vegetation. The superiority of QML over classic models is not based on a model’s generalization capabilities but on better training and execution times in the upcoming era when quantum hardware is available and accessible to many more people. In order to explore the potential and future of this still-emerging field, there is still a lot of research being done in QML; for instance, Verdon et al. (2018) and Havlı´cek et al. (2019).

354

13. Satellite image classification using quantum machine learning

5 Assignments • Please use Google Colab to change the configuration of the QML model with more layers and various optimizers and check if it impacts the model’s performance. • To see the effects of training a model with more data, retrain the model with higher test data like 10% or 20% (i.e., 0.1 or 0.2 test size assignment). • Does increasing the number of epochs impact the classification accuracy?

6 Open questions There are undoubtedly many unanswered questions regarding using QML in practical applications. Here are a few to motivate further research: • How can QML easily outperform the conventional ML model on digital computers? • Given that the dataset used in this tutorial was relatively small, will data augmentation be a workable solution when the data are represented in a quantum computer? • Can a QNN solve classification tasks that traditional machine learning cannot?

Acknowledgment Sections 1–3 are revised and populated by the editors with the permission of the authors.

References Alzubi, J., Nayyar, A., Kumar, A., 2018. Machine learning from theory to algorithms: an overview. J. Phys. Conf. Ser. 1142 (1), 012012. Ayoade, O., Rivas, P., Orduz, J., 2022. Artificial intelligence computing at the quantum level. Data 7 (3), 28. Baddock, M.C., Gill, T.E., Bullard, J.E., Acosta, M.D., Rivera Rivera, N.I., 2011. Geomorphology of the Chihuahuan desert based on potential dust emissions. J. Maps 7 (1), 249–259. Biamonte, J., Wittek, P., Pancotti, N., Rebentrost, P., Wiebe, N., Lloyd, S., 2017. Quantum machine learning. Nature 549 (7671), 195–202. Broughton, M., Verdon, G., McCourt, T., Martinez, A.J., Yoo, J.H., Isakov, S.V., Massey, P., et al., 2020. Tensorflow quantum: a software framework for quantum machine learning. arXiv preprint arXiv:2003.02989. D€ uzg€ un, H.Ş., Demirel, N., 2011. Remote Sensing of the Mine Environment. CRC Press, USA. Farhi, E., Neven, H., 2018. Classification with quantum neural networks on near term processors. arXiv preprint:1802.06002. Franklin, J., 2010. Ecological understanding of species distributions. In: Mapping Species Distributions (Spatial Inference and Prediction), Cambridge University Press, Cambridge, UK, pp. 34–52. Gill, J.C., Malamud, B.D., 2017. Anthropogenic processes, natural hazards, and interactions in a multi-hazard framework. Earth Sci. Rev. 166, 246–269. Google AI Quantum and Collaborators, Arute, F., Arya, K., Babbush, R., Bacon, D., Bardin, J.C., Barends, R., Boixo, S., Broughton, M., Buckley, B.B., et al., 2020. Hartree-Fock on a superconducting qubit quantum computer. Science 369 (6507), 1084–1089. Havlı´cek, V., Co´rcoles, A.D., Temme, K., Harrow, A.W., Kandala, A., Chow, J.M., Gambetta, J.M., 2019. Supervised learning with quantum-enhanced feature spaces. Nature 567 (7747), 209–212. Hidary, J.D., 2019. Quantum Computing: An Applied Approach. Springer.

References

355

Kwak, Y., Yun, W.J., Jung, S., Kim, J., 2021. Quantum neural networks: concepts, applications, and challenges. In: 2021 Twelfth International Conference on Ubiquitous and Future Networks (ICUFN), pp. 413–416. Latorre, J.I., 2005. Image compression and entanglement. arXiv preprint quant-ph/0510031. Le, P.Q., Dong, F., Hirota, K., 2011. A flexible representation of quantum images for polynomial preparation, image compression, and processing operations. Quantum Inf. Process. 10 (1), 63–84. Mahesh, B., 2020. Machine learning algorithms—a review. Int. J. Sci. Res. [Internet] 9, 381–386. McMahon, D., 2007. Quantum Computing Explained. John Wiley & Sons. Mehta, P., Bukov, M., Wang, C.-H., Day, A.G.R., Richardson, C., Fisher, C.K., Schwab, D.J., 2019. A high-bias, lowvariance introduction to machine learning for physicists. Phys. Rep. 810, 1–124. Montanaro, A., 2016. Quantum algorithms: an overview. npj Quantum Inf. 2 (1), 1–8. National Academies of Sciences, Engineering, and Medicine, et al., 2019. Quantum Computing: Progress and Prospects. National Academies Press. Nielsen, M.A., Chuang, I., 2002. Quantum Computation and Quantum Information. American Association of Physics Teachers. Schuld, M., Petruccione, F., 2018. Supervised Learning With Quantum Computers. vol. 17 Springer. Schuld, M., Sinayskiy, I., Petruccione, F., 2014. The quest for a quantum neural network. Quantum Inf. Process. 13 (11), 2567–2586. Schuld, M., Sinayskiy, I., Petruccione, F., 2015. An introduction to quantum machine learning. Contemp. Phys. 56 (2), 172–185. Song, X.-P., Hansen, M.C., Stehman, S.V., Potapov, P.V., Tyukavina, A., Vermote, E.F., Townshend, J.R., 2018. Global land change from 1982 to 2016. Nature 560 (7720), 639–643. Sun, Z., Di, L., Fang, H., 2019. Using long short-term memory recurrent neural network in land cover classification on landsat and cropland data layer time series. Int. J. Remote Sens. 40 (2), 593–614. Venegas-Andraca, S.E., Bose, S., 2003. Storing, processing, and retrieving an image using quantum mechanics. In: Quantum Information and Computation, vol. 5105, pp. 137–147. Verdon, G., Pye, J., Broughton, M., 2018. A universal training algorithm for quantum deep learning. arXiv preprint:1806.09729. Williams, M., 2000. Dark ages and dark areas: global deforestation in the deep past. J. Hist. Geogr. 26 (1), 28–46. Xie, Y., Sha, Z., Yu, M., 2008. Remote sensing imagery in vegetation mapping: a review. J. Plant Ecol. 1 (1), 9–23.

This page intentionally left blank

C H A P T E R

14 Provenance in earth AI Amruta Kale and Xiaogang Ma Department of Computer Science, University of Idaho, Moscow, ID, United States

1 Introduction The growing use of artificial intelligence (AI) and machine learning (ML) has become an indispensable part of our modern life. AI/ML has revolutionized everything in a relatively short period of time, reshaping workflows from research and manufacturing to upgrading the finance and healthcare streams. This is due to the advancement of technologies, e.g., deep learning (DL) (LeCun et al., 2015) that have largely contributed to the enormous success of AI/ML systems in terms of prediction and accuracy. DL plays a prime role in building advanced technology like virtual assistants, autonomous driverless cars, and facial recognition. AI/ML technologies have the power to design systems that can mimic human behavior, offer answers to challenging and complex problems, and further develop simulations to become human-level AI by utilizing computer-based training and advanced algorithms. According to a recent study by Markets and Markets, the AI market size will grow from USD 86.9 billion in 2022 to USD 407.0 billion by 2027, at a Compound Annual Growth Rate (CAGR) of 36.2% during the forecast period (Artificial Intelligence Market (Markets and Markets Analysis), 2022). Due to the vast availability of big data (volume, variety, and velocity), DL algorithms are now frequently used in most domains (Ma, 2021). The huge success of DL models like deep neural networks (DNN) and artificial neural networks (ANN) comprises a combination of multiple layers and millions of parameters that extract important features from raw data. Yet, this complex process also makes DNN applications into black-box models (Castelvecchi, 2016). Even though these models deliver high predictive accuracy they often lack transparency. As the black-box model is increasingly used, the demand for explanation is also increasing from various stakeholders in AI (Preece et al., 2018). Another risk lies in making and implementing judgments that are not reasonable, lawful, or simply do not allow an extensive explanation of their actions, especially in critical domains (Gunning and Aha, 2019). It is common to believe that focusing purely on performance will make AI/ML systems

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00015-3

357

Copyright # 2023 Elsevier Inc. All rights reserved.

358

14. Provenance in earth AI

opaque, unfair, and nonintuitive. As the demand and awareness for ethical AI are increasing, people are hesitant to apply AI/ML techniques that are not transparent, interpretable, reproducible, and traceable (Goodman and Flaxman, 2017; Zhu et al., 2018). It is a common understanding that there are trade-offs regarding the model’s accuracy and transparency. Still, as we are moving toward a more automated world, AI/ML models should also be humanunderstandable. To facilitate human understandability, users often require explanations from AI/ML models as to how these systems arrived at their conclusions; this, however, is often lacking in the existing system (Montavon et al., 2017; Adadi and Berrada, 2018; Miller, 2019). Recently researchers have acknowledged the increasing need for explainable artificial intelligence (XAI) and trustworthy artificial intelligence (TAI) in AI/ML systems (Wing, 2020). As a result, several survey papers have highlighted the significance of XAI and TAI (Adadi and Berrada, 2018; Arrieta et al., 2020; Belle and Papantonis, 2021: Wing, 2021). This research field holds substantial promise to address the challenges mentioned above. XAI refers to the methods and techniques in the applications such that the results generated by AI/ML models are easily explainable, and understandable to humans (Ribeiro et al., 2016; Gunning and Aha, 2019). It may include general information about how the system operates, why the system failed, what underlying features were considered, and information about training and test dataset (Guidotti et al., 2018; Lipton, 2018; Murdoch et al., 2019). However, we also believe that explanation is user-focused, and the type of explanation depends on the user’s role, former knowledge, and domain. Some safety-critical applications may require comprehensive knowledge to make judgments, while others may not require a detailed description of the systems and how they arrived at their conclusions. For example, a meteorologist anticipating weather forecasts and other weather occurrences, such as where a hurricane would impact, may require thorough information about the factors that influence atmospheric conditions and weather patterns over time. However, general users may just require information regarding the circumstances and safety precautions to be undertaken. To make AI/ML systems transparent, an explanation is key as it provides extensive knowledge about the system and builds user engagement in the AI/ML systems. There have been many post hoc explainability approaches (Guidotti et al., 2018; Lipton, 2018; Arrieta et al., 2020; Belle and Papantonis, 2021) designed to provide explanations for AI/ML models that are not transparent by design. More preciously, these post hoc methods are an interpretable model (e.g., linear model or decision tree) which is used to train on the black-box model to get a better understanding. These techniques contain explanations about the results in the form of natural language explanations (Krening et al., 2016), visualizations of learned models (Mahendran and Vedaldi, 2015), and explanations by example (Mikolov et al., 2013) to understand the underlying model. However, we believe XAI is a diverse topic, and a single disciplinary approach cannot solve it. Consequently, some academics stated that provenance is also an emerging field that can be used to explain AI/ML systems (Liu et al., 2017; Jentzsch and Hochgeschwender, 2019; Frost, 2020). The term “provenance” describes the informational sources, such as the people and organizations responsible for creating or delivering an artifact. The phrase was initially meant to describe works of art or antiques, but it is now used in many fields, including computing, paleontology, science, archaeology, manuscripts, archives, and printed books. One of the great examples of capturing provenance can be seen in artwork, like who created it, who owned it at the time, which collection it was part of, etc. Adding these details increases the value of artwork and determines its

2 Overview of relevant concepts in provenance, XAI, and TAI

359

ownership. In the same way, adding provenance to data and AI/ML processes can enhance transparency and explainability. We strongly believe provenance has the capability of explanation that has been neglected or has not received the attention it deserves. The inclusion of provenance can address the “what” and “why” aspects by documenting the entire process. Several researchers further discussed that enabling provenance is essential for determining authenticity, building trust, and ensuring reproducibility in AI/ML models ( Jaigirdar et al., 2019; Amalina et al., 2019; Jaigirdar et al., 2020). In our previous literature reviews, we found that including provenance in AI/ML models will strengthen explanation and improve transparency (Kale et al., 2022). We believe that adding provenance in AI/ML systems will help generate resourceful and sufficient explanations for users and reproducibility. In this chapter, we will first provide an overview of basic concepts in provenance, XAI, and TAI. Second, we will discuss the related work in the field of provenance and AI in earth science by mentioning the state-of-the-art progress. Third, we will present several tools from the communities designed for capturing provenance to support explainability and transparency. Finally, we will discuss the progress and trends, and wrap up the chapter.

2 Overview of relevant concepts in provenance, XAI, and TAI 2.1 Guidelines for building trustworthy AI AI has enormous potential to revolutionize everyone’s lives. It has spread across all facets of society, bringing profound changes individually, societally, and environmentally. However, even with such unprecedented advancement, they still face challenges in addressing trustworthiness, transparency, and intelligibility. In order to build a transparent and fair AI system, the High-level Expert Group on AI (HLEG) prepared a document on ethics and guidelines on TAI (AI HLEG (High-Level Expert Group on AI), 2019). This guideline listed seven fundamental criteria that AI systems must achieve to be considered trustworthy. TAI is built on three key components: an AI system should be (1) lawful, adhering to all applicable laws and regulations; (2) ethical, respecting the principal and ethical values; and (3) robust, both technically and socially (Fig. 1). As AI/ML systems are increasingly used, the necessity to explain the results has led to new discussions and actions in scientific communities. It is essential that these systems must be transparent, unbiased, and reliable, which is why these guidelines are so important. These guidelines will help newcomers to attain a basic understanding of what is TAI and how to realize it. Here are the seven European Union (EU) guidelines for defining TAI (Floridi, 2019; Thiebes et al., 2020; Jain et al., 2020): • Human agency and oversight: AI systems should support human agency and fundamental rights and not limit or mislead human freedom. • Technical robustness and safety: Trustworthy AI demands algorithms to be safe, consistent, and robust enough to deal with errors or irregularities throughout the AI system’s life cycle.

360

14. Provenance in earth AI

FIG. 1 Trustworthy AI with three key components.

• Privacy and data governance: Throughout the entire lifecycle, AI systems must maintain privacy and data protection where users should have complete control over their own data. • Transparency: AI systems should be traceable, explainable, and well-communicable even if the system has flaws or limitations in it. • Diversity, nondiscrimination, and fairness: AI systems should be fair to all stakeholders regardless of their age, gender, abilities, or characteristics. • Societal and environmental well-being: AI systems should promote social transformation as well as enhance environmental sustainability and accountability. • Accountability: Mechanisms should be put in place to ensure ownership, accountability, and potential compensation for AI systems and their outcomes.

2.2 Understanding explainable AI The inability of AI systems to provide comprehensive information has raised social, ethical, and legal pressure for the development of new AI techniques that are capable of making explainable and understandable decisions. TAI and XAI are often mentioned together. XAI suggests a transition toward more transparent AI. However, XAI is not a new field; the term was first coined by Van Lent et al. (2004) to highlight the ability of their system to explain the behavior of AI-controlled entities in simulation games. Recently, the topic has received great attention from both academia and industry. As a result, several survey papers have highlighted the noteworthy importance of XAI (Adadi and Berrada, 2018; Singh et al., 2018; Lecue, 2020; Arrieta et al., 2020; Belle and Papantonis, 2021; Das and Rad, 2020). This research field aims to develop a set of strategies that will make the result of AI/ML systems understandable to humans. XAI will be essential if the user needs to understand what, why,

2 Overview of relevant concepts in provenance, XAI, and TAI

361

and how aspects of the models. To address this, the Defense Advanced Research Projects Agency (DARPA) funded the “Explainable AI (XAI) Program” to improve explainability through local and post hoc interpretation methods (Gunning, 2019; Arrieta et al., 2020). The program focuses on building explainable models while maintaining high predictive accuracy (Gunning and Aha, 2019), to create new ML techniques that combine explanations and enable users to understand, manage, and effectively gain trust. These ML techniques will have the ability to identify flaws and predict how the machine will behave in the future. The key objective of XAI is to address trustworthiness and intelligibility in AI/ML models. Fig. 2 characterizes the visual representation of DARPA’s XAI program. However, for greater clarity, we have simplified the diagram based on the types of ML models. Traditionally there are two types of ML models, and the choice of these models depends upon the different application purposes. Transparent ML models (e.g., linear regression, k-nearest neighbors, Bayesian models, decision trees) have the ability to figure out what went wrong in the system or explain how they arrived at a particular decision (Holzinger et al., 2017; Murdoch et al., 2019). These models have a substantial percentage of training and test accuracy and work well with simple datasets. However, when dealing with complex applications, transparent models are insufficient for analysis which is why opaque models are required. Opaque ML models (e.g., DL and neural networks) are black boxes in nature, despite high predictive accuracy, these models cannot be easily examined or understood (Montavon et al., 2017; Adadi and Berrada, 2018). But with the new XAI approach, the system takes input from the current task and makes decisions, recommendations, and actions that allow users to understand and evaluate based on the system explanation. This technique will help users regulate their decisions by providing a reason or justification, particularly when unexpected assumptions are made. XAI will provide insights into the behavior of systems and their unknown flaws, improve models’ transparency, and verify predictions, which will lead toward TAI.

FIG. 2 Transparent vs. Opaque vs. Explainable model. Above image is adapted from DARPA’s XAI program (Gunning, D., Aha, D., 2019. DARPA’s explainable artificial intelligence (XAI) program. AI Mag. 40(2), 44–58).

362

14. Provenance in earth AI

2.3 Provenance and documentation XAI provides transparency and increases the intelligibility of a system using post hoc explanation methods (K€ ohl et al., 2019). In our opinion, gaining transparency using post hoc methods can be useful, but many metadata and context information about these systems are still widely neglected. To achieve an accurate explanation, provenance documentation should be an essential component of the XAI approaches (cf. Singh et al., 2018; Jentzsch and Hochgeschwender, 2019). Experts and researchers are interested in documenting provenance for several reasons. Most importantly, well-documented provenance confirms the credibility of the scientific results and enables reusability (Moreau et al., 2008; Zeng et al., 2019). Provenance determines ownership, as it provides the historical context of who has owned the work and when. The term provenance has an exceptionally long history, it is “the origin or source of something” (Cheney et al., 2009). In this way, it is like metadata, which is data about data (Ma, 2018). Metadata is a crucial component of data collection and distribution. It provides information like the author’s details, date created, date modified, and data file versions in a structured and standardized form in such a way that the dataset can be potentially reused. According to the definition of the World Wide Web Consortium (W3C), provenance is defined as “information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness” (Groth and Moreau, 2013; Missier et al., 2013). That definition means provenance can be used to document not only metadata but also other entities and steps in a workflow. The W3C PROV family consists of 12 documents (PROV-Overview, PROV-PRIMER, PROV-O, PROV-DM, PROV-N, PROV-CONSTRAINTS, PROV-XML, PROV-AQ, PROVDICTIONARY, PROV-DC, PROV-SEM, and PROV-LINKS) which give details to help understand and implement provenance documentation. Fig. 3 illustrates the basic elements of

FIG. 3 The three top classes of PROV-O model and properties. Above image adapted from W3C PROV family of documents (Groth, P., Moreau, L., 2013. An Overview of the PROV Family of Documents, W3C, https://www.w3.org/TR/provoverview/ (Accessed 27 January 2022)).

3 Need for provenance in earth AI

363

PROV-O (Prov Ontology). It is also known as the starting point term and it is built on three fundamental classes (1) prov:Entity, which is a physical, digital, conceptual, or other kinds of thing with some fixed aspects; entities may be real or imaginary; (2) prov:Activity, which is something that occurs over a period of time and acts upon or with entities; it may include consuming, processing, transforming, modifying, relocating, using, or generating entities; and (3) prov:Agent, which is something that bears some form of responsibility for an activity taking place, for the existence of an entity, or another agent’s activity. We believe adding provenance in AI/ML systems will help address the issues associated with reproducibility, transparency, explainability, accountability, and authenticity.

3 Need for provenance in earth AI 3.1 Use of AI in the earth science domain In recent years, AI has been widely used to improve or replace conventional tasks in earth science domains. These methods have proven effective in performing various tasks like climate models, anomaly detection, weather prediction, event classification, and space weather, raising the expectation that AI could address some of the major challenges in earth science (Rasp et al., 2018). According to Intel’s survey from 2018, 74% of the respondents indicated that AI would support solving long-term environmental concerns (Intel study, 2018). As a response, Intel is on board, pledging to restore 100% of its global water use by 2025. According to the article published by Columbia University’s Earth Institute (Cho, 2021), AI has assisted farmers in India in increasing groundnut yields by providing knowledge on how to prepare the field, apply fertilizer, and choose sowing dates, resulting in a 30% increase in yields per hectare. According to the same article, Norway uses AI to develop a flexible and autonomous power grid incorporating more renewable energy. Microsoft’s AI for Earth program, launched in 2017, seeks to provide 200 research grants totaling $50 million to projects that use AI to address environmental challenges (Microsoft’s Earth AI Program, 2017). IBM’s Green Horizon project in China utilized an AI system to predict pollution levels, track pollution sources, and generate potential solutions to drastically lower the pollutants (IBM Green Horizon Project, 2016). The increasing attention to AI/ML and earth science has also been reflected in the records of publications for the past 10 years. Fig. 4 illustrates the exponential growth of relevant publications found in Scopus.

3.2 Related work in provenance and earth science As the amount of data is increasing in the earth science domain, there are numerous initiatives taken to extend and improve practices in preserving provenance. For instance, Lanter (1991, 1993) developed a metadatabase system for tracking the process of workflow and a system (Geolineus) for recording geographic information system (GIS) operations. Governments and other funding organizations have expressed the need for provenance and have been developing policies related to documenting provenance. In 1998, NASA established the Federation of Earth Science Information Partners (ESIP) to involve a larger group of stakeholders in improving techniques for storing, searching, accessing, and using earth science data (Showstack, 1998). ESIP has also initiated standard practices for the reusability of data by

364

14. Provenance in earth AI

FIG. 4 Distribution of publications (01/2010–12/2020) whose title, abstract and keywords include “earth science” and “artificial intelligence” or “machine learning.” The query below was used to extract results from the Scopus database on Dec 29th, 2021 (TITLE-ABS-KEY (earth AND science) AND TITLE-ABS-KEY (machine AND learning) OR TITLE-ABS KEY (artificial AND intelligence)) AND PUBYEAR >2009 AND PUBYEAR 2) processes, but Geoweaver also supports isolated processes, that is, processes not connected to each other are also allowed. Complex scientific experiments can simply be broken down into a number of workflows, which can then be executed and managed here. The weaver workspace allows users to create new workflows or edit the existing ones. The “add to weaver” button in the process module allows users to add different processes to the weaver workspace. In Fig. 11 we have three processes created in Python. For better understandability we have

FIG. 10 A Geoweaver dashboard to browse the provenance recorded for each process. Here category is the type of process we are executing, the name is the given program name and ID is the unique identification of each program.

4 Technical approaches

371

FIG. 11 Demonstration of workflow created in Geoweaver using different process.

demonstrated simple Python programs “addition” (addition of two numbers), “for_loop” (running a for loop to print a series of numbers), and “if_else” (print the greater number). To create a workflow the processes can be linked with each other by pressing and holding the SHIFT key on the keyboard. Once the workflow is created you can click on the “plus” button in the top floating bar, then add details in the popup window “Input Workflow Name” (write the name of your workflow) and a simple description in the “Description,” once you add all the information click confirm to complete. To run the workflow, click on the “play” button in the top floating bar, in the pop window select the “one-host” option, then choose localhost and set the environment to default. Finally, click Run, enter the password for localhost and confirm. While the workflow is in the execution mode, you will notice different colors: blue means the process is waiting, yellow means the corresponding process is running, green means the process execution is finished, and red means the process execution is failed for some reason (Fig. 12). To export the workflow from Geoweaver, click on the downward icon button in the top floating bar. The workflow exportations will provide two options “Workflow with process code” and “Workflow with process code and history.” The first option will only download the source code and workflow json, but the latter will download source code, workflow json along with the detailed history of the previous execution. The second option is recommended as it is provenance enabled. Click “Confirm” and a ZIP file will be automatically downloaded to your machine. Another best feature of Geoweaver is the ability to reproduce and edit the existing workflows. To import the shared workflow, click on the upward icon in the top

372

14. Provenance in earth AI

FIG. 12 Web interface representing different colors in the execution mode. The image is captured from Geoweaver in-browser software.

floating bar and drag and drop the Geoweaver ZIP file and click “Start.” Once the uploading is finished and if the workflow file is valid, it will ask “The upload workflow is valid. Do you want to proceed to save it into the database?” “You can click OK, then the workflow will be automatically loaded in the workspace and ready to reuse. All three platforms introduced above have their websites where detailed tutorials and sample workflows can be accessed, including examples in Earth and environmental sciences. Interested readers are suggested to go to their websites (see links in captions of Figs. 5–7) to see and practice the different technical approaches for provenance or metadata documentation.

5 Discussion Provenance in Earth AI is closely relevant to reproducibility. In our opinion, one of the most important aspects of making AI/ML more reproducible is to record or document all the core primitives such as hyperparameters, model architecture, code commits, datasets, and all the metadata associated with the training process. We understand there are plenty of different factors such as data changes, different software environments or versions, and numerous other small variations that can result in a reproducibility crisis. As it is not necessary to document all the detailed information, AI practitioners must prioritize documenting the most important elements of a project from day one to enable other researchers to easily reproduce their work when necessary. Adding a standardized format of reproducibility also ensures efficiency and accuracy. This will not only help researchers to reproduce results but will also ensure transparency and trust. Beyond documenting the fundamental components of an AI/ML system, the concept of reproducibility can be viewed as a systematic way of working in data-intensive Earth AI.

5 Discussion

373

Another advantage enabled by provenance is interpretability. Although interpretability and explainability are often discussed interchangeably, interpretability is concerned with the factors that influence a model’s decisions, while explainability deals with the reasoning process that a model follows to arrive at a final decision. The need for interpretability has been highlighted in many studies, especially when the decisions made by AI/ML algorithms have generated unintended biased, discriminatory, and even harmful outcomes. This issue has raised concerns about transparency and ethics for AI practitioners, such as when algorithms are deployed in critical domains like healthcare. In our opinion, interpretability is a prerequisite for humans to trust AI/ML models. It allows users to understand the causes behind decisions of real-world AI/ML applications, and thereby improve the fairness of the models. Enabling interpretability in AL/ML models will improve confidence and trust in the model. It will help data scientists to draw explanations from the black-box model for why certain decisions or predictions have been made. Recent techniques such as LIME (Local Interpretable Model-Agnostic Explanation) and SHAP (SHapley Additive exPLanations) show great promise for model interpretability. However, there is still ample room for improvement from a data science and engineering perspective. In order to understand opaque models, we need new initiatives and techniques to design systems that are safe, robust, ethical, and most importantly, interpretable. As AI/ML systems expand to include more diverse applications, the need for capturing provenance is gaining traction in the research community. We believe that the inclusion of provenance will not only strengthen AI/ML systems but will also improve transparency and explainability. Adding domain-specific documentation standards can help the community to grasp and begin employing appropriate practices routinely. In our opinion, data only adds value when it is accompanied by provenance information (for example, Wikipedia is not considered a trustworthy source due to the fact that many of its sources cannot be verified). Relying purely on data without verifying its source could be an unhealthy practice. On the other hand, documenting the necessary detail of a workflow will help researchers with troubleshooting in the event of errors, shedding light on the behavior of the model. It is worth noting that, manual documentation can put model provenance at risk, particularly when working with large datasets. For this reason, we encourage automating the process of provenance tracking by using workflow platforms, tools, and packages to limit manual operation. Fortunately, more adequate tools for recording and sharing provenance information are already being developed, which will greatly facilitate the automation of provenance documentation. The future of provenance documentation looks promising because governments and funding organizations are already recognizing the need for data preservation and provenance and are increasingly providing guidelines and support for works in that direction. Looking into the future, we propose a few points for discussion. The first is that AI/ML systems need to be adaptive and interactive, providing explanations based on users’ needs, expertise, and requirements. The success of W3C PROV is a perfect example of making AI/ ML processes reproducible or repeatable. However, we believe that as the research progresses, we will need more adoption of provenance standards to enable open science in various disciplines, including those in earth science. Our second point is that more functions of data management and provenance documentation need to be enabled in workflow platforms. To speed up scientific research and comprehension, open data will allow researchers to share data, information, and expertise (with clear licensing) enabling transparency and

374

14. Provenance in earth AI

reproducibility. The increasing availability of open data comes with the need for data management. However, the humongous amount of data available today makes manual management an unrealistic approach. There is a need for more platforms like Geoweaver which automate the AI/ML workflow and enable users to perform all tasks in one place more efficiently. The third point is about leveraging cloud service in data-intensive Earth AI. Many of the existing automated data management and analysis platforms are cloud-based, and we will undoubtedly see a continuation of rapid adoption and growth of cloud platforms. This will likely result in a shift in focus from complex high-cost local computation to cloud computing. This fact will be the primary motivator for the researchers to migrate to cloud platforms.

6 Conclusions With the increasing adoption of AI/ML systems, there is an increasing need for the results to be interpretable, reproducible, traceable, and explainable. Although the post hoc explainability approaches could be one way to explain a black-box model, we believe they are still in their infancy stage and not completely reliable. We suggest adopting established methods from the field of data and software provenance will be an ideal solution for providing explanations to AI/ML systems. Provenance will not only help users to trace, evaluate, understand, and reproduce the AI/ML results but will also enhance users’ decisions about how much trust to place in data and results generated from the original sources. In this study first, we presented a summary highlighting the fundamental concepts of XAI, TAI, and provenance. Second, we discussed how AI/ML models have advanced in the earth science discipline and the related work in provenance. Third, we illustrated three different tools that support reproducible results and provenance tracking. Lastly, we presented a research outline in the discussions to analyze the challenges and suggest further research opportunities. We hope this chapter provides enough justice to the importance of provenance and gives insight into new tools and the progress that can be made in AI/ML systems. We believe that provenance remains an important topic and has much more to offer the earth science community.

7 Assignment Create a simple workflow in the Geoweaver system (https://geobrain.csiss.gmu.edu/ Geoweaver/) with your own dataset and observe how the provenance is documented and stored. Write a short report about the technical approach in Geoweaver for reproducible workflows and the role of provenance in it.

8 Open questions Share workflows, code, and dataset with other people, what will you do to record and demonstrate the provenance of the dataset and code?

References

375

Acknowledgments The work was supported by the National Science Foundation under Grants Nos. 2019609 and 2126315 and the National Aeronautics and Space Administration under Grant No. 80NSSC21M0028.

References Adadi, A., Berrada, M., 2018. Peeking inside the black box: a survey on explainable artificial intelligence (XAI). IEEE Access 6, 52138–52160. AI HLEG (High-Level Expert Group on AI), 2019. Ethics Guidelines for Trustworthy AI. European Commission, Brussels. 39 pp. Available: https://digital-strategy.ec.europa.eu/en/library/ethics-guidelines-trustworthy-ai. (Accessed 27 January 2022). Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Mock, S., 2004. Kepler: an extensible system for design and execution of scientific workflows. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management. Santorini, Greece, pp. 423–424. Amalina, F., Hashem, I.A.T., Azizul, Z.H., Fong, A.T., Firdaus, A., Imran, M., Anuar, N.B., 2019. Blending big data analytics: review on challenges and a recent study. IEEE Access 8, 3629–3645. Arrieta, A.B., Dı´az-Rodrı´guez, N., Del Ser, J., Bennetot, A., Tabik, S., Barbado, A., Garcı´a, S., Gil-Lo´pez, S., Molina, D., Benjamins, R., Chatila, R., 2020. Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities, and challenges toward responsible AI. Inf. Fusion 58, 82–115. Artificial Intelligence Market (Markets and Markets Analysis), 2022. Available: https://www.marketsandmarkets. com/Market-Reports/artificial-intelligence-market-74851580.html. (Accessed 9 September 2022). Bedia, J., San-Martı´n, D., Iturbide, M., Herrera, S., Manzanas, R., Gutierrez, J.M., 2019. The METACLIP semantic provenance framework for climate products. Environ. Model Softw. 119, 445–457. Belle, V., Papantonis, I., 2021. Principles and practice of explainable machine learning. Front. Big Data 4, 25. https:// doi.org/10.3389/fdata.2021.688969. Castelvecchi, D., 2016. Can we open the black box of AI? Nat. News 538 (7623), 20–23. Cheney, J., Chiticariu, L., Tan, W.C., 2009. Provenance in Databases: Why, How, and Where. Now Publishers Inc, Hanover, MA, https://doi.org/10.1561/9781601982339. 100 pp. Cho, R., 2021. Artificial Intelligence a Game Changer for Climate Change and the Environment. https://news.climate. columbia.edu/2018/06/05/artificial-intelligence-climate-environment/. (Accessed 29 December 2021). Das, A., Rad, P., 2020. Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey. 24 pp. Available: https://arxiv.org/abs/2006.11371. Das, P., Ivkin, N., Bansal, T., Rouesnel, L., Gautier, P., Karnin, Z., Dirac, L., Ramakrishnan, L., Perunicic, A., Shcherbatyi, I., Wu, W., 2020. Amazon SageMaker Autopilot: a white box AutoML solution at scale. In: Proceedings of the Fourth International Workshop on Data Management for End-to-End Machine Learning, pp. 1–7. DataRobot, A.I., 2012. Cloud. https://www.datarobot.com/. (Accessed 18 January 2022). Datatron MLOps: Machine Learning Operations, 2016. https://datatron.com/. (Accessed 18 January 2022). Di, L., Yue, P., Ramapriyan, H.K., King, R.L., 2013. Geoscience data provenance: an overview. IEEE Trans. Geosci. Remote Sens. 51 (11), 5065–5072. Downs, R.R., Duerr, R.E., Hills, D.J., Ramapriyan, H.K., 2015. Data stewardship in the Earth sciences. D-Lib 21 (7/8). https://doi.org/10.1045/july2015-downs. Duerr, R.E., Downs, R.R., Tilmes, C., Barkstrom, B., Lenhardt, W.C., Glassy, J., Bermudez, L.E., Slaughter, P., 2011. On the utility of identification schemes for digital earth science data: an assessment and recommendation. Earth Sci. Inf., 139–160. https://doi.org/10.1007/s12145-0. Eisenman, I., Meier, W.N., Norris, J.R., 2014. A spurious jump in the satellite record: has Antarctic Sea ice expansion been overestimated? Cryosphere 8 (4), 1289–1296. Eker, J., Janneck, J.W., Lee, E.A., Liu, J., Liu, X., Ludvig, J., Neuendorffer, S., Sachs, S., Xiong, Y., 2003. Taming heterogeneity—the Ptolemy approach. Proc. IEEE 91 (1), 127–144. ESIP Data Preservation and Stewardship Committee, 2019. Data Citation Guidelines for Earth Science Data, Version 2. ESIP, https://doi.org/10.6084/m9.figshare.8441816.v1. Floridi, L., 2019. Establishing the rules for building trustworthy AI. Nat. Mach. Intell. 1 (6), 261–262.

376

14. Provenance in earth AI

Frost, L., 2020. Explainable AI and other Questions Where Provenance Matters. IEEE IoT Newsletter, pp. 03–04. https://iot.ieee.org/newsletter/january-2019/explainable-ai-and-other-questions-where-provenance-matters. Garfin, G., Jardine, A., Merideth, R., Black, M., LeRoy, S. (Eds.), 2013. Assessment of Climate Change in the Southwest United States: A Report Prepared for the National Climate Assessment. Island Press/Center for Resource Economics, pp. 1–533. Goodman, B., Flaxman, S., 2017. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 38 (3), 50–57. Groth, P., Moreau, L., 2013. An Overview of the PROV Family of Documents, W3C. https://www.w3.org/TR/provoverview/. (Accessed 27 January 2022). Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., Pedreschi, D., 2018. A survey of methods for explaining black box models. ACM Comput. Surv. 51 (5), 1–42. Gunning, D., 2019. DARPA’s explainable artificial intelligence (XAI) program. In: Proceedings of the 24th International Conference on Intelligent Users Interface. Marina del Ray, CA, USA, p. ii. Gunning, D., Aha, D., 2019. DARPA’s explainable artificial intelligence (XAI) program. AI Mag. 40 (2), 44–58. Gutierrez, J.M., Maraun, D., Widmann, M., Huth, R., Hertig, E., Benestad, R., R€ ossler, O., Wibig, J., Wilcke, R., Kotlarski, S., San Martin, D., 2019. An intercomparison of a large ensemble of statistical downscaling methods over Europe: results from the VALUE perfect predictor cross-validation experiment. Int. J. Climatol. 39 (9), 3750–3785. Hilger, J., Wahl, Z., 2022. Data catalogs and governance tools. In: Making Knowledge Management Clickable. Springer, Cham, pp. 187–192. Holzinger, A., Biemann, C., Pattichis, C.S., Kell, D.B., 2017. What do We Need to Build Explainable AI Systems for the Medical Domain?, p. 28. Available: https://arxiv.org/abs/1712.09923. IBM Green Horizon Project, China, 2016. https://www.ibm.com/blogs/internet-of-things/air-pollution-greeninitiatives/. (Accessed 29 December 2021). Intel study, 2018. Intel Newsroom: Applying Emerging Technology to solve Environmental Challenges. https:// newsroom.intel.com/editorials/intel-study-applying-emerging-technology-solve-environmental-challenges/ #gs.8c0dbs. (Accessed 29 December 2021). Iturbide, M., Bedia, J., Herrera, S., Ban˜o-Medina, J., Ferna´ndez, J., Frı´as, M.D., Manzanas, R., San-Martı´n, D., Cimadevilla, E., Cofin˜o, A.S., Gutierrez, J.M., 2019. The R-based climate4R open framework for reproducible climate data access and post-processing. Environ. Model Softw. 111, 42–54. Jaigirdar, F.T., Rudolph, C., Bain, C., 2019. Can I trust the data I see? A physician’s concern on medical data in IoT health architectures. In: Proceedings of the Australasian Computer Science Week Multiconference, Sydney, Australia, pp. 1–10. Jaigirdar, F.T., Rudolph, C., Oliver, G., Watts, D., Bain, C., 2020. What information is required for explainable AI?: a provenance-based research agenda and future challenges. In: Proceedings of the IEEE 6th International Conference on Collaboration and Internet Computing (CIC). Atlanta, GA, USA, pp. 177–183. Jain, S., Luthra, M., Sharma, S., Fatima, M., 2020. Trustworthiness of artificial intelligence. In: Proceedings of the 6th International Conference on Advanced Computing and Communication Systems (ICACCS). Coimbatore, India, pp. 907–912. Jentzsch, S.F., Hochgeschwender, N., 2019. Don’t forget your roots! using provenance data for transparent and explainable development of machine learning models. In: Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW). San Diego, CA, USA, pp. 37–40. Kale, A., Nguyen, T., Harris Jr., F., Li, C., Zhang, J., Ma, X., 2022. TBD. Provenance Documentation to Enable Explainable and Trustworthy AI: A Literature Review. Data Intelligence. (Revision resubmitted). K€ ohl, M.A., Baum, K., Langer, M., Oster, D., Speith, T., Bohlender, D., 2019. Explainability as a non-functional requirement. In: 2019 IEEE 27th International Requirements Engineering Conference (RE), pp. 363–368. Krening, S., Harrison, B., Feigh, K.M., Isbell, C.L., Riedl, M., Thomaz, A., 2016. Learning from explanations using sentiment and advice in RL. IEEE Trans. Cogn. Dev. Syst. 9 (1), 44–55. Lanter, D.P., 1991. Design of a lineage-based meta-data base for GIS. Cartogr. Geogr. Inf. Syst. 18 (4), 255–261. Lanter, D.P., 1993. A lineage Meta-database approach toward spatial analytic database optimization. Cartogr. Geogr. Inf. Syst. 20 (2), 112–121. Lecue, F., 2020. On the role of knowledge graphs in explainable AI. Semantic Web 11 (1), 41–51. LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521 (7553), 436–444.

References

377

Lipton, Z.C., 2018. The mythos of model interpretability: in machine learning, the concept of interpretability is both important and slippery. Queue 16 (3), 31–57. Liu, S., Wang, X., Liu, M., Zhu, J., 2017. Towards better analysis of machine learning models: a visual analytics perspective. Vis. Inf. 1 (1), 48–56. Ma, X., 2018. Metadata. In: Schintler, L.A., McNeely, C.L. (Eds.), Encyclopedia of Big Data. Springer, Cham, Switzerland, p. 5, https://doi.org/10.1007/978-3-319-32001-4_135-1. Ma, X., 2021. Big data. In: Daya Sagar, B., Cheng, Q., McKinley, J., Agterberg, F. (Eds.), Encyclopedia of Mathematical Geosciences. Encyclopedia of Earth Sciences Series. Springer, Cham, https://doi.org/10.1007/978-3-030-260507_2-1. Ma, X., Fox, P., Tilmes, C., Jacobs, K., Waple, A., 2014a. Capturing provenance of global change information. Nat. Clim. Chang. 4 (6), 409–413. Ma, X., Zheng, J.G., Goldstein, J., Zednik, S., Fu, L., Duggan, B., Aulenbach, S., West, P., Tilmes, C., Fox, P., 2014b. Ontology engineering in provenance enablement for the National Climate Assessment. Environ. Model Softw. 61, 191–205. Mahendran, A., Vedaldi, A., 2015. Understanding deep image representations by inverting them. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Boston, MA, USA, pp. 5188–5196. Mayernik, M.S., Callaghan, S., Leigh, R., Tedds, J., Worley, S., 2015. Peer review of datasets: when, why, and how. Bull. Am. Meteorol. Soc. 96 (2), 191–201. Microsoft’s Earth AI Program, 2017. https://www.microsoft.com/en-us/ai/ai-for-earth-tech-resources# primaryR16. (Accessed 29 December 2021). Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119. Miller, T., 2019. Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 1–38. Missier, P., Belhajjame, K., Cheney, J., 2013. The W3C PROV family of specifications for modelling provenance metadata. In: Proceedings of the 16th International Conference on Extending Database Technology, Genoa, Italy, pp. 773–776. Montavon, G., Lapuschkin, S., Binder, A., Samek, W., M€ uller, K.R., 2017. Explaining nonlinear classification decisions with deep Taylor decomposition. Pattern Recogn. 65, 211–222. Moreau, L., Groth, P., 2013. Provenance: An Introduction to PROV. Morgan & Claypool Publishers, San Rafael, CA, p. 113. Moreau, L., Groth, P., Miles, S., Vazquez-Salceda, J., Ibbotson, J., Jiang, S., Munroe, S., Rana, O., Schreiber, A., Tan, V., Varga, L., 2008. The provenance of electronic data. Commun. ACM 51 (4), 52–58. Murdoch, W.J., Singh, C., Kumbier, K., Abbasi-Asl, R., Yu, B., 2019. Interpretable Machine Learning: Definitions, Methods, and Applications. pp. 1–11. Available: https://arxiv.org/abs/1901.04592. Preece, A., Harborne, D., Braines, D., Tomsett, R., Chakraborty, S., 2018. Stakeholders in Explainable AI. Available: https://arxiv.org/pdf/1810.00184.pdf. Rasp, S., Pritchard, M.S., Gentine, P., 2018. Deep learning to represent subgrid processes in climate models. Proc. Natl. Acad. Sci. 115 (39), 9684–9689. Ribeiro, M.T., Singh, S., Guestrin, C., 2016. Why should i trust you? Explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA, USA, pp. 1135–1144. Showstack, R., 1998. NASA selects Earth science information partners. EOS Trans. Am. Geophys. Union 79 (5), 58. Singh, J., Cobbe, J., Norval, C., 2018. Decision provenance: harnessing data flow for accountable systems. IEEE Access 7, 6562–6574. Sun, Z., Di, L., Burgess, A., Tullis, J.A., Magill, A.B., 2020. Geoweaver: advanced cyberinfrastructure for managing hybrid geoscientific AI workflows. ISPRS Int. J. Geo Inf. 9 (2), 119. https://doi.org/10.3390/ijgi9020119. Sun, Z., Sandoval, L., Crystal-Ornelas, R., Mousavi, S.M., Wang, J., Lin, C., Cristea, N., Tong, D., Carande, W.H., Ma, X., Rao, Y., 2022. A review of earth artificial intelligence. Comput. Geosci. 159, 105034. https://doi.org/ 10.1016/j.cageo.2022.105034. Talia, D., 2013. Workflow systems for science: concepts and tools. ISRN Softw. Eng. https://doi.org/10.1155/2013/ 404525. Tenopir, C., Rice, N.M., Allard, S., Baird, L., Borycz, J., Christian, L., Grant, B., Olendorf, R., Sandusky, R.J., 2020. Data sharing, management, use, and reuse: practices and perceptions of scientists worldwide. PloS One 15 (3), e0229003. https://doi.org/10.1371/journal.pone.0229003.

378

14. Provenance in earth AI

Thiebes, S., Lins, S., Sunyaev, A., 2020. Trustworthy artificial intelligence. Electron. Mark., 447–464. Tilmes, C., Fox, P., Ma, X., McGuinness, D., Privette, A.P., Smith, A., Waple, A., Zednik, S., Zheng, J., 2013. Provenance representation for the National Climate Assessment in the global change information system. IEEE Trans. Geosci. Remote Sens. 51 (11), 5160–5168. Van Lent, M., Fisher, W., Mancuso, M., 2004. An explainable artificial intelligence system for small-unit tactical behavior. In: Proceedings of the National Conference on Artificial Intelligence. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999, pp. 900–907. Wing, J.M., 2020. Ten research challenge areas in data science. Harvard Data Sci. Rev. 2 (3). https://doi.org/ 10.1162/99608f92.c6577b1f. Wing, J.M., 2021. Trustworthy AI. Commun. ACM 64 (10), 1–12. Zeng, Y., Su, Z., Barmpadimos, I., Perrels, A., Poli, P., Boersma, K.F., Frey, A., Ma, X., de Bruin, K., Goosen, H., John, V.O., 2019. Towards a traceable climate service: assessment of quality and usability of essential climate variables. Remote Sens. 11 (10), 1186. https://doi.org/10.3390/rs11101186. Zhu, J., Liapis, A., Risi, S., Bidarra, R., Youngblood, G.M., 2018. Explainable AI for designers: a human-centered perspective on mixed-initiative co-creation. In: 2018 IEEE Conference on Computational Intelligence and Games (CIG). Maastricht, Netherlands, pp. 1–8.

C H A P T E R

15 AI ethics for earth sciences Pablo Rivasa,b, Christopher Thompsonb, Brenda Tafurc, Bikram Khanalb, Olawale Ayoaded, Tonni Das Juib, Korn Sooksatrab, Javier Orduze, and Gissella Bejaranof a

Center for Standards and Ethics in Artificial Intelligence, Baylor University, Waco, TX, United States bDepartment of Computer Science, School of Engineering and Computer Science, Baylor University, Waco, TX, United States cIngenieria de Sistemas, Facultad de Ingenieria y Arquitectura, Universidad de Lima, Provincia y Departamento de Lima, Lima, Peru dDepartment of Physics, College of Arts and Sciences, Baylor University, Waco, TX, United States eDepartment of Mathematics and Computer Science, Earlham College, Richmond, IN, United States f Department of Computer Science, Marist College, Poughkeepsie, NY, United States

1 Introduction The field of artificial intelligence (AI) continues to grow rapidly as we experience technological progress. Many systems considered to have some kind of intelligence are being assimilated and integrated into day-to-day operations. Drivers take their hands off the steering wheel and vehicles can find a way to navigate a road. Doctors and clinics can analyze their datasets more quickly to provide better healthcare. Today millions of consumers interact with computer programs designed to establish and sustain an information link with human beings with the purpose of achieving a specific goal, e.g., to know weather conditions in different locations, to receive air quality warnings, and even to quantify the risks associated with climate change. Some of these intelligent programs are recently fueled by AI. With the advent of AI and its ubiquitousness, there are questions in need of answers and growing ethical concerns that need to be addressed as consumers will no longer be able to distinguish whether they are interacting with human-curated or AI-based information. Some of these questions or concerns arise because people have specific feelings about technology given past experiences. Just to name a few recent examples, in 2015 a developer used an API that provided genetic information to deny access to an App, causing outrage from the

Artificial Intelligence in Earth Science https://doi.org/10.1016/B978-0-323-91737-7.00007-4

379

Copyright # 2023 Elsevier Inc. All rights reserved.

380

15. AI ethics for earth sciences

community, and such technology can be used by a bot to make similar decisions (Winter, 2015). In 2016, most people heard about Microsoft’s bot that posted messages on social media that very soon learned from people’s interactions and posted messages that were categorized as incredibly racist. And although today we see that with humor (Davis, 2016), we at the same time try to understand what happened, how to prevent that from happening in the future, and how we as human beings perceive such events passing judgment on such technologies’ morally and ethically (Rivas et al., 2018). This chapter will discuss certain considerations that need to be understood when implementing AI systems within the earth sciences. In this chapter, we outline several of the many standards, best practices, and paths toward more ethical AI in the earth and environmental sciences that include more open datasets and unbiased algorithms.

2 Prior work There is a wealth of research focused on the ethical problems caused by AI when it is in operation ( Jobin et al., 2019). Critics have examined the relationship between the role cultural bias plays in algorithmic automated decisions (Eubanks, 2018) and how AI systems can lead to discrimination of undersampled and underrepresented groups (Buolamwini and Gebru, 2018). We can foresee many forms of regulation and laws regarding Earth AI ethics in the near future. With the advent of large machine learning (ML) models for text, e.g., GPT3 (Brown et al., 2020) or T5 (Raffel et al., 2020), images, e.g., ViT (Dosovitskiy et al., 2020) or DDPM (Ho et al., 2020), or a combination of both, e.g., CLIP (Radford et al., 2021) or DALL-E (Ramesh et al., 2021), we have observed several ethical concerns from the community in relation to biases that arise from training from large datasets. These large models are extremely helpful in representing knowledge and inferring characteristics of the data in textual or visual form. For example consider Fig. 1, which shows original, unique images automatically imagined and produced by ML using DALL-E. The concerns expressed on these types of models have been presented recently in different forums, including the work by Bender et al. (2021) on large language models. However, these concerns translate into vision models as well. In order to mitigate and prevent these, we will be discussing the work of the community that might be helpful for earth scientists using or developing AI technology. In the next section, we will begin discussing the IEEE Standard 7000a “Model Process for Addressing Ethical Concerns During System Design.”

3 Addressing ethical concerns during system design AI-based earth science systems thrust new scientific perspectives across the globe. These systems are particularly important for climate risk assessment and critical decision-making,

a

Available at https://www.techstreet.com/ieee/standards/ieee-7000-2021?product_id¼2109271.

3 Addressing ethical concerns during system design

(a)

(b)

(c)

(d)

381

FIG. 1 Images (A)–(D) are results from asking DALL-E to generate images based on the following prompt: “A picture of artificial intelligence technology applied to the planet earth viewed from space.”

e.g., determining if a disaster may hit, which areas need precautionary planning, how long the weather effects may last, or the extent of damages a disaster may cause. These systems are essential to pursue the safety of human beings in sight of climate change-related events. However, privacy, fairness, and transparency are issues that stakeholders increasingly care about to ensure trustworthiness. These are fundamental reasons a standardized, ethical system design should consider societal values, efficiency, and effectiveness. The IEEE Standard 7000 is a standard that aims to guide organizations to build products that consider societal values (Systems and S. E. S. Committee, 2021). The standard helps identify stakeholders’ priorities and values; it establishes processes and projects that promote societal values during system design and development. According to the standard, developers should use processes to design new systems or software and improve existing systems attributing ethical issues. Engineers, technologists, and other stakeholders should follow methodologies to conduct the processes by identifying

382

15. AI ethics for earth sciences

and analyzing the ethical concerns of end-users. The sets of processes that a team called ethical value engineering project team generally involve leading other groups into presenting end-users expectations of the system from their perspective and obtaining and ranking values for approval by management and stakeholders as a basis for the system’s design. The standard also outlines a few important processes. The Ethical Requirements Definition process requires analyzing ethical value requirements and validating them by recording the system’s perspective of interests. The Ethical Risk-Based Design analyzes value-based system requirements and the appropriate risk treatment options. The Transparency Management process maintains the availability of appropriate information for internal and external, short-term, and long-term stakeholders. These processes ensure transparency between stakeholders and organizations, accountability of the product, the privacy of the stakeholders, and fairness of the system. The IEEE 7000 Standard is convenient while designing systems regarding weather, climate, or other critical earth science applications. The descriptions of the processes required for maintaining the standard ensure the societal values for system design. The inputs and outcomes of the processes can be organized for easy tracking and adequately informing the different stakeholders. This standard has been cited several times since its inception, giving it great credibility in the community. This also indicates that scholars are concerned regarding maintaining societal values for AI-based systems. A rising citation number can indicate growing concerns about the ethical values associated with AI-based systems. Moreover, maintaining stakeholders’ trust is a sensitive issue for earth science, AI-based systems that estimate climate changes and damages or advantages due to climate changes. This standard helps ensure ethical values during system design. AI-based earth system designers can follow this standard and be cautious about ethical values associated with the product while preserving the efficiency and quality of the product. The next section discusses the IEEE Standard P7003b entitled “Algorithmic Bias Considerations” and its relevance to earth systems.

4 Considerating algorithmic bias The IEEE P7003 Standard for Algorithmic Bias Consideration is an IEEE ethics-related standard that started to be developed in 2017 as a part of the IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems. The proposed standard is currently undergoing its last revision and aims to provide a development framework to individuals or organizations responsible for developing and deploying an automated algorithm to identify and mitigate unintended bias (Koene et al., 2018). Automated algorithms that support decision-making perform a function that could produce unwanted results for different users exhibiting a form of bias. The standard deals with a methodology for mitigating such forms of bias. Many available recommender algorithms implemented by the industry are user biased, which psychologically impacts the users’ decision-making instinct without their consent. In particular, social media and streaming industries are known to systematically manipulate

b

Currently under development, and a summary is available at https://fairware.cs.umass.edu/papers/Koene.pdf.

4 Considerating algorithmic bias

383

people’s surroundings and environment in the digital world to analyze users’ behavior (Grimmelmann, 2015). The P7003 standard will allow the creators of such algorithms to appropriately communicate and disclose to users any known and potential negative impact on users. According to the standard, during the early development of an algorithm, P7003 will guide the developer to correctly identify the targeted user groups and any subgroups that may require special consideration, e.g., visual impairments. The standard requires the developer to validate system performance through carefully chosen validation and test datasets, e.g., the one shown in Fig. 2; thus, the developer must verify that the designed system uniformly serves the targeted groups. P7003 challenges the system designer to document and socially justify the implemented criteria. The P7003 standard will aid in ensuring the nonexistence of unjustified and inappropriate bias via various methodologies to eliminate or mitigate such biases. According to (Koene et al., 2018), the P7003 standard will contain three important sections: • Foundations. The section includes a description of legal frameworks related to the psychology of bias and its cultural contexts. It aids the system designer in engaging and understanding common ethical issues. • System design and implementation. This section discusses data evaluation, algorithmic processing, resilience assessment, scope assessment, and transparency. It outlines how to use frameworks to identify unintended bias issues. • Use cases. The section forms an annex to P7003 listing different illustrative examples. These examples would should how algorithms can result in unintended bias and how the P7003 can be used to address the problem. The authors of (Koene et al., 2018) mention an essential fact about the team developing the standard, it includes participants from various backgrounds, such as civil society organizations, industry, and academia; all voices were treated equally even though participants come from different backgrounds. The essence of P7003 is to ensure that automated decision-making algorithms or systems prioritize ethical considerations concerning bias and fairness and disclose identified issues in terms of services. The proposed standard is creative in capturing unknown biases and measures to resolve them.

FIG. 2 A traditional hold out test set, and a fivefold cross-validation strategy.

384

15. AI ethics for earth sciences

If appropriately adopted, the P7003 standard can guide proper data collection to prevent bias and careful model evaluation to identify bias, thereby protecting end users and stakeholders. The AI ethics standard can guide the design of future decision-making algorithms, particularly in the earth sciences, fostering privacy and fair service to end-users on a global scale. In the next section, we discuss in more technical detail how ontologies can help the ethical design of autonomous systems by following the IEEE Standard 7007c “Ontological Standard for Ethically Driven Robotics and Automation Systems.”

5 Designing ethically driven automated systems As AI has developed, it has become the subject of much philosophical discussion, particularly in the ethics surrounding its creation and use. While the autonomy of AI is undeniably a boon to humans, its rational process must be informed by the morals and values on which we build our society. The question then requires understanding how AI should interact responsibly with humans. Within philosophical discourse, a baseline understanding must be agreed upon by all parties before they can make progress (it is for this reason that the Platonic dialogues are primarily composed of Socrates defining terms with his fellow interlocutors). This process generates an ontology: a formal definition of concepts in a knowledge domain, emphasizing those concepts’ properties and their relations. Thus, given the world-changing potential of AI, creating a modern ontology of ethical considerations for AI has become altogether necessary. The IEEE Ontological Standard for Ethically Driven Robotics and Automation (R&A) Systems (Standing Committee for Standards Activities, 2021) is such an exhaustive account. This standard—classified as IEEE Std 7007-2021—builds on previous work in the 2015 standard IEEE Std 1872-2015 (Schlenoff et al., 2015). The modern standard describes a generalized ethical paradigm with application to all kinds of robots and systems. IEEE Std 7007 seeks to address four subdomains within the broader domain of Ethically aligned Robotics and Automation Systems (ERAS): 1. Norms and Ethical Principles (NEP), engaging the norms of expected behaviors by autonomous systems and “norm-aware” agents. 2. Data Protection and Privacy (DPP), describing the obligations an autonomous system has to respect regarding the boundaries of protected and private data. 3. Transparency and Accountability (TA), concerning the importance of informative and unambiguous explanation for an autonomous system’s motivations and plans. 4. Ethical Violation Management (EVM), detailing the detection, assessment, and management of ethical violations in autonomous system behavior, including questions of governing accountability, responsibility, and legal personhood.

c

Available at https://www.techstreet.com/ieee/standards/ieee-7007-2021? gateway_code¼ieee&vendor_id¼7070&product_id¼2217375.

5 Designing ethically driven automated systems

385

Like many ethical treatises written in classical and contemporary philosophy, Std 7007 communicates its ideas in the language of formal logic. Logic forms the foundation of computer science, so this presentation mirrors the theoretical implementation of these ethical standards in R&A systems (notably, without respect to a particular programming language). Further, a logical basis lends itself to more trustworthy earth AI. There are two ways in which a developer can prove the trustworthiness of their creation: by induction, achieved through rigorous application of test cases, or deduction, achieved through formal proof. Inductive reasoning is unreliable for this purpose, and the fear of failure over an unaccounted-for test case is a potent factor fueling public suspicion of AI. On the other hand, the establishment of ethics by logical proof presents a sense of surety otherwise unachievable; one could prove that the system will behave as expected, that is, ethically (Bringsjord et al., 2006). Establishing a standard ontology ensures effective interdisciplinary communication—it is vital not only in the design and implementation of R&A systems but also in their field application and subsequent performance analyses. Those who utilize these systems are granted essential insight beyond the fold of technical abstraction: a glimpse into those technologies’ ethical intentions, limitations, and assurances. Likewise, the ethicists who evaluate the system’s public performance gain a framework that streamlines the process of philosophical discourse, as noted earlier. Effective implementation of R&A systems can vastly improve many facets of human life. Among these, it serves to augment humanity’s stewardship of the Earth and to gird us against the wilds of nature. Throughout history, society has shown itself to be poorly aware of environmental conservation, and it is only with the visible effects of ignorance—climate change and degradation of the environment—that we have learned to repent. While it may be difficult for humanity to comprehend the utter paradigm shift necessary, impartial automated systems have no such biases or preconceptions. Indeed, AI and R&A systems allow us to forecast our Earth’s future and determine what actions we must take toward conservation. However, we must also acknowledge that these systems remain lifeless and intrinsically devoid of morality. Ethics are but a construct to the machine, and we must be explicit in implementing that construct. Consider how an R&A system may perceive the issue of climate change, given no ethical guidance. The classic example involves seeing humankind as a blight upon the Earth and exterminating us to let nature heal. While this is perhaps an extreme example, it demonstrates the soullessness of our technological oracles. A solution is unacceptable if it harms human beings or inhibits the well-being of any group. On the other hand, consider how an R&A system may be used to mitigate the impact of an earth disaster. Without understanding the value of human life, the system may underestimate the risk posed to those nearby or weigh other factors as being more critical in its calculations (Sun et al., 2022). Establishing an ethical ontology amounts to quantizing the human moral experience and transplanting it into a computerized system. The development of Std 7007 involved a concentrated effort of both academic and industry professionals and represented the work of multiple underrepresented groups. Such diversity speaks well for the ultimate purpose of this standard: that it may be a unifying force across myriad disciplines toward more ethical systems. This standard is highly relevant to Earth AI and AI as a whole. It represents an effort of standardization that will continue to be necessary as long as humankind harnesses AI,

386

15. AI ethics for earth sciences

cutting-edge AI requires an advanced system of ethics. The growth from Std 1872 to Std 7007 exemplifies the value of keeping ontology current. Active updates ensure that the technology grows alongside our society, neither falling behind nor vaulting ahead. Ultimately, the goal is to align AI with our system of values and integrate it as an ethical agent in a society bound by human morality. Main takeaways: Regardless of the level of comfort of the reader with using ontologies, they are instrumental. Even if ontologies are lightly adopted across earth science and engineering; however, where they are adopted, e.g., biomedical, auto safety, or manufacturing, then the ontology allows for interoperability between the ethical ontology and connected aspects of the domain-specific ontology. Where no subdomain ontology exists, the value is less clear. Still, the commonality allows for safety and ethical frameworks to be interwoven in systems that, for instance, put out alerts or additional design-time requirements for earth sciences.

6 Assessing the impact of autonomous and intelligent systems on human well-being The promising concept of building a digital twin of the Earth (Bauer et al., 2021) exemplifies the potential benefits in the fields of novel Earth observations, analysis, and modeling, which are rapidly expanding our understanding of climate change and Earth Science Systems. Novel approaches to Earth Science are enabled by emerging breakthroughs in AI, such as spectroscopic tool design and data query performance, analyzing flood and landslide by combining remote sensing imagery and ML models, and so on. The potential of applying these autonomous and intelligent systems to Earth science problems has been established, and there are still many obstacles and opportunities for academics in the field to overcome. Human well-being is impacted directly and indirectly by scientific technology’s potential applications of autonomous and intelligent systems on earth. For this reason, careful and unbiased considerations must be made during these systems’ development, implementation, or deployment. At each stage of the process, various questions must be asked and addressed, the most important of which concerns the impact on human safety. For example, what are the possible effects on human well-being? the probability of their occurrence? and how are negative impacts on human well-being considered and mitigated? These are considerations within the IEEE Recommended Practice for Assessing the Impact of Autonomous and Intelligent Systems on Human Well-Being.d Within the broad discipline of earth science, AI has been substantially incorporated into several subfields, such as satellite meteorology and oceanography (e.g., pattern recognition and classification); hybrid climate and weather numerical models, and data assimilation systems (e.g., fast forward models for data assimilation and fast emulation of physical processes); geophysical data fusion and data mining; interpolation, nonlinear multivariate statistical analysis, downscaling; hydrologic applications (e.g., modeling rainfall–runoff

d

The IEEE Std. 7010 is available online: https://ieeexplore.ieee.org/document/9084219.

7 Developing AI literacy, skills, and readiness

387

relationships, flood forecasting, precipitation forecasting) (Cherkassky et al., 2006; Liu and Weisberg, 2011; Wang et al., 2008; Looms et al., 2008; ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, 2000). As a result, AI applications may have an impact on a variety of these subfields. The use of AI can have a positive effect, such as in the work of Avand et al. on using ML models, remote sensing, and GIS to investigate the impact of changing climates and land uses on flood probability (Avand et al., 2021). The authors use the software TerrSet, and Lars-WG were used to survey and project Land use and land cover (LULC) and climate changes, respectively, and to map flood probability, ML models such as random forest (RF), and Bayesian generalized linear model (GLMbayes) were deployed. According to recent reports, mental health difficulties and mental health demands after natural disasters constitute a resource dilemma for primary care physicians. It is estimated that 4%–5% of survivors of a large-scale natural catastrophe would acquire posttraumatic stress disorder (PTSD) (Maghrabi, 2012). According to the World Health Organization’s (WHO) comprehensive definition of health, a complete condition of physical and mental well-being, mental health requires postdisaster support from the entire “disaster community.” The use of AI systems, such as the work in (Avand et al., 2021), in conjunction with preexisting data, can aid in the prediction of future flood occurrence/patterns, as well as the early detection of other humanitarian disasters such as storm surges, forest fires, landslides, and hurricanes, among other things, that potentially impact human existence. Main takeaways for Earth AI researchers: The IEEE Std 7010 (IEEE Standards Committee, 2020), enables researchers across multiple disciplines to have increased access to well-being data, thus improving the understanding impact of Earth AI systems on human well-being. The standard can help identify Earth AI systems’ impact on human well-being from a multidisciplinary perspective. For example, earth scientists may use wireless sensor network data; computer scientists will use ML to extract features from MODIS data; data scientists will mine social media, etc. The standard will benefit researchers sharing these different data modalities and their understanding of the human impact by gathering specifics, validating indicators, and measuring certain data qualifications. IEEE Std 7010 will also facilitate iterative work in data collection stages, favoring better collective data. These could yield new research for the benefit of the public and the societal good.

7 Developing AI literacy, skills, and readiness Technology permeates finance, health, sociology, economics, and science, such as chemistry, physics, and mathematics. This system, illustrated in Fig. 3, creates new topics to research, new materials, new paradigms, and even other new systems. The outputs typically become inputs, and this machinery moves forward. Society finds this system very productive, improving with every iteration since we discover new applications, technologies, or answers. This system contains a significant component: education. It requires continuing education, particularly in science, technology, engineering and mathematics areas that need outreach programs at middle school and beyond.

388

15. AI ethics for earth sciences

FIG. 3 The role of literacy as a societal system.

This IEEE Standard P7015e will focus on literacy in the engineering area of AI and will deal with concerns regarding competencies, technical backgrounds, researchers, and a dynamic educational system (Long and Magerko, 2020). This literacy should strengthen AI concepts, provide skills and abilities for the next generation of scientists, and give strengths to face new challenges. In our computational systems, we want that data to contain correct information. We want to reduce the loss of information in agile systems. We achieve good results when our systems move information faster without losing information. This standard promotes AI literacybuilding efforts and data globalization. It also allows designing policy interventions through operational frameworks; it further tracks progress and evaluates outcomes. Under this context, this standard defines a common set of definitions, languages, and understanding of data and AI literacy, skills, and readiness (IEEE-Computer-Society, 2022). This standard enables the respective skills and competencies, widely taught as a transdisciplinary area. This area contains different fields and perspectives, and any person can acquire those with an interest in data, AI, and ethics. This standard boosts AI and data education at different levels of education. Main takeaway: the standard will enable Earth AI researchers to have a trusted process for training their research associates, students, and other participants, ensuring the safety and trustworthiness of their AI-based research products.

8 On documenting datasets for AI Recently, researchers have set forth data documentation practices referred to as datasheets for datasets (Gebru et al., 2021). Writing datasheets for datasets is relevant in AI Ethics for Earth

e

The IEEE Standard P7015 is currently under production, and is currently entitled “Standard for Data and Artificial Intelligence (AI) Literacy, Skills, and Readiness”. For more information, see https://standards.ieee.org/ieee/7015/ 10688/.

8 On documenting datasets for AI

389

Sciences to avoid bias and correctly use datasets. According to this practice, every dataset should contain seven sections: motivation, composition, collection process, preprocessing/ cleaning/labeling, uses, distribution and maintenance. In the motivation stage, authors from the datasets should specify their purpose, who created it, and any funding received. Regarding composition, they should identify in more detail the instances, what they consist of, if there is confidential or sensitive data or data that could be offensive, and if there is any recommendation for splitting the data in training, validation, and testing. The collection process should be detailed. For example, whether the data were observed directly or an indirect collection mechanism was used, as well as the collection devices and whether it is a complete data set or a sample of the data set. In the latter case, it should be detailed what mechanism was used to ensure an adequate sampling strategy and how datasets’ owners validated this process. In addition, it is essential to mention the data collection timeframe, the participants’ informed consent, and whether there was an ethical review process. Likewise, in the preprocessing/cleaning/labeling stage, each of the processes should be mentioned in detail, as well as the software used to carry it out. The uses stage must detail which tasks can be performed with this dataset and which cannot. As well as any repository containing scientific articles or systems that use the dataset and any step in the preprocessing that may impact future uses of the dataset. On the distribution side, dataset authors should mention when and how the dataset is distributed and whether it has an identifier (i.e., DOI) or any copyright, license, or regulatory restrictions. Finally, maintenance details will be in charge of uploading updates to the dataset, the dataset owner’s contact details, and any mechanisms for third parties to contribute to the dataset (Gebru et al., 2021). There are also different motivations for documenting datasets. For example, knowledge preservation facilitates knowledge transfer from one person to another. Thus, when employees rotate positions, it saves time and effort in understanding the tasks of previous workers and projects. The second point is inter-organizational accountability, which refers to the traceability of information and those responsible for it. In addition, auditing becomes much simpler when documentation exists. Also, regulatory intervention could push more organizations to document. However, documentation should be considered a mere mandatory task and a fundamental purpose of transparency (Miceli et al., 2021). The success of this initiative can be seen in companies adopting this type of documentation for their datasets, such as Google, Microsoft, and IBM, who launched a pilot following an initial draft of (Gebru et al., 2021). In addition, they have brought new proposals to the table, such as model cards for tracking their ML models and data cards, which are lighter versions of a datasheet. For example, in model cards (Mitchell et al., 2019), the dataset’s authors should specify the intended use, factors, metrics, and ethical considerations to help increase the dataset’s transparency. It also addresses going further so auditors can perform adversarial testing and qualitative and quantitative analysis of the algorithms used. On the other hand, it is essential to standardize the AI Ethics for Earth Sciences to avoid misuse or biased uses of these datasets that the creators cannot prevent. Therefore, having a datasheet with all the precise information on the correct use of the dataset will reduce misunderstandings that may occur intentionally or unintentionally. One of the concerns specified in such an article is the motivations for collecting data. It is necessary to know the purpose, what it will be used for, to what extent, and in what form. Likewise, in some cases of data collection, bias may occur due to the costly nature of this task. In addition, the dataset’s

390

15. AI ethics for earth sciences

TABLE 1

Template sections of datasheets for datasets (Gebru et al., 2021).

Datasheet for XYZ dataset from MODIS Motivation: Detailed description… Composition: Detailed description… Collection process: Detailed description… Preprocessing/cleaning/labeling: Detailed description… Uses: Detailed description… Distribution and maintenance: Detailed description…

authors must ensure that the data collected do not favor a particular cause or are of specific interest (Record and Vera, 2021). Also, the World Meteorological Organization has developed several efforts to establish guidelines to guide data standardization for collecting and exchanging climate data. Similarly, the United Nations Committee of Experts on Global Geospatial Information Management (UN-GGIM) has provided recommendations on managing geospatial data in various countries. The objective is to make the data trustworthy, easy to use, and accessible (Peng et al., 2021). Furthermore, the software field already uses documentation; however, since dataset documentation is not widely extended, it has transformative potential as more people and organizations adopt it. In general, it is an initiative that the background of the correct use of datasets is quite valuable for the community. In this way, it would be beneficial to apply it in AI Ethics for Earth Sciences because it is a field that can be sensitive if the data provided as datasets are not interpreted correctly or for the purposes for which they were created. Main takeaways: The contribution of Gebru’s paper has been significant, and it could have an even more impact on correctly using datasets and avoiding misinterpretations. Although a slight modification of some questions proposed by Gebru, the extension of the main idea of documentation to other areas such as Machine Translation shows the impact of this paper on the community (Costa-jussa` et al., 2020). Finally, this dataset documentation initiative impacts the Earth AI community and has given rise to new documentation proposals in large technology companies. For a sample template, you may use Table 1 to get started.

9 On documenting AI models Model cards were proposed by Mitchell et al. (2019) to help stakeholders involved in ML models (e.g., ML practitioners and software developers). ML practitioners could note important information about their ML models for further use by other people. Also, they could compare their ML models to other ML models by using the cards. Further, software developers

10 Carbon emissions of earth AI models

391

that would like to use some ML models for predictions could know if the models fit their problem specifications. Generally, a proposed card contains nine sections: model details, intended use, factors, metrics, evaluation data, training data, quantitative analysis, ethical considerations, and caveats and recommendations. Each section can be briefly described as follows: • The model details section summarizes brief information about the model. • The intended use section explains the use cases of the model. • The factors section describes some factors affecting the performance or fairness of the model. • The metrics section specifies what metrics are used for evaluating the model. • The evaluation data section provides details about the data used for analyzing the model. • The training data section describes the data used to train the model. • The quantitative analysis section reports the results of the evaluation of the model. • The ethical considerations section demonstrates the model’s ethical challenges and possible solutions for those. • The caveats and recommendations section points out any additional concerns regarding the model. Essentially, AI-based earth sciences technology can benefit from the model cards. For example, using a cyclone detector trained from satellite data (Kim et al., 2019) may benefit from model cards. Suppose you expect to use an ML model for detecting a cyclone in a desert area. In that case, you can use a model card to check if any model is trained by training data obtained from a desert area in the training data section or having a good outcome with desert area data in the evaluation data and quantitative analyses sections. Moreover, you can check the ethical considerations section to find any cyclone detector that is fair for many environments. Other sections are also helpful in several circumstances, and different kinds of AI-based earth sciences technology can benefit from these model cards. Main takeaways: Model cards are important for ML development in the earth sciences since ML practitioners or anyone using ML models can find the cards helpful, as described earlier. An explanation of a model card in (Mitchell et al., 2019) is well-organized and easy to follow. Also, the authors are from Google and the University of Toronto. Therefore, they are well-qualified. In the paper, the authors created model cards as examples for a smiling classifier trained and evaluated by CelebA dataset (Liu et al., 2015) and a toxicity classifier evaluated on the test set in (Dixon et al., 2018). These sources are available online. You may use the template in Table 2 to get you started.

10 Carbon emissions of earth AI models Before we conclude this chapter we would like to take a moment and discuss a topic that is gaining traction in responsible AI practices, namely, the calculation of carbon (CO2) emissions of ML models (Kochanski et al., 2019; Schwartz et al., 2020). The main premise here is to make a reasonable estimate of the time a model takes to train and the hardware used.

392

15. AI ethics for earth sciences

TABLE 2 Template sections of model cards for model reporting (Mitchell et al., 2019). Model card for XYZ convolutional-LSTM Model details: Detailed description… Intended use: Detailed description… Factors: Detailed description… Metrics: Detailed description… Evaluation data: Detailed description… Training data: Detailed description… Quantitative analysis: Detailed description… Ethical considerations: Detailed description… Caveats and recommendations: Detailed description…

We will give only one example here about a database that is available through Hugging Face, a popular ML hub.f Given that a popular Earth AI type of model might be a computer vision model, we will demonstrate how to retrieve the CO2 emissions of Vision Transformer (ViT) models. pip install huggingface_hub - U from huggingface_hub import HfApi

Here, the HfApi enables us to do a number of things, including searching the model cards of models and retrieve their CO2 emissions information, as follows: api = Hf Api () models = api . list_models ( search =’ ViT ’, card Data = True ) for model in models : if hasattr (model , ’ cardData ’): if " co2 _eq_emissions " in model . card Data : print (’ Model is:’, model . model Id ) print (model. card Data [" co2_eq_emissions "], " grams of CO2 emitted during training .")

This code will search for models that have ”ViT” as part of the name, indicating they belong to the vision transformer family, and print their CO2 emissions. The output would look like this:

f

https://huggingface.co/blog/carbon-emissions-on-the-hub.

12 Assignments

393

Model is: abhishek/autotrain_cifar10_vit_base 32.869648157119876 grams of CO2 emitted during training. Model is: abhishek/autotrain_fashion_mnist_vit_base 0.2438639401641305 grams of CO2 emitted during training.

These ViT models may have low CO2 emissions in comparison to language models, like GPT2 or GPT3, which may have tons of CO2 emissions. If you are interested in tracking your own emissions, please look into the TrainingArguments class in the transformers library, which continues to provide alternatives for CO2 emissions logging. While this is not standard practice yet, we believe it will be very soon.

11 Conclusions Engineers should develop Earth AI ethics-related logic by partnering with social scientists, ethicists, and philosophers who have been studying AI’s social implications in policing, law, and finance. This includes developing a guideline for ML researchers to engage with ethics as not only a philosophical project but also a pragmatic one where the collection of data and the use of particular models over others have direct impacts on ecosystems and humans. Last but not least, we believe that communicating one’s use of any ML or AI to the broader community and its impacts will be necessary for achieving a fair and ethical movement in AI in geosciences. For example, an automated method for developing land cover maps may directly impact representations of Indigenous land; this should be disclosed to the public. The societal implications of careless ML models on data collected unprofessionally are dangerous; we all need to care and train our research partners moving forward in the world of Earth AI.

12 Assignments 1. Use the IEEE 7007-2021 Standard use case template for the following Earth AI-based autonomous & robotics system: A system that uses a multimodal ML model to deploy firefighting drones to human settlements in case of an emergency. The system uses NASA’s MODIS and GOES imagery along with real-time readings from sensors deployed by homeowners. Note: this use case is intentionally vague in some aspects, so make some assumptions about the data, the model, the actions, and the principles of the system owners. The exercise is intended for the reader to use the standard, not to wrestle with philosophical aspects of the system. 2. Suppose that you have received a grant to develop an Earth AI system for the quantification of risks of skin cancer due to environmental pollution in your country, to identify the geographical regions at more considerable risk and the best course of treatment for the types of pollution. The grant requires evaluation indicators for different stages of the project. Because the system can affect minors, your team knows this requires additional precautions and careful decision-making. This has severe implications in the health monitoring of adults and children for extended periods of time, as well as satellite readings from different hyperspectral sensors. Because of this, you and your team

394

15. AI ethics for earth sciences

decided to use IEEE Std 7010 to guide the development of the Earth AI system for the early identification of risk and course of action. • Determine how the standard will benefit you. • Make a plan to implement the standard in your research plan. 3. Using Table 1 as a template, create a datasheet for any of the datasets you have produced; alternatively, choose a dataset referred to in this book, and do your best to complete the datasheet. If there is not enough information, make your own assumptions in order to complete this exercise. 4. Using Table 2 as a template, create a model card for at least one of the models described in this book’s chapter entitled “Spatiotemporal attention ConvLSTM networks for predicting and physically interpreting geo-event dynamics.” If there is not enough information, make your own assumptions in order to complete this exercise.

13 Open questions 1. What is the main purpose of the IEEE Standard 7000, and why is it relevant today? 2. If appropriately adopted, how could the IEEE Standard P7003 benefit society? 3. Why are AI ethics standards important for Earth AI research and development?

References ASCE Task Committee on Application of Artificial Neural Networks in Hydrology, 2000. Artificial neural networks in hydrology. II. Hydrologic applications. Journal of Hydrologic Engineering 5 (2), 124–137. Avand, M., Moradi, H., et al., 2021. Using machine learning models, remote sensing, and gis to investigate the effects of changing climates and land uses on flood probability. J. Hydrol. 595, 125663. Bauer, P., Dueben, P.D., Hoefler, T., Quintino, T., Schulthess, T.C., Wedi, N.P., 2021. The digital revolution of earthsystem science. Nat. Comput. Sci. 1 (2), 104–113. Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S., 2021. On the dangers of stochastic parrots: can language models be too big? In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610–623. Bringsjord, S., Arkoudas, K., Bello, P., 2006. Toward a general logicist methodology for engineering ethically correct robots. IEEE Intell. Syst. 21 (4), 38–44. https://doi.org/10.1109/MIS.2006.82. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al., 2020. Language models are few-shot learners. Adv. Neural Inform. Process. Syst. 33, 1877–1901. Buolamwini, J., Gebru, T., 2018. Gender shades: intersectional accuracy disparities in commercial gender classification. In: Conference on Fairness, Accountability and Transparency, PMLR, pp. 77–91. Cherkassky, V., Krasnopolsky, V., Solomatine, D.P., Valdes, J., 2006. Computational intelligence in earth sciences and environmental applications: issues and challenges. Neural Netw. 19 (2), 113–121. Costa-jussa`, M.R., Creus, R., Domingo, O., Domı´nguez, A., Escobar, M., Lo`pez, C., Garcia, M., Geleta, M., 2020. Mt-Adapted Datasheets for Datasets: Template and Repository. arXiv preprint arXiv:2005.13156. Davis, E., 2016. AI amusements: the tragic tale of tay the chatbot. AI Matters 2 (4), 20–24. Dixon, L., Li, J., Sorensen, J., Thain, N., Vasserman, L., 2018. Measuring and mitigating unintended bias in text classification. In: Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pp. 67–73. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al., 2020. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

References

395

Eubanks, V., 2018. Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press. Gebru, T., Morgenstern, J., Vecchione, B., Vaughan, J.W., Wallach, H., Iii, H.D., Crawford, K., 2021. Datasheets for datasets. Commun. ACM 64 (12), 86–92. Grimmelmann, J., 2015. The law and ethics of experiments on social media users. Colo. Tech. LJ 13, 219. Ho, J., Jain, A., Abbeel, P., 2020. Denoising diffusion probabilistic models. Adv. Neural Inform. Process. Syst. 33, 6840–6851. IEEE Standards Committee, 2020. IEEE Recommended Practice for Assessing the Impact of Autonomous and Intelligent Systems on Human Well-Being, IEEE Std 7010-2020. pp. 1–96, https://doi.org/10.1109/ IEEESTD.2020.9084219. IEEE-Computer-Society, 2022. Standard for Data and Artificial Intelligence (AI) Literacy, Skills, and Readiness, Standard for Data and Artificial Intelligence (AI) Literacy, Skills, and Readiness, Katharina Schueller is the Working Group Chair. Check websites https://standards.ieee.org/ieee/7015/10688/, and https://development. standards.ieee.org/myproject-web/public/view.html#pardetail/9447. Working group website https:// sagroups.ieee.org/7015/. Jobin, A., Ienca, M., Vayena, E., 2019. The global landscape of ai ethics guidelines. Nat. Mach. Intell. 1 (9), 389–399. Kim, M., Park, M.-S., Im, J., Park, S., Lee, M.-I., 2019. Machine learning approaches for detecting tropical cyclone formation using satellite data. Remote Sens. 11 (10), 1195. Kochanski, K., Rolnick, D., Donti, P., Kaack, L., 2019. Climate change AI: tackling climate change with machine learning. In: AGU Fall Meeting Abstracts, vol. 2019, p. GC33A-04. Koene, A., Dowthwaite, L., Seth, S., 2018. IEEE p7003tm standard for algorithmic bias considerations. In: 2018 IEEE/ ACM International Workshop on Software Fairness (FairWare), pp. 38–41, https://doi.org/10.23919/ FAIRWARE.2018.8452919. Liu, Y., Weisberg, R.H., 2011. A review of self-organizing map applications in meteorology and oceanography. In: Self-Organizing Maps: Applications and Novel Algorithm Design. vol. 1, pp. 253–272. Liu, Z., Luo, P., Wang, X., Tang, X., 2015. Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV). Long, D., Magerko, B., 2020. What is ai literacy? competencies and design considerations. In: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, pp. 1–16. Looms, M.C., Binley, A., Jensen, K.H., Nielsen, L., Hansen, T.M., 2008. Identifying unsaturated hydraulic parameters using an integrated data fusion approach on cross-borehole geophysical data. Vadose Zone J. 7 (1), 238–248. Maghrabi, K., 2012. Impact of flood disaster on the mental health of residents in the eastern region of jeddah governorate, 2010: a study in medical geography. Life Sci. J. 9 (1), 95–110. Miceli, M., Yang, T., Naudts, L., Schuessler, M., Serbanescu, D., Hanna, A., 2021. Documenting computer vision datasets: an invitation to reflexive data practices. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 161–172. Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D., Gebru, T., 2019. Model cards for model reporting. In: Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229. Peng, G., Downs, R.R., Lacagnina, C., Ramapriyan, H., Iva´nova´, I., Moroni, D., Wei, Y., Larnicol, G., Wyborn, L., Goldberg, M., et al., 2021. Call to action for global access to and harmonization of quality information of individual earth science datasets. Data Sci. J. 20 (1). Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al., 2021. Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, PMLR, pp. 8748–8763. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J., et al., 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21 (140), 1–67. Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., Sutskever, I., 2021. Zero-shot text-to-image generation. In: International Conference on Machine Learning, PMLR, pp. 8821–8831. Record, N.R., Vera, L., 2021. Uncovering Big Data Bias in Sustainability Science. Rivas, P., Holzmayer, K., Hernandez, C., Grippaldi, C., 2018. Excitement and concerns about machine learning-based chatbots and talkbots: a survey. In: 2018 IEEE International Symposium on Technology and Society (ISTAS), IEEE, pp. 156–162.

396

15. AI ethics for earth sciences

Schlenoff, C., et al., 2015. IEEE Standard Ontologies for Robotics and Automation, IEEE Std 1872-2015. pp. 1–60, https://doi.org/10.1109/IEEESTD.2015.7084073. Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O., 2020. Green AI. Commun. ACM 63 (12), 54–63. Standing Committee for Standards Activities, 2021. IEEE Ontological Standard for Ethically Driven Robotics and Automation Systems, IEEE Std 7007-2021. pp. 1–119, https://doi.org/10.1109/IEEESTD.2021.9611206. Sun, Z., Sandoval, L., Crystal-Ornelas, R., Mousavi, S.M., Wang, J., Lin, C., Cristea, N., Tong, D., Carande, W.H., Ma, X., Rao, Y., Bednar, J.A., Tan, A., Wang, J., Purushotham, S., Gill, T.E., Chastang, J., Howard, D., Holt, B., Gangodagamage, C., Zhao, P., Rivas, P., Chester, Z., Orduz, J., John, A., 2022. A review of earth artificial intelligence. Comput. Geosci. 159, 105034. https://doi.org/10.1016/j.cageo.2022.105034. https://www.sciencedirect. com/science/article/pii/S0098300422000036. Systems, S. E. S. Committee, et al., 2021. IEEE Standard Model Process for Addressing Ethical Concerns During System Design: IEEE Standard 7000-2021. Wang, X., Barker, D.M., Snyder, C., Hamill, T.M., 2008. A hybrid etkf-3dvar data assimilation scheme for the WRF model. Part I. Observing system simulation experiment. Mon. Weath. Rev. 136 (12), 5116–5131. Winter, J.S., et al., 2015. Big data analytics, the social graph, and unjust algorithmic discrimination: tensions between privacy and open data. 2015 Regional ITS Conference, Los Angeles. International Telecommunications Society (ITS). No. 146313.

Index Note: Page numbers followed by f indicate figures and t indicate tables.

A Accumulated local effects (ALE), 328–329 collection of characteristics, 328 data’s average prediction, 329 implementation, 328–329 calculation, 328 import packages, 328 Intermountain Dry Tall Sagebrush Shrubland, 329f model training, 328 replace predictor variable codes, 328 partial dependence plots, 328 reliable model, 329 Accuracy assessment, 261–262 Actor-oriented design methodology, 366 Adam optimizer, 343 AGI. See Artificial general intelligence (AGI) AirNow observation network application programming interface, 251–252 Community Multiscale Air Quality Modeling System (CMAQ) simulation data, 254 data, 254 Fortran, 254 George Mason University (GMU), 254 Google Drive folder, 254 meteorological input data, 254 concentration, 251–252 GitHub page, 251–252 historical and real-time data, 251–252 quality checking processes, 251–252 source of truth data, 252 TROPOspheric Monitoring Instrument (TROPOMI) O3, 252–253 Google Earth Engine Python library, 252–253 remotely sensed atmospheric, 252–253 tropomiCollection variable, 253 Weather Research and Forecasting (WRF) model, 254 ALE. See Accumulated local effects (ALE) Algorithmic bias, 382–384 automated decision-making algorithms, 383 fivefold cross-validation strategy, 383f future decision-making algorithms, 384 identification and mitigate unintended bias, 382

Institute of Electrical and Electronics Engineers (IEEE) P7003 Standard for, 382 P7003 challenges, 382–383 P7003 standard, 383 recommender algorithms, 382–383 social media and streaming industries, 382–383 Amazon Web Services (AWS), 334 Analog ensemble, 213–215 Analog forecasting methodology ensemble and spatial extension, 213–215 AnEn forecasts, 214 forecast lead time (FLT), 213–214 historical observations, 214 scientific toolbox, 214–215 search space extension (SSE), 214–215 spatial similarity metric, 215 weather patterns, 213–214 generation of future predictions AnEn searches, 213 dense Internet of Things network, 213 observation availability, 213 historical information, 212 postprocessing fashion, 212 quantification of similarity between weather patterns exact weather analogs, 212 Kernel functions, 212–213 long-range forecasts of temperature, 212 machine learning algorithm, 213 quantification techniques, 212 spatial-temporal similarity metric with machine learning, 215–218 computer vision (CV), 215 DA spatiotemporal embedding network, 216f embedding network, 216–217 encode time information, 216 four-dimensional data structure, 216 historical weather records, 215 max-pooling layers, 216 network weights, 217–218 PyTorch implementation, 217 RA approach, 217 triplet network training, 217–218

397

398 Analog forecasting methodology (Continued) spatiotemporal weather patterns, 211–212 synoptic and mesoscale weather, 211 techniques, 211–212 Anchors, 330 coverage and precision, 330 IF-THEN rules, 330 implementation, 330–333 explanation, 330 import packages, 330 “Intermountain Dry Tall Sagebrush Shrubland” with features and their values, 332f package installation, 330 prediction of model, 331f predictor variable codes, 330 Random Forest Classifier, 330 rule-based explanation, 333 rules creation, 330 Animate, 79 ANN. See Artificial neural networks (ANN) Applications of artificial intelligence (AI), 41 Arctic amplification, 41 Arctic coastal communities, 42 Arctic ecosystem, 41 Arctic Sea ice, 42 artificial intelligence (AI) for forecasting, 41 dataset description, 45 deep learning-based sea ice forecasting, 49–53 Earth’s radiation budget, 42 and global climate patterns, 42 long short-term memory (LSTM) model-based approach for, 43 seasonal variations, 42 solar energy, 42 targets predicting, 56 Artificial general intelligence (AGI), 8 Artificial intelligence algorithms, 109, 115 automation process, 101–102 based technology, 114 camera operation, 104 and deep learning, 102–103 ethics for earth sciences based earth science systems, 380–381 carbon emissions of earth models, 391–393 documenting datasets for, 388–390 documenting models, 390–391 earth researchers, 387 ethical problems, 380 ethical values, 382 ethics standard, 384 forms of regulation and laws, 380 genetic information, 379–380

Index

Institute of Electrical and Electronics Engineers (IEEE) Standard P7015e, 388 knowledge and inferring characteristics of data, 380 large machine learning (ML) models, 380 literacy development, 387–388 positive effect, 386–387 technological progress, 379 technology, 391 trustworthy earth, 385 evolution in Earth sciences concepts and algorithms, 4 convolutional neural networks, 4 deep learning, 4 development of, 4–5 domains, 4 Earth data product generation and application, 5 Earth observation and data collection, 4–5 hardware and software environment, 4–5 human labeled image pixels, 5 ImageNet competition, 4 networks deal, 4 neural networks, 4 physics-based models, 4–5 powerful neural networks, 4 training label datasets, 5 image annotation, 105 machine learning and, 102 plant disease detection and classification, 102–103 for powered plant disease detection, 103 research papers, 103 sensor and camera configurations, 104 technologies, 3 workflow in Geoweaver emission, 291f environment-independent research experiment, 290–291 EPA ground observation, 292 experimental changes, 292–293 historical versions and logging out, 291 managing emission, 290–291 multiple data sources, 292 open-source workflow management software, 290–291 operational power plants, 292 power plant emissions and training data, 292 productivity issues, 290–291 remote sensing data, 292 sensed data and precise ground emission sources, 291 Artificial neural networks (ANN), 249–250, 357–358 Atmospheric wind, 59 Augmentation, 109–110 AWS. See Amazon Web Services (AWS)

Index

B Backbone algorithm backpropagation, 4 Bagging. See Bootstrap aggregation Black-box models, 357–358 Bootstrap aggregation, 18–19 Brief detour, 174–177

C Carbon emissions of earth artificial intelligence models automated method, 393 calculation of, 391 Earth AI-based autonomous & robotics system, 393 Earth artificial intelligence ethics-related logic, 393 hardware, 391 Hugging Face, 392 model cards of models, 392 tacking emissions, 393 Vision Transformer (ViT) models, 392 CBAM. See Convolutional block attention module (CBAM) ChannelGate class, 128 Channel-spatial attention module, 128 Cirq, 344 Climate change forecasting, 42 global warming, 42 machine learning, 45–46 prediction, 56 CNN. See Convolutional neural network (CNN) Community Multiscale Air Quality Modeling System (CMAQ) accuracy of ozone prediction, 266 artificial intelligence/machine technique, 267 experimental changes, 267 by George Mason University (GMU) team, 266 ground/aerial observation capacity limitation, 265–266 and machine learning models’ performance, 250 model performance, 267–268 open-source numerical modeling system, 247–248 ozone accuracy, 263–264 physics and chemistry, 248 primary dynamical model, 247–248 remote sensing, 266 results, 266 simulation data, 248, 258–259 stability and reliability, 266 Voting Ensemble and TabNet, 266 workflow on Geoweaver, 265f Computational graph, 163 Concepts in provenance building trustworthy AI, 359–360 European Union (EU) guidelines, 359–360

399

High-level Expert Group on AI (HLEG), 359 scientific communities, 359–360 with three key components, 360f unprecedented advancement, 359 provenance and provenance documentation, 362–363 experts and researchers, 362 fundamental classes, 362–363 gaining transparency using post hoc methods, 362 metadata, 362 PROV-O model and properties, 362f well-documented provenance, 362 World Wide Web Consortium (W3C), 362–363 understanding explainable artificial intelligence, 360–361 approach, 361 comprehensive information, 360–361 noteworthy importance of, 360–361 opaque machine learning models, 361 techniques development, 360–361 transparent machine learning models, 361 transparent vs. Opaque vs. Explainable model, 361f Conceptual hydrologic model, 186–200 Conductivity function, 181 Convolutional block attention module (CBAM), 121 Convolutional Long Short-Term Memory attention-based methods networks, 121–122 attention networks architecture, 126–144 cell structure, 123f channel-spatial attention block, 134–135 channel-spatial attention module, 128 machine learning approaches, 120–121 network, 121, 237–238 nonattention block, 132–134 proposed attention-based model, 126f spatial and channel, 121 Convolutional neural networks, 4, 19, 59 custom deep learning model, 60 data preparation process, 60 eddy detection task, 60 existing literature, 60 higher-level and lower-level visual neurons, 59–60 machine earning-ready dataset, 60 ocean eddies, 59 open-source model, 59–60 pixel-wise classification, 59–60 Correlation coefficient, 49

D DARPA. See Defense Advanced Research Projects Agency (DARPA) Data augmentations, 97, 109–110 Data collection, 251–254, 275–281 Data concatenation, 54–55

400

Index

Data-driven methods, 41 DataLoader object, 85 Data normalization, 126 Data preparation, 19–21, 61–75 AVISO-SSH data product, 61 absolute dynamic topography (ADT), 61 automated algorithm, 61 conventional heuristic and physics-based models, 61 Eddy detection, 61 satellite altimetry measurements, 61 sea level anomaly (SLA), 61 SurfaceWater and Ocean Topography (SWOT) mission, 61 binary classification, 21 compressed numpy (.npz) file, 72–75 masks and sea surface height map, 72–75 lon_range and lat_range, 72 subset_arrays function, 72–74 three motivators, 74–75 Planet data, 19 Planet Labs Education, 19 PlanetScope image, 20 py-eddy-tracker algorithm, 65–68 EddiesObservations, 65–67 GridDataset object, 65–67 ground truth eddy masks, 65–68 plot segmentation mask, 68 segmentation mask, 65 Research Program, 19 sea surface height map preprocessing, 64–65 Absolute Dynamic Topography (ADT), 64–65 only mesoscale variability, 64–65 segmentation masks in parallel, 69–71 cheap inference, 69 py-eddy-tracker code, 69 Python’s built-in multiprocessing library, 69–72 PyTorch and TensorFlow, 69 surface reflectance of four bands of Planet image, 20f training and testing sets, 61–63 algorithm development processes, 62 eddy detection task, 62 leverage py-eddytracker’s GridDataset, 63–64 overfitting, 61–62 py-eddy-tracker, 63 Python’s matplotlib library, 63–64 significant performance degradation, 61–62 Data preprocessing, 46–48 Data retrieval image annotation, 105 annotation process, 105 development and training, 105 human labor and costs, 105

and preparation, 103–105 protocols for image capture, 104–105 meteorological conditions, 104–105 time consuming, 104–105 variability, 103–104 camera operation, 104 geographic differences, 104 human factor, 104 illumination conditions, 104 image background, 103 intraclass variations, 104 powered plant disease detection, 103 preprocessing techniques, 104 sensor and camera configurations, 104 Dataset acquisition and preparation, 123–126 Dataset description, 45 Dataset preparation, 254–255, 286–287 Data variability, 103–104 Data wrangling, 193 Decision Tree (DT) model, 18–19, 262, 318 Deep learning artificial intelligence (AI) approaches, 43 based sea ice forecasting, 49–53 Convolutional Neural Networks (CNNs), 102–103 forecasting sea ice, 41 models, 105–109 neural networks, 115 techniques, 102 training models, 103f training process of, 112–113 Deep learning-based approach global eddy maps, 97 mesoscale ocean eddies, 97 oceanography data, 97 ocean transport processes, 97 physical oceanography, 97 pixel-wise annotated satellite altimetry dataset, 97 py-eddy-tracker algorithm, 97 Deep learning models artificial intelligence algorithms, 109 augmentation operations, 109–110 automatized machine learning (AutoML), 109 class balance, 109 convergence curves for a deep neural network, 111f cross-validation, 110 data augmentation, 109–110 deep neural networks, 109 domain adaptation techniques, 110–111 image augmentation, 109–110 independent datasets, 110–111 machine learning and deep learning, 109 NasNet Large model, 105–109 peer-reviewed published articles, 109–110

Index

Python, 105–109 training process, 110 process progresses, 110 test procedure, 110 and test sets, 110 transfer learning, 109 Deep neural networks (DNN), 357–358 Defense Advanced Research Projects Agency (DARPA), 360–361 Developing AI literacy computational systems, 388 Earth AI researchers, 388 education, 387 efforts and data globalization, 388 Institute of Electrical and Electronics Engineers (IEEE) Standard P7015e, 388 literacy as a societal system, 388f DNN. See Deep neural networks (DNN) Documenting AI models artificial intelligence-based earth sciences technology, 391 CelebA dataset, 391 cyclone detector, 391 machine learning development in earth sciences, 391 machine learning models, 390–391 model cards for model, 392t Documenting datasets for artificial intelligence collection process, 389 composition, 389 data collection, 389–390 datasheets for datasets, 388–389, 390t distribution side, 389 documentation, 390 Gebru’s paper, 390 inter-organizational accountability, 389 model cards, 389 motivations for documenting datasets, 389 motivation stage, 389 preprocessing/cleaning/labeling stage, 389 sensitive data, 389 time and effort, 389 uses stage, 389 World Meteorological Organization, 390 writing datasheets, 388–389 Dual numbers, 164 Dynamic forecasting systems, 45–46

E Earth artificial intelligence approaches, 6 assignments, 14 cases of, 5

choice questions and programming tasks, 14 developments, 4–5, 9 evolution in Earth sciences, 4–5 experiments, 6–7 geoscientists complicated backgrounds, 12 physics laws or experiences, 12 popularity, 11–12 smooth data processing, 12 tutorials, 12 land cover land use mapping, 6–7 latest developments and challenges, 6–7 complex classification tasks, 7 earth observing satellites, 6–7 earthquake/volcano seismic signals, 6 geoscientists, 6 hydrology, 7 hydro scientific problems, 7 imagery datasets, 6–7 land use mapping, 6–7 large objects on the Earth surface, 6–7 limited datasets, 7 machine language algorithms, 6 phase association methods, 6 processing seismic data, 6 traditional methods, 6 learning goals and tasks, 12–13 general learning objectives, 12–13 independent tutorial, 12–13 learning objectives, 13t post processing stage, 12–13 train-test-validate cycles, 12–13 long-term expectations for model spatial–temporal capability, 8 reliability and consistency, 8 models, 4–5 pipelines, 12–13 practical, 9–11 researches, 9–10 self-learning idea, 12 short-term current rule-based expert systems, 8 overfitting and underfitting problems, 8 promising approaches, 8 restrictions and problems, 8 short-term goals, 8 spatial–temporal resolution maps, 14 techniques, 7 technologies, 3 use of, 6 Earth artificial intelligence workflow dataset acquisition and preparation, 123–126 4D images as input-output sequences, 125

401

402

Index

Earth artificial intelligence workflow (Continued) data normalization, 126 fire ecological attributes, 123–124 fire spread dynamics, 123–124 input-output sequence generation, 125, 125f lattice for a single training instance, 125f minimum-maximum normalization, 126 percolation model, 123 Python-specific pickle format, 124–125 spatiotemporal sequence prediction, 125 target channel value, 126 encoder-decoder approach, 137–140 block, 137–140 framework, 137–140 layers, 133–134 modeling workflow demonstration, 126–146 2D convolution layer, 130–132 attention ConvLSTM networks architecture, 126–144, 126f CBAM-based convolution, 136–137 central processing unit (CPU), 127–128 channel-attended feature maps, 128–130 ChannelGate class, 128 channel-spatial attention module, 128 configuration, 127 convolutional block attention module (CBAM), 128–132 CSA-ConvLSTM block, 128–130, 134–135 decoder network, 137–140 encode and decoder block, 137 encoding and decoding network, 137 essential inputs, 140–144 execution, 144–146 feature aggregation and transformation, 132 gating mechanism, 133–134 graphics processing unit (GPU), 127–128 initial states and cell outputs, 137–140 inter-channel relationships, 130–132 linear mappings and nonlinear scalar functions, 132 model errors, 144–146 multiple pooling layers, 128 nonattention ConvLSTM block, 132–134 optimal number of epochs, 144–146 spatial-channel attention -ConvLSTM block, 136–137 spatial-channel attention module, 130–132 spatial-channel attention-supported network, 128 SpatialGate class, 128, 134–136 test function, 140–144 train and test data paths, 144–146 train and test functions, 140–144 train and test network, 140–144

physical interpretability of trained model, 146–148 ConvLSTM networks, 146–147 integrated gradients, 146–147 temporal heatmaps, 147–148 results accuracy and corresponding testing, 147 attention-based models, 153–154 epoch-wise error lists and temporal losses, 148–149 fire-only and no fire-only temporal losses, 150–152 fire progression, 149 fire spread dynamics, 147 integrated gradients, 153–154 mean squared error of all epochs, 152 nonattention and attention-based ConvLSTM models, 149–150f nonattention and attention-based models, 153–154 physical interpretation, 153–154 prediction performance, 148–153 proposed attention-based ConvLSTM models, 147 spatiotemporally average feature and no-fire cells, 154f visualization errors, 152–153 Earth climate system, 42 Earth data product generation, 5 Earth observing satellites, 6–7 Earth science, 94 applications, 363–364 artificial intelligence and machine learning, 363 challenges in, 363 domain, 363 work in provenance and, 363–365 Earth’s environment, 1–2 Earth’s hydrological and energy cycles, 17 Earth’s surface, 17 Earth system in artificial intelligence, 4–5 chemical reactions, 3 computing resources, 3 concept of, 1 disruptive climate, 2–3 earthquakes and weather forecasting systems, 1–2 food shortage, 2–3 global earthquakes, 2f grand challenges, 2–3 individual natural disasters, 2–3 models, 45–46 movements, 1 natural hazards, 1–2 natural resources, 2–3 science and technology, 3 stability and smooth cycles of, 1–2 worsening air quality, 2–3 Earth system sciences

Index

artificial Intelligence (AI) technologies, 3 convolutional neural networks, 4 short-term and long-term expectations for artificial Intelligence (AI), 8 EddyDataset class, 77–79. See also Animate Eddy detection model evaluate model on training and validation sets, 91–94 classical computer vision techniques, 93–94 eddy tracking and trajectory determination, 93–94 GitHub, 91–93 OpenCV library, 93–94 Open image in new tab, 91–93 val_loader object, 91 Introduction notebook, 75–76 load data, 76–79 animate validation data, 79 binary keyword argument, 76–77 cyclonic and anticyclonic eddies, 76–77 DataLoaders, 77 distribution of class frequencies to identify class imbalances, 77 EddyDataset class, 77–78 eddy detection model, 78–79 example visualization, 77–79 Geophysical Variable (GV), 78–79 get_eddy_dataloader() function, 76 GV (1999-03-01), 78f GV (2007-08-01), 78f GV (2013-12-20), 78f GV 2019-D1-10, 79f load NPZ and convert into PyTorch DataLoader, 76–77 Mask (1999-03-01), 78f Mask (2007-08-01), 78f Mask (2013-12-20), 78f Mask 2019-01-10, 79f not eddies, 77 plot_sample(), 77–78 specify NPZ file paths, 76 validation dataset, 79 val_split keyword argument, 76 metrics, 81–84 arithmetic mean, 82–83 classification and segmentation models, 81–82 cyclonic eddy, 81–82 Eddy detection, 81 F-score, 82 harmonic mean of precision and recall, 82–83 imbalanced problems, 81 multiclass classification, 82–83 not eddies, 81 one-vs-rest, 81–82 precision and recall, 81–83

403

Tensorboard logger, 83–84 Tensorboard logs, 84f torchmetrics.F1Score, 83 weights and biases, 83 metrics and visualizations, 75–76 segmentation model, 79–80 Adam optimizer’s learning rate, 80–81 custom_losses.py source file, 80 EddyNet, 79–80 loss function, L(fƟ(x),y), 80 model based off U-Net, 79–80 object detection models, 79–80 one-cycle learning rate scheduler, 80–81 optimizer, 80 standard cross-entropy loss, 80 Tensorboard, 81 training curves, 80 training loop, 81 training and, 75–94 training components, 79–81 dataset loading, 79 loss function, 80 optimizer, 80 train the model, 85–91 analyze training curves in tensorboard, 89 checkpoint_path variable, 89 DataLoader object, 85 functions at epoch and batch level, 85 machine learning datasets, 85 machine learning problem, 85 model’s training progress, 89 pixel-level predictions, 85 reload data option, 89 run_batch() function, 85 run_epoch(), 85–89 run_epoch function, 85, 89–90 TensorBoard instance, 89 training loop, 85–89 training loop for prescribed num_epochs, 89–91 typical machine learning workflow, 89 validation set, 85–89 Eddy detection model, 75–94 Encoder-decoder block, 137–140 Ensemble learning-based sea ice forecasting, 53–55 Ensemble learning technique, 18–19 Environmental Protection Agency (EPA), 280 AirData website, 280 remotely sensed data, 280 EPA. See Environmental Protection Agency (EPA) Ethically driven automated systems, 384–386 classical and contemporary philosophy, 385 development of Std 7007, 385 Earth’s future, 385

404

Index

Ethically driven automated systems (Continued) effective implementation of R&A systems, 385 Ethically aligned Robotics and Automation Systems (ERAS), 384–385 ethical ontology, 386 ethical standards in R&A systems, 385 human moral experience, 385 Institute of Electrical and Electronics Engineers (IEEE) Std 1872-2015, 384–385 Institute of Electrical and Electronics Engineers (IEEE) Std 7007, 384–385 interdisciplinary communication, 385 ontology, 384 trustworthy earth AI, 385 Ethical value engineering project team, 381–382 Euclidean norm. See Squared error Evaluate_model, 260 Explainable artificial intelligence (XAI), 321, 358 Extreme Gradient Boosting (XGBoost), 249–250

F First-order methods, 170 Forecasting AnEn and GFS, 218–219 atmospheric variables, 219 Global Forecasting System and North American Mesoscale, 219 machine learning interpretability via attribution, 230–234 anchor-positive and anchor-negative pairs, 232–233 associated reasoning, 230–231 deep analog (DA) Spatial embedding model, 231–232, 231f Gaussian filter, 231 integrated gradient (IG), 231, 233 model predictions and model input, 231 noninformation, 231 positive forecast, 233–234 relative humidity, 232 solar irradiance and total cloud, 232 solar irradiance for deep analog Spatial network, 233f spatial attribution of a single input feature, 232 surface variables, 232 target forecasts, 232 search space extension, 221–225 analog identification, 223 forecasted location, 223 observed solar irradiation on the horizontal axis, 222f predicted location, 224–225 prediction mean absolute error of solar irradiance, 224, 224f

resolution-limited model topography, 223 search repository, 221 single-grid forecasting, 224–225 solar irradiance predictions, 221–222 target forecasting, 224 weather analogs, 221 six prediction techniques, 218t solar irradiance, 218–219 test period, 219 verification at a single location, 219–221 AnEn and deep analog Spatial, 221 AnEn Spatial outperforms, 219 bivariate scatter plots, 221 complete test period, 220f continuous rank probability score (CRPS), 219 lower-resolution models, 220 Monte-Carlo Independent Column Approximation method, 221 single-grid forecasting, 220 weather analog identification, 225–230 AnEn and deep analog Spatial with NAM, 226f AnEn using North American Mesoscale, 228f based technique for, 225 cloud-free regimes, 229–230 cluster features, 230 deep analog Spatial group, 226 deep analog Spatial using NAM, 229f distance and prediction bias, 226 high irradiance regime, 227 historical observations, 225 horizontal axis and surface radiation budget (SURFRAD) observations on vertical axis, 230f imperfect and biased weather forecast, 225 low solar irradiance, 227 multivariate forecasts, 227–228 North American Mesoscale forecasts, 225–226 observed solar irradiance on vertical axis, 227f single-grid forecasts, 227 solar irradiance and latent features, 229–230 Spatial embedding network, 227–228 surface radiation budget (SURFRAD) measurements of solar irradiance, 228–229 target North American Mesoscale forecast, 226 Forecasting Arctic Sea ice, 41 Future developments in artificial intelligence challenging tasks, 9 infrastructure and technology stack, 9 Moore’s Law, 9 operational systems, 9 software and services, 9 trends of, 9 workflow consistent over time, 9

Index

G Geophysical Fluid Dynamics Laboratory’s Couple Physical Model, 42 Geoscience, 272 Geoweaver, 264–265, 366–372 core design of, 366–369 database, 370 description, 370–371 Earth and environmental sciences, 372 earth scientists benefits, 366–369 execution mode, 371 Geoweaver ZIP file, 371–372 host connection, 369, 369f in-browser software, 372f Input Workflow Name, 370–371 one-host option, 370–371 process creation, 370 provenance record, 370f Python programs, 369f, 370–371 source file downloading, 371–372 user interface of, 368f weaver workspace, 370–371 web-based workflow system, 366–369 workflow, 370–372, 371f Geoweaver system, 374 GFS. See Global Forecasting System (GFS) Global climate patterns, 42 Global Forecasting System (GFS), 7 Global interpretability, 323 Google Quantum Computer, 338–339 GPU. See Graphics processing unit (GPU) Graphics processing unit (GPU), 19 Ground-level ozone air pollutants, 247 Center for Environmental Measurement and Modeling (CEMM), 247–248 Extreme Gradient Boosting (XGBoost) model, 248 open-source numerical modeling system, 247–248 prerequisites, 248–249 earthengine-api, 249 Python 3.8, 249 python packages, 248 XGBoost model, 248 on public health and welfare, 247

H Hamiltonian operator, 343 High-resolution Planet imagery, 38 Humanity, 2–3 Hurricane trajectory prediction models, 10 Hydrological process monitoring, 39 Hydrologic model complex neural networks, 201

405 data, 192–194 data wrangling, 193 Earth systems’ modelers, 192 full dataset, 193 in_vars and out_vars variables, 193–194 memory usage and computational speed, 192 model inputs for training, 192–193 multiple shooting, 193 prerun results, 192–193 PyTorch Dataset class, 193 recurrent neural network (RNN), 193 tensor arrays, 193–194 winter low temperatures, 192 interpretability-predictive spectrum, 201 model analysis, 199–200 internal dynamics of the system, 199 Nash-Sutcliffe efficiency, 199 nse function, 199 steady state, 199 test_data, 199 model setup, 196–197 Adam optimizer and mean squared error, 197 HydroParam objects, 196 HydroSimulator, 196 hyperparameters set up, 197 initial_storage values, 196 model training functions, 194–195 optimizer’s step method, 194–195 update_model_step function, 194–195 pre/postprocessing networks, 201 PyTorch framework, 201 scaling up to a conceptual, 186–200 system of equations, 186–191 system of equations, 186–191 conceptual hydrologic model, 186, 189–190 conventional approaches, 187 drainage calculation, 188 HydroEquation class, 189–190 HydroSimulator, 191 nonlinear reservoir network, 191 reference potential evapotranspiration, 187 saturated fraction, 188 sigmoid function, 186 subsurface flow, 189 surface bucket, 186 surface saturation, 189 train differential equations, 186 user-definable parameters, 189–190 training, 197–198 curves, 198 multiple trajectory training, 197–198 Python loops, 197–198 time off, 198

406

Index

Hydrologic model (Continued) update_ic_step, 197–198 training/testing data, 195–196 basin selection, 195–196 MultipleTrajectoryDataset, 195–196 Hydrologist’s favorite, 175–177 Hydrology modeling applications of modeling, 157 computational element, 159 conceptual and physics-based, 158 conceptual hydrologic model, 159–160 data-driven approaches, 158 data-driven approaches via machine learning, 157–158 directly model intermediate processes, 158 Earth scientists, 158 evapotranspiration, 159–160 “explainable artificial intelligence” methods (XAI), 158 goal of, 157 hybrid approaches, 158 hybrid-machine-learning methods, 158 machine learning-based, 157, 169 machine-learning models in, 158 ordinary differential equation (ODE), 159 physics inspired machine learning, 159–160 pure-research point, 157 PyTorch ecosystem, 157–158 PyTorch parameters, 159–160 third general approach, 159 traditional knowledge, 157–158 Hyperparameter tuning, 334

I Image annotation, 105 Image classification, 338 Image-wide model performance, 34–36 Input-output sequence generation, 125 Integrated gradient (IG), 146–147, 231 Intelligent systems on human well-being, 386–387 autonomous and intelligent systems on earth, 386 climate and weather numerical models, 386–387 disaster community, 387 Earth Science, 386 human well-being, 386 impact of autonomous and, 386–387 impact of Earth AI systems, 387 Institute of Electrical and Electronics Engineers (IEEE) Std 7010, 387 mental health demands, 387 mental health difficulties, 387 satellite meteorology and oceanography, 386–387 wireless sensor network data, 387 World Health Organization’s (WHO), 387 Intersection over Union, 97

J Jaccard Index, 97 Java-based collaborative platform, 366

K Kepler scientific workflow system, 366 Kepler’s software, 368f KGML. See Knowledge Guided Machine Learning (KGML) Knowledge Guided Machine Learning (KGML), 159

L Land cover (binary) classification, 339–340, 343–344 Landsat imagery, 298, 300 Linear reservoir model, 174–177 Lloyd Shapley for game theory, 325 Loading and preprocessing data, 346–347 Long short term memory (LSTM), 7, 43 Long-term survival challenges, 3 Loss function, 80 LSTM. See Long short term memory (LSTM)

M Machine language derived vegetation correlated data, 324 Decision Tree models, 318 dependent target variable, 320 derived products, 317–318 dropping correlated features, 324 Earth Resources Observation and Science (EROS), 318 ELI5, 322 machine learning frameworks, 322 python library, 322 explainable artificial intelligence (XAI), 321 artificial intelligence/machine learning technique, 321 black-box model systems, 321 methodology and techniques, 321 model-agnostic techniques, 321 model-specific techniques, 321 model transparency, 333–334 XAI techniques, 321 global interpretation, 322 historical wildfire incidents, 318 implementation, 322–324 data preprocessing, 323 explain_prediction, 324 global interpretability, 323 libraries, 323 local interpretability, 324 model interpretation, 323 model training, 323 predictor variable codes, 323

Index

random forest-derived shrub prediction, 322–324 Shrub dataset, 323 source code snippets, 322–324 train and test sets, 323 weights and predictions, 322–324 input datasets, 317–318 Intermountain Low & Black Sagebrush Shrubland Steppe, 325f LANDFIRE data bands, 317–318 datasets, 317 data viewer portal, 319f dictionary document, 319–320 national products, 318 Prototype project, 318 remap feature names and explanation, 319t vegetation products, 318 Lloyd Shapley for game theory, 325 local interpretation, 322 many third-party datasets and methods, 318 model, 320–321 advantages and disadvantages, 320–321 Random Forest Classifier model, 320–321 Random Forest’s mechanism, 320–321 multiple explainable artificial intelligence methods, 317–318 National Vegetation Classification, 319 point-based data, 318 prerequisites, 320 Python, 320 routine and standard, 320 SHapley Additive exPlanations (SHAP), 325 decisions/predictions, 326 easy-to-use interface, 328 implementation, 325–327 installation, 325 Intermountain Semi-Desert Shrubland & Steppe, 327f model training, 326 necessary packages, 325 negative/positive correlation with vegetation type, 327f predictor variable codes, 325 trained model and training dataset, 326 shrub, 320f tavei feature, 324f terrestrial ecological system, 318 traditional land monitoring projects, 317 vegetation cover types, 317 Machine learning (ML), 11f AirNow ground dataset, 250 algorithms, 18, 272 approaches for sea ice forecasting, 45–55

407

based sea ice forecasting, 46–49 challenges and opportunities live, 94 cloud-based framework, 94 Community Multiscale Air Quality Modeling System (CMAQ) mechanistic model data, 266 and deep learning, 43 earth science, 94 interpretability via attribution, 230–234 and its integration with analog ensemble, 207–208 learning-based sea ice forecasting, 53–55 to managing and reuse, 293–294 models, 249–250 oceanography, 94 physics-based modeling community, 272 remote sensing data, 271 scientific papers, 275 sea surface height data, 94 significant power of, 272 software and hardware, 94 spatial-temporal similarity metric with, 215–218 state-of-the-art, 41 worflow management, 264–265 Machine learning tools and model, 18 random forest, 18–19 convolutional neural network (CNN), 19 decision tree, 18 decision tree model, 18–19 machine learning algorithm, 18 machine learning models, 19 python packages, 19t supporting packages, 19 scikit-learn, 18 Max Planck Institute’s Meteorology Earth System Model, 42 Mean squared error (MSE), 18, 261–262 Merging training data, 283–285 Metaclip, 365–366 METAdata for CLImate products calibration, 365 datasource, 365 graphical_output, 366 interpreter design, 366 metaclip interpreter, 366, 367f metaclipR package, 365–366 provenance-embedded climate product, 366 PROV-O classes and properties, 365–366 Resource Description Framework (RDF), 365–366 verification, 366 Metrics, 81–84 Model analysis, 199–200 Model attribution, 242–243 Model building, 299–312

408

Index

Model evaluation, 52–53 Model fitting, 304–309 Modeling workflow demonstration, 126–146 Model parameter tuning, 21–28 main parameters, 22 number of features, 24–25 “max_features” in scikit-learn, 24 output, 25f total feature size, 25 number of samples, 22–24 computational burden, 23 custom function, 23 data using code, 22 model performance, 22 output, 24f snow-covered area model, 23 whole dataset, 23 number of trees, 26–27 model performance, 26 n_estimators, 26 output, 26f optimal sample size, 21–22 parameters and sensitivities, 22 “RandomForestClassifier” function, 22 tree depth, 27–28 large tree depth, 27 max_depth, 27 “max_depth” argument, 27 model performance, 27 output, 28f Model performance evaluation airborne lidar surveys, 32 five evaluation metrics, 33 image-wide model performance, 34–36 airborne snow observatory and snow-covered areas, 35 false-positive predictions, 36 four bands’ surface reflectance, 34 organizing data and print evaluation results, 36 output, 35f Planet false-color image, 35 python function “run_sca_prediction()”, 34 water bodies and glaciers, 35 k-fold cross-validation approach, 32 lidar sensor to map snow depth, 32 in open areas versus forested areas, 37–38 3-m canopy height model dataset, 37 dense forest, 37 high false-negative, 37 model accuracy between, 37 output, 37f Planet imagery for open areas, 38 shaded valley, 37

snow cover, 38 output, 33f root mean squared error (RMSE), 32 spatial distribution of snow depth, 33 testing subset model performance, 34 “calculate_metrics()”, 34 evaluation metrics, 34 “model.predict()”, 34 output, 34f snow-covered areas mapping model, 34 two levels of assessment, 32 Model selection, 111–112 accuracy tables, 111–112 classification problem, 111–112 computational power, 112–113 confusion matrix, 111–112, 113f and experimental results, 111–113 final model selection, 111 Intersection over Union (IoU), 111–112 model complexity, 112–113 plant disease recognition, 111–112 poor generalization capabilities, 113 real time applications, 112–113 trained model accuracy, 112t Model training, 50–52 feature importance, 30–31 blue band, 31 output, 31f permutation_importance, 31 random forest algorithm, 31 “sklearn.inspection” module, 31 functions, 194–195 main parameter configurations, 28 random forest model, 29–30 accuracy, 29 accuracy values, 30 “cross_val_score”, 29 final model performance, 29 K-fold cross-validation, 29 model training accuracy, 30 output, 30f “RandomForestClassifier()”, 29 “Repeated-StratifiedKFold”, 29 save the model, 32 “dump()” function, 32 “joblib” package, 32 splitting data into training and testing subsets, 28–29 model accuracy, 28–29 output, 29f “random_state” parameter, 29 “sklearn. model_selection”, 29 train_test_split function, 29 “train_test_split” function, 29

Index

Moderate Resolution Imaging Spectroradiometer (MODIS) quantum machine learning on, 342–353 satellite data retrieval, 341 Modern-Era Retrospective analysis for Research and Applications Version 2 meteorology data Cloud Fraction variable (CLDTOT), 279–280 Data Access Protocol (DAP), 277–278 data collection, 278–279 meteorological data collections, 277 NASA Earthdata account, 277 OPeNDAP, 277 python code, 278 Temp and Wind variables, 278–279 time frame, 278 total precipitation variable, 279 MODIS hyperspectral images Cirq, 344 Python package, 344 quantum circuits, 344 satellite image dataset, 344f classification performance, 353 land cover (binary) classification, 343–344 binary classification problem, 343–344 TensorFlow-Quantum’s framework, 344 vegetative and nonvegetative images, 343–344 loading and preprocessing data, 346–347 Colab from Google Drive, 346 dataset loading, 346 download data file, 346 Earth AI book Github repository, 346 image operations, 347 rescale, 347 setting path and list variables, 346 train and test splits, 347 training and testing, 347 quantum circuit data encoding, 347–349 44-pixel, 347f batch image as argument, 348 Cirq circuits, 349 current quantum computers, 348 defined function, 349 full-resolution and downscaled image, 348 noisy intermediate-scale quantum (NISQ) processors, 347 quantum information processing, 348 set threshold, 348 train and test set, 348 vegetation and nonvegetation, 347 quantum machine learning (QML) on, 342–353 practical applications, 354 quantum neural network (QNN) model, 342–343, 349–353

409

Adam optimizer, 343 circuit layer for visualization, 350 data-circuit size, 350 dataset for a loss function, 351 encoded circuit, 349f gate layer as a circuit, 350f Hamiltonian operator, 343 hinge loss function, 351 input data, 343 Keras model, 351 model building, 349 modeling vegetation, 353 no-cloning theorem, 343 noisy intermediate-scale quantum processor, 353 quantum hardware, 353 readout qubit, 350–351 training performance across five epochs, 352f training process, 352 transport encoded data, 349 setup, 345 download data from Google Drive, 345 import package, 345 package resources, 345 packages, 345 visualization and implement statistical analysis, 345 visualization tools, 345 TensorFlow quantum (TFQ), 344 datatype primitives, 344 machine learning applications, 344 quantum algorithms, 344 TensorFlow (TF), 344 vegetation coverage, 342 MODIS MCD19A2 product blue band AOD, 280 Environmental Protection Agency NO2 data, 280f MAIAC Land Surface Bidirectional Reflectance Factor (BRF), 280 poi_mean function, 281 utilizeMODIS Blue Band AOD, 280 MSE. See Mean squared error (MSE) Multiple shooting, 193 Multiple trajectory, 193

N NAAQS. See National ambient air quality standards (NAAQS) National ambient air quality standards (NAAQS), 275 National Snow and Ice Data Center, 44 Neural networks, 4 Newton’s method, 170–171 Nonlinear activation, 236–237 Nonlinear reservoir model, 160 Normalized Difference Snow Index, 18

410

Index

Numerical optimization bread-and-butter of introductory numerical analysis, 169 brief background on, 169–177 computational modeling courses, 169 data-driven methods, 169 first-order gradient-based optimization, 169 first-order methods, 170 Adam and RMSProp, 170 Euclidean norm, 170 gradient-descent methods, 170 gradient via, 170 machine-learning optimization strategies, 170 second-order methods, 170 topography, 170 hydrologist’s favorite, 175–177 Newton’s iteration, 175 Newton’s method, 175 physical intuition, 175 machine learning-based models, 169 ordinary differential equation (ODEs), 174–177 catchment scale, 174–175 linear reservoir model, 174–175 nonlinear conductivity curve, 174–175 single linear reservoir, 174–175 physics-based models, 169 rest of machine learning, 169 second-order methods, 169–174 bare-bones implementation, 171–172 first-order optimization strategies, 170–171 iteration process, 171 jacobian function, 172–173 network encoding, 173 neural net, 173 Newton’s method, 170–171 TroughLayer, 173 Numerical physics-based models, 4–5 Numerical weather prediction models, 209–210

O Ocean eddies, 59 Ocean mesoscale machine learning model, 98 remote sensing variables, 99 satellite altimetry, 98 satellite measurements, 99 sea surface height observations, 99 Oceanography, 94 Ocean’s circulation, 59 One-cycle learning rate scheduler, 80–81 One-vs-one fashion, 81–82 Open areas versus forested areas, 37–38 Optical satellite imagery, 17–18

Optimizer, 80 Ordinary differential equations of neural networks network, 183–184 neural networks, 184 reservoir equation, 184 ReservoirEquation class, 184 timestep’s storage, 183–184 torchdiffeq, 184–186 training procedure, 184–186 training process, 183–184 nonlinear reservoir model, 177–179 conductivity term, 177–178 linear reservoir, 177 reservoir draining and values, 177–178 reservoir conductivity function with neural networks, 179–181 complex structure, 179–180 neural network, 179–180 training data and runs, 181 split out the input/output data, 181–183 differential equation, 181–182 network’s training task, 181–182 NeuralReservoir, 182–183 Overfitting, 61–62 Ozone atmospheric processes, 249 forecasting, 250f formation of, 249 prediction, 266 on public health and welfare, 247 real-world concentration, 247–248 satellite-borne observations, 250 surface ozone concentration AirNow ground dataset, 250 artificial intelligence/machine learning technology, 250 conventional strategies, 249 Deep Learning methods, 249–250 emission sources, 249 forecasting surface ozone concentration, 249 formation of ozone, 249 machine learning (ML) models, 249–250 OMI (Ozone Monitoring Instrument), 250 open community model, 249 ozone forecast, 250f ozone predictions, 250 surface concentration, 249 TROPOMI (TROPOspheric Monitoring Instrument), 250 vegetation damage, 249 vegetation damage, 249 Ozone prediction accuracy assessment, 261–262

Index

mean average error (MAE), 262 mean percentage error (MPE), 262 Root mean squared error, 262 traditional statistical accuracy assessment metrics, 261–262 accuracy improvement, 265–266 accuracy metrics, 264t AirNow, Community Multiscale Air Quality Modeling System (CMAQ), and CMAQ-LSTM, 264f numeric models, 265–266 airnow_data.csv, 254 cmaq_2017_Jan_Feb.csv, 254 date formatting, 254–255 dropping, 254–255 Extreme Gradient Boosting (XGBoost) model, 255 machine learning, 255–264 Abox plot, 261 AirNow O3 with Community Multiscale Air Quality Modeling System (CMAQ) O3, 259f AirNow observations, 258–259, 258f cross-validation scoring function, 260 cross_val_score function, 260 dataset’s small size and short duration, 257–258 extreme gradient boosting model, 255–261 max_depth values, 261 multiple tree–based regressors, 255 slimmed-down version, 259–260 testing set, 257 time and scale-up tree boosting, 255 XGBoost model negative root mean squared scores, 261, 261f XGBoost models, 256 XGBRegressor, 256 machine learning models, 262–264 Avg-Max-Min box charts, 262 Community Multiscale Air Quality Modeling System (CMAQ) model’s limitation, 263–264 data processing scripts, 264–265 Geoweaver, 264–265 Geoweaver GitHub repository, 264–265 Long-term Short-term memory (LSTM), 262–263 ten-fold cross-validation, 262 tree-regressor models, 262 model training, 254–255 rescaling, 254–255 voting ensemble, 262–263 based random forest model, 262–263 based XGBoost model, 263–264 voting regressor with three-member models, 263f workflow management, 264–265 XGBoost method, 262

411

P Percolation model, 123 Permutation Importance, 323 Physical interpretation, 153–154 Planet Labs’ small satellites, 17–18 PlanetScope imagery, 19 Plant disease artificial intelligence-based technology, 114 automation process, 101–102 challenges, 101–102 classification problem, 115 crop loss, 101 data issues, 113–114 data variability, 103–104 decision-making process, 114 detection and classification, 102–103 detection and recognition problems, 102–103 disorders, 101–102 domain adaptation, 114, 116 extrinsic factors, 102 health protection, 101 multiple disorders, 101–102 under overcast conditions, 114 plant pathology, 113–114 psychological and cognitive phenomena, 101 soil properties and meteorological data, 116 symptoms, 101–102 taxonomy of, 104 test and validation sets, 115 training and evaluating models, 115 training dataset, 101–102 user profiles, 114 variability in agricultural fields, 115–116 Plant disease challenge, 101–102 Plant disease detection analysis and classification, 102 artificial intelligence techniques, 102 and classification, 102–103 Convolutional Neural Networks (CNNs), 102–103 and deep learning, 102–103 neural networks, 102 nonstructured and dynamic conditions, 102–103 research articles, 103 training deep learning models, 103f Plant disease detection and classification, 102–103 Plant satellites, 18 Plot_sample(), 77–78 Pooling, 237 Power plant emissions coal-fired, 275 coal-fired power plant, 272 coal/gas combustion, 271 controlling, 275

412

Index

Power plant emissions (Continued) estimation, 271 global temperatures, 274 ground observation, 275 higher-resolution weather products, 272 long-term operation, 271 machine learning algorithms, 272 manufactured emissions, 274 MCD19A2, 283 merging training data, 283–285 data frame, 284–285 date column into numerical representation, 284–285 dependent and independent features, 285 final dataset, 283 model training, 283–284 set and a testing set, 285 training set, 285 MERRA-2, 282–283 meteorological reading, 282 pandas dataframe, 283 sel() method, 282 time parameter, 282 xarray datasets, 282 xarray’s resample() method, 282 monitoring NO2, 272 monitoring timely basis, 275 multispectral and hyperspectral sensors, 271–272 necessary credentials, 274 pollutants and greenhouse gases, 275 pollution emissions, 272 prerequisites, 274 earthengine-api, 274 python packages, 274 remote sensing data, 272 remote sensing data and Environmental Protection Agency ground observations, 271 remote sensing technology, 271–272 from space, 272 support vector regression machine learning model, 272 traditional mapping techniques, 272 TROPOMI NO2, 281 future improvements, 293 poi_mean function, 281 Preprocessing section, 281 tropomi_no2.csv, 281 TROPOspheric Monitoring Instrument satellite instrument, 272 TROPOspheric Monitoring Instrument tropospheric NO2 data, 275–276 emissionAI overall execution flow, 273f Google Earth Engine Python, 275 tropomiCollection, 276

tropomi_no2.csv, 275–276 Practical artificial intelligence black boxes, 11 consuming and feeding real-world data, 10–11 full-stack pipeline, 10–11 geoscientific scenarios, 11 machine learning operations, 11f production software, 10 ready-for-use datasets, 10–11 real-world application scenario, 9–10 real-world datasets, 10–11 satellite based land cover classification, 10 scikit-learn, 10 spatiotemporal limitation, 10 tree-based models, 11 Prediction performance, 148–153 Prerequisites, 274 Prescribed num_epochs, 89–91 Provenance documentation, 362–363 Provenance in earth artificial intelligence advanced algorithms, 357 advancement of technologies, 357 automatic tracking, 373 black-box model, 373 black-box models, 357–358 challenging and complex problems, 357 cloud service in data-intensive, 373–374 computer-based training, 357 concepts in, 359–363 data management and, 373–374 data management and provenance documentation, 373–374 deep learning algorithms, 357–358 defined, 358–359 document and present climate product, 365–366 documented in Geoweaver, 370 document record, 372 domain-specific documentation standards, 373 in earth artificial intelligence, 363–365 black-box model, 357–358 building trustworthy, 359–360 computer-based training and advanced algorithms, 357 demand and awareness for, 357–358 Explainable Program, 360–361 Geoweaver, 366–372 need for provenance in earth, 363–365 science domain, 363 technology, 357 trustworthy, 360f understanding explainable, 360–361 and earth science, 363–365 data abundance, 364–365

Index

data science applications, 364–365 Earth Science Information Partners (ESIP), 363–364 geographic information system (GIS) operations, 363–364 metadatabase system, 363–364 research community, 364–365 reusability of data, 363–364 Scopus datatabase, 364f Semantic Web technologies, 363–364 transparency and credibility, 363–364 transparency and verifiability, 364–365 in earth science domain, 363 for Earth program, 363 IBM’s Green Horizon project, 363 Intel’s survey, 363 embedded climate product, 366 enhance users’ decisions, 374 explainability, 373 explainable artificial intelligence (XAI), 358 future documentation, 373 Geoweaver dashboard, 370f inclusion of, 358–359 informational sources, 358–359 interpretability, 373 Local Interpretable Model-Agnostic Explanation, 373 in machine language systems, 358–359 in machine learning, 358–359 manual documentation, 373 model’s performance, 357–358 natural language explanations, 358–359 network, 367f parse information, 366 post hoc explainability approaches, 358–359 and provenance documentation, 362–363 real-world artificial intelligence/machine learning applications, 373 recording and sharing, 373 reproducibility, 372 researches, 358 reshaping workflows, 357 reusability of data, 363–364 safety-critical applications, 358 scientific research and comprehension, 373–374 success ofW3C PROV, 373–374 technical approaches, 365–372 metaclip, 365–366 scientific research, 365 state-of-the-art technologies, 365 tracking model activity, 365 trustworthy artificial intelligence (TAI), 358 weather forecasting, 358 workflow documentation, 373

413

Provenance in earth artificial intelligence, machine learning systems, 373 Py-eddy-tracker algorithm, 65–68, 97 PyTorch autodiff and backpropagation via, 171 autodifferentiation, 160–169 getting started with, 160–162 hydrologic model, 159–160 hydrologic model parameterized by, 159–160 ReLU function, 165 PyTorch and autodifferentiation autodifferentiation in PyTorch, 164–169 built-in implementation, 164 calculation, 164 chain rule, 167 deep neural networks, 168 derivative of a neural network, 166 feedforward type network, 166 hand-calculated derivative, 165 hyperbolic tangent, 165 implementation, 167–168 Jacobian of a function, 164 mathematical function, 168–169 neural network implementation, 169f PyTorch neural networks, 166 ReLU activation, 167 ReLU function, 165 getting started with, 160–162 architectures and components, 160–161 configuration options, 161–162 machine-learning frameworks, 160–161 sake of accessibility, 161–162 standard numerical implementations, 161–162 theory, 163–164 backpropagation algorithm work, 163 computational graph, 163, 163f dual numbers, 164 forward mode autodiff, 163 machine-learning frameworks, 163 reverse mode autodifferentiation, 164 simple function, 163 theory of automatic differentiation, 164 PyTorch DataLoader, 76–77 PyTorch ecosystem, 157–158

Q QML. See Quantum machine learning (QML) QNN. See Quantum neural network (QNN) Quantum circuit data encoding, 347–349 Quantum computer and informatics, 338–339 Quantum computing (QC), 337 Quantum machine learning (QML), 338–339 area of conventional, 337

414 Quantum machine learning (QML) (Continued) conventional machine learning, 337 defined, 338 hardware pool, 339 information coding, 337 open-source framework, 344 popularity of, 337 powerful system, 338 quantum and, 339 quantum circuits in, 344 quantum computers, 338–339 techniques, 338 Quantum neural network (QNN), 342–343, 349–352

R R2 score, 49 Radiance data, 341 “RandomForestClassifier” function, 22 Random Forest (RF), 18–19, 29–30, 262 Recurrent neural network (RNN), 193 Remote sensing (RS), 250, 271–272, 339–340 accessibility of, 272 based daily data, 272 machine learning, 271 nonimagery sources of, 296 nonlinear relationships and hidden patterns, 292 shrublands using, 295 technology, 271–272 Rescale, 347 Research data, 208–210 data collection and experiments, 208 surface radiation budget network, 209 downward solar radiation, 209 radiation measurements, 209 research community, 209 satellite retrieval validation, 209 weather prediction models, 209–210 downward shortwave solar radiation time, 211f Global Forecasting System (GFS) versions and North American Mesoscale (NAM) availability, 210 high-resolution regional dynamic, 209 nonstationary model performance, 210 prediction accuracy, 210 second weather model, 209–210 strong diurnal cycle, 210 Weather Research and Forecasting (WRF), 209 Reservoir conductivity function, 179–181 Responsible artificial intelligence. practices, 391 RMSE. See Root mean squared error (RMSE) RNN. See Recurrent neural network (RNN) Root mean squared error (RMSE), 32, 261–262

Index

S Satellite altimeter, 59 Satellite data retrieval, 341 Satellite image classification data, 340–342 collaborative tagging, 342 collection, 340 Hierarchical Data Format version 5 (HDF5), 341 labeling system, 341–342 MODIS image batches, 341–342 MODIS RGB images, 341 NASA information systems, 341 nonvegetation Earth data, 342f Radiance data, 341 satellite data retrieval, 341 spliting images into batches for annotation, 341–342 vegetation Earth data, 343f land cover classification, 339–340 machine learning, 338 artificial intelligence research and development, 338 general objective of, 338 supervised learning, 338 unsupervised learning, 338 quantum computer and informatics, 338–339 digital data and enable machine learning, 338–339 Earth system sciences, 338 Google Quantum Computer, 338–339 physics in informatics, 338–339 quantum systems, 338–339 understanding and practice of, 338 quantum machine learning, 339 performance and optimization, 339 quantum computers, 339 vegetation from satellite data, 339 remote sensing (RS), 339–340 data collection, 340 Earth systems, 339–340 environmental events, 339–340 global environmental change, 339–340 mapping and monitoring land cover, 339–340 upper Earth surface and ecosystem, 339–340 vegetation and nonvegetation cover, 340 environmental issues, 340 natural resource management and environmental science research, 340 nonvegetation elements, 340 SCA. See Snow-covered area (SCA) SCA mapping model, 39 Scikit-learn, 18 Sea ice, 41–42 density, 41–42 different states of water, 41–42 forms of water, 41–42

Index

frozen ocean water, 41–42 sea ice grows and shrinks, 41–42 Sea ice data exploration, 44–45 dataset description, 45 daily and monthly sea ice extent data, 45 meteorological variables, 45 open-access github repository, 45 sea ice extent and atmospheric variables, 45 goal of, 44–45 National Snow and Ice Data Center, 44 observed, 45f Python’s matplotlib package, 44 sea ice extent (SIE) values, 44 seasonality or nonstationary behavior, 44–45 stakeholders and policy makers, 44–45 Sea ice forecasting, 45–55 data-driven AI approaches, 45–46 deep learning-based, 49–53 custom reshape function, 49–50 data preprocessing, 49–50 dropout layers, 50–51 Early Stoppingh method, 51–52 loss function and optimizer, 51 mean squared error (MSE), 51 model evaluation, 52–53 model performance on test data, 52–53 model’s training versus validation (test) loss, 52f model training, 50–52 normalization technique, 50 observed versus predicted sea ice extent, 53 processing time-series data, 49 recurrent neural networks, 49 reshape the data, 50 target variable, 50 three-dimensional (3D) datasets, 49–50 train and test sets, 50 training and validation loss, 52 dynamic forecasting systems, 45–46 ensemble learning-based, 53–55 data concatenation, 54–55 dataset is reshaped to 3D, 54–55 ensembling machine learning models, 53–54 hybrid modeling approach, 53–54 long short-term memory (LSTM) model’s predictive performance, 55 model evaluation, 55 multiple linear regression-long short-term memory (LSTM) ensemble model, 54f multiple linear regression predictions, 55 observed versus long short-term memory (LSTM)predicted, 53f original dataset, 54–55 variance and generalization error, 53–54

415

machine learning approaches for, 45–55 machine learning based, 46–49 data preprocessing, 46–48 duplicate the sea ice column, 47 fitting the model, 48–49 meteorological variables, 47f MinMaxe scaling approach, 47–48 model evaluation, 49 model predictions, 48–49 normalized scale, 48 observed versus multiple linear regressionpredicted, 48f physical interpretations, 47–48 time-series regression problem, 46 training and testing sets, 47 results and analysis, 55–56 deep learning, 55–56 global warming trend, 56 multiple data-driven models, 56 North Atlantic Oscillation and Arctic oscillation, 56 pan-Arctic Sea ice extent, 56 physical processes, 56 root mean squared error (RMSE) and R2 score, 56t spring season predictive barrier, 55 spatiotemporal correlations, 45–46 spatiotemporal data mining, 45–46 variables included in the dataset, 46t Sea ice seasonal forecast, 42–43 Arctic Sea ice forecasting, 43 Coupled Model Intercomparison Project (CMIP6), 43 current operational, 42 data-driven artificial intelligence, 43 deep neural network (DNN), 43 fluid motion and thermodynamics, 42 multiple linear regression (MLR), 43 physical-based models, 42 probabilistic deep learning, 43 random forest baseline model, 43 sea ice concentration (SIC) prediction, 43 Sea Ice Prediction Network—Phase 2 (SIPN2), 43 seasonal and interannual fluctuations of, 42 two-dimensional convolutional neural network (2DCNN), 43 Search space extension, 221–225 Sea surface height (SSH), 59 Second-order methods, 170–174 Segmentation model, 79–80 Self-attention mechanism, 121, 155 Shrubland identification ability to, 312 artificial intelligence/machine learning based approach, 297 assignment, 315

416 Shrubland identification (Continued) climate change and human land use patterns, 295 false negative predictions, 314 false positive predictions, 314 future directions, 315 land use and land cover (LULC) classificationmethodologies, 296 low-statured vegetation, 296 model evaluation, 309–312 data’s bounding box, 311 evaluate() method, 309–310 full data set, 310 GeoTIFF file, 312 precision-recall curve (PRC), 310f predicted values, 311–312 predict() method, 310–311 single-dimension prediction vector, 312 two-dimensional array, 311 XYZ raster file, 311 model fitting, 304–309 activation function, 306 classification models, 304 class weights, 307–309 cutting-edge models, 304 data rescaled and class weights, 304 dense layers, 306 densely connected layers, 306 EarlyStopping() function, 307 false positive classifications, 304 feedforward neural network, 307f hidden layers, 306 model fitting process, 309 models evaluation, 304 model’s PRC score, 309 neural network, 308f plot_model() function, 307 positive cases, 304 resampled_history object, 309 shrubland predictions, 304 sigmoid activation function, 306 straightforward feedforward neural network, 304–306 structured data, 307 model performance, 315 model’s precision, 315 natural climate solutions, 296 observation represents, 314 pairing test, 313–314 predicted probability of, 313f preprocessing, 299–303 approaches, 303 artificial intelligence-ready format, 300 Boolean indicator variables, 300–301

Index

data download and library installation, 299–300 data leakage, 302–303 hold-out test set, 301 negative not-shrubland class, 303 positive shrubland class, 303 StandardScaler() function, 302 TensorFlow datasets, 302–303 training data labels, 303 validation set, 301–302 prerequisites, 297–299 data collection, 298 data download, 297–298 GIS software, 299 Land Change Monitoring, Assessment, and Projection (LCMAP), 298 libraries installation, 297 predictors used for model fitting, 299t Python version3.8.13, 297 study area, 298, 298f TIFF file, 298–299 the prevalence of, 295 rare land-cover class, 315 resembling closed-canopy conditions, 296 road networks and rivers, 314 science and stewardship opportunities, 296 trade-off with classification models, 314 training data representing, 313–314 true, 314 using remote sensing data, 295 Snow-covered area (SCA), 17 Snowpack functions, 17 Solar irradiance critical role, 206 grid-combined generating systems, 206 machine learning and integration with analog ensemble, 207–208 AnEn technique, 208 black box, 207 data-driven and inductive, 207 deep analog (DA), 208 deep learning (DL), 207 gradient-based methods, 207 spatial metric, 208 photovoltaic power plants, 206 weather analogs and machine learning (ML), 206 weather analogs history, 206–207 Analog Ensemble (AnEn) technique, 206–207 AnEn applications, 206–207 chaotic nature of, 206 Krick’s analog techniques, 206 numerical weather prediction (NWP) models, 206 temporal search window, 206

Index

Van den Dool’s astronomical estimation, 206 Solar power generation, 206 Solar radiation, 17 Spatial-channel attention module, 130 SpatialGate class, 128–130 Spatiotemporal weather analog forecasting techniques, 234 AnEn and deep analog work, 235–236 computational cost, 234–235 convolutional layers, 234 convolutional NNets, 234 deep learning layers and operators, 236–238 convolutional layers, 236 convolutional long short-term memory network, 237–238 convolutional neural networks (CNNs), 237 cross-correlation, 236 dying ReLU problem, 236–237 gradient contributions, 238 kernel operation, 236 long short term memory (LSTM) embedding network, 237 model weights and biases, 238 nonlinear activation, 236–237 nonlinear activation function, 236–237 pooling, 237 spatiotemporal sequence data, 237 zero-value padding, 236 extended analog search with GFS, 238–239 input features and predicted variable, 234 numerical weather prediction (NWP) model being Global Forecasting System (GFS), 239–240f runtime of AnEn, 234–235 space and time, 235 spatial forecasts, 234 spatial information, 234 spatial resolutions, 234 Squared error, 170 SSH. See Sea surface height (SSH) State-of-the-art machine learning (ML), 41 Stochastic Gradient Descent, 80 Support Vector Machine, 249–250 Support vector regression (SVR), 285–287 actual vs. predicted values, 287 Dataset preparation, 286–287 grid search, 285 hyperparameter, 285–286 param_grid dictionary, 286 Radial Basis function, 286 radial basis kernel function, 285 second utility function, 287 showAccuracyMetrics, 287 Surface radiation budget network, 209 Surface reflectance, 21

417

SVR. See Support vector regression (SVR) Sycamore processor, 338–339 System design DALL-E, 381f ethical concerns during, 380–382 climate risk assessment, 380–381 critical decision-making, 380–381 ethical requirements definition process, 381–382 Institute of Electrical and Electronics Engineers (IEEE) 7000 Standard, 382 Institute of Electrical and Electronics Engineers (IEEE) Standard 7000, 381 Institute of Electrical and Electronics Engineers (IEEE) Standard P7003, 382 transparency management, 381–382

T TAI. See Trustworthy artificial intelligence (TAI) TensorBoard instance, 89 Tensorboard logger, 83–84 TensorFlow quantum (TFQ), 344 TensorFlow (TF), 344 neural networks, 344 package, 295 Testing sets, 61–63 Testing subsets, 28–29 Test network, 140–144 Time series prediction models, 7 Torchvision.transforms module, 97 Traditional mapping techniques, 272 Training components, 79–81 Training loop, 85–89 Training neural networks, 170 Tree depth, 27–28 TROPOspheric Monitoring Instrument, 252–253 Trustworthy artificial intelligence (TAI), 358 building, 359–360 with three key components, 360f Trustworthy earth artificial intelligence, 385 Typical machine learning workflow, 98

U Utility functions, 287–290 emptyTestList variable, 289 final utility function plotCorrelation, 289–290 models’ accuracy metrics, 287–289 plotCorrelation utility function, 288f plotOnActual function, 289 plotOnActual utility function, 288f showAccuracyMetrics, 287–289, 287f

V Vergil graphical user interface (GUI), 366 Voting Ensemble method, 256

418

Index

W Weather analog identification, 225–230 under a high irradiance regime, 240–241 AnEn and deep analog Spatial with North American Mesoscale, 240f AnEn using North American Mesoscale, 241f deep analog Spatial using North American Mesoscale, 242f North American Mesoscale forecast and SURFRAD measurement, 240f model attribution, 242–243 deep network, 243 Gaussian filter, 243 gradient-based attribution, 243 image classification task, 243 IntegratedGrads calculation, 243 interpretation of machine learning models, 242–243 original solar irradiance input and the baseline, 244f Weather analogs, 206–207 Weather Research and Forecasting (WRF) model, 254 Wildfires accurate prediction of, 119 accurate spatiotemporal prediction, 154–155 attention-based Convolutional Long Short-Term Memory (ConvLSTM), 154–155 causes, 119 ecological processes, 119 fire-fighting and longer-term fire management efforts, 154–155 fire-spread rate, 154–155 Geographic Artificial Intelligence (GEO-AI) approach, 119 global, 154–155 interpretation of, 154–155 predicting and interpreting spatiotemporal dynamics, 155 prediction of, 154–155

spatial data science life cycle for artificial intelligence, 119, 120f spreadmodeling, 154–155 technical contributions, 120 cross-channel and spatial information, 120–121 machine learning approaches, 120–121 physical models, 120 space-time attention mechanism, 120–121 traffic conditions, 155 Wildfire spread attention-based methods for ConvLSTM networks, 121–122 attention mechanisms, 121 convolutional block attention module (CBAM), 121, 124f convolutional operation of, 122 encoder-decoder models, 121 physical interpretation, 121 self-attention mechanism, 121 structure of, 123f wildfire and environmental driver interactions, 122f wildfire spread prediction, 121 buoyant flame dynamics in, 120–121 input-output sequence generation, 125 methodology ConvLSTM network, 121 prediction and interpretation, 121 self-attention mechanism, 121 spatiotemporal attention-based sequence forecasting frameworks, 121 prediction and physical interpretation, 121 technical contributions, 120 time-dependent process, 154–155 Workflow platforms, 373–374

X XAI. See Explainable artificial intelligence (XAI)