Practical Environmental Statistics and Data Analysis 9781906799274, 9781906799045

Most environmental problems involve a large degree of uncertainty, and one way to improve understanding of the issues af

351 64 3MB

English Pages 308 Year 2011

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Practical Environmental Statistics and Data Analysis
 9781906799274, 9781906799045

Citation preview

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

ADVANCED TOPICS IN ENVIRONMENTAL SCIENCE SERIES

SERIES EDITOR Grady Hanrahan John Stauffer Endowed Chair of Analytical Chemistry California Lutheran University Thousand Oaks, California, USA This series of high-level reference works provides a comprehensive look at key subjects in the field of environmental science. The aim is to describe cutting-edge topics covering the full spectrum of physical, chemical, biological and sociological aspects of this important discipline. Each book is a vital technical resource for scientists and researchers in academia, industry and government-related bodies who have an interest in the environment and its future sustainability.

Published titles Modelling of Pollutants in Complex Environmental Systems, Volume I Edited by Grady Hanrahan Modelling of Pollutants in Complex Environmental Systems, Volume II Edited by Grady Hanrahan Practical Environmental Statistics and Data Analysis Edited by Yue Rong

Forthcoming titles Comprehensive Environmental Mass Spectrometry Edited by Albert Lebedev Biofuels in Practice: Technological, Socio-economical and Sustainability Perspectives Edited by Luc Van Ginneken and Luc Pelmans

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Edited by

Yue Rong

To practitioners of statistics Published in 2011 by ILM Publications Oak Court Business Centre, Sandridge Park, Porters Wood, St Albans, Hertfordshire AL3 6PH, UK

6635 West Happy Valley Road, Suite 104, #505, Glendale, AZ 85310, USA

www.ilmpublications.com/www.ilmbookstore.com Copyright # 2011 ILM Publications ILM Publications is a trading division of International Labmate Limited All Rights Reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except under the terms of the Copyright, Designs and Patents Act 1988 or under the terms of a licence issued by the Copyright Licensing Agency Ltd, 90 Tottenham Court Road, London W1T 4LP, UK, without the permission in writing of the publisher. Requests to the publisher should be addressed to ILM Publications, Oak Court Business Centre, Sandridge Park, Porters Wood, St Albans, Hertfordshire AL3 6PH, UK, or emailed to [email protected]. The views expressed in this book are those of the editor and the contributors and not the State of California. Product or corporate names may be trademarks or registered trademarks but, for reasons of style and consistency, the TM and 1 symbols have not been used. Product or corporate names are used only for identification and explanation without intent to infringe. The publisher is not associated with any product or vendor mentioned in this book. This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials or for the consequences of their use. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data Practical environmental statistics and data analysis / edited by Yue Rong. p. cm. – (Advanced topics in environmental science series) Includes bibliographical references and index. Summary: "Describes the application of statistical methods in different environmental fields, with an emphasis on how to solve real-world problems in complex systems"–Provided by publisher. ISBN 978-1-906799-04-5 1. Environmental sciences–Statistical methods. 2. Environmental sciences–Data processing. I. Rong, Yue, 1958GE45.S73P73 2011 577.07297–dc22 2011004906 Commissioning Editor: Lindsey Langston Cover Designer: Paul Russen Typeset by Keytec Typesetting Ltd, Dorset, UK Printed and bound in the UK by MPG Books Group, Bodmin and King’s Lynn

TABLE OF CONTENTS

Figure and Table Captions for the Colour Insert

ix

The Editor

xi

The Contributors

xiii

Foreword

xv

Acknowledgements

xix

Preface

xxi

Chapter 1 Environmental Data, Information and Indicators for Natural Resources Management Nilgun B. Harmancioglu, Cem Polat Cetinkaya and Filiz Barbaros 1.1 Introduction 1.2 Data versus information 1.3 Environmental data analysis 1.4 Decision making for environmental management 1.5 SMART and OPTIMA projects: Gediz case study 1.6 Concluding remarks References Chapter 2 Application of Statistics in Earthquake Hazard Prediction Endi Zhai 2.1 Introduction 2.2 Mathematical formulation 2.3 Earthquake intensity attenuation relations 2.4 An example of earthquake hazard prediction using historical seismicity data 2.5 Summary References

1

2 6 15 34 41 58 63 67 67 68 74 74 78 79

vi

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Chapter 3 Adaptive Sampling of Ecological Populations Jennifer A. Brown 3.1 Introduction 3.2 Adaptive cluster sampling 3.3 Adaptive allocation for stratified and two-stage sampling 3.4 Discussion Acknowledgements References Chapter 4 Statistics in Environmental Policy Making and Compliance in Surface Water Quality in California, USA Jian Peng 4.1 Introduction 4.2 Clean Water Act and Porter–Cologne Water Quality Control Act 4.3 Statistics in environmental standards and water quality criteria 4.4 Statistics in environmental sampling design 4.5 California State 303(d) listing policy 4.6 Total maximum daily loads 4.7 Implementation of environmental regulations Acknowledgements References Chapter 5 Solving Complex Environmental Problems Using Stochastic Data Analysis: Characterisation of a Hydrothermal Aquifer Influenced by a Karst, Example of Rennes les Bains, France Alain Mangin and Farid Achour 5.1 Introduction 5.2 Presentation of the Rennes les Bains site and water geochemistry 5.3 Analysis of piezometric time series 5.4 Evidence of the presence of a thermal convection 5.5 Conclusion References Chapter 6 Application of Statistics in the Evaluation and Optimisation of Environmental Sampling Plans Meng Ling and Jeff Kuo 6.1 Introduction 6.2 Approach 6.3 Site applications 6.4 Summary References

81 81 82 87 92 93 93 97 97 98 99 103 104 108 110 112 113

117 117 118 120 134 135 136 141 141 142 151 160 161

TABLE OF CONTENTS

Chapter 7 Statistical Accounting for Uncertainty in Modelling Transport in Environmental Systems James Weaver, Jordan Ferguson, Matthew Small, Biplab Mukherjee and Fred Tillman 7.1 Introduction 7.2 Model background 7.3 Parameter data 7.4 Transport in uniform aquifers 7.5 Vapour intrusion of hazardous compounds into indoor air 7.6 Contamination of municipal well fields 7.7 One source simulation 7.8 Two, four and six source simulations 7.9 Conclusion Acknowledgement References

vii

163

163 166 168 171 173 175 180 184 190 191 191

Chapter 8 Petroleum Hydrocarbon Forensic Data and Cluster Analysis 195 Jun Lu 8.1 Introduction 195 8.2 Cluster analysis 196 8.3 Types of petroleum hydrocarbons or related data for forensic analysis 198 8.4 Examples 204 8.5 Concluding remarks 214 Acknowledgements 214 References 215 Chapter 9 Anomaly Detection Methods for Hydrologists, Hydrogeologists and Environmental Engineers Farid Achour, Jean-Pierre Laborde and Lynda Bouali 9.1 Introduction 9.2 Different types of errors 9.3 Anomaly detection methods 9.4 Construction of a virtual time series of reference 9.5 Case study 9.6 Conclusion References

217 217 219 219 227 234 237 240

viii

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Chapter 10 Statistical Methods and Pitfalls in Environmental Data Analysis Yue Rong 10.1 Introduction 10.2 Estimation of percentile and confidence interval 10.3 Correlation coefficient 10.4 Regression 10.5 Analysis of variance 10.6 Data trend analysis 10.7 Summary and conclusions Acknowledgement References Index

243 243 244 247 249 252 256 258 258 259 261

FIGURE AND TABLE CAPTIONS FOR THE COLOUR INSERT

Figure 1.8: The location of the Gediz River Basin in Turkey. Figure 1.10: Digital elevation model of the Gediz River Basin. Figure 1.12: Landcover map for the Gediz River Basin. Figure 1.13: Soil map for the Gediz River Basin. Figure 1.14: River reaches in the Gediz Basin. Figure 2.4: Hazard contribution in terms of distance and magnitude. Figure 4.4: Southern California Bight Regional Monitoring Programme Bight’03 sampling locations based on a stratified sampling design (SCCWRP, 2007). Figure 5.9: Piezometric level time series at Rennes les Bains (period from 1 to 29 August, with a time step of 5 min) and the corresponding scalogram (Morlet wavelet). Figure 5.18: (a) Correlation integral and (b) reconstructed attractor using the Grassberger and Proccacia method on piezometric time series recorded at the Rennes les Bains well during April 1996. Figure 6.3: The Delaunay triangulation of a monitoring network. Figure 6.8: Site plan, monitoring locations, and COPC plumes (delineated to the respective action levels). (a) Plumes in mid-2003; (b) plumes in 2008. Figure 7.5: Analytical model output showing extreme results compared to the averagedparameter simulation (in black). Figure 8.4: Clusters generated based on PIANO data, (a)–(i). Figure 8.5: Clusters generated from carbon number data, (a)–(i). Figure 8.6: Clusters generated from identified gasoline range compounds, (a)–(i).

x

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 8.7: Clusters generated from ratios of selected 19 pairs of gasoline range compounds, (a)–(i). Table 9.3: Contaminated matrix with ‘introduced’ errors. Table 9.4: Detected errors at 95% confidence level. Figure 9.3: Regression residuals plot. Figure 9.4: Detection of accidental errors. Figure 9.9: Temporal evolution of cumulative regression residuals. Figure 9.18: Site location with monitoring network. Figure 9.19: Spatial projection of the factor loadings for: (a) C2; (b) C3.

THE EDITOR

Dr Yue Rong (aka YR) is currently the Environmental Program Manager at the Los Angeles Regional Water Quality Control Board of the California Environmental Protection Agency, USA. He has more than 20 years’ experience with the Agency in dealing with groundwater contamination problems in the Los Angeles area of California. He is the recipient of the Board’s Outstanding Achievement Award and Supervisory Performance Award. He also received the 2011 Association of Environmental Health Sciences (AEHS) Foundation Achievement Award. Dr Rong is an Associate Editor for the peer-reviewed journal Soil and Sediment Contamination and an Associate Editor for the Journal of Environmental Forensics. He was elected in 2006 and re-elected in 2008 as the President of the Board of Directors for the Southern California Chinese American Environmental Professional Association (SCCAEPA). He is also the Editor-in-Chief of the peer-reviewed SCCAEPA online journal. Dr Rong is the author or co-author of around 30 peerreviewed publications. His PhD in Environmental Health Sciences was obtained from the University of California at Los Angeles (UCLA), his MS in Environmental Sciences from the University of Wisconsin, both in the USA, and his BS in Earth Sciences was attained at the Beijing Normal University, China.

THE CONTRIBUTORS

Farid Achour ENVIRON International Corp. Irvine, California, USA

Filiz Barbaros Dokuz Eylul University Water Resources Management Research and Application Center (SUMER) Izmir, Turkey

Lynda Bouali Research and Development Saidal Pharmaceutical Group Algiers, Algeria

Jennifer A. Brown Department of Mathematics and Statistics University of Canterbury Christchurch, New Zealand

Cem Polat Cetinkaya Dokuz Eylul University Water Resources Management Research and Application Center (SUMER) Izmir, Turkey

Jordan Ferguson Independent Student Services Contractor to United States Environmental Protection Agency Athens, Georgia, USA

Nilgun B. Harmancioglu Dokuz Eylul University Water Resources Management Research and Application Center (SUMER) Izmir, Turkey

Jeff Kuo Department of Civil and Environmental Engineering California State University, Fullerton Fullerton, California, USA

Jean-Pierre Laborde Polytech’ Nice-Sophia Biot, France

Meng Ling Acton Mickelson Environmental, Inc. El Dorado Hills, California, USA

Jun Lu AECOM Environment Long Beach, California, USA

Alain Mangin Retired from CNRS Station d’Ecologie Expe´rimentale du CNRS a` Moulis (SEEM) Moulis, France

Biplab Mukherjee National Research Council National Academy of Sciences Washington D.C., USA

Jian Peng Orange County Watersheds Program Orange, California, USA

Matthew Small United States Environmental Protection Agency Office of Research and Development and Region IX San Francisco, California, USA

xiv

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Fred Tillman National Research Council National Academy of Sciences Washington D.C., USA

James Weaver United States Environmental Protection Agency

Office of Research and Development National Exposure Research Laboratory Athens, Georgia, USA

Endi Zhai Kleinfelder, Inc. Irvine, California, USA

FOREWORD

As we enter the second decade of the 21st century we are confronted with a wide array of challenging environmental problems and issues. Some of the problems are global in scale – global climate change is the most notable of these – others are national or regional in scope. An unprecedented mobilisation of effort is required if we are to gain a foothold in confronting this broad ecological crisis and learn to live sustainably. Basic and applied research must be conducted, carefully formulated environmental policies must be identified and implemented, investments must be made in green businesses and infrastructure, and, perhaps, as some have suggested, a more fundamental change may be required: transformations in human consciousness and in societal politics. Whatever views may guide us in setting priorities for addressing the ecological pressures on our planet, we can certainly agree that scientific and quantitative skills are of crucial importance in helping us to understand the nature of our ecological problems and the potential impacts of environmental decisions and policies. Researchers and practitioners in these fields develop and apply ecological principles to enhance our understanding of the complex physical and biological environment in which we live. Others construct mathematical models and apply statistical tools to aid and inform decision-making processes. Since a large degree of uncertainty is associated with most environmental problems and issues, statistical skills are particularly important. The field of statistics provides a theoretical grounding and a set of methods for analysing numerical data for the purpose of making inferences in the face of uncertainty. The need for statistical methodology when decisions are to be made in the presence of uncertainty comes into clear focus when assessing future climate change impacts. Persons who are concerned about global climate change believe it is important for policy-makers to anticipate a range of possible climate conditions and that the uncertainty about the nature and magnitude of impacts is not a reason to wait to act. In my home state of Wisconsin a major effort is underway to find adaptation strategies to the potential impacts of climate change in the state. The effort is led by the Wisconsin Initiative on Climate Change Impacts (WICCI), a statewide collaboration of scientists and stakeholders. Working groups of scientists in WICCI are assessing potential impacts of climate change on a variety of natural and human

xvi

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

systems across the state. The starting point for performing these assessments is a consistent data set of future climate change projections. Wisconsin climate scientists have obtained such a data set by down-scaling daily maximum and minimum temperatures and daily precipitation amounts from global climate models on to a 0.18 latitude 3 0.18 longitude grid that covers the state. The coarse climate change projections obtained as output from the global models were debiased against observed temperature and precipitation data obtained from National Weather Service stations. Rather than follow the typical procedure of relating the large-scale atmospheric state to one specific value of the temperature and precipitation at a point, the researchers related the large-scale atmospheric state to the probability density function of temperature and precipitation at a point. In this way they could simulate both the variability and extremes of temperature and precipitation to account properly for the effect of the large scale on the weather at a point. Interpolations, regressions and other statistical tools were needed to complete this down-scaling process. By comparing model results for the mid-21st century (2046–2065) and late 21st century (2081– 2100) time periods with those for the 1961–2000 time period, projections could be made. In this example we see the crucial role statistical methods play in providing a basis for environmental decision making in the face of uncertainty. The book you hold in your hands provides a valuable contribution to our understanding of statistical methods that are of particular relevance to environmental problem solving and decision making. The book chapters, written by authors with a wide range of academic and professional backgrounds, provide basic information about appropriate statistical methodologies to be employed when studying environmental problems, as well as practical guidance for applying them to specific types of issues and cases. Topics covered include raw data analysis, evaluation of background data and standards, environmental sampling and interpretation, stochastic data analysis, statistical pitfalls in environmental data analysis, and spatial and spatial–temporal dependencies. Applications to environmental forensics, ecological populations, environmental policy making, groundwater monitoring networks, transport in environmental systems, and microbial recreational water quality monitoring and source tracking are discussed. The basic objective of the book, which is to assist practitioners in the application of statistical methods in solving real-world problems in complex systems, is invigorated by both the scope of the academic backgrounds of the authors and their range of organisational and agency experience. Their academic backgrounds include environmental science, civil and environmental engineering, mathematics, statistics, hydrology and jurisprudence, among others. Some of the contributors have worked for public entities, such as the US Environmental Protection Agency, a water quality control board, a water quality planning unit in a watershed programme, and others have experience in consulting companies. Yet another virtue of the book is that it provides an international perspective. The authors reside in China, France, New Zealand, Turkey and the USA. The global context within which we live and work, not to mention the commonalities and the

FOREWORD

xvii

global expanse of our environmental and ecological problems, calls out for communication across national boundaries and among varied cultures. We should be grateful to Editor Yue Rong and these writers for enhancing our knowledge of statistical techniques that are needed for environmental data analysis and problem solving. Robert B. Wenger Professor Emeritus Natural and Applied Sciences (Mathematics) University of Wisconsin–Green Bay Green Bay, Wisconsin, USA

ACKNOWLEDGEMENTS

The editor of this book would like to express appreciation to Mrs Lindsey Langston and ILM Publications for accepting, editing and producing this book. The editor thanks Dr Grady Hanrahan for his support and his review of the manuscripts. Gratitude also goes to the contributors of each chapter, who also peer-reviewed all the chapters and made this book possible.

PREFACE

This book describes the application and practice of statistics in the field of environmental science. This is not a mathematical book, rather a practical statistics book. The contributors to this book use statistics as a means to solve problems in various environmental fields. The statistics in this book have little meaning unless we interpret them in the context of real environmental problems. The beauty of the chapters is that they do not describe how to plug numbers into statistical equations, but instead they discuss how to solve problems with the use of statistics. This book gives the reader a perspective on how environmental professionals are actually practising sometimes ‘mysterious’ statistics. Statistics has a long history of use in scientific fields. However, environmental science is a relatively new subject that stemmed from industrialisation in recent human history. Environmental science is evolving, from the early days of investigating fish kill in the Great Lakes in the USA and mercury pollution affecting humans in Japan, to today’s research into global climate change and green technologies. Today environmental science has developed into a multi-disciplinary field, which includes environmental engineering (environmental, civil, chemical and engineering geology) and related sciences (chemistry, geology, hydrogeology, ecology, biology, toxicology, climatology, atmospheric science, earth science, soil science, air quality, water quality and hazardous waste), public health, environmental studies, environmental law and economics, urban planning and studies. It deals with environmental issues from the regional to the global scale. In this sense, environmental science is different. Statistics is a tool used in other scientific fields and it is rightfully applied in the field of environmental science. As you can see, the contributors to this book come from around the world in different environmental fields, working in academic, governmental, regulatory, technological and consulting industries. What they have brought to the reader represents the state of the art and mind of environmental professionals who are striving to analyse and solve global environmental problems. Some statistical methods are very rudimentary and straightforward, and some of them are very experimental and observational. Nevertheless, the chapters present a live and vivid picture of the statistics practised by today’s environmental professionals. I am very impressed by the variety of statistical applications presented in this book. I will be very pleased if any of the information in this book helps readers, even in a

xxii

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

small way. I certainly hope that readers will have the same feeling I do after reading this book, which is that we need more statistics in environmental science and practice. Yue Rong Los Angeles Regional Water Quality Control Board California Environmental Protection Agency Los Angeles, California, USA

CHAPTER

1

Environmental Data, Information and Indicators for Natural Resources Management Nilgun B. Harmancioglu, Cem Polat Cetinkaya and Filiz Barbaros Environmental management has to be based on ‘informed’ decision making, where information in three dimensions, namely economic, social and environmental, is required to identify ‘indicators’ for sustainability. Such identification is realised by the use of decision support systems (DSSs) comprising the integrated tools of databases, models, geographical information system (GIS) and expert systems. The success of DSS applications is closely related to the quantity and quality of available data and information on economic, social and environmental aspects of the management problem. In that regard, information may even be considered as the fourth pillar of sustainability. This chapter focuses on the following issues in sequence: the role of data and information in environmental management; data versus information, properties of environmental data and transfer of data into information; data analysis; information required for environmental decision making and identification of sustainability indicators. The authors have worked on these issues since 1992 in the form of mostly international projects, academic research and theses, organisation of conferences and similar. The writing of this chapter provided them with the opportunity to put various pieces of work into one complete body. Regarding sustainability indicators, the problem is how to evaluate whether environmental management is sustainable or not and how to ensure sustainability in decision making for management. The chapter also focuses on this problem, attempts to define sustainability in water resources systems and introduces sustainability indicators. These issues are considered in the case of the SMART (Sustainable Management of Scarce Resources in the Coastal Zone) and OPTIMA (Optimisation for Sustainable Water Resources Management) projects funded respectively by the 5th

Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

2

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

and 6th Framework Programmes of the European Union. The basic issues discussed are further demonstrated in the case of the Gediz River Basin in Turkey.

1.1 1.1.1

INTRODUCTION The role of data and information in environmental management

We live in an age of environmental alertness. Almost all natural resources are attacked by pollution at varying degrees of intensity. The quality of surface and ground waters is continuously degrading. The situation is similar for land resources with problems of soil erosion, deforestation and desertification in many parts of the world. Air pollution has already reached life- and health-threatening levels in particular regions. These problems have eventually endangered physical habitat for biodiversity. Further difficulties are expected because of the possible effects of climate change on various components of the environment. All these adverse developments are induced by diverse human activities, as well as by natural occurrences. The result is that environmental degradation not only endangers nature, but it also has serious social and economic implications. Thus, we need urgent remedies, not short-term but longterm solutions, to preserve environmental quality for future generations as well as for the present. It was this consideration that led to the adoption of ‘sustainable development’ as the basic policy in environmental management. The need for sustainability has put significant demands on the decision-making process for management. We now need more efficient, more effective and more reliable decisions with which to control and develop our environment. Decision makers and planners are unfortunate in the sense that current problems have become multifold, multidimensional and multifaceted. Similarly, there are numerous objectives, often of a conflicting nature, to be satisfied. Furthermore, technology has provided an abundant number of solutions that may be applied even though their consequences for a particular problem investigated are not known in advance. Thus, the result is that decision makers have to perform in the realm of complexity and uncertainty. This is why the situation may be described as being ‘unfortunate’ for them. On the other hand, in the present age in which we live, technology, although it has stimulated environmental pollution in a number of ways, has currently provided the most advanced and effective tools to facilitate decision making. Thus, decision makers can be considered ‘fortunate’ as they are now better equipped in identifying, analysing and solving environmental problems. The essential basis for decision making is information on the environment. This information is to be provided by available data on various components of the environmental continuum, as well as social, economic and all types of demographic data. Furthermore, effective and efficient decisions require information that is sufficient and reliable. On the other hand, to support the decision-making process, information should not only be sufficient and reliable but must also satisfy three conditions.

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

1. 2. 3.

3

It must be available when it is needed. It must be easily accessed by the user. It should be available in a form that is easy to understand and use for the decision maker or the planner.

The management of water resources, like that of the other components of the environment, has to be based on ‘informed’ decision making, where information in three dimensions, namely economic, social and environmental, is required to identify ‘indicators’ for sustainability. Such identification is realised by the use of a DSS comprising the integrated tools of databases, models, GIS and expert systems. The success of DSS applications is closely related to the quantity and quality of available information on economic, social and environmental aspects of water resources. In that regard, information may even be considered as the fourth pillar of sustainability (Harmancioglu, 2007).

1.1.2

Data and the decision-making process

Agenda 21 of UNCED (1992) (Rio World Summit on Environment and Development) has officially stated the new outlook towards environmental management, namely that the environment should be managed by an integrated approach in respect of sustainability. It was further emphasised in Agenda 21 that effective management relies essentially on reliable and adequate information on how the environment behaves under natural and man-made impacts. In particular, Chapter 40 of Agenda 21 on ‘Information for decision making’ emphasises the importance of improved availability of information on all aspects of environment and development. It specifically underlines the need for improved presentation of data and information in a format that will facilitate policy and decision making by governments. The chapter states: ‘Special emphasis should be placed on the transformation of existing information into forms more useful for decision-making and on targeting information at different user groups. Mechanisms should be strengthened or established for transforming scientific and socio-economic assessments into information suitable for both planning and public information.’ Substantial amounts of data already exist on various processes occurring in the natural environment, including water resources. However, the mode of adoption of integrated approaches for sustainable development of water resources has certainly changed information expectations and, hence, the types and the amounts of data needed. Now, more and different types of data have to be collected to describe the status and trends of not only water resources, but also of the ecosystem, other natural resources, pollution and socioeconomic variables. As current environmental problems extend to freshwater (both surface and groundwater), land resources, coastal zones, urban air, desertification, soil degradation, biodiversity and other habitats, data are required on all these media so that such problems can be assessed and managed. Considering freshwaters, conventional water resources information systems comprise hydrological and meteorological data on such processes as precipitation (rainfall, snow), river levels and flows, lake and reservoir levels, groundwater levels,

4

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

sediment concentrations and loads in rivers, evapotranspiration, and water quality (physical, chemical and bacteriological variables) of surface and groundwater. On the other hand, freshwaters are now considered a part of the environmental continuum comprising air, soil and water components that are interactive in complex ways. Thus, there is now a need to collect data on the wider environment to include watershed characteristics such as vegetation patterns, soil moisture, topography, climate and aquifer characteristics. Environmental data should include a wide variety of variables to provide information on diffuse sources of pollutants, accidental spills, irrigation return flows, eutrophication of lakes, and the status of estuarine and coastal ecosystems. Such data essentially reflect human impact on the natural environment. In a similar vein, data are also needed to describe water use by man, that is the volumes of water required for domestic, industrial and agricultural use, and characteristics of rivers related to catchment area uses such as recreation, navigation and fishery habitats (Harmancioglu et al., 2003). It is clear from the foregoing that the types of data required to produce information on the environment are highly varied. In addition, these data should reflect the true nature of the environment. Environmental processes are, by nature, heterogeneous, dynamic, non-linear and anisotropic. They are marked by spatial variability as well as temporal variability. Accordingly, collected data should reflect these characteristics of the environment along with the spatial and temporal variability of environmental processes to be representative of nature. On the other hand, although Agenda 21 and several other international documents and reports have stressed the provision of adequate and reliable information for sound environmental management, they have also recognised that current systems of information production, that is data management systems, do not fulfil the requirements of environmental management and decision making. In view of the rapidly growing environmental problems, it is often found that our data management systems experience a declining trend at a time when informational support is needed the most. There is a significant gap between information needs on environment and information produced by current systems of data collection and management. The presence of this gap contradicts the nature of the Information Age in which we live (Harmancioglu, 2003). Recognition of the gap between information provided by available data and that required for environmental management has brought focus to current monitoring systems, databases, data validation and data use. Accordingly, major efforts have been initiated at regional and international levels to improve the status of existing information systems. The purpose of these efforts is to ensure that the data made available to users are accurate and reliable. Data are transferred into information via a data management system that involves a number of steps comprising data acquisition, processing and the eventual data analyses for preparation of operational and design data. Each of these steps contributes to the retrieval of the required information and has an impact on the quality of data collected and processed. Thus, all of these steps must be efficient to maximise data utility and reliability, meaning that quality controls should be realised at each step. In particular, it is necessary that collected data are validated before they are disseminated

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

5

to users. The users themselves can apply a number of checks to test whether the data are representative of the environment before they use them as a basis for their operational and design decisions. Despite the above requirements, each step of a data management system is subject to numerous uncertainties and difficulties so that shortcomings are often encountered in available data. These shortcomings relate to the reliability, accuracy, completeness (missing values), homogeneity, length of record and spatial extent of data. There are often no measurements of sampling error indicated along with available data. In particular, data validation is often poorly achieved. The result is that the eventual information produced is of poor quality, imprecise and unreliable. Decisions based on such information are prone to significant errors, such that management of the environment cannot be realised in an efficient and cost-effective manner. The major problems associated with available environmental data are their incompleteness (missing values), inadequacy and non-homogeneity. Further shortcomings may also be noted. In most cases, available data do not reflect a sufficient spatial coverage. A general deficiency is the lack of measurement of sampling errors, and data validation is overlooked. There are further problems in data presentation. Data may be available in incompatible formats; often, different disciplines involved in data collection and processing use different jargons. In general, reporting of data is poorly realised with no reference given to the specifications of particular variables measured. Similarly, methodologies used in laboratory measurements are not indicated. These shortcomings may be summarised as follows (NATO LG, 1997). • • •



• •

There is a significant lack of integration among different procedures applied in data collection and in transfer of data into information. In general, current monitoring networks appear to be purposeless as no specific and clear objective is stated. The quality of available environmental data varies significantly from one region to another and from one country to another. Such variations may be attributed to the presence of different sources of pollutant loads and different geological (or geochemistry) conditions. Shortcomings often encountered in available data relate to their reliability, accuracy, completeness (missing values), homogeneity, length of record and spatial extent. There are often no measurements of sampling error indicated along with available data. There are significant problems associated with data presentation and reporting, as follows. s Data from different sources are not compatible and comparable owing to the use of different formats and units used in data presentation. s There are incompatibilities between different data acquisition and retrieval systems. s Accessibility of data is often a problem in most countries. s Different disciplines use different nomenclature or jargons in data presentation.

6

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

s

s

s

Reporting of data is often poorly achieved as specifications of particular variables (e.g. NH3 -N, NO3, PO4, and so on) regarding their laboratory analyses are not disclosed. An explanation of laboratory analysis methods is not provided along with presented data; the users therefore cannot assess the compatibility of the methods. Data validation is poorly achieved; current networks collect a lot of data but these data are not validated.

It follows from the above that the initial and possibly the most crucial step of environmental or water resources management is the establishment of a sound information system for the case studied.

1.2 1.2.1

DATA VERSUS INFORMATION Definition

The purpose of environmental data collection is to produce information on environmental processes. Often in the past, the terms ‘data’ and ‘information’ were used interchangeably so that the general proposition was that the more data are collected, the more information is obtained. Later, however, a distinction has been made between ‘data’ and ‘information’. The term ‘data’ means a series of numerical figures which constitute our means of communication with nature. On the other hand, what these data tell us or what they communicate to us is ‘information’ (Harmancıoglu et al., 1992). Thus, it is possible that data tell us all that we need to know about what occurs in nature (full information), or they may tell us some but not all about nature (partial information), or they may tell us nothing at all (no information). This means that availability of data is not a sufficient condition unless the data have utility, and the term ‘information’ describes this utility or usefulness of data (Harmancıoglu et al., 1992). It follows then that little data may not be sufficient to convey the required information; however, it is quite possible that excess data also produce little or no information. Essentially, the major problem of present times is to have too many data that one doesn’t know what to do with. Ward et al. (1986) express this situation as ‘data-rich but information-poor syndrome’. It is basically this syndrome that leads us to concentrate our efforts on data management. The difference between data and information can also be explained as: data only need a collector, whereas information is defined by the content of data, which is meaningful to the user. Information exists if it is useful to some audience or to decision makers in the most general and inclusive sense. As noted in Section 1.1, this information must have a number of properties. First, it has to be timely; that is information must be there when it is needed. Further, it has to be accurate and precise; otherwise, it is not useful information. It has to be easy to understand and must come in a format which meets the expectations and the capability of the specific audience who uses it. Context or context-rich information to

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

7

allow or to ease interpretation is another important aspect. Finally, information has to be easily accessible by the users.

1.2.2

Transfer of data into information

As pointed out in the previous section, data availability is not a sufficient condition to produce the required information about the environment. It is the utility or usefulness of data that contributes to production of information. In the past, the primary concern was to conceive what available data showed about prevailing conditions of the environment. The question nowadays is whether the available data convey the expected information. Data collection systems have indeed become sophisticated with new methods and technologies. However, when it comes to utilising collected data, no matter how numerous they may be, one often finds that available samples fail to meet specific data requirements foreseen for the solution of a certain problem. In this case, the data lack utility and cannot be transferred into the required information. This is one of the reasons why we need to manage our data systems; data management is required to produce an efficient information system where data utility is maximised. Another aspect of the problem lies in the cost considerations. Data collection and dissemination are costly procedures; they require significant investments which have to be amortised by versatile uses of data. Even in the developed countries, a data collection system has to be realised under the constraints of limited financial sources, sampling and analysis facilities, and manpower. If the outputs of this system, or the data, do not fulfil information expections, the investment made in the system cannot be amortised so that the result will inevitably be economic loss. Cost considerations do not only relate to costs of monitoring; they are also reflected in the eventual decision-making process. If available data produce the required information, decisions are made more accurately and the smaller the chances are of underdesign and overdesign. Proper decisions minimise economic losses and lead to an overall increase in the benefit/cost ratio. Thus, a data collection system has to be cost-effective and efficient to avoid economic losses both in the monitoring system itself and in the eventual design based on the information produced by this system. The transfer of data into information involves several activities in sequence to constitute an environmental data management system, as summarised in Figure 1.1. Each of these activities contribute to retrieval of the required information. Thus, all of these steps must be efficient to maximise data utility. To respect the condition of costeffectiveness, again each step has to be economically optimised. Thus, these activities have to be managed to ensure the efficiency and cost-effectiveness of the whole information system. The ultimate goal of an environmental data management system is decision making for environmental management. The key to proper management decisions is information on environmental processes, and retrieval of this information relies on data to be collected, analysed and evaluated. Figure 1.1 shows that the two basic tools of integrated environmental management, that is modelling and data, can be integrated in the data management system. In essence, modelling is the stage where data are transferred into information for the

8

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 1.1: Basic steps in environmental data management. OBJECTIVES AND CONSTRAINTS

NETWORK DESIGN

SAMPLE COLLECTION

LABORATORY ANALYSIS

DATA HANDLING

STORAGE AND RETRIEVAL

DATA DISTRIBUTION

DATA ANALYSIS

MODELLING

INFORMATION UTILISATION

DECISION MAKING

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

9

eventual decision-making process. Thus, it constitutes a significant component of the environmental data management system. On the other hand, production of the desired information from available data is a difficult task; it is subject to numerous uncertainties and problems in the collection, processing, handling, analysis and interpretation of data. Thus, management of the system of activities shown in Figure 1.1 has become an end in itself, apart from the management of the environment. The major difficulty associated with current data management systems relates to deficiencies in defining specific objectives for monitoring. Constraints in the form of social, legal, economic and administrative factors complicate this step further. Essentially, lack of clearly stated objectives implies failure to define information expectations so that, eventually, the data management system cannot produce the information required for decision making. In this case, one may consider the option not to collect any data for which the objective is not specified. With respect to the design of data collection programmes, there are as yet no standard guidelines to be followed in the design of monitoring programmes. Basic problems relate to the selection of sampling sites, frequencies, variables and sampling duration. When these network features are not properly selected, the efficiency of the monitoring network is significantly reduced (Harmancioglu et al., 1999; 2004a). The major difficulty in physical sampling relates to realisation of representative sampling. Furthermore, the selection of proper tools and equipment for sampling may complicate the problem particularly in the case of equipment failures. Sampling has to be followed by proper preservation of sampling, and timely and safe transport to the laboratories. These activities, if not appropriately realised, may lead to poor samples. Laboratory analyses result in significant uncertainties due to lack of standardisation among laboratories with respect to analysis methods and units used. There is a significant need for reference laboratories. Furthermore, laboratory analyses must include quality control/quality assurance of available samples, which are not properly realised in most laboratories. This issue significantly hinders exchange of data on local, regional and global levels. With respect to storage of data, most developed countries have well-established databases which can be accessed easily by users. The main problem here is that data banks have been filled up with huge amounts of data; and there is the question of what should be done with too many data. Developing countries either have no data banks or have poor databases that are hardly accessible by users. The main problem related to data banks is the appropriateness of formats with which the data are stored. Again, there is a need for harmonisation or standardisation in development of databases so that data exchange can be facilitated on regional and global levels. Data analysis is the initial step of transferring data into information. There are numerous analysis methods proposed by different researchers. The problem is to select the best one among them. Modelling, as a means of data analysis, has its own uncertainties and complexities. Models often prove to be unsatisfactory when the underlying mechanisms of environmental processes are not fully and reliably perceived. Another difficulty related to data analyses is that the messy character of environmental data require special treatment via modified or new techniques. These

10

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

methods have been developed, but they have not yet been validated to the fullest extent. It follows from the above that each step of the data management system has its own difficulties and uncertainties such that the resulting data are often of a messy character, with deficiencies in both quantity and quality. Actually, each task in the system contributes to data utility and accuracy; problems in any one step reduce the reliability of the output information. Thus, first, to improve the status of existing data management systems, these problems should be solved, or at least minimised. Second, the system should be viewed as a cohesive whole, since the output of one step constitutes the input to the next step. Coordination of data flow among these steps is often difficult because each task is performed by a different discipline. Thus, agreement should be established between multidisciplinary approaches if current data management systems are to be improved.

1.2.3

Integrated approach to environmental data management systems

In follows from Section 1.2.2 that the prevailing universal problem in environmental data management systems is the significant incoherence between data collection procedures and the retrieval of information required by the users. In this regard, an integrated approach to data management has become a necessity in recent years. Two main reasons can be specified to explain the needs for integration. First, the multidisciplinary, global or regional character of various programmes requires strengthening of collaboration between data management activities of different organisations in order to ensure proper coordination of environmental data flow, collection and archiving and to avoid duplication of efforts both on national and international levels. Second, the requirements for a significant leap forward in the capacity to handle environmental data is occurring at a time when computer and communication technology has made significant advances in terms of technical capability and connectivity. As was stated in Agenda 21 of the UNCED in Rio de Janerio in 1992, the priority activities for environmental management should include: establishment and integration of existing data on physical, biological, demographic and user conditions into a database; maintenance of these databases as part of the assessment and management databases; and promotion of exchange of data and information with a view to the development of standard intercalibrated procedures, measuring techniques, data storage and management capabilities. The problems that must be addressed today require interdisciplinary approaches and much more sharing of data and information than in the past (Harmancioglu et al., 1997a; 1997b; 1998). Integrated environmental data management is concerned with providing an opportunity to draw together relevant data on a transient or permanent basis, both within the same or across disciplinary boundaries, so as to address through analyses, modelling or other means, environmental issues of local, regional, national or international interest or concern (Harmancioglu et al., 1997a). There are at least three levels of data integration: data of the same type (e.g. water quality data collected by

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

11

different methods) into an integrated data set; data of different types of one discipline (e.g. marine physical, chemical, biological and other oceanographic data types) into a comprehensive data bank; and data of different disciplines (e.g. oceanographic, meteorological, geophysical or demographic data) for modelling and decision-making purposes (Harmancioglu et al., 2004b). In essence, advances in global environmental and water resources management are not primarily limited by a lack of data and information, but by a lack of proper data and information management. At present, there exist huge amounts of different types of environmental data, which are not merged on a routine basis for the effective production of information. Modern data management offers various ways and tools, such as GISs, to reduce, condense, integrate and analyse such data. Furthermore, modern data types are not limited to routine ground-based observations; they include new data types such as remote sensing data from satellites and airborne platforms and data from real-time sensors and systems, producing high volumes of data. Moreover, numerical models provide another powerful source of data, especially for forecasts and simulation. The availability of such data and the advances in data collection technologies has increased the need for ‘integration’ in environmental data management systems. It follows from the above that the essence of the problem lies basically in inadequate data and information management rather than in a lack of data. There are further impacts of poor data management, including: • • • • •

ineffective exchange of knowledge; potential loss of valuable historical data; significant amounts of redundant work involved in information production; lack of efficiency in assembling the relevant information required for the solution of a given environmental problem; increased budget required for data organisation in particular projects or programmes.

It is worthwhile here to stress a few points regarding integrated data management systems as follows. 1.

2.

Integrated data management is not solely required for scientific and technical purposes. It is the basis for environmental decision making where community participation has become a significant component. Thus, production of sound information on environmental problems should also serve to inform the public in order to broaden the basis for the decision-making process. While environmental management traditionally included only considerations related to the natural environment, increasingly, the influences of economic and sociological developments need to be taken into account. This allows for a quantification in economic terms of the effects of these factors, which is an essential piece of data for environmental management. To facilitate the processing of socioeconomic data in relation to data from the natural environment, it is suggested to broaden the scope of the term ‘environment’ to include socioeconomic variables and parameters.

12 3.

4.

5.

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Integration of data management is not a static procedure; it has an adaptive nature because new environmental problems are emerging, which require new types of data to be collected. It is often stated that objectives of a data collection system have to be specified, and then the design and/or operation of the system should be optimised in view of the objectives. This approach may be questioned. Since a data collection system designed today has to function for several decades and because the principle of sustainable development requires consideration of the needs of future generations, the objectives that the current systems have to meet are almost impossible to specify when the objectives of future generations are not known. The only possibility seems to be to define today’s objectives, to try to anticipate potential objectives of generations to come, and to define objectives of a data collection system on the basis of both. The problem remains, however, that we cannot clearly anticipate those needs of future generations, just as our predecessors did not anticipate the high relevance of water quality monitoring assumed today. Although data collection networks are presently implemented in many countries of the world, we are faced with the constraint that there will always remain remote areas, such as Siberia, Sahara, Central Australia, and so on, where data are not currently collected and will not be collected in the future. This is a problem, since there is a growing interest in global data sets in order to understand and quantify global changes. Some approaches, for example the one adopted by the Global Energy and Water Cycle Experiment (GEWEX), may be applicable for generation of environmental data sets in remote regions (Harmancioglu et al., 2003).

1.2.4

Shortcomings of available environmental data

Shortcomings of available environmental data may first be attributed to deficiencies of existing monitoring networks summarised below: • • •

• •

lack of coordination between various agencies running different networks; lack of agreement between collected data and environmental management objectives, resulting in data-rich, information-poor monitoring practices; problems related to: s selection of variables to be observed; s selection of sampling techniques; s selection of sampling sites; s how long monitoring of certain variables at certain sites should be continued; lack of reliable and accurate data (messy data); deficiencies in data presentation, interpretation and reporting.

Other difficulties may also be cited for the design and operation of current environmental monitoring programs, as follows.

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

• • • • • •

• • •

13

Objectives of environmental assessment and management are not properly defined. The monitoring system is established with inadequate knowledge of the natural system (conceptual difficulties). There is insufficient planning of sample collection, handling, storage and analysis. Data are poorly archived. A precise definition of information contained in the data and how it is measured is not given. The value of data is not precisely defined and, consequently, existing networks are not optimal either in terms of the information contained in these data or in terms of the cost of obtaining the data. The method of information transfer in space and time is restrictive. Cost-effectiveness is not emphasised in certain aspects of monitoring. The flexibility of the monitoring network in responding to new monitoring objectives and conditions is not measured and not generally considered in the evaluation of existing or proposed networks.

Shortcomings of existing networks eventually lead to collection of data that are deficient in reliability and accuracy. Regarding the quality of available data, one first has to note that environmental processes are strongly subject to non-homogeneities created by man while similar effects also occur naturally. Thus, there exists the problem of non-homogeneities in observed data series. Furthermore, some environmental variables can be easily monitored, yet some others require complex laboratory analyses. Errors in laboratory experimental analyses plus changes either in monitoring or laboratory practices may often lead to inconsistencies (systematic errors). Another problem is censored data that occur when some concentrations are below detection limits, and therefore cannot be described numerically by laboratory practices. All these limiting factors eventually make the utilisation of environmental data difficult. Consequently, the reliability of the output information is poor (NATO LG, 1997).

1.2.5

Noise

Data are collected to obtain information about the ecosystem and the way it functions under basic forces and their interactions. Data are essentially signals from the ecosystem; however, they do not represent perfect information about the natural system because of various sources of noise. Essentially, there is uncertainty between the real world (a particular process in the environment) and the information we have about it (understanding of environmental conditions) (Figure 1.2). Part of this uncertainty cannot be identified or quantified. The part that can be identified or quantified is noise (NATO LG, 1997; Harmancioglu and Singh, 2002). When assessing the information content of data, sources of noise must be accounted for as they lead to blurring of information. Noise refers to a number of uncertainties that stem from monitoring practices. Such uncertainties may be due to:

14

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 1.2: Noise as uncertainty between the real world and our understanding of it. Noise Reality

• • •

• • • • • • • • • • • • • •

- Concept - Data - Statistics

View of the world

lack of a clearly stated specific objective for monitoring; mistaken assumptions and bias in the conceptual description of the ecological system as well as in the evaluation of data; insufficient design of the monitoring system (stations and samples not being representative of the true conditions of the environment in spatial/temporal dimensions); errors in field measurements (uncalibrated operations, lack of proper hydrological surveys prior to sampling); failure to select the proper methods for measurement; various interferences that occur during sampling (sample contamination); failure to look at the right place for the right material (e.g. water, air, biota, bottom sediments, etc.); errors in sample conservation and identification during the transport of the sample to the laboratory; various interferences that occur during laboratory analyses (sample contamination, lack of sensitivity, lack of calibration, errors in data reporting); failure to detect true signals (detection limits); errors in data handling (errors in entry and retrieval of data at computer facilities); lack of quality assurance at various stages of monitoring; lack of consistency with respect to sampling methods and sampling sites; changes in sampling programmes with respect to changing objectives or funding; errors in sampling; changes in sampling and analytical techniques (e.g. changes in methods, equipment, or detectability); lack of completeness in information production due to missing data.

If noise is defined as blurring of information, then all steps in data management shown in Figure 1.1 (i.e. steps of data collection through transfer of data into information) have noise components because each has its own uncertainties. Thus, all problems relevant to each step constitute a source of noise. Each step imposes conditions on the type and quality of information flowing from the previous element. This implies that, in each element (step), criteria for accepting the results of the previous element have to be established. Also, each step is subject to changes and

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

15

enhancements over time, reflecting changes in knowledge or goals, or improvements in methods and instrumentation. Thus, each step must have defined quality assurance activities to monitor these changes. The above sources of noise should be assessed when trying to extract the information contained in available data. Basically, these sources indicate three major areas where uncertainties may prevail: • • •

conceptual understanding of basic processes; available data; statistical noise.

When dealing with noise in any of the areas above, it must be recognised that noise cannot be totally eliminated, but can be minimised. The important thing is to be aware of the sources of noise and to be able to assess them (Harmancioglu and Singh, 2002).

1.3

ENVIRONMENTAL DATA ANALYSIS

1.3.1

Selection of the appropriate data analysis methodology

There are several methodologies, basically statistical in nature, that are used to analyse the properties of observed environmental data. The principles underlying these methodologies are available in general statistical literature and in publications devoted particularly to the environmental process analysed. Among numerous studies, which would be too exhaustive to cite here, one may refer to Chapman (1992) for a general summary of methods used for water quality data analyses and to Hipel and McLeod (1994) for an extensive and highly detailed review of environmental data analysis techniques. It is not intended in this chapter to restate the mathematical background of such techniques; rather, a general critical overview of the data analysis procedures is presented. The selection of a particular data analysis methodology for investigating environmental data depends basically on two factors: 1. 2.

the type of information sought; the nature of available data.

There are essentially three types of information to be derived by data analyses on environmental variables: 1. 2. 3.

information on mean value; information on the extreme; information on trends (spatial or temporal).

Each of these properties needs different data analysis techniques so that it can be reliably described. Such techniques are further classified according to their suitability to the nature of available data. Some methodologies require regularly collected data, whereas some can better adapt to the sporadic nature of environmental observations.

16

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

1.3.2

The nature of environmental data

Environmental processes, such as water quality, streamflow, precipitation and similar other hydrologic variables, may be analysed as univariate series in the form of either time series (as a function of time) or as line series (as a function of distance) when one of the dimensions, time or space, is kept constant. However, information is often needed on both the temporal and the spatial distribution of such processes so that one has to consult multivariate analysis techniques for a full understanding of how they evolve over time and space. In this sense, runoff data are probably the least problematic as they are regularly observed within a systematically operated network. In contrast, some environmental data, such as those of water quality, pose significant difficulties in multivariate analyses, owing to the monitoring practice applied. Three basic features of sampling affect the resulting information again about water quality: 1. 2. 3.

variables sampled; the frequency of sampling; and sampling sites.

With respect to the first feature, the difficulty is that the quality of water, even at a single site, has to be described by a large number of variables, in contrast to streamflow, which is represented by a single variable at a point in space. Accordingly, the analysis of water quality for a single site becomes a multivariate one, where the relationships between several variables have to be investigated. There is no problem when all variables are monitored regularly at the same time points. However, if different frequencies are applied for each variable, such relationships may be quite difficult to describe reliably. The second feature of environmental data monitoring, that is the temporal frequency of sampling, is the most problematic aspect with respect to data analysis. For example, water quality variables are often sporadically observed at irregular time intervals. Furthermore, their data series have several gaps and missing values as there may be long intervals where observations are not made. Another problem is that periods of most environmental observations are often quite short. With these characteristics, the nature of environmental data is often described as ‘messy’ (Hipel and McLeod, 1994). Consequently, the application of classical techniques of time series data is often made difficult by this messy character of observed environmental data. The third feature of environmental monitoring relates to the adequate spatial representation of the natural process. Even if there exist sufficient numbers and locations of sampling sites, information transfer between the observed variables in the space domain is often poor. This is because temporal sampling frequencies for a single variable at different sites do not match, or because different variables are monitored at different sites. It follows from the above that the multivariable, multisite and messy character of environmental data complicate their analysis so that researchers are in continuous search of appropriate techniques to identify the space/time distributions of environmental variables. To summarise, in selecting the appropriate data analysis methodology, it is often necessary to gain a clear understanding of the dynamic behaviour of the natural

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

17

processes involved. In terms of data analysis, statistics are useful for expressing the data in summary form. When the information is summarised in the form of plots or tabulated data, and so on, it is said to be of non-parametric form. When it is summarised in the form of an empirical (black box) model, it is of parametric form. In what follows, the parametric statistics will be discussed (Harmancioglu et al., 1998).

1.3.3

Estimation of mean values

In environmental management, it is often required to identify the mean value of an observed variable at a particular site. Such information is sought for management purposes, such as general surveillance or particular treatment needs in the case of river water quality. For example, the design of a treatment plant to regulate instream quality is based on the knowledge of the mean values of particular variables monitored at a site. The design criteria are based on the true means to be estimated from observed data. Obviously, one or two random observations are not sufficient to decide upon the true mean value. A series of data should be available in adequate amounts so that the mean water quality concentrations can be reliably estimated. Then the question is how many samples should be taken to determine the true mean with a certain level of confidence. Sanders and Adrian (1978) have proposed a method for estimating the mean value of a water quality variable from a series of monitored data. Essentially, they have developed this methodology to determine the required sampling frequencies in time if the information sought is the true mean value of a water quality variable at a specified level of statistical confidence. The method depends on the assumption that the primary objectives of future water quality monitoring networks are the determination of ambient water quality conditions and an assessment of yearly trends. The purpose of the method is to derive a sampling frequency criterion from standard statistical procedures that are used to determine the relationship between sampling frequency and the expected half-width of the confidence interval of the random component of an annual mean variable concentration (Sanders and Adrian, 1978; Sanders et al., 1983; Sanders, 1988). It must be noted here that, upon lack of sufficient water quality data, the method was demonstrated by Sanders and Adrian (1978) for the case of river flows so that the annual statistic used was the mean log river flow. For a series of random events, the confidence interval of the mean decreases as the number of samples increases. Thus, the accuracy of the estimate of the mean is a function of the number of sample observations. Therefore, a sampling frequency, as number of samples per year, can be determined for a specified confidence interval of the mean. Unfortunately, most hydrological time series are not random but significantly correlated and non-stationary, which makes standard statistical analyses difficult. Thus, the method can be applied only after removing the serial correlation and non-stationarity from the series. The Student t-statistic is selected to estimate the relationship between sampling frequency and the confidence interval of the mean of the random component. If the observations xi (i ¼ 1, . . ., n) are stationary, independent and identically distributed, the variable t of Equation 1.1 can be defined by a Student t-distribution

18

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS



x pffiffiffi S= n

(1:1)

where x ¼ calculated mean of the independent residuals  ¼ theoretical population mean S2 ¼ sample variance of xi n ¼ number of independent observations (Sanders and Adrian, 1978). For a specified level of significance, the variable t will lie in a confidence interval defined by known constants. This means that the probability that the random variable t is contained within the interval is equal to the level of significance (1  Æ), and the probability that the variable t is not contained within the interval is equal to Æ. This situation can be written by using the common statistical notation   x Pr ¼ tÆ=2 , pffiffiffi , t1Æ=2 ¼ 1  Æ (1:2) S= n where t1Æ=2 and tÆ=2 are constants defined from the Student t-distribution for a specified level of significance and the number of samples. By using the equality t1Æ=2 ¼ tÆ=2 , the confidence interval of the theoretical residual mean can be written as tÆ=2 S tÆ=2 S x  pffiffiffi ,  , x þ pffiffiffi n n

(1:3)

and the width of the confidence interval of this mean of the random sequence (xi ) is 2tÆ=2 S 2R ¼ pffiffiffi n

(1:4)

where R represents half the expected confidence interval of the mean (Sanders and Adrian, 1978). 2R is the confidence interval between the limits defined. Thus, R is a function of the standard deviation of the observed residuals, the square root of the number of the data and the constant from the Student t-distribution. Consequently, to determine the temporal sampling criterion, a plot of half of the expected confidence interval of the residual mean versus the sampling frequency is sufficient, since the confidence interval is symmetric about the mean. Sanders and Adrian (1978) showed the application of the method for the case of streamflows with a lack of sufficient water quality data for statistical analysis. In their procedure, they first removed all series components that cause non-stationarity (trends, periodicity and serial correlations). Next, the sample variance of residuals S 2a are computed and plotted against the sampling interval. The S 2a values stabilise after a certain sampling interval and approach a limiting value. After a certain sampling interval for which S 2a stabilises, the variance becomes almost constant and is independent of the sampling interval. Sanders and Adrian (1978) stated that this is a necessary condition so that the analysis of the relationship between R and n becomes

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

19

theoretically valid. Next, for the streamflow series used, they derived the plots of R versus n (number of samples per year) for specified levels of significance (1  Æ). Sanders and Adrian (1978) used daily streamflows in their analysis so that the required sampling frequency is found by dividing the number of days in a year by the number of samples per year Sampling frequency ¼

365 n

(1:5)

To determine the sampling frequency by this method, one has to specify the level of significance first. Then, using the plots of R against n (number of samples per year), the number of samples per year (n) can be determined for a particular value of R. The methodology described above for estimation of mean values is perfectly valid in the statistical sense. However, its application to short-duration irregularly observed environmental data does not always produce reliable results, since the underlying assumptions of the method are often not met by environmental data series. This may not be the case for a number of developed countries where data banks are already filled up with regularly observed data. However, in a great majority of countries, including the developing ones, reliability of such statistical approaches may be fairly low.

1.3.4

Estimation of extreme values

It is not only the means but the extreme values of environmental conditions that are of interest to managers. Knowledge on extremes is required for regulatory purposes such as detecting standard violations. A major difficulty associated with assessment of standard compliance is that it is based on sampling. In this context, it is highly affected by sampling errors and the resulting uncertainties. For example, the actual quality of water at a time and space point may exceed a critical value, but this may not be noticed if it is not sampled at that time. Or, if a sample taken at a certain time shows that the quality is good, it is assumed until the next sampling that it will remain good. In each case, our decisions carry a risk of failing to observe the actual quality of water (Alpaslan et al., 1993). The major problem in assessment of compliance stems from the selected monitoring frequencies. Essentially, continuous monitoring is required in order to detect extreme values which may lead to standard violations. Recently, advances in measurement capabilities permit us to observe much better the variability and the uncertainty in the behaviour of natural processes. By means of continuous monitoring, we can now identify not only the average concentrations of pollutants, but also the occurrence of extreme events in the form of shock loadings, which are similar to flood events in the case of water quantity (Beck and Finney, 1987). On the other hand, continuous monitoring is often costly in time, labour and money, in addition to being highly sensitive to system failures, for example equipment failures (although internationally the number of stations with continuous monitoring is increasing (Mulder, 1994)). Then the question is how frequently should a variable be sampled or how many samples should be taken so that extremes do not go unnoticed. The answer to

20

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

this question is basically treated by probabilistic approaches, which are valid for random variables such as water quality. An extreme condition regarding a random variable can be described as the probability of exceedance P(X . xcr,h ), or the probability of non-exceedance P(X , xcr,l ), with X representing the random variable, and xcr,h and xcr,l , the critical high or low values of the variable. The exceedance (or non-exceedance) probabilities may be determined by either a parametric or a non-parametric approach. The former requires the fitting of probability distribution functions to describe the random variable, and is therefore subject to modelling uncertainties (errors). The nonparametric approach to quantification of natural and/or impact risks covers the use of the well-known plotting position formulas. The relationship between parametric and non-parametric probabilities results in the following formula, which is commonly used in risk analyses to describe the probability of at least one failure within a record of n samples (Harmancioglu and Alpaslan, 1992; Harmancioglu et al., 1993): R(1) ¼ 1  ½1  Pð X . xcr Þ n

(1:6)

where R(1) represents the risk of being equalled or exceeded once or more in a record of n samples. The condition of compliance or violation entails a risk factor in the form of a random variable exceeding (or non-exceeding) a critical value, xcr,h or xcr,l , set as a standard. In this case, exceedance or non-exceedance probabilities represent the risk of violation of a standard. Sometimes the frequency ( f ) of occurrences of undesirable outcomes (compliance failures) may be considered more significant such that it does not exceed a critical value fcr . This condition may be expressed as the probability (Harmancioglu and Alpaslan, 1992) P½ f > f cr  < Æf

(1:7)

which is required to remain below an acceptable or specified level of risk Æf . If, in addition to frequency of failures, the degree of failures is also of concern (Dendrou and Delleur, 1979), a similar requirement can be expressed using quantities. For example, the concentration Cl of a pollutant load may be desired not to exceed a particular critical level cl,cr . Then, the probability of such exceedance can be assessed to remain below a specified risk level as P½ C1 > c1,cr  < Æ1,cr

(1:8)

In this case, the standard is set as the critical level cl,cr , and the risk factor here refers to that of samples failing to meet the standard. Defining Æl,cr means that an acceptable level of risk or compliance failure is a priori determined so that a particular percentile of the samples are expected to satisfy the condition in Equation 1.8. For example, if Æl,cr is selected as 5%, it indicates an acceptable failure rate of 5% so that 95% of the samples are required to meet the standard. Such an evaluation foresees the determination of percentile values from an observed statistical sample using parametric or non-parametric approaches. In this respect, Crabtree et al. (1987) claim that the parametric approach serves better to obtain the maximum

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

21

amount of information from available data. However, this procedure is subject to modelling uncertainties so that tests for goodness of fit need to be applied. Nevertheless, Crabtree et al. (1987) have found that the ‘best’ estimate of the 95th percentile from a data set must be based upon a fitted probability distribution. Among the 334 water quality data sets they have analysed, approximately half of them could be fitted by a normal, lognormal or Pearson Type 3 distribution, the latter providing the most flexibility in the case of water quality. Crabtree et al. (1987) have suggested that non-parametric techniques can be used if failure occurs in fitting a parametric distribution. Interpretation of standards as percentile values avoids assessment of compliance by using a standard that is unnecessarily rigid in some circumstances and not rigid enough in others. Similarly, it avoids the use of a standard that is inconsistent with the capability of existing treatment facilities or with the cost of possible improvements (Crabtree et al., 1987). Thus, the standards are now described on a probabilistic basis to take into consideration the local circumstances and river quality objectives. 95th percentile class limits are adopted to describe water quality standards in most countries. This implies an acceptable 5% risk of compliance failure (Crabtree et al., 1987; Warn, 1988). The probabilistic basis for standard compliance as described above indicates another difficulty associated with compliance, as it is based on sampling. In this context, it is highly affected by sampling errors and the resulting uncertainties. Warn (1988) stresses that these errors can be minimised by defining confidence limits which describe the range within which the true value of the standard (expressed as percentiles) lies. Loftis and Ward (1981) also present methods for defining confidence limits around standard violation probabilities. Such an approach aids in the assessment of uncertainty of the standard violation information obtained from collected series of water quality data (Sanders et al., 1983). A major difficulty in comparing a particular percentile value of water quality with a standard is that, by chance, one may have a set of bad samples which shows non-compliance, or on the contrary, a set of good samples that indicate compliance, although the sampled site actually violates the standard (Warn, 1988). This uncertainty in assessment occurs because of sampling error when cases of true failure or true compliance cannot be observed owing to discrete sampling. Such sampling errors lead to the risk of making wrong decisions regarding compliance with standards. This risk can be expressed in the form of confidence limits to describe the range within which the true value of the 95th percentile is expected to lie (Warn, 1988). Warn (1988) defines 95% confidence limits around a 95th percentile standard and refers to them as optimistic (lower limit) and pessimistic (upper limit) confidence limits. Then three cases may occur in assessment of compliance using the 95th percentile value, as follows. 1. 2.

If the pessimistic confidence limit is less than the 95th percentile standard, then one can be at least 95% confident that the standard is not exceeded. If the optimistic confidence limit exceeds the 95th percentile standard, one can be at least 95% confident that the standard is not met.

22 3.

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

If the 95th percentile standard lies between the optimistic and pessimistic confidence limits, it is not possible to assess compliance or failure with at least 95% confidence.

Warn (1988) further discusses that such an assessment may be worked out for different levels of confidence limits (50%, 80%, 99%, 99.99%, etc.) for interpreting compliance at different rates of assessment risks. The selection of the risk level depends on the planner’s evaluation of the significance pertaining to the problem considered. To obtain a complete picture of effects of sampling errors, more than one risk level may be investigated by describing confidence limits at different levels of probability. Warn (1988) states that such an approach provides a sound basis for interpreting the performance of treatment plants where sampling is realised infrequently.

1.3.5

Estimation of trends

General Within the framework of environmental management, monitoring practices have been initiated to evaluate the consequences of pollution control efforts, to determine the current status of environmental variables, and to detect possible changes, if any, in the natural processes with respect to time and space. Essentially, in respect of these objectives, monitoring efforts have been intensified in almost all developed countries where a significant amount of research is devoted to investigation and evaluation of trends in environmental variables such as surface water quality on a regional or national basis. On the other hand, environmental observations often constitute intermittent series of messy data so that the application of classical methods for trend analysis has generally produced unsatisfactory results, particularly in the case of water quality. Thus, efforts have intensified to develop, apply and test more effective techniques for trend analyses. In these studies, non-parametric methods are proposed since, by means of such techniques, problems related to probability distributions of water quality variables, short observation periods and the sporadic character of quality data are effectively handled. Since the 1970s, several non-parametric techniques based on order statistics of observed data have been developed particularly for the analysis of water quality. The goodness of fit of these techniques depends on the series structural properties and the type of temporal trend investigated. Parametric methods of trend detection Parametric statistical methods for trend detection are techniques that use the numerical values of observed data directly and which, therefore, require that the probability distribution of the process is known (Lettenmaier et al., 1991). Parametric methods basically cover the well-known parametric t-tests; other parametric techniques that have been developed for serially correlated data include time series analysis and intervention analysis (Van Belle and Hughes, 1984; Hipel and McLeod, 1994; Icaga, 1994).

23

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

The classical statistical tests require an assumption that an observed data vector X is independently distributed as fx1 x2    x n (x1 , x2 , . . ., x n ) ¼ fx1 (x1 ) : fx2 (x2 )    fx n (x n )

(1:9)

In addition, parametric tests (those which assume a knowledge of the form of the probability density function) usually require that the elements of the data vector must have identical probability density functions. The best known parametric tests are the ttests, which assume that the probability density function f is normal (Lettenmaier, 1976). Two types of trends may be considered by parametric methods: step trends which consist of a step change in the mean level of a process at the mid-point of the data series; and linear trends which consist of a process with mean level that varies linearly throughout the data record. These two types of trends refer to sudden changes (such as improvement in stream quality due to establishment of a new treatment plant) and gradual changes (increases in non-point polluting contributions to a river due to urbanisation or land use) (Lettenmaier, 1976). The trend detection problem by parametric methods is basically a statistical hypothesis testing problem as shown in Table 1.1. The null hypothesis H0 is that an event A has not occurred, and the alternative hypothesis H1 is that A has occurred. A test statistic T is used to test H0 and H1 . The probability of choosing H0 when H1 is true is the confidence level of the test or (1  Æ). The probability of choosing H1 when H1 is true is the power of the test denoted by (1  ). The probability of a Type 2 error or  is a function of sample size, the population, the confidence level Æ and the alternative hypothesis H1 . In the trend detection problem, H0 is the hypothesis that there is no trend in the underlying population. H1 states either that there is a trend in the data (two-sided test) or that a positive or negative trend exists in the data (onesided test) (Lettenmaier, 1976). Testing for trends is basically the determination of the sample size n ¼ n1 ¼ n2 needed for a particular power of the test, where two population means are compared (Walpole and Myers, 1990). The hypotheses may be H 0 : 1 ¼ 2 H 1 : 1 ¼ 6 2

(1:10)

where population standard deviations 1 and 2 are known. For a particular alternative such as 1  2 ¼ , the power of the test is given by Table 1.1: Hypothesis testing and error probabilities (Lettenmaier, 1976). State of nature

Test indication H0

H0 H1

No error (P ¼ 1  Æ) confidence Type 2 error (P ¼ )

H1 Type 1 error (P ¼ ÆP ¼ Æ) No error (P ¼ 1  ) power

24

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

2

3   1   ¼ 1  P4zÆ=2  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2  , z , zÆ=2  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  ffi5  1 þ  22 =n  21 þ  22 =n

(1:11)

where the statistic x1  x2   qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2  ffi  1 þ  22 =n

(1:12)

is a standard normal variable with x1 and x2 being the two sample means. Using Equation 1.11:  z ffi zÆ=2  qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  2   1 þ  22 =n is obtained so that the required sample size becomes   ð zÆ=2 þ z Þ2  21 þ  22 nffi 2

(1:13)

(1:14)

When the population variances are not known, the statistic of Equation 1.12 is assumed to follow the Student t-distribution (Walpole and Myers, 1990). The power of the test is important in the trend detection problem because it gives, at a fixed confidence level, the probability of trend detection. The power of the test depends upon the sample size, trend magnitude and the marginal probability distribution of the dependent data series, which is assumed to be normal. For dependent series, it varies also with the form of the dependence of data. To avoid an assumption of the distribution type, Lettenmaier (1976) proposes the use of non-parametric tests such as Spearman’s rho test for step trends and the Mann– Whitney test for linear trends, which require the use of Monte Carlo simulated time series. Otherwise, the Student t-test is used, assuming that the independent data series are normally distributed. The parametric test described above can be applied to detect step trends in the following way. The data set xi of size n, divided into two parts of equal size, has the means 1 and 2 so that the hypotheses H 0 : 1 ¼  2 H 1 : 1 ¼ 6 2 are tested as given in Equation 1.10. For a confidence level (1  Æ), the test statistic is defined as   t ¼ jx1  x2 j n1=2 =2S  t1Æ=2,

(1:15)

where t1Æ=2, is the quantile of the Student t-distribution at probability level 1  Æ/2, the degrees of freedom  are n  2, and S is the sample standard deviation of the data set

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

2 3 n=2 n X X 1 2 2 4 ð xi  x1 Þ þ S2 ¼ ð xi  x2 Þ 5 n  2 i¼1 i¼ n=2þ1

25

(1:16)

Hypothesis H0 is accepted when t < 0 or rejected when t . 0. In parametric methods, a test criterion NT is defined as a population statistic assuming that the population trend and standard deviation are known. With Tr ¼ j1  2 j representing the absolute value of the true difference between the two means, NT is defined as pffiffiffi (1:17) N T ¼ Tr n=2 The power of the test is then   1   ¼ F NT  t1Æ=2, 

(1:18)

where F is the cumulative distribution function of the standard Student t-distribution with  ¼ n  2 degrees of freedom (Lettenmaier, 1976). In case of a linear trend, a similar approach is used to obtain the test criterion N T9 as N T9 ¼

½ n(n þ 1)(n  1)1=2 T r9 n(12)1=2  

(1:19)

for the well-known regression model yi ¼ xi þ ª þ  i

(1:20)

where  i is normally distributed with zero mean and variance  2 .  is the trend magnitude and ª is the base level constant. In Equation 1.19,   represents the standard deviation of  i and T r9 ¼ n . The test criterion N T9 for linear trends is similar to NT of Equation 1.17 for step trends except for the constants. The power of the test is again computed by Equation 1.18. Lettenmaier (1976) developed curves to describe the relation between the power of the test (1  ) and NT or N T9 . He described that the power curves are essentially functions only of NT or N T9 for sample sizes greater than 20. He further developed his curves into the relationship between the detection power and the sampling interval ˜ for different specified values of Tr . Here, the number of samples n is replaced by T/˜, where T is the total observation period and ˜ is the sampling frequency. Thus, by substituting n in Equations 1.17 and 1.19 by T/˜, a direct relation between the detection power and sampling frequency is obtained. This relation is valid for both step and linear trends (Tr or T r9). For the case of dependent series, Lettenmaier (1976) replaces n by the effective sample size ne , which is a function of the dependence type. For an AR(1) (first-order auto-regressive) process, ne becomes ne ¼ n

1  1 1 þ 1

(1:21)

26

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

where  1 is the sample first-order autocorrelation coefficient. Then, the test criterion N T9 for dependent series can be computed as in Equation 1.17 or Equation 1.19 by using ne instead of n. With the above method, Schilperoort et al. (1982) used the data from an existing network to investigate two issues. 1. 2.

Which trend over a certain observation period (T) can be detected with the present sampling interval? Which sampling interval is necessary to detect a specified trend Tr (as a certain percentage of the mean concentration) over the period T ?

Lettenmaier’s technique on trend detection has the advantage that it is an objective-based approach to selection of sampling frequencies. Furthermore, it can be used for small sample sizes to determine what information the available data bring at particular levels of detection power. This technique is demonstrated on actual water quality data by Lettenmaier (1976) and Schilperoort et al. (1982). Their results show that the method works quite well under the given assumptions. Apart from the standard technique described, other parametric methods for trend assessment cover time series analysis and intervention analysis (Hipel et al., 1975; Hipel and McLeod, 1994; Lettenmaier, 1988). Non-parametric methods of trend detection Many existing environmental databases have been found unsuitable for analysis by standard parametric methods as available data series do not fulfil the requirements of such methods. Basically, the application of traditional statistical techniques to spatially and temporally correlated, non-normal environmental data is problematic. Other techniques developed for serially correlated data, such as time series and intervention analyses, are not suitable for some environmental data, for example water quality data, because of missing data, censored data and changing laboratory techniques (Van Belle and Hughes, 1984). Recently, several non-parametric tests for trends in environmental variables such as water quality have been proposed. The non-parametric methods are more flexible and can handle the above problems more easily. A non-parametric test is a method for testing a hypothesis where the test does not depend on the form of the underlying distibution of the null hypothesis. Therefore, non-parametric methods are sometimes referred to as distribution-free methods. In response to the need for non-parametric procedures, authors like Lettenmaier (1976), Hirsch et al. (1982), Hirsch and Slack (1984), Van Belle and Hughes (1984) and a number of other researchers have made significant contributions to the development and application of non-parametric techniques in water resources. Research on this topic is still being continued owing to the wide range and great number of water quality problems encountered. Lettenmaier (1976) claimed that the Mann–Whitney test for step trends and Spearman’s rho test for linear trends perform very well in comparison to parametric ttests. On the other hand, since these two tests also require independent data, Lettenmaier focused on detection of trends in water quality from data records with dependent observations. He considered that, for dependent time series, the power of the trend test varies with the form of the dependence of the observations. Accordingly,

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

27

he developed a method of trend detection in the case of dependent data by establishing an equivalence between power curves for dependent and independent observations. Hirsch et al. (1982) presented techniques that are suitable in the presence of complications related to water quality data and proposed them for the exploratory analysis of monthly water quality data for monotonic trends. The first procedure they described is a non-parametric test for trend detection applicable to data sets with seasonality, missing, or censored values: the seasonal Kendall test. For stochastic processes with seasonality, skewness and serial correlation, the seasonal Kendall test performs better than its parametric alternatives, although it cannot be considered an exact test in the presence of serial dependence. The second procedure proposed by Hirsch et al. (1982) is an estimator of trend magnitude. It is an unbiased estimator of the slope of a linear trend and has a higher precision than a regression estimator where data are highly skewed. It gives lower precision in case of normally distributed series. The third procedure described by Hirsch et al. (1982) provides a means for testing for change over time in the relationship between water quality concentrations and flow, thus avoiding the problem of identifying trends in water quality that result from particular discharge series observed. In this method, a flow-adjusted concentration is defined as the residual based on a regression of concentration on some function of discharge. These flow-adjusted concentrations, which may also be seasonal and nonnormal, are then tested for trend by using the seasonal Kendall test (Hirsch et al., 1982). Van Belle and Hughes (1984) have analysed the relative power of various nonparametric procedures. They considered two classes of techniques: 1. 2.

intrablock methods which compute a statistic, such as Kendall tau, for each block or season and then sum these to produce a single overall statistic; aligned rank methods which remove the block effect from each observed value, sum the data over blocks and then produce a statistic from these sums.

Van Belle and Hughes (1984) discussed that aligned rank methods are asymptotically more powerful than intrablock methods; yet intrablock methods are more adaptable and may be generalised to deal with a broad range of models. Hirsch and Slack (1984) analysed application of non-parametric trend tests for seasonal data with serial dependence and proposed an extension of the Mann–Kendall (seasonal Kendall) test for trend. They claimed that, since the test is based entirely on ranks, it performs well in case of non-normal and censored data. Seasonality and missing values present no theoretical or computational problems in its application. Hirsch and Slack (1984) have shown that this modified test is valid in the case of serial dependence except when the data have a strong long-term persistence or when sample sizes are small (e.g. five years’ worth of monthly data). McLeod et al. (1983) discuss that there are two major steps in statistical analysis of trends. The first step is called ‘exploratory data analysis’ where important properties of the data are delineated by simple graphical and numerical studies. These studies include graphs of data against time, box-and-whisker plots, Tukey smoothing, and the autocorrelation function. At this stage, McLeod et al. (1983) use a data filling procedure to produce evenly spaced data series from data observed at unequal time

28

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

intervals. The next step of the analysis is called ‘confirmatory data analysis’ where the purpose is to confirm statistically the presence or absence of trends. For this step, McLeod et al. (1983) propose the use of the intervention analysis method to determine if there has been a significant change in the mean of the series. Montgomery and Reckhow (1984) also discuss that exploratory and confirmatory data analysis procedures should be applied to detect trends in water quality. Their trend detection methodology involves the following in a step-wise manner: 1. 2. 3. 4.

hypothesis formulation (statement of the problem to be tested); data preparation (selection of water quality variables and data); data analysis by exploratory techniques; statistical tests (tests for detecting trends).

Lettenmaier et al. (1991) used the non-parametric seasonal Kendall test and its multivariate extension to analyse 403 water quality monitoring stations in the USA for possible trends for the period 1978–1987. The results of their study showed that, for all groups and individual constitutents, trends were present only for a minority of stations at 10% significance level. Furthermore, analysis of possible relationships between trends and land use and population did not give strong evidence of possible causes. Hirsch et al. (1991) reviewed in detail methods for the detection and estimation of trends in water quality. They considered that the steps involved in the selection of a trend detection method include: 1. 2. 3. 4.

5.

determination of the type of trend hypothesis to analyse (step versus monotonic trend); selection of the general category of statistical methods to use (parametric versus non-parametric); selection of water quality data to analyse (concentration versus flux); selection among various data manipulation alternatives related to the use of mathematical tranformations and the removal of natural sources of variability (discharge, seasonality) in water quality; the choice of trend detection technique for water quality records with censored data.

With respect to list point 2 above, Hirsch et al. (1991) discuss that parametric procedures for trend testing are regression in the case of a monotonic trend and the two-sample t-test for step trends. In these methods, estimators of trend magnitude are the regression slope and the difference in the means. Non-parametric alternatives for these methods are the Mann–Kendall test and the rank sum test, respectively. Hirsch et al. (1991) further indicate that the decision as to which procedure should be used is based on considerations of power and efficiency of the test required by the available data. Power is the probability of selecting the null hypothesis (of no trend) given a particular type and magnitude of actual trend, and efficiency is a measure of estimation error. As indicated by Hirsch et al. (1991), a procedure’s relative efficiency can be measured by the ratio of the mean square error of an alternative procedure to the mean square error of the particular procedure considered. Hirsch et al. (1991)

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

29

discuss also that, for any significance level, the most powerful test is the parametric procedure if residuals are normally distributed. Similarly, the relative efficiency of these procedures is higher when residuals are normally distributed. In the case of nonnormal water quality variables, Hirsch et al. (1991) propose the use of the seasonal Kendall test for monotonic trends and seasonal rank sum test for step trends. Other studies on non-parametric methods include those by Hipel et al. (1988), who used the seasonal Mann–Kendall test to analyse trends in lake water quality; Hughes and Millard (1988) who suggested a tau-like test for trend in the presence of multiple censoring points; Lettenmaier (1988) who extended the use of nonparametric trend tests to the multivariate case; and Hirsch (1988) who investigated the magnitude of step trends in water quality by non-parametric tests. Berryman et al. (1988) presented an extensive review of non-parametric tests as applied to water quality data and evaluated current studies on the subject. They further proposed a methodology on how to select the most appropriate test for a given time series. They indicated that time series analysis is difficult to use when observations are taken at irregular intervals. It is possible to model trends by time series analysis, but such a procedure does not, by itself, detect trends that are considered significant. Only graphical and statistical tests can be used to detect such trends. On the other hand, statistical tests can be used together with time series analysis. For example, tests that cannot be used on periodic series can be applied after seasonality is removed from the series by means of time series analysis models. However, Berryman et al. (1988) also note that recent developments in tests for water quality data allow trend detection in a great variety of water quality time series without having to decompose the series into its components before testing it for trend. According to Berryman et al. (1988), only graphical methods and statistical tests can be used to detect significant trends. The rule here is to consider a trend as significant when its magnitude is large, compared to the variance of the process, so that the probability of its occurrence by chance only is minimal. Usually, a trend is considered significant when its probability of occurrence only by chance is below 5%. Berryman et al. (1988) listed 12 tests for monotonic trends (e.g. Kendall, Spearman, intrablock tests, aligned ranks tests), seven tests for step trends (e.g. median, Mann–Whitney, Kolmogorov–Smirnov), and three tests for multistep trends (e.g. Kruskall–Wallis). Among these, Spearman and Kendall tests are the most powerful when time series do not contain seasonal variations. Intrablock and aligned ranks tests can be used when data are affected by cycles. In intrablock tests, the data are blocked into seasons or months where the seasonality effect is homogeneous. Then, all blocks are subjected to ‘treatments’ that are given values of the independent variable, which is time (Berryman et al., 1988). Intrablock and aligned rank tests measure the relationship between time and the variable analysed. Berryman et al. (1988) have concluded that Mann–Whitney, Spearman and Kendall tests are the best methods for trend detection in water quality time series. Among these, Mann–Whitney is the most widely used two-sample test when the assumptions of its parametric equivalent, the t-test, are not met.

30

1.3.6

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Measuring the information content of observed data

An important conclusion to be drawn from Sections 1.3.1–1.3.5 is that difficulties in the application of most statistical methods stem from deficiencies in the monitoring practices applied, that is selection of sampling frequencies, sites, sampling duration, and so on. Thus, it may be advisable at this point to reassess the current status of monitoring programmes. In particular, it is essential to define a priori the specific information expected from monitoring and further to express this expectation in statistical terms. This appears to be the only way of producing data that comply with the assumptions and requirements of available data analysis methods. In general, two statistical measures are used to define the information conveyed by and expected from observed data. These are summarised in the following sections (NATO LG, 1997). The Fisher information measure Fisher (1921) proposed a measure for estimation of information associated with a statistical parameter. The information is measured through an estimate of that parameter and is the inverse of the variance of the sampling distribution of that parameter I p ¼ 1= 2p

(1:22)

where Ip is the information content of the estimate of the parameter and  2p is the variance of the sampling distribution of the parameter estimate. If the information content of the population mean of a random variable is desired then it can be shown that I  ¼ N = 2

(1:23)

where the random variable has a mean , standard deviation , and I  is the information content contained in a series of N independent observations of the random variable. In general, Ip is a linear function of the sample size, N, assuming the statistical parameter estimate is unbiased. In practice, however, there might be biases in parameter estimate which would force the sampling variance to underestimate the uncertainty in the biased estimator. Without correction for biases, the original Fisher measure of information could be misinformation, for it actually decreases as the estimate of the parameter improves. For example, as the number of samples increases, the variance of the sampling distribution tends to increase (Matalas et al., 1975). Matalas and Langbein (1962) were probably the first to introduce the Fisher measure in hydrology. Moss (1970) used it to determine an optimum operating procedure for a river-gauging station. Moss et al. (1985) used it for evaluation of hydrological data networks, and Moss and Gilroy (1980) for determining cost-effective streamgauging strategies for the Lower Colorado River basin. Entropy measures Shannon (1948) developed the entropy theory which provides a measure of information contained in a set of data or the distribution of a random variable. The entropy

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

31

theory has been applied to a broad range of scientific areas. In environmental and water science, it has been applied to a wide spectrum of problems (Harmancioglu and Ozkul, 2006; Harmancioglu and Singh, 1998; Ozkul et al., 2000). For a discrete random variable x, the Shannon entropy can be defined as I ð xÞ ¼ 

N X

P(xi )ln P(xi )

(1:24)

i¼1

where P(xi ) is the probability that x ¼ xi , N is the number of observations, and I(x) is the entropy value of x or P(x). If P(x1 ) ¼ P(x2 ) ¼ P(x3 ) ¼ . . . ¼ P(x N ) ¼ 1/N , then I ¼ ln N

(1:25)

where ln is to the base 2 but can be converted to any other base using an appropriate multiplier. Equation 1.25 specifies the upper bound for I. If x is a continuous variable, the Shannon entropy is defined as ð

I ð xÞ ¼  f (x) ln f (x)dx  ln ˜x

(1:26)

In practice, a truncated form of Equation 1.26 is used: ð I ð xÞ ¼  f (x) ln f (x)dx

(1:27)

The truncation process may have a significant effect on the value of entropy (NATO LG, 1997). From the perspective of hydrological or water quality monitoring design, three types of entropy are useful: conditional entropy, joint entropy and transinformation. Consider two random variables x and y. The conditional entropy I(x|y) is defined as I(xj y) 

XX i

Pð xi , y j Þ ln P(xi j y j )

(1:28)

j

where P(xi ,y j ) is the joint probability of xi and y j, and P(xi |y j ) is the conditional probability of xi conditioned on y j. The joint entropy of x and y is defined as I ð x, yÞ ¼ 

XX i

The transformation T(x,y) is defined as

j

Pð xi , y j Þ ln Pð xi , y j Þ

(1:29)

32

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

T ð x, yÞ ¼ I ð xÞ  I(xj y) ¼ I ð yÞ  I( yjx)

(1:30)

¼ T ð y, xÞ

(1:31)

¼ I ð xÞ þ I ð yÞ  I ð x, yÞ

(1:32)

¼

XX i

j

Pð xi , y j Þ ln

P(xi j y j ) P(xi )

(1:33)

T(x,y) can be referred to as the (average) information transmission and is also called the mutual information. The quantity I(x,y) measures the uncertainty associated with the outcome pair x,y, and is dependent on the association between two variables as well as their dispersion. I(x|y) measures the uncertainty in x given the knowledge of y. The criterion of maximum information transmission can be used for water quality monitoring network design.

1.3.7

The use of statistics in water management studies

Current studies for water management, water resources being the principal component of the environment, are often carried out on regional or watershed scales. These studies cover the delineation of the status of the watershed and possible driving forces, analysis of available input data and finally assessment of the response of the watershed system to inputs and identified natural or anthropogenic interventions. Various statistics are employed at different phases of such studies as summarised in the following (NATO LG, 1997) sections.

Driving forces These provide the input to the watershed. The input can be natural such as acid precipitation, or man-made both point source and non-point source, such as waste discharge from an industry, city sewage water, agricultural pollution due to chemical fertilisation, and so on. The data expressing the driving forces must be checked for quality, trend, completeness, homogeneity or consistency. Frequently, there are gaps in the data and they must be filled in. Mass curves are used for checking the homogeneity or consistency of data. The normal ratio method, inverse distance squared method and correlation method are among the methods for filling in missing values. The entropy method is also used for this purpose (Singh and Harmancioglu, 1997). The data must also be checked for their errors, representativeness and sampling strategy. All data are not collected at the same temporal frequency. Some are collected more frequently than others. The question then arises: how should they be transformed to the same base frequency, as may be needed by the design methodology, without undue loss of information? Statistics help to accomplish this objective. Statistical methods for trend detection are employed if data have any persistence.

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

33

Watershed system When dealing with a watershed, soil, vegetation, land use, morphology and geology must be known or specified. These characteristics influence the pollutant transport and storage within the watershed. Complicating specification of these characteristics is their spatial variability. Statistics aid in characterising this variability. Furthermore, statistics help to classify basins based on similarity and homogeneity measures. These latter measures are useful when transposing results from one watershed to another. Here the correlation methods and kriging are helpful. Observed data Once a water quality monitoring network has produced some data, these data are analysed for information extraction. Using these data, the network design is checked for optimality. In other words, if sampling frequencies in space and time are acceptable and are in accord with the design objectives, and the cost of data collection is not prohibitively large, then the designed network is satisfactory. To that end, entropy and correlation methods are employed. Other statistical measures, such as spectral methods and information content, can also be employed.

1.3.8

The use of models in water management studies

Environmental data are essential for environmental management as well as for model building, calibration, verification and real-time application. Data requirements of different models are, however, different, since these models are intended for different purposes. On the other hand, depending on the availability of the type and quality of data, different types of models are developed. Thus, data and models are interdependent. In practice, two criteria can be distinguished by which models and their data requirements are identified: (i) spatial and temporal resolution, and (ii) level of analysis. Two broad categories of temporal resolution include continuous and discrete, and those of spatial resolution include lumped and distributed (or spatially continuous) types. For example, water quality variables are sampled mostly at discrete time intervals, meaning that water quality models can, strictly speaking, be only discretetime models. Continuous-time water quality models have to be based on temporal interpolation between sampled points. Environmental models are either lumped or distributed. Again, availability of data was, until recently, the primary limitation on the development of distributed models. Strictly speaking, most environmental models are either lumped or quasidistributed, as environmental variables such as water quality are measured at only a limited number of points and not continuously in space. The level of analysis is determined by the amount and resolution of the data available (both quantity and quality) on one hand and by the purpose of the assessment and availability of resources on the other. Thus, models can be classified according to the level of analysis to be achieved by their use: screening, primary and secondary models.

34

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Screening models provide a quick examination of the environmental fate of a water quality variable and are, therefore, used to provide a qualitative assessment about the behaviour of the variable at a specific site. Example applications of such models for water quality analysis are: (i) evaluation of the relative erosion potential of different soil types in a watershed; (ii) evaluation of the relative effect of the rainfall characteristics on chemical washoff; (iii) comparison of the relative chemical migration in variable climatic regimes; (iv) evaluation of health hazards of different chemicals, and so on. The data requirements of such models are rather limited. The objective of primary models is to provide a more detailed representation of predominant environmental fate processes at an intermediate level. In other words, their results are a little more meaningful and quantitative than those of screening models. Example applications include: (i) evaluation of the risk of exposure to contaminants above a certain threshold; (ii) evaluation of surface and subsurface transport of chemicals applied during cropping; (iii) estimation of chemical loading downstream due to human activities upstream, and so on. Primary models can help to identify dominant processes and variables. Thus, they constitute a good management tool. Their data requirements are intermediate. Secondary models are used for performing comprehensive analyses of water quality processes. Example applications include: (i) evaluation of the fate and transport of agricultural chemicals; (ii) quantification of the effect of agricultural practices on non-point source pollution; (iii) estimation of the import of human activity upstream on chemical loading downstream, and so on. The data requirements of these models are high. In practice, most assessments are realised in an iterative manner, with screening done first and secondary modelling applied last. These models have different configurations and require different database structures. Modern tools, including models, GIS, database management, graphics and expert systems can be used to analyse monitored data.

1.4 1.4.1

DECISION MAKING FOR ENVIRONMENTAL MANAGEMENT A new approach

Recent advances in information and communication technologies, coupled with the development of new tools such as remote sensing and satellite technology (RSST), GIS and expert systems, have also changed the framework of the decision-making process. In the past, planners did also take advantage of models and data. A typical approach would be to develop a management model, often in the form of an optimisation model, and incorporate into it a technological model (e.g. water quality, urban infrastructure, erosion and similar) as part of the objective function or the constraints. The solution to such an optimisation problem would indicate the ‘best’ decisions to be made. Nowadays, the increased complexity of the management problem on one

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

35

hand, and the availability of advanced tools for management on the other hand, have changed the way the decision-making process is accomplished. The current approach involves two basic steps: (i) development of an integrated information system to support the decision making; and (ii) building and testing alternative management scenarios by running each through the information system. These two steps imply that the decision-making process has become an interactive one; that is the planner communicates with the information system and tests the consequences of his decisions instead of arriving at a single fixed decision. Figure 1.3 summarises the basic steps in the decision-making process for integrated water management. The major procedures in this process cover the development of a basin information system, followed by the development of a decision support system comprising databases, models, GIS and expert systems. As noted in the previous sections, the most important issue here is that water resources management has to be based on ‘informed’ decision making, where information in three dimensions, namely economic, social and environmental, is required to identify ‘indicators’ for sustainability. Such identification is realised by the use of decision support systems (DSS) comprising the integrated tools of databases, models, GIS and expert systems. The success of DSS applications is closely related to the quantity and quality of available data and information on economic, social and environmental aspects of water resources.

1.4.2

Development of an information system for management

Development of an information system for decision making involves three basic steps (Fedra, 1997): 1. 2. 3.

preparation of data for analysis; analysis of data and modelling to transfer data into information; developing methods of communication for the user (planner) to interpret the results and interactively participate in the planning process.

Preparation of data by itself requires significant efforts to sort the available data into a usable format. Data comprise those on different processes. They come from numerous sources and different monitoring organisations so that they are often available in varying formats. Thus, incompatible data formats are a problem to be resolved. Furthermore, if data are collected via automatic measuring equipment or are remotely sensed satellite data, they come in large volumes. If they are monitored by traditional sampling methods, problems of incompleteness, inadequacy and non-homogeneity of data, together with high sampling errors, must be handled. Accordingly, the data preparation step involves testing of available data for reliability and errors and integration of large volumes of multi-format, multi-source and multimedia data into a database. Clearly, this step is essentially devoted to data management. To effectively manage the available data and databases, GIS may be used as an essential element of the information system. Fedra (1997) states that GIS is ‘the backbone of data management’. When GIS is integrated with the database system, the

36

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 1.3: The decision-making process for water resources management. IDENTIFICATION OF MANAGEMENT OBJECTIVES AND PREFERENCES – identification of problems – policy analysis – multidisciplinary objectives – water use objectives – constraints and impacts IDENTIFICATION OF THE BASIN SYSTEM – system elements (discharge, water quality, land resources, etc) – interactions between the elements – social, legal, economic and institutional aspects

DEVELOPMENT OF THE BASIN INFORMATION SYSTEM

DEVELOPMENT OF MONITORING SYSTEMS

DATA VALIDATION

DEVELOPMENT OF DATABASES

DATA ANALYSIS AND TRANSFER OF DATA INTO INFORMATION

BASIN MODELLING

DEVELOPMENT OF GIS AND EXPERT SYSTEMS

INTEGRATION OF DATABASES, MODELS, GIS AND EXPERT SYSTEMS (DECISION SUPPORT SYSTEM – DSS)

DEVELOPMENT OF ALTERNATIVE MANAGEMENT PLANS – SCENARIO DEVELOPMENT

COMPARISON OF SCENARIOS VIA DSS SYSTEM

DECISION MAKING – analyses of risks and reliability – assessment of uncertainties

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

37

two elements together actually constitute the ‘backbone’ of the whole management procedure. This is why current management practices include GIS as an essential tool. When data are prepared via database–GIS integration, models can be run to interpret the data for production of information. This is another stage of integration where data + GIS + model, as primary tools of the information system, are integrated. GIS plays an active role here by preparing the input data for the model and further displaying the model results within a geographical reference frame. A final stage of integration is realised between the model and an expert system. Expert systems function like a model as they define, for a given set of input variables, an output value for a variable. Unlike numerical models, expert systems accomplish this function in a rule-based, qualitative framework, where not only physical, but also socioeconomic factors can be accounted for. A numerical model can be integrated with the expert system by substituting a set of rules (Fedra, 1997). Scenario development and testing, as the second step of the decision-making process, can then be accomplished through the integrated model + expert system component of the information system. It follows from the above that the information system for environmental management is established by the integration of basic management tools: databases, GIS, models and expert systems. The integration of management tools constitutes a multimedia framework as a basis for interactive software systems, which are designed nowadays with an object-oriented approach. These systems can be accessible to local users or even to global communities via networking systems such as the World Wide Web (Fedra, 1997). Thus, users can easily access and interact with the system, test and interpret results of various scenarios, and make judgements. The above procedure implies that the decision-making process has now evolved into a dynamic procedure where the decision maker is actively involved in all aspects of the management problem, that is databases, modelling, GIS and expert systems.

1.4.3

Sustainability issues and environmental indicators

For the past decade, sustainability has been a highly popular concept in environmental and water resources management. It entails a long-term, instead of a short-term perspective in resource assessment and management and is often attached to the concepts of renewability, resilience and recoverability. Clark and Gardiner (1994) describe these concepts as follows. •





Renewability refers to the ‘rate at which a resource can be replaced, so that sustainability is achieved by restricting the level of use to something at or below the rate of replacement’. Resilience describes ‘the ability to withstand stress without long-term or irreversible damage’ and ‘sustainability is achieved by restraining use to a level at or below that which exceeds the system’s resilience’. Recoverability is ‘a concept which accepts that detrimental impact may take place, but concentrates on the rate or frequency of impact in relation to the

38

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

inherent rate of recovery. A sustainable system is one in which impact intensity and frequency are small in relation to recovery rate’. All of the above concepts are significant in catchment planning. Furthermore, a holistic approach is emphasised in assessing the catchment system, so that the system is considered to be interconnected, and short-term decisions are viewed to have longterm direct and indirect consequences (Clark and Gardiner, 1994). Chapter 8 of Agenda 21 emphasises that sustainability must be achieved in making decisions for overall development and resource management at regional and national scales. In that regard, decisions must seek for sustainability in three dimensions: economic (efficiency), social (equity) and environmental (compatibility) (Figure 1.4). If significant, a fourth dimension can be added to cover institutional issues as in Figure 1.5. Furthermore, Chapter 40 of Agenda 21 on ‘Informed decision making’ (statements 6 and 7) foresees the identification of sustainability indicators along the above three dimensions and requires that these indicators are updated regularly to be reported in regional or national studies. Accordingly, various international organisations have prescribed lists of indicators to be used in local, regional, national and global development and management plans. For example, the indicators developed by the World Bank for various projects cover the following: •

population related indicators: demographic indicators, rate of population increase, life standards, health aspects and similar;

Figure 1.4: Dimensions of water resources management studies.

ECONOMICAL

ECONOMICAL SOCIAL

SOCIAL

ENVIRONMENTAL ENVIRONMENTAL Sustainable region for development and management

39

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

Figure 1.5: Pillars of sustainability and examples of indicators. Benefit/cost

ECONOMIC

INSTITUTIONAL

NGOs

Legal/political issues

System efficiency

Gender issues Equity between generations Coastal water quality ENVIRONMENTAL Env. health

• • • •

SOCIAL Satisfaction of water demand

Level of social conflicts

land use: rate of urbanisation, properties of vegetation cover, land use types, and so on; economic activities: economic indicators, indicators on agriculture, energy, fisheries, mining, industry; environmental indicators: water use, water demand, water quality and quantity, water supply, wastewater discharges, air quality, hazardous wastes and similar; actors and policies: number of actors involved in water management, number of NGOs, presence of regulations on environment, and so on.

The above indicators are also classified to represent system (e.g. basin) properties such as P (stress on the system), S (state of the system) and R (system response). Similar indicators are identified by the European Environment Agency (EEA) and by the Mediterranean Action Plan (MAP) for Mediterranean countries. There are 130 MAP indicators, 45 of which are P type; 45 are S type; and 40 are R type (UNEP, 2000). In essence, the system properties of P, S and R are major components of the DPSIR (driving forces–pressures–states–impacts–responses) approach proposed by EEA in resource development and management studies as shown in Figure 1.6 (Harmancioglu, 2004). It follows from the above that sustainability has to be defined on the basis of indicators which, in turn, require large numbers and types of data to be realistically specified. It is also clear that we need data not only in the environmental dimension but also in social and economic dimensions. These two issues are illustrated in the case of two EU projects, SMART and OPTIMA, which were carried out on five case studies in Turkey, Lebanon, Cyprus, Morocco and Jordan. Figure 1.7 summarises the basic approach applied in the projects where the final stage comprised a comparative analysis across the five case studies. The following section displays only the Turkish case to summarise data requirements and assessment of indicators.

40

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 1.6: Basics of the DPSIR approach and relevant indicators in resource management studies. origin of

Driving forces (D) Such indicators measure the forces that drive the actual water demand

causes

influence

Pressures (P) Such indicators collate information on mechanisms/rates of use/exploitation of water resources, which may affect their state (abstraction, pollution...)

State (S) Such indicators collate information on the state of the water resources and supply system as determined by pressures

Impacts (I) Such indicators refer to the evaluation of the consequences of changes in state condition

Responses (R) Societal courses of actions, resulting from policy/decision making, aimed at solving the problems analysed in the DPSI chains, targeted to one or more levels: D, P, S, and/or I

Figure 1.7: Basic approach employed by SMART and OPTIMA projects for basin management studies. INPUTS DRIVING FORCES Demographic

PRESSURES Socioeconomic

Water demand

RESPONSES Water use Technology institutions

OUTPUTS STATE Water Quality

Water availability

Costs

Benefits

INDICATORS IMPACTS Satisfaction of water demand

Environmental Health

Level of social conflict

COMPARATIVE ANALYSIS RESPONSES

Efficiency of the system

Water allocation rules

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

1.5

41

SMART AND OPTIMA PROJECTS: GEDIZ CASE STUDY

Gediz River Basin along the Aegean coast of Turkey (Figure 1.8 in the colour insert) demonstrates the entire range of prototypical water management problems in the Mediterranean region and reflects the importance of the institutional and regulatory framework, and the need for direct participation of major actors and stakeholders in the decision-making process for sustainability (Harmancioglu, 2001). The case is studied within the scope of two EU INCO projects: SMART (Sustainable Management of Scarce Resources in the Coastal Zone: contract number ICA3-CT-2002-10006) and OPTIMA (Optimisation for Sustainable Water Management: contract number INCOCT-2004-509091), sponsored by EU FP5 and FP6 Programmes, respectively (www.ess.co.at/SMART; www.ess.co.at/OPTIMA). The SMART project was completed in 2005 and OPTIMA in 2007.

1.5.1

Data requirements

Basin profile Data availability and quality are considered prerequisites for achieving the objectives of both SMART and OPTIMA in solving water supply/demand issues through testing of strategies that can reconcile conflicting demands on scare water resources. The database for the Turkish case study comprises data required by the three primary components of the project, namely, the socioeconomic framework and two quantitative analysis tools (models): WaterWare and TELEMAC. Initial investigations on data availability and quality are accomplished through a ‘requirements and constraints analysis’. This analysis disclosed: 1. 2. 3. 4. 5.

description of the case study area; information on data: availability, sources, coverage, formats, quality and cost; key water issues; key issues of changes; policy relevant information;

which were then summarised as a profile of the basin, as in Table 1.2, to present the basic initial information. Socio-economic data The socio-economic data required by the projects were defined in the form of another analysis on ‘Socio-economic framework and guidelines’. These data provided inputs to the following four tasks: 1. 2. 3. 4.

population, demographic and migration policy analysis; political and economic options adopted for the study areas; competing water uses; economic analysis of water resources.

The Gediz river basin

940 MCM (52 mm)

Annual volume of water required for domestic purposes Annual volume of water required for agricultural purposes Annual volume of water required for industrial purposes

The Gediz river basin The Gediz river basin The Gediz river basin

695 MCM (39 mm)

54 MCM (3 mm)

Per capita: 521 m 3

133 MCM (7.4 mm)

Current annual water demand: 886 MCM

There is no use of treated water

The Gediz river basin

160 MCM (9 mm)

Per capita: 647 m 3

Treated water – the volume of water treated annually

Ground water– the volume of water pumped to the surface for use each year Surface water – the volume of annual mean discharge from the drainage area

Current annual water supply: 1100 MCM

National

Bottled: A0.87

M

M

M

M

M



M M M

The Gediz river basin The Gediz river basin The Gediz river basin

The cost of water per litre (Euros)

M

Reliability

Turkey

The geographic location of the case study in longitude and latitude The drainage area (km2 ) The resident population The average annual rainfall

Area

North; 388049–398139 East; 268429–298459 18 000 km2 1 700 000 700 mm

Data

Description

Table 1.2: Compilation of basic information on Gediz Basin in the form of a basin profile.

Municipalities

DSI

Municipalities and DSI

( continued)

State Hydraulic Works (DSI)

State Hydraulic Works (DSI)

State Hydraulic Works (DSI) State Institution of Statistics (DIE) State Institution of Meteorology (DMI) and State Hydraulic Works (DSI) Market prices

USGS

Source

42 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

3.6 MCM (0.2 mm)

The Gediz river basin

There is no tourism activity M

Basin population is growing at a rate of 1.5%. There is considerable internal migration and rapid urganisation. Total basin population increased from 1 100 000 in 1970 to 1 700 000 in 1997 (DIE).

Urban areas are increasing at a rate of 2%, and industrial areas are growing at a rate of 10% resulting in an increase in water demand. Urban population is increasing at a rate of 2%, while rural population is decreasing at a rate of 0.7% (DIE).

New irrigation technologies are encouraged to improve irrigation efficiency through the reduction of water loss, and changes in irrigation methods. This may reduce irrigation demand or may cause an increase in irrigated areas, using the same amount of water (DSI).

There are serious institutional, legal, social and economic drawbacks, which enhance water allocation and environmental pollution problems. There are constraints to achieving basin management objectives. Institutional evolution is slow in comparison to rapid evolution in water management problems. Legislation used in current management practices is too old and cannot meet current demands.

Trends in demographic change and how they affect water supply and demand

Trends in land use change and how they affect water supply and demand

Trends in technological change and how they affect water supply and demand

Trends in institutional change and how they affect water supply and demand

( continued)

Urban and industrial wastewater and agricultural return flows deteriorate water quality. At Middle Gediz the quality classification has declined from class III to class IV. Over the entire basin, daily BOD loads from domestic and industrial discharge amount to 210 000 kg that flow into surface waters (DSI). According to Turkish water quality classification, 60% of surface water and 30% of groundwater are in class IV (DSI). Water shortage is a significant problem due to frequent droughts and competition for water among various uses, mainly irrigation and domestic (DSI and municipalities). Supply (61 mm) versus demand (49.6 mm): There is no water left for allocation whereas domestic demand increases at a rate of 2% (0.15 mm) each year and industrial demand increases at a rate of 10% (0.3 mm) each year.

DSI

Water issues – problems for supplying sufficient water to meet current and future demands

Key Issues: Problems for discussion regarding the sustainable management of scarce water resources

Annual volume of water required for environmental purposes

Annual volume of water required for touristic purposes

Table 1.2: ( continued )

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

43

What the law says about water quality: what standards are in place, and how is water quality monitored?

There are too many regulations. Municipality Law (1930); Water Products Law (1971); ISKI Law (1982); Enforcement of laws is poor. Environmental Law (1983); National Parks Law (1983); Environmental Protection Fund Regulation (1985); IZSU Law (1987); Water Pollution Control Regulation (WPCR) (1988); Administrative WPCR (1989); Technical WPCR (1991); Local Environmental Regulations (1993); Wetland Law (1995)

Civil Law (1926); Groundwater Law (1960) The law that governs surface water use rights in Turkey What the law says about water foresees that water is a public good which everyone is rights and water allocation: which uses or groups have priority? entitled to use, subject to the rights of prior users. There is no registration system for surface water rights or water use. Each landowner has the right to use groundwater on the Groundwater Law (1960) condition that it is used for meeting personal needs and after getting permission from DSI.

Municipality Law (1930) There is no specific regulation to determine the price of surface water or groundwater. I˙f consumers use wtaer for irrigation from the water distribution systems installed by government, they pay for water considering the cost of the maintenance and operation of the system. Domestic and industrial water prices are determined by DSI Law (1953) local municipalities if the consumers use the public water distribution systems.

What the law says about water price: how is it determined?

Source

Regulation

Description

Key regulations: Existing rules that set constraints for the management of water resources

Table 1.2: ( continued )

44 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

45

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

The socio-economic data were analysed and processed through a ‘data compilation and analysis’ into a consistent database. The use of these data led to the management scenarios depicted in Section 5.2. Figure 1.9 shows the districts within the Gediz River Basin for which the data presented in Tables 1.3–1.7 apply. Geospatial data Base maps Within the SMART project, a number of digital maps in the form of several GIS layers are prepared for the Gediz River Basin, by following a series of digital data capturing methods. The most fundamental among all these methods is the digital elevation model (DEM), representing the topography throughout the whole basin area. The DEM is obtained from the USGS (US Geological Survey) EROS Data Center in the USGS–DEM format with a remarkably sufficient vertical resolution. These data have then been converted from the geographical coordinates into the national coordinate system of Turkey, being the Universal Transverse Mercator projection (Zone 35N for the region) on ED-50 datum (Figure 1.10 – see colour insert). The river network within the basin is first obtained from the digital chart of the world (DCW), which is then transferred to the same projection system with the DEM and modified for some parts by spatial editing on the basis of the hydrographic information available on hard copy maps. In such a modification, some river reaches are redrawn from the paper maps and some deleted to obtain a compatible network with a selected level of hierarchy (Figure 1.11(a)). Landcover information was directly transferred from the geospatial database of an already completed project (IWMI, 1998) and examined for compatibility with the previously produced data layers (Figure 1.12 – see colour insert). The landcover map has a classification based on the USGS classification method, including land classes as urban or built-up land, dry land, cropland and pasture, irrigated cropland and pasture, cropland/grassland mosaic, cropland/woodland mosaic, grassland, shrubland, mixed Figure 1.9: Districts within the Gediz Basin. Kirkagac

Sindirgi

Pazarlar Saphane Simav

Soma

Aslanapa

Location map

Altintas Akhisar

Saruhanl Foca Menemen

Gediz Banaz

Aliaga Manisa Merkez

De Mirci

Cordeg

Selendi

Kopru Basi

Gol Marmara

Usak Merkez Kula

Salihli

Ulubey

Esme

Turgutlu

Cigli

N

Bornova Kemalpasa Alasehir Ahmetli Bayindir Odemis

Sarigol Kiraz Buldan

Izmir

Nazilli

Kuyucak

Guney

Governmental districts 0

25

50 km

145 505 1 432 041

76 043 56 075

IZMIR Menemen Kemalpasa

USAK Centre Regional

221 694 19 194 152 397 91 362 60 184 18 524 40 399 45 608 12 182 48 132 134 854 34 682 72 093 25 415 76 641 101 057

Population (as of 1990), (ca)

MANISA Centre Ahmetli Akhisar Alasehir Demirci Golmarmara Gordes Kirkagac Koprubasi Kula Salihli Sarigol Saruhanli Selendi Soma Turgutlu

Municipality

179 458 1 627 198

114457 73114

278 555 18 852 152 582 93 760 59 314 17 831 38 110 48 303 10 851 52 986 149 151 35 621 68 134 26 061 89 038 121 020

Population (as of 2000) (ca)

23 12

51 30

26 2 0 3 1 4 6 6 11 10 11 3 5 3 16 20

2.1 1.3

4.2 2.7

0.23 0.2 0 0.3 0.1 0.4 0.6 0.6 1.2 1 1 0.3 0.6 0.3 1.5 1.8

Growing rate Yearly growth between 1990 rate (%) and 2000 (%)

111 88

110 86

181 64 89 90 41 67 43 83 41 52 111 99 86 36 92 214

Population density 1990, (ca/km2 )

137 103

165 112

227 63 89 92 41 65 40 88 37 58 122 101 81 37 107 256

Population density 2000 (ca/km2 )

Table 1.3: Urban population, growth rate and population density of districts within the Gediz Basin (source: www.die.gov.tr; www.tcmb.gov.tr).

46 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Life expectancy at birth (national) (year)

Crude death rate (‰)

66.4 64.2 68.7

Male

Female

1990

70.4

65.8

68.0

2000

1.23

1.04

Total

Regional

Male

Female

2000

Regional

21

Growth of gross domestic product at National (no regional market prices data available) Activity rate National (%) Regional

Crude birth rate (‰)

20 946

71 669

1 025 406

MANISA

50 723

Immigration, Net migration Rate of net 1985–1990 migration (%)

Permanent residence Emigration, population, 1990 1985–1990

Regional migratory balance

51

41

9.4%

1990

8.6

1990

Table 1.4: Migration, death/birth rates and life expectancy values as of 1990 and 2000 (source: www.die.gov.tr; www.tcmb.gov.tr).

25

n/a

6.3%

2000

9.3

2000

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

47

Cereals Pulses Industrial crops Oil seeds Tuber crops Fruits Vegetable

Crops

26 934 400 1 678 606 13 826 794 1 859 052 6 315 000 12 601 307 17 778 965

Production (tons)

2396 495 1224 315 1035 4268 2887

Value (million Euro)

National

1440 349 1176 295 776 3450 2334

Value of marketable production (million Euro) 355 913 14 378 109 716 59 394 40 052 977 455 697 900

Production

32 6 107 1 8 331 97

Value (million Euro)

Regional

20 4 104 1 6 256 79

Value of marketable production (million Euro)

Table 1.5: Agricultural income distribution by the main types of production (year: 1994) (source: www.die.gov.tr; www.tcmb.gov.tr).

48 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Cereals Pulses Industrial crops Oil seeds Tuber crops Fruits Vegetable

Crops

33 060 972 1 599 360 23 485 669 2 391 105 7 720 000 13 933 034 21 151 592

Production (tons)

5240 763 3240 565 1992 604 5701

Value (million Euro)

National

3144 534 3134 528 1452 5456 4613

Value of marketable production (million Euro) 346 248 14 158 129 541 74 932 52 131 1 213 566 895 565

Production (tons)

51 5 263 2 16 518 172

Value (million Euro)

Regional

31 4 256 1 12 399 139

Value of marketable production (million Euro)

Table 1.6: Agricultural income distribution by the main types of production (year: 1998) (source: www.die.gov.tr; www.tcmb.gov.tr).

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

49

50

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 1.7: Percentage of tertiary employment (year: 2000) (source: www.die.gov.tr; www.tcmb.gov.tr). Municipality

% of tertiary employment

MANISA Centre Ahmetli Akhisar Alasehir Demirci Golmarmara Gordes Kirkagac Koprubasi Kula Salihli Sarigol Saruhanli Selendi Soma Turgutlu Total of sub-districts and villages Total

56.09 54.10 59.31 62.57 54.78 26.69 54.20 78.84 55.62 47.84 62.68 44.27 39.75 68.46 56.98 52.09 8.59 25.38

IZMIR Menemen Kemalpasa Total

58.00 43.12 59.32

USAK Centre

52.32

REGIONAL NATIONAL

36.39 38.1

shrubland/grassland, mixed forest, deciduous needleleaf forest, evergreen broadleaf forest, water bodies, barren or sparsely vegetated and bare ground tundra. Soil information was obtained in the same way from the external database of the above-mentioned project and was included in the geospatial database of SMART with the same coordinate information used for the other layers (Figure 1.13 – see colour insert). Geospatial data on hydrometeorological observations The geospatial database for Gediz also includes three layers of geospatial information, each representing geographic locations of observation points within the basin. Meteorological stations are placed inside and around the basin area, using the coordinate information given on the website of the General Directorate of State

51

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

Figure 1.11: (a) Stream network for the Gediz River; (b) meteorological stations within the Gediz Basin; (c) stream gauging stations within the Gediz Basin, operated by DSI; (d) stream gauging stations within the Gediz Basin, operated by EIE.

(a)

(b)

(c)

(d)

Meteorological Works (DMI), which is mainly responsible for the operation of these stations (Figure 1.11(b)). When inserting geographic locations in the database, few errors were observed, and they are removed by following a cross-check on the hard copy maps distributed by DMI. Stream gauging stations are included in two different digital layers, again based on the coordinate information on the Internet, as some of these stations are operated by the General Directorate of State Hydraulic Works (DSI), while the rest are run by the General Directorate of Electrical Works Authority (EIE). Figures 1.11(c) and 1.11(d) show the geographical locations of the stream gauging stations operated by DSI and EIE, respectively. Data for modelling studies Data compilation for WaterWare model applications WaterWare (WRM) is an annual water budget model which simulates the river system on a daily basis; hence daily hydrometeorological data within a certain year are compiled for the Gediz River Basin on the basis of its topological features. The data needs of WRM depend on identified ‘nodes’ and connecting ‘reaches’ to be identified in the model. In the Gediz case, data on inflows, precipitation, temperature and dynamic demand series are compiled to satisfy the data needs of WaterWare. In addition, some physical features and characteristics of reservoirs, reaches and aquifers are considered in the data compilation process.

52

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Hydrometeorological data The hydrometeorological data for WRM applications in the Gediz Basin consist of outflows for main subcatchments and meteorological data on precipitation and temperature, observed at particular gauging stations in the basin. In the case study area, meteorological conditions vary from year to year. Between 1990 and 1995 in particular, a severe drought was experienced throughout the basin, which led to significant water scarcity. For WRM analyses, the year of 1991 is selected to represent the drought conditions and the year of 1982 to represent the ‘normal’ conditions. Flow time series These are listed below. 1.

2.

3.

Subcatchments and input flow time series: seven subcatchments in Gediz are identified for WRM applications. Two of them, Buldan and Afsar situated on the southeastern part of basin, have no observed outflow; therefore their yields are computed with a rainfall–runoff model. The outflows of the other five subcatchments are compiled through annual flow almanacs of two governmental institutions, State Hydraulic Works (DSI) and Electrical Works Authority (EIE). The five subcatchments thus analysed are Demirkopru Dam upstream subcatchment, Gordes subcatchment, Medar subcatchment, Nif subcatchment, and Yigitler subcatchment. In addition, daily flow time series of the downstream stream gauging station at Manisa Bridge are also compiled to be used in calibration of the model. The outflow time series of the other five minor tributaries and creeks joining the main river are also taken into account as lateral inflows to reaches. They are compiled again from the annual flow almanacs of DSI and EIE. All flow time series are uploaded to WaterWare online database. Dynamic water demand time series: 75% of the available surface water in the Gediz Basin is consumed by irrigation. There are many irrigation districts of different scales ranging from 500 to 50 000 ha with a total area of 107 000 ha. Four major irrigation districts, Alasehir, Adala, Ahmetli and Menemen, cover almost 90% of the entire irrigated area. For the preliminary analyses and calibration purposes of WRM, the flows diverted from the river to the irrigation channels through weirs and reservoirs are compiled for the year 1991 and uploaded to WaterWare online database. Precipitation and temperature time series: precipitation and temperature data pertaining to each subcatchment were obtained by the Thiessen method. The locations of all meteorological stations in or near the Gediz Basin were already defined in the files previously uploaded to the geodatabase server so that the Thiessen polygons were easily obtained by the ArcInfo software. Finally, the representative precipitation and temperature time series data for each subcatchment were re-calculated by considering the spatial proportion of each area in percentage.

53

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

Compilation of data on reservoirs and reaches In order to meet the data requirements of the WaterWare model, particular characteristics of the hydraulic structures on the Gediz River are identified and input to the model. Reservoirs The data required on reservoirs are: the geometry of the reservoir, height of the dam, flood level, storage capacity, dead storage, surface area, effective area, spill capacity and seepage coefficient. All of these data are obtained from the annual operation reports of the II. Regional Directorate of State Hydraulic Works and are presented in Tables 1.8 and 1.9. Reaches From among the characteristics of river reaches required by the WaterWare model, river length and river slope were computed by using GIS techniques. Figure 1.14 (see colour insert) indicates the river reaches in the Gediz Basin within the basin boundaries. Selected attributes of the river theme used to determine the length and slope properties are shown in Figure 1.15. Aquifers The Gediz River Basin covers a large area and comprises several single aquifer systems which are fed by precipitation and seepage from reservoirs. Among these, four aquifers which are situated in the major lowlands of the basin are intensively exploited by domestic and industrial water uses. The names and parameters of these aquifers as required by WRM are given in Table 1.10. Further data The OPTIMA project builds upon the results of SMART and intends to identify the best management plan for the basin. To realise the optimisation procedure involved, OPTIMA requires further data and couples the WaterWare modelling system with an optimisation component. Table 1.8: Data on reservoir characteristics. Demirkopru Res. Dam height (m) Flood level (m) Storage (106 m3 ) Dead storage (106 m3 ) Surface area (km2 ) Effective area (km2 ) Spill capacity (m3 /s) Seepage coefficient

36 36 1022 88.4 45 45.7 200 0.2

Afsar Buldan Res. Res. 24 24 83.5 5.2 5 6.38 118 0.25

30 30 44 3 3 3 190 0.25

Marmara Lake 7.3 7.3 321.36 15 68 68.2 65 0.1

Gordes Yigitler Res. Res. 67 64 448.46 19.6 14.44 15 150 0.2

20 20 15.4 1.5 3 3.2 70 0.2

64 59 49 39 29 20 14 10 6 2

14.44 13.44 11.42 9.38 7.25 5.48 4.4 3.73 3.08 2.45

68 60.07 57.31 54.52 51.64 49.6 47.96 42.34 35.19 24.96

7.3 6.1 5.5 4.9 4.5 3.9 3.3 2.7 2.1 1.5

321.36 244.46 209.2 175.68 143.8 113.54 84.28 57.19 33.62 15.58

Area (km2 )

Level (m)

Area (km2 )

Level (m)

Vol. (106 m3 )

Gordes Res.

5 4.9 4.48 3.97 3.4 2.83 2.41 2.06 1.71 1.35

Marmara Lake

24 19.85 16.55 14.25 12.05 9.85 7.65 5.45 3.25 1.05

45 43.7 40.2 36.14 32.89 27.84 22.8 16.95 14.09 10.87

36 33.5 30 26.5 23 19.5 16 9.5 5.5 2

1022 901.7 754 620.7 499.9 392.7 305 166.78 112.19 68.51

Area (km2 )

Level (m)

Area (km2 )

Level (m)

Vol. (106 m3 )

Afsar Res.

Demirkopru Res.

Table 1.9: Reservoir geometries defined in WRM.

448.46 418.5 318 229.5 150 94 65.6 50 36 24.4

Vol. (106 m3 )

83.5 64.07 48.09 38.4 30.3 23.44 17.72 12.8 8.65 5.21

Vol. (106 m3 )

20 2 – – – – – – – –

Level (m)

30 27 22 19 16 13 10 7 4 1

Level (m)

3 0.5 – – – – – – – –

Area (km2 )

Yigitler Res.

3 2.55 1.98 1.66 1.36 1.1 0.856 0.709 0.574 0.463

Area (km2 )

Buldan Res.

15.4 1.5 – – – – – – – –

Vol. (106 m3 )

44 36.67 25.36 19.92 15.4 11.74 8.81 6.46 4.55 3

Vol. (106 m3 )

54 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

55

Figure 1.15: Selected attributes of the river theme.

Table 1.10: The aquifers identified in WaterWare (WRM) and their parameters. Name of aquifer Parameter

Alasehir

Menemen

Salihli-Turgutlu

Sarikiz

Area (km2 ) Average depth (m) Porosity (%) Percolation loss coefficient Deg. day coefficient Recharge coefficient

940 80 15 0 1 3 105 0.75

300 80 15 0 1 3 105 0.75

1200 90 15 0 1 3 105 0.70

600 80 15 0 1 3 105 0.65

In OPTIMA, the performance of the basin system is expressed in terms of criteria and indicators (any one of which can be used in the optimisation) such as the •



overall water budget, balancing all inputs, losses, uses, outflows including export and inter-basin transfer, and change in storage including reservoirs and the groundwater system; additional information relates to the storage/extraction relationships and, thus, the sustainability of the overall system in a long-term perspective; technical efficiency of the system: this describes the ratio of useful demand satisfied to the losses through evaporation from reservoirs and seepage losses through the various conveyance systems;

56 •

• •





PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

supply/demand ratio, globally, for any and all individual demand nodes in the basin, or any functional/sectoral grouping; this also includes any environmental water demands, low flow constraints, wetland nourishing, and so on; reliability of supply, measured at any or all demand nodes and control nodes that compare the water available with user-defined needs/expectations as constraints; development potential, which relates the unallocated water summed over all demand points to the total input: this, in principle, defines the amount of water that is available for further exploitation; costs and benefits, derived from useful demand satisfied and the added value derived, versus the costs of shortfalls (again at any or all nodes), as well as the costs of supplying the water in terms of investment (annualised) and operating and maintenance costs of structures and institutions; groundwater sustainability, which describes the ratio of content to the net withdrawal (balance of recharge, summed over natural and artificial) and evaporative and deep percolation losses, measured in years of reserves at current exploitation levels.

It follows from the above that data requirements of OPTIMA are more exhaustive to include information on sectoral water use, irrigation districts, areas and crop patterns, basin institutions, stakeholders, and economic benefits and costs. Economic valuation (expressed as net present values or annualised costs considering investments, operating costs and project or technology life times) includes estimates of the cost of various alternative water technologies (from non-conventional supply options like desalination, water harvesting, recycling and re-use to new or bigger reservoirs, lining irrigation canals and more efficient irrigation technologies to water saving showers) versus the benefits generated by supplying water for useful demands (Fedra and Harmancioglu, 2005). Different allocation scenarios and also the use of different water technologies lead to different cost–benefit ratios for the system. From the set of results generated, any number of constraints can be derived for the optimisation, both global criteria aggregated over all nodes and a yearly simulation run such as overall reliability of water supply, as well as node- and location-specific constraints defined as minimum or maximal flow (or supply) expectations, again with different temporal resolution and aggregation (Cetinkaya and Harmancioglu, 2006; 2008; Cetinkaya et al., 2008). Table 1.11 presents an example of the constraints used in optimisation, which constitute another set of indicators. These indicators were defined by basin stakeholders for optimisation purposes of OPTIMA.

1.5.2

Definition of indicators and management scenarios

Management scenarios are developed for the Gediz Basin on the basis of the approaches described in Figures 1.3 and 1.6. First, the baseline scenario is developed to identify the current status of the basin. Next, three types of future scenarios are assessed: business-as-usual scenario, where all practices and characteristics of the basin are considered to remain the same in the future; optimistic scenario, where

57

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

Table 1.11: The range of ‘rigid constraints’ for optimization (in the form of indicators), identified by basin stakeholders. Indicator

Minimum and maximum State Hydraulic Works value Authority

Overall supply/demand ratio Reliability of supply Benefit/cost ratio Economic efficiency (direct) Economic efficiency (indirect) Net benefit (direct or indirect) Cost of water (direct) Cost of water (indirect)

.70–95% .70–95% .0.8–2.0 0.10–0.70 A/m3 0.08–0.9 A/m3 960–1200 A/ca 0.10–0.7 A/m3 0.08–0.7 A/m3

.90 % .90 % .1.5 0.5 A/m3 0.35 A/m3 1000 A/ca 0.4 A/m3 0.25 A/m3

practices are improved so that indicators are also improved; and a pessimistic scenario, where things change for the worse. For each scenario, economic, social and environmental indicators are computed and then the scenarios are compared as in Figure 1.16. Both SMART and OPTIMA projects involve socio-economic analysis and basin simulation through the WaterWare model (Harmancioglu et al., 2008). SMART also

Figure 1.16: The structure of SMART and OPTIMA projects where indicators are identified for each scenario. SCENARIO Geographic and hydrologic

Technology

External Economic variables

Decision variables

Water allocation Institutional

Water demand

Socioeconomic

Bathymetry

Su Kullanimi

Pollutants TELEMAC

OUTPUTS For each N, t

Costs Water quality

Benefits INDICATORS

Hydrometeorology

OUTPUTS

OUTPUTS

Supply/demand rate

Demographic

SOCIO-ECONOMIC ANALYSIS

WATERWARE

For each N, t

INPUT

INPUT

INPUT

INDICATORS COMPARATIVE ANALYSIS

Hydrodynamic information

INDICATORS

58

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

uses the TELEMAC model for modelling of the Izmir Bay to assess the pollutants discharged by Gediz River into the Bay, which is not discussed here. Table 1.12 presents the results of the three scenarios developed for the basin. At the final stage, WaterWare model simulations resulted in the management scenarios of Table 1.13, which was derived by using all the diverse types of data described in Section 1.5.1. It is apparent in this case study that development of basin management plans requires exhaustive amounts of data to be collected. The important issue here is that these data should be accurate and reliable to lead to sound decisions on management of a basin (Harmancioglu, 2006; 2007). For the Gediz case, water quality management responses could not be assessed due to lack of sufficient data. Consequently, no possible actions can be foreseen for the future. This is an unfortunate situation, resulting from data limitations, since one of the two key problems in Gediz is water pollution. Likewise, the economic efficiency of the Gediz system for water use/water supply issues cannot be evaluated, again due to lack of sufficient and reliable data. It is fairly difficult to state the economic value per unit of water used in the basin. This again is an unfortunate situation resulting from data limitations and indicates lack of awareness, responsibility, and authority among the decision makers.

1.6

CONCLUDING REMARKS

It must be recognised today that there is a significant gap between information needs on the environment and the information produced by current systems of data collection and management. Mankind has now developed the most sophisticated means of collecting, processing, storing and communicating data. There have been significant advances in instrumentation; in technologies for data collection, transmission, handling and archiving; in development of new methods like remote sensing by, and data transmission through satellites; and finally in the analysis and presentation of data, particularly in spatial dimensions, using computer-based models and GISs. However, data users and decision-makers still suffer from poor information when they attempt to use the available data. Among several data limitations and shortcomings of data management systems, one of the major reasons for this situation is that the basic requirements needed to ensure the accuracy and reliability of environmental data are often overlooked. Data validation in particular is poorly achieved. The problem is particularly significant in developing countries where data validation is often hardly accomplished, and where the requirement that data should be managed within a system of activities is not fully recognised. In such countries, it is often the quantity, but not the quality, of data that counts. On the other hand, planners, managers, researchers and engineers should give due consideration to recommendations expressed at international levels towards improved availability of information on the environment for better management. Examples include the Dublin statement of the International Conference on Water and Environment; Agenda 21 of UNCED; various workshops and meetings held by the World Meteorological Organisation (WMO), World Health Organisation (WHO), United Nations Educational, Scientific and Cultural Organisation (UNESCO), United

Existing (partially successful) 1.5%/year 1%/year 0% 0% 0% Class IV 0.2 mm/year 0% 0% 0%/year Sufficient Cotton, grape, maize Insufficient 30% 0% 0 mm Class IV Insufficient awareness

Existing

1.5%/year 1%/year 700 mm/year 9 mm/year 59 mm/year Class IV 0.2 mm/year

7.4 mm/year

3 mm/year 39 mm/year Sufficient Cotton, grape, maize Insufficient 30% 1070 km2 0 mm Class IV Insufficient awareness

Birth control

Urban growth rate Rural growth rate Precipitation rate Groundwater supply Surface water supply Groundwater pollution Basin-out water supply (surface and groundwater) Domestic water use (surface and groundwater) Industrial water use (groundwater) Irrigation water use Domestic water supply investments Change in crop pattern Irrigation m/o investments Loss rate in irrigation system Irrigated area Industrial water use (surface water) Surface water quality Water exploitation awareness

BAU

Baseline

Variables/driving forces

Table 1.12: Scenarios developed for the Gediz Basin.

4%/year  40% Sufficient Grape, vegetable, maize Sufficient 10% 0% 4 mm Class III Comprehensive awareness

0.5%/year

Existing (successful) 1%/year 1%/year 0% 0% 0% Class III 0.4 mm/year

Optimistic

8%/year 15% Insufficient Cotton, grape Insufficient 30% 0% 4 mm Class IV Insufficient awareness

2.5%/year

Existing (unsuccessful) 3%/year 2%/year 10% 10% 10% Class IV 0.5 mm/year

Pessimistic

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

59

Not used

Global sectoral demand multipliers

S/D ratio (%)

0 94.66 100 0 75.05 95.99

Sectoral water budget data

Domestic Agricultural Industrial Services Generic Total

0 87.5 100 0 83.01 90.25

Reliability (%)

95.41 89 92.73 3.2 1.77 0

Baseline scenario with observed values for 1991

Summary of main scenario assumptions

Supply/demand ratio (%) Global efficiency (%) Reliability (%) Total shortfall (%) Total unallocated (%) Flooding conditions (day)

Baseline

Scenarios

Table 1.13: WRM simulation results for the generated scenarios. Current optimistic

Current pessimistic

98.51 89.6 87.97 1.68 0.21 0

0 97.96 100 0 99 98.27

0 71.16 100 0 83.01 79.36

S/D ratio Reliability (%) (%)

Not used

0 99.16 100 0 99 99.34

0 71.37 100 0 83.01 79.5

Reliability (%)

99.78 90.1 88.94 0.27 0.15 0 S/D ratio (%)

Not used

0 52.03 100 0 90 58.6

( continued)

0 54.73 100 0 83.01 68.4

Reliability (%)

52.67 90.3 73.67 134.96 0.05 0 S/D ratio (%)

Not used

Crop pattern change Crop pattern change Crop pattern change expected. The TS of the expected. The TS of the expected. The TS of the year 1991 are used year 1982 are used year 1982 are used

Current BAU

60 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

124.61 94.65 29.96 124.57

Constraint Inflow

Control node violations (total flow (mm3 /year))

Total input Total output Mass budget Sustainable yield

Groundwater budget (mm3 /year, %)

Table 1.13: ( continued )

245.37 213.35

100.00% 75.96% 24.04% 99.97%

Not regarded Not regarded

191.07 94.65 96.42 191.02

100.00% 49.54% 50.46% 99.98%

Not regarded Not regarded

211.74 94.65 117.09 211.69

100.00% 44.70% 55.30% 99.98%

Not regarded Not regarded

119.71 94.65 25.06 119.67

100.00% 79.07% 20.93% 99.96%

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

61

62

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Nations Environment Programme (UNEP), The World Bank, International Association of Hydrological Sciences (IAHS) and International Water Association (IWA); recent directives foreseen for the EU Community; and a number of international programmes such as the EEA (European Environmental Agency) work programme, World Hydrological Cycle Observing System (WHYCOS) of WMO and the World Bank, Global Reference Information Database (GRID) and Global Environmental Monitoring System (GEMS) of UNEP, to name but a few. In particular, Agenda 21 of UNCED has formulated a new outlook towards environmental management. In recognition of the environment as a continuum of air, soil and water resources, it was stated in Agenda 21 that the environment should be managed by an integrated approach. On the technical level, integrated management of environment relies on two basic tools: data and modelling. The adoption of an integrated approach has also imposed new requirements on these tools. With respect to data, the problems that must be addressed today require interdisciplinary approaches and hence much more sharing of data and information than in the past. To this end, Agenda 21 has emphasised that the priority activities for environmental management should include the development of standard inter-calibrated procedures, measuring techniques, data storage and management capabilities to ensure the production of sound information. Another significant issue to be mentioned is that historical data should be used with caution accounting for the following two factors. 1. 2.

Stationarity of most natural processes has become questionable owing to the recent problem of global climate change. Methodologies and technologies in data sampling and analytical procedures have changed since the beginning of sampling by the existing networks.

In the above, attention should be drawn to the possibility of a climate change which is expected to impact all environmental processes in one way or another. Thus, its effect on collected data should be carefully studied since it can significantly change basic data characteristics such as homogeneity, consistency and stationarity. One of the most important requirements imposed on data collection strategies by the adoption of integrated approaches to environmental management is the need to collect widely varying types of data. Chapter 40 of Agenda 21 indicates that more and different types of data are to be collected on the status and trends of the ecosystem, natural resources, pollution and socio-economic variables at local, regional, national and international levels. To complete the picture, data collection on water uses should also be emphasised and encouraged since man’s impact on environmental regimes is becoming increasingly dominant. With respect to the production of information from available data, Agenda 21 also emphasises that national and international information data centres should establish continuous and accurate data collection systems. They should also use such new techniques as GISs, expert systems and models for analysis and assessment data. These requirements will be more challenging in the future when large amounts of satellite data will have to be processed and validated.

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

63

REFERENCES Alpaslan, N., Harmancioglu, N.B. and Ozkul, S. (1993). Risk factors in assessment of compliance with standards, Los Angeles, California. Proceedings of IAWRPC 1993 Conference on Risk, Risk Analysis Procedures and Epidemiological Confirmation, August 1993, p. 13. Beck, M.B. and Finney, B.A. (1987). Operational water quality management: Problem context and evaluation of a model for river quality. Water Resources Research. 23, 11: 2030–2042. Berryman, D., Bobee, B., Cluis, D. and Haemmerli, J. (1988). Nonparametric tests for trend detection in water quality time series. AWRA, Water Resources Bulletin. 24, 3: 545–556. Cetinkaya, C.P. and Harmancioglu, N.B. (2006). Optimization methods applied to river basin management. Proceedings of the 5th International Conference on Engineering Computational Technology (Topping, B.H.V., Montero, G. and Montenegro, R. (Eds)), Las Palmas de Gran Canaria, 12–15 September 2006. Civil-Comp Press. Cetinkaya, C.P. and Harmancioglu, N. (2008). Optimization of water resources management schemes. Proceedings of the International Conference on Fluvial Hydraulics, River Flow 2008, vol. 3, pp. 2313–2322. Cetinkaya, C.P., Fistikoglu, O., Fedra, K. and Harmancioglu, N. (2008). Optimization methods applied for sustainable management of water-scarce basins. Journal of Hydroinformatics. 10, 1: 69–95. Chapman, D. (ed.) (1992). Water Quality Assessments – A Guide to the Use of Biota, Sediments and Water in Environmental Engineering. Chapman and Hall, London. Clark, M.J. and Gardiner, J. (1994). Strategies for handling uncertainty in ıntegrated river basin planning. In: Kirby, C. and White, W.R. (Eds), Integrated River Basin Development. John Wiley & Sons, Part VIII, pp. 437–445. Crabtree, R.W., Cluckie, I.D. and Foster, C.F. (1987). Percentile estimation for water quality data. Water Research. 21, 5: 583–590. Dendrou, S.A. and Delleur, J.W. (1979). Reliability concepts in planning storm drainage systems. In: McBean, E.A., Hipel, K.W. and Unny, T.E. (Eds), Reliability in Water Resources Management. Water Resources Publications, Fort Collins, pp. 295–321. Fedra, K. (1997). Integrated environmental ınformation systems: from data to ınformation. In: Harmancioglu, N., Alpaslan, M., Ozkul, S. and Singh, V.P. (Eds), Integrated Approach to Environmental Data Management Systems. Kluwer Academic Publishers, NATO ASI Series, 2. Environment, vol. 31, pp. 367–378. Fedra, K. and Harmancioglu, N.B. (2005). A web-based water resources simulation and optimization system. Proceedings of CCWI 2005 on Water Management for the 21st Century (Savic, D., Walters, G., King, R. and Khu, A-T. (Eds)), Centre of Water Systems, University of Exeter, vol. II, pp. 167–172. Fisher, R.A. (1921). On mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society, Series A. 222: 309–368. Harmancioglu, N.B. (2001). Sustainable resource management in water-short basins. Proceedings of ASCE/EWRI World Water and Environmental Resources Congress 2001, Orlando, Florida, 20– 24 May 2001. Harmancioglu, N.B. (2003). Integrated data management: where are we headed? In: Harmancioglu, N.B., Ozkul, S.D., Fistikoglu, O. and Geerders, P. (Eds), Proceedings of the NATO ARW on Integrated Technologies for Environmental Monitoring and Information Production. Kluwer, NATO Science Series, IV. Earth and Environmental Sciences, vol. 23, pp. 3–16. Harmancioglu, N.B. (2004). Sustainability indicators in water resources management (in Turkish). Proceedings of the 4th National Hydrology Congress, Istanbul, June 2004, pp. 9–18. Harmancioglu, N.B. (2006). Integrated ınformation base for sustainable water resources management. Proceedings of the NATO ARW on Integration of Information for Environmental Security (Coskun, G., Cigizoglu, K. and Maktav, D. (Eds)), Istanbul, September 2006. Springer, pp. 407–442. Harmancioglu, N.B. (2007). Informed decision making for water resources management. Proceedings

64

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

of DSI and WWC International Congress on River Basin Management, Antalya, Turkey, March 2007. Harmancioglu, N.B. and Alpaslan, N. (1992). Risk factors in water quality assessment. Proceedings of the AWRA 28th Annual Symposium on Managing Water Resources During Global Change, Reno, Nevada, 1–5 November 1992, Symposium Session S-XC, pp. 299–308. Harmancioglu, N.B. and Ozkul, S.D. (2006). Assessment of ınformation production by streamflow data at varying time/space scales. European Geophysical Union General Assembly, 2–7 April 2006, Geophysical Research Abstracts, Vol. 8, 10481. Harmancioglu, N.B. and Singh, V.P. (1998). Entropy in environmental and water resources. In: Herschy, R.W. and Fairbridge, R.W. (Eds), Encyclopedia of Hydrology and Water Resources. Kluwer Academic Publishers, Dordrecht., pp. 225–241. Harmancioglu, N.B. and Singh, V.P. (2002). Data accuracy and data validation. In: Encyclopedia of Life Support Systems (EOLSS), Knowledge for Sustainable Development, Vol. II, Theme 11 on ‘Environmental and Ecological Sciences and Resources’, Ch. 11.5 on ‘Environmental Systems’ (A. Sydow, A. (Ed.)), pp. 781–798. Harmancioglu, N.B., Alpaslan, N. and Singh, V.P. (1992). Design of water quality monitoring networks. In: Chowdhury, R.N. (Ed.), Geomechanics and Water Engineering in Environmental Management. Taylor & Francis, ch. 8, pp. 267–296. Harmancioglu, N.B., Alpaslan, N. and Ozkul, S. (1993). Quantification of risk components in water quality assessment and management, Los Angeles, California. Proceedings of IAWRPC 1993 Conference on Risk, Risk Analysis Procedures and Epidemiological Confirmation, August 1993. Harmancioglu, N.B., Alpaslan, M.N. and Ozkul, S.D. (1997a). Conclusions and recommendations. In: Harmancıoglu, N.B., Alpaslan, M.N., Ozkul, S.D. and Singh, V.P. (Eds), Proceedings of the NATO ARW on Integrated Approach to Environmental Data Management Systems, September 1996. Kluwer, NATO ASI Series, 2. Environment, vol. 31, part 9, pp. 423–436. Harmancioglu, N.B., Alpaslan, M.N., Ozkul, S.D. and Singh, V.P. (Eds) (1997b). Proceedings of the NATO ARW on Integrated Approach to Environmental Data Management Systems, 16–20 September 1996. Kluwer Academic Publishers, NATO ASI Series, 2. Environment, vol. 31. Harmancioglu, N.B., Singh, V.P. and Alpaslan, N. (Eds.) (1998). Environmental Data Management. Kluwer Academic Publishers, Water Science and Technology Library, vol. 27. Harmancioglu, N.B., Fistikoglu, O., Ozkul, S.D., Singh, V.P. and Alpaslan, N. (1999). Water Quality Monitoring Network Design. Kluwer, Water Science and Technology Library, vol. 33. Harmancioglu, N.B., Ozkul, S.D., Fistikoglu, O. and Geerders, P. (Eds) (2003). Proceedings of the NATO ARW on Integrated Technologies for Environmental Monitoring and Information Production, 10–14 September 2001. Kluwer Academic Publishers, NATO Science Series, IV. Earth and Environmental Sciences, vol. 23. Harmancioglu, N.B., Cetinkaya, C.P. and Geerders, P. (2004a). Transfer of ınformation among water quality monitoring sites: assessment by an optimization method. Proceedings of EnviroInfo Conference 2004, 18th International Conference Informatics for Environmental Protection, Track 1: Sharing Environmental Knowledge, Session PS-11. Harmancioglu, N.B., Geerders, P., Fistikoglu, O. and Ozkul, S. (2004b). The need for ıntegration in environmental data management. Proceedings of the EWRA Symposium on Water Resources Management: Risks and Challenges for the 21st Century, vol. 1, pp. 707–712. Harmancioglu, N.B., Fedra, K. and Barbaros, F. (2008). Analysis for sustainability in management of water scarce basins: the case of the Gediz River Basin in Turkey. Desalination. 226: 175–182. Hipel, K.W. and McLeod, A.I. (1994). Time Series Modeling of Water Resources and Environmental Systems. Elsevier, Amsterdam, Developments in Water Science, no. 45. Hipel, K.W., Lennox, W.C., Unny, T.E. and McLeod, A.I. (1975). Intervention analysis in water resources. Water Resources Research. 11, 6: 855–861. Hipel, K.W., McLeod, A.I. and Weiler, R.R. (1988). Data analysis of water quality time series in Lake Erie. Water Resources Bulletin. 24, 3: 533–544. Hirsch, R.M. (1988). Statistical methods and sampling design for estimating step trends in surface water quality. Water Resources Bulletin, AWRA. 24, 3: 493–503.

ENVIRONMENTAL DATA FOR NATURAL RESOURCES MANAGEMENT

65

Hirsch, R.M. and Slack, J.R. (1984). A nonparametric trend test for seasonal data with serial dependence. Water Resources Research. 2, 6: 727–732. Hirsch, R.M., Slack, J.R. and Smith, R.A. (1982). Techniques of trend analysis for monthly water quality data. Water Resources Research. 18, 1: 107–121. Hirsch, R.M., Alexander, R.B. and Smith, R.A. (1991). Selection of methods for the detection and estimation of trends in water quality. Water Resources Research. 27, 5: 803–813. Hughes, J.P. and Millard, P.S. (1988). A tau-like test for trend in the presence of multiple censoring points. AWRA, Water Resources Bulletin. 24, 3: 521–531. Icaga, Y. (1994). Analysis of Trends in Water Quality Using Nonparametric Methods, PhD thesis in Civil Engineering, Dokuz Eylul University Graduate School of Natural and Applied Sciences, Izmir (advisor: N. Harmancioglu). IWMI (International Water Management Institute) (1998). Research Program on Institutional Support Systems for Sustainable Management of Irrigation in Water-Short Basins. Revised description of a project being carried out by IWMI with the support of the German Federal Ministry for Economic Cooperation and Development (BMZ). Lettenmaier, D.G. (1976). Detection of trends in water quality data from records with dependent observations. Water Resources Research. 12, 5: 1037–1046. Lettenmaier, D.P. (1988). Multivariate nonparametric tests for trend in water quality. AWRA, Water Resources Bulletin. 24, 3: 505-512. Lettenmaier, D.P., Hooper, E.R., Wagoner, C. and Faris, K.B. (1991). Trend in stream quality in the Continental United States 1978–1987. Water Resources Research 27, 3: 327–339. Loftis, J.C. and Ward, R.C. (1981). Evaluating stream standard violations using a water quality data base. Water Resources Bulletin. 17, 6: 1071–1078. Matalas, N.C. and Langbein, W.B. (1962). Information content of the mean. Journal of Geophysical Research. 67: 3441–3448. Matalas, N.C., Slack, J.R. and Wallis, J.R. (1975). Regional skew in search of a parent. Water Resources Research. 11: 815–826. Mcleod, A.I., Hipel, W.K. and Comancho, F. (1983). Trend assessment of water quality time series. AWRA, Water Resources Bulletin. 19, 4: 537–547. Montgomery, R.H. and Reckhow, H.K. (1984). Techniques for detecting trends in lake water quality. AWRA, Water Resources Bulletin. 20, 1: 43–52. Moss, M.E. (1970). Optimum operating procedure for a river gaging station established to provide data for design of a water supply project. Water Resources Research. 6: 1051–1061. Moss, M.E. and Gilroy, E.J. (1980). Cost effective streamgaging strategies for the Lower Colorado River Basin: The Blythe field office operations. US Geological Survey open-file reports 801048. Moss, M.E., Thomas, W.O. and Gilroy, E.J. (1985). The evaluation of hydrological data networks. In: Rodda, R. (Ed.), Facets of Hydrology II. John Wiley, Chichester, pp. 291–310. Mulder, W.H. (1994). Water quality monitoring, forecasting and control. In: Advances in Water Quality Monitoring, Report of a WMO Regional Workshop, Vienna. World Meteorological Organization, Geneva, Switzerland, Technical Reports in Hydrology and Water Resources, no. 42, WMO/TD-no. 612, pp. 130–137. NATO LG (1997). Report of the Linkage Grant Project on Assessment of Water Quality Monitoring Networks – Design and Redesign, supported by NATO Scientific Affairs Division, Project Code ENVIR.LG.950779, September 1995–September 1997 (this project is realized by collaboration of six countries: USA, Russia, Hungary, Canada, Italy and Turkey). Ozkul, S.D., Harmancioglu, N.B. and Singh, V.P. (2000). Entropy-based assessment of water quality monitoring networks. ASCE, Journal of Hydrologic Engineering. 5, 1: 90–100. Sanders, T.G. (1988). Water quality monitoring networks. In: Stephenson, D. (Ed.), Water and Wastewater System Analyses. Elsevier, Development in Water Science no. 34, ch. 13, pp. 204– 216. Sanders, T.G. and Adrian, D.D. (1978). Sampling frequency for river quality monitoring. Water Resources Research. 14: 569–576.

66

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Sanders, T.G., Ward, R.C., Loftis, J.C., Steele, T.D., Adrian, D.D. and Yevjevich, V. (1983). Design of Networks for Monitoring Water Quality. Water Resources Publications, Littleton, Colorado. Schilperoort, T., Groot, S., Watering, B.G.M. and Dijkman, F. (1982). Optimization of the Sampling Frequency of Water Quality Monitoring Networks. ‘Waterloopkundig’ Laboratium Delft, Hydraulics Laboratory, Delft, The Netherlands. Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal. 27: 379–423. Singh, V.P. and Harmancioglu, N.B. (1997). Estimation of missing values with use of entropy. In: Harmancioglu, N.B., Alpaslan, M.N., Ozkul, S.D. and Singh, V.P. (Eds), Integrated Approach to Environmental Data Management Systems. Kluwer Academic Publishers, NATO ASI Series, 2. Environment, vol. 31, pp. 267–275. UNCED (1992) Agenda 21: Programme of Action for Sustainable Development. United Nations, New York. UNEP (2000). 130 ındicators for sustainable development in the Mediterranean region. MAP, Presentation note and recommendations formulated by the Mediterranean Commission on Sustainable Development and adopted by the Contracting parties to the Barcelona Convention (Malta 1999), Athens, Sophia Antipolis. Van Belle, G. and Hughes, J.P. (1984). Nonparametric tests for trend in water quality. Water Resources Research. 20, 1: 127–136. Walpole, R.E. and Myers, R.H. (1990). Probability and Statistics for Engineers and Scientists. MacMillan Publishing Company, New York. Ward, R.C., Loftis, J.C. and McBride, G.B. (1986). The data-rich but information-poor syndrome in water quality monitoring. Environmental Management. 10: 291–297. Warn, A.E. (1988). Auditing the quality of effluent discharges. In: Workshop on Statistical Methods for the Assessment of Point Source Pollution, 12–14 September, Canada Centre for Inland Waters, Burlington, Ontario, Canada.

CHAPTER

2

Application of Statistics in Earthquake Hazard Prediction Endi Zhai

2.1

INTRODUCTION

Earthquakes, which are normally recurring phenomena of the earth’s crust, have in the past caused considerable loss of property and life. It is one of the major environmental concerns for human beings. Scientists have never predicted a major earthquake. They do not know how, and they do not expect to know how any time in the foreseeable future. However, based on historical earthquakes and statistical data, probabilities can be calculated for potential future earthquakes. Virtually every important decision regarding the evaluation of earthquake effects on people and man-made facilities is made using some form of probabilistic earthquake hazard or earthquake risk analysis based on statistics of earthquake recurrence. The current earthquake design codes used by various countries and agencies adopt design ground motions that are based on probabilistic earthquake hazard analysis. Earthquake hazard analysis involves the quantitative estimation of ground-shaking hazards at a particular site. Earthquake hazards may be analysed probabilistically, when uncertainties in earthquake size, location, and time of occurrence are explicitly considered. To help understand some of the concepts used in probabilistic seismic hazard analysis, it is useful to think first of probabilistic hazard analysis as a statistical evaluation of the ground motions at a site from an artificial catalogue of future earthquake. This model could be used to predict the occurrence of earthquakes during the next 1 000 000 years, for example. For each earthquake in this large catalogue, the ground motion at the site is predicted using attenuation relations. Since there is variability in the attenuation (the scatter in ground motion data represented by the standard deviation of the attenuation relation), sometimes the ground motion will be high and sometimes the ground motion will be low even for the same magnitude and distance to the site. In the end, we have a large artificial data set of peak ground motions that occur at the site in the next 1 000 000 years. We then compute how often Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

68

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

different levels of ground motion occur in this artificial data set. For example, if a peak acceleration of 0.2g occurs 10 000 times in our 1 000 000 year data set, then 0.2g occurs about every 100 years, where g is acceleration due to gravity (32.2 feet/sec2 ). Similarly if 1.0g occurs only 500 times in our 1 000 000 year data set, then 1.0g occurs about every 2000 years. If we plot these return periods, then we build up a hazard curve. The shape of the hazard curve results from the random variations in the locations and magnitudes of the earthquakes in our artificial catalogue and from the random variations in the ground motion attenuation. Of course, the models and assumptions that we used to build up the artificial data set may be wrong. If we use a different seismic source characterisation model or a different ground motion attenuation relation, we would arrive at a different catalogue of ground motions and potentially different return periods for the various levels of shaking. If we were to repeat the process for a different model, we would arrive at a different hazard curve. Therefore, different models of the seismic sources and attenuation relations lead to different hazard curves (as opposed to the shape of the hazard curve). The term ‘aleatory uncertainty’ is used to describe the random variations which lead to the shape of an individual hazard curve. The term ‘epistemic uncertainty’ is used to describe the scientific uncertainty in our model of the occurrence of earthquakes and ground motion which leads to alternative hazard curves. The example of using the seismic source characterisation to develop a 1 000 000 year catalogue is useful for an initial conceptual understanding of seismic hazard analysis; however, it is not what we are doing. That is, we are not trying to predict what will happen over the next 1 000 000 years, but rather we are estimating the probabilities that a level of ground motion will occur in the next year (or 100 years). If the earthquake process is stationary (e.g. earthquake rates do not change with time) then the two approaches give the same answer. There is often a misunderstanding of hazard analysis for low annual probabilities. For example, if we are discussing the 104 annual probability ground motion level (return period of 10 000 years), someone may comment that we do not know what will happen 10 000 years from now. However, a 104 annual probability is just that, the 1 in 10 000 chance that the ground motion occurs next year. It does not imply that the models are applicable to the next 10 000 years. This chapter presents a mathematical formation of probabilistic seismic hazard analysis and provides a design example of how to use statistical earthquake data to characterise earthquake sources to predict earthquake hazard level within design life time of building structure at a site in southern California, United States of America.

2.2

MATHEMATICAL FORMULATION

The probabilistic seismic hazard analysis follows the standard approach first developed by Cornell (1968). The main change from the original work is that more parameters are randomised (a more complete description of the aleatory variables) and epistemic uncertainty is considered. In particular, the aleatory variability in the ground motion was not considered in the original work. The ground motion aleatory

APPLICATION OF STATISTICS IN EARTHQUAKE HAZARD PREDICTION

69

variability has a large effect on the hazard and cannot be ignored. The basic methodology involves computing how often a specified level of ground motion will be exceeded at the site. The hazard analysis computes the annual number of events that produce a ground motion parameter, A, that exceeds a specified level, a. This number of events per year, v, is also called the ‘annual frequency of exceedance’. The inverse of v is called the ‘return period’. The calculation of the annual frequency of exceedance, v, involves the rate of earthquakes of various magnitudes, the rupture dimension of the earthquakes, the location of the earthquakes relative to the site, and the attenuation of the ground motion from the earthquake rupture to the site. The annual rate of events from the ith source that produce ground motions that exceed a at the site is the product of the probability that the ground motion exceeds the test value given that an earthquake has occurred on the ith source and the annual rate of events with magnitude greater than mmin , on the ith source: v i (A . a) ¼ N i (mmin )Pi [A . ajE i (m > mmin )]

(2:1)

where Ni (mmin ) is the annual number of events with magnitude greater than mmin on the ith source and Ei (mmin ) indicates that an event with magnitude > mmin has occurred on the ith source. For multiple seismic sources, the total annual rate of events with ground motions that exceed a at the site is just the sum of the annual rate of events from the individual sources (assuming that the sources are independent): v(A . a) ¼

NX source

v i (A . a)

(2:2)

i¼1

where v and a are as previously defined.

2.2.1

Hazard for fault sources

Fault sources are modelled by multiple planes, which allow the strike of the fault to be changed. For planar sources (e.g. known faults), we need to consider the finite dimension and location of the rupture in order to compute the closest distance. Specifically, we need to randomise the rupture length, rupture width, rupture location along a strike, rupture location down a dip, and hypocentre location along the rupture length (for strike-slip faults). (Since rupture width and length are correlated, it is easier to consider the rupture area and rupture width and then back-calculate the rupture length.) The general form of the conditional probability for the ith fault is given by Pi [A .ajE i (m > mmin )] ¼

ð1

ð1

ð1

ð1

ð 1 ð M max

i

f m i (m) f RA i (m) f RW i (m) f Exi (Ex) RA¼0 RW ¼0 Ex¼0 Ey¼0 x¼0 m¼ M min

3 f Eyi (Eyi ) f xi (x) 3 Pi (A . a)jm, r(Ex, Ey, RA, RW ), x)dm dx dEy dEx dRW dRA

(2:3)

70

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

where fRW (m), fRA (m), fEx (Ex), fEy (Ey), fx (x), and fm (m) are probability density functions for the rupture width, rupture area, rupture location along strike, rupture location down dip, hypocentre location in the rupture plane, and magnitude, respectively. The models used for these probability density functions are described later. For the fault normal component (FN), the probability of exceeding the ground motion a for a given magnitude, m, and closest distance, r, and hypocentre location, x, is given by   ln (a)  ln ½ Sa FN (m, r, x) (2:4) P(A . ajm, r, x) ¼ 1    (m) where Sa (m,r,x) and (m) are the median and standard deviation of the ground motion from the attenuation relations for the fault normal component, and () is the normal probability integral given by (z) ¼

ðZ

1 2 pffiffiffiffiffiffi eu =2 du 1 2

(2:5)

where z is a random variable.

2.2.2

Probability of exceedance

The annual rate of events given in Equation 2.2 is not probability; it can exceed 1. To convert the annual rate of events to a probability, we consider the probability that the ground motion exceeds test level a at least once during a specified time interval. At this step, a common assumption is that the occurrence of earthquakes is a Poisson process. That is, there is no memory of past earthquakes, so the chance of an earthquake occurring in a given year does not depend on how long it has been since the last earthquake. (Non-Poisson models are discussed later.) If the occurrence of earthquakes is a Poisson process then the occurrence of peak ground motions is also a Poisson process. For a Poisson process, the probability of an event (e.g. ground motion exceeding a) occurring n times in time interval t is given by Pn (t) ¼ exp (vt)(vt) n =n!

(2:6)

The probability that at least one event occurs (e.g. n > 1) is 1 minus the probability that no events occur P(n > 1, t) ¼ 1  p0 (t) ¼ 1  exp (vt)

(2:7)

So the probability of at least one occurrence of ground motion level a in t years is given by Pð A . a, tÞ ¼ 1  exp ½vð A . aÞ t For t ¼ 1 year, this probability is the annual hazard.

(2:8)

APPLICATION OF STATISTICS IN EARTHQUAKE HAZARD PREDICTION

2.2.3

71

Aleatory and epistemic uncertainty

The basic part of the hazard calculation is computing the integrals in Equation 2.3. All of the aleatory variables are inside the hazard integral. The randomness of the seismic source variables is characterised by the probability density functions, which are discussed below. The randomness of the attenuation relation is accounted for in the probability of exceeding the ground motion a for a given magnitude and closest distance. Epistemic (scientific) uncertainty is considered by using alternative models and/or parameter values for the probability density functions, attenuation relation, and activity rate. For each alternative model, we recalculate the hazard and compute alternative hazard curves. Epistemic uncertainty is typically handled using a logic tree approach for specifying the alternative models for the density function, attenuation relation, and activity rates.

2.2.4

Activity rate

There are two approaches to estimating the fault activity rate: historical seismicity and geological (and geodetic) information. If historical seismicity catalogues are used to estimate the activity rate, then the estimate of N(mL ) is usually based on fitting the truncated exponential model (discussed below) to the historical data. Maximum likelihood procedures are generally preferred over least-squares method for estimating the activity rate and the b-value. When using geological information on slip-rates of faults, the activity rate is computed by balancing the energy build-up estimated from geological evidence with the total energy release of earthquakes. Knowing the dimension of the fault, the slip-rate, and the rigidity of the fault, we can balance the long-term seismic moment so that the fault is in equilibrium. (e.g. Youngs and Coppersmith, 1985b). The seismic energy release is balanced by requiring the buildup of seismic moment to be equal to the release of seismic moment in earthquakes. The build-up of seismic moment is computed from the long-term slip-rate. The seismic moment, M0 (in dyne cm), is given by M 0 ¼ AD

(2:9)

where  is the rigidity of the crust, A is the area of the fault (in cm2 ), and D is the average displacement (slip) on the fault surface (in cm). The annual rate of build-up of seismic moment is given by M 0 ¼ AS

(2:10)

where S is the slip-rate in cm/year. The seismic moment released during an earthquake is given by log10 M 0 ¼ 1:5M þ 16:05

(2:11)

where M is the moment magnitude of the earthquake. To balance the moment build-up and the moment release, the annual moment rate from the slip-rate is set equal to the

72

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

sum of the moment released in all of the earthquakes that are expected to occur each year: ð mU L AS ¼ N (m ) f m (m)10(1:5 mþ16:05) dm (2:12) m¼ M L

Given the slip-rate, fault area, and magnitude density function, the activity rate, N(mL ) is given by N (m L ) ¼ ð m U m¼ M L

2.2.5

AS f m (m)10

1:5 mþ16:05)

(2:13) dm

Magnitude density distribution

The magnitude density distribution describes the relative number of large-magnitude and moderate-magnitude events that occur on the seismic source. Two alternative magnitude density functions are considered: the truncated exponential model and the characteristic model. The truncated exponential model is the standard Gutenberg– Richter model that is truncated at the minimum and maximum magnitudes and renormalised so that it integrates to unity. The density function for the truncated exponential model is given by    exp (m  m L ) f m (m) ¼ (2:14) 1   exp ½(m  m L ) where  is ln(l0) times the b-value. Regional estimates of the b-value are usually used with this model. The characteristic model assumes that more of the seismic energy is released in large-magnitude events than for the truncated exponential model. That is, there are fewer small-magnitude events for every large-magnitude event for the characteristic model than for the truncated exponential model. There are different models for the characteristic model. Two commonly used models are the characteristic model as defined by Youngs and Coppersmith (1985a) and the ‘maximum magnitude’ characteristic model. In this chapter, we will call these two models the characteristic model and maximum magnitude model, respectively. The density function for the generalised form of the Youngs and Coppersmith characteristic model is given by (    exp (m  m L ) 1 f m (m) ¼ for m , m U  ˜m2 1   exp ½(m U  ˜m2  m L ) 1 þ c (    exp (m U  ˜m1  ˜m2  m L ) 1 for m > m U  ˜m2 f m (m) ¼ 1   exp ½(m U  ˜m2  m L ) 1 þ c (2:15) where

APPLICATION OF STATISTICS IN EARTHQUAKE HAZARD PREDICTION

   exp (m U  ˜m1  ˜m2  m L ) ˜m2 c¼ 1   exp ½(m U  ˜m2  m L )

73

(2:16)

In the Youngs and Coppersmith model, ˜m1 ¼ 1.0 and ˜m2 ¼ 0.5. The density functions themselves are similar at small magnitudes. However, when the geologic moment rate is used to set the annual rate of events, N(mL ), then there is a large impact on N(mL ) depending on the selection of the magnitude density function. The characteristic model has many fewer moderate-magnitude events than the truncated exponential model (about a factor of 10 in difference). Recent studies have found that the characteristic model does a better job of matching observed seismicity than the truncated exponential when the total moment rate is constrained by the geological slip-rate.

2.2.6

Rupture dimension density functions

For the rupture area and rupture width, the density function is determined from regression models which give the rupture area and rupture width as a function of magnitude. Wells and Coppersmith (1994) developed empirical models for rupture area and rupture width as follows: log10 ð RAÞ ¼ 3:49 þ 0:91M  0:24

(2:17)

log10 ð RW Þ ¼ 1:01 þ 0:32M  0:15

(2:18)

The density functions, fRA (m), fRW (m) are log-normal distributions centred about the median values given by Equations 2.15 and 2.16. These distributions are truncated at 2 in the hazard calculations.

2.2.7

Rupture location density functions

The centre of the rupture location is parameterised in terms of the normalised fault length and fault width. Ex is the fraction of the fault length (measured along the strike) and Ey is the fraction of the fault width (measured down the dip). The location of the centre of the rupture plane is assumed to be uniformly distributed over the fault plane. The resulting density functions for fEx (Ex) and fEy (Ey) are unity.

2.2.8

Hypocentre location density function

For a given rupture dimension (length and width) and rupture location, the location of the hypocentre along the strike is parameterised in terms of the normalised rupture length. The location of the hypocentre is assumed to be uniformly distributed over the rupture plane. The resulting density function for fx (x) is unity. In the hazard analysis, a total of ten hypocentre locations evenly spaced along the rupture length are used for each magnitude, rupture location and rupture dimension.

74

2.3

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

EARTHQUAKE INTENSITY ATTENUATION RELATIONS

An attenuation relation is an equation or a table that describes how earthquake ground motion decreases as the distance to the earthquake increases. Because earthquake ground motion increases with magnitude, the attenuation relation also depends on magnitude. Strong motion data (statistical data recorded from the past earthquakes) and geophysical attenuation models are used to establish the attenuation relations. Strong motion data consist of recordings of those earthquake ground motions capable of damaging buildings or of weakening soils or embankments under or near buildings. Strong motion recorders are placed in areas likely to be shaken, in order to maximise the opportunity for gaining new information when earthquakes occur. From the statistical analysis of the recorded strong motion data, three important characteristics are derived: • • •

how much the ground motion increases at a given distance as the magnitude increases; how much the ground motion decreases at a given magnitude as the distance increases; how much the subsurface soil column amplifies the ground motion.

Attenuation relations present the results of analysing strong motion data in showing how large the ground motions are expected to be for a certain earthquake magnitude and a certain distance from the earthquake. Usually the attenuation relations are obtained by a statistical process called regression. Given a specified mathematical equation, regression determines parameters for that equation. In some cases, regression is used to determine the remaining parameters when the other parameters are given by geophysical attenuation models. Then, given a magnitude, a distance, and a geologic site condition, the equation gives the average value of the ground motion expected. For a future earthquake, the actual ground motion will not be that average value, but rather a value in some uncertainty range around that average value. The regression also gives an estimate of that uncertainty range. The adjustment for geological site condition is sometimes determined by regression, but also sometimes determined by physical models of the soil column effect.

2.4

AN EXAMPLE OF EARTHQUAKE HAZARD PREDICTION USING HISTORICAL SEISMICITY DATA

A probabilistic seismic hazard analysis (PSHA) was conducted to evaluate the likelihood of various future earthquake shaking levels at a site in Needles, California as reflected in peak horizontal acceleration values. This site is selected because it is quite distant from major active faults such that the prediction of seismic hazard at the site is more based on statistics of the historical earthquakes. Figure 2.1 shows the seismicity in the area surrounding North Needles that was used to estimate the recurrence used in the background seismic source characterisation. As shown in Figure 2.1, the selected rectangular area centred around North Needles is approxi-

75

APPLICATION OF STATISTICS IN EARTHQUAKE HAZARD PREDICTION

Figure 2.1: Historical earthquakes in the vicinity of Needles, CA. 1993–2001

4 3 2

35°N Needles

34°N

115°W

114°W

mately 181 km wide by 276 km long. The largest recorded earthquake magnitude within the zone is 5.0. Historical earthquakes with magnitudes larger than 3 were used to evaluate the recurrence of the background seismic zone. In Figure 2.2, the recurrence for the background seismic zone based on recorded seismicity is shown by solid diamonds with vertical lines indicating uncertainties (plus/minus one standard deviation) in the data (Weichert, 1980). The historical seismicity was fitted using a truncated exponential recurrence model as shown by the solid line in Figure 2.2. The slope (or b-value) of the fitted recurrence used in the present study was obtained using the approach proposed by Weichert (1980). For

76

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 2.2: Exponential recurrence model. Recorded seismicity Exponential model 102 5 2

10

Number of events per year with magnitude  M

5 2

1 5 2

101 5 2

102 5 2

103 5 2

104

3

4

5

6

7

8

Magnitude, M

the selected background seismic zone, the b-value obtained was 0.95. The maximum magnitude used to evaluate the recurrence of the background source zone was conservatively selected as 6.5. Because the recurrence shown in Figure 2.2 was developed for the entire zone, it can be further normalised to yield the recurrence per unit square kilometre area. This recurrence was used to evaluate the seismic hazard of background seismicity at the site. A logic tree approach was included as part of the PSHA for this example to incorporate the effect of uncertainties on the analysis. This approach allows the specification of multiple values and the assignment of weighting factors on these values to reflect the judged likelihood of each being the true state of nature. For this

77

APPLICATION OF STATISTICS IN EARTHQUAKE HAZARD PREDICTION

study, the effects of uncertainties in seismic source characterization (i.e. seismogenic depth, maximum earthquake magnitude, slip-rate, and segmentation scenario) and attenuation relationships were included as part of the logic tree uncertainty analysis. The selection of appropriate attenuation relationships is based upon the general subsurface site conditions. Based on the log of soil borings obtained at the site and the general geology condition, the attenuation relationships by Idriss (1987; 1994) for deep soil site conditions, Abrahamson and Silva (1997) for soil site conditions, and Sadigh et al. (1997) for soil site conditions were used in the PSHA analysis. The results of the PSHA, expressed in terms of horizontal peak ground acceleration plotted against mean number of events per year that result in that peak horizontal acceleration being exceeded at the site – also referred to as ‘annual frequency of exceedance’ – are shown in Figure 2.3. The contributions of the two seismic source types (identified fault sources and background seismicity) to the total seismic hazard at the site are also shown in this figure. As can be seen in this figure, the main contributor to the total hazard at the site is the background seismicity (statistics of historical earthquakes). Based on the hazard curve shown in Figure 2.3, the computed horizontal peak ground acceleration corresponding to 10% probability of exceedance in 50 years (or called the 475-year return period) is 0.122g, where g is acceleration due to gravity.

Figure 2.3: Annual frequency of exceedance plotted against peak horizontal acceleration. 1 Background seismicity

Annual frequency of exceedance: 1/year

Faults sources Total 0.1

0.01

0.001

0.0001 0

0.1

0.2 0.3 Peak horizontal acceleration, g

0.4

0.5

78

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

The magnitude and distance contribution to this shaking level is shown in Figure 2.4 (see colour insert). Based on this figure, an earthquake event of magnitude 6.25 is recommended for use with the peak ground acceleration above. The computed uniform hazard response spectra of 3%, 5%, and 7% of critical damping, for 475-year average return period (ARP) are presented in Figure 2.5. Their maximum response spectral accelerations are 0.388g, 0.325g, and 0.283g, respectively, at a period of 0.3 s.

2.5

SUMMARY

This chapter provides an application of statistics for earthquake hazard prediction. A theoretical background was discussed, which provided a detailed description about how statistical data and uncertainties are considered in the probabilistic earthquake hazard prediction. Earthquake ground motion attenuation relations have been developed using the regression method based on the recorded earthquakes at various site conditions. An example of how to apply the historical seismicity data in predicting the future earthquake level for building design was elaborated.

Figure 2.5: Design-level response spectra for 475-year return period earthquake. 0.6 Critical damping  3%

Horizontal pseudo spectral acceleration (g)

Critical damping  5% 0.5

Critical damping  7%

0.4

0.3

0.2

0.1

0 0.01

0.10

1.00 Period: s

10.00

APPLICATION OF STATISTICS IN EARTHQUAKE HAZARD PREDICTION

79

REFERENCES Abrahamson, N.A. and Silva, W.J. (1997). Empirical response spectral attenuation relations for shallow crustal earthquakes. Seismological Research Letters. 68, 1: 94–127. Cornell, C.A. (1968). Engineering seismic risk analysis. Bulletin of the Seismological Society of America. 58: 1583–1606. Idriss, I.M. (1987). Earthquake Ground Motions. Lecture Notes, Course on Strong Ground Motion. EERI, Pasadena, CA. Idriss, I.M. (1994). Personal communication through Dr Paul G. Somerville. Sadigh, K., Chang, C.-Y., Egan, J.A., Makdisi, F., and Youngs, R.R. (1997). Attenuation relationships for shallow crustal earthquakes based on california strong motion data. Seismological Research Letters. 68, 1: 180–189. Weichert, D. (1980). Estimation of the earthquake recurrence parameters for unequal observation periods for different magnitudes. Bulletin of the Seismological Society of America. 70, 4: 1337–1347. Wells, D.L. and Coppersmith, K.J. (1994). New empirical relationships among magnitude, rupture length, rupture width, rupture area, and surface displacement. Bulletin of the Seismological Society of America. 84, 4: 974–1002. Youngs, R.R. and Coppersmith, K.J. (1985a). Development of a fault-specific recurrence model. Earthquake Notes (abs.). 56, 1: 16. Youngs, R.R. and Coppersmith, K. (1985b). Implications of fault slip rates and earthquake recurrence models to probabilistic seismic hazard estimates. Bulletin of the Seismological Society of America. 58: 939–964.

CHAPTER

3

Adaptive Sampling of Ecological Populations Jennifer A. Brown

3.1

INTRODUCTION

Adaptive sampling refers to sampling designs where the protocol for data collection changes, evolves or adapts during the course of the survey. There are many different designs that can be used for collecting data in the field. Simple random sampling, stratified sampling, systematic sampling and cluster sampling are all examples of commonly used standard designs. These designs specify the data collection process rather than the data collection device. The design may specify that 30 plots of 1 m2 be placed at random within the study site. This is an example of a simple random sample design with a sample size of n ¼ 30. The device or the sampler is the 1 m2 plot that is used to collect measurements from the sample unit. If the interest is in a plant species’ density, the measurements from each sample unit or plot will be counts of the observed plant of the species. This example is to introduce the concept of ‘design’ in a simple way. The design is about how to collect the data in the field. Carrying on with this simple example, many times, there may be a better way to design the survey rather than using simple random sampling, especially in large field surveys. Simple random sampling will, if carried out correctly, always give an unbiased estimate of the true characteristic of interest (e.g. the overall population total) but it does leave a lot to chance. There are usually more sophisticated designs that can be used. Stratified sampling is a good example. Consider a study where there is some information on the different soil types and it is known that there is a relationship between soil type and species density. By separating the study area into groups or strata of similar soil types, and taking samples from within each strata, the overall precision of the survey will be at least as good as if a simple random sample were taken.

Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

82

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

As an example of adaptive sampling, if a stratified design were used for the plant survey and strata were defined based on some auxiliary information such as soil type, the question of how much sample effort to allocate to each strata needs to be addressed. The theory tells us that effort should be allocated among strata on the basis of relative within-strata variance, strata size and within-strata sample costs (Cochran, 1977). But what if this within-strata variance is not known? Here is your first example of adaptive sampling. You could sample all the strata with some effort and use these preliminary first-phase results to estimate the within-strata variance to decide on what effort should be allocated where for the second phase. This two-phased stratified design proposed by Francis (1984) is one of the early examples of an adaptive design. As information is gained during the course of the survey, the design for data collection evolves. There are many examples of adaptive sampling in environmental science and in this chapter we review some of them, with particular reference to populations that are rare and clustered. A text on adaptive sampling was published in 1996 by Thompson and Seber, which presents much of the theoretical background and development of estimators. In this chapter, the focus is on some adaptive sampling examples and readers are directed to the reference papers or to Thompson and Seber (1996) for background reading. The emphasis is on rare and clustered populations simply because these are often the most challenging to survey and are also very common in environmental science. An excellent discussion on what is the definition of rare and clustered is given in McDonald (2004). Rare and clustered populations are challenging to survey because when things are rare, without some targeted field effort, most of the time in the field is spent finding nothing – and keeping the field crew focused is challenging! Statistically, the challenge is that with rare and clustered populations, the survey precision can be low. Consider the simple random sample of 30 plots with counts of plants. Compare a population with a density of 2 plants/m2 . The variance among the 30 plot counts would be about 1 if the population were randomly distributed, but could be over 40 if the same total number of plants were observed in only two of the plots (with the other 28 plots having zero counts). Any design that helps target field effort to where the species of interest is should help with both the logistics of the field survey and its statistical precision. This is where adaptive sampling can be extremely useful – it makes sense to target field effort to where the rare plants or animals are.

3.2

ADAPTIVE CLUSTER SAMPLING

Adaptive cluster sampling was introduced by Thompson (1990). It was developed for situations where you know the population is rare and clustered, but the exact location of the clusters is not known. The design can start with a random sample. Prior to sampling, a threshold value is chosen, C, and if any of the units in the initial sample meet or exceed this threshold, yi > Cyi, then neighbouring units are sampled. Any unit that meets or exceeds this threshold is considered to have ‘met the condition’. If any of these neighbouring units meets this condition, its neighbouring units are selected and so on. In this way, as sampling continues for any cluster that is detected in the

83

ADAPTIVE SAMPLING OF ECOLOGICAL POPULATIONS

initial sample, the shape and size of the cluster can be described. The final sample is the collection of clusters that were detected in the initial sample, along with any of the sample units that were in the initial sample but below the threshold. The concept of the design is very attractive to practising field scientists. The concept of searching in the neighbourhood of the location of where a rare plant (or animal) has been found is very appealing. This design takes that intuitive behaviour and puts it in the framework of probability sampling. The technicality of adaptive cluster sampling is to do with calculating the probability of selecting a sample unit. In simple random sampling, the probability of selecting any sample unit is the same for each unit. With adaptive cluster sampling, the probability of a unit being selected is more complicated because there can be many different ways for a unit, if it is part of an aggregate, to appear in the sample – it could be in the sample because it was in the initial selection, or because any one of its neighbouring units were in the initial sample. The reason why it is important to estimate these probabilities is that they are used in the calculation of the statistic of interest. This statistic is typically an estimate of the population mean or population total, and the associated estimate of the variance of the sample statistic. The design is a form of unequal probability sampling and there are two very useful estimators that can be used, the Horvitz–Thompson and the Hansen–Hurwitz estimators. Without going into the details, which are well explained in Thompson and Seber (1996), the preferred formula for the estimators is the Horvitz–Thompson estimator. Two terms need to be defined to help understand the terminology to distinguish ‘networks’ and ‘clusters’. A network is the collection of units around the unit in the initial sample that triggered neighbourhood searching. All these units will have met the condition. Neighbourhood searching is an adaptive process, and for neighbourhood searching to stop, units must have been measured and their value found to be below the threshold. These units are called ‘edge units’. Together, networks and edge units make up a cluster. One additional technicality that is helpful to understand is that any unit in the initial sample that does not meet the condition is considered a network of size one. With this definition of what a network is, the entire study area can be divided up into distinct and not overlapping networks. Some of these networks will be only one unit in size; others will be larger than one unit (Figure 3.1). With the condition yi > 1, the 200-quadrant study area can be divided up into 187 distinct networks. The top right-hand corner has a network of size 5, the aggregate to the left of this is a network of size 7, and there is one more network of size 4. All the other quadrants represent networks of size 1, even those with counts in them, because for these, there are no counts in the neighbouring quadrants. The estimator for the population total is for unequal probability selection because the larger networks have more chance of being selected than the smaller networks. The large size 5 network in the top right-hand corner will appear in the final sample if any of the five units are in the initial sample, whereas a size 1 network has only one chance of being in the final sample. The Horvitz–Thompson estimate of the population total, ^, is calculated using these distinct networks .

84

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 3.1: A population of blue-winged teal ducks (from Smith et al., 1995). There are 200 25 km2 quadrants. The count of observed ducks is shown. The left side of the figure shows an initial sample of 10 quadrants, and the right side shows the final sample. One network larger than one quadrant in size is selected in the final sample (the network has seven quadrants in it). The condition to trigger adaptive selection was yi > 1. A neighbourhood was defined as the surrounding four quadrants. 60

60 122 114

1 71446339 103 150

71446339

14

103 150

6

2

14

6

2

2

3

2

3

12

12 2

2

2

4

4 5

3

10

10

2

122 114

1

3

20

5

20

3

3

^ ¼

XK y zk k k¼1 Æ k

(3:1)

where yk is the total of the y values in network k; z k is an indicator variable equal to one if any unit in the kth network is in the initial sample, and zero otherwise; and Æ k is the initial intersection probability. This initial intersection probability is the probability that at least one of the units in the network will be in the initial sample. For the kth network of size x k , the initial intersection probability is

85

ADAPTIVE SAMPLING OF ECOLOGICAL POPULATIONS

Æk ¼ 1 



N  xk n



N n



where N is the size of the study area, x k is the size of the network, and n is the size of the initial sample. The estimate of the variance of the estimated population total has a complicated-looking formula because joint inclusion probabilities, Æ jk , need to be calculated, that is the probability that both network j and k appear in the initial sample





 N  xj N  xk N  x j  xk N Æ jk ¼ 1  þ  where j 6¼ k n n n n and Æ jk ¼ Æ j

where

j¼k

The estimator of the variance, V ^ a r(^), is

XK XK  y Æ jk  Æ j Æ k z z V^ a r(^) ¼ y j k j k j¼1 k¼1 Æ jÆ k

(3:2)

Using the example in Figure 3.1 of the blue-winged teal (Smith et al., 1995) an initial sample of 10 quadrants is taken. The survey was designed with the threshold condition, yi > 1 and the definition of the neighbourhood as the four surrounding quadrants. Only one quadrant in the initial sample triggered adaptive selection of the surrounding quadrants. The final sample size was 16, but in total, many more than 16 units had to be visited: the four neighbouring units are always visited or checked if ducks are present, but only those in the initial sample or in a network are used in calculating the sample estimators. These other units are the edge units, and are visited and checked so that the ‘edge’ of the networks can be defined. With a simple condition like yi > 1, these units only needed to be checked to see if ducks were present or absent. However, with a condition like yi > 10, ducks within the units would need to be counted to know whether there were less than 10, something that may be more time-consuming than simply checking for presence. The Horvitz–Thompson estimate of the population total (Equation 3.1) is ^ ¼ ¼

X187 y zk k

k¼1

Æk

13 753  1 þ0 þ  þ 0 0:3056

¼ 45 003 The only non-zero term in this equation is for the network of size 7. The values of all the other terms are zero. The other nine networks that were selected in the initial sample were only one quadrant in size and y* ¼ 0. All the other networks that were not selected have z k ¼ 0. The initial intersection probability for the size 7 network is calculated as follows:

86

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Æ1 ¼ 1 



200  7 10



200 10

¼ 0:3056

The estimator of the variance for the duck population is fairly straightforward because there is only one network sampled where the y* is not zero. The joint inclusion probability for the network of size 7 with itself is Æ11 ¼ Æ1 ¼ 0:3056 The joint inclusions probabilities for networks that were not selected do not need to be calculated because for these z k ¼ 0. The estimator of the variance (Equation 3.2) is therefore

X187 X187  y Æ jk  Æ j Æ k z z V^ a r(^) ¼ y j k j k j¼1 k¼1 Æ jÆ k



Æ11  Æ1 Æ1 Æ12  Æ1 Æ2 z1 z2 z1 z1 þ y1 y2 ¼ y1 y1 Æ1 Æ 1 Æ1 Æ2

 Æ187187  Æ187 Æ187 z z þ . . . þ y y 187 187 187 187 Æ187 Æ187



Æ11  Æ1 Æ1 Æ12  Æ1 Æ2 z1 z2 z1 z1 þ y1  0  ¼ y1 y1 Æ1 Æ 1 Æ1 Æ2

Æ187187  Æ187 Æ187   0 þ . . . þ y187 y187 Æ187 Æ187 ¼ 13 7532

0:3056  0:30562 11 0:30562

¼ 4:2978  108 Fortunately, software packages can now assist with these calculations, for example, SAMPLE at www.lsc.usgs.gov/aeb/davids/acs/ (Morrison et al., 2008). Whether adaptive cluster sampling is an efficient design or not depends first on how clustered the population is and, second, on the survey design. As a general principle, the more clustered the population is, the more efficient adaptive cluster sampling is compared with simple random sampling. In simple random sampling, few design choices need to be made. The design choices are about the sample unit size and shape, and the total sample size. In adaptive cluster sampling, designing the survey is more complex. In addition to choices about the sample unit size and shape, the size of the initial sample needs to be selected, and the criteria for adaptive selection and the neighbourhood need to be defined (e.g. the surrounding two, four or eight neighbouring units). There is considerable literature on how to design an efficient survey and much of this is reviewed in Smith et al. (2004) and Turk and Borkowski (2005). Again, staying

ADAPTIVE SAMPLING OF ECOLOGICAL POPULATIONS

87

with general principles only, efficient designs will be where the final sample size is not excessively larger than the initial sample size, and which has small networks. This can be achieved by using large criteria for adapting and small neighbourhood definitions (Brown, 2003). One final discussion on the design of adaptive cluster sampling is a concern often raised with adaptive designs: concern that the size of the final sample is not known prior to sampling and this makes planning the field work difficult. Restricting the final sample size by a stopping rule has been discussed by Brown and Manly (1998), Salehi and Seber (2002), and Lo et al. (1997). Another approach is an inverse sampling design, where surveying stops once a set number of non-zero units have been selected (Christman and Lan, 2001; Seber and Salehi, 2004). Fortunately, along with the growing literature on how to design an adaptive cluster sample, there is growing literature on its application to a range of environmental situations. Some recent examples are the use of adaptive cluster sampling for surveys of plants (Philippi, 2005), waterfowl (Smith et al., 1995), seaweed (Goldberg et al., 2006), shellfish (Smith et al., 2003), marsupials (Smith et al., 2004), forests (Talvitie et al., 2006; Magnussen et al., 2005), herpetofauna (Noon et al., 2006), larval sea lampreys (Sullivan et al., 2008), sediment load in rivers (Arabkhedri et al., 2010), in hydroacoustic surveys (Conners and Schwager, 2002) and fish eggs (Smith et al., 2004; Lo et al., 1997). This discussion has focused on adaptive cluster sampling where the initial sample is selected at random. The design can be applied to systematic sampling (Thompson, 1991a; Acharya et al., 2000), stratified sampling (Thompson, 1991b; Brown, 1999) and two-stage sampling (Salehi and Seber, 1997).

3.3

ADAPTIVE ALLOCATION FOR STRATIFIED AND TWOSTAGE SAMPLING

Alternative (but related) designs to adaptive cluster sampling are designs based on stratified and two-stage sampling. These conventional designs can have an adaptive component added in the same way that the adaptive component is added to simple random sampling in adaptive cluster sampling. Stratified and two-stage sampling are both designs where the study area is sectioned into strata or primary units. With stratified sampling, all strata are selected and a sample is taken within each, usually by simple random sampling. With two-stage sampling, a selection of primary units is chosen and, within the chosen primary unit, a sample is taken, usually by simple random sampling. The essential difference between the two categories of designs is whether there is some selection of the primary units or not: in stratified sampling, all primary units are selected and are called strata. The estimation of the statistic of interest (usually the mean or the total) is different between the two categories to reflect the different sources of variation in the sample (in two-phase sampling, only a selection of primary units are chosen and it has an additional component of uncertainty). Stratified and two-stage sampling can be very useful for sampling rare and

88

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

clustered populations in the same way that adaptive cluster sampling is useful. If the location and size of the clusters are known, or can be approximated by some auxiliary information such as a habitat suitability score, then by matching the size and shape of the primary unit or stratum to these clusters, survey effort can be specifically targeted there. This can be done by allocating considerably more survey effort to the primary units or strata that are known to contain the clusters of interest, thus increasing the survey intensity in the areas of interest. Very often, the location and size of clusters are not known and even habitat maps and habitat prediction models for species distribution have some uncertainty. In these situations, sectioning the study area into strata and primary units should be based on the idea of minimising the within-primary unit or the within-stratum variance so that the units within the primary units are as similar as possible. Other considerations are often related to field logistics. Natural features in the field such as catchment boundaries or fence lines can be used. The size of primary units may represent what sampling can be achieved by the field crew in one day, simplifying planning of field work to a primary unit per day. One way of applying adaptive sampling to a stratified design was described in the introduction to this chapter. In a two-phase stratified design proposed by Francis (1984), the survey area is sectioned into strata and initial survey effort allocated based on the best available information using the standard approach of putting more effort in the more variable strata. After this initial phase of surveying, the preliminary information can be used to improve the estimate of the strata variability, and the remaining survey effort can be allocated to the strata that will be most effective in reducing the overall sample variance. The adaptive allocation in the second phase is done to adjust or to make up for any shortcomings in the initial allocation of effort. The initial allocation of effort should be done using the best available information prior to the survey and the final allocation is done using the information gained during the survey. A similar scheme was proposed by Jolly and Hampton (1990). In the first phase, a conventional stratified sample is taken from the population. Then, using the first phase sample results to estimate within-stratum variance, the remaining sample units are added one by one to an individual stratum. At each step of this sequential allocation of sample units, the stratum that is allocated the unit is chosen on the basis of where the greatest reduction in variance will be. For some populations, rather than using within-stratum variance of the criteria for adaptive allocation, the square of the stratum mean is preferred (Francis, 1984). Once this desktop exercise of allocating the second phase effort is complete, the actual field sampling resumes and the second phase is conducted. The final estimates are based on the pooled information from the first and second phases. This does result in a small bias, and bootstrapping has been proposed for bias correction and variance estimation (Manly, 2004). The design has been extended to surveying multiple populations (Manly et al., 2002). Smith and Lundy (2006) used a modified design to conduct a stratified sample of sea scallops. Based on the within-stratum mean from the first phase, a fixed amount of effort was allocated to each stratum where the mean was above a threshold value.

ADAPTIVE SAMPLING OF ECOLOGICAL POPULATIONS

89

They used the Rao–Blackwell method (Thompson and Seber, 1996) to derive an unbiased estimate for the population. Another example of adaptive allocation, this time with two-stage sampling, is adaptive two-stage sequential sampling (Brown et al., 2008). Using much the same concept as the two-phase stratified design, an initial sample is taken from selected primary units. Then, in the second phase, additional units are allocated to the primary units proportional to the number of observed units in that primary unit that exceed a threshold value g i .º, where g i is the number of sampled units in the ith primary unit that exceed the threshold value and º is a multiplier. In the simulation study in Brown et al. (2008), the blue-winged teal duck population (Smith et al., 1995; Salehi and Smith, 2005) is divided into eight primary units. In the first phase, samples from within selected primary units are taken. Different levels of survey intensity were trialled, varying both the number of selected primary units and the number of units taken from within each selected primary unit. Second-phase effort is allocated, again using different levels of survey intensity with varying levels of the multiplier º and varying levels for the threshold value. The design allows survey effort to be intensified at the locations where, in this case, the ducks from this rare and clustered population are found. Their simulation results show gains in survey efficiency (measured by reduced sample variance) with the adaptive component added to the conventional two-stage design. The range of designs used in the simulations allows some general observations to be made about two-stage (and stratified) sampling and about adaptive allocation. It is important to ensure adequate effort is available for the adaptive allocation that occurs in the second phase of sampling if large gains in efficiency are to be realised. For the same amount of effort, Brown et al. (2008) recommend putting less effort into the initial sample of the selected primary sample units to ensure more effort is available for the sequential allocation of additional units, compared with the reverse. Another recommendation was that the threshold value that is used to ‘trigger’ adaptive allocation of additional units should be relatively high. These recommendations are consistent with what is recommended for adaptive cluster sampling (Brown, 2003; Smith et al., 2004). A third example of adaptive sampling applied to conventional stratified or twostage sampling is a complete allocation stratified design (Salehi and Brown, 2010). This is a simplified design for adaptive stratified sampling. If any unit in a stratum has a value that exceeds a threshold, the stratum is completely surveyed. It is simplified in two ways: the rule to decide on whether a stratum is to be allocated additional survey effort does not require the first-phase survey in the stratum to be completed. Second, the instruction to the field crew on how much additional effort is required is simply to survey the entire stratum. The complete allocation stratified design merges the best features of some of the previous adaptive designs. In adaptive cluster sampling, the appeal is that it allows the field biologist to do what they intuitively want to do in surveys of rare and clustered populations. Having searched endlessly without seeing anything, once they do see an individual (or a threshold number of individuals), they are reluctant to leave and want to stay searching around it. The adaptive searching of the neighbourhood in adaptive

90

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

cluster sampling is similar to conducting a complete search in the vicinity of the found individual. In complete allocation, once an individual is observed, the neighbourhood is completely searched. The difference is that in adaptive cluster sampling, the neighbourhood is not defined prior to sampling and, for some populations, can be excessively large (Brown, 2003). In complete stratified allocation, the searched neighbourhood is defined and restrained by the stratum boundary. The estimate of the population total for complete stratified sampling, ^st , is ^st ¼

ª X y h

(3:3)

 h¼1 h

where yh is the total of the y values in the hth stratum, ª is the number of strata that were completely surveyed, and  h is the probability that the whole stratum h is selected



N h  mh Nh h ¼ 1  nh nh where for the hth stratum, Nh is the size of stratum, m h is the number of non-empty units in the stratum, and n h is the size of the initial sample in the stratum. An unbiased variance estimator for complete stratified sampling, V ^a r[^st ], is V^ a r[^st ] ¼

 ª X (1   h ) y 2 h

h¼1

2h

(3:4)

As an example, case study data from a previously unpublished study by Brown is provided. The study is on a rare buttercup found in the South Island of New Zealand. The buttercups in the study are within the Lance McCaskill Nature Reserve. The Castle Hill buttercup (Ranunculus crithmifolius sub. paucifolius) is one of New Zealand’s rarest plants. Locations of buttercup plants observed in a study conducted in November 1998 are mapped within 10 3 10 m quadrants (Figure 3.2). The study area has been divided into 12 strata with 25 units in each. The site was divided into 12 strata, each containing 25 quadrants. In the first phase, a simple random sample of size 3 was taken from each of the strata (Figure 3.2). During the first phase, samples were taken from three of the strata containing buttercups. In the second phase, these three strata were surveyed completely. The total final sample size is therefore (3 3 25) + (9 3 3) ¼ 102. For the three strata that did trigger adaptive sampling,  h , the probability that the whole stratum is selected is

11 3 5 ¼ 1  ¼ 0:928 25 3

91

ADAPTIVE SAMPLING OF ECOLOGICAL POPULATIONS

Figure 3.2: A population of Castle Hill buttercups. There are 300 100 m2 quadrants. The count of buttercups is shown. The study area is sectioned into 12 strata; three quadrants are sampled from each stratum.

2 5

2

2

1 3

2

6

7

1

1

3

1

1

1

3

1

8

3

5

20

3

5

6

5

13

11

16

2

4

1

2

2

1 1

2 2

2

1 2

1

2

1



17 3 6 ¼ 1  ¼ 0:704 25 3

1

2

2

1

3 2

11 3 8 ¼ 1  ¼ 0:928 25 3

92

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

In the other nine strata, no plants were observed. The estimate of the total number of plants (Equation 3.3) is ^st ¼ ¼

3 X y h

 h¼1 h 66 11 73 þ þ 0:928 0:704 0:928

¼ 164:4 The estimated variance of this estimate (Equation 3.4) is V^ a r[^st ] ¼

 3 X (1   h ) y 2 h

h¼1

¼

2h

(1  0:928)662 (1  0:704)112 (1  0:928)732 þ þ 0:9282 0:7042 0:9282

¼ 878:4

3.4

DISCUSSION

The addition of an adaptive selection component to conventional sample designs gives a wide range of very flexible and useful sampling techniques. Adaptive sampling has special application for sampling rare and clustered populations because it allows survey effort to be targeted to where any plant or animal of interest has been found. Various designs are introduced in this chapter and the focus has been on how adaptive selection can be added to existing well-known designs. Here, this discussion has been limited to simple random sampling, and to two-stage and stratified sampling. In adaptive cluster sampling, extra effort is allocated to the neighbourhood of units in the initial sample that triggered adaptive selection – typically a unit where the count of the plant or animal within it is .0. With stratified or two-stage designs, adaptive allocation of additional effort is triggered on the basis of a measure of the whole stratum (or primary unit). Two-phase stratified sampling (Francis, 1984), adaptive two-stage sequential sampling (Brown et al., 2008), and complete allocation stratified sampling are examples of adaptive sampling applied to two-stage and stratified sampling. There is a growing amount of literature on adaptive sampling. To assist this expanding field, Salehi and Brown (2010) suggest the use of the terms ‘adaptive searching’ and ‘adaptive allocation’ to distinguish two categories. Adaptive searching refers to designs such as adaptive cluster sampling where the neighbourhood is searched. In contrast, in adaptive allocation, extra effort is initiated once a collection of units are sampled (e.g. the stratum or the primary unit). The distinction between the two classes is based on where and when the decision to allocate extra effort can be

ADAPTIVE SAMPLING OF ECOLOGICAL POPULATIONS

93

made: immediately after an individual sample unit is measured or once a collection of units has been completely sampled. The other terminology that is established, but needs to be adhered to, is the distinction between two-stage and two-phase sampling. In this chapter, two-stage sampling is where a selection of primary units is sampled and, in the second stage, a sample is taken within each. Two-phase sampling in this chapter is where an initial sample is taken and, on the basis of information from that sample, additional effort is allocated in the second phase. In adaptive searching, the decision to conduct the second phase of sampling occurs concurrently with the first phase. In adaptive allocation, the decision to conduct the second-phase sampling occurs after the first phase. All the designs discussed are remarkably efficient, giving estimates of populations that have lower variance than the conventional design without the adaptive selection. However, as with conventional sampling, the survey must be careful designed to realise these efficiency gains in terms of size and number of stratum, allocation of effort to the first phase (i.e. the initial sample before the additional effort is allocated), the threshold used to trigger adaptive allocation, and, for adaptive cluster sampling, the neighbourhood definition.

ACKNOWLEDGEMENTS The author would like to thank Mohammad Salehi, Mohammad Moradia, and David Smith for ongoing collaborative work in this field. Also, thanks are given to Miriam Hodge, to whom the author is always indebted.

REFERENCES Acharya, B., Bhattarai, G., de Gier, A. and Stein, A. (2000). Systematic adaptive cluster sampling for the assessment of rare tree species in Nepal. Forest Ecology and Management. 137: 65–73. Arabkhedri, M., Lai, F.S., Noor-Akma, I. and Mohamad-Roslan, M.K. (2010). An application of adaptive cluster sampling for estimating total suspended sediment load. Hydrology Research. 41, 1: 63–73. Brown, J.A. (1999). A comparison of two stratified sampling designs: adaptive cluster sampling and a two-phase sampling design. Australia and New Zealand Journal of Statistics. 41, 4: 395–404. Brown, J.A. (2003). Designing an efficient adaptive cluster sample. Environmental and Ecological Statistics. 10: 95–105. Brown, J.A. and Manly, B.F.J. (1998). Restricted adaptive cluster sampling. Environmental and Ecological Statistics. 5: 47–62. Brown, J.A. M., Salehi, M., Moradi, M., Bell, G. and Smith, D.R. (2008). An adaptive two-stage sequential design for sampling rare and clustered populations. Population Ecology. 50, 3: 239–245. Christman, M.C. and Lan, F. (2001). Inverse adaptive cluster sampling. Biometrics. 57: 1096–1105. Cochran, W.G. (1977). Sampling Techniques, 3rd edition. Wiley, New York. Conners, M.E. and Schwager, S.J. (2002). The use of adaptive cluster sampling for hydroacoustic surveys. Journal of Marine Science: Journal du Conseil. 59, 6: 1314–1325.

94

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Francis, R.I.C.C. (1984). An adaptive strategy for stratified random trawl surveys. New Zealand Journal of Marine and Freshwater Research. 18: 59–71. Goldberg, N.A., Heine, J.N. and Brown, J.A. (2006). The application of adaptive cluster sampling for rare subtidal macroalgae. Marine Biology. 151: 1343–1348. Jolly, G.M. and Hampton, I. (1990). A stratified random transect design for acoustic surveys of fish stocks. Canadian Journal of Fisheries and Aquatic Sciences. 4: 1282–1291. Lo, N.C.H., Griffith, D. and Hunter, J.R. (1997). Using restricted adaptive cluster sampling to estimate Pacific hake larval abundance. California Cooperative Oceanic Fisheries Investigations Report. 37: 160–174. Magnussen, S., Kurz, W., Leckie, D.G. and Paradine, D. (2005). Adaptive cluster sampling for estimation of deforestation rates. European Journal of Forest Research. 124, 3: 207–220. Manly, B.F.J. (2004). Using the bootstrap with two-phase adaptive stratified samples from multiple populations at multiple locations. Environmental and Ecological Statistics. 11: 367–383. Manly, B.F.J., Akroyd, J.M. and Walshe, K.A.R. (2002). Two-phase stratified random surveys on multiple populations at multiple locations. New Zealand Journal of Marine and Freshwater Research. 36: 581–591. McDonald, L.L. (2004). Sampling rare populations. In: Thompson, W.L. (Ed.) Sampling Rare and Elusive Species. Island Press, Washington DC, pp. 11–42. Morrison, L.W., Smith, D.R., Nichols, D.W. and Young, C.C. (2008). Using computer simulations to evaluate sample design: an example with the Missouri bladderpod. Population Ecology. 50: 417–425. Noon, B.R., Ishwar, N.M. and Vasudevan, K. (2006). Efficiency of adaptive cluster and random sampling in detecting terrestrial herpetofauna in a tropical rainforest. Wildlife Society Bulletin. 34: 59–68. Philippi, T. (2005). Adaptive cluster sampling for estimation of abundances within local populations of low-abundance plants. Ecology. 86: 1091–1100. Salehi, M. and Brown, J.A. (2010). Complete allocation sampling: An efficient and easily implemented adaptive sampling design. Population Ecology. 52, 3: 451–456. Salehi, M.M. and Seber, G.A.F. (1997). Two-stage adaptive cluster sampling. Biometrics. 53: 959– 970. Salehi, M.M. and Seber, G.A.F. (2002). Unbiased estimators for restricted adaptive cluster sampling. Australian and New Zealand Journal of Statistics. 44: 63–74. Salehi, M.M. and Smith, D.R. (2005). Two-stage sequential sampling: a neighborhood-free adaptive sampling procedure. Journal Agricultural Biological and Environmental Statistics. 10: 84– 103. Seber, G.A.F. and Salehi, M.M. (2004). Adaptive sampling. In: Armitage, P. and Colton, T. (Eds) Encyclopedia of Biostatistics, Volume 1, 2nd edition. Wiley and Sons, Chichester, pp. 59–65. Smith, S.J. and Lundy, M.J. (2006). Improving the precision of design-based scallop drag surveys using adaptive allocation methods. Canadian Journal of Fisheries and Aquatic Sciences. 63: 1639–1646. Smith, D.R., Villella, R.F. and Lemarie´, D.P. (2003). Application of adaptive cluster sampling to lowdensity populations of freshwater mussels. Environmental and Ecological Statistics. 10: 7–15. Smith, D.R., Brown, J.A. and Lo, N.C.H. (2004). Application of adaptive cluster sampling to biological populations. In: Thompson, W.L. (Ed.) Sampling Rare and Elusive Species. Island Press, Washington DC, pp. 75–122. Smith, D.R., Conroy, M.J. and Brakhage, D.H. (1995). Efficiency of adaptive cluster sampling for estimating density of wintering waterfowl. Biometrics. 51: 777–788. Sullivan, W.P., Morrison, B.J. and Beamish, F.W.H. (2008). Adaptive cluster sampling: estimating density of spatially autocorrelated larvae of the sea lamprey with improved precision. Journal of Great Lakes Research. 34: 86–97. Talvitie, M., Leino, O. and Holopainen, M. (2006). Inventory of sparse forest populations using adaptive cluster sampling. Silva Fennica. 40: 101–108. Thompson, S.K. (1990). Adaptive cluster sampling. Journal of the American Statistical Association. 85: 1050–1059.

ADAPTIVE SAMPLING OF ECOLOGICAL POPULATIONS

95

Thompson, S.K. (1991a). Adaptive cluster sampling: Designs with primary and secondary units. Biometrics. 47: 1103–1115. Thompson, S.K. (1991b). Stratified adaptive cluster sampling. Biometrika. 78: 389–397. Thompson, S.K. and Seber, G.A.F. (1996). Adaptive Sampling. Wiley, New York. Turk, P. and Borkowski, J.J. (2005). A review of adaptive cluster sampling: 1990–2003. Environmental and Ecological Statistics. 12: 55–94.

CHAPTER

4

Statistics in Environmental Policy Making and Compliance in Surface Water Quality in California, USA Jian Peng Many states and environmental groups fault EPA for delays in issuing guidance and assistance needed to carry out the provisions of the law. EPA and others are critical of states, in turn, for not reaching beyond conventional knowledge and institutional approaches to address their water quality problems. Environmental groups have been criticized for insufficient recognition of EPA’s and states’ need for flexibility to implement the Act. Finally, Congress has been criticized for not providing adequate funding and resources to meet EPA and state needs. (Claudia Copeland, 2001, Implementing Clean Water Act, see www.ncseonline.org/NLE/ CRSreports/water/h2o-15.cfm)

4.1

INTRODUCTION

This chapter outlines the statistical theories applied to the environmental policymaking processes as well as for the environmental compliance practice in the State of California, USA, with an emphasis on surface water quality. Statistics is a critical tool in collecting and interpreting environmental data. Environmental samples are often collected from an infinite pool (e.g. water, soil, or air) and sample representativeness is a perpetual issue. However, physical, chemical, and biological processes can complicate environmental problems and statistical manipulations are often needed during investigation and data interpretation to reflect these processes. Moreover, regulatory and compliance decisions often bring significant economic and legal challenges, and statistics is often as important as physical and biological sciences underlying environmental regulations that are often subject to close technical and legal scrutiny. In California, the popularity of lottery and Las Vegas casinos seems to suggest that Californians are oblivious about statistics. However, California environmental policies show otherwise, since they are often heavily dosed, if not explicitly decorated, Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

98

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

with statistical sciences. In most instances, the policies are presented in plain language free from statistical terms, leaving statistics and scientific justifications to underlying technical supporting documents (TSDs). In other cases, such as the Clean Water Act (CWA) 303(d) Listing Policy, statistical jargons and calculations can take front stage and appear quite formidable to laymen. Owing to limits in space and the author’s expertise, this chapter cannot cover every aspect of environmental policy making and compliance. Rather, a general regulatory framework with an emphasis on surface water issues is provided for the readers, who should have basic knowledge about environmental science, policy and statistics. The references listed at the end of this chapter should facilitate further reading and research as needed.

4.2

CLEAN WATER ACT AND PORTER–COLOGNE WATER QUALITY CONTROL ACT

The Clean Water Act (CWA) (1972) is the cornerstone of surface water quality protection in the USA. CWA uses both regulatory and non-regulatory tools to sharply reduce direct pollutant discharges into the waters of the USA, to finance municipal wastewater treatment facilities, and to manage polluted runoff in order to restore and maintain the chemical, physical, and biological integrity of the nation’s waters. Figure 4.1 summarises the basic framework of the CWA. First, water quality standards (WQSs) are established and monitoring is conducted to determine the attainment of WQS. Based on the determination, the water body can be declared unimpaired or impaired depending on the listing criteria and be put on the CWA List of Impaired Waters, that is CWA 303(d) list. Once on the list, total maximum daily loads (TMDLs)

Figure 4.1: Schematic flowchart of the Clean Water Act (CWA) and Porter–Cologne Act (US EPA, 2010). Set goals and water quality standards (WQS)

Conduct monitoring 303(d)

No

Meeting WQS?

Yes

Develop strategies and controls (TMDLs) Apply antidegradation Implement strategies NPDES Section 401 Section 319 Section 404 State Revolving Fund (SRF)

STATISTICS IN ENVIRONMENTAL POLICY MAKING

99

are required to be developed to bring water quality to compliance. To do so, implementation measures are needed, including National Pollutant Discharge Elimination System (NPDES) permitting, Section 401 programmes, Section 404 programmes, and funding programmes (e.g. CWA 319h Grant Programme) to support the water quality improvement projects. Analogous to CWA as the national environmental law on water quality, the Porter–Cologne Act (1969) is the principal law that governs water quality in California. It establishes a comprehensive programme to protect water quality and its beneficial uses. Unlike CWA, the Porter–Cologne Act applies to both surface water and groundwater. As a state-level implementation of CWA, the Porter–Cologne Act preceded the CWA but was revised to comply fully with the CWA. The following sections are organised based on the logical flowchart of CWA as shown in Figure 4.1, with topics in sequential order on WQSs (Section 4.3), environmental monitoring design (Section 4.4), CWA 303(d) Listing Policy (Section 4.5), TMDLs (Section 4.6) and implementation measures (Section 4.7).

4.3

STATISTICS IN ENVIRONMENTAL STANDARDS AND WATER QUALITY CRITERIA

A WQS, as defined in the CWA, consists of a numeric or narrative water quality criterion, designated uses (‘beneficial uses’ in California), an antidegradation policy, and implementation procedures, as shown in Figure 4.2. As one of the key elements in the CWA, the development of water quality criteria (WQC) is the first, and perhaps the most important step in environmental policy making and one of the primary tools for managing water quality (US EPA, 2003). Numeric criteria provide a precise basis for deriving water quality-based effluent limitations (WQBELs) in NPDES permits and waste load allocations (WLAs) for TMDLs to control pollutant discharges. However, despite the designated levels of protection by water quality protection programmes, the criteria often do not specify these levels; that is, the probability that adverse events will occur is unknown. One of the problems is the ‘one-size-fits-all’ methodology used in their development (Reiley et al., 2003). For this reason, WQC tend to be conservative in most cases. Ideally three different types of criteria should be established for each pollutant. Type one would be national criteria that apply to the entire USA. Type two would be regional ones for a state or a geographical region. Type three are site-specific criteria. These types of criteria are progressively more site-specific, chemical-specific, and organism-specific. In format, WQC can be narrative or numeric. In terms of environmental media, there are WQC for water, tissue, and sediment. In term of length of exposure, there are chronic (.4 days’ exposure) or acute (instantaneous) criteria. WQC can also be categorised into aquatic life, human health (drinking or consumption of fish/aquatic life), or wildlife criteria. For human health criteria guidance, the US Environmental Protection Agency (EPA) evaluates many diverse toxicity studies, whose results feed into a reference dose or cancer potency estimate (usually based on a 106 cancer risk) that, along with a number of exposure factors and determination of risk level, results

100

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 4.2: Components of Water Quality Standard (WQS).

Beneficial uses

Criteria (numeric or narrative)

Anti-degradation policy

Implementation procedures

Waters of the United States/State

in a guidance criterion. For aquatic life, the EPA evaluates many diverse aquatic toxicity studies to determine chronic and acute toxicity taking into account how other factors (such as pH, temperature or hardness) affect toxicity. The EPA also, to the extent possible, addresses bioaccumulation or bioconcentration to protect aquaticdependent wildlife (US EPA, 2000).

4.3.1

Process of setting WQSs

Water quality criteria are usually developed based on tests with surface water and epibenthic organisms, and the principal route of exposure has to be direct intake of water. To derive chronic effect level such as criterion continuous concentration or CCC, there are two approaches: the threshold effect concentration approach (such as ‘no observed effect’ concentration (NOEC)), and statistical analysis of the distribution of available data to project a concentration that is protective at a specified level. The latter approach is the preferred one if sufficient data are available (Reiley et al., 2003). In this approach, chronic toxicity data from all tested species (at least eight families; US EPA, 1986) are fit to a statistical distribution; then, an extrapolation or interpolation approach is used to estimate a given percentile within the sensitivity distribution (see Figure 4.3). This approach can be put simply as ‘to protect 95% of species 95%

101

STATISTICS IN ENVIRONMENTAL POLICY MAKING

Figure 4.3: Derivation of criterion concentration from distribution of chronic data ‘to protect 95% of species 95% of time’. The small bell curve represents the uncertainty around the 5th percentile of the larger curve (Reiley et al., 2003).

Number of species affected

Toxicant concentration that will protect 95% of species with 95% certainty

5th percentile

Tail of the distribution

Toxicant concentration

of the time’. Many statistical techniques can be used to derive uncertainty bounds/ intervals around the point estimate for the distribution. The main sources of uncertainties in this approach are from the goodness of fit of the lognormal or logistic model to the data, and the leveraging of the fitted relationship. It is therefore important to inspect the data visually for distribution patterns. If there are insufficient data to calculate WQC, then acute-to-chronic ratio (ACR) is used to estimate chronic thresholds based on at least three different families. ACR is the ratio between the acute LC50 and chronic value (geometric mean of the NOEC and ‘low observed effect’ concentration (LOEC)) for a given species in which both end points were established in the same laboratory under similar test conditions.

4.3.2

NTR and CTR

The National Toxics Rule (NTR) specifies the national criteria that have been promulgated for 14 states since 1992. The California Toxics Rule (CTR) specifies the federally promulgated WQC for the State of California. These criteria have been derived using the procedures discussed above. CTR is applicable to all California’s inland surface waters, enclosed bays, and estuaries for all purposes and programmes under CWA except for those within Native American tribal jurisdictions (where NTR applies). Water quality criteria are numeric limitations on chemical concentration, but derived based on end points in biological effects. Therefore, if the correlation between

102

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

concentrations of chemicals and biological effects are affected by other factors, these factors need to be accounted for in the criteria. For example, for many metals (Cd, Cu, Cr, Pb, Ni, Ag, Se, and Zn), their toxic effects (criterion maximum concentration (CMC) and criterion continuous concentration (CCC)) are affected by hardness and other chemical interactions that lower their bioavailability, as evidenced by the ratios of field toxicity and laboratory toxicities, or water effect ratio (WER). CCC and CMC can be calculated as CMC ¼ WER 3 ACF 3 exp ½ mA 3 ln ðhardnessÞ þ bA 

(4:1)

CCC ¼ WER 3 ACF 3 exp ½ mC 3 ln ðhardnessÞ þ bC 

(4:2)

where ACF (acute conversion factor), mA , mC , bA , and bC are all chemical-specific constants for any given metal species (CTR: US EPA, 2000). After hardness (as CaCO3 ) is measured, site-specific CMC and CCC can be calculated based on the above equations.

4.3.3

Sediment quality objectives

The State Water Resource Control Board recently adopted Phase I Sediment Quality Objectives (SQOs) for the protection of benthic communities based on a multiple lines of evidence (MLOE) approach that integrates sediment chemistry, sediment toxicity, and benthic community information in order to achieve the following narrative SQOs: ‘Pollutants in sediments shall not be present in quantities that, alone or in combination, are toxic to benthic communities in bays and estuaries of California’ (California SWRCB, 2009a). Similar to water column WQC that are based on an organism’s direct interaction with water, SQOs are based on adverse effects due primarily to direct interaction between the organism and sediment. In the MLOE approach, the sediment chemistry line of evidence (LOE) is evaluated using the California logistic regression model (CA LRM) and chemical score index (CSI). CA LRM predicts the probability of sediment toxicity associated with the concentration of 12 toxic chemicals. CSI uses empirical thresholds to predict the benthic community disturbance associated with a slightly different set of 12 toxicants. However, it is not described here owing to space limitation. The CA LRM value is the maximum probability of toxicity from the individual models (p value) as calculated in p¼

e B0 þ B1 log(x) 1 þ e B0 þ B1 log(x)

(4:3)

where p is the probability of observing a toxic effect; B0 and B1 are chemical-specific regression parameters provided in the guidance document (California SWRCB, 2009); and x is the concentration of the chemical. The maximum p value of the 12 toxicants is pmax , which is used to classify sediment chemistry LOE into categories to be used in the subsequent overall station evaluation. For sediment toxicity LOE, different test organisms can be used in toxicity tests, and a minimum of one short-term survival test and one long-term sublethal test should

STATISTICS IN ENVIRONMENTAL POLICY MAKING

103

be conducted. Depending on the percentage survival rate and statistical significance between test and control samples, the toxicity of the sediment is classified into categories that are used later in the station-level evaluation. The benthic condition LOE is assessed using all four recommended methods, benthic response index (BRI), index of biotic integrity (IBI), relative benthic index (RBI), and river invertebrate prediction and classification system (RIVPACS). The results from all four methods need to be combined to evaluate the overall benthic condition. BRI is discussed here as an example. This index is an abundance-weighted average pollution tolerance score of organisms occurring in a sample, which is calculated by N X (n i Pi )0:25

BRI ¼

i N X (n i )0:25

(4:4)

i

where N is the total number of species; n i is the abundance (number of individuals) for the ith benthic species; and Pi is the pollution tolerance score for the ith benthic species. After all three LOEs are evaluated, chemistry and toxicity LOEs are combined to evaluate chemically mediated effects, and toxicity and benthic community LOEs are combined to evaluate biological effects. Finally, chemically mediated effects and biological effects are combined to evaluate the overall station level sediment quality, ranging in five categories from ‘unimpacted’ to ‘clearly impacted’. The classification can then be used for regulatory purposes such as 303(d) listing.

4.4

STATISTICS IN ENVIRONMENTAL SAMPLING DESIGN

All environmental investigations start with sampling the environmental media for a clear and unbiased understanding of their physical, chemical, and biological integrity. To do so, a statistically sound sampling design combined with rigorous quality assurance/quality control (QA/QC) programme can provide scientifically defensible data, especially for a large-scale and/or long-term monitoring programme. Depending on the scope and objectives of the study, cost-effectiveness considerations, or a priori knowledge of spatial and temporal variability of pollutants, sampling design can range from simple random sampling to stratified, systematic, long-term sampling that requires exquisite statistical design (Gilbert, 1987). Two representative examples are given below.

4.4.1

Surface Water Ambient Monitoring Programme

California has a Surface Water Ambient Monitoring Programme (SWAMP) that oversees the State’s monitoring activities. To ensure statewide consistency, SWAMP

104

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

specifies the protocols and methodologies to be used for sampling, data analysis, and data reporting. SWAMP is uniquely positioned to promote collaboration with other entities by proposing conventions related to monitoring design, measurement indicators, data management, quality assurance, and assessment strategies, so that data from many programmes can be used in integrated assessments that answer critical management questions, such as 303(d) listing. SWAMP guidance document lists 25 ‘objectives for monitoring’ in order to ensure that the monitoring programmes are designed to answer regional and/or sitespecific questions related to beneficial uses. These characteristics have made SWAMP a powerful tool for environmental regulations. For example, the sampling design, analytical QA/QC, and close association with beneficial uses make it an ideal tool for 303(d) listing (see Section 4.5) purposes. For these reasons, SWRCB has invested considerable effort in making SWAMP easy to use, well supported, and widely adopted for various environmental programmes at all levels (SWRCB, 2005a; Puckett, 2002). For example, nearly all environmental-related monitoring programmes or projects now require a SWAMP-compatible quality assurance project plan (QAPP). The SWRCB SWAMP website has a user-friendly electronic advisor (see http:// swamp.waterboards.ca.gov/swamp/qapp_advisor/).

4.4.2

Southern California Bight Regional Monitoring Programme

To improve the efficacy of existing monitoring programmes and improve capacity for regional assessments, the Southern California Coastal Water Research Project (SCCWRP) initiated a series of monitoring efforts throughout the Southern California Bight (SCB) in 1994, 1998, 2003 and 2008 (see Figure 4.4 in the colour insert). Stratified sampling design was employed, which divided SCB into subgroups (strata) that are internally homogenous. Each stratum was then sampled randomly. Statistically this scheme gives more precise estimates of overall average and inventory than a simple random sampling. This sampling design had enabled SCCWRP to address many regional environmental issues (e.g. Bay et al., 2005; SCCWRP, 2007). Interestingly, the Bight 2003 sampling missed the notorious ‘Station 6C’, which is located on the Palos Verdes Shelf and is orders of magnitude more contaminated by DDTs (1,1,1trichloro-2,2-di(4-chlorophenyl)ethane) and PCBs (polychlorinated biphenyls) than other sites. This caused some difficulty in estimating the DDTs inventory of the SCB. On the positive side, this stratified approach enabled a good estimate of large-scale concentration gradients across the SCB, thus the DDTs flux to the open ocean and its contribution to the global background could be calculated (Zeng et al., 2005).

4.5

CALIFORNIA STATE 303(D) LISTING POLICY

Many water quality planning activities in California start with the listing of the water body in question on the impaired water body list or CWA Section 303(d) list owing to exceedances of WQSs or other violations. Because the listing necessitates develop-

STATISTICS IN ENVIRONMENTAL POLICY MAKING

105

ment of the dreaded TMDLs (discussed in the next section), regulatory agencies and regulated entities alike put great effort into the process. In a nutshell, the 303(d) listing is a process where the Regional Board assembles all available water quality data, compares them against applicable WQC and makes a decision on whether to list or de-list a water-body–pollutant combination based on default binomial statistics (other statistical methods could also be used in the listing decision) (California SWRCB, 2004a). Owing to the fact that sampling data are imperfect and randomly variable, verifying compliance using statistics is the preferred way (California SWRCB, 2004b). Using statistics in 303(d) listing improves the decision-making process, but it will not change WQC or calculation of effluent limits. Since issues about WQC and monitoring have been discussed before, this section will focus on the rationale of the final step, that is the listing decision based on statistical evaluations. The 303(d) listing, as a decision-making process using experimental/observational data, is itself a statistical hypothesis test. In statistics and other scientific disciplines, likely hypotheses are never proven; instead they are simply not rejected until another hypothesis takes their place (California SWRCB, 2004b). Therefore, rejecting a hypothesis with sufficient confidence is a stronger argument than retaining one because of insufficient confidence in rejecting it. Hypothesis testing begins by selecting a null hypothesis, which is usually believed to be true or simply acts as a basis for argument. In either case where the null hypothesis cannot be rejected or it is rejected, information about the sample population is inferred with a known degree of confidence. Sometimes, an alternate hypothesis (Ha ) can be considered if the null hypothesis is rejected. Owing to the high cost of developing and implementing TMDLs for a listed water body, and because it takes more samples (and thus more effort and cost) to de-list a water body than to list it (discussed later in this section), the initial decision to list or not to list should have enough statistical significance so that the chance of an incorrect decision is sufficiently small. Similar to a judicial trial where a suspect is presumed to be innocent before proven guilty beyond reasonable doubt, the 303(d) listing process starts with a null hypothesis (H0 ) that the water body is unimpaired. The alternative hypothesis (Ha ) is that the water body is impaired. The choice of a null hypothesis seems arbitrary, but its form is important because it is desirable to control both Type I (false positive) and Type II (false negative) errors. Type II errors are not easily controlled by statistical manipulations, but they can be effectively controlled by increasing sample size and/or increasing the effect size, which is the ‘gray region’ where the consequences of decision errors are insignificant (for example, if we prescribe that klistjr1 , klist, N Þ ¼

N X

N! r1 k (1  r1 )( N  k) k!(N  k)! k¼ klist (4:5)

where Æ is the probability of Type I error, or the probability of incorrectly listing a clean water body; k is the number of exceedances; klist is the minimum number of exceedances to list; N is the total number of samples; and r1 is the exceedance rate. Equation (4.5) can be easily calculated using Excel1 function BINOMDIST() where Æ ¼ BINOMDISTð N  klist, N , 1  r1 , TRUEÞ

(4:6)

Similarly, the probability of not rejecting the alternate hypothesis is  ¼ BINOMDISTð klist  1, N , r2 , TRUEÞ

(4:7)

where r2 is the alternate exceedance rate, and  is the Type II error for the alternate

107

STATISTICS IN ENVIRONMENTAL POLICY MAKING

hypothesis Ha (water body is impaired), or failure to reject the wrong hypothesis the water body is impaired. A balanced Æ and  error approach is one that selects H0 , Ha , Æ, and  in a way that the probability of erroneously listing an unimpaired water body is balanced by the possibility of erroneously not listing an impaired water body. At the same time, both errors should be sufficiently small so that the consequence of either error is insignificant. For example, in a listing decision evaluation for a toxic pollutant, a total of 72 samples are collected. The null hypothesis is that the exceedance rate is less than 3% (10% for conventional pollutant). The alternate hypothesis is that the exceedance rate is greater than 18% (25% for conventional pollutant). Note that in both cases the effect levels are 15%. Based on the error-balancing scheme, the listing (i.e. not-to-exceed) threshold is determined by calculating pairs of Æ and  values using a series of number of exceedances ranging from 0 (no exceedance) to 72 (all samples exceeded WQC). When the number of exceedances is smaller than 3% (two or fewer exceedances) or greater than 18% (13 or more exceedances), Æ and  are sufficiently different (one is near 1 and the other near 0) that no consideration is needed. In between 2 and 13, Æ and  start to converge and at one point (in this case number of exceedance is 7) they are the closest, that is the Type I and Type II errors are approximately balanced at this point. The corresponding number of exceedance of 7 is the ‘threshold not-to-exceed exceedance’, where the possibility (Æ) of listing the water body while exceedance rate is no greater than 3% is nearly equal to the possibility () of not listing it while the exceedance is greater than 18%. Note that at the point of convergence, both Æ and  (or error rates) need to be sufficiently small (,0.2). Figure 4.5 is a graphic representation of the process described above.

Figure 4.5: Calculation of the number of exceedances for a toxic pollutant required to put a water body on the CWA 303(d) list based on a total number of samples of 72. The binomial calculation used here is listed in Table 4.3. 1.0 Alpha

Beta

Alpha-beta

Probability

0.8

0.6

0.4

5

0.2

7

6

8

0 1

11

21

31 41 Number of samples

51

61

71

108

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 4.3: Data set used in Figure 4.5. k is the number of exceedance tested by the binomial test. The total number of samples N ¼ 72. The highlighted row indicates the listing threshold of 7, where Æ,  , 0.2 and Æ   is minimised. k

Æ



Æ

k

Æ



Æ

1 2 3 4 5 6

0.8884 0.6400 0.3672 0.1703 0.0653 0.0211

0.0000 0.0000 0.0001 0.0005 0.0020 0.0064

0.8884 0.6400 0.3671 0.1698 0.0633 0.0147

7 8 9 10 11 12

0.0059 0.0014 0.0003 0.0001 0.0000 0.0000

0.0173 0.0399 0.0801 0.1428 0.2296 0.3370

0.0114 0.0384 0.0798 0.1428 0.2296 0.3370

The de-listing of a water body from the 303(d) list follows a similar rationale but with H0 and Ha essentially switching place, but with a more conservative approach such that for the same number of samples, the threshold exceedance to de-list a water body is always one less than the one to list it (State Listing Policy, California SWRCB, 2004a, tables 3.1 and 4.1). For example, six or fewer exceedances are allowed for 72 samples in order to de-list a water body.

4.6

TOTAL MAXIMUM DAILY LOADS

A TMDL is the maximum amount of a pollutant that a water body can receive and still safely meet WQSs. Each TMDL must account for all sources of the pollutant, including both point (waste load allocation (WLA)) and non-point sources, natural background (load allocation (LA)), and a margin of safety (MOS), or TMDL ¼ LA + WLA + MOS. TMDLs allocate allowable pollutant loads for each source, and identify management measures that, when implemented, will assure that WQSs are attained. In California, the Porter–Cologne Act requires TMDLs with their associated implementation plans be adopted into the Basin Plans through the basin planning process (in contrast the CWA does not require the implementation of TMDLs). Currently there are nearly 2000 water-body–pollutant combinations and over 400 TMDL projects in California. The challenge to draft and implement these TMDLs is daunting. Statistics are applied to many aspects of TMDLs. WLAs, LAs, MOS, monitoring, 303(d) listing, numeric targets, allocations, implementation, and compliance assessment are all parts of TMDLs involving statistics. Owing to the space limit, only one issue is discussed here, that is the translation of non-daily TMDLs into ‘truly’ daily loads as a result of a recent court case (Friends of the Earth vs. US EPA, 2007). TMDLs are often expressed as non-daily loads for a number of reasons. First, many WQSs (e.g. chronic criteria) are not daily standards and concentration-based WLA and LA may be more appropriate. Second, many pollutant source types and receiving water physico-chemical processes cannot be described in daily terms. Third, the assessment of cumulative loading impacts is more pertinent to achieving WQSs,

109

STATISTICS IN ENVIRONMENTAL POLICY MAKING

therefore long-term allocations are appropriate and informative from a management perspective (US EPA, 2007). For example, both nutrient and sediment TMDLs for San Diego Creek watershed (TMDLs completed before the court case in question), Orange County, California, use annual and seasonal loads as end points because the sediment and associated nutrient loadings are strongly seasonal and daily loadings are impractical. However, owing to the lawsuit above (Friends of the Earth vs. EPA and others, see US EPA, 2007), newly developed TMDLs have to incorporate a true daily term in both WLAs and LAs; a conversion has to be made. Nonetheless, there are methods to express loads in daily terms based on static or dynamic expressions. The static approach is more suitable in situations where temporal variations are small or well defined, so the maximum daily load value is set to represent the allowable upper limit of load values that are consistent with the longterm average (LTA) required by the TMDL. A dynamic approach is required where flows and loads vary greatly, such as sediment loads of a river in a semi-arid region. In this case, both flow and sediment loads need to be simulated fairly closely using dynamic models, flow-duration model, watershed model (such as hydrologic simulation program FORTRAN (HSPF)), or other models (Flynn, 2003). Subsequently, a percentile (e.g. 90th, 95th, or 99th percentile) can be used to set the daily maximum loads. The choice of a percentile depends on factors such as confidence in the original analysis representing field conditions, and type of error associated with analysis, that is the balance of Type I and Type II errors. If the loads are normally distributed, the maximum daily loads can be calculated from the average daily loads from MDL ¼  þ Z p  ¼  þ Z p

CV 

(4:8)

where MDL is the maximum daily limit;  is the mean of the distribution (in this case, the average load to achieve WQS);  is the standard deviation of the daily loads and  2 ¼ ln (CV2 þ 1); CV is the coefficient of variation of the daily loads (standard deviation divided by the mean); Z p is the z-score of the pth percentage point of the standard normal distribution. (z-scores are published in basic statistical reference tables and are often included as a spreadsheet function (e.g. NORMSINV(y) in MS Excel). For the 95th percentile, Z95% ¼ 1.645, and for the 99th percentile, Z99% ¼ 2.326.) If the loads are lognormally distributed instead (more often in pollutant loading issues), MDL is calculated from the LTA as MDL ¼ LTA 3 exp ( Z p  y  0:5 2y )

(4:9)

For example, in the draft selenium TMDL for Newport Bay watershed, Orange County, California, the concentration-based allocations are set as the numeric targets, which are based on site-specific objectives of tissue concentrations in bird eggs and fish tissue. The corresponding water column guideline of 13 g/L is used as a concentration-based allocation. This allocation is deemed a quarterly (90-day) average (i.e. LTA ¼ 13 g/L in Equation 4.9), therefore, the multiplier (the exp term in the above

110

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

equation) is calculated based on the calculated CV of 0.35, Z p ¼ 2.291, and the multiplier (the exponential term above) is calculated to be 2.064. Therefore, the daily maximum ‘load’ is 13 3 2.064  27 g/L (SARWQCB, 2010). The fact that this TMDL is concentration-based has simplified the process of calculating MDL. If it were load-based instead, the conversion would involve considering orders-of-magnitude variations in flow and associated selenium concentrations, and the modelling and calculations would be substantially more complicated.

4.7 4.7.1

IMPLEMENTATION OF ENVIRONMENTAL REGULATIONS State Implementation Plan (SIP)

The Policy for Implementation of Toxics Standards for Inland Surface Waters, Enclosed Bays, and Estuaries of California, or State Implementation Plan (SIP), is the policy that implements the California Toxics Rule (CTR) by establishing a standardised approach for permitting discharges of toxic pollutants to non-ocean surface waters to ensure achievement of WQSs. Among numerous aspects of this policy, an example is given below for the calculation of effluent limitations, which uses statistics-based conversions between chronic and acute WQSs. First, effluent concentration allowance (ECA) is calculated by considering WQC, flow-based dilution credit, and background concentration (not detailed here). Second, the LTA discharge condition is calculated using ECA corrected by the multipliers (US EPA, 1991) ECA multiplieracute99 ¼ e(0:5

2

z )

(4:10)

and 2

ECA multiplierchronic99 ¼ e(0:5 4 z 4 )

(4:11)

where  is the standard deviation of the daily concentrations and  2 ¼ ln (CV2 þ 1); CV is the coefficient of variation of the daily loads (standard deviation divided by the mean); and  24 ¼ ln (CV2 =4 þ 1). Based on the multipliers above, two LTA discharge conditions can be calculated and the lower one is used. Finally, to calculate the average monthly effluent limitation (AMEL) and maximum daily effluent limitation (MDEL) based on LTA, the following multipliers are used: 2

AMEL multiplier95 ¼ e(z n 0:5 n )

(4:12)

and MDEL multiplier99 ¼ e(z 0:5

2

)

(4:13)

where  is the standard deviation of the daily concentration and  2 ¼ ln (CV2 þ 1) and  2n ¼ ln (CV2 =n þ 1), and n is the number of samples per month; CV is the coefficient of variation of the daily concentration (standard deviation divided by the mean); Z p is the z-score of the pth percentage point of the standard normal distribution

STATISTICS IN ENVIRONMENTAL POLICY MAKING

111

(z-scores are published in basic statistical reference tables and are often included as a spreadsheet function (e.g. NORMSINV(y) in MS Excel). For the 95th percentile, Z95% ¼ 1.645, and for the 99th percentile, Z99% ¼ 2.326.)

4.7.2

National Pollutant Discharge Elimination System

As authorised by the CWA, the NPDES permit programme controls water pollution by regulating point sources that discharge pollutants into waters of the USA. California has been delegated permit authority for the NPDES permit programme including stormwater permits for all areas except Native American tribal lands. Regional Boards and the State Board administer NPDES permits, with the EPA providing oversight. Note that TMDL (discussed in Section 4.5), as powerful a regulatory tool as it is, is not self-executable in California, despite the fact that an implementation plan is required and the Basin Plan is amended for each adopted TMDL. Rather, TMDLs are executed via various permits, with NPDES permits being the primary vehicle. An NPDES permit provides two levels of control: technology-based limits based on the ability of dischargers in the same industrial category to treat wastewater, and water quality-based limits if technology-based limits are not sufficient to provide protection of the water body. In the previous subsection about SIP, the process of deriving water quality-based effluent limits (WQBELs) is described. For technologybased effluent limits, CWA mandates that EPA establishes national technology-based regulations known as effluent guidelines and pretreatment standards to reduce pollutant discharges from categories of industry discharging directly to waters of the USA or discharging indirectly through publicly owned treatment works (POTWs). The EPA has provided extensive guidance on technology-based effluent limits (e.g. US EPA, 2009).

4.7.3

Dry Weather Monitoring Programme in Orange County, California

Many municipal level (counties and cities) environmental agencies in California spend more resources on NPDES permit compliance than on any other programmes. Owing to the complexity of the NPDES system (e.g. SWRCB/SARWQCB, 2009), only the Dry Weather Monitoring (DWM) Programme of the Orange County Watersheds Programme (www.ocwatersheds.com) is described here as an example. The principal goal of the DWM Programme of Orange County Watersheds is to detect and eliminate illegal discharges and illicit connections to the municipal separate storm sewer system (MS4s), which could convey pollutants that may cause or contribute to exceedances of receiving water quality objectives (Bernstein et al., 2008). Owing to the sheer number of potential sites, the programme has two distinct elements. The first fixed/targeted element focuses on high-priority sites and those that have chronic problems. The second is a stratified probabilistic element that randomly selects sites every year to establish the regional urban background, which can then be used to prioritise sites in the future. The process in the second element is shown in Figure 4.6.

112

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

After DWM sites have been selected, field teams collect end-of-pipe grab samples and measure a suite of physical and chemical parameters via a combination of field screening and laboratory analysis. Tolerance intervals are used to detect excursions of water quality for further actions. Tolerance intervals (Figure 4.7) are a quantitative, rigorous method for incorporating and addressing the variability in background conditions when searching for data values that are significantly different from the background or, in the DWM case, illegal discharge/illicit connection (Bernstein et al., 2008; Smith, 2002). Calculated from site data, a tolerance interval bound is the upper or lower confidence interval bound of a percentile of the background data distribution as shown in Figure 4.7. Once a constituent or some constituents at one site exceed the corresponding tolerance intervals, which often suggest illegal discharge and/or illicit connection, County of Orange or the responsible city (NPDES co-permittee) will be notified and further actions will be taken to address the water quality issue.

ACKNOWLEDGEMENTS This chapter is dedicated to Dr Teh-Fu Yen, who passed away on 12 January 2010, who inspired the author’s interest in the environmental sciences. Thanks are also given to Dr Stuart Goong (OC Watersheds Programme), Dr Yue Rong (Los Angeles Regional Water Board), and Steven Saiz (Central Coast Regional Water Board) for their careful review, and Terri Reeder and Pavlova Vitale (both from Santa Ana Regional Board) for their assistance. The author gratefully acknowledges Amanda Carr (OC Watersheds Programme) for her understanding and encouragement. The author is indebted to his statistician wife Yaling for her support and tolerance. Several

Figure 4.6: Dry weather monitoring site selection process for the NPDES Stormwater Programme of County of Orange, California, USA (Bernstein et al., 2008).

Add new developments into database

Identify all facilities in County database

Define strata based on watershed

Select 39 inch pipes discharging to channels

Estimate urbanised area in each stratum

Define final pool of potential sites

Allocate number of sites per stratum based on proportion of urbanised area

Randomly select sites to monitor

113

STATISTICS IN ENVIRONMENTAL POLICY MAKING

Figure 4.7: Schematic diagram of the tolerance interval for the dry weather monitoring programme (Bernstein et al., 2008; Smith, 2002). 0.20

Relative frequency

0.15

.90th quantile of the underlying data distribution

Underlying data distribution

Distribution of tolerance interval bounds

0.10

0.05

.05 Proportion of bounds less than .90th quantile

0 5

10 Parameter value

15

anonymous reviewers also provided comments and suggestions. However, the author takes full responsibility for any errors in this chapter.

REFERENCES Bay, S.M., Mikel, T., Schiff, K., Mathison, S.. Hester, B., Young, D. and Greenstein, D. (2005). Southern California Bight 2003 Regional Monitoring Program: I. Sediment Toxicity. Technical Report 451, Southern California Coastal Water Research Project (SCCWRP), Westminster, CA. Bernstein, B., Moore, B., Sharp, G. and Smith, R. (2008). Assessing urban runoff program progress through a dry weather hybrid reconnaissance monitoring design. Environmental Monitoring and Assessment. 157, 1–4: 287–304. California SWRCB (California State Water Resources Control Board) (2004a). Water Quality Control Policy for Developing California’s Clean Water Act Section 303(d) List (commonly referred to as State Listing Policy). California SWRCB (California State Water Resources Control Board) (2004b). Functional Equivalent Document for Water Quality Control Policy for Developing California’s Clean Water Act Section 303(d) List (commonly referred to as State Listing Policy FED). California SWRCB (California State Water Resources Control Board) (2004c). Functional Equivalent Document for Water Quality Control Policy for Developing California’s Clean Water Act Section 303(d) List, Appendix D (draft), Interval estimators and hypothesis tests for data quality assessments in water quality attainment studies. California SWRCB (California State Water Resources Control Board) (2005a). Comprehensive Monitoring and Assessment Strategy to Protect and Restore California’s Water Quality. Surface Water Ambient Monitoring Programme (SWAMP), October 2005. California SWRCB (California State Water Resources Control Board) (2005b). Policy for Implemen-

114

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

tation of Toxics Standards for Inland Surface Waters, Enclosed Bays, and Estuaries of California (State Implementation Plan (SIP)). California SWRCB (California State Water Resources Control Board) (2009). Water Quality Control Plan for Enclosed Bays and Estuaries, Part 1. Sediment Quality (Sediment Quality Objectives (SQO)). California SWRCB (California State Water Resources Control Board) and SARWQCB (Santa Ana Regional Water Quality Control Board) (2009). Waste Discharge Requirements for the County of Orange, Orange County Flood Control District, and the Incorporated Cities of Orange County within the Santa Ana Region. Order no. R8-2009-0030, NPDES no. CAS618030. CWA (Clean Water Act) (1972). Federal Water Pollution Control Act Amendments of 1972, Title 33, ch. 26, Water pollution prevention and control, 33 U.S.C. § 1251 et seq. (commonly referred to as the Clean Water Act (CWA)) Flynn, R.H. (2003). Development of Regression Equations to Estimate Flow Durations and LowFlow-Frequency Statistics in New Hampshire Streams. US Geological Survey Water-Resources Investigations Report 02-4298. US Geological Survey, Reston, VA. See http://pubs.usgs.gov/ wri/wri02-4298/wri02-4298.pdf. Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold Company, New York. Puckett, M. (2002). Quality Assurance Management Plan for the State of California’s Surface Water Ambient Monitoring Program (‘SWAMP’). California Department of Fish and Game, Monterey, CA. Prepared for the State Water Resources Control Board, Sacramento, CA. Reiley, M.C., Stubblefield, W.A., Adams, W.G., Di Toro, D.M., Hodson, P.V., Erickson, R.J. and Keating F.J. Jr (2003). Reevaluation of the state of the science for water-quality criteria development. Proceedings from the Pellston Workshop on Reevaluation of the State of the Science for Water-Quality Criteria Development, 25–30 June 1998, Fairmont Hot Springs, Montana. Society of Environmental Toxicology and Chemistry (SETAC), SETAC Press, Pensacola, Florida. SARWQCB (Santa Ana Regional Water Quality Control Board) (2010). Draft Technical Staff Report for the Selenium Maximum Daily Loads and Site-Specific Objectives in Newport Bay Watershed, California. Smith, R.W. (2002). The use of random-model tolerance intervals in environmental monitoring and regulation. Journal of Agricultural, Biological, and Environmental Statistics. 7, 1: 74–94. SCCWRP (Southern California Coastal Water Research Project) (2007). Executive Summary of the Southern California Bight Regional Monitoring Program 2003 (Bight’03). SCCWRP, Westminster, CA, SCCWRP Technical Report 386. State of California (1969). Porter–Cologne Water Quality Control Act (California Water Code Div. 7 et seq., commonly referred to as the Porter–Cologne Act). US EPA (United States Environmental Protection Agency) (1986). Quality Criteria for Water (commonly referred to as the ‘Gold Book’). United States Environmental Protection Agency, Office of Water Regulations and Standards, Washington DC. US EPA (United States Environmental Protection Agency) (1991). Technical Support Document for Water Quality-Based Toxics Control, US EPA Office of Water, Washington DC. US EPA (United States Environmental Protection Agency) (2000). Water Quality Standards; Establishment of Numerical Criteria for Priority Toxic Pollutants for the State of California. Rule (California Toxics Rule (CTR)), 40 CFT Part 131, Federal Register, vol. 65, no. 97, 18 May 2000. US EPA (United States Environmental Protection Agency) (2003). Strategy for Water Quality Standard and Criteria—Setting Priorities to Strengthen the Foundation of Protecting and Restoring the Nation’s Waters. US EPA Office of Science and Technology. US EPA (United States Environmental Protection Agency) (2007). Technical Guidance on Options of Expression of Daily Loads in TMDL (draft). US EPA Office of Wetlands, Oceans, and Watersheds, Washington DC. US EPA (United States Environmental Protection Agency) (2009). Technical support documents for the preliminary 2010 effluent guidelines program plan, October 2009.

STATISTICS IN ENVIRONMENTAL POLICY MAKING

115

US EPA (United States Environmental Protection Agency) (2010). Introduction to Clean Water Act, see www.epa.gov/watertrain/cwa/(access date 17 January 2010). Wheater, C.P. and Cook, P.A. (2000). Using statistics to understand the environment. In: Gardner, R. and Mannion, A.M. (Eds), Introductions to Environment Series: Environmental Science. Routledge (Taylor and Francis Group), London and New York. Zeng, E.Y., Tsukada, D., Diehl, D.W., Peng, J., Schiff, K., Noblet, J. and Maruya, K. (2005). Distribution and mass inventory of total dichlorodiphenyldichloroethylene in the water column of the Southern California Bight. Environmental Science and Technology. 39: 8170–8176.

CHAPTER

5

Solving Complex Environmental Problems Using Stochastic Data Analysis: Characterisation of a Hydrothermal Aquifer Influenced by a Karst, Example of Rennes les Bains, France Alain Mangin and Farid Achour

5.1

INTRODUCTION

Hydrothermal systems represent deep aquifers, the investigation and study of which are difficult and very costly. These difficulties are related to their significant depth, long residence time of water in the aquifer, low groundwater velocities in case a tracer test is planned, and the recharge area(s) that often are not localised. Thus, it is primarily geochemical, isotopic or temperature investigative approaches that are used to determine their characteristics. These standard investigative approaches can identify dominating chemical elements, residence time and sometimes the temperatures acquired at depths (Barnes, 1979; Blavoux, 1991; Blavoux and Berthier, 1985; Blavoux et al., 1982; Burger et al., 1985; Carrie, 1991; Elder, 1981; Ellis and Mahon, 1977; Ghafouri, 1968; Helgeson, 1969; de Launay, 1899; Michard, 1989; Moret, 1946; Rambaud, 1991; Schoeller, 1975; Schoeller and Schoeller, 1976; 1982; Valat, 1971). Proper management of groundwater resources requires an accurate evaluation of the parameters (hydraulic properties) that control the movement and storage of water. To determine the hydrodynamic parameters of a deep aquifer such as a hydrothermal aquifer, a new approach based on the analysis of the fluctuation and evolution of the piezometric levels is proposed in this chapter. Several methods are used; among them, standard methods such as interpretation of pumping test results and new methods such as using techniques of signal processing to analyse the piezometric level time series collected within a 1460 m deep well tapping the hydrothermal aquifer of Rennes les Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

118

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Bains (France). These signal-processing techniques are correlation and spectral analyses, continuous Morlet wavelet analysis, orthogonal mutiresolution wavelets, 1/f n noise analysis, fractals, and reconstructed attractors. After identification of the existence of the effects of Earth tides and barometric pressure on the hydrothermal aquifer, these techniques were applied on piezometric level time series and have yielded the values of storage coefficient and porosity. The results obtained are consistent with the information provided by geochemical, isotopic and thermal investigations. In addition, these techniques revealed the existence of a ‘noise’, which was analysed by using reconstructed attractors; this noise was caused by the existence of a thermal convection. This thermal convection can be explained by the existence of a local hydrothermal karst located at the top of the Devonian formation, which corroborates the anomaly observed during the pumping test. The information obtained led to a better management of this hydrothermal aquifer and indicated the need to reconsider the geological structure of this sector.

5.2

PRESENTATION OF THE RENNES LES BAINS SITE AND WATER GEOCHEMISTRY

The thermal SPA of Rennes les Bains is located in the Aude County, 50 km south of Carcassonne. The area corresponds to the foreland of the Pyrenees at the limit of the Mouthoumet Massif (Figure 5.1). The geology of Rennes les Bains is generally well known (Arthaud et al., 1976; Barnolas et al., 1996; Bessie`re, 1987; Bilotte, 1985; Bouchaala, 1991; Bresson, 1908; Khufuss, 1981; Yvruox, 1997; Von Gaertner, 1937), but many questions regarding the detailed geology remain unanswered. These questions include the organisation of the geological layers in the eastern part of the Mouthoumet Massif and relationship of the massif with the Alet sedimentary basin. Figure 5.1: Schematic map of the tectonic units of the Mouthoumet Massif, according to Bessie`re & Schulze. Lairiere Roc de Nitable

Alet

Mouthoumet

Albas Villerouge Felines Quintillan Palairac

N Pechcardou Rennes les Bains Unit

Tuchan 5 km Relative autochthonous Roc de Nitable units

Padern

Félines-Palairac unit Serre de Quintillan unit

Durban

119

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

Thus, as demonstrated by Bessie`re (1987), the Cardou portion contains geological thrust faults related to the Pyrenean tectonics, suggesting that deep structure is complex, as shown in the interpretive section produced by Yvroux (1997) (Figure 5.2). The aquifer formations are the Devonian and Carboniferous base limestone, which are overlain by the Carboniferous Culm facies impermeable formations, thereby forming a confined aquifer. Until 1994, attempts to characterise the thermal activity was exclusively based on the use of the following hot springs: Bains Forts, Gieulles Marie, la Reine, and Bains Doux. The hot springs have been exploited since Roman times. Following the discovery of a bacterial contamination within these springs, a well was installed by Aude County as a replacement for the springs. For practical reasons, the well could only be positioned 300 m from the hot springs. A thermal anomaly was observed at this well. While the temperature reached values of 458C at the springs, it was only 34.58C at the bottom of the well, at a depth of 1460 m. The measurement of the temperature gradient as a function of depth (Figure 5.3) provides a relatively low value of 1.38C/100 m. The geochemical analysis of the water from the well and the springs indicated that they originated from the same reservoir. These waters are classified as calcium bicarbonate waters. Several isotopic analyses were performed (Yvroux and Olive, 2004). Carbon 14 results indicate a very wide range of residence times, varying from 9000 years for the well to 16 000 years for the springs. Without discussing these results, this indicates very long circulation times in the aquifer, thus low hydraulic

Figure 5.2: Interpretive hydrogeological cross-section through Rennes les Bains (Yvroux, 1997). NNE

SSW Fontaine Salée Branch

FN

Cardou Branch Carcassonne Alet Area Rennes-les-Bains Springs Spring Recharge Branch Basin Arques Springs 924 m 300 m Well

P

0 2000 (m)

0 1

2

1 Meso-Cenozoic 2 Trias 3 Carboniferous 4 Devonian 5 Ordovician-Silurian 6 Cold Water 7 Thermal Water

3

4

5

6

7

1

2 km

120

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 5.3: Temperature gradient of the Rennes les Bains well. 35 a  0.013 r  0.998

Temperature: °C

30

25

20

15

0

500

1000

1500

Depth: m

permeabilities. The oxygen isotope data provide an estimate of the altitude of the recharge area. It would, according to the authors cited, be found 10 km northeast of the springs in the Valmige`re Bouisse sector.

5.3

ANALYSIS OF PIEZOMETRIC TIME SERIES

The groundwater level fluctuations in the Rennes les Bains well provide an excellent physical observable (a variable whose temporal evolution is similar to the temporal evolution of the system) that explains the dynamics of the hydrothermal aquifer (Bos, 1995; Heudron, 1999). By analysing the recorded time series, the dynamic properties of waters flowing within the aquifer can be determined. The available time series evaluated here consists of readings collected at 5 min intervals. The methods discussed here range from models to signal-processing methods, and even approaches related to dynamical systems.

5.3.1

Information provided by the pumping test

The data set considered here was collected between 27 June 1994 and 12 January 1995, but includes a number of interruptions; therefore, only a portion of the recordings can be analysed. The most complete portion of the record began on 4 October 1994 and extended for a period of 100 days with a constant pumping rate of 20 m3 /h. The recovery period began on 13 January 1995 and extended beyond 27 February, at which date, the curve was no longer useable owing to various disturbances. The duration of the recovery was estimated at 45 days.

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

121

Theis–Jacob Recovery Method Given the available data, only one available well was used for pumping purposes and, for various technical problems that occurred during the pumping, the study focused solely on the recovery curve. In this case, the graphic solution from the approximation method of Theis–Jacob (Castany, 1963; De Marsily, 1981) was used. This approach is judged appropriate because the aquifer is confined, has slow flow velocities, and the test included very long pumping and recovery times. This method is based on the determination of a representative line on the recovery curve, which is a semi-logarithmic plot of the residual drawdown (s) against a time equal to the sum of the time counted from the cessation of the pumping (t ) and duration of the pumping (tp ) in seconds. The residual drawdown (s) in metres is the difference between the piezometric level before pumping and the drawdown observed during the recovery. It should be noted that this method allows the calculation of transmissivity, but not storage coefficient. The recovery curve can be written as

tp 0:183Q log 1 þ s¼ (5:1) T t with Q (m3 /s) equal to the pumping rate and T (m2 /s) equal to the transmissivity. The slope of the line (a) is equal to a¼

0:183Q T

(5:2)

The value T may then be recalculated. Results Of the 45 days of recovery (Figure 5.4), despite some technical incidents that occurred between 11 and 13 January, it was possible to employ the logarithmic Theis–Jacob approximation method. From 10 to 23 January, the recovery is very steep and this part of the curve can be interpreted as indicating an unusually long-lasting capacity effect (Figure 5.5). This effect would indicate the presence of particular hydrogeological conditions in the area reached by the well. However, from 23 January on the semi-logarithmic plots, it is possible to fit a straight line with a slope of 4 (Figure 5.5). The transmissivity is therefore T¼

0:183 3 20 ¼ 2:5 3 104 m2 =s 4 3 3600

(5:3)

The thickness of the aquifer is estimated to be equal to the thickness of the carbonate formations of the Devonian and basal Carboniferous units in the area, or approximately 450 m. Under these conditions the Darcy permeability would be k ffi 5:5 3 107 m=s

(5:4)

This permeability is quite consistent with what is known about the usual permeabilities of fractured limestones, and combined with the isotopic data, allows us to estimate a flow velocity of about 1 m/year.

122

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 5.4: Recovery curve showing drawdown versus time. 16 17 18

Water level: m

19 20 21 22 23 24 25 26 1/01

9/01

17/01

25/01 3/02 Time: days

11/02

20/02

1995

Figure 5.5: Drawdown versus time on a semi log plot with Jacob’s straight line. 36.0

Residual drawdown: m

35.5 35.0 34.5 34.0 33.5

a4

33.0 32.5 Capacitive effect

32.0 31.5 0.4

0.5

0.6

0.7 0.8 log (1  tp/t)

0.9

1.0

1.1

The estimated intrinsic permeability was calculated to be about 5 3 1014 m2 (1 Darcy corresponds to about 1 3 1012 m2 ). This calculated permeability represents the global permeability of the aquifer. It nevertheless remains an existing anomaly in the vicinity of the well, resulting in a strong capacity effect that lasted 13 days. If the flow rate of the aquifer being considered in the model is similar to the pumping rate, this would represent a volume of 11 000 m3 .

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

5.3.2

123

Analysis of long-term variations

The long-term variations in the time series are represented by events occurring over several days to several months, given the recording time step (5 min) and length of available time series (less than 2 years). These variations help to explain the filling and/or emptying of the aquifer. Methods used The methods used for this study are correlation and spectral analyses. These methods are based simultaneously on theoretical analysis of time series (Box and Jenkins, 1976; Bras and Rodriguez-Iturbe, 1993; Jenkins and Watts, 1968; Yevjevich, 1972) and geostatistics (Matheron, 1965), and are considered to constitute a signal-processing technique (Max, 1980). They are applied here to describe the temporal variations of the studied system and identify the physical processes that are involved (Mangin, 1981a; 1981b; 1984; 1994). These analyses decompose the signal using different time scales: a long-term trend; variations in the medium term corresponding to structured elements approached by periodic functions; and short-term fluctuations characterising random components. They are carried out either in the physical domain (correlogram, variogram) or in the frequency domain (spectral density function associated with different functions: amplitude, phase, coherence, gain). The analysis is performed on the time series separately (simple analysis) or in connection with series believed to be responsible for observed variations (cross correlation). The components, after identification, are extracted separately by using specific filters (linear filters such as first-order differentiation filter, uniform moving average filter (Barbut and Fourgeaud, 1971)) for further analysis. Spectral analysis Applying these methods to the piezometric time series revealed the existence of trends where it is possible to distinguish two periods: one for which there is a monotonic decrease of the water level and a second period called the ‘influenced period’ in which the water level increases, but according to a succession of pulses. To analyse these trends, the period from July 1995 to April 1996 with a daily time step, is taken as an example (Figure 5.6). The trend represents the largest observed change in the time series as it exceeds 10 m. It is interpreted as being associated with periods of discharge and recharge of the aquifer. The detailed analysis leads to the conclusion that the mechanisms are more complex than was previously thought. Indeed, overall increases and decreases of the piezometric level follow a seasonal periodicity notably with the rise of piezometric level in early winter. This finding suggests a regular succession of discharge in the summer and recharge after the autumn rains. However, as shown in Figure 5.6, during the summer of 1995, when rainfall was relatively high, there was no impact on the piezometric level. A strong link is seen between rainfall and piezometric level only until early December 1995. The analysis of the rainfall–piezometric level relationship provides a curious result. At this scale, rain can be considered as quasi-random, the cross-correlogram depicts a

124

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 5.6: Rainfall–piezometric level time series from July 1995 to April 1996 (Couiza, Rennes les Bains). 60

60 50

40

40

30

30

20

20

10

10

0 1/07/95

19/08/95

8/10/95

27/11/95 Time: days

16/01/96

Piezometry: m

Rainfall: cm

Influenced period 50

0 25/04/96

6/03/96

good image of the impulse response function. However, this response (Figure 5.7) appears as fast and abrupt, with two days’ delay at most, which would be the case of a karstic aquifer as known in the literature, but certainly not that of a deep hydrothermal aquifer, especially with the calculated hydraulic permeabilities. Furthermore, the absence of geochemical variations during these events does not allow the rapid contribution of water to the aquifer.

Figure 5.7: Cross-correlogram between rainfall and piezometric level for the period of 1 December 1995 to 5 May 1996 (Couiza, Rennes les Bains). 1.0 0.8 0.6

Correlation

0.4 0.2 0 0.2 0.4 0.6 0.8 1.0 25

20

15

10

5

5 0 Time: days

10

15

20

25

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

125

Therefore, to explain these variations, it must be recognised that this is only the transmission of pressure on the hydrothermal aquifer, from another aquifer that is impacted by an important recharge and would likely be a karstic aquifer. Indeed, this aquifer exists and is known as the Turonian limestone aquifer, which was encountered during the completion of the drilling activities and located within the first 100 m below the ground surface. This result is very important for management purposes as it shows that the deep hydrothermal aquifer is associated in its behaviour with the Turonian karstic aquifer.

5.3.3

Study of diurnal and semi-diurnal, and identification of short-term components

After removing the long-term trends, variations with amplitudes in the range of 10– 20 cm are observed and their meaning is sought. Continuous and orthogonal wavelets analyses Besides correlation and spectral analyses mentioned above, it became necessary to seek new methods to assess and analyse non-stationarities. The first method used is wavelet analysis. Wavelet analysis can be considered as a complement to correlation and spectral analyses. This analysis makes it possible to follow the evolution in the time domain of different time-scale components (short, medium and long term); it is a time-scale analysis and the resulting graph is a scalogram. In addition, the choice of a wavelet instead of a sine–cosine function (Fourier transform) leads to a more precise representation of the analysed signal (Burke Hubbard, 1995). Developed in 1984 by Grossman and Morlet, the theoretical foundations have been explained by many authors (Abry, 1997; Arneodo et al., 1995; Daubechies, 1992; Mallat, 1999; Meyer and Roques, 1992; Torrence and Compo, 1998). Detailed formulation of the concept and applications of wavelet analyses can be found in the aforementioned references. Two methods are used: the continuous wavelet analysis and multiresolution wavelet analysis. The continuous wavelet analysis uses a wavelet on which the expansion or contraction may be controlled (Morlet wavelet, Mexican hat, Haar function) and, therefore, the scale components may be continuously analysed. The multiresolution wavelets method uses an orthogonal basis following a dyadic discretisation, and the scale components are independent of each other. Such an approach very clearly demonstrates the non-stationarities, revealing the link between different scale levels and isolating each component independently to study its own evolution. Multiresolution wavelets are an efficient method of identification or de-noising of hydrogeologic signals when other perturbations, such as human pumping or other intermittent processes, are superimposed on the main signal. In this sense, these tests lead to a better identification of the processes that are responsible for the observed variations. When applied in geophysics and hydrology (Foufoula-Georgiou, 1994; Labat et al., 1999a; 1999b; 2000a; 2000b; 2001; 2002a; 2002b), these methods provide useful results in the case of very complex aquifer systems.

126

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

The second method used is fractals. Indeed, the investigation of structures in the short term, such as noise, uses another concept based on fractals (Abry, 1997; Hardy and Beier, 1994). Noise corresponds to the short-term components of the signal and is usually considered to be (totally) unstructured. However, the noise, if a time series, can be studied using its spectral density function. Hardy and Beier (1994) present a noise type classification, based on the slope () of the spectral density function of the studied time series, plotted in a bilogarithmic diagram (method of noise 1/f n ). If this spectral density function follows a power law distribution, its slope in a bilogarithmic diagram will characterise the nature of the noise (Figure 5.8). Figure 5.8: Noise classification (modified from Hardy, H.H. and Beier, R.A., Fractals in Reservoir Engineering, World Scientific, 1994. Reprinted with permission). 2 log (Amp) Spectral density

0β1

(Random noise) log (ω)

1  β  0

2  β  1

3  β  2

(Brownian motion)

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS







127

Gaussian noise is characterised by 1 ,  , 1.  ¼ 0 corresponds to pure white noise, 0 ,  , 1 corresponds to a high-pass filter, and 1 ,  , 0 corresponds to a low-pass filter. Gaussian noise means that the time series is composed of independent stochastic variables. The noise contain no information. Brownian noise is characterised by 3 ,  , 1. 2 ,  , 1 corresponds to the anti-persistent Brownian noise, with a poor correlation between variables. 3 ,  , 2 corresponds to the persistent Brownian noise, with a memory effect and a high correlation between variables. When  , 3, variables are not stochastic and the noise is structured.

Naturally occurring stresses on aquifers: solar, lunar and atmospheric Aquifers are subjected to mechanical stresses from natural processes such as mechanical forcing of the aquifer by ocean and earth tides or atmospheric-pressure loading. Earth and ocean tides are the product of lunar and solar tidal forces. Changes in barometric pressure are induced by variation in temperature and circulation. Fluctuations of groundwater pressure due to these stresses are often reflected in the records of water-level monitoring wells. The gravitational influences of the sun and moon cause sea level to rise slightly at some locations on the surface of the earth and to drop at other locations. These sealevel oscillations vary periodically with the changing positions of the sun and moon at a given location and are referred to as ocean tides. The inland part of the earth formation, at a distance from the surface loading, responds hydraulically as if an oscillatory pumping/injection sequence were being performed; they are referred to as Earth Tide. The identification of the existence of these natural stresses within a given aquifer allows the estimation of hydraulic parameters of the aquifer, such as effective porosity and storage coefficient. Detection of the barometric and earth tide effects. On all piezometric fluctuation recordings, only two periods are usable: the first period extends from 13 June 1995 to 5 May 1996, the second from 4 May 1997 to 30 September 1997. For convenience, these periods will be split into several portions. In particular, it is necessary for some analyses to have 8192 values in order to comply with a dyadic scale. When analysed separately, all periods provided the same results. The presented results correspond to two analysed periods, the first period from 1 August to 29 August 1997 with a time step of 5 min, the second period from 13 June to 22 August 1995 with a time step of 1 h. The Morlet wavelet analysis, after applying a first-order differentiation filter to remove the trend, was performed on the first period (Figure 5.9 – see colour insert). Different periodicities: 12 h, 24 h, and weekly, along with a very important noise (see scalogram in Figure 5.9) are revealed. In addition this scalogram indicates that these periodicities are non-stationary and are interdependent on each other. The noise is itself influenced by the diurnal and semi-diurnal components. The spectral analysis (Figure 5.10) shows that the main components are represented by the trend, the diurnal and semi-diurnal components. After using a first-order differentiation filter, which eliminates the trend, it is found (Figure 5.11) that the

128

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 5.10: Spectral density functions calculated on piezometric levels between 1 and 29 August 1997 with a time step of 5 min (raw data). 10 9

Spectral density

8 7 6 5 4 3 24 h

2

12 h

1 0

0

0.05

0.10

0.15

0.20

0.25 0.30 Frequency

0.35

0.40

0.45

0.50

diurnal and semi-diurnal are mostly masked in noise. Also shown is a component at 8 h, but this is an artefact attributable to the fact that the diurnal variations are abrupt, which introduces secondary peaks in the Fourier transform. On the scalogram, the 8 h periodicity is absent. After using a uniform moving average filter with a 24 h amplitude, and after eliminating the noise, the spectral density function (Figure 5.12) shows a strong peak at 12 h, which reflects clearly an effect attributable to the M2 earth tide (Arditty, 1978; Bos, 1995; Bredehoeft, 1967; Didier, 1997; Heudron, 1999; Melchior, 1978). Using Figure 5.11: Spectral density functions calculated on piezometric levels between 1 and 29 August 1997 with a time step of 5 min (first-order filter). 10 9

Spectral density

8 7 6 12 h

5

8h

4 24 h

3 2 1 0

0

0.05

0.10

0.15

0.20

0.25 0.30 Frequency

0.35

0.40

0.45

0.50

129

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

Figure 5.12: Spectral density functions calculated on piezometric levels between 1 and 29 August 1997 with a time step of 5 min (first-order filter followed by a uniform moving average filter with a 24 h amplitude). 100 90 12 h

80

Spectral density

70 60 50 40 30 24 h

20 10 0

0

0.05

0.10

0.15

0.20

0.25 0.30 Frequency

0.35

0.40

0.45

0.50

multiresolution wavelets analysis, it is possible to isolate the semi-diurnal component to determine its amplitude. The value of this amplitude was estimated to be 0.10 m. There is an approach applicable to the calculation of the storage coefficient (S) based on tidal effects (Bredehoeft, 1967; Jacob, 1940; Mangin, 1975; Marsaud et al., 1993), using the following formula S ¼

eŁ dh

(5:5)

where Ł corresponds to the cubic expansion of the M2 earth tide, equal to 4.5 3 108 , e is the thickness of the aquifer (450 m), and dh is the variation induced by the earth tide. This results in S¼

450 3 4:5 3 108 ¼ 2 3 104 0:10

(5:6)

The result shows a low storage coefficient and is consistent with the conditions usually observed for this type of confined aquifers. The diurnal and weekly components are caused by fluctuations in barometric pressure (Bredehoeft, 1967; Jacob, 1940; Mangin, 1975; Marsaud et al., 1993). The effect of pressure on confined aquifers results in barometric efficiency (B) that can be used to calculate the porosity. To determine if an effect is really due to the barometric pressure, a cross-correlation analysis of the relationship pressure and piezometric level was performed, only the period with 1 h time step was analysed because pressure

130

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

recordings with a time step shorter than 1 h did not exist. For that purpose, the period from June to August 1995, the only period for which there are barometric pressure recordings on an hourly basis, was taken into account. After August 1997, the barometric pressure was recorded every 3 h. The cross-correlogram indicates the existence of a relationship, which, following the period taken into account, is proportional or inversely proportional; in this case, it is proportional (Figure 5.13). From the cross-correlation analysis, it is possible to determine the value on the component at 24 h, the gain value that corresponds to the barometric efficiency. The gain function (Figure 5.14) shows a value of 0.72 for the component at 24 h. The following Jacob equation (Jacob, 1940) can be used to calculate the porosity j¼

Ew BS r ge

(5:7)

where j corresponds to the porosity, S is the storage coefficient (equal to 2 3 104 ) from the previous calculation, e is the thickness of the aquifer of 450 m, Ew modulus of elasticity of water (equal to 2.05 3 109 kg/m), and r is the density of water (equal to 997 kg/m3 ) j¼

2:05 3 109 3 0:72 3 2 3 104  7% 997 3 9:81 3 450

(5:8)

This value is consistent with porosity values observed in fractured limestone formations. As for the noise, when representing the spectral density function in bilogarithmic coordinates, the slope of this spectrum equals 1, which corresponds to the limit of Gaussian noise and Brownian noise (Figure 5.15). Thus, it is likely to be random but non-stationary. To investigate the identified noise, after having isolated the short-term Figure 5.13: Cross-correlogram between barometric pressure and piezometric fluctuation performed on filtered data (period from 13 June to 8 August 1995, with hourly time step). 1.0 0.8

Cross-correlation

0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1.0

60

40

20

0 Days

20

40

60

131

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

Figure 5.14: Gain function between barometric pressure and piezometric fluctuation performed on filtered data (period from 13 June to 8 August 1995, with hourly time step). 5.0 4.5 4.0 3.5

Gain

3.0 2.5 2.0 0.72 (24 h)

1.5

Amplification

1.0

Attenuation

0.5 0

0

0.05

0.10

0.15

0.20

0.25 0.30 Frequency

0.35

0.40

0.45

0.50

Figure 5.15: Noise analysis (1/f n ) from the spectral density function on logarithmic coordinates performed on raw piezometric data (period from 1 to 29 August 1997, with 5 min time step).

Log of the spectral density

102

101

100

p  1

101

102

101 Log of the frequencies

component (10 minutes, Figure 5.16), the representation of the correlogram (Figure 5.17) indicates the existence of a periodic variation. It is therefore clear that this component is random but is modulated by the tidal and barometric effects. These characteristics indicate the existence of a thermal convection (Combarnous, 1970).

132

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 5.16: Component at 10 min of the piezometric time series (period from 1 to 29 August 1997, with 5 min time step), obtained using multiresolution wavelet analysis. 5

Piezometric fluctuation: cm

4 3 2 1 0 1 2 3 4 5 1/08

4/08

7/08

11/08

14/08 18/08 Time: days

21/08

25/08 28/08/1997

Figure 5.17: Simple correlogram calculated according to the time series depicted in Figure 5.16.

0.8

Correlation

0.6 0.4 0.2 0 0.2 0

5.3.4

100

200

300 Time: min

400

500

600

Research on the physical nature of the processes responsible for the noise

Dynamic systems and reconstructed attractors analysis The method involves the study of dynamical systems and is based on the reconstructed attractors (Alligood et al., 1996; Dahan Dalmedico et al., 1992; Sole´ and Manrubia, 1996; Berge´ et al., 1992; 1994). This method aims to identify the nature of the

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

133

physical processes responsible for observed variations in time series. Any dynamic system is perfectly known if it is possible to represent its movement in space (phase space), the dimensions of which are those of the state variables that characterise it and correspond to its coordinates; the number of independent state variables corresponds to the degree of freedom of the system. The trajectory of the movement in the phase space determines a figure called the attractor. The attractor can be defined by a dimension, which can be an integer for deterministic systems or fractal for chaotic systems (Berge´ and Pomeau, 1988; Berge´ et al., 1984; Farmer and Sidorowich, 1987; Lorenz, 1963; Manneville, 1990; Paredes, 1995). For complex systems, the state variables are not always known, thus, a fortiori their degree of freedom as well as the couplings structures that connect these variables (nature of differential equations). To study such a system, Takens (1981) suggests that from a recording of an entity that reflects how the system functions, build an attractor that is topologically equivalent to the system. The method is known as method of delays because the state space is constructed from coordinates that are derived from each other by introducing a delay in the time series X 1 ¼ X ½ t; X 2 ¼ X ½ t þ ; X 3 ¼ X ½ t þ 2; . . .; X D ¼ X ½ t þ ð D–1Þ

(5:9)

where  is a delay that is determined from the time series by a method known as ‘average mutual information’ (Abarbanel, 1995). To estimate the number of degrees of freedom and the dimension of the attractor, the method used is that of the correlation integral proposed by Grassberger and Procaccia (1983). This method is used to calculate distances between points of the reconstructed attractor from an algorithm proposed by Fowler and Roach (1991; 1993). Two important types of information are derived from the correlation integral: primarily, the fractal dimension of the attractor and the number of degrees of freedom; secondarily the entropy of Kolmogorov–Sinai that informs on the nature of the analysed process, which can be deterministic, random or chaotic (Eckmann and Ruelle, 1985; Shuster, 1988). Results The method of reconstructed attractors has been applied on four different periods of the data set (portions of the time series that are usable): September 1995, December 1995, April 1996 and August 1997. The results obtained for these four periods are identical; therefore, only one of them will be taken as an example, that of April 1996. A first attempt using raw time series indicated the existence of deterministic components. In fact, the incidence of pressure and earth tide being very strong, the underlying processes, with lower amplitude (about 1 cm), are completely masked. Therefore, the analysis was repeated after completion of filtering on the time series. The filtering consisted of a first-order differentiation filter to eliminate the trend, a uniform moving averages filter with amplitude of 24 h to remove the effects of tides and barometric effects, followed by a uniform moving average filter with amplitude of 10 h. The noise of interest corresponds to the residual. Using the short-term component obtained by multiresolution wavelets, the result would have been the same.

134

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Because of different filtering, correlation integrals obtained are rather complex, but still usable (Figure 5.18(a) – see colour insert). The ‘average mutual information’ method allowed determination on the filtered time series of a delay () value of 1. On the graph showing the dimension of the attractor plotted against the embedding dimension (Figure 5.18(b) – see colour insert), beginning from an embedding dimension of 5, the dimension of the attractor would be 2.8. Beyond this value, the attractor increases monotonically. The first part of the curve can be interpreted as attributable to the physical identification of noise; the second part of the curve indicates that errors increase and amplify, showing an increase in the number of degree of freedom. However, when comparing this result with that obtained from experiments of Rayleigh–Benard convection (Malraison et al., 1983)), it is notable that these authors found a dimension of the attractor value of 2.8 for an embedding dimension of 5. The other analysed portions of the time series provide the same values but one of them reached the value of 4. Nevertheless, this dimension has been reported by Berge´ and Dubois (in chapter 6, Berge´ et al., 1992). These results confirm what had previously been anticipated, namely that the observed noise is a result of a thermal convection.

5.4

EVIDENCE OF THE PRESENCE OF A THERMAL CONVECTION

Taking into account the results on the functioning of Rennes les Bains hydrothermal aquifer, the geochemical and isotopic data and the integration of thermal measurements have led to a review of the geology of this area. The interpretation is consistent with the cross-section shown in Figure 5.2. Moreover, this interpretation is consistent with all previous geological information, with a recharge area located toward Valmige`re-Bouisse and very large residence times. As a whole, it conforms to the model proposed by Toth (1963). As seen on the cross-section, the well reached only the upper portion of the reservoir, which is supplied by the Cardou Branch (Figure 5.2). The springs would receive water from a deeper area at about 2000 to 2500 m, which would give a temperature gradient of 1.58C per 100 m, which in that case is quite similar to that actually obtained from the well (Mangin et al., 2004). The identification and analysis of the impact of earth tides and barometric pressure on the aquifer allowed the authors to estimate a permeability of 5.5 3 107 m/s (an intrinsic hydraulic conductivity of about 3 1014 m2 ), and a porosity of 7%. These data are also consistent with the geochemical and isotopic results, and with the carbonate rock type of the aquifer. The different methods used to analyse the piezometric level fluctuations allowed the hydrodynamic functioning of a deep hydrothermal aquifer to be understood. The existence of a random non-stationary noise evolving according to earth tides and barometric pressure indicates the presence of a thermal convection. Indeed, it has been shown (Malraison et al., 1983; Toth, 1963) that convection could take place only

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

135

for Rayleigh numbers greater than 40 in porous media (Malraison et al., 1983). The Rayleigh number in this type of environment is given by the following formula Ra ¼

r gbeK˜t Dt Vd

(5:10)

where r is the density of water (997 kg/m3 ), g is gravitational acceleration (9.81 m/s2 ), b is the coefficient of thermal expansion versus temperature of water (estimated at 0.88 3 104 /8C), e is the thickness of the aquifer (450 m), K is the intrinsic permeability, Dt is the thermal diffusivity (1.3 3 107 m2 /s), Vd is the dynamic viscosity of water (about 1.3 3 103 kg/m s), and ˜t is the temperature gradient (about 188C). With a Rayleigh number of 40, the permeability in these conditions should be K¼ ¼

Ra Dt Vd r gbe˜t 40 3 1:3 3 107 3 1:3 3 103 997 3 9:1 3 0:88 3 104 3 450 3 18

(5:11)

¼ 1 3 1012 m2 Despite the approximations that were made, it takes a much greater permeability than that obtained for the thermal convection to take place. In fact, it should be at least 20 times greater. The manifestation in the well of a thermal convection, therefore, imposes the existence of a strong local increase of permeability. However, it is quite possible if the possibility of a hydrothermal karstic aquifer in this sector is recognised. This result would explain the large capacity effect observed during the pumping test.

5.5

CONCLUSION

Hydrothermal aquifers are still regarded in hydrogeology as particular aquifers that cannot be investigated with the conventional methods used in shallow aquifers. Additionally, their hydrodynamic investigation is always extremely costly due the need to install deep wells. Consequently, their investigative approach is generally based on geochemical and isotopic considerations. The use of new methods based on the analysis of piezometric level time series shows that it is possible to obtain parameters such as permeability, porosity, and storage coefficient, in the same way as for shallow aquifers, at almost no cost. The results obtained indicated the need to reconsider some aspects of the geology of the study area and mainly highlighted the existence of a local hydrothermal karstic aquifer. This finding is fundamental for the management of this deep aquifer. The methodology developed during this work can be applied to investigate any well tapping a confined aquifer, such as water supply wells. The techniques used can be generalised to analyse any environmental problem.

136

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

REFERENCES Abarbanel, H.D.I. (1995). Analysis of Observed Chaotic Data. Springer, New York. Abry, P. (1997). Ondelettes et turbulences; Multire´solution: Algorithmes de De´composition, Invariance d’E´chelle et Signaux de Pression (Wavelets and Turbulence; Multiresolution: Decomposition Algorithms, Scale Invariance and Pressure Signals). Diderot Editeur, Arts et Sciences. Alligood, K.T., Sauer, T.D. and Yorke, J.A. (1996). Chaos. An Introduction to Dynamical Systems. Springer, New York. Arditty, P. (1978). The Earth Tide Effects on Petroleum Reservoirs. Thesis, Stanford University, USA. Arneodo, A., Argoul, F., Bacry, E., Elezgary, J. and Muzy, J.F. (1995). Ondelettes, Multifractales et Turbulence de l’ADN aux Croissances Cristallines (Wavelets, Multifractals and Turbulence from DNA to Crystalline Formations). Diderot Editeur, Arts et Sciences. Arthaud, F., Burg, J.P. and Matte, Ph. (1976). L’e´volution structurale hercynienne du Massif de Mouthoumet (Sud de la France) (Structural evolution of the Mouthoumet Massif (southern France)). Bulletin de la Socie´te´ Ge´ologique de France. 7, XVIII: 967–972. Barbut, M. and Fourgeaud, C. (1971). Ele´ments d’Analyse Mathe´matique des Chroniques (Mathematical Analysis of Time Series). Collection Hachette Universite´. Barnes, H.L. (Ed.). (1979). Geochemistry of Hydrothermal Ore Deposits. Wiley, New York. Barnolas, A., Chiron, J.C. and Guerange, B. (1996). Synthe`se Ge´ologique et Ge´ophysique des Pyre´ne´es, Vol. 1, Introduction Ge´ophysique, Cycle Hercynien (Geological and Geophysical Synthesis of the Pyrenees. Vol. 1, Geophysical Introduction, Hercynian Cycle), Bureau de Recherches Ge´ologiques et Minie`res et de l’Instituto Technologico Geominero de Espana, pp. 159–161, pp. 289–291, pp. 312–313, pp. 645–647. Berge´, P. and Pomeau, Y. (1988) Le Chaos The´orie et Expe´riences (The Chaos Theory and Experiences). Editions Eyrolles, Paris. Berge´, P., Pomeau, Y. and Vidal, Ch. (1984). L’Ordre dans le Chaos. Vers une Approche De´terministe de la Turbulence (Order in Chaos. Towards a Dynamic Approach to Turbulence). Hermann, Collection Enseignement des Sciences. Berge´, P., Pomeau, Y. and Vidal, Ch. (1992). L’Ordre dans le Chaos. Vers une Approche Dynamique de la Turbulence. Collection Enseignement des Sciences, Hermann, Editions des Sciences et des Arts. Berge´, P., Pomeau, Y. and Dubois-Gance, M. (1994). Des Rythmes au Chaos (From Rhythms to Chaos). Editions Odile Jacob, Paris. Bessie`re, G. (1987). Mode`le d’e´volution polyoroge´nique d’un Massif Hercynien: le Massif de Mouthoumet (Aude) Polygenic evolution model of the Hercynian Massif: the Mouthoumet Massif (Aude)). The`se Toulouse. Billotte, M. (1985). Le Cre´tace´ Supe´rieur des Plates-formes Est-Pyre´ne´ennes (Upper Cretaceous of the East Pyrenean Platforms). Strata, Toulouse, se´rie 2, vol. 5, 438 pp. Blavoux, B. (1991). Le forage: une fac¸on moderne de prote´ger et ge´rer la ressource. La qualite´ de l’eau thermale coulera-t-elle toujours de sources? (Well drilling: a modern way to protect and manage the resource. Will the quality of thermal water last?) Journe´es Nationales Cite´ des Sciences et de l’Industrie (National Scientific Days, City of Science and Industry), La Vilette Paris, Colloque 1991, pp. 17–33. Blavoux, B. and Berthier, F. (1985). Les originalite´s hydroge´ologique et technologique des eaux minerals (Hydrogeological and technological originalities of mineral waters). Bulletin de la Socie´te´ Ge´ologique de France. 7: 1033–1044. Blavoux, B., Dazy, J. and Sarraut-Reynauld, J. (1982). Information about the origin of thermomineral waters and gas by means of environmental isotopes. Journal of Hydrology. 56: 23-38. Bos, C. (1995). Etude d’un aquife`re thermal profond: le forage de Rennes-les-Bains (Aude) (Study of a deep thermal aquifer: the Rennes-les-Bains well (Aude)). Me´moire de DESU (Superior Studies, Diploma Dissertation), Universite´ de Toulouse. Bouchaala, A. (1991). Hydroge´ologie d’aquife`res karstiques profonds et relation avec le thermalisme. Exemple de la partie occidentale du Massif de Mouthoumet (Aude) (Hydrogeology of deep

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

137

karstic aquifers and its relation with thermalism. Example of the occidental part of the Mouthoumet Massif (Aude)). Doctorat (PhD Thesis) de l’Universite´ de Franche-Comte´. Box, G.E.P. and Jenkins, G.M. (1976). Time Series Analysis, Forecasting and Control, Revised edition. Holden Day, San Francisco. Bras, R.L. and Rodriguez-Iturbe, I. (993). Random Functions and Hydrology. Dover Publications, New York. Bredehoeft, J.D. (1967). Response of well-aquifer system to earth tides. Journal of Geophysical Research. 12: 3075–3085. Bresson, A. (1908). Groupe primaire. Re´gion des Corbie`res (Primary group. Corbie`res area). In: Carez, L. (Ed.), Re´sume´ de la Ge´ologie des Pyre´ne´es Franc¸aises (Summary of the Geology of the French Pyrenees). Me´m. Serv. Carte Ge´ologique de France, vol. V, pp. 3276–3284. Burger, A., Recordon, E., Bovet, D., Cotton, L. and Saugy, B. (1985). Thermique des Nappes Souterraines (Thermic of Subterranean Aquifers). Presses Polytechniques Romandes. Burke Hubbard, B. (1995). Les Ondes et les Ondelettes: la Saga d’un Outil Mathe´matique (Waves and Wavelets: the Saga of a Mathematical Tool). Pour la Sciences. Carrie, J.C. (1991). De´termination de l’origine et du transit des eaux en vue de l’e´tablissement des pe´rime`tres de protection du gıˆte thermal. La qualite´ de l’eau thermale coulera-t-elle toujours de sources? (Determination of the origin and transit of waters in order to establish protection perimeters for a thermal well. Will the quality of thermal water last?) Journe´es Nationales Cite´ des Sciences et de l’Industrie (National Scientific Days, City of Science and Industry), La Vilette Paris, Colloque 1991, pp. 41–49. Castany, G. (1963). Traite´ Pratique des Eaux Souterraines (Practical Treatise on Groundwater). Dunod, Paris. Combarnous, M. (1970). Convection Naturelle et Convection Mixte en Milieu Poreux (Natural Convection and Mixed Convection in Porous Media). The`se de Doctorat en Sciences Physiques (PhD Thesis in Earth Sciences) de l’Universite´ de Paris. Dahan Dalmedico, A., Chabert, J.L. and Chemla, K. (1992). Chaos et De´terminisme (Chaos and Determinism). Editions Seuil. Daubechies, I. (1992). Ten Lectures on Wavelets. CSBM-NSF Series in Applied Mathematics (SIAM) Publications, 61. de Launay, L. (1899). Sources Thermomine´rales, Origine des Eaux Thermomine´rales et Chimiques (Thermo-mineral Springs, Origin of Thermo-mineral and Chemical Waters). Librairie Polytechnique Baudry, Paris. De Marsily, G. (1981). Hydroge´ologie Quantitative (Quantitative Hydrogeology). Editions Masson. Didier, E. (1997). Mise en Evidence des Caracte´ristiques Hydroge´ologiques de Trois Aquife`res Hydrothermaux (Hautes Pyre´ne´es). Approche Syste´mique (Assessment of Hydrogeological Characteristics of Three Hydrothermal Aquifers (Upper Pyrenean). Systemic Approach). DEA (Masters Diploma), Paris, vol. XI. Eckmann, J.P. and Ruelle, D. (1985). Ergodic theory of chaos and strange attractors. Reviews in Modern Physics. 57: 617. Elder, J. (1981). Geothermal System. Academic Press, London. Ellis, A.J. and Mahon, W.A.J. (1977). Chemistry and Geothermal Systems. Academic Press, London. Farmer, J.D. and Sidorowich, J.J. (1987). Predicting chaotic time series. Physics Review Letters. 59, 8. Foufoula-Georgiou, E. (1994). Wavelets in Geophysics. Academic Press, San Diego. Fowler, A.D. and Roach, D. E. (1993). Dimensionality analysis of time series data – non linear methods. Computers and Geosciences. 79, 1: 41–52. Fowler, T. and Roach, D. (1991). Dimensionality analysis of objects and series data. In: Nonlinear Dynamics, Chaos and Fractals. Geological Association of Canada, pp. 59–81. Ghafouri, R.M. (1968). Etude hydroge´ologique des sources thermomine´rales des Pyre´ne´es (Hydrogeological study of the Pyrenean thermo-mineral springs). The`se Doctorat (PhD Thesis) de l’Universite´ de Bordeaux. Grassberger, P. and Procaccia, I. (1983). Measuring the strangeness of strange attractors. Physica 9D: 353–371. North-Holland Publishing Company, Amsterdam. Hardy, H.H. and Beier, R.A. (1994). Fractals in Reservoir Engineering. World Scientific, Singapore.

138

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Helgeson, H.C. (1969). Thermodynamics of hydrothermal systems at elevated temperatures and pressures. American Journal of Science, 267: 724–804. Heudron, D. (1999). Etude hydrodynamique des syste`mes hydrothermaux a` partir de l’exemple de Rennes-les-Bains (Aude) (Hydrodynamic study of hydrothermal systems through the example of the Rennes-les-Bains). Me´moire de DEA (Master Diploma), Universite´ Paris XI (Orsay). Khufuss, A. (1981). Ge´ologie et hydroge´ologie des Corbie`res me´ridionales: re´gion de Bugarach, Rouffiac-des-Corbie`res (Geology and hydrogeology of the meridional Corbie`res: Bugarach, Rouffiac-des-Corbie`res). The`se 3e` me cycle (PhD Thesis), University Paul Sabatier, Toulouse. Jacob, C.E. (1940). On the flow of water in an artesian aquifer. Transactions of the American Geophysical Union. 2: 574–786. Jenkins, G.M. and Watts, O.G. (1968). Spectral Analysis and its Applications. Holden-Day, San Francisco. Labat, D., Ababou, R. and Mangin, A. (1999a). Analyse en ondelettes en hydrologie karstique. 1e` re partie: analyse varie´e de pluies et de´bits de sources karstiques (Wavelet analysis in karstic hydrology. First part: univariate analysis of rainfall rates and karstic springs runoff). Comptes Rendus de l’Acade´mie des Sciences. 329: 873–879. Labat, D., Ababou, R. and Mangin, A. (1999b). Analyse en ondelettes en hydrologie karstique 2e partie: analyse en ondelettes et croise´es pluie – debit (Wavelet analysis in karstic hydrology. Second part: rainfall-runoff cross-wavelet analysis). Comptes Rendus de l’Acade´mie des Sciences. 329: 873–879. Labat, D., Ababou, R. and Mangin, A. (2000a). Rainfall-runoff relations for karstic springs: Part 1: Convolution and spectral analyses. Journal of Hydrology. 238: 123–148. Labat, D., Ababou, R. and Mangin, A. (2000b). Rainfall-runoff relations for karstic springs: Part 2: Continuous wavelet and discrete orthogonal multiresolution analyses. Journal of Hydrology. 238: 149–178. Labat, D., Ababou, R. and Mangin, A. (2001). Introduction of wavelet analyses to rainfall/runoffs relationship for karstic basin: the case of Licq-Atherey karstic system (France). Ground Water. 39, 4: 605–614. Labat, D., Mangin, A. and Ababou, R. (2002a). Rainfall-runoff relations for karstic springs: Multifractal analysis. Journal of Hydrology. 256: 176–195. Labat, D., Ababou, R. and Mangin, A. (2002b). Analyse multire´solution croise´e de pluies et de´bits de sources karstiques (Multiresolution cross-analysis of rainfall rates and karstic spring runoffs). Compte Rendus Ge´osciences. 334: 551–556. Lorenz, E.N. (1963). Determinic nonperiodic flow. Journal of the Atmospheric Sciences. 20: 130–141. Mallat, S. (1999). A Wavelet Tour of Signal Processing, 2nd edition. Academic Press, San Diego. Malraison, B., Atten, P., Berge´, P. and Dubois, M. (1983). Dimension d’attracteurs e´tranges: une de´termination expe´rimentale en re´gime chaotique de deux syste`mes convectifs (Dimension of strange attractors: an experimental determination in a chaotic regime of two convective systems). Comptes Rendus de l’Acade´mie des Sciences. 297: 209. Mangin, A. (1975). Contribution a` l’E´tude Hydrodynamique des Aquife`res Karstiques (Contribution to the Hydrodynamic Study of Karstic Aquifers). The`se Doctorat d’Etat (PhD Thesis), University of Dijon. Also published in Annales de Spe´le´ologie. (1974). 29, 3: 283–332; (1974). 29, 4: 495–601; (1975). 30, 1: 21–124. Mangin, A. (1981a). Utilisation des analyses corre´latoire et spectrale dans l’approche des syste`mes hydrologiques (Using spectral and correlation analyses to study hydrogeological systems). Comptes Rendus de l’Acade´mie des Sciences. Se´rie II, 293: 401–404. Mangin, A. (1981b). Apports des analyses corre´latoire et spectrale croise´es dans la connaissance des syste`mes hydrologiques (Contribution of spectral and correlation analyses in the investigation of hydrogeological systems). Comptes Rendus de l’Acade´mie des Sciences. Se´rie II, 293: 1011–1014. Mangin, A. (1984). Pour une meilleure connaissance des syste`mes hydrologiques a` partir des analyses corre´latoires et spectrales (For a better knowledge of hydrogeological systems through the use of correlation and spectral analyses). Journal of Hydrology. 67: 25–43. Mangin, A. (1994). Karst hydrogeology. In: Groundwater Ecology. Academic Press, New York.

SOLVING COMPLEX PROBLEMS USING STOCHASTIC DATA ANALYSIS

139

Mangin, A., Yvroux, M. and D’Hulst, D. (2004). Approche hydrodynamique d’un aquifere hydrothermal, influence par le karst. Exemple de Rennes les Bains, France (Hydrodynamic characterisation of a hydrothermal aquifer influenced by a karst. Example of Rennes les Bains, France). 10th Technical Day of the French National Committee of the International Association of Hydrogeologists. Hydrothermal Circulations in Limestones. Congress Acts, Carcassonne, France. Manneville, P. (1990). Structures Dissipatives. Chaos et Turbulence (Dissipative Structures. Chaos and Turbulence). Ale´a-Saclay, Gif-sur-Yvette. Marsaud, B., Mangin, A. and Bel, F. (1993). Estimation des caracte´ristiques physiques d’aquife`res profonds a` partir de l’incidence barome´trique et des mare´es terrestres (Estimation of the physical characteristics of deep aquifers by barometric incidence and Earth tides). Journal of Hydrology. 144: 85–100. Matheron, G. (1965). Les Variables Re´gionalise´es et leur Estimation (Regionalised Variables and their Estimation). Masson, Paris. Max, J. (1980). Me´thodes et Techniques de Traitement du Signal et Applications aux Mesures Physiques (Technical Methods of Signal Processing and Applications to Physical Measures). Masson, Paris. Melchior, P. (1978). The Tides of the Planet Earth. Pergamon Press, Paris, 609 pp. Meyer, Y. and Roques, S. (1992). Progress in Wavelet Analysis and Applications. Edition Frontieres. Michard, G. (1989). Equilibres Chimiques dans les Eaux Naturelles (Chemical Equilibrium of Natural Waters). Editions Publisud. Moret, L. (1946). Les Sources Thermomine´rales. Hydroge´ologie. Ge´ochimie. Biologie (Thermomineral Waters. Hydrogeology. Geochemistry. Biology). Masson, Paris. Paredes, C. (1995). Aplicacion de la Geometria Fractal en las Ciencas de la Tierra (Application of Fractal Geometry in Earth Sciences). Tesis Doctoral (PhD Thesis), Universidad Politecnica de Madrid, 285 pp. Rambaud, A. (1991). La gestion des eaux thermales face aux pollutions: quels avenirs? La qualite´ de l’eau thermale coulera-t-elle toujours de sources? (The management of thermal waters facing pollution: what future? Will the quality of thermal water always flow from sources?) Journe´es Nationales Cite´ des Sciences et de l’Industrie, La Vilette Paris, Colloque 1991, pp. 33–37. Schoeller, H. (1975). Les proble`mes thermiques et chimiques des eaux thermals (Thermal and chemical problems of thermal waters). Proceedings of the 119 Association Internationale des Sciences Hydrologiques Symposium of Grenoble, pp. 1–8. Schoeller, H. and Schoeller, M. (1976). Calcul de la tempe´rature des sources thermomine´rales a` leur origine profonde (Estimation of the temperature of thermo-mineral springs at their deep origin). Comptes Rendus de l’Acade´mie des Sciences. 283, D: 753–756. Schoeller, H. and Schoeller, M. (1982). Les eaux thermomine´rales des Pyre´ne´es (Thermo-mineral water of the Pyrenees). Presse Thermale et Climatique. 119, 2: 81–86. Shuster, H. (1988). Deterministic Chaos. Verlagsgesellschaft, Weinheim. Sole´, R.V. and Manrubia, S.C. (1996). Orden y Caos en Sistemas Complejos (Order in Chaos and Complex Systems). UPC, Barcelona. Takens, F. (1981). Detecting Strange Attractors in Turbulence. Springer, Lecture Notes in Mathematics, Vol. 898. Torrence, C. and Compo, G.P. (1998). A practical guide to wavelet analysis. Bulletin of the American Meteorological Society, 79, 1: 61–78. Toth, J. (1963). A theoretical analysis of groundwater flow in small drainage basins. Journal of Geophysical Research. 68, 16: 4795–4812. Valat, J.L. (1971). Etude hydroge´ologique des sources thermales de Rennes-les-Bains (Hydrogeological study of thermo-mineral springs of Rennes-les-Bains). The`se 3e` me cycle (1e` re partie) (PhD Thesis), Montpellier. Von Gaertner, H.R. (1937). Montagne Noire und Massiv von Mouthoumet als Teile des sudwesteuropaischen Variszikums (Black Mountain with Mouthoumet Massif and the Variscan folds of southwestern France). Abhandlungen der Gesellschaft der Wissenschaften zu Gottingen, Math. Phys. III, Berlin.

140

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Yevjevich, V. (1972). Stochastic Processes in Hydrology. Water Resources Publications, Fort Collins, Colorado. Yvroux, M. (1997). Karst et thermalisme dans le Massif du Mouthoumet: l’exemple de Rennes les Bains (Karst and thermalism in the Mouthoumet Massif: example of Rennes les Bains). Spe´le´ologie Aude. 6: 72–84. Yvroux, M. and Olive, P. (2004). Circulations hydrothermales dans la partie occidentale du Massif de Mouthoumet (Aude) (Hydrothermal circulations in the occidental part of the Mouthoumet Massif, Aude). 10th Technical Day of the French National Committee of the International Association of Hydrogeologists. Hydrothermal Circulations in Limestones. Congress Acts, Carcassonne, France.

CHAPTER

6

Application of Statistics in the Evaluation and Optimisation of Environmental Sampling Plans Meng Ling and Jeff Kuo

6.1

INTRODUCTION

A sampling plan (also referred to as a monitoring programme in this chapter) is an indispensable part of remedial investigation at contaminated sites. A sampling plan is typically implemented to detect potential hazards or chemical releases, monitor environmental compliance and/or confirm the progress of corrective actions. A technically sound sampling plan can lead to effective and efficient monitoring of environmental conditions, provide crucial information for remedial design, support negotiation with regulatory agencies, and result in cost savings. This chapter discusses strategies and procedures for the evaluation and optimisation of an existing environmental sampling plan. Statistical tests and procedures commonly used in environmental data analysis are applicable to evaluate the effectiveness of an existing sampling plan. Many statistical approaches have specifically been developed for this purpose. Results obtained from using statistical methods that are robust and have been widely used in the evaluation of sampling plans are more acceptable to environmental professionals and regulatory agencies. In this chapter, we introduce a multi-component approach to evaluate an existing environmental sampling plan. This approach was originally developed to optimise existing ground water sampling plans, but its principles can be applied to other environmental sampling plans or monitoring programmes. The content of this chapter is organised into four sections. Section 6.1 provides an introduction. Section 6.2 describes the evaluation approach and each of its components in detail. Section 6.3 presents application examples of three remediation sites in the USA. A summary of the chapter is provided in Section 6.4. Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

142

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

6.2

APPROACH

The multi-component approach for the evaluation and optimisation of an existing environmental sampling plan consists of five components: 1. 2. 3. 4. 5.

trend analysis of concentrations in local areas; trend analysis of site-wide or population-wise concentrations; analysis of sampling locations; analysis of sampling frequencies; comprehensive assessment.

Trend analyses utilise historical monitoring data and statistical methods to assess the concentration trends in local areas and the overall temporal variations of environmental contamination at a site as a whole. These results provide a better understanding of the conditions and evolvement of the contamination, and serve as a foundation for subsequent specific evaluations. Analyses of sampling locations and frequencies provide a specific evaluation as to where and how often to sample. The analysis typically involves the use of advanced statistical methods to assess spatially the monitoring network and temporally evaluate sampling frequency. The first four components of the approach produce preliminary recommendations with regard to removal or addition of sampling locations and the need for adjustment of sampling frequency. In the final comprehensive assessment stage, the preliminary recommendations are refined with professional judgement by taking non-technical factors into consideration, to produce final recommendations. The specific statistical methods introduced in this chapter were chosen because of their usefulness, robustness and applicability, and to a large extent based on the experiences of the authors. They include Mann–Kendall trend analysis for individual locations (Aziz et al., 2003a), box–whisker plot and sign test for site-wide trend analysis (US EPA, 2006), the Delaunay method for sampling location analysis (Ling et al., 2005), and the modified cost-effective sampling (CES) method for sampling frequency analysis (Ling, 2003). The following sections describe each specific statistical method and evaluation rationale in detail.

6.2.1

Concentration trend analysis for individual locations

Mann–Kendall trend analysis (Aziz et al., 2003a) was developed based on the Mann– Kendall test (Gilbert, 1987). The Mann–Kendall test can be viewed as a nonparametric test for zero slope of the first-order regression of concentration data versus time. This procedure does not require knowledge on the statistical distribution of the data and can be used with data sets that include irregular sampling intervals and missing data. The Mann–Kendall test can also be used with concentration data reported as trace or lower than the reporting or method detection limit because it uses only the relative magnitudes of the data rather than their measured values. The Mann– Kendall test has advantages in the cases where data outliers would produce biased estimates of the slope obtained from the least-squares method. The Mann–Kendall

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

143

test is one of the trend analysis methods recommended by the US Environmental Protection Agency (US EPA, 2006). The Mann–Kendall trend analysis determines the concentration trend by considering three factors: the Mann–Kendall statistic S, the confidence level in the trend, and the coefficient of variation (COV) of the concentration data. The Mann–Kendall statistic S measures the trend in the data: positive values indicate an increase in concentrations over time and negative values indicate a decrease in concentrations over time. The strength of the trend is proportional to the magnitude of the Mann– Kendall statistic S (i.e. a large value indicates a strong trend). The Mann–Kendall statistic, S, is defined as S¼

n1 X n X

sgnð x j  x k Þ

(6:1)

k¼1 j¼ kþ1

where x is the observed value in time sequential order, n is the number of observations in the data set, and sgn(x j  x k ) is an indicator function that results in the values 1, 0, or 1 according to the sign of x j  x k ( j . k). The indicator function is calculated as follows: 8 if x j  x k . 0 < sgn(x j  x k ) ¼ 1 sgn(x j  x k ) ¼ 0 if x j  x k ¼ 0 (6:2) : sgn(x j  x k ) ¼ 1 if x j  x k , 0 The confidence level in the trend is determined by consulting the S statistic and the sample size n in a Kendall probability table such as the one reported in Hollander and Wolfe (1973). The COV is given by s COV ¼ (6:3) x where s is the sample standard deviation, and x is the sample average. Using the above statistics, concentration trends in the Mann–Kendall trend analysis are further classified into six categories: 1. 2. 3.

4. 5. 6.

increasing trend (I) – Mann–Kendall statistic S greater than 0 with the confidence level greater than or equal to 95%; probably increasing trend (PI) – Mann–Kendall statistic S greater than 0 with the confidence level greater than or equal to 90%, but less than 95%; no trend (NT) – Mann–Kendall statistic S greater than 0 with the confidence level less than 90%; or Mann–Kendall statistic S less than 0, the confidence level less than 90%, and the COV greater than 1; stable trend (S) – Mann–Kendall statistic S less than 0, the confidence level less than 90%, and the COV less than 1; probably decreasing trend (PD) – Mann–Kendall statistic S less than 0 with the confidence level greater than or equal to 90%, but less than 95%; decreasing trend (D) – Mann–Kendall statistic S less than 0 with the confidence level greater than or equal to 95%.

144

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

The trends determined above reflect the direction as well as uncertainty (or confidence) of a trend. A graphical illustration of the concentration trends by the Mann–Kendall trend analysis is provided in Figure 6.1. The Mann–Kendall trend analysis requires a minimum of four samples, and a minimum of six samples for a reliable analysis. The Mann–Kendall trend analysis has been built into the Monitoring and Remediation Optimisation System (MAROS), a public software program developed by the Air Force Center for Environmental Excellence (Aziz et al., 2003b; Ling et al., 2004). In addition to showing trends in individual locations, the Mann–Kendall trends can be assessed collectively to determine an overall trend of a site. Such an assessment is usually conducted using a certain weighting system; examples can be found in Newell et al. (2006) and Aziz et al. (2003a).

6.2.2

Trend analysis of site-wide or population-wise concentrations

The box–whisker plot provides a convenient graphical examination of contaminant concentration distribution (US EPA 2006). When plotted together for multiple sampling events, the population distributions of various sampling events can be compared to assess qualitatively the overall trend of contaminant distribution at a site. In a conventional box–whisker plot, the minimum, the 25th percentile (or lower quartile), the 50th percentile (or median), the 75th percentile (or upper quartile), and the maximum of the contaminant data from a particular sampling event are documented in a chart format to illustrate the centre, spread, and skew of the data (Figure 6.2). The length of the central box (the central 50%) indicates the spread of the data while

Figure 6.1: The concentration trends defined in the Mann –Kendall trend analysis.

Decreasing trend

Strong trend

Confidence  95%

Weak trend

Confidence  95%

Probably decreasing 95%  Confidence  90%

Relatively stable trend

Increasing trend

Stable trend Confidence  90% and COV < 1

Probably increasing 95%  Confidence  90%

Fluctuating or no trend Confidence  90% and COV  1

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

145

Figure 6.2: Illustration of the box–whisker plot. 100 000

Contaminant concentration: µg/L

10 000

1000

100

10 First quarter 2008

Second quarter 2008 Third quarter 2008

Fourth quarter 2008

Maximum 75th percentile 50th percentile 25th percentile Minimum

the length of the whiskers shows the range of the distribution. If the upper box and whisker have approximately the same lengths as those of the lower box and whisker, the data are distributed symmetrically. If the upper box and whisker are longer than the lower box and whisker, then the data are right-skewed, and vice versa. The box–whisker plot can be plotted in a logarithmic scale to better illustrate the distribution of contaminant concentrations, which usually differ by orders of magnitude. The box–whisker plot can also be used for data reported as below the laboratory reporting limits (or non-detects). For data reported as below the reporting limits, they are normally quantified as 50% or 100% of the reporting limits, unless a more sophisticated method is used. For example in Figure 6.2, the concentration axis is in a logarithmic scale and data reported as non-detects were quantified as 50% of their

146

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

respective reporting limits. The overlap of the minimum value and the 25th percentile in the second quarter of 2008 as well as the overlap of the three quartiles in the fourth quarter of 2008 were due to the large percentage of data reported as non-detects. Although the above simple quantification method may obscure the actual distribution of the data, it still provides a quick and meaningful way for comparing several sampling events. Quantitative evaluation is often needed to test for a statistically significant difference in the overall plume concentration levels between different sampling events. For site-wide comparison with the same monitoring network, paired two-sample tests are preferred as they have higher statistical power than non-paired tests. Such tests, as recommended by the US EPA (2006), include paired t-test, sign test and Wilcoxon signed rank test. The selection of which test to use largely depends on the percentage of non-detects. For a contaminated site under remediation, the percentage of nondetects is usually high and will increase over time as a result of effective remediation. When the percentage of non-detects is between 50% and 90%, the US EPA (2006) recommends using the test of proportions. The sign test (US EPA, 2006) is a statistical method that can provide a quantitative evaluation of population difference and thus the site-wide concentration trend. The sign test can be considered as a special case of test of proportions, that is the median or the 50% proportion. It is a non-parametric method for testing median difference. The sign test can be used no matter what the underlying distributions may be, therefore avoiding violation of key statistical assumptions implicit in many statistical testing methods. In sign tests, the non-detects can be simply quantified as 50% or 100% of their respective reporting limits. The medians of two sets of site-wide concentration data are calculated and computed to derive a test statistic that reflects the number of differences greater than zero. This test statistic is then compared to the critical value(s) under a prescribed significance level to draw conclusions. For multiple sampling events, the sign test can be performed between consecutive sampling events or between two chosen sampling events to provide an overall trend evaluation over the period of interest. Detailed procedures of the sign test can be found in US EPA (2006), which is available online for public access.

6.2.3

Analysis of sampling locations

The Delaunay method (Ling, 2003) can be used to eliminate redundant sampling locations and/or recommend new locations for additional sampling based on the adequacy of a monitoring network. The redundancy reduction part of the method is introduced here, as it is more often utilised in evaluating an existing monitoring programme. Applications of the sampling augmentation part of the Delaunay method can be found in Ling (2003) and Ling et al. (2005). The Delaunay method identifies redundant sampling locations by defining the relative importance of a given monitoring location with respect to the entire monitoring network. The Delaunay method is based on the Delaunay triangulation of a monitoring network (Figure 6.3 – see colour insert). The dash-dot blue lines connecting the monitoring locations form Delaunay triangles and the dashed yellow lines

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

147

depict Voronoi diagrams. The dashed yellow lines are formed by bisecting the edges of the Delaunay triangles. Delaunay triangulation is a widely used spatial interpolation method for solving spatial distribution problems (Isaaks and Srivastava, 1989; Okabe et al., 1992). The relative importance of a monitoring location is estimated through the calculation of the slope factor (SF). The SF of any location, N0 , is defined as    EC N0  C N0    (6:4) SF ¼  Max(EC N0 , C N0 ) where EC N0 is the estimated concentration and C N0 is the measured concentration at location N0 , both in logarithmic scale, and ‘Max’ returns the larger of the two. The estimated concentration EC N0 at location N0 is computed as the inverse-distanceweighted average of logarithmic concentrations at the location’s natural neighbours (i.e. vertices of the Voronoi diagram containing this location). The SF value ranges from 0 to 1. A value of 0 means that the concentration at a location can be exactly estimated by its surrounding locations; thus, sampling at this location provides little new information to existing knowledge of the plume. A value larger than 0 indicates some estimation error; the larger the estimation error, the more important the monitoring location is. Locations with relatively large estimation errors cannot be eliminated from a monitoring network, whereas locations with SF values less than a certain threshold are potential candidates for elimination. To ensure that eliminating sampling locations from the monitoring network will not cause significant information loss, two information loss measures were developed: the concentration ratio (CR) and the area ratio (AR). The concentration ratio, CR, is defined as CR ¼

Cavg,Current Cavg,Original

(6:5)

where Cavg,Current is the average plume concentration estimated after eliminating sampling locations, and Cavg,Original is the average plume concentration estimated from all monitoring locations. The average plume concentration is calculated as the areaweighted average of the average concentrations of all Delaunay triangles Cavg ¼

N X

TC i  TA i

X N

i¼1

TA i

(6:6)

i¼1

where N is number of all Delaunay triangles in the triangulation, TC i is the average concentration for each Delaunay triangle, and TA i is the area of each Delaunay triangle. The area ratio, AR, which measures the change in the areal coverage of the reduced network, is defined as AR ¼

ACurrent AOriginal

(6:7)

148

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

where ACurrent is the area of the polygon formed by joining the peripheral wells in the reduced network, and AOriginal is the polygon area occupied by the original network. By evaluating the values of CR and AR, the impact of eliminating sampling locations on the estimate of the average plume concentration and the triangulation area can be assessed. CR and AR values close to 1 indicate that the plume information loss is not significant. Conversely, CR and AR values approaching 0 represent a large estimation discrepancy and thus indicate greater information loss. By setting an acceptable level of information loss, the validity of eliminating sampling locations can be judged. Because the AR criterion allows for higher significance of wells that define the outer boundary of the plume, eliminating these wells is more difficult than wells inside the plume. Since the Delaunay method could result in different well configurations for each individual sampling event, a method for evaluating multiple sampling events was developed as follows: 1. 2.

3. 4. 5. 6.

calculate the SF values for each sampling location for all selected sampling events; calculate the overall SF value by averaging the SF values for each location across the selected sampling events, weighted by the number of locations contained in each sampling event; eliminate the most redundant location (the one with the smallest overall SF value that is less than the prescribed or recommended SF threshold); calculate the overall CR and AR values after each elimination by averaging CR and AR values across the selected sampling events; confirm or restore the elimination based on the overall CR and AR values; repeat steps 3–5 until all eligible locations are examined.

The above method assumes that a similar spatial pattern for the plume exists in the time period defined by the selected sampling events. This is often the case for plumes approaching steady state and/or plumes subject to remedial control or natural attenuation. Using this analysis, the overall importance of a sampling location over a selected time period can be measured. Because the Delaunay method is based on the analysis of a single contaminant, multiple contaminants may lead to multiple analyses. A common way to avoid too many analyses that might lead to conflicting results is to select one or a few ‘indicator’ contaminants with the highest health or environmental risk. For example, for ground water contaminated by petroleum hydrocarbons, benzene or total BTEX (benzene, toluene, ethylbenzene, and xylenes) are often used to evaluate the general behaviour of the dissolved plume. In the cases when conflicting results exist, a location will be eliminated only if it can be eliminated for all contaminants of concern. The Delaunay method has been applied to many sites for evaluating ground water monitoring networks and recommending sampling locations (Aziz et al., 2003a; Ling et al., 2005; Ling et al., 2006; US EPA, 2004). The Delaunay method has been incorporated into the MAROS software (Ling et al., 2004) for free access by environmental practitioners.

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

6.2.4

149

Analysis of sampling frequencies

The modified CES method (Ling, 2003), developed based on the CES approach (Johnson et al., 1996), can be used to determine the most efficient sampling schedule for a given sampling location. The CES approach (Johnson et al., 1996) determines the sampling frequency by first setting a provisional frequency based on the most recent data (e.g. the last three years of data). Concentrations are regressed versus time using the least-squares method to derive a first-order linear slope, termed the rate of change. Based on the magnitude of the rate of change, a monitoring location is routed along one of four paths. The lowest rate, for example 0–10 g/L per year, leads to an annual sampling frequency. The highest rate, for example 30+ g/L per year, leads to a quarterly schedule. A rate of change between these two extremes is qualified by variability information, with higher variability leading to a higher sampling frequency (quarterly) and lower variability leading to a lower sampling frequency (semiannual). Variability is characterised using a distribution-free version of the COV, such as the range of data divided by the median concentration. Next, the CES approach adjusts the provisional frequency based on the overall data. If the rate of change based on the overall data is significantly greater than that based on the most recent data, the provisional frequency is re-estimated with the overall data instead of the most recent data. Otherwise, the provisional frequency determined earlier is kept. The frequency determined above can be further reduced based on health or environmental risk. The frequency can be reduced by one level (i.e. from semiannual to annual or from quarterly to semiannual) if the recent maximum concentration for a compound is less than one half of its maximum contaminant level (MCL). Biennial sampling could also be applied if, for example, the recommended sampling frequency was annual for the previous three years. The modified CES method considers not only the magnitude of the concentration trend and data variability, but also the direction and uncertainty of the trend (i.e. whether the trend is increasing or potentially decreasing, etc.). In the modified CES method, the Mann–Kendall analysis described in Section 5.2.1 is used to characterise both the direction and uncertainty of the concentration trend. The modified CES method determines sampling frequency by first estimating the overall and recent concentration trends from the historical and recent monitoring data, respectively. Two ‘provisional’ sampling frequencies, the overall frequency and the recent frequency, are then determined from the respective concentration trends and rate of change (Figure 6.4). The overall and recent provisional frequencies are assessed to determine the recommended sampling frequency (i.e. quarterly, biannual, annual or biennial) through an improved set of decision rules with the decision process illustrated in Figure 6.5. The following elements are emphasised throughout the decision process: • • •

a high rate of concentration change should be tracked with a high frequency of sampling; an increasing trend is more of a concern than a decreasing trend; flexibility is needed for adjusting the monitoring frequency over the life of the monitoring programme.

150

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 6.4: Decision matrix for determining the provisional sampling frequency in the modified CES method. Rate of change (linear slope) High

MH

Medium

LM

Low

Mann–Kendall trend

I PI NT

Q S

S

A

PD D

Q  Quarterly; S  Semiannual; A  Annual I  Increasing; PI  Probably increasing; NT  No trend S  Stable; PD  Probably decreasing; D  Decreasing

These decision elements are similar to those cited by US EPA (1999). The proposed sampling frequency through the improved decision rules can be either less frequent or more frequent than the existing frequency, and may vary with each future re-evaluation as more monitoring data become available. The modified CES method has been incorporated into the MAROS software (Aziz et al., 2003a; Ling et al., 2004) and has been applied in many environmental monitoring programmes.

6.2.5

Comprehensive assessment

Comprehensive assessment is the process to review and refine the results from the above technical analyses by applying professional judgement and taking into account non-technical factors. In the comprehensive assessment, recommendations from the technical components of the sampling plan evaluation are either accepted, rejected or revised to produce final recommendations. Some factors to be considered include sufficiency of data, regulatory requirements and field applicability. For example, implementing a recommended sampling plan may be more challenging at sites with active industrial operations than those without; recommendations from purely technical analyses may not meet certain regulatory needs. Site-specific hydraulic and hydrostratigraphic conditions also need to be considered, including the location of a monitoring well relative to other wells and potential receptors, hydraulic and geologic conditions proximate to a well, ground water flow or seepage velocity, recent concentrations and concentration trends, and the possible presence of continuing sources. Based on the technical approach described, the final conclusions and recommendations result from technically sound statistical methodologies and critical profes-

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

151

Figure 6.5: The decision process for the modified CES method. Determine two provisional frequencies (the recent frequency and the overall frequency)

Is the overall frequency more frequent than the recent frequency?

No

Yes Adjust the recent frequency using specific decision rules

The (adjusted) recent frequency becomes the new provisional frequency

Is the maximum concentration in the recent data  ½ of the MCL?

No

Yes Adjust the provisional frequency using specific decision rules

The (adjusted) provisional frequency is now the final frequency

sional judgement. The recommendations may also include contingency plans for cases where changing conditions are to be observed.

6.3 6.3.1

SITE APPLICATIONS Trend analysis for a former chemical packaging facility

The Mann–Kendall trend analysis was conducted to provide evidence to support the site closure request for a former chemical packaging facility in the western USA.

152

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Contamination was reported in the early 1990s in the ground water and soil underlying the site. To remediate the site, active remedial activities as well as monitored natural attenuation have been conducted. The remedial measures have successfully mitigated the contamination and the request for site closure was submitted to the regulatory agency. From the standpoint of sampling plan evaluation, site closure is equivalent to phase-out or termination of the entire monitoring programme. In response to concerns regarding chloroform attenuation at the site, an additional investigation was conducted in early 2009 to provide further evidence to support site closure. Six selected monitoring wells (MW-1 to MW-6), located in different areas of the site were sampled and their concentration trends were analysed. Chloroform concentration data from the 1990s to 2009 were used in the Mann– Kendall trend analysis. Results from the trend analysis indicated that all monitoring wells except MW-5 have a decreasing trend in concentration (Table 6.1). The Mann– Kendal statistic for the wells with a decreasing trend ranges from 20 to 1609 and the corresponding confidence levels range from 95.5% to 100%. These are solid evidence that chloroform attenuation through processes such as degradation and dilution is effectively occurring at the site. Although the Mann–Kendall trend analysis indicated an increasing trend at MW-5, chloroform concentrations at this location have been below 5 g/L since 1997 and have shown a decreasing tendency since September 2005 (Figure 6.6). Since there are only four records after September 2005, a decreasing trend from the Mann–Kendall trend analysis could not be reliably concluded. MW-5 is a downgradient well located approximately 1200 feet from the source area. The fluctuation of concentrations observed at MW-5 suggested possible potential contributions from other sources. A sample collected from a potable water faucet located near MW-5 in March 2005 showed an elevated concentration of chloroform. The use of potable water in the vicinity and the resulted infiltration to the aquifer makes it a potential source of chloroform to groundwater. Based on the above analyses, the effective attenuation of chloroform at the site was concluded. A report detailing this trend analysis with other supporting evidence was submitted to the regulatory agency as a supplementary document to obtain approval.

Table 6.1: Mann –Kendall trend analysis based on data from 1985 to 2009. Well MW-1 MW-2 MW-3 MW-4 MW-5 MW-6

Number Number of samples of detects 19 23 21 91 21 10

19 23 21 91 19 8

Coefficient Mann–Kendall of variation statistic 0.49 0.35 0.59 3.88 0.40 0.62

91 139 81 1609 81 20

Confidence in trend

Concentration trend

99.9% 100% 99.3% 100% 99.3% 95.5%

D D D D I D

Note: D, decreasing; PD, probably decreasing; S, stable; NT, no trend; I, increasing; PI, probably increasing.

153

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

Figure 6.6: Chloroform concentrations over time at monitoring well MW-5. 10

Concentration: µg/L

8

6

4

2

0 Dec-96

6.3.2

Aug-98

Apr-00

Jan-02 Sep-03 Sample date

May-05

Feb-07

Oct-08

Sampling location evaluation for an industrial site

Selection of sampling locations for future monitoring was needed at an industrial site in the central USA contaminated with chlorinated solvents. The contaminants of greatest concern at the site are trichloroethene (TCE) and its degradation product cis1,2-dichloroethene (cis-1,2-DCE). Release of TCE to the ground water was believed to originate from waste drums buried on-site several decades ago. This resulted in a ground water plume approximately 1500 feet long migrating into off-site areas. Numerous monitoring wells have been installed since the 1990s and various remedial measures have been taken. The waste drums were removed and contaminated soils were excavated to the extent practicable. Owing to the fractured bedrock geological setting in this region, remedial measures were limited to hydraulically contain and control the ground water plume. On-site ground water extraction was initiated in 2004 and off-site extraction was started a few years later. An evaluation was conducted to explore the potential of having a reduced well network for future monitoring. The evaluation included the following analyses and considerations. •

Sampling location analysis was conducted with the Delaunay method in the MAROS software, using TCE and cis-1,2-DCE concentration data between 1990 and 2008 from a network of 63 monitoring wells (Figure 6.7). A recommendation was made for the removal of 12 locations, which had the highest level of redundancy over the entire monitoring history.

154 •







PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

The same sampling location analysis was conducted as described above, but with data between 2004 and 2008. A recommendation was made for the removal of 17 locations due to plume stability after the extraction system was put into operation. The Mann–Kendall trend analysis was carried out with TCE and cis-1,2-DCE concentration data between 1990 and 2008 for each monitoring well. The results reflected the overall concentration trend at each location. The total number of decreasing, probably decreasing, and stable trends was essentially the same as the total number of no trend, probably increasing, and increasing trends. The Mann–Kendall trend analysis was carried out with TCE and cis-1,2-DCE concentration data between 2004 and 2008 for each monitoring well. The results were more representative of the current and future conditions at the site than those using data between 1990 and 2008. There were more decreasing, probably decreasing, and stable trends than no trend, probably increasing, and increasing trends. The location of monitoring well relative to the dissolved plumes and screen intervals relative to site stratigraphic units were recorded. The information allowed an evaluation of the importance of a location in a qualitative way. For example, a location serving as a sentry well at the downgradient edge of the dissolved plume will be kept for sampling even if a removal is recommended by the Delaunay method.

Balancing the above factors and considering that the data after 2004 would yield more representative future plume trends, the final recommendation was to remove 14 monitoring wells from future sampling (Figure 6.7). It was also recommended that periodic re-evaluation (e.g. once every three years) be conducted as the site and plume conditions change over time.

6.3.3

Comprehensive evaluation for a former bulk fuel terminal site

A comprehensive evaluation of a monitoring plan was conducted for a former bulk fuel terminal site on the west coast of the USA. Extensive ground water monitoring was conducted as part of a remedial investigation in response to the presence of petroleum hydrocarbons contamination in the subsurface. The objective of this task was to evaluate and optimise the existing ground water monitoring programme at the site. All aspects of the evaluation approach introduced in this chapter were used. All statistical analyses, except the sign test, were conducted using the MAROS software. These statistical analyses produced preliminary recommendations regarding removal of sampling locations and adjustment of sampling frequencies. The last component of the evaluation (i.e. comprehensive assessment) refined preliminary recommendations by considering non-technical and site-specific factors to produce final recommendations. The following sections describe in detail the site information and results of the evaluation.

155

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

Figure 6.7: Selection of sampling locations for future monitoring. Triangles represent wells to be sampled and crosses represent wells to be removed.

Scale: feet

N 0

150

300

450

600

Site information and existing ground water monitoring programme The site is relatively flat and entirely paved with asphalt concrete. A harbour borders the site to the west. The site was a fuel terminal from the 1920s to the 1970s. Operations included storage of refined petroleum in aboveground and underground storage tanks. Products stored in these tanks included diesel, gasoline, gasoline additives, heating oil and other oil products. In the 1970s and 1980s, the fuel terminal facilities were dismantled and repaved for construction of cargo terminals. Shallow geology at the site from top (youngest) down includes an anthropogenic fill, an aquitard, and a deeper aquifer. The fill unit consists predominantly of fine to medium sand underlying the pavement to a depth of about 14 to 21 feet. The water table surface is located within the fill. The average hydraulic conductivity of the fill material is approximately 3 feet/day. The aquitard consists of dark organic marine clay, silty clay, and isolated lenses of sand and silt. The thickness of this unit ranges from approximately 0.5 to 7 feet. The hydraulic conductivity of the aquitard is

156

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

significantly less than the units above and below it. The deeper aquifer is approximately 50 feet thick beneath the site and is predominantly sand. The average hydraulic conductivity of this unit is about 3 feet/day. Ground water in the deeper aquifer is semi-confined. Depth to ground water in the fill generally ranges from approximately 5 to 10 feet below the ground surface. Ground water flow in the fill is generally west toward the harbour at an average hydraulic gradient of approximately 0.001, with localised areas where the gradient may be up to 0.003. The vertical gradient between the fill and the upper portion of the deeper aquifer is generally downward. Seasonal fluctuations of ground water levels are usually less than 1 foot except for those monitoring wells under tidal influence or artificial stresses. Measurable liquid phase hydrocarbons (i.e. thickness greater than 0.01 foot) were last observed in February 2004 in two monitoring wells. Various investigation and remedial activities have been conducted since 1979. The current investigation and remedial programme was initiated in 2003, which led to the installation and operation of a vapour extraction and air sparging (VE/AS) system. The VE/AS system was designed to remove flammable or toxic gases from the soil vapour, and to reduce concentrations of constituents of potential concerns (COPCs) in ground water. The primary COPCs at the site include benzene, gasoline-range organics (GRO), and diesel-range organics (DRO). Prior to VE/AS operation, much of the fill was impacted by dissolved benzene, GRO, and DRO at concentrations greater than site-specific action levels (Figure 6.8(a) – see colour insert). VE/AS operation has reduced both the spatial extent and magnitude of the dissolved constituents in the fill (Figure 6.8(b) – see colour insert). Concentrations of dissolved benzene and GRO have decreased significantly as a result of VE/AS operation, with the number of monitoring wells having concentrations exceeding the action levels reduced by over 70%, and resulting in a high percentage of non-detects exceeding 50% in some sampling events. The spatial extent and concentrations of DRO have also decreased, but not to the extent of the more volatile benzene and GRO. Vertically, COPCs concentrations are greater in the fill than in the underlying water-bearing units. The existing monitoring well network includes 52 monitoring wells. Forty-three of them are less than 20 feet deep and screened across the water table within the fill overlying the aquitard. The other nine of the monitoring wells are deeper than 20 feet and screened in the deeper aquifer beneath the aquitard. The monitoring well network includes four sets of nested monitoring wells. At each well cluster, one monitoring well is screened across the water table within the fill, one is screened within the upper portion of the deeper aquifer, and one is screened within the lower portion of the deeper aquifer. Ground water monitoring was performed inconsistently prior to 2003. A regular monitoring programme has been implemented since 2003. Ground water monitoring was conducted quarterly from the fourth quarter of 2003–2004 and has been conducted semiannually since. All 52 wells are included in the sampling plan for each monitoring event.

157

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

Individual location and site-wide trend analysis Concentration trends of the primary COPCs for each monitoring well were evaluated with the Mann–Kendall trend analysis. Box–whisker plots were produced for the entire data sets from different sampling events to reveal the overall concentration trends, and sign tests were conducted to evaluate if the differences between the sampling events were statistically significant. Data from as early as 2000 were utilised in the analysis. Duplicates were averaged and concentrations reported as non-detects were quantified as half of their respective reporting limits. The Mann–Kendall trend analysis results are provided in Table 6.2. Concentrations of benzene, GRO and DRO in the majority of monitoring wells were decreasing, probably decreasing or stable. For the two monitoring wells identified as having a probably increasing trend for benzene, all reported concentrations were less than the reporting limits. For 13 of 21 monitoring wells with a stable trend or no trend for GRO, all or all but one reported concentrations were less than the reporting limits. The only increasing trend for DRO was caused by a couple of abnormally elevated concentrations reported since the second half of 2005. The box–whisker plots for the period from the first half of 2004 to the first half of 2008 indicated site-wide decreasing trends for benzene, GRO, and DRO (Figure 6.9). In terms of the upper quartile (i.e. the 75% percentile) of the concentration distribution, benzene concentrations have decreased by more than two orders of magnitude, GRO by one and a half orders of magnitude, and DRO by one order of magnitude. In addition, the range of concentrations narrowed for most sampling events over time. Owing to the large percentage of non-detects, the sign test was performed with a significance level of 5% for the consecutive sampling events compared in the box–whisker plots. For benzene, decreasing trends were confirmed for all the consecutive sampling events except from the first half of 2005 to the first half of 2006. For GRO, decreasing trends were confirmed for all the consecutive sampling events except from the first half of 2007 to the first half of 2008. For DRO, the significant decreasing trend was only confirmed from the first half of 2005 to the first half of 2006. However, the sign test was significant from the first half of 2004 to the first half of 2008, confirming a decreasing trend over time.

Table 6.2: Former bulk fuel terminal – Mann–Kendall trend analysis results. Trend Decreasing trend Probably decreasing trend Stable trend No trend Probably increasing trend Increasing trend

Benzene

GRO

DRO

30 5 6 9 2* 0

22 8 10 11 1* 0

17 9 11 14 0 1

* The ‘probably increasing trend’ is the result of increased reporting limits after 2003; all results for these wells since 2000 are less than the reporting limits.

158

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 6.9: Box–whisker plots for (a) benzene, (b) GRO and (c) DRO.

Benzene concentration: µg/L

10 000 1000 100 10 1 0

First half First half First half First half First half 2004 2005 2006 2007 2008 (a)

GRO concentration: µg/L

100 000 10 000 1000 100 10 1 First half First half First half First half First half 2004 2005 2006 2007 2008 (b)

DRO concentration: µg/L

100 000 10 000 1000 100 10 1 First half First half First half First half First half 2004 2005 2006 2007 2008 (c)

159

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

The mostly decreasing and stable trends of concentrations in individual monitoring wells and the site-wide decrease in concentration levels are clearly supportive of the effectiveness of the remedial measures and that the contaminant plumes are shrinking or stable. Sampling locations and frequency The analyses of sampling locations and frequency were conducted with the Delaunay method and the modified CES method, respectively, for each of the primary COPCs. The sampling location analysis was applied to the 43 monitoring wells screened in the fill unit. Concentration data were grouped by sampling events and averaging was taken if there were multiple results at one location. The default parameters in the MAROS software were used for the Delaunay method, that is a location can be removed only when its slope factor is less than 0.1 and the area ratio and concentration ratio are greater than 0.95 (equivalent to an information loss of less than 5%). Besides, many key monitoring wells (e.g. the downgradient peripheral wells) were set not to be removed even if the algorithm determines so. The sampling frequency analysis was applied to all monitoring wells using data since 2004, when the VE/AS system started in full operation. The default parameters of the modified CES method, that is the rate of change parameters based on the action levels, were used in the analysis. Results of sampling location analysis suggest that ineffective monitoring wells exist and can be removed. Four to seven monitoring wells are found redundant depending on the COPCs: • • •

benzene: MW-14, MW-15, MW-24, MW-26, MW-52, MW-53, and MW-55; GRO: MW-14, MW-24, MW-26, MW-52, MW-53, and MW-55; DRO: MW-15, MW-41, MW-42A, and MW-47.

Results of the sampling frequency analysis suggest that the sampling frequency at most monitoring well locations can be reduced to biennially or annually (Table 6.3). Of the seven monitoring wells recommended for quarterly sampling, three were because of abnormally elevated DRO concentrations reported, one was because of a limited number of samples, and the other three were because of fluctuating concentrations. The recommendation for less-frequent sampling for most monitoring wells is consistent with the observation that the primary COPC plumes at the site are shrinking or stable.

Table 6.3: Former bulk fuel terminal – sampling frequency analysis results. Frequency

Benzene

GRO

DRO

Biennial Annual Semiannual Quarterly

32 16 2 2

26 23 1 2

12 33 4 3

160

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Comprehensive assessment Results of the above analyses indicate that some monitoring wells can be removed from the monitoring network and the frequency of sampling can be reduced without impacting the technical integrity of the monitoring programme. These preliminary results were further evaluated in the comprehensive assessment by considering factors such as proximity to sewers, regulatory requirements and the potential disruption of cargo operations. For each monitoring well considered redundant for one or more COPCs, further judgement was applied to come up with a final recommendation. For example, the results of the Delaunay method suggested removal of MW-14 for benzene and GRO but not for DRO. However, all reported DRO concentrations are less than the action level. In addition, this well is not near a source area and is surrounded by other monitoring wells. Therefore, the final recommendation for MW-14 is to remove it from the monitoring network. As another example, monitoring well MW-42A is recommended for removal for DRO but not for other COPCs, and the final recommendation is to retain this well to monitor all COPCs. Repeating this evaluation process, removal of six monitoring wells from the network is justified. Results of the modified CES analysis suggest that 45 to 49 monitoring wells be sampled biennially or annually depending on COPC. Although different sampling frequencies for different COPCs may be suggested for a given well, the final recommendation is to sample each well based on the most conservative frequency for all COPCs. For example, if annual sampling is suggested for benzene and GRO but semiannual for DRO, the final recommendation is to conduct semiannual sampling for all three COPCs. In addition, all biennial frequencies are adjusted to annual in the final recommendation to be consistent with general regulatory practice. Monitoring wells suggested for quarterly sampling are recommended for semiannual sampling after further evaluation. For instance, the preliminary suggestion for quarterly sampling of MW-54 was caused by a couple of abnormally elevated DRO concentrations. If not for the abnormal results, the modified CES analysis would have indicated less frequent sampling. In addition, the results of Mann–Kendall trend analysis indicate no apparent DRO trend at this location. Semiannual sampling for MW-54 is therefore considered conservative. Similar judgement calls were applied to other wells to conclude that semiannual sampling was sufficiently conservative even if quarterly sampling was suggested by the modified CES method. For the 46 monitoring wells remaining after the recommended removal of six wells, the final recommendation is that 34 wells are to be sampled annually and 12 wells semi-annually (Table 6.4). It is also recommended that re-evaluation should be conducted when significant changes in site and/or contamination conditions occur.

6.4

SUMMARY

A multi-component approach for the evaluation of existing environmental sampling plans was introduced here. This evaluation approach integrates statistical analyses for concentration trend, sampling location and sampling frequency with a comprehensive

APPLICATION OF STATISTICS IN ENVIRONMENTAL SAMPLING PLANS

161

Table 6.4: Monitoring programme optimisation final recommendations. Well ID

Sampling schedule

Well ID

Sampling schedule

MW-13 MW-14 MW-15 MW-16 MW-17 MW-18 MW-24 MW-25 MW-26 MW-34 MW-35 MW-36 MW-37A MW-38 MW-39 MW-40 MW-41 MW-42A MW-43 MW-44 MW-45 MW-46A MW-47 MW-48A MW-49 MW-50

Annual Remove Remove Annual Semiannual Annual Annual Annual Annual Annual Annual Semiannual Annual Annual Semiannual Annual Remove Semiannual Annual Annual Semiannual Annual Remove Semiannual Annual Annual

MW-51 MW-52 MW-53 MW-54 MW-55 MW-56 MW-57 MW-58 MW-59 MW-60 MW-61 MW-62 MW-A1 MW-A2 MW-A3 TW-1BB MW-2B MW-37B MW-37C MW-42B MW-42C MW-46B MW-46C MW-48B MW-48C MW-MG3A

Annual Annual Remove Semiannual Remove Annual Annual Annual Annual Annual Annual Semiannual Annual Annual Annual Semiannual Annual Annual Annual Semiannual Semiannual Annual Annual Annual Annual Semiannual

assessment of other factors to optimise a sampling plan. Statistical procedures that are robust and widely used as well as methods specifically designed for such evaluation were introduced. Three case applications in environmental remediation were presented to demonstrate the usefulness and applicability of the evaluation approach and each statistical method. The framework of this evaluation approach can be readily supplemented with other statistical methods and adapted to environmental data of different characteristics for the evaluation of other environmental sampling plans.

REFERENCES Aziz, J.J., Ling, M., Rifai, H.S., Newell, C.J. and Gonzales, J.R. (2003a). MAROS: A decision support system for optimizing monitoring plans. Ground Water. 41, 3: 355–367. Aziz, J.J., Newell, C.J., Ling, M., Rifai, H.S., and Gonzales, J.R. (2003b). Monitoring and Remediation Optimization System (MAROS) 2.0 Software Users Guide. Developed for Air Force Center

162

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

for Environmental Excellence (AFCEE), Brooks AFB, San Antonio, TX. Available from www.afcee.brooks.af.mil/products/rpo/ltm.asp. Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York. Hollander, M. and Wolfe, D.A. (1973). Nonparametric Statistical Methods. Wiley, New York. Isaaks, E.H. and Srivastava, R.M. (1989). An Introduction to Applied Geostatistics. Oxford University Press, New York. Johnson, V., Tuckfield, R.C., Ridley, M.N. and Anderson, R.A. (1996). Reducing the sampling frequency of ground water monitoring wells. Environmental Science and Technology. 30, 1: 355–358. Ling, M. (2003). Optimizing Existing Long-term Ground Water Monitoring Plans with Innovative Methods. PhD thesis, University of Houston, Houston, Texas, USA. Ling, M., Rifai, H.S., Aziz, J.J., Newell, C.J., Gonzales, J.R. and Santillan, J.M. (2004). Strategies and decision-support tools for optimizing long-term groundwater monitoring plans – MAROS 2.0. Bioremediation Journal. 8, 3–4: 109–128. Ling, M., Rifai, H.S. and Newell, C.J. (2005). Optimizing groundwater long-term monitoring networks using Delaunay triangulation spatial analysis techniques. Environmentrics. 16, 6: 635–657. Ling, M., Johnson, J. and Twiford, J. (2006). Ground water monitoring program optimization at an active industrial terminal. Proceedings of the 2006 Petroleum Hydrocarbons and Organic Chemicals in Ground Water: Prevention, Assessment, and Remediation Conference, 6–7 November 2006, Houston, Texas. National Ground Water Association (NGWA), Westerville, Ohio, pp. 330–343. Newell, C.J., Cowie, I., McGuire, T.M. and McNab, W.W. Jr (2006). Multiyear temporal changes in chlorinated solvent concentrations at 23 monitored natural attenuation sites. Journal of Environmental Engineering. 132, 6: 653–663. Okabe, A., Boots, B. and Sugihara, K. (1992). Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. Wiley, New York. US EPA (US Environmental Protection Agency). (1999). Use of Monitored Natural Attenuation at Superfund, RCRA Corrective Action, and Underground Storage Tank Site. Directive 9200.417P, Final Draft. US EPA (US Environmental Protection Agency). (2004). Demonstration of Two Long-Term Groundwater Monitoring Optimization Approaches. Office of Solid Waste and Emergency Response, September 2004, EPA 542-R-04-001a. US EPA (US Environmental Protection Agency). (2006). Data Quality Assessment: Statistical Methods for Practitioners - EPA QA/G-9S. Office of Environmental Information, United States Environmental Protection Agency, EPA/240/B-06/003.

CHAPTER

7

Statistical Accounting for Uncertainty in Modelling Transport in Environmental Systems James Weaver, Jordan Ferguson, Matthew Small, Biplab Mukherjee and Fred Tillman

7.1

INTRODUCTION

Models are frequently used to predict the future extent of ground-water contamination, using estimates of hydraulic conductivity, porosity, hydraulic gradient, biodegradation rate and other parameters. Often, the best estimates or averages of these parameters are used as model inputs, and the model transforms them into output concentrations, which are presumed themselves to be best estimates. Despite apparent certainty, all properties of the subsurface are actually uncertain because of imperfect measurement methods, and are subject to point-to-point variability because of geological heterogeneity. Simple models generally neglect the effects of geological heterogeneity, so a specific model may not be capable of representing the physical setting. Ignoring such a mismatch between model capability and natural system characteristics introduces uncertainty in model results. The US Environmental Protection Agency (US EPA, 2009) summarised uncertainty caused by data uncertainty and model uncertainty and showed, conceptually, that the two are linked (Figure 7.1). As the model is made more complicated (e.g. more information on heterogeneity is included, or more processes are included), the model uncertainty decreases. At the same time the data uncertainty increases, because it is difficult to obtain the necessary additional data at all locations. Conceptually, the minimum level of uncertainty is balanced somewhere between the endpoints. Minimum uncertainty may also be related to characteristics of the problem (US EPA, 2009), as some processes may dominate model behaviour at certain scales. For example, near-field air quality simulations depend more on building wakes, release conditions and size, turbulent diffusion rates, and land use effects than they do Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

164

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 7.1: Hypothetical relationship between uncertainty and model complexity (US EPA, 2009) which relates decreasing uncertainty in the model to a corresponding increase in uncertainty in the data required to apply an increasing complex model.

Total uncertainty

Minimum total uncertainty

Uncertainty Data uncertainty Model uncertainty

Model complexity

on chemical reactions. Conversely, far-field simulations depend more on atmospheric chemistry (US EPA, 2009). Thus, the selection of appropriate model capabilities can reduce overall uncertainty. Models commonly are viewed as useful tools for understanding contaminant transport (Oreskes et al., 1994) and for determining future risk (ASTM, 1995). Yet despite more than 50 years of work in developing and testing, the degree of predictive capability of subsurface transport models has not been established (e.g. Eggleston and Rojstaczer, 2000; Miller and Gray, 2002). Models are better suited to providing a framework for understanding transport than for making inerrant predictions of future exposure and risk. Commonly, models are calibrated to field data to demonstrate their ability to reproduce contaminant behaviour at a site. This process implies a degree of correctness in understanding and provides the first step in demonstrating predictive ability. The term ‘predictive error’ is used to describe the fact that model predictions will not be accurate. Specifically, predictive error describes variability that may exist about a ‘best-estimate’ prediction, and is calculated using methods based upon error propagation theory. Where models are calibrated, the idea of minimum predictive error applies (Moore and Doherty, 2006; Tonkin and Dougherty, 2009) where it has been shown that although calibration can be used to seek a minimum-error-variance parameter set, additional variability exists that is beyond identification through

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

165

calibration. At best, a family of parameter sets can be found by calibration-constrained Monte Carlo analyses (Tonkin and Dougherty, 2009), each of which is a plausible driver for the observed model output. At the opposite extreme – screening sites for additional effort – sufficient data may not have been collected for calibration. How, then, should models be used in situations where they cannot or will not be calibrated? This is a major issue in assessing the impact of future changes in environmental policies. Brownfield redevelopment provides a prime example: there is no alternative to using models to estimate the potential for contaminated vapour ingress (called vapour instrusion) into yet-to-be-built structures and to determine the amount of clean-up still needed at the site. Impacts of ethanol use as a major fuel component, siting of fuelling stations in public water supply capture zones, and projected land use and climate changes are other examples of future predictions that must rely on uncalibrated models. Necessarily, results from these analyses have an inherent uncertainty which should be accounted for in policy decisions. Part of the uncertainty follows from the nature of environmental models. They have an inherent empirical basis. Generally, they can be considered to be composed of two parts: relationships between physical, chemical and/or biological quantities; and empirical coefficients describing specific situations. A familiar example from groundwater illustrates this character. Darcy’s law describes the flow of water in a porous material. Developed in the 1850s from experimentation with sand filters, it relates the flow through the porous material, q (L/T), to the hydraulic head gradient, dh/dl (unitless) and an empirical coefficient called the hydraulic conductivity, K (L/T) q ¼ K

dh dl

(7:1)

The flow of water in response to a gradient in head can be determined from this relationship only if the value of the empirical coefficient is known. Hydraulic conductivity can range from 1012 m/day for dense crystalline rock to 104 m/day for gravel (Charbeneau, 2000). Lack of knowledge of the value of the empirical coefficient is equivalent to lack of knowledge of the flow, as Darcy’s law itself does not contain that information. Every transport principle discussed below shares this dependence on empirical coefficients. Extensive study of the fundamental scientific nature of models shows that (Oreskes, 2003): •

• •

accurate prediction is achievable only in special cases, usually characterised by small numbers of parameters or by naturally repetitive systems (e.g. planetary motion); accurate prediction follows from iterative correction of models, based on discrepancies between observations and predictions; accurate prediction can follow from faulty conceptual models.

In The Discovery of Neptune by Morton Grosser (1962), the sequence of events that led to the discovery of Neptune were described. A known perturbation of the orbit of Jupiter suggested the presence of an outer planet. Before computers and numerical

166

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

models existed, complex mathematical expressions for planetary orbits served as predictive tools. Each time a new model was developed, its predictions were used to try to observe the postulated planet. Up to the point when Neptune was at last discovered, there was a sequence of modelling attempts, followed by observations, which served as motivation to improve the predictions. Oreskes (2003) points out that some problems, like planetary motion, are more amenable to a prediction-coupledwith-observation paradigm. A commonly seen example – weather forecasting – points to a way forward from this less-than-promising situation. Because experience shows that a specific, quantitative prediction of future weather is difficult to the point of being beyond possibility, weather forecasts are given in terms of possibilities, probabilities and ranges, as in: Sun and clouds mixed with a slight chance of thunderstorms during the afternoon. Humid. High 89F. Winds W at 10 to 20 mph. Chance of rain 30%. (Italic emphasis added.)

Not only is the forecast given in these terms, it is received and accepted with all its uncertainties. Such forecasts are based on scientific understanding of climate and weather, a large monitoring network, models and interpretation of their results, and presentation of the results. This chapter presents background in subsurface transport models and outlines the sources of uncertainty in their application. With this knowledge as background, three examples of uncertainty analysis are presented to address the uncertainties in predictions from forward-looking model applications. First is the application of a simplified transport model for subsurface contamination. Simple models obviously appeal because of their ease of use, perceived low cost and acceptance by certain environmental programmes. Largely unappreciated, though, is the range of plausible outcomes that result from recognising that their input parameters are not known precisely. In the second example, vapour intrusion of gaseous contaminants from the ground to residential indoor air is similarly examined, but the uncertainty analysis is used to generate generic bounding parameter sets. If these parameter sets exist, and the model applies to a specific residence, the sets can be used to make environmental decisions based on a scientifically established worst-case analysis. Alternatively, averaged parameter inputs are often used to represent an analyst’s best estimate of parameter values. The latter does not, however, guarantee environmental protection. The final example again considers subsurface transport in groundwater. Here, spatially variable groundwater velocity is considered, as are probabilities associated with the variable inputs. This extends the analysis to allow a weather-forecast-like prediction, in which the prediction can be given in terms of possibilities, probabilities and ranges.

7.2

MODEL BACKGROUND

Subsurface transport models are based on mass conservation of water and other chemical constituents, which usually are present in low enough concentrations that they do not affect the flow of the water. Thus, mass conservation principles are applied

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

167

separately to the flow of water and the transport of a chemical. Both simple and complex models are based on this fundamental paradigm. The simplest transport models are based on an analytic solution of the transport equation which results from exact solution of the transport equation under restrictive assumptions: homogeneous aquifer; one-dimensional, steady-state, uniform groundwater flow; vertical and lateral transport by dispersion only. In other words, the aquifer’s properties must be assumed not to change from point to point: groundwater flow goes only in one direction (one dimensional), does not change with time (steady state), nor vary from point to point (uniform). Since groundwater flow is restricted to one dimension, transport laterally and vertically must be driven only by dispersion. These restrictive assumptions are necessary to find an analytical solution to the equation, and result less from assessing the physical nature of the aquifer than from the historical, mathematical approach to obtaining a solution. Furthermore, no analytical transport solution is known for a general flow field (i.e. where velocities vary due to pumping, irregular boundaries, transient flow, etc.). Commonly-known examples of these models include the approximate Domenico models (Domenico, 19871 ; Domenico and Robbins, 1985), AT123D (Yeh, 1981), BIOSCREEN (Newell et al., 1996), BIOCHLOR (Aziz et al., 2000), ASTM-RCBA tier 2 models (ASTM, 1995) and the EPA Soil Screening Criteria (US EPA, 1996). Because all aquifers are heterogeneous, the applicability of analytic solutions to the transport equation would appear to be very limited. Their simplicity, however, facilitates their usage, perhaps even beyond their true scientific limits. Some aquifers such as the ‘sand box’ aquifers (Long Island, Southern New Jersey, Florida, Cape Cod, Central Michigan; see, for example, Olcott (1995)) come closest to meeting requirements for homogeneity. In other cases, applying analytical solutions adds to uncertainty in model results. A mismatch between model assumptions and the field site may be folded into the value of a fitted parameter during calibration. The resulting fitted parameter no longer only represents the physical parameter alone, but also the mismatch in model assumptions. Approaches in which both groundwater velocity and contaminant distributions are allowed to vary in time and from point to point greatly increase the flexibility of subsurface models. In these, the mass conservation principle leads to a governing equation for water and a separate equation for the contaminant. Because of the general limitations of analytical transport equations, and to provide the most flexibility, these equations are solved by numerical approximations to the governing equations.2 Given these two major approaches to modelling transport – analytical and numerical – an application requires determination of the appropriate modelling approach, based on the site conceptual model, values of parameters for groundwater flow and contaminant transport, and uncertainty of model outputs. 1

For analysis of the approximation in the Domenico models see Domenico and Schwartz (1998), Srinivasan et al. (2007) and West et al. (2007). 2 Hybrids and exceptions abound: analytical element models for groundwater flow allow analytical solutions to be patched together to form irregular domains with spatially variable flow patterns and particle tracking models use the ground water velocity pattern to approximate contaminant transport, to cite two examples.

168

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

For homogeneous geologic media, the standard mathematical statement of the transport equation is R

@c @c @c @c @2 c @2 c @2 c ¼ v x  v y  vz þ Dx 2 þ D y 2 þ Dz 2  ºc @t @x @y @z @x @y @z

(7:2)

where R is the retardation factor (unitless), c is concentration (M/L3 ); v x , v y , and v z are the x-, y-, and z-direction seepage velocities (L/T); Dx, D y and Dz are the dispersion coefficients (L2 /T); and º is a first-order loss coefficient (T 1 ). The retardation factor is defined by R¼1þ

rb f oc K oc

(7:3)

where rb is the bulk density (M/L3 ), is the porosity (L3 /L3 ), foc is the fraction organic carbon (unitless) and Koc is the organic carbon partition coefficient (L3 /M). The velocities are defined by locally applying Darcy’s law vx ¼

K x @ h @x

vy ¼

K y @ h @y

vz ¼

K z @ h @z

(7:4)

where Kx , K y and Kz are the hydraulic conductivities in the three coordinate directions (L/T), and h is the hydraulic head (L). Equation 7.2 provides a framework to relate the three-dimensional velocities, sorption (via the retardation factor), dispersion and biodegradation. The values of the coefficients Kx , K y , Kz , , Dx, D y, Dz, rb , foc , Koc and º determine the details of a specific application. These coefficients (also called input parameters) are of at least equal importance to the equations in determining model results.

7.3

PARAMETER DATA

The transport equation contains unknown parameters: groundwater velocity, the dispersion coefficient, the retardation factor and the first-order loss rate. Each parameter may be variable; obviously, the groundwater velocity varies from point to point in most cases. The retardation factor depends on porosity, bulk density, fraction organic carbon and organic carbon partition coefficient; all of these can be spatially variable, the latter owing to the variable nature of organic carbon. Biodegradation rates may vary with changing conditions that control degradation (i.e. the oxygen concentration). Lastly, the dispersion coefficient largely represents variable groundwater velocity through heterogeneous aquifer materials.

169

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

The dispersion coefficient has been found to be scale-dependent and to show orders-of-magnitude variability at a given scale. Because higher velocity clearly increases the rate of spreading, the dispersion coefficient, D (L/T), is represented as the product of dispersivity, Æ (L) and groundwater velocity, v (L/T); D ¼ Æv. Dispersivity was originally presumed to be a property of the aquifer. Gelhar et al. (1992) compiled published dispersivity data, categorised its reliability and presented plots of dispersivity versus scale (Figure 7.2). Longitudinal dispersivity was found to be scale-dependent. The cause of this phenomenon is heterogeneity; as the plume expands dispersion-induced spreading increases, because the plume encounters more velocity variations caused by heterogeneity. Figure 7.2 also shows order-of-magnitude variability at a given scale. This data tabulation does not provide precise estimates of dispersivity, but rather broad ranges of values. Although the mechanisms of dispersion are fairly well understood, this parameter is not determined at many contaminated sites. There are several methods for estimating dispersivity based on plume length, including Xu and Eckstein (1995), US EPA (1986) and ASTM (1995). Xu and Eckstein (1995) developed a regression formula for the Gehlar et al. (1992) data (curve on Figure 7.2), but the method does not represent

Figure 7.2: Data showing the scale dependence of dispersivity and its order-of-magnitude variation at a given scale (Gelhar, L.W., Welty, C. and Rehfeldt, K.R. (1992). A critical review of data on field-scale dispersion in aquifers. Water Resources Research. 28, 7: 1955–1974. Reproduced by kind permission of the American Geophysical Union). 105

4

10

Longitudinal dispersivity

103

Reliability: One tenth Xu and Eckstein High Intermediate Low

102 101

0 101 102 103 1 10

1

101

102 Scale: m

103

104

105

170

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

the underlying variability. Note that these methods are based on analyses of published literature and do not represent specific sites. Gelhar et al. (1992) determined that the reliability of the majority of the published data was low. So estimates from these methods should be considered to have low site-specific reliability. Calibration, therefore, is the only way to estimate this parameter at specific sites. The first author’s experience in simulating plumes on Long Island illustrates the relationship between heterogeneity and dispersivity (Weaver, 1996). The upper glacial aquifer of Long Island consists of relatively uniform sands and gravels; as a consequence, the calibrated model dispersivity was at the low end of the range developed by Gelhar et al. (1992). Using an average or a value from a regression formula would have given a higher value that did not represent the site. As noted above, dispersivity represents underlying velocity variation in individual pores, in collections of pores and through aquifers. In the aquifer, velocity variation is caused by the nature of the flow patterns and heterogeneity. Groundwater flow velocities vary because of the topographic locations of recharge and discharge areas. Recharge occurring at high elevation (mountains) sets up flow cells with discharges at low elevations (valleys). Travel times in these aquifers can vary from months for short flow paths to thousands of years for deep regional flow systems (Charbeneau, 2000). Velocities vary along various flow paths simply because of the geometry of the flow system (e.g. Domenico and Schwartz, 1998). Velocities increase where flow lines converge, especially at the discharge points. Similarly in the capture zone model presented below, the velocities vary along the curving streamlines as water approaches the pumping well (Figure 7.3). Velocities increase along flow paths that converge on the well. When heterogeneity is considered, velocity variations become less regular and are influenced by variation in hydraulic conductivity. Heterogeneity can result from deposition of differing materials (layered), differing depositional energy (trending), or contact between alluvium and rock (contact) (Charbeneau, 2000). The resulting hydraulic conductivity can be different in different directions – called anisotropy. Hydraulic conductivity also exhibits spatial correlation, typically weaker vertically than laterally. Because transport depends in part on velocity (advection), transport is facilitated in highly conductive zones and suppressed in low conductivity zones. Combining these effects by imposing a flow system due to topography or well pumpage drives flow through the aquifer, but favours high-conductivity zones. Parameter data to describe flow and transport come from various sources and usually represent a mixture of site-specific measurements, estimates, literature values and user-guide-supplied defaults. When the model is calibrated to field data, calibrated parameters are added to the list. Thus, the data are heterogeneous in their origin, quality and their degree of site-specific applicability. Added to these is point-to-point variability in most properties. Even a seemingly chemical-specific parameter, like the organic carbon partition coefficient, can be spatially variable because of variability in soil organic matter. Similarly, the ‘forcing function’ of the problem – the mass and timing of the release – is normally unknown. For almost all sites with chemical contamination, the history of the release(s) is unknown because of non-existent records, use of different chemicals at various times, and the hidden nature of some

171

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

Figure 7.3: Well and capture zone containing a group of two LUST sites (labelled A and B) and a group of four LUST sites. 700 600 500 400 300 200

y: m

100 0 700 600 500 400 300 200 100 0 100 200

100

200

300

400

500

600

700

B A

300 400 500 600 700 x: m

incidents. Parameters describing the release (mass and timing), specified precisely to match the mathematical requirements of a Such specification is akin to exact knowledge of the release, impossible to obtain. As a consequence, we are justified in parameters and the forcing functions as variables.

7.4

however, must be simulation model. which is usually treating both the

TRANSPORT IN UNIFORM AQUIFERS

In an example presented by Weaver et al. (2002), an analytical subsurface transport model was run using all possible combinations of the estimated range of input parameters (512 in total). The parameters comprise the standard set for an analytical solute transport model (van Genuchten and Alves, 1982). Since the model is transient, each parameter set’s result is represented by a breakthrough curve (Figure 7.4), which is the time-history of contamination at a specified receptor location. From the collection of breakthrough curves, certain extremes were plotted: the earliest and latest arrival time, the highest and lowest peak concentration, and the longest and shortest duration of concentration above the risk threshold. An example breakthrough curve

172

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 7.4: Example breakthrough curve showing the definition of peak concentration, first arrival time and duration. The average concentration is defined from this curve by the timeaveraged concentration over the duration. Peak concentration 1.6

Concentration: µg/L

1.2

0.8 Duration

0.4 First arrival time

0 0

1000

2000 3000 Time since release: days

4000

with these quantities illustrated is shown in Figure 7.5 (see colour insert). A fourth result is also used: the highest and lowest average concentration, c, defined by ð c(t) dt duration ð c¼ (7:5) dt duration

where c(t ) is the time-varying concentration at the receptor location (M/L3 ). The importance of Equation 7.5 is that a very similar quantity appears in a risk calculation, albeit based on an assumed-constant concentration (US EPA, 1989). Equation 7.5 represents the combination of mass and duration, and in every transient example examined, the parameters generating the extreme average concentration differ from those generating either the extreme peak concentration or the extreme duration. Breakthrough curves for the extreme events show that the model produces results with widely differing qualitative character (Figure 7.5 – see colour insert). Early breakthrough is associated with high-concentration, short-duration contaminant

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

173

pulses. Late breakthrough is associated with much lower, broader curves. The breakthrough curve resulting from the average values of all parameters arrives in the middle of the distribution with a mid-level concentration. Although in one sense this represents averaged behaviour, it is clear that the averaged-parameter curve does not represent the extremes in either qualitative shape or specific quantitative results.

7.5

VAPOUR INTRUSION OF HAZARDOUS COMPOUNDS INTO INDOOR AIR

For intrusion of contaminants into residential indoor air living spaces from subsurface contamination, Tillman and Weaver (2006; 2007) performed a similar uncertainty analysis using the ‘Johnson–Ettinger’ model (JEM, Johnson and Ettinger, 1991). The JEM is based on a simplified conceptualisation of the subsurface and the building, and treats each as a compartment, with steady-state flux between them. JEM results are often given in terms of the ‘alpha’ parameter (Æ) that is defined by Ƽ

CB ¼ CS

A exp ð BÞ  A exp ð BÞ  1 exp ð BÞ þ A þ C

(7:6)

where CB is the concentration in the building (M/L3 ), CS is the concentration in the source (M/L3 ), and the dimensionless quantity A is given by A¼

Deff T AB QB LT

(7:7)

2 where Deff T is the effective diffusion coefficient of the contaminant in soil (L /T), AB is the subsurface foundation area (L2 ), QB is the volumetric flow rate of air in the building (L3 /T) and LT is the distance from contamination to the bottom of the foundation (L). The air flow rate in the building, QB , is broken down into the building volume (L3 ), VB, and the air exchange rate (T 1 ), EB . The dimensionless quantity B is given by



Qs LC Deff C N AB

(7:8)

where Qs is the soil gas flow rate into the building (L3 /T), LC is the thickness of the foundation (L), Deff C is the effective diffusion coefficient for the contaminant in the crack (L2 /T), and N is the crack ratio (dimensionless). The crack ratio is defined by N¼

AC AB

(7:9)

where AC is the area of the crack (L2 ). Johnson and Ettinger (1991) used the assumption that the floor/wall cracks and openings are filled with dust and dirt characterised by a density, porosity and moisture content similar to that of the

174

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

eff underlying soil to justify equating Deff T and DC . The diffusion coefficient is estimated from the Millington–Quirk relationship 31

Deff T

31

Ł 3 DW ŁW3 ¼ DA A2 þ H 2

(7:10)

where DA is the air phase diffusion coefficient (L2 /T), ŁA is the air-filled porosity (L3 /L3 ), is the porosity (L3 /L3 ), DW is the water phase diffusion coefficient (L2 /T), ŁW is the water content (L3 /L3 ) and H is the Henry’s Law coefficient (unitless) (see US EPA, 2004). A depth-weighted average diffusion coefficient is used to average out the effects of layering in the vadose zone (Johnson and Ettinger, 1991). The dimensionless quantity C is given by C¼

Qs QB

(7:11)

The indoor air concentration, CB itself is given by CB ¼ ÆSG CSG

(7:12)

CB ¼ ÆGW CGW H

(7:13)

for a soil gas source, and

for a groundwater source. The coefficients ÆSG and ÆGW are the alpha factors calculated for soil gas or ground water sources, respectively. Taking this a step further for carcinogenic compounds at a specified cancer risk, the allowable building concentration, CB-A , can be calculated from CB-A ¼

TR AT URF EF ED

(7:14)

where TR is the target risk level (unitless), AT is the averaging time (T), URF is the inhalation unit risk factor (M/L3 )1 , EF is the exposure frequency (T/T) and ED is the exposure duration (T) (based on US EPA, 2001). Like the aquifer transport example, there are several parameters of the model that are not necessarily measured at specific sites. Further, the model often is not calibrated because of the intrusiveness of indoor air sampling; therefore, it is typically used as an uncalibrated prediction tool. With parameter uncertainty, it is clear that an analysis of the dependence of model results on inputs and their variability is needed. One-at-atime sensitivity analysis is intended to determine sensitivity of the model to a unit change in each input parameter taken separately. Although this procedure places parameters on an equal footing and provides insight into the relative sensitivity of the model to each parameter, it does not account for synergy between parameters. Tillman and Weaver (2006) showed that the predicted uncertainty in model results was higher if parameters were allowed to vary simultaneously than if they were varied one at a time. When these effects were considered, the overall sensitivity of the JEM was shown to increase by two orders of magnitude in some cases.

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

175

Although this observation is valuable in itself, the uncertainty analysis can be used to rank the relative importance of the parameters. Using ranges of parameter determined by the US EPA, and estimates where needed, simulations for sand, sandy loam, and clay showed that reducing the variability of a few parameters accomplished the most uncertainty reduction in the results. Although the ranking of these parameters varied among the soil types, the air exchange rate, soil moisture content, building mixing height, and porosity were always among the most important parameters of the model. Therefore, effort spent on improving estimates of these would reduce the model uncertainty the most. Is the model’s response consistent over all ranges of parameter values? If the answer is no, then an uncertainty analysis is needed for each specific run of the model. For the simple groundwater transport model, the response was consistent for first arrival time, peak concentration and duration of contamination (Weaver et al, 2002); it was not consistent for the average concentration over the duration of contamination. For the JEM model, consistent behaviour was found in a range of depths and soil types; thus, the JEM model can be used to generate a generic prescription for analysing a worst-case scenario as follows. By picking parameter ranges from EPA guidance documents when available or reasonable estimates when not, Tillman and Weaver (2007) established that a generic worst case existed. The analysis was conducted over a range of soil types and depths. The results were that the model produced the most protective risk estimates when the low values of effective saturation, air exchange rate, mixing height, source depth and residual moisture content were selected, along with the high values of porosity and soil gas flow rate. For this evaluation the model was insensitive to temperature and crack width, which could take either value. An analyst can be certain that, within the assumptions of the model, these choices give model results that represent the worst possible exposure.

7.6

CONTAMINATION OF MUNICIPAL WELL FIELDS

One limitation of the range-based method for uncertainty analysis is its inability to assign a probability to any of the outcomes. Each result of the model must be assumed to be equally probable, so the approach cannot distinguish between common and rare events. To advance beyond this problem, Monte Carlo simulation is used. The basic idea of Monte Carlo simulation is to replace deterministic inputs with probability distributions which, in practice, may be of any type. The model is run with parameters chosen randomly from these distributions and a number of outputs are generated for statistical analysis. A key limitation of the simple Monte Carlo method is that hundreds or thousands of runs of the model are needed to sample the entire range of the input distributions. This results because there is a low probability of picking extreme values from the distributions, when picking parameters randomly. Techniques have been developed to minimise this problem, notably Latin hypercube sampling, in which samples are taken from specified subintervals of the distributions, forcing sampling of the extreme events. Large savings of runs can be achieved with such techniques.

176

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Transport of contaminants from multiple leaking underground storage tank (LUST) sites to municipal water supply wells was simulated, using an approximate approach developed by Small (2003). It allows for progressive leaching of a stationary gasoline source, using a set of compartments to represent the spatial distribution of constituents within the gasoline. Transport to a receptor is assumed to follow streamlines, defined by solving the equation for groundwater flow (Figure 7.6). This method fits the paradigm described above where the flow of water is solved separately from transport of the contaminant. We exploit this feature to gain flexibility in choosing methods for each part of the problem. In this chapter we use methods that are well suited to Monte Carlo analysis, because they use only small amounts of computer time, facilitating thousands of runs of the model. The gasoline leaching model uses an advective leaching model formulated over a number of compartments in the smear zone (Small, 2003). Applying mass balance to a single compartment gives dM ¼ Qcin  Qcout dt

(7:15)

Figure 7.6: Conceptual illustration of transport from multiple sources along streamlines to a receptor well. 2) Input to aquifer

1) Compartmentalised source

3) Advection and decay along each streamline

4) Well integrates contributions from all sources

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

177

where M is the total mass of the chemical of concern (COC) in the gasoline (M), Q is the volumetric flow rate through the smear zone (L3 /T), cin is the COC concentration in water entering the gasoline smear zone (M/L3 ) and cout is the exiting COC concentration (M/L3 ). The input COC concentration to the gasoline lens, cin , is normally zero, but allows compartments to be linked where the output from an upgradient compartment becomes the input to a down-gradient compartment. Using this feature to extend the model to multiple compartments gives the model for source concentration csource ¼ co e½ðÆþºþÞ t

j¼ N 1 X

Æj

j¼0

tj j!

(7:16)

where co is the initial aqueous concentration at the source (M/L3 ), º is an aqueous, first-order biodegradation rate in the source zone (T 1 ), N is the number of compartments (unitless) and t is time (T). Smear zone leaching by water is expressed by quantity Æ (T 1 ) and volatilisation from the top of the compartment by  (T 1 ), and these are defined by Ƽ

Q ; Vbox Bx



Deff K h At zBVbox

(7:17)

where Q is the flow rate through the compartment (L3 /T), which has a total volume of Vbox (L3 ), Bx is the bulk partition coefficient (unitless), defined below, Deff is the effective diffusion coefficient (Equation 7.10) (L2 /T), is the porosity (L3 /L3 ), Kh is the dimensionless Henry’s Law coefficient (unitless), At is the area of the top of the compartment (L2 ) and z is the depth to the smear zone (L). Flow through the compartment is reduced by blockage from oil-filled pores according to

 Sw  Swr kr ¼ (7:18) 1  Swr where kr is the relative permeability (unitless), Sw is the fraction of the pore space filled by water (called water saturation) in the smear zone (L3 /L3 ), Swr is the residual water saturation (L3 /L3 ) and  is a factor related to the pore size distribution (unitless). The bulk partition coefficient, Bx, expresses equilibrium partitioning among the phases presumed to exist in the smear zone: water, gasoline and solids. Bx is calculated from

rb Bx ¼ Sw þ Sg K gw þ f oc K oc (7:19) where Sg is the fraction of the pore space filled by gasoline(L3 /L3 ) and Kgw is the gasoline to water partition coefficient (L3 /L3 ), defined from cg ¼ K gw cw

(7:20)

where cg is the contaminant concentration in gasoline (M/L3 ) and cw is the contaminant concentration in water. Equations 7.16–7.20 are used to generate the timedependent concentration at each source in the capture zone (Figure 7.6). The proper-

178

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

ties – volume of the release, composition of gasoline, size of the source, date of release – can be different for each source. The sources combine to generate the potential impact on the water supply well in the second phase of the solution. Transport to a receptor is assumed to be dominated by advection and first-order loss, which is taken as a biodegradation loss. Once at a receptor well, the concentration is reduced by dilution in the well bore because some streamlines contribute no contamination. Thus the well concentration, cwell , is   (7:21) cwell ¼ Dcsource exp ttravel Rºplume where D is the well bore dilution factor (unitless), csource is the source concentration given by Equation 7.16 (M/L3 ), ttravel is the travel time from source to well (T), R is the retardation factor and ºplume is the biodegradation rate constant for the aquifer (T 1 ). Streamline positions and travel times are calculated from a method developed by Bear and Jacobs (1965) for a single well in a uniform flow field. The travel time is calculated from

 sin Ł (7:22) t ¼ x þ ln sinðy þ ŁÞ where Ł is the angle that the regional flow makes with the positive x axis (radians), x, y, and t are the dimensionless distances and travel time, defined by x¼

2qo x; Qb



2qo y; Qb



2q2o t nQb

(7:23)

where qo is the regional groundwater seepage velocity (L/T) and Qb is the pumping rate per length of well screen (L2 /T). The next step to predict impacts on municipal water supply wells requires selecting parameters of the transport model and the groundwater flow model. Values used in triangular distributions for Monte Carlo simulations are shown in Tables 7.1–7.3. For the triangular distribution values are assigned for the endpoints – the cumulative probabilities of 0.0 and 1.0 – and the mid-point – the average of the minimum and maximum values shown in the tables. These were chosen to provide similar results to the range-of-parameter method, in the absence of presumed knowledge of specific statistical distributions. Running the model 1000 times shows predicted breakthrough curves for the well. The ranges of variables in Tables 7.1–7.3 were based on the authors’ experience, assigned a range of 10%, or held constant. Clearly, the range of variability of the inputs directly influences the resulting variability in the model results. In this case we emphasise the variability in aquifer hydraulic conductivity by assigning it an order-ofmagnitude variation. Hydraulic conductivity is fundamentally important for the regional aquifer flow and the induced aquifer velocity to generate the specified well pumping rate. Contaminant transport is directly dependent on velocity as evidenced by differing travel times and durations found in the results. We specified a release of 300–500 gallons (1150–1900 litres) of gasoline to represent a small release with moderate (50%) uncertainty. Because of their dependence on the emplaced-source size, the concentrations achieved in the model depend on the source volume

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

179

Table 7.1: One-thousand-run Monte Carlo simulation parameter values were varied by a percentage of an average value. Parameter Well pumping rate, Qw Aquifer thickness, b Hydraulic gradient(a) , i Epsilon,  Residual water saturation , Swr Henry’s Law constant(b,c) , Kh Soil sorption coefficient(d) , Koc Gasoline/water partition coefficient(e) , Kgw Diffusion coefficient in air(f ), Dair Diffusion coefficient in water(f ), Dwater t12source t12plume

Range

Percentage

2250–2750 m3 /day 27–33 m 0.0009–0.0011 5.4–6.6 0.09–0.11

10% 10% 10% 10% 10%

Benzene 0.198–0.242 0.0747–0.0913 m3 /kg 315–385

MTBE 0.027–0.033 0.0127–0.0155 m3 /kg 13.95–17.05

0.6642–0.8118 m2 /day 6 3 105 – 7.4 3 105 m2 /day 365–1095 days 365–1095 days

0.576–0.704 m2 /day 5 3 105 – 6.2 3 105 m2 /day 1825–5475 days 1825–5475 days

10% 10% 10% 10% 10% 50% 50%

(a)

Charbeneau (2000). Meylan and Howard (1990). (c) Fischer et al. (2004). (d) Montgomery (1996). (e) Cline et al. (1991). (f ) SPARC Calculator (see http://sparc.chem.uga.edu/sparc/). (b)

Table 7.2: One-thousand-run Monte Carlo simulation parameter values were varied over a specified range. Parameter Porosity (a) , Flow direction(b) , Ł Fraction organic carbon, foc Hydraulic conductivity, K Volume of gasoline, Vgas Oil saturation, So (a) (b)

Range 0.25–0.35 -65–658 0.0001–0.001 10–100 m/day 300–500 gallons 0.1–0.2

Charbeneau (2000). Wilson et al. (2000).

variability. By setting the benzene concentration to 1% of the gasoline, we limited one source of variability. If the gasoline met US reformulated gasoline (RFG) standards, this was a reasonable assumption for a forward-looking evaluation of RFG behaviour (Weaver et al., 2010).

180

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 7.3: One-thousand-run Monte Carlo simulation parameter values were constant. Parameter Solids density , pSolids Water saturation in vadose zone, Sw,vadose Air saturation in vadose zone, Sa,vadose Source height, H Depth to source, z Density of gasoline, pGas

7.7

Value 2.65 g/cm3 0.25 0.75 1m 2m 0.72 g/cm3

ONE SOURCE SIMULATION

A simulation for benzene released from one source located 185 m (610 ft) from the well was used as the first example (Figures 7.7–7.10). The model parameters combine in ways that generate fast-arriving, short-duration, high-concentration pulses at one extreme; and slow-arriving, long-duration, low-concentration pulses at the other. The highest peak and average concentrations, and the longest duration breakthrough curves contrast with the averaged-parameter result (Figure 7.7). Although the model has 18 input parameters, not all have a major impact on the model results. Diffusion coefficients in air and water, for example, have low variability and only influence diffusion transport in the vadose zone. When volatilisation is low, these two parameters have little impact on well bore concentrations. Each set of input parameters generates a characteristic response from the model. The most important observation from the Monte Carlo results is that a simulation using the average values of all input parameters (hereafter ‘averaged-parameter’) does not represent the true behaviour of the model with uncertain parameters. This single averaged-parameter result fails to account for the diverse output the model can produce from varying inputs. Three extreme results were compared to each other and against averaged values of each parameter (Tables 7.4–7.6 and Figure 7.7). For each extreme, the biodegradation rate was the lowest possible value (highest half life of 1095 days). Low biodegradation is necessary to generate breakthrough curves with the highest concentrations – peak and average – and the longest duration of contamination at the well. The highest peak concentration of about 30 g/L appeared in an early-arriving case (Table 7.6), largely from high initial aqueous concentration, low biodegradation, high pumpage and a low regional velocity. High source volume, coupled with a low fuel–water partition coefficient, generates high source concentration, other factors being equal. Early arrival at the well assures less degradation because of less time in the aquifer, and thus, high concentration. The low regional velocity allows the high well pumpage, and resulting high groundwater velocity, to dominate the result at this location. These factors combined to generate the fast-moving, high-concentration breakthrough curve.

181

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

Figure 7.7: Selected results from 1000 simulation model run with one LUST source. These show the breakthrough curve with the peak concentration, highest average concentration and longest duration. For comparison, the simulation with all parameters set to their average values (averaged-parameter simulation) is shown. 40 Extremes: Peak concentration Highest average concentration Longest duration Averaged properties

Concentration: µg/L

30

20

10

0 0

500

1000

1500

2000

2500

Time since release: days

The longest duration result (1860 days) was associated with mass released at the lowest rate (highest gasoline/water partition coefficient) and had the lowest pumping and regional flow rate; therefore, low groundwater velocity to the well dominated the result. Biodegradation was also low, so the contamination that moved slowly toward the well was able to persist during the relatively long transport time. The highest average concentration result (1.6 g/L) occurs with the lowest pumping rate and lower initial aqueous concentration data set, not with the highest peak concentration data set. The regional flow rate was relatively high. Relatively low initial concentration and transport velocity to the well interact to maximise the average concentration. A simple average of the input parameter values gave a breakthrough curve that does not resemble any of the extreme results which, by definition, are low-probability events that are expected to diverge from an average. However, the extremes provide insight into what could be the most problematic situations. This comparison shows

182

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 7.8: Capture zones and source locations for possible simulation results. (a) average flow direction of 458 and medium value of regional flow; (b) average flow direction with higher regional flow, which narrows the capture zone width; (c) flow direction of 1108 with medium regional flow value; (d) flow direction of 1108 with higher regional flow velocity, which decreases the width of the capture zone. In (b) and (d), one of the LUST sites located well within the capture zone with the lower regional flow velocity is now on the edge of the capture zone (b) or just outside the capture zone (d).

1000

1000

500

500

0 15001000 500 0 500

y: m

1500

y: m

1500

500

1000 1500

0 1500 1000 500 0 500

1000

1000

1500 x: m (a)

1500 x: m (b)

1000

1000

500

500

0 1500 1000 500 0 500

y: m

1500

y: m

1500

500

1000

0 1500 15001000 500 0 500

1000

1000

1500 x: m (c)

1500 x: m (d)

500

500

1000

1500

1000 1500

that reliance on averaged-parameter simulations conceals much of the behaviour of the model, and clearly suggests that a more insightful approach, like a Monte Carlo uncertainty analysis, is needed. The average breakthrough curve, as influenced by uncertainty, is determined from all individual breakthrough curves (Figure 7.9). For much of its duration, it is lower in concentration than the averaged-parameter breakthrough curve. For a given pumping rate, increasing regional velocity decreases the width of the capture zone (Figure 7.8). When the variable flow direction is also included, a source near the boundary may be inside the capture zone for some simulations and outside for others. The lower 1000-run simulation average concentration is caused by the fairly large proportion of times that the source is not in the capture zone of the well, resulting in many simulations that produce no concentration at the well. The 1000-simulation breakthrough curve arrives earlier than the averagedparameter breakthrough curve because many of the individual 1000 simulations have higher velocity and earlier breakthrough. The duration of the 1000-simulation curve is

183

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

Figure 7.9; Average breakthrough curve from 1000 simulation run with a single source. For comparison purposes the averaged-parameter breakthrough curve and threshold concentration of 1 g/L are shown. 2.5

Average simulation: 1000 simulation average Averaged parameter Threshold

Concentration: µg/L

2.0

1.5

1.0

0.5

0 0

500

1000

1500

2000

2500

Time since release: days

longer for the same reason: many individual results have durations longer than the averaged-parameter case. Cumulative probability curves, generated from the individual simulations, provide the means to assign probabilities to the results. For example, at 500 days, the cumulative probability of the concentration lying above the threshold concentration of 1 g/L was about 20% (Figure 7.10). Because sources can lie outside the capture zone, the modelled concentrations can be zero at any time during the simulation. This feature of the problem caused the width of the distributions to increase from 500 days to 1000 days, as the mean concentration increased and zero concentration always remains a possibility. Consequently, the cumulative probability of the concentration being above 1 g/L increased from 20% to about 40% over this time period. Later, the average concentration dropped and the width of distribution decreased until, by 2000 days, none of the simulations produced concentrations above the threshold.

184

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 7.10: Cumulative probability curves for 1000-simulation run with one site at selected times. 1.0

Cumulative probability

0.8

0.6 Cumulative probability: 500 days 1000 days 1500 days 2000 days 2500 days

0.4

0.2

0 0.001

0.01

0.1

1

10

100

Concentration: µg/L

7.8

TWO, FOUR AND SIX SOURCE SIMULATIONS

When more sources potentially are located within the capture zone, there is a possibility of higher contamination at the well. Simulations in which each of the six sources was included illustrate various features of multiple source situations (Figure 7.11). The previous single source results (source A, Figure 7.3) showed an average concentration of about 1.5 g/L. When the source closer to the well was included (source B, Figure 7.3), the average peak concentration increased to about 6.7 g/L. Individual examination of each breakthrough curve (Figure 7.3, source A) shows that the nearest source (source B) contributed most to the peak concentration because it arrived soonest at the well. The second source (source A) was located further away from the well and its contamination arrives later with lower concentration. For these 1000-run simulations, the parameters were fixed for all sources; meaning that the difference between source A and source B was caused by biodegradation during the

Maximum peak concentration Maximum average concentration Earliest arrival time Longest duration Averaged-parameters

Extreme

2694 2250 2750 2250 2500

Pumping rate (m3 /day) 0.097 0.34 0.44 0.037 0.20

Q (m/day) 0.25 0.25 0.25 0.3 0.3

Porosity

22 77 100 10 55

K (m/day) 0.1 0.2 0.2 0.2 0.15

So

5.4 6.6 6.6 6.6 6.0



0.1 0.11 0.11 0.11 0.10

Swr

315 347 385 385 350

Kgw

0.0747 0.0747 0.0913 0.0747 0.083

Koc (m3 /kg)

0.198 0.198 0.242 0.198 0.22

Kh

Table 7.4: One-thousand-run Monte Carlo simulation run parameters for extreme results: maximum peak concentration, maximum average concentration, earliest arrival time and longest duration. These are compared against the results for the averaged parameters.

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

185

186

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 7.5: One-thousand-run Monte Carlo simulation parameters which did not influence the extreme results. Parameter

Value

Thickness Gradient Angle foc Gasoline volume Dair Dwater Source and plume half life

27 m 0.0011 0.35 0.0001 1.89 m3 0.66 cm2 /s 6 3 105 cm2 /s 1095 days

Table 7.6: Characteristic outputs for each extreme result and the averaged-parameter simulation. Arrival time Average Peak (days) concentration concentration (mg/L) (mg/L) Maximum peak concentration Maximum average concentration Earliest arrival time Longest duration Averaged-parameters

30 12 12 4.0 1.5

0.56 1.6 1.5 1.0 0.22

280 300 260 420 440

Duration (days) 160 940 840 1860 400

longer travel time to the well. Because of the lag in arrival time and the resulting reduction in concentration, the effect of source A on the combined breakthrough curve was to increase the concentration after the peak concentration had passed. The remaining four sources are located farther away from the well, so the arrival time of their contamination was delayed even more. Taken together, these four sources contributed a peak concentration of about 2.2 g/L to the well, and when combined, the four sources added enough concentration to the well to produce a new peak concentration of 6.9 g/L. Like the previous two examples, the averaged-parameter breakthrough curve does not represent all possible breakthrough curves that can occur at the well (Figure 7.12).3 When six potential sources were located in the capture zone, the variety of responses increased because some combinations of sources did not contribute to the breakthrough during a specific simulation. This behaviour is best shown by the highpeak and high-average-concentration curves. The high-peak curve had contributions from only the nearest two wells (A and B on Figure 7.3), because the four more distant 3 In contrast to the previous single-well example, these curves do not represent the extreme responses of the model, but show examples of breakthrough curves.

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

Figure 7.11: Average breakthrough curve for (a) two LUST sites and (b) six LUST sites. 8 Two sources: One source (40, 120) One source (40, 180) Sources combined Threshold concentration

Concentration: µg/L

6

4

2

0 0

500

1000

1500

2000

2500

(a) Six sources:

8

Two sources combined Four sources Six sources combined Threshold concentration

Concentration: µg/L

6

4

2

0 0

500

1000 1500 Time since release(s): days (b)

2000

2500

187

188

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 7.12: Example breakthrough curves for selected extreme parameters with six LUST sites. Concentration: µg/L

25 Duration:

20

Long

15 10 5 0

Concentration: µg/L

25 Arrival time:

20

Early Late

15 10 5 0

Concentration: µg/L

50 Concentration: High average High peak

40 30 20 10 0

Concentration: µg/L

25 20

Averaged parameters

15 10 5 0 0

500

1000

1500

2000

2500

Time since releases: days

wells were outside the capture zone and contribute nothing to the well. In contrast, the high average concentration curve showed six distinct peaks contributing contamination to the well. The other example results showed similar behaviour to the one-source case. The long-duration breakthrough curve is characterised by a sustained relatively low-level concentration. Early arriving breakthrough curves tend to have high peak

189

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

concentrations due to high velocity and, consequently, reduced opportunity for biodegradation. On average, late arriving curves have lower concentrations for the opposite reason. Cumulative probability curves from 500 days to 2500 days show that initial concentrations (say 500 days) declined on average as time passed (say 2500 days) (Figure 7.13). At later times, fewer concentrations were above the 1 g/L threshold concentration. At 500 days, roughly 90% of the individual simulation results are above 1 g/L, but by 2500 days, almost all concentrations were below the threshold. As the average concentration declined, so did width of the concentration distribution. Compared to the single source simulation (Figure 7.10), the average concentrations were higher, but showed a similar decrease in probability of concentration above the threshold. With the average breakthrough curve and the cumulative probabilities, the average concentration at each time can be assigned a probability, making the output closer to how a weather forecast is presented. From the cumulative probability curve for 500 days (Figure 7.13), we can estimate that the probability is 20% that the concentration is 3 g/L or less; 50% that it is 6 g/L or less; and 80% that it is

Figure 7.13: Cumulative probability curves for six LUST sites from the 1000-simulation model run. 1.0

Cumulative probability

0.8

0.6

0.4 Cumulative probability: 500 days 1000 days 1500 days 2000 days 2500 days Threshold concentration

0.2

0 0.001

0.01

0.1

1

Concentration: µg/L

10

100

190

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

10 g/L or less. These results link the uncertainty in the model inputs to a forecast concentration and associated probability.

7.9

CONCLUSION

Simulation models provide the best means for quantitative integration of environmental transport phenomena; they are limited, however, by uncertainty and unaccounted spatial variability in parameter values. Despite the mathematical precision inherent in the relationships between physical, chemical and biological quantities, the parameters of these equations have a fundamentally empirical basis. Consequently, the parameters are uncertain because measurement methods themselves represent approximations and errors occur. Particularly for subsurface transport at real field sites, spatial variability is always present on a finer scale than that of the characterisation programme. The combined uncertainty in parameters contributes to uncertainty in model results. Another contributor is the forcing function which describes the timing and mass of the release. For leaking underground storage tank sites, these are rarely known because tanks are buried for fire safety and leaks normally cannot be observed. When forwardlooking analyses are performed and future predictions made, sites may not even be characterised and few data may be available. In these cases especially, the effect of parameter, forcing function and spatial uncertainties must be accounted for in decision making. Monte Carlo simulation provides a simple method for accounting of uncertainty, if the model is sufficiently simple that multiple runs can be accomplished practically. Uncertainty analysis is particularly necessary for analytic solutions of the transport equation. The restrictive assumptions of these models require the assumption of homogeneous aquifers; one-dimensional, steady-state, uniform groundwater flow; and vertical and lateral transport by dispersion only. By definition, these models do not match geometric features of real aquifers. Under such circumstances, uncertainty analysis shows the range of results the model can produce. Heterogeneity and other flow features may require more comprehensive approaches, but the uncertainty analysis at least alerts the analyst to the variable character of the outputs. An important aspect of this variability is that, in the cases presented above, the averaged-parameter simulation did not represent the range of possible solutions. If, as in the case of the parameter range calculation, all combinations of parameters are equally possible, then an extreme case cannot be discounted as being the significant result of the model. A decision maker should consider that an early-arriving, highconcentration breakthrough curve is as likely as the averaged-parameter or any other result of the model. When probabilities are included – as in the capture zone analysis – they can be assigned to the concentrations. Given that the input probabilities are accurate, the analyst can then show where the model results lie in the context of the output probabilities. The question that needs an answer is, ‘Given uncertainty in the model results, can a decision be made?’ The answer may be yes, if the probability ranges are small, the risks low, or the exposures limited. It may be no, if the range of model predictions is too wide for confident, defensible decisions.

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

191

The approaches discussed in this chapter are only the beginning for an environmental assessment. Although the model results can provide an estimate of the environment conditions and their probabilities, evaluation of those model results is necessary to move from a totally forward-looking analysis to one where the uncertainties are minimised. Observation of the predicted quantities, which here are concentrations in an aquifer, indoor air, or a municipal water supply well, are needed to show that the model predictions are reasonable. As noted by Oreskes (2003), one necessary feature for improving predictability is that the observation should occur on a short enough time scale that an observation can be made. For problems like LUST-sitecaused aquifer contamination, observations made during sampling rounds prior to clean-up or site closure provide the data necessary to evaluate models of the type described in this chapter. Because transport times stretch to years, sufficient time is available to evaluate model results with field data. Thus, all the tools necessary to predict and then evaluate model performance are available for these problems.

ACKNOWLEDGEMENT This chapter has been reviewed in accordance with the US Environmental Protection Agency’s peer and administrative review policies and approved for publication. Mention of trade names or commercial products does not constitute endorsement or recommendation for use. The US EPA provided J. M. Ferguson’s funding under EP09D000556, and B. Mukherjee’s and F. Tillman’s funding under CR 83323201.

REFERENCES ASTM (1995). Standard Guide for Risk-Based Corrective Action Applied at Petroleum Release Sites. ASTM Designation E 1739095. American Society for Testing and Materials, West Conshohocken, Pennsylvania. Aziz, C.E., Newell, C.J., Gonzales, J.R., Haas, P., Clement, T.P. and Sun, Y. (2000). BIOCHLOR Natural Attenuation Decision Support System, User’s Manual Version 1.0. United States Environmental Protection Agency, Cincinnati, Ohio, EPA/600/R-00/008. Bear, J. and Jacobs, M. (1965). On the movement of water bodies injected into aquifers. Journal of Hydrology. 3: 37–57. Charbeneau, R.J. (2000). Groundwater Hydraulics and Pollutant Transport. Prentice Hall. Cline, P.V., Delfino, J.J. and Rao, P.S.C. (1991). Partitioning of aromatic constituents into water from gasoline and other complex solvent mixtures. Environmental Science and Technology. 25: 914–920 (American Chemical Society Journals, web: 10 March 2010). Domenico, P.A. (1987). An analytical model for multidimensional transport of a decaying contaminant species. Journal of Hydrology. 91: 49–58. Domenico, P.A. and Robbins, G.A. (1985). A new method of contaminant plume analysis. Ground Water. 23, 4: 476–485. Domenico, P.A. and Schwartz, F.W. (1998). Physical and Chemical Hydrogeology, 2nd edn. John Wiley and Sons. Eggleston, J.R. and Rojstaczer, S.A. (2000). Can we predict subsurface mass transport? Environmental Science and Technology. 34: 4010–4017.

192

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Fischer A., Muller, M. and Klasmeier, J. (2004). Determination of Henry’s law constant for methyl tert butyl ether (MTBE) at groundwater temperatures. Chemosphere. 54: 689–694. Gelhar, L.W., Welty, C. and Rehfeldt, K.R. (1992). A critical review of data on field-scale dispersion in aquifers. Water Resources Research. 28, 7: 1955–1974. Grosser, M. (1962). The Discovery of Neptune. Harvard University Press. Johnson, P.C. and Ettinger, R.A. (1991). Heuristic model for predicting the intrusion rate of contaminant vapors into buildings. Environmental Science and Technology. 25, 1445–1452. Meylan, W.M. and Howard, P.H. (1991). Bond contribution method for estimating Henry’s Law constants. Environmental Toxicology and Chemistry. 10.10: 1283–1293. Miller, C.T. and Gray, W.G. (2002). Hydrological research: just getting started. Ground Water. 40, 3: 224–231. Montgomery, J. (1996). Groundwater Chemicals Desk Reference, 2nd edn. CRC Press, Boca Raton, Florida. Moore, C. and Doherty, J. (2006). The cost of uniqueness in groundwater model calibration. Advances in Water Resources. 29, 4: 605–623. Newell, C.J., McLeod, R.K. and Gonzales, J.R. (1996). BIOSCREEN Natural Attenuation Decision Support System, User’s Manual Version 1.3. United States Environmental Protection Agency, Cincinnati, Ohio, EPA/600/R-96/087. Olcott, P.G. (1995). Groundwater Atlas of the United States, Segment 12, Hydrologic Investigations Atlas 730-M. United States Geologic Survey, Reston, Virginia. Oreskes, N. (2003). The role of quantitative models in science. In: Canham, C.D., Cole, J.J. and Lauenroth, W.K. (Eds), Models in Ecosystem Science. Princeton University Press, pp. 13–31. Oreskes, N., Shrader-Frechette, K. and Belitz, K. (1994). Verification, validation, and confirmation of numerical models in the earth sciences. Science. 263: 641–646. Small, M.C. (2003). Managing the Risks of Exposure to Methyl Tertiary Butyl Ether (MTBE) Contamination in Ground Water at Leaking Underground Storage Tank (LUST) Sites. Dissertation, Civil and Environmental Engineering, University of California at Berkeley. Srinivasan, V., Clement, T.P. and Lee, K.K. (2007), Domenico solution – is it valid? Ground Water. 45, 2: 136–146. Tillman, F.D and Weaver, J.W. (2006). Uncertainty from synergistic effects of multiple parameters in the Johnson and Ettinger (1991) vapor intrusion model. Atmospheric Environment. 40, 22: 4098–4112. Tillman, F.D and Weaver, J.W. (2007). Parameter sets for upper and lower bounds on soil-to-indoorair contaminant attenuation predicted by the Johnson and Ettinger vapor intrusion model. Atmospheric Environment. 41, 27: 5797–5806, DOI 10.1016/j.atmosenv.2007.05.033. Tonkin, M. and Dougherty, J. (2009). Efficient nonlinear predictive error variance for highly parameterized models. Water Resources Research. 45: DOI:10.1029/2007WR006678. US EPA (United States Environmental Protection Agency) (1989). Risk Assessment Guidance for Superfund Volume 1, Human Health Evaluation Manual (Part A), Interim Final. Office of Emergency and Remedial Response, Washington, DC, EPA/540/1-89/002. US EPA (United States Environmental Protection Agency) (1996). Soil Screening Guidance User’s Guide. Office of Solid Waste and Emergency Response, Washington, D.C., Publication 9355.4-23. US EPA (United States Environmental Protection Agency) (2001). Supplemental Guidance for Developing Soil Screening Levels for Superfund Sites, Peer Review Draft. Office of Solid Waste and Emergency Response, Washington, D.C., OSWER 9355.4-24. US EPA (United States Environmental Protection Agency) (2004). User’s Guide for Evaluating Subsurface Vapor Intrusion into Buildings. Office of Emergency and Remedial Response, Washington, D.C. US EPA (United States Environmental Protection Agency) (2009). Guidance on the Development, Evaluation, and Application of Environmental Models. US EPA, Washington, D.C., EPA/100/ K-09/003. van Genuchten, M.T. and Alves, W.J. (1982). Analytical Solutions of the One-Dimensional Convec-

STATISTICAL ACCOUNTING FOR UNCERTAINTY IN MODELLING TRANSPORT

193

tive-Dispersive Solute Transport Equation. United States Department of Agriculture, Agricultural Research Service, Technical Bulletin Number 1661. Weaver, J.W. (1996). Application of the hydrocarbon spill screening model to field sites. In: Reddi, L. (Ed.), Subsurface Environment: Assessment and Remediation. Proceedings of Non-Aqueous Phase Liquids (NAPLs). American Society of Civil Engineers, Washington, D.C., 12–14 November , pp. 788–799. Weaver, J.W., Tebes-Stevens, C. and Wolfe, K.L. (2002). Uncertainty in model predictions—plausible outcomes from estimates of input ranges. Proceedings of Brownfields 2002, Charlotte, North Carolina, 13–15 November 2002. Available online from http://www.epa.gov/athens/ learn2model/part-two/onsite/doc/UncertaintyInModelPredictions.pdf Weaver, J.W., Exum, L.R. and Prieto, L.M. (2010). Gasoline Composition Regulations Affecting LUST Sites. United States Environmental Protection Agency, Washington, D.C., 20460, EPA 600/R10/001. West, M.R., Kueper, B.H. and Ungs, M.J. (2007). On the use and error of approximation in the Domenico (1987) solution. Ground Water. 45, 2: 126–135. Wilson, J.T., Cho, J.S., Wilson, B.H. and Vardy, J.A. (2000). Natural Attenuation of MTBE in the Subsurface under Methanogenic Conditions. United States Environmental Protection Agency, Cincinnati, Ohio, EPA/600/R-00/006. Xu, M. and Eckstein, Y. (1995). Use of weighted least-squares method in evaluation of the relationship between dispersivity and scale. Groundwater. 33, 6: 905–908. Yeh, C.T. (1981). AT123D: Analytical Model to Transient 1-, 2-, and 3-Dimensional Waste Transport in Aquifers. Oak Ridge National Laboratory, Oak Ridge, Tennessee, ORNL-5602.

CHAPTER

8

Petroleum Hydrocarbon Forensic Data and Cluster Analysis Jun Lu

8.1

INTRODUCTION

Petroleum hydrocarbons are among the most pervasive contaminants in the environment. The cost to clean up these contaminants can be prohibitive depending on the nature and extent of the contamination. For sites where multiple responsible parties are potentially involved, it is almost without exception that environmental forensics techniques are sought to determine the contaminant source(s). In many cases where responsible parties are known, the forensics data are still used to locate the source(s), which aids in creating an understanding of the pathway, a key component of the site conceptual model. The forensic approach is also frequently taken to differentiate between current and historical releases in cases where there have been previously documented releases. With this type of approach, remediation can be expedited by limiting clean-up operations to the area that was impacted only by the current release(s). The most commonly used petroleum hydrocarbons data for environmental forensics analysis result from gas chromatography (GC). During a site investigation, if the samples analysed are limited, the relationship between samples may be determined quickly by a visual examination. However, if the scale of investigation involves a large number of samples, cluster analysis may be used to provide some preliminary groupings, which in turn help to determine potential sources or relationship among samples, thereby giving investigators additional insight into the make-up of the samples. The objective of this chapter is to demonstrate the use of cluster analysis as an exploratory tool in environmental forensics with GC-based petroleum hydrocarbon data. In order to explain cluster analysis and its use in environmental forensics, the chapter is organised into three sections. Section 8.2 provides an overview of cluster analysis. Section 8.3 describes the types of petroleum hydrocarbon data suitable for cluster analysis and precautions in using them. Section 8.4 provides real-life examples Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

196

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

to illustrate how cluster analysis can be used to enhance the understanding of potential sources and the relationship(s) between samples. Because this chapter is written to demonstrate the usability of cluster analysis in comprehending forensics data, there will be no detailed step-by-step description of cluster analysis; rather the discussions presented herein are results oriented and applications are demonstrated using real-life examples.

8.2

CLUSTER ANALYSIS

Cluster analysis is the process of searching for patterns in a data set by grouping observations or objects into clusters (Rencher, 2002). Owing to its simple concept, cluster analysis can be used to discover structures in data without meeting rigorous statistical rules. There are two basic types of cluster analysis: hierarchical and partitional. Hierarchical clustering is most commonly used and will be discussed in this chapter. Cluster analysis is an ideal tool used in environmental forensics for rapidly determining if there is any similarity between samples collected from a site. The similarity between samples is presented in the form of a dendrogram, which is essentially a tree diagram. During cluster analysis, decisions must be faced regarding whether data should be standardised and which methods of distance measure and linkage should be used. Standardisation is to convert all variables to a common scale by subtracting the means and dividing by the standard deviation before the distance matrix is calculated. This is commonly done if the effects of scale differences need to be minimised or if the variables are in different units, or in the same units but with orders of magnitude difference in value. Because the hydrocarbon forensic data used in this chapter are all in the same units in the same data sets, no standardisation will be conducted for analysis. Commonly used methods of distance measure include Euclidean distance, city block distance, and Pearson correlation (Everitt et al., 2001). Euclidean distance (also called ‘distance as the crow flies’ or ‘2-norm distance’) is probably the most commonly chosen type of distance (Everitt et al., 2001). It is simply the geometric distance in the multidimensional space. It is computed as Dij ð ed Þ ¼

" r X

#1=2 ð xik  x jk Þ

2

(8:1)

k¼1

where xik and x jk are, respectively, the kth variable value of the p-dimensional observations for individuals i and j. City block distance describes distances on a rectilinear configuration. It is also referred to as rectilinear, taxicab or Manhattan distance because it measures distances travelled in such a street configuration (Everitt et al., 2001). The distance is computed as follows:

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

Dij ð cd Þ ¼

r X    xik  x jk 

197

(8:2)

k¼1

where xik and x jk are, respectively, the kth variable value of the p-dimensional observations for individuals i and j. Pearson correlation is not strictly a distance measure, rather it is a dissimilarity measure using correlation coefficients. The correlation is computed as   1  ij  ij ¼ 2 with Xr

ð xik  x jo Þ i1=2 Xr ð x  xio Þ2 ð x  x jo Þ2 k¼1 ik k¼1 jk

ij ¼ hX r

k¼1

(8:3)

where 1X xik r k¼1 r

xio ¼

where xik and x jk are, respectively, the kth variable value of the p-dimensional observations for individuals i and j. Methods of linkage include single, complete and average (Everitt et al., 2001). Single linkage is the method in which the distance between two clusters is determined by the distance of the two closest objects (nearest neighbours) in the different clusters (Sneath, 1957); this rule tends to produce chain-like clusters. Complete linkage is the method in which the distances between clusters are determined by the greatest distance between any two objects in the different clusters (i.e. by the ‘furthest neighbours’) (Sorensen, 1948); this method tends to find compact clusters with equal diameters. Average linkage (unweighted pair group) is the method in which the distance between two clusters is calculated as the average distance between all pairs of objects in the two different clusters (Sokal and Michener, 1958); this method tends to join clusters with small variances. Unlike single and complete linkage, this method takes account of cluster structure and is relatively robust. Based on the measurement of distance and linkage methods described above, there are nine possible combinations. Considering other types of less commonly used methods, which are not discussed here, the number of combinations can be larger. Logically, one would think that it would be beneficial to minimise the number of calculations by choosing the most appropriate combinations for the data sets of interest. For a data set with two to three variables, a screening data analysis (e.g. scatter plot) may be done to get a sense of data structure so that the ‘best’ combination(s) of clustering methods may be chosen. However, for petroleum hydrocarbon forensics analysis, the variables can vary from a minimum of five to over tens of thousands of data points. Therefore, it is almost impossible to determine what the ‘best’ combination of methods is for the data set.

198

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Cluster analysis can be used as an exploratory tool for the data by considering all the methods. The number of calculations seems large; however, with the speed of an ordinary computer, each calculation can be done fairly quickly. The groupings from each combination are reviewed to obtain insights of the data that may shed light on certain aspects of the issue under investigation.

8.3

TYPES OF PETROLEUM HYDROCARBONS OR RELATED DATA FOR FORENSIC ANALYSIS

At petroleum-impacted sites, typical analysis during site investigation includes US Environmental Protection Agency (EPA) Method 8015, EPA Method 8260B and EPA 8270. These analytical methods are implemented through GC/flame ion detector (FID) and GC/mass spectrometry (MS). Measurements include total petroleum hydrocarbon, benzene, toluene, ethylbenzene and xylenes, oxygenates (methyl tertiary-butyl ether, tertiary butyl alcohol, etc.) and 16 priority pollutant polycyclic aromatic hydrocarbons. While these analyses meet regulatory requirements to determine chemicals of concern and their extent of contamination, the value of these data is very limited in an environmental forensics investigation (Stout et al., 2002). Analysis for forensics investigation is conducted with the same instruments, but aims to obtain data that are used to determine the ‘fingerprint’ or nature of the petroleum hydrocarbons. Two types of GC-based forensic data are described as follows.

8.3.1

GC/FID analyses

Typically, the analysis starts from GC/FID. The output from the instrument is two sets of data: retention time versus responses. These data can be presented in four forms: (i) carbon chain (C4–C5, C6–C7, etc.); (ii) chromatogram (i.e. a plot of signals of the compounds versus time); (iii) a list of identifiable compounds with relative concentrations; and (iv) raw data containing information of intensity of signals (voltage) and retention time in various formats (e.g. .cdf). All of these data, except for the chromatograms, may be used for cluster analysis. Carbon chain data Carbon chain analysis is typically conducted at conventional laboratories. The purpose of this type of analysis is to determine the distribution of hydrocarbons that may be used to gain some preliminary understanding of types of products released. Figure 8.1 presents the results of a crude sample as an example. As can be observed, no individual compounds are identified in the analysis; instead all compounds are integrated into a broad array of hydrocarbon ranges (C8–C9, C10–C11, etc., see the tabulated data in Figure 8.1). In the example, the hydrocarbons peaked at C8–C9 and gradually sloped down to C35 (see the chart in Figure 8.1). The analyses are broad, but a great deal can still be learned from the sample as its hydrocarbon range and general patterns mimic a type of crude oil.

199

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

Figure 8.1: Example of carbon chain analyses. Carbon range

Result (mg/kg)

DRO (C13–C22)

260 000

ORO (C23–C32)

120 000

EFH (C13–C32)

370 000

GRO (C4–C12)

150 000

EFH (C8–C9)

81 000

EFH (C10–C11)

59 000

EFH (C12–C13)

66 000

EFH (C14–C15)

61 000

EFH (C16–C17)

54 000

EFH (C18–C19)

50 000

EFH (C20–C21)

41 000

EFH (C22–C23)

38 000

EFH (C24–C25)

30 000

EFH (C26–C27)

24 000

EFH (C28–C29)

23 000

EFH (C30–C31)

17 000

EFH (C32–C35)

17 000

EFH (C36–C40)

0

90 000 70 000 60 000 50 000 40 000 30 000 20 000

Carbon range

EFH (C36–C40)

EFH (C32–C35)

EFH (C30–C31)

EFH (C28–C29)

EFH (C26–C27)

EFH (C24–C25)

EFH (C22–C23)

EFH (C20–C21)

EFH (C18–C19)

EFH (C16–C17)

EFH (C14–C15)

EFH (C12–C13)

0

EFH (C10–C11)

10 000

EFH (C8–C9)

Concentration: mg/kg

80 000

200

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

It should be pointed out that carbon chain data may be used to determine the relationship among samples only for sites with a well-understood history of release(s). For sites with complicated history of releases and/or weathered hydrocarbons, groupings generated from cluster analysis may be misleading. List of identifiable compounds As described previously, the carbon chain analysis does not specify individual compounds, but provides an array of hydrocarbon ranges present in the sample so that a general category of hydrocarbons can be determined. In most cases, the general nature of the analysis does not meet the need of fingerprinting hydrocarbons. Table 8.1 shows an example of fingerprinting results for a sample of gasoline product when specific compounds are identified. As can be seen, a total of 93 compounds are identified. Depending on the objective of the analysis, these compounds may be analysed as a whole, as a subset, or as ratios of selected sets of compound pairs used for analysis. The entire list of compounds may be used for determining potential sources if the petroleum products are relatively unaltered or for degree of weathering if samples are known to come from the same source. To determine the source relationship among samples with weathered samples, the weathering effect may be minimised by removing affected components or by using ratios of compounds that would be affected by a certain weathering mechanism on the same magnitude. In the case where affected components are to be removed, it is important that the dominant weathering mechanism(s) be identified. For example, if it is determined that evaporation is the dominant weathering mechanism, the compounds with high vapour pressure (low-molecularweight compounds) shall be removed. If it is determined that dissolution is the dominant weathering mechanism, the compounds with high solubility shall be removed (single-ring aromatics such as benzene and toluene). The use of ratios of compounds of interest is beneficial because with ratios concentration effects are minimised. In addition, the use of ratios tends to induce a self-normalising effect on the data since variations due to fluctuations of instrument Table 8.1: GC/FID results of an LNAPL sample. Analyte Propane Isobutane Isobutene Butane/Methanol trans-2-Butene cis-2-Butene 3-Methyl-1-butene Isopentane 1-Pentene 2-Methyl-1-butene

Concentration, Analyte % 0 0 0 0 0 0 0 0.37 0 0

3-Methylheptane 2,2,5-Trimethylhexane n-Octane 2,2-Dimethylheptane 2,4-Dimethylheptane Ethylcyclohexane 2,6-Dimethylheptane Ethylbenzene m + p Xylenes 4-Methyloctane

Concentration, % 0.75 0.31 2.85 0.11 0.34 1.33 0.7 0.36 0.63 1.02 ( continued)

201

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

Table 8.1: ( continued ) Analyte

Concentration, Analyte %

Concentration, %

Pentane trans-2-Pentene cis-2-Pentene/t-Butanol 2-Methyl-2-butene 2,2-Dimethylbutane Cyclopentane 2,3-Dimethylbutane/MTBE* 2-Methylpentane 3-Methylpentane Hexane trans-2-Hexene 3-Methylcyclopentene 3-Methyl-2-pentene cis-2-Hexene 3-Methyl-trans-2-pentene Methylcyclopentane 2,4-Dimethylpentane Benzene 5-Methyl-1-hexene Cyclohexane 2-Methylhexane/TAME 2,3-Dimethylpentane 3-Methylhexane 1-trans-3Dimethylcyclopentane 1-cis-3-Dimethylcyclopentane 2,2,4-Trimethylpentane

0.65 0 0 0 0.16 0 0.17 3.95 3.21 5.08 0 0.11 0 0.09 0.36 5.12 1.09 0 0.36 3.48 4.35 2.73 5.65 2.21

2-Methyloctane 3-Ethylheptane 3-Methyloctane o-Xylene 1-Nonene n-Nonane Isopropylbenzene 3,3,5-Trimethylheptane 2,4,5-Trimethylheptane n-Propylbenzene 1-Methyl-3-ethylbenzene 1-Methyl-4-ethylbenzene 1,3,5-Trimethylbenzene 3,3,4-Trimethylheptane 1-Methyl-2-ethylbenzene 3-Methylnonane 1,2,4-Trimethylbenzene Isobutylbenzene sec-Butylbenzene n-Decane 1,2,3-Trimethylbenzene Indane 1,3-Diethylbenzene 1,4-Diethylbenzene

0.98 0.27 1.4 0 0.06 1.75 0.34 0.23 0.18 0.99 0.11 0.64 1.69 0.58 0.23 0.05 0.76 0.07 0.22 1.23 0.32 0.45 0.77 0.47

2.95 1.26

1.29 0.58

n-Heptane

5.43

Methylcyclohexane

5.67

2,5-Dimethylhexane

0.8

2,4-Dimethylhexane 2,3,4-Trimethylpentane Toluene/2,3,3Trimethylpentane 2,3-Dimethyl hexane 2-Methylheptane 4-Methylheptane 3,4-Dimethylhexane 3-Ethyl-3-methylpentane 1,4-Dimethylcyclohexane

1.16 0.79 0.5

n-Butyl benzene 1,3-Dimethyl-5-Ethyl benzene 1,4-Dimethyl-2ethylbenzene 1,3-Dimethyl-4ethylbenzene 1,2-Dimethyl-4ethylbenzene Undecane 1,2,4,5-Tetramethylbenzene 1,2,3,5-Tetramethylbenzene

0 1.43 1.01

1,2,3,4-Tetramethylbenzene Naphthalene 2-Methyl-naphthalene 1-Methyl-naphthalene 1-Methyl-2-propylbenzene

1.24 0.39 1.43 0.75 –

0.98 2.47 1.23 0.24 2.69 0

1.01 0.31 3.03

202

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

operating conditions, operator and matrix effects are minimised (Wang and Christensen, 2006). For example, for gasoline-range hydrocarbons, 4-methylheptane and 2-methylheptane have identical or very similar vapour pressure and solubility, see Table 8.2 (US National Library of Medicine, 1994). By using ratios of these compounds in the matrix, evaporation and dissolution can be eliminated or minimised. Therefore, the groupings developed in the cluster analysis will help to better determine the relationship between samples in a relatively unaltered state. Raw data Typically, forensics specialists use the chromatogram to fingerprint the sample through pattern recognition. However, numeric data of each signal can also be obtained from built-in software in the instrument and used for cluster analysis. Depending on the sample media, GC/FID analysis can generate up to tens of thousands of data points (retention time versus voltage). Figure 8.2 shows a section of data in which a maximum locates at the data point #45307 or retention time of 25.1705 min. Of the 20 data points shown, 19 of them are at the ‘shoulder’ of the peak. All these data points may be used as input variables for preliminary analysis providing the software can handle it. However, as most of these data are non-essential, they may be removed for more efficient analysis. It should be pointed out that before data reduction the data should be examined to ensure that all data sets have the same number of data points that come from the same lab, with the same instrument and under the same patch. The raw data are typically exported from the instrument as a .cdf file, which can be imported and processed with various spectra software (e.g. Peakfit1 ). These data should be imported into the same spreadsheet of the Workbook (e.g. Excel) for examination and processing.

8.3.2

GC/MS analyses

GC/MS provides the most reliable data for identification of petroleum compounds. Two most common analyses are PIANO analyses and GC/MS full scan. PIANO analyses For gasoline-range hydrocarbons, PIANO analysis is commonly used to determine gasoline components. The term PIANO refers to the five groups of compounds of Table 8.2: Vapour pressure and solubility values for 4-methylheptane and 2-methylheptane. Compounds of interest

Vapour pressure (mm Hg at 258C)

Solubility at 258C

20.5 20.7

7.97 7.97

4-methylheptane 2-methylheptane Note: mm Hg indicates millimetres of mercury.

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

203

Figure 8.2: GC/FID raw data (retention time versus voltage).

petroleum products: paraffins (P), isoparaffins (I), aromatics (A), naphthenes (N) and olefins (O). These compounds are typically analysed with purge-and-trap GC/MS. However, in certain circumstances, these compounds can also be analysed with GC/ FID. Regardless of the method, the analysis is only suitable for gasoline-range hydrocarbons that are in the range of C4–C13. The output for this analysis may be in three forms: (i) a list of individual identifiable compounds; (ii) a summary of data in PIANO category; and (iii) a summary of data in carbon numbers. The first category of data is similar to those found using GC/FID analysis. The PIANO and carbon number data are described in the following paragraphs. Table 8.3 shows examples of three petroleum product samples. Sample A is a gasoline-range product that was likely produced during the performance era (i.e. prior to the 1960s); sample B is a light hydrocrackate, a refining stream that is dominated by isopentane; sample C is a reformate, a refining stream that is enriched with aromatics.. As each type of petroleum product is characterised by different combinations of PIANO, use of relative concentrations of the five categories may reveal relationships between samples that might not be seen by visual examination of the data.

204

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 8.3: PIANO and carbon number results (A: gasoline range free product; B: light hydrocrackate free product; C: reformate). Sample ID Paraffins I-paraffins Olefins Naphthenes Aromatics Total unknowns C4 C5 C6 C7 C8 C9 C10 C11 C12

A

B

C

17.19 30.50 4.87 13.70 24.17 9.55 2.07 13.55 15.09 9.95 14.39 14.44 11.57 4.88 4.25

17.96 47.60 0.95 15.06 14.63 0.69 4.51 38.21 29.32 15.05 5.02 2.73 0.90 0.36 0.08

5.46 21.70 6.83 8.29 48.57 9.04 0.51 4.68 7.92 13.92 21.68 20.41 12.63 5.44 3.58

Relative concentrations of compounds in terms of carbon number category are also shown in the table. As compared with PIANO results, the carbon numbers reveal characteristics from a different perspective. For example, sample B has very low concentrations of C8 or above. With this, it is easy to see that this is a type of refining stream in which all components of C8 or heavier were removed during the refining process. GC/MS full scan If hydrocarbons are heavier than the gasoline range, a GC/MS full scan is conducted to characterise the chemical compounds. A GC/MS full scan identifies various classes of compounds including alkanes, isoalkanes, cycloalkanes, aromatics, bicyclane, sterane and terpane biomarkers and provides characterisation of petroleum hydrocarbons including crude oil, jet fuel, diesel and fuel oils. Figures 8.3(a) and (b) show two GC/MS chromatograms of ions 85 and 113 and relative concentrations of compounds in peak height and peak area. These data may be used for cluster analysis provided that the weathering effect is thoroughly evaluated and understood. For example, the isoprenoid compounds (e.g. IP 16, IP 18, pristine, phytane) may be used for slightly weathered samples because these compounds are relatively weathering resistant as compared with straight-chain hydrocarbons (Stout et al., 2002).

8.4

EXAMPLES

In this section, two examples are provided to illustrate the use of cluster analysis in environmental site characterisation. The first example shows groupings that resulted

205

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

from clustering PIANO and carbon number data and the appropriateness of each grouping. The second example illustrates how to use ratios to uncover groupings that might not be possible with individual compounds.

8.4.1

PIANO versus carbon number

The eight petroleum product samples were taken from monitoring wells at a petroleum refinery. Historically, there were multiple releases leading to the occurrence of intermediate refining streams and finished products in the shallow aquifer. Efforts were made to identify sources of these products so that the pathways of the historical releases would be better understood, leading to optimal remediation design. GC/FID analysis was conducted for these samples and results were presented in terms of over 300 compounds and summarised as PIANO and carbon numbers (Table 8.4). Based on the interpretation of chromatograms (not presented here), in combination with other site information, the eight samples appear to consist of four types of products, which are summarised below. • •

Sample G1 is a refining stream characterised by relatively high concentrations of paraffins and low concentrations of aromatics. Samples G2 and G3 are leaded gasoline, similar to modern gasoline in terms of hydrocarbon distribution.

Table 8.4: Concentrations of compounds in each PIANO category (volume %). Sample ID

Paraffins I-paraffins Olefins Naphthenes Aromatics Unknowns PIANO total Total C4 C5 C6 C7 C8 C9 C10 C11 C12 C4–C12

G1

G2

G3

G4

G5

G6

G7

G8

29.72 27.25 6.37 20.89 11.57 4.19 95.80 99.99 1.86 23.80 34.47 16.56 7.69 5.36 3.14 1.80 0.87 95.53

11.61 33.12 7.32 10.74 32.43 4.74 95.23 99.97 2.62 16.34 16.27 16.69 19.90 12.74 6.34 2.67 1.62 95.19

9.98 32.21 6.26 10.36 35.86 5.29 94.67 99.96 1.01 9.70 16.39 16.56 22.17 15.38 7.70 3.32 2.14 94.37

13.74 57.89 2.10 17.15 8.32 0.76 99.21 99.97 2.47 33.94 37.96 19.07 2.84 1.56 0.74 0.38 0.20 99.15

15.64 49.64 0.79 16.22 12.33 3.03 94.62 97.64 3.87 36.85 31.20 15.95 5.60 0.51 0.36 0.12 0.14 94.62

5.46 21.70 6.83 8.29 48.57 9.04 90.84 99.88 0.51 4.68 7.92 13.92 21.68 20.41 12.63 5.44 3.58 90.76

8.60 19.75 0.97 4.20 64.08 2.37 97.60 99.96 0.31 4.75 10.97 28.66 27.95 17.11 6.19 1.38 0.26 97.58

9.90 20.79 0.96 3.31 62.64 2.37 97.59 99.96 1.44 6.47 11.21 26.66 26.48 16.88 6.57 1.65 0.13 97.51

Figure 8.3: Selected ion chromatograms from GC/MS fall scan: (a) ion 85; (b) ion 113.

206 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

207

(a)

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

Figure 8.3: (continued)

208

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

(b)

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

209

210 • •

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Samples G4 and G5 are commingled products of light hydrocrackate and reformate. Samples G6, G7 and G8 are reformate (a refining stream). G6 is relatively unaltered while G7 and G8 are more weathered.

Owing to the fact that all the products are in the gasoline range and there are no pronounced differences in chemical characteristics, the cluster analysis was conducted to obtain additional information on the affinity or grouping of these products. PIANO data As can be observed from the table, the totals range from 97.6% to 100%. After removing the ‘unknowns’, the total percentage of PIANO ranges from 90.8% to 99.2%. The lower totals have resulted from relatively higher concentrations of unidentified compounds. It seems logical that the ‘unknown’ be redistributed to the existing five categories (i.e. PIANO) to preserve the total concentrations; however, because the ‘unknown’ does not come from each category on equal probability or have direct correlation with the existing percentage of each category, the redistribution is not justified and therefore not conducted. Nine analyses were conducted using a combination of the three methods of distance measurement and three methods of linkage as mentioned previously. Four was chosen as the number of clusters, based on groupings from preliminary analysis. Cluster analysis results are shown in Figures 8.4(a) to (i) – see colour insert. Examination of the results revealed that groupings from cluster analysis of various methods are consistent with groupings suggested based on the fingerprinting method, except for the combination methods that involve Pearson correlation. For the three methods involving Pearson correlation, the sample G6 is unexceptionally grouped with G2 and G3. Because the use of Pearson correlation as a distance measure is somewhat controversial (Everitt et al., 2001), the results from combination methods involving Pearson correlation may be ignored. However, the discrepancy between the analyses may provide additional insights for the relationship among the samples. With suggestions from cluster analysis, the findings using the Pearson correlation warrant examination of the affinity of sample G6 by forensics investigators. Carbon number data In Table 8.4, the totals range from 90.76% to 99.15%. As can be observed, the lower totals have resulted from relatively higher concentrations of unidentified compounds and potentially compounds of C13+. In this case, the only option is to use the data as they are. In a similar way to the PIANO data, the results from four cluster analyses are shown on Figures 8.5(a) to (i) – see colour insert. Examination of the results of carbon number data revealed that cluster analysis of various methods yields similar grouping to that of PIANO data regarding all the samples but G6. Affinity of G6 with other samples is complicated. In most methods, G6 is closer to G2 and G3 than G7 and G8. In methods of Manhattan/single,

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

211

Manhattan/average and Pearson/complete, G6 has similar affinity to G2/G3 and G7/ G8. Only with one method (Manhattan/complete) was G6 closer to G7 and G8. Based on the cluster analysis of the PIANO data in the forms of PIANO and carbon number, the grouping suggested by various methods is consistent with the preliminary analysis in general. The discrepancy (i.e. affinity of G6 with other samples) between cluster analysis groupings and those suggested based on fingerprinting analysis appears to provide additional information that might be of value for understanding the nature of the sample G6.

8.4.2

Raw data versus ratio analysis (an underground storage tank site)

The data for this example are from two leaking underground storage tank sites adjacent to each other. The monitoring system in site A is denoted as MW while that in site B is denoted GW. Based on current groundwater flow, site A is upgradient relative to site B. Monitoring wells MW1 in site A and GW1, GW2, GW3, GW4 and GW5 in site B all contain light non-aqueous phase liquid (LNAPL). The question is did the LNAPL in MW1 (site A) originate from the same source as those at site B or a separate source? LNAPL samples were collected from all six monitoring wells (i.e. MW1, GW1, GW2, GW3, GW4 and GW5). Laboratory analyses performed include: (i) GC/FID whole oil analysis (for carbon ranging from C3 to C44); (ii) oxygenated blending agents; (iii) ethylene dibromide/methylcyclopentadienyl manganese tricarbonyl and organic lead speciation GC/electron capture detector (ECD). Based on the carbon range and the distinct compound patterns, all six samples were identified as gasoline products. High concentrations of lead in all the samples indicated that the LNAPL at all six monitoring wells was at least partially leaded gasoline. Preliminary examination of the chromatograms suggests that they fall into three groups. The first group consists of the samples from MW1, GW4, and GW5. Within this group, GW4 and GW5 appeared to be identical. MW1 is similar to the other two samples except for certain compounds (e.g. a notable difference in 2-methylhexane and 3-methylhexane concentrations). The second group consists of samples from GW1 and GW2. They may be weathered products of the first group or could have originated from a different release. The third group consists of one sample, GW3, which contains significantly higher concentrations of aromatic compounds. This sample appeared to be either a weathered product of the LNAPL found in the second group contaminated with a product high in aromatic compounds or a completely unrelated product that is high in aromatics. Cluster analysis was performed independently from visual review of the chromatograms to explore potential grouping among the six samples. Preliminary analysis was conducted with the original data (Table 8.5). As there appear to be three groups based on preliminary analysis, three clusters were chosen for the dendrograms. Nine dendrograms with various combinations of distance measurement and linkage method are shown on Figures 8.6(a) to (i) – see colour insert.

212

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 8.5: Concentrations of identified gasoline range compounds (%).

Isopentane Pentane 2,2-Dimethylbutane 2,3-Dimethylbutane/MTBE 2-Methylpentane 3-Methylpentane Hexane trans-2-Hexene 3-Methylcyclopentene 3-Methyl-2-pentene cis-2-Hexene 3-Methyl-trans-2-pentene Methylcyclopentane 2,4-Dimethylpentane 5-Methyl-1-hexene Cyclohexane 2-Methylhexane/TAME 2,3-Dimethylpentane 3-Methylhexane 1-trans-3-Dimethylcyclopentane 1-cis-3-Dimethylcyclopentane 2,2,4-Trimethylpentane n-Heptane Methylcyclohexane 2,5-Dimethylhexane 2,4-Dimethylhexane 2,3,4-Trimethylpentane Toluene/2,3,3-Trimethylpentane 2,3-Dimethyl hexane 2-Methylheptane 4-Methylheptane 3,4-Dimethylhexane 3-Ethyl-3-methylpentane 3-Methylheptane 2,2,5-Trimethylhexane n-Octane 2,2-Dimethylheptane 2,4-Dimethylheptane Ethylcyclohexane 2,6-Dimethylheptane Ethylbenzene m + p Xylenes 4-Methyloctane 2-Methyloctane

MW1

GW1

GW2

GW3

GW4

1.1 1.87 0.19 0.25 4.54 3.28 6.67 0.12 0.19 0.08 0.14 0.28 5.9 0.83 0.27 4 3.84 1.91 4.6 1.78 2.61 0.77 5.83 5.39 0.53 0.79 0.38 0.29 0.65 2.3 0.87 0.15 2.09 0.48 0.19 2.8 0.08 0.26 1.03 0.58 0.94 1.11 0.76 0.87

0.4 0.65 0.16 0.17 3.95 3.21 5.08 0 0.11 0 0.09 0.36 5.12 1.09 0.36 3.48 4.35 2.73 5.65 2.21 2.95 1.26 5.43 5.67 0.8 1.16 0.79 0.5 0.98 2.47 1.23 0.24 2.69 0.75 0.31 2.85 0.11 0.34 1.33 0.7 0.36 0.63 1.02 0.98

0.6 0.9 0.16 0.23 4.03 3.13 5.23 0 0.13 0 0.1 0.34 4.92 1.07 0.34 3.3 4.63 2.65 5.84 2.09 2.87 1.38 6.11 5.6 0.9 1.44 0.82 0.58 1.07 3.1 1.45 0.27 3.24 0.79 0.4 3.53 0.14 0.44 1.55 0.91 0.53 0.84 1.35 1.3

0.2 0.45 0.08 0.22 2.58 2.02 4.34 0.9 0.16 0.06 0.14 0.24 4.04 0.73 0.25 2.89 3.65 1.8 4.36 1.54 2.2 0.9 5.36 4.18 0.51 0.77 0.43 0.36 0.63 2.01 0.77 0.14 1.86 0.44 0.22 2.09 0.06 0.19 0.69 0.43 2 6.27 0.64 0.7

0.7 1.29 0.22 0.29 5.01 3.77 6.86 0.12 0.17 0.07 0.14 0.36 5.62 1.1 0.35 3.62 4.56 2.52 5.43 2.04 2.83 1.15 6.3 5.45 0.65 1.05 0.53 0.42 0.75 2.27 0.93 0.17 2.18 0.57 0.26 2.51 0.08 0.29 0.95 0.5 0.97 1.36 0.71 0.75

GW5 0.9 1.44 0.24 0.15 5.28 3.88 6.84 0.09 0.16 0.06 0.12 0.33 5.28 1 0.32 3.12 4.08 2.21 4.85 1.79 2.49 1 5.41 4.67 0.56 0.9 0.46 0.37 0.65 2.05 0.83 0.14 1.96 0.49 0.23 2.32 0.08 0.29 0.87 0.51 0.5 1.04 0.76 0.85 ( continued)

213

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

Table 8.5: ( continued )

3-Ethylheptane 3-Methyloctane o-Xylene 1-Nonene n-Nonane Isopropylbenzene 3,3,5-Trimethylheptane 2,4,5-Trimethylheptane n-Propylbenzene 1-Methyl-3-ethylbenzene 1-Methyl-4-ethylbenzene 1,3,5-Trimethylbenzene 3,3,4-Trimethylheptane 1-Methyl-2-ethylbenzene 3-Methylnonane 1,2,4-Trimethylbenzene Isobutylbenzene sec-Butylbenzene n-Decane 1,2,3-Trimethylbenzene Indane 1,3-Diethylbenzene 1,4-Diethylbenzene n-Butyl benzene 1,3-Dimethyl-5-Ethyl benzene 1,4-Dimethyl-2-ethylbenzene 1,3-Dimethyl-4-ethylbenzene 1,2-Dimethyl-4-ethylbenzene 1,2,4,5-Tetramethylbenzene 1,2,3,5-Tetramethylbenzene 1,2,3,4-Tetramethylbenzene Naphthalene 2-Methyl-naphthalene 1-Methyl-naphthalene

MW1

GW1

GW2

GW3

GW4

GW5

0.21 1.16 0 0.04 1.68 0.29 0.2 0.15 1.53 0.83 0.9 1.8 0.45 0.7 0.04 2.28 0.06 0.19 1.11 0.76 0.65 0.61 0.52 1.05 0.45 0.77 0.37 2.6 1.06 0.64 0.88 0.5 1.35 0.58

0.27 1.4 0 0.06 1.75 0.34 0.23 0.18 0.99 0.11 0.64 1.69 0.58 0.23 0.05 0.76 0.07 0.22 1.23 0.32 0.45 0.77 0.47 1.29 0.58 1.01 0.31 3.03 1.43 1.01 1.24 0.39 1.43 0.75

0.36 1.89 0 0.07 2.21 0.41 0.26 0.21 0.6 0.34 0.49 1.33 0.57 0.41 0.06 1.45 0.08 0.19 1.15 0.1 0.13 0.41 0.69 0.87 0.36 0.75 0.52 1.37 0.64 0.55 0.6 0 0.45 0.2

0.16 0.91 0.95 0.07 1.13 0.17 0.12 0.13 1.12 2.57 1.59 2.2 0.42 1.12 0.04 5.81 0.05 0.16 0.87 1.56 0.97 0.65 1.23 1.36 0.46 1.13 0.11 2.31 0.97 1.22 0.87 0.94 1.32 0.57

0.23 1.05 0 0.04 1.29 0.23 0.16 0.12 1.02 0.38 0.92 1.6 0.38 0.53 0.04 2.2 0.05 0.16 0.83 0.59 0.46 0.49 0.61 1.03 0.42 0.78 0.45 2 0.85 0.76 0.72 0.37 0.92 0.39

0.2 1.12 0 0.04 1.53 0.27 0.18 0.16 1.39 0.73 1.07 1.98 0.5 0.79 0.05 2.75 0.07 0.22 1.17 0.65 0.43 0.86 0.75 1.39 0.5 1.06 0.61 2.91 1.21 0.97 0.99 0.26 1.12 0.5

Review of the clusters reveals that the six samples fall into the same three groups as suggested based on a review of chromatograms. Based on the agreement on the groupings between the two independent analyses, it appears that LNAPL in MW1 was related to that found in the samples from site B, implying that LNAPL in site B is part of the LNAPL plume originating from site A. However, this interpretation does not appear consistent with the site history of operation and hydrogeological setting. The most likely reason for this is the weathering effect that

214

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

may have altered the products so that presence or absence of residual compounds and their relative percentages are not representative of the nature of the original products. To restore the original nature of the products, the weathering effects must be eliminated or minimised. In this example, ratios of compounds were used for this purpose. The concepts for using ratios of compounds are described in section 8.3. The ratios of 19 pairs of compounds were used for cluster analysis (Table 8.6). The vapour pressure and solubility properties for the pairs chosen were within 20% based on values published in various literature (Howard et al., 1991; Lyman et al., 1992; Montgomery, 2000; United States Library of Medicine, 1994). Nine dendrograms derived from ratio analysis are shown on Figures 8.7(a) to (i) – see colour insert. As compared with the groupings resulting from individual compounds, similarity of the ratio compounds found in the samples changes. The most noticeable change is that MW1 is singled out from all other samples. Among the five samples on site B, GW1 and GW2 are similar and three other samples (GW3, GW4 and GW5) are similar. The groupings based on ratios appear to be very consistent with site hydrogeology setting and site operation history. They helped in understanding the LNAPL distribution at both sites and therefore, aided the refinement of the remediation strategy.

8.5

CONCLUDING REMARKS

Cluster analysis is a simple tool that can be used in environmental site characterisation to determine potential sources and relationships among samples collected at sites of interest through groupings generated from the analysis. Many types of petroleum hydrocarbon data are available for cluster analysis. The two examples given in section 8.4 demonstrate how the cluster analysis may be creatively used for real-life applications. Owing to complications of distance measurement and linkage, the same data could result in multiple groupings. However, the key to successful use of cluster analysis as a tool is to understand the objective of the analysis and the types and limitations of the data. Investigators need to bear in mind that cluster analysis is a supplementary tool used for exploratory purposes. It can help in reaching a conclusion, but only when coupled with other information, especially site history of operation and hydrogeological setting.

ACKNOWLEDGEMENTS The author would like to thank Joel Farrier and Travis Taylor for their encouragement in preparing this chapter and Kim Olsen for her editing and formatting of text, tables and figures. The authors also wish to acknowledge Dr Yue Rong for his constructive comments on the chapter.

215

PETROLEUM HYDROCARBON FORENSIC DATA AND CLUSTER ANALYSIS

Table 8.6: Ratios of selected 19 pairs of gasoline range compounds.

4-Methylheptane/ 2-Methylheptane trans-1,3-Dimethylcyclopentane/ cis-1,3-Dimethylcyclopentane 3-Ethyl-3-methylpentane/ 3,4-Dimethylhexane 3,4-Dimethylhexane/ 2,3-Dimethylhexane 3-Ethyl-3-methylpentane/ 2,3-Dimethylhexane 3-Methylhexane/ 2,3-Dimethylpentane 2-Methylheptane/ 3-Ethyl-3-methylpentane 2,3-Dimethylpentane/ 2,4-Dimethylpentane Ethylbenzene/ m-Xylene (m/p-xyl) 4-Methylheptane/ 3-Ethyl-3-methylpentane 2-Methylheptane/ 3,4-Dimethylhexane 2,5-Dimethylhexane/ 2,3,3-Trimethylpentane 4-Methylheptane/ 3,4-Dimethylhexane 2-Methylheptane/ 2,3-Dimethylhexane 2,3-Dimethylhexane/ 2,3,3-Trimethylpentane 4-Methylheptane/ 2,3-Dimethylhexane 3-Ethyl-3-methylpentane/ 2,3,3-Trimethylpentane 3,4-Dimethylhexane/ 2,3,3-Trimethylpentane Pentane (nC5)/ Isopentane (iC5)

MW1

GW1

GW2

GW3

GW4

GW5

0.38

0.50

0.47

0.38

0.41

0.40

0.68

0.75

0.73

0.70

0.72

0.72

13.93

11.21

12.00

13.29

12.82

14.00

0.23

0.24

0.25

0.22

0.23

0.22

3.22

2.74

3.03

2.95

2.91

3.02

2.41

2.07

2.20

2.42

2.15

2.19

1.10

0.92

0.96

1.08

1.04

1.05

2.30

2.50

2.48

2.47

2.29

2.21

0.85

0.57

0.63

0.32

0.71

0.48

0.42

0.46

0.45

0.41

0.43

0.42

3.54

2.52

2.90

3.19

3.03

3.15

1.83

1.60

1.55

1.42

1.55

1.51

5.80

5.13

5.37

5.50

5.47

5.93

3.54

2.52

2.90

3.19

3.03

3.15

2.24

1.96

1.84

1.75

1.79

1.76

1.34

1.26

1.36

1.22

1.24

1.28

7.21

5.38

5.59

5.17

5.19

5.30

0.52

0.48

0.47

0.39

0.40

0.38

1.73

1.76

1.50

1.96

1.74

1.64

REFERENCES Everitt, B.S., Landau, C. and Leese, M. (2001). Cluster Analysis, 4th edn. Arnold, London. Howard, P.H., Boethling, R.S., Jarvis, W.F., Meylan, W.M. and Michalenko, E.M. (1991). Handbook of Environmental Degradation Rates. Lewis Publishers, Boca Raton, Florida.

216

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Lyman, W.J., Reidy, P.J. and Levy, B. (1992). Mobility and Degradation of Organic Contaminants in Subsurface Environments. C.K. Smoley, Incorporated, Chelsa, Michigan. Montgomery, J.H. (2000). Groundwater Chemicals – Desk Reference, 3rd edn. Lewis Publishers, Boca Raton, Florida. Rencher, A.C. (2002). Methods in Multivariate Analysis, 2nd edn. Wiley Interscience. Sneath, P.H.A. (1957). The application of computers to taxonomy. Journal of General Microbiology. V, 17: 201–226. Sokal, R.R. and Michener, C.D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin. V, 38: 1409–1438. Sorensen, T. (1948). A method of establishing groups of equal amplitude in plant sociology based on similarity of species content and its application to analyses of the vegetation on Danish commons. Biologiske Skrifter. V, 5: 1–34. Stout, S.A., Uhler, A.D., McCarthy, K.J. and Emsbo-Mattingly, S. (2002). Chemical fingerprinting of hydrocarbons. In: Murphy, B.L. and Morrison, R.D. (Eds), Introduction to Environmental Forensics. Academic Press, San Diego, California. United States National Library of Medicine (1994). 5th edn. United States Government Printing Office, Washington D.C. Wang, Z. and Christensen, J.H. (2006). Crude oil and refined product fingerprinting: applications. In: Morrison, R.D. and Murphy, B.L. (Eds), Environmental Forensics – Contaminant Specific Guide. Academic Press, Amsterdam.

CHAPTER

9

Anomaly Detection Methods for Hydrologists, Hydrogeologists and Environmental Engineers Farid Achour, Jean-Pierre Laborde and Lynda Bouali

9.1

INTRODUCTION

Anomaly detection refers to the process of finding occurrences in data that do not conform to expected behaviour. These non-conforming occurrences are often referred to as anomalies, strange observations, discordant observations, outliers, exceptions, aberrations or irregularities. Anomaly detection has been studied in the statistics community since the early 19th century. Over time, a variety of anomaly detection techniques have been developed for specific application domains (e.g. insurance, fraud detection, cyber security, public health, agronomic research, industrial faults and damage detection, and military surveillance for enemy activities). An anomaly can be either an error or an accurate observation. Statistics does not allow identification of errors in time series, it only allows the identification of anomalies. The boundary between normal and anomalous occurrences is often not clear. As a result, an anomalous observation that is located close to the boundary can actually be real, and vice versa. There are two types of errors in environmental time series, accidental and systematic. Accidental errors are errors that do not always recur when an observation is repeated under the same conditions; they often occur during reading, transcription or conversion of data. Systematic errors result from some bias in the measurement process and are not due to chance; they occur for example when a gauge is moved from its previous location, when a different recording device is used, when there is instrumental drift or when there is a change in recording method or change in observer. The Nobel laureate for economics in 1991, Ronald Coase said: ‘if you torture the data long enough it will confess.’ Although this notion is clearly important in the water and environmental sciences, one reality that hydrologists and hydrogeologists Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

218

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

are constantly facing is the absence of a reference time series that has the characteristic of being error free and against which the time series collected within the basin/ aquifer could be compared. Failure to detect anomalies in time series in water and environmental projects can have significant economic, environmental and social impacts. To avoid these negative outcomes, we developed a new methodology for anomaly detection in hydrologic and hydrogeologic time series that we have been successfully using for the last decade. This methodology is based on the concept of ‘regional vectors’ that are synthetic error-free time series with a continuous spatial structure. These regional vectors are represented by the ‘factor scores’ of each factorial component representing a spatial structure, after a principal component analysis is performed on the entire dataset. Only principal components representing a spatial structure are taken into account when investigating anomalies in time series. As it is known that accidental and systematic errors are randomly distributed in space, they are therefore represented by the principal components with no spatial structure. The factor scores (i.e. regional vectors) represent the projection of observations on the principal components; they are synthetic (and unitless). The regional vectors summarise information that is common to all variables; for this reason we can say that they filter errors. The comparison of a given time series to the regional vectors using Bois’ ellipse statistical test allows the calculation of the probability that a given value within a time series represents an anomaly and, therefore, needs to be verified. This anomaly detection procedure, when performed at an early stage of a water and environmental study, can save a tremendous amount of time and money and should therefore be part of standard practice. As previously stated, statistical tests do not allow engineers and scientists to detect errors, they make it possible to recognise occurrences with a low probability of occurrence. It is a general assumption that normal data belong to large and dense clusters, whereas anomalies belong to small and sparse clusters (Figure 9.1). It is the responsibility of the scientist and engineer to investigate the authenticity of the potential anomalies by going back to the original field documents, when available, or by comparing data from one station to that from surrounding stations. The data will therefore be validated or corrected at the earliest appropriate opportunity in the project life cycle. To date, anomaly detection in the hydrological and hydrogeological sciences has not been generalised. As a result, many groundwater numerical models, developed as part of groundwater management programmes, were incorrectly calibrated using calibration criteria where simulated groundwater levels were compared against observed water levels containing errors. As a result, it was possible, for example, to use a groundwater model that predicts volumes of water that do not exist, thus confounding attempts to manage the water resource. Similarly, in hydrology, dams are dimensioned according to the analysis of available runoff data. To calculate a dam’s height, hydrologists use statistical models that rely on past historic behaviour of a given river, and characterise the runoff time series statistically by parametrising the distribution. The existence of anomalies in the collected data will lead to an under/ overestimation of the height of the planned dam. Both scenarios can have disastrous financial consequences, to say nothing about safety concerns.

219

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

Figure 9.1: Anomalies in a two-dimensional dataset. G1 and G2: Normal Groups O1, O2, O3, O4 and O5: Anomalies

Y

G1

O3 O4 O1 O2

O5

G2

X

9.2

DIFFERENT TYPES OF ERRORS

Errors may be present in hydrological and hydrogeological time series for a variety of reasons. Typically we can differentiate two types of errors: accidental and systematic. Accidental errors usually affect some measures locally and are randomly distributed in time and space in the series of realisations (observations) and variables (stations). Systematic errors are errors that uniformly affect certain portions of a time series, and those portions are randomly distributed in time and space. The reader is reminded that an anomaly can be an error or an accurate observation, such as an exceptional event (accidental anomaly) and/or climate change (systematic anomaly). In evaluating various methods for detecting anomalies, it is important to understand the balance between their ability to allow identification of anything that is likely to be a mistake, while not rejecting anomalies that represent real and accurate data. For this reason, the professional judgement of the scientist/engineer is crucial in interpreting an apparent anomaly.

9.3 9.3.1

ANOMALY DETECTION METHODS Parametric and non-parametric methods

Parametric and non-parametric techniques are used to detect anomalies in time series. Parametric methods assume that the underlying probability distribution is known a

220

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

priori, and therefore, the most used parametric methods are the box plot rule, Grubb’s test, Student’s test, regression-based model, minimum volume ellipsoid estimation (MVE), autoregressive integrated moving average (ARIMA) and principal component analysis (PCA). Non-parametric methods assume the underlying probability distribution is not known a priori, and the most used methods are histogram based and kernel function.

9.3.2

Comparison to a time series of reference

Double mass analysis Double mass analysis, also called ‘double mass plot’ or ‘double mass curve’, is a commonly used graphical data analysis approach for investigating the behaviour of hydrological or hydrogeological observations made at a number of locations. For this, points and/or a line are plotted for two stations, A and B. The x and y coordinates are determined by representing the cumulated values of A against the cumulated values for B, for the same period. If both stations are affected to the same extent by the same phenomenon, then a double mass curve should follow a straight line. A break in the slope of the curve/line would indicate that conditions have changed at one location but not at the other (Figure 9.2). The drawback of the double mass analysis technique is that: • •

it is not easy to detect an outlier visually; it is not always easy to determine a breakpoint visually;

Cumulative values for Y

Figure 9.2: Double mass plot.

Break point

Cumulative values for X

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS



221

even if a breakpoint is identified, we do not know which of the two stations produced the anomaly.

Regression residuals analysis Let’s assume that we have an error-free temporal time series called ‘X’, and that we plan to check for the presence of anomalies in another time series called ‘Y’ that is reasonably correlated with X. Underlying statistical hypotheses In the following, we will assume that the analysed time series represent realisations of random Gaussian variables of N dimensions, therefore: • • •

marginal distributions are Gaussian; regressions are linear; conditional distributions are Gaussian.

In the case where the analysed data deviate from normality, appropriate transformations should be performed on the variables in order to develop a distribution that is normally distributed. For example, although statistical distribution of most climatic variables are positively skewed, we can often achieve a normalpffiffiffiffi distribution by conducting an anamorphosis on raw data, such as Y ¼ ln(X ) or Y ¼ X . In order to construct the virtual time series, we will use the results of a principal component analysis (PCA), because it is a linear combination of the entire dataset, the principal components are orthogonal among themselves and they have a mean of 0 and a variance that equals 1. We will also evaluate whether the distribution of the values constituting the components of the PCA are Gaussian. If not, we will perform an anamorphosis on the original dataset. Detection of accidental anomalies It is known from the literature that the conditional distribution of Y knowing X is Gaussian, with a mean of sy m yx ¼ m y þ r ð x  m x Þ (9:1) sx and a standard deviation of

pffiffiffiffiffiffiffiffiffiffiffiffiffi s yx ¼ s y 1  r 2

(9:2)

This means that the difference, I , between a particular value yi and its conditional mean (deviation to regression line) is Gaussian with a zero mean and a standard deviation that equals pffiffiffiffiffiffiffiffiffiffiffiffiffi s yx ¼ s y 1  r 2 (9:3) We also know that for a Gaussian population, there is 90% chance of having 1:645 , U , þ1:645

222

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

and 95% chance of having 1:96 , U , þ1:96 It is therefore possible to compare the position of the experimental values relative to the most commonly used confidence intervals (90%, 95%, 98%). The points that are located outside these intervals have a low probability of being attributable to chance, and they will therefore need to be verified (Figure 9.3 – see colour insert). These points will be identified as point or accidental anomalies. In practice, for example, for a confidence level ofp95% (U approximately between ffiffiffiffiffiffiffiffiffiffiffiffiffi 2 and +2), all the errors thatpare four times s 1  r 2 will be systematically y ffiffiffiffiffiffiffiffiffiffiffiffiffi detected, those that are twice s y 1  r 2 will have a 50% chance of being detected (Figure 9.4 – see colour insert). As shown in Figure 9.5, if too many errors have been identified, the confidence interval at 95% can be artificially widened. As a result of widening the confidence interval, most if not all of the existing anomalies will not be detected. However, as shown in Figure 9.6, the distribution of regression residuals is no longer Gaussian, thus indicating the underlying presence of anomalies. As a partial conclusion, we can recognise that the regression residuals analysis identifies as anomalies all the accidental errors that depict a deviation equal to a pffiffiffiffiffiffiffiffiffiffiffiffiffi multiple of the conditional standard deviation s y 1  r 2 , based upon the selected confidence interval (Table 9.1). However, if we choose a confidence interval of 80%, most of the errors will be included in the identified anomalies, but these anomalies will be too numerous and Figure 9.5: Important accidental errors. 1250 Original point 1000

mm

750

500

250

Erroneous point 0 250

500

750 mm

1000

1250

223

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

Figure 9.6: Non-Gaussian regression residuals. 600

Gaussian distribution fitting of regression residuals

400

mm

200

0 200 400 600 2.500

1.500

0.500 0.500 Z values

1.500

2.500

Table 9.1: Conditional standard deviation versus confidence interval. Confidence interval at: 95% 90% 80%

Absolute value of error greater than: 1.960 1.645 1.282

often due to chance. Our experience indicates that it is often good practice to start with a confidence interval of 95% or 98%, and then decrease it progressively in such a way that suspicious data will be identified, allowing them to be corrected, maintained or removed. Detection of systematic anomalies It is common to encounter time series that become heterogeneous owing to gauge shifting or a change of measurement device or observer. As shown in Figure 9.5, let us assume that the observed values Y became artificially higher (black dots) after an early observation period (white dots). We will assume, as before, that the values contained in X are reliable. In Figure 9.7, a break of stationarity can be detected. However, in practice, all the dots look similar! Let us consider the population as homogeneous and plot the regression line y ¼ f (x) as shown in Figure 9.8. In this example, regression residuals  i ¼ yi  m yx are generally negative during the first period (white dots) and generally positive during the second period. If we construct a new variable, Zi , that equals the sum of the first

224

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 9.7: Break of stationarity. 1250 Y 1000

750

500

250 250

500

750

1000

1250 X

Figure 9.8: Regression residuals sign. 1250 Y 1000

750 ε0

ε0

500

250 250

500

750

1000

1250 X

P ith regression residuals  i , Z i ¼ ij¼1  j , we will produce a curve that will depict a decreasing and then an increasing trend, until reaching the value 0 (the residuals have a zero mean). P The variable Z i ¼ ij¼1  j is the sum of i realisations ofpffiffiffiffiffiffiffiffiffiffiffiffiffi a Gaussian random variable  with a zero mean and a standard deviation of   ¼ s y 1  r 2 . The variable Zi will, therefore, be a Gaussian random variable. It is obvious that the mean of Zi equals 0 for any given i. By contrast, within a sample size ne, the successive draws of  are performed under the condition of having Zne ¼ 0. This is reflected in the formulation of the standard observation of Z

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i (ne  i)  Z ¼  ne  1

225

(9:4)

It is therefore straightforward to establish the confidence interval inside which there is a given probability to find Zi 95% chance of having rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i(ne  i) i(ne  i) 1:96   , Z i , 1:96   (9:5) ne  1 ne  1 90% chance of having

rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi i(ne  i) i(ne  i) 1:645   , Z i , 1:645   ne  1 ne  1

(9:6)

Bois’ ellipse test Bois’ ellipse test is an extension of the double mass analysis, in which he added a robust statistical element that allows evaluation of the cumulative residuals combined with different confidence intervals that depict ellipses. We plotted the evolution of the cumulative regression residuals along with the 95% confidence interval (Figure 9.9 – see colour insert). It shows that the cumulative regression residuals have clearly less than 5% probability of being due to chance. We can, therefore, conclude that we are dealing with a systematic anomaly that probably started between the 18th and 28th observation. As shown in Figure 9.10, the effect of two breaks of stationarity ‘forces’ the cumulative residuals to be located within the 95% confidence interval. However, we

Sum of the deviations to the conditional mean value: mm

Figure 9.10: Time series with two breaks of stationarity. 500 400 300 200 100 0 100 200 300 400

1961 1985

500 1937 1943 1949 1955 1961 1967 1973 1979 1985 1991 1997

226

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

notice abrupt changes in the slope of Zi in 1961 then again in 1985. It is, therefore, recommended to verify successively the stationarity of the period 1932–1985, then the period 1962–2001. For each period, it is most likely that there is only one break of stationarity. Figures 9.11 and 9.12 confirm the existence of these breaks of stationarity at a 95% confidence level.

Sum of the deviations to the conditional mean value: mm

Figure 9.11: First period with break of stationarity. 500 400 300 200 100 0 100 200

1961

300 400 500 1936 1941 1946 1951 1956 1961 1966 1971 1976 1981

Sum of the deviations to the conditional mean value: mm

Figure 9.12: Second period with break of stationarity. 500 400 300 200 100 0 100 200 300 400 500

1985 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

9.4

227

CONSTRUCTION OF A VIRTUAL TIME SERIES OF REFERENCE

As suggested above, it is almost impossible to have a time series X that is absolutely reliable. The common approach is successively to compare a series Y to X1 then X2 , X3 , . . . and see whether or not the same anomalies are found and to determine whether they are on Y or on X. This approach is tedious and is not easy to computerise. For several years we have successfully used an anomaly detection methodology based on the results of PCA. The reasoning focuses on understanding the physical phenomena, and translating the results of the PCA is not difficult. Let us go back to the analysis of climatic events and the assumptions that can be made. •

• •

• •



Climatic data are often presented in the form of tables with column variables (different measurement points in space) and row observations (measurements at different times). We assume that the time series or its transform is complete and that the phenomena are Gaussian. The variables are distributed in space and there is a spatial continuity of chronological behaviour. This assumption means that the time series may have different means and standard deviations, but the data from different sites are more correlated the closer they are located to one another (spatial continuity of the phenomenon for standardised variables). Measurements are representative of the observed phenomenon, but the time series may be contaminated with random accidental and/or systematic errors. The error distributions may have two different shapes, an ‘irregular shape’ for accidental errors with a lot of zero values, and an ‘echelon shape’ for systematic errors (Figures 9.13 and 9.14). Errors are randomly distributed in space, so there is no spatial structure for the errors.

If these assumptions are acceptable, the principal component analysis will allow us to ‘filter’ the information in the data matrix. The principal components are calculated in descending order of eigenvalues of the matrix of correlation coefficients. This implies that the first component is that on which the observations (centred and standardised)

Figure 9.13: Accidental errors with irregular shape.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

228

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 9.14: Systematic errors with echelon shape.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

are less distorted in projection. In fact, the first component is the one that best explains all the variables. The second component is the one that best explains all of what was not explained by the first, and so on. It is, therefore, conceivable that the errors (assumed random and therefore specific to a given variable) appear in the components of higher order. The problem then arises as to when the components no longer explain the underlying phenomenon but mostly measurement errors. As the errors are randomly distributed in space, a component ‘error’, due to erroneous variables, will consequently have no spatial structure. It is sufficient to study the spatial structure of the correlation coefficient between variables and each component (projection of variables on components) by analysing the resulting variogram. If the variogram of the component presents a continuity at the origin, it means that the component represents information that is continuous in space and related to the studied phenomenon. In contrast, a nugget variogram indicates that the component primarily explains errors. To clarify further the concept of regional vectors, let us assume that we have nv variables (stations) on which no observations are recorded (e.g. rainfall, water levels); we have therefore a cloud of no points in a space with nv dimensions. However, if these variables are correlated amongst one another, the cloud will be represented in a subspace with lower dimensions. The objective of PCA is to identify this subspace while respecting the following constraints: the axes of this subspace are orthogonal amongst each other and each one of these axes maximises the variance of the projections of the observations on each axis. The directions of these axes (principal components) are given by the eigenvectors of the correlation coefficient matrix obtained when analysing the initial variables nv . The regional vectors are defined as the values taken by the projections of the observations on the principal components. Detailed formulation of the concept and applications of principal component analysis, and the mathematical formulation of the factors’ scores and factors’ loadings can be found in Legendre and Legendre (1998) and Jolliffe (2002). We can reasonably affirm that the components with a spatial and temporal continuity are those explaining the phenomenon being evaluated. Projections of variables on these first components represent the phenomenon filtered for measurement errors. Therefore, instead of comparing a variable Y to a real time series X, we will compare it to virtual values Y9 that it (Y) would have had, given the values of the first components.

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

229

Suppose we have nc components retained as presenting a spatial structure and suppose that a variable Y (with a mean m y and a standard deviation sy ) is correlated with a r YC i correlation coefficient with the ith component (projection of Y on the ith component). Let’s call Cij the value taken by the ith component for the jth observation. We know then that the conditional mean of Y in relation with Ci for the jth observation is yCij ¼ y þ s y

nc X

C ij r YC i

(9:7)

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi nc X ¼ sy 1  r2YC i

(9:8)

i¼1

and its conditional standard deviation is s y Cij

i¼1

We can then take the series Y9 thus calculated instead of a reference variable X. In environmental sciences, it is not straightforward (if not impossible) to find a ‘real’ error-free time series as a starting dataset to be analysed. The reason is that we are often not aware of the errors that exist in the dataset. The underlying principle of any statistical anomaly technique is that ‘an anomaly is an observation which is suspected of being partially or wholly irrelevant because it is not generated by the stochastic model assumed’ (Anscombe and Guttman, 1960). For demonstration purposes and to overcome the fact that we are never sure about the potential existence of anomalies in the dataset, we will work on a virtual dataset for which the statistical structure of the studied phenomenon is perfectly known, we will then introduce errors in the dataset. Let us assume that 25 rain gauges are randomly distributed inside a square area with a side length of 100 km. For purposes of discussion, let’s assume that after performing a PCA on the whole dataset representing annual rainfall, we calculate that the three first components explain 90% of the variance, that the first component depicts the temporal evolution, the second component depicts a North–South opposition, and the third component depicts an East–West opposition. Under these hypotheses, we will randomly generate 25 Gaussian time series of 30 years of annual rainfall (Table 9.2). The realistic outcome of the spatial distribution of the time series is shown in Figures 9.15(a) and 9.15(b) through the spatial projection of the ‘factor loadings’ of the second and third components of the PCA. The advantage of having generated these 25 Gaussian time series is that we are now able to add perfectly known errors. As shown in Table 9.3 (see colour insert), the dataset previously generated (Table 9.2) contained systematic and accidental errors of 150 mm (bold red cells) or 150 mm (bold blue cells), and accidental (+10% or 10%) with the same colour code as before. We will then analyse this dataset as if it was a real one in order to detect the anomalies that were previously introduced. As a first step, we will perform a PCA and calculate the projections of the observations on the components (factors scores). The results, as shown in Figures 9.16(a), (b) and (c), indicate that the first three components do not change between the

S3

S4

S5

611 670 867 499 504 956 778 901 774 937 955 982 1107 844 914 896 733 915 855 942 457 795 641 855 664 892 976 1036 876 995 622 587 736 663 664 596 498 598 495 504 784 692 697 662 642 849 866 893 810 730 576 512 414 576 619 399 350 367 341 305 1155 1116 1247 1009 1213 821 721 921 753 797 836 879 1043 776 764 704 628 753 640 751 748 729 936 647 766 644 374 504 406 508 717 980 842 881 826 526 572 633 642 596 742 649 725 637 707 583 511 641 541 411 747 632 727 635 757 769 638 752 590 710 581 789 623 719 659 683 889 804 819 747 881 878 787 705 909 974 841 1072 829 976 728 870 769 718 713 1066 1141 1187 921 1025

S2

Note: S: station; O: observation.

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 O16 O17 O18 O19 O20 O21 O22 O23 O24 O25 O26 O27 O28 O29 O30

S1

Table 9.2: Original matrix.

718 822 1080 929 785 1118 608 690 742 915 594 470 1182 951 994 581 867 453 955 674 848 605 723 916 763 1025 950 1151 838 1190

S6

686 892 948 968 701 949 725 540 660 720 529 436 1158 901 876 714 838 573 867 490 788 514 688 854 521 920 1009 1016 725 1133

S7 586 765 843 749 704 858 593 469 695 892 476 402 957 712 802 576 689 343 952 573 671 564 740 587 613 823 603 887 748 1059

S8 449 864 985 867 714 905 663 566 696 830 486 324 979 780 868 700 704 407 744 488 680 587 634 593 617 693 754 882 798 994

S9

S11

402 483 679 857 939 932 822 838 696 804 957 926 609 636 598 586 748 630 971 912 567 549 294 385 952 982 688 722 727 770 576 650 749 733 401 353 871 935 488 545 633 690 432 588 552 626 531 678 671 682 750 857 560 753 754 783 775 780 980 1089

S10 649 1029 1023 908 638 922 826 611 659 821 494 470 1135 883 971 765 777 675 859 414 708 547 778 784 621 778 885 1019 734 1106

S12

S14

S15

S16

S17

S18

S19

S20

S21

S22

829 612 615 720 825 725 648 531 520 723 876 772 793 709 838 1000 972 774 939 710 1166 962 942 984 1197 1193 1087 956 1072 1221 855 866 894 853 798 877 868 828 813 885 759 775 687 827 676 831 699 765 764 857 1035 926 929 1020 1106 1122 1006 940 1054 1080 727 572 659 583 638 574 658 589 596 633 574 549 520 645 693 648 564 566 676 654 648 715 635 669 588 668 683 640 680 608 856 874 958 779 856 1027 884 842 825 987 581 600 471 518 472 484 424 598 414 485 442 411 245 384 345 363 389 402 398 402 1225 940 1164 1009 1389 1336 1186 1002 1162 1288 947 701 755 897 1014 958 886 623 889 941 1043 728 878 915 986 1022 839 780 896 966 727 661 649 555 644 769 686 588 667 598 945 728 784 828 882 839 792 746 903 837 504 455 439 433 583 470 533 387 419 527 840 1027 920 993 902 1039 801 968 943 929 627 636 538 626 624 685 663 495 542 788 803 663 526 886 865 895 686 616 821 891 625 572 557 575 512 593 500 588 614 475 799 625 597 696 806 775 650 730 827 696 909 702 608 732 983 839 739 671 906 633 758 798 548 736 777 701 539 656 679 662 858 840 777 826 920 885 771 855 1013 927 926 667 809 705 1011 946 829 791 983 964 1061 777 892 890 1141 1140 913 774 1031 1112 799 773 929 933 824 943 811 805 924 950 1194 1089 1012 1131 1162 1191 1152 1094 1127 1086

S13 767 921 1102 855 425 1228 718 601 704 824 506 346 1398 931 970 691 879 652 782 501 691 637 806 1013 537 773 960 1040 873 992

S23 793 755 1160 847 771 1101 637 747 606 908 387 412 1332 1058 1030 830 934 564 858 720 907 542 820 947 678 1035 1087 1224 912 1152

S24 676 810 1070 938 785 1022 520 705 635 903 509 271 1155 780 946 636 802 380 971 584 786 577 666 801 694 964 827 1058 980 1118

S25

230 PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

231

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

Figure 9.15: Spatial projection of factor loadings for: (a) C2; (b) C3. 0.00 Y 0.01 Y

0.2

0.06 Y

20

30

40

50 60 (a)

70

0.1

20

0

0

0.06 Y

0.09 Y

0.09 0.13 Y 0.02 Y Y

0.10 Y

0.32 Y

0.20 Y

90

0.06 Y

0

0.09 Y

10 80

0.2

0.1 0.01 Y

0.15 Y 0.1

0.03 Y

0.27 Y

0.09 Y

20

0.1

0.07 Y

0.2 0.25 Y

0 .1

.3 0

0.12 Y

40 30

0.27 Y

0.2

0.07 Y

0.1

0.1

10

0.07 Y

0 .3

50

0.24 Y

0

20

0.21 Y 0.10 Y

60

0.42 Y 0.11 Y

0.08 Y

80

0.29 Y

0.30 Y 0.35 Y 0.26 Y

0.2

.2

0

0.2

0.39 Y

0.25 Y

70 0.19 Y

0 .1

0.1

0.3

0.09 Y

0.27 Y

30

 0.2

0.05 Y

60

40

.1

0.21 Y

70

90

0.26 0.26 Y Y 0.19 Y

0

0.2

80

50

0.03 Y

0

0.1

0.18 Y

0

90

0.24 0.19 Y Y

30

40

50 60 (b)

0.21 Y 0.27 Y

70

80

90

PCA performed as shown in Table 9.2 (error-free dataset) and Table 9.3 (dataset containing errors – see colour insert). The first component remains stable, as do the second and third components, which are more sensitive. The factors scores remain similar to the ones calculated before the introduction of errors. Similarly, the computed spatial variograms of the factorial components indicate that the three first components have a clear temporal and spatial structure (Figures 9.17(a) and 9.17(b)), whereas the components of higher order represent a pure nugget effect, indicating an absence of information (Figures 9.17(c) and 9.17(d)). We can see that even if we ‘contaminated’ the initial dataset by introducing errors, the first components that represent a spatial structure stayed quasi-unchanged. The errors were contained in the higher-order components. We can, therefore, state that comparing each time series to the linear combination of the first components of the PCA is similar to comparing the time series to a virtual error-free time series. In Table 9.4 (see colour insert), we underline the detected errors and the following may be noticed: 1. 2.

22 point anomalies among which we found six out of the imposed 12 errors; seven systematic anomalies among which we found three of five imposed errors.

The process must be iterative. Once the errors have been identified and corrected, a new PCA will allow the detection of new anomalies (the values taken by the components will not change significantly, but the correlation between variables will improve, thus allowing a more refined detection of anomalies). As a conclusion, we can say that PCA components can be used as a reference time series against which the actual time series can be compared.

232

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 9.16: Influence of errors on the values taken by the factor scores for: (a) C1; (b) C2; (c) C3. 3.00

C1 after errors

2.00 1.00 0.00 1.00 2.00 C1 original 3.00

3.00

3

2

1

0 (a)

1

2

3

C2 after errors

2.00 1.00 0.00 1.00 2.00 3.00

C2 original 3

3.00

2

1

0 (b)

1

2

3

C3 after errors

2.00 1.00 0.00 1.00 2.00 3.00

C3 original 3

2

1

0 (c)

1

2

3

233

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

Gamma(h) for C2

Figure 9.17: Variogram of the projection of the factor loadings for: (a) C2; (b) C3; (c) C4; (d) C5. 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0

0

10

20

30 Distance, h (a)

40

50

60

0

10

20

30 Distance, h (b)

40

50

60

0

10

20

30 Distance, h (c)

40

50

60

0

10

20

30 Distance, h (d)

40

50

60

Gamma(h) for C3

0.030 0.025 0.020 0.015 0.010 0.005 0

0.035

Gamma(h) for C4

0.030 0.025 0.020 0.015 0.010 0.005 0

Gamma(h) for C5

0.030 0.025 0.020 0.015 0.010 0.005 0

234

9.5

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

CASE STUDY

The methodology has been applied to annual rainfall time series collected at 27 stations/gauges from 1968 to 1997 in southeastern France (Figure 9.18 in the colour insert and Table 9.5). The objective was to evaluate the potential existence of anomalies within the dataset. A PCA was performed on the entire matrix and the results (Table 9.6) showed that the three first components explain 90% of the variance. The first component explains the temporal behaviour of the rain within the studied area; the second and third components explain the spatial behaviour. The map depicting the projection of the factor loadings on the second component (Figure

Table 9.5: Annual rainfall data (mm). Year

Entrevaux

Se´maphore la Garoupe

Golfe-Juan

Mandelieu

Bancairon

Le Lauron

La Grave

1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997

1046.6 1034.5 1012 941.5 1397.3 932.6 854.3 1170.7 1295.2 1305 819.7 1238.8 700.1 890.1 724 876 1113.4 688 646.4 1136.9 808 771.5 917.2 556.7 1019.2 1056.1 1304.5 833.7 1449.2 1019.3

1060.6 768.1 482.1 859.7 1150.1 706.1 778.2 1075.4 1033.7 986.5 829.4 1099.4 633.3 619.1 571.5 639.1 984 665.6 686.8 977.9 626 491.9 647.8 733.6 709 810.8 892.3 827.5 1043.3 752.4

1217 895.1 557.4 944.5 1307 746.2 921.9 1223.8 1215.4 1117.5 836.1 1257 690.6 695.4 652.2 712.7 1012.3 746 834 1253.8 789.9 503.3 831.3 853.2 910.1 1054.8 999 942.2 1195 826.2

1162.8 892.5 673 956.2 1355.4 725.4 906 1092 1156.1 1031.3 789.3 1173.6 667.4 729.8 652.1 693.4 903.5 657.8 720.5 1003.7 622.8 542.9 703.4 740.7 677.4 1033.3 1003.8 797.2 1097.8 841.8

991.9 825.9 868.2 943.8 1175.6 773.1 611 1038.7 1182.1 1358.6 681.5 1292.5 584.9 948.2 709.7 666.9 977 592.5 737.9 977.6 887.3 693.1 769 634.9 933.6 901.9 1030.5 845.2 1219.7 831.8

1260.5 883.1 889.4 1109.2 1331.8 807 999.7 1257 1240.2 1208.3 888.5 1472 673.7 878.2 689.9 782.5 1143.9 707.6 773.8 1141.3 720.3 601.3 803.4 816.9 1019.9 972.8 1174.5 925.2 1429.8 854.9

1012.3 709.8 766.3 950.6 1000.5 681.6 643.7 1040.2 1052.2 1116 731 1248.9 676.9 858 617.4 720.1 1028.2 668.6 668.5 1010.6 714.6 560.3 438.4 742.3 870.3 970.3 1111.8 798.7 1055.3 684.6

235

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

9.19(a) – see colour insert) shows a north–south opposition that is explained by a rainfall increase from the Mediterranean coast towards the Mercantour massif that is located inland. The projection of the factor loadings on the third component (Figure 9.19(b) – see colour insert) indicated an east–west opposition because rain decreases from the east to the west as we move away from the Alps. The existence of a spatial structure for the second and third components is corroborated by the presence of a spatial structure in the variograms (Figures 9.20(a) and 9.20(b)). After performing this first step, we used the three first components of the PCA as our regional vectors and performed Bois’ test to check for the existence of anomalies in the time series.

Year

Place Neuve

Le Clot

Quartier de la gare

Gendarmerie

Hotel du Collet

Le Serret

Peira cava

1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997

1634.6 1345.1 1232.2 1467.7 2080.7 1121.8 1158.1 1800.5 1695.2 1747.2 1158.1 1794 936.7 1177.9 951 1070.9 1669.3 975 957.2 1557.5 1358.3 879.9 935.6 937.9 1289.4 1526.3 1682.2 1194.2 2182.9 1235.5

1171.4 1094.3 1094.9 1061.3 1259.3 1015 868.5 1127.5 1514.9 1736.5 1007.5 1377.5 675.2 1114.1 954.2 991.3 1165.4 959.7 1033.8 1143.9 960.9 791.2 915.4 1061.6 1244.9 1174.5 1427.8 1203.5 1600 1137.8

1063.5 856 798.5 1066.7 1213.3 832.2 745.8 1224.9 1037.1 1286.6 859.7 1385 794.7 883.1 688.7 876.4 1083.7 690.1 720.5 1140.8 698 547 564 799.3 848 1078 1080.6 956.4 1214.4 755.5

1088.1 951.6 1101.5 917.7 1161.2 815.4 730.2 955.3 1312.1 1347.5 811.6 1264.2 627.6 1012.2 836.4 787.5 1025.7 729 687.8 967.7 831.4 686.8 818.8 883.7 992.8 883.5 1293.6 974.7 1357.4 936.3

1177.8 833.5 1113.7 890.4 1123.5 791.7 697.3 1035.8 1070.1 1446.4 801.8 1322.4 624.9 880.7 853 757.3 1210.7 710.6 974.9 1112.1 922.1 744.2 755.2 876.3 988.6 980.6 1320.1 883.6 1637.6 1133.6

1147.7 1023.1 1152.5 1390.9 1534.5 880.4 759.5 1222.6 1163.8 1357.3 982.1 1452.4 795.6 878.6 632.9 952.4 1164.9 729.6 714.7 1076.3 817.3 674 696 810 1014.7 1204.2 1299.3 964.6 1455 1003.8

1206.1 1081.2 1138.8 1198.4 1404.2 879.5 760.6 1347.4 1377.7 1550.3 909.4 1671.6 807.3 942 812.4 901.9 1378.6 807.1 852.8 1394.5 903.4 772.9 831.1 923.6 1216.1 1233.6 1441.1 1149 1602.8 1127

236

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

By way of illustration, we present two scenarios. The first station we checked is a weather station that is operated by professionals (Antibes La Garoupe Station), and the second station is a regular station operated by volunteers (Golfe Juan Station). The Bois’ test showed the absence of anomalies within the time series collected at the Antibes La Garoupe (Figure 9.21) and showed a break of stationarity within the time series collected at the Golfe Juan towards 1984/1985 (Figure 9.22), thus suggesting that an anomaly occurred at the Golfe Juan station in 1984/1985. After reviewing the original field notes for the Golfe Juan station, we were able to conclude that the anomaly is present because the station was physically moved.

Table 9.5: (continued) Year

Garavan

Aeroport

1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997

957.9 827.8 594.8 769.9 956.5 631 649.8 911.4 969.8 1099.7 807.7 1118 572 658.3 639.9 731.2 988.9 596.8 615.7 995.8 603.3 569.5 446.3 793.6 867.5 975.3 818.4 840.5 1025.3 663.6

1114.9 807.3 577.4 839.9 1089.3 711 761.2 1024.4 938.6 967.9 795.3 1203.9 538 551.4 533.2 617.2 807.8 551.3 596.8 934.3 552 424.9 644.5 871.9 980.2 913.3 858.4 717.4 1072.6 722.8

Chateauvieux Gendarmerie Semaphore O.N.F. Gendarmerie 1015.6 842.8 688.2 1088.3 1256.9 762.7 879.9 1047.8 1180.2 1121 889.6 1358.5 637.9 820.2 647.6 658.7 1023.4 730 662.3 1006 747.8 540.6 558.4 782.5 901.1 996.5 991.2 886.6 1170.1 751.6

1098 868 1045.3 873.6 1094.2 925.2 748.9 1008.9 1289.1 1498.9 951.5 1372.6 637.4 1052.9 889.1 798 1124.9 743.8 979.2 1060 831.7 666.6 764.8 919.6 963 878 1155.3 849.8 1421 954.2

953.8 758.2 523.5 867.6 810.5 502.4 693.7 836.2 913.3 967.4 714.9 1111.4 520.9 601.2 550.6 601.6 885.7 589 549.1 947.6 551.1 467.2 554.5 833 899.9 726.5 773.2 795.8 918.5 697

1259.3 925.5 1158.4 1134.5 1352.8 980 803.5 1291.4 1480.7 1684.8 966.3 1616.3 724.5 1216.9 1069.4 862.8 1362.9 737.4 990 1205 950.6 973.8 912.3 1018.5 1264.8 1295.5 1281.1 1063 1563 1273.6

1101.7 929.3 1003.7 908.6 1207.1 862.2 643.4 1068.3 1434.3 1306.2 758.2 1331.7 635.9 994.6 760.6 645.4 1109.8 658.8 847.5 1180 864.4 828.2 1134.3 1075.8 920.7 991.7 1101.3 878.1 1361.7 867.9

237

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

9.6

CONCLUSION

Anomaly detection in hydrological, hydrogeological and environmental projects is an essential step that should be performed to ensure that reliable time series are being analysed and processed. The results of the analyses performed on time series are typically used in an effort to understand the studied phenomenon, and will serve as a foundation for daily management and to make future predictions. The anomaly detection portion for any environmental project, when performed, typically consists of performing classic parametric or non-parametric tests without

Year

Parra

Village

Gendarmerie

BAN

Place village

Dramont

1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997

1302.2 1115.5 919.1 1466.7 1786.4 898.3 1130.1 1738 1475.3 1297.5 1160.7 1419.1 931.3 873 697 985 1216.3 742.2 890 1294.5 1143.2 772.2 917.4 1033.3 1117.3 1372 1701.7 1102.9 1649 873

1206.9 968.6 1085.2 1071.6 1425.4 940.2 782.8 1346.3 1359.7 1330.8 868 1327.9 704.4 875.1 718.1 835 1152.1 664.7 825.8 1399.8 964.5 667.1 684.4 708 1008.2 1141.7 1291.4 853.3 1552.1 936.8

1088.4 1001.1 710.1 1075.2 1087.3 807.2 868.2 1008 1055 999.2 916.9 1343.3 713.1 771.7 696.8 749.4 1316.8 766.7 677 1459.9 732.7 571.8 613.4 749.1 1001.5 1191.8 1015.6 928.1 1162.5 872.8

914.5 801.1 510.2 845.5 1145.2 731.7 883.6 1072.6 1351.2 1025.6 846.2 977.3 712.5 663.6 606.6 809.6 1099.1 650.6 777.5 961.6 645.1 473.3 669.5 701.1 639.7 922.3 972 758 1044.9 604.2

1086.4 1051.1 1065.6 1115.6 1367.1 676.2 1008.9 1417.6 1446.4 1540.3 968.7 1259.8 926 941.7 680.2 859 1191.3 785.1 869.1 1192.2 1120.8 729.6 772.1 725.6 1086.7 1179.2 1200.1 880.8 1439.9 829.7

1060 863.5 544.3 930.8 1260.8 879.3 1038.7 1549 1478.2 1249.4 1134.2 1219.7 625.9 668.4 620.9 668.6 1157.6 631.4 765 907 561.8 341.1 661.6 528.7 659.6 769.8 924.3 840.9 882.3 832.8

238

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 9.6: PCA results. Regional vectors Factors scores

Factors loadings

Year

C1

C2

C3

1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997

0.82 0.18 0.51 0.30 1.46 0.72 0.66 1.02 1.37 1.56 0.38 1.78 1.29 0.49 1.16 0.84 0.77 1.20 0.88 0.82 0.76 1.56 1.09 0.65 0.03 0.40 0.89 0.21 1.73 0.36

0.73 0.51 2.32 0.95 1.06 0.01 1.97 1.56 0.03 1.76 1.17 0.11 1.04 1.30 0.71 0.37 0.29 0.48 0.16 0.76 0.53 0.89 0.34 0.04 0.60 0.63 0.84 0.27 1.47 0.86

0.97 0.01 0.98 0.08 1.89 0.41 0.60 1.49 1.27 0.12 0.16 1.89 0.56 0.08 0.50 0.22 0.66 0.42 0.14 1.34 1.12 0.26 1.01 2.17 1.61 0.36 1.12 1.05 0.49 0.31

Eigenvalues and variance Eigenvalues Explained variance Cumulative explained variance

Station Entrevaux Se´maphore la Garoupe Golfe-Juan Mandelieu Bancairon Le Lauron La Grave Place Neuve Le Clot Quartier de la gare Gendarmerie Hotel du Collet Le Serret Peira cava Garavan Aeroport Chateauvieux Gendarmerie Semaphore O.N.F. Gendarmerie Parra Village Gendarmerie BAN Place village Dramont

C1 21.65 80.20% 80.20%

C2 1.80 6.66% 86.86%

C1

C2

C3

0.90 0.92 0.91 0.91 0.92 0.97 0.94 0.95 0.86 0.94 0.87 0.84 0.91 0.96 0.91 0.89 0.95 0.86 0.87 0.88 0.83 0.87 0.95 0.84 0.84 0.89 0.77

0.15 0.33 0.32 0.28 0.28 0.08 0.01 0.01 0.38 0.11 0.40 0.44 0.03 0.16 0.10 0.22 0.17 0.39 0.14 0.36 0.30 0.25 0.07 0.25 0.34 0.00 0.38

0.27 0.02 0.05 0.11 0.08 0.00 0.11 0.18 0.05 0.08 0.04 0.04 0.09 0.08 0.29 0.20 0.07 0.06 0.41 0.09 0.01 0.26 0.15 0.29 0.20 0.25 0.20

C3 0.79 2.94% 89.80%

C4 0.62 2.28% 92.07%

taking into account the underlying spatial and temporal variability of the studied variable (water levels, rain, runoff, temperature and so on). In water and environmental fields, the most commonly used test is the regression analysis, where linear regressions are performed between stations that are spatially not distant from one another, with

239

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

Figure 9.20: Variogram of the projection of the factor loadings for: (a) C2; (b) C3. Column D: C2 Direction: 0.0 Tolerance: 90.0

Column E: C3 Direction: 0.0 Tolerance: 90.0

0.025

Variogram

0.020 0.015 0.010 0.005 0

0

50

100 150 200 Lag distance: km (a)

250 300 0

50 100 150 200 250 300 350 Lag distance: km (b)

Figure 9.21: Bois’ test for Antibes La Garoupe station. 400

Antibes La Garoupe (06004002)/C1  C2

300 200 100 0 100 200

1997

1995

1993

1991

1989

1987

1985

1983

1981

1979

1975

1977

1973

1971

400

1969

300

the assumption that one of the stations can be used as a time series of reference, and thus does not contain errors. Unfortunately, this assumption can seldom be verified. If this a priori assumption is incorrect, the errors will affect all the other parts of the project, and may result in significant negative financial and economic impacts. The proposed method using the regional vectors to detect anomalies in time series, with the validated assumption that the regional vectors are error free, is a powerful tool to detect anomalies in time series. The outcome of this test is a probability that a given observation in the time series is an anomaly, thereby allowing

240

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 9.22: Bois’ test for Golfe Juan station. 600

Golfe Juan (06004004)/C1  C2

400 200 0 200

1997

1995

1993

1991

1987

1989

1985

1983

1981

1979

1975

1977

1973

1971

600

1969

400

verification of the accuracy of the spotted observation by going back to the original field documents. This new method offers outstanding benefits compared to conventional methods and should therefore be part of every water or environmental project.

REFERENCES Achour, F. (1997). Climatic Conditions and Water Availability in Semi Arid Areas. Application of New Methodologies to the Chelif Basin. Algeria, PhD Dissertation, Universite de Franche Comte, France, 261 pp. Achour, F., Laborde, J.P, and Assaba, M. (2006). Climatic change in North Africa during the last century and its impact on water availability. In: Managing Drought and Water Scarcity in Vulnerable Environments: Creating a Roadmap for Change in the United States, Longmont, Colorado, 18–20 September 2006. Aggarwal, C. (2005). On abnormality detection in spuriously populated data streams. Proceedings of 5th SIAM Data Mining Conference, New Port Beach, California, pp. 80–91. Aggarwal, C.C. and Yu, P.S. (2001). Outlier detection for high dimensional data. Proceedings of the ACM SIGMOD International Conference on the Management of Data, Santa Barbara, USA. Aggarwal, C. and Yu, P. (2001). Outlier detection for high dimensional data. Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM Press, pp. 37–46. Aggarwal, C.C. and Yu, P.S. (2008) Outlier detection with uncertain data. Proceedings of SIAM Data Mining (SDM), Atlanta, Georgia, 24–26 April 2008, pp. 483–493. Agyemang, M., Barker, K. and Alhajj, R. (2008). A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis. 10, 6: 521–538. Anscombe, F.J. and Guttman, I. (1960). Rejection of outliers. Technometrics. 2, 2: 123–147. Assaba, M., Laborde, J.P. and Achour, F. (2006). Global and distributed modeling of runoff in northern Algeria. Proceedings of the 7th International Conference on Hydroinformatics, Nice. pp. 1551–1558. Bakar, Z., Mohemad, R., Ahmad, A. and Deris, M. (2006). A comparative study for outlier detection techniques in data mining. Proceedings of IEEE Conference on Cybernetics and Intelligent Systems. IEEE, pp. 1–6.

ANOMALY DETECTION METHODS FOR ENVIRONMENTAL ENGINEERS

241

Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data. John Wiley, Chichester. Basu, S. and Meckesheimer, M. (2007). Automatic outlier detection for time series: an application to sensor data. Knowledge and Information Systems. 11, 2: 137–154. Beckman, R.J. and Cook, R.D. (1983). Outliers. Technometrics. 25, 2: 119–149. Bois, Ph. (1971). Une Me´thode de Controˆle de Se´ries Chronologiques Utilise´es en Climatologie et en Hydrologie. Laboratoires de Me´canique des Fluides Universite´ de Grenoble. Section hydrologie, 49 pp. Bois, Ph. (1986). Controˆle des se´ries chronologiques corre´le´es par e´tude du cumul des re´sidus. Deuxie`mes Journe´es Hydrologiques de l’Orstom, Montpellier, pp. 89–100. Bouvier, C., Cisneros, L., Dominguez, R., Laborde, J.P. and Lebel T. (2003). Generating rainfall fields using principal components (PC) decomposition of the covariance matrix: a case study in Mexico City. Journal of Hydrology. 278, 1–4: 107–120. Buishard, A. (1984). Tests for detecting a shift in the mean of hydrological time series. Journal of Hydrology. 73: 51–69. Chandola, V., Banerjee, A. and Kumar, V. (2007). Anomaly Detection: A Survey, University of Minnesota, Technical Report, August 2007. Edgeworth, F.Y. (1887). On discordant observations. Philosophical Magazine 23, 5: 364–375. Fox, A.J. (1972). Outliers in time series. Journal of the Royal Statistical Society. Series B (Methodological). 34, 3: 350–363. Jagadish, H.V., Koudas, N. and Muthukrishnan, S. (1999). Mining deviants in a time series database. Proceedings of the 25th International Conference on Very Large Data Bases. Morgan Kaufmann Publishers Inc., pp. 102–113. Jolliffe, I.T. (2002). Principal Component Analysis, 2nd edition. Springer. Keogh, E., Lonardi, S. and Chi’ Chiu, B.Y. (2002). Finding surprising patterns in a time series database in linear time and space. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, New York, USA, pp. 550–556. Laborde, J.P. (2002). Me´thodes de de´tection des anomalies et de comblement des lacunes dans les se´ries de donne´es. Actes des Journe´es de Climatologie du Comite´ National Franc¸ais de Ge´ographie, Strasbourg, pp. 47–66. Laborde, J.P. (2010). Geographical information and climatology for hydrology. In: GIS and Climatology. Collection Hermes, Lavoisier Editeur, pp.195–232. Laurikkala, J., Juhola1, M. and Kentala, E. (2000). Informal identification of outliers in medical data. Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology, pp. 20–24. Legendre, P. and Legendre, L. (1998). Numerical Ecology, 2nd English edition. Elsevier Science BV, Amsterdam, 853 pp. Searcy, J.K. and Hardison, C.H. (1960). Double-mass Curves: US Geological Survey Water-Supply Paper 1541-B, pp. 31–66. Wigbout, M. (1973). Limitations in the use of double-masse curves. Journal of Hydrology. 12: 132– 138.

CHAPTER

10

Statistical Methods and Pitfalls in Environmental Data Analysis Yue Rong

10.1 INTRODUCTION A British politician once said ‘There are three kinds of lies: lies, damned lies, and statistics.’ Of course statistics can be very misleading if we do not watch the pitfalls around the statistical methods applied. Statistical methods have been widely applied in the scientific arena and are also used in environmental data analysis. However, interpretation of the results of statistics may be misleading if the data are not analysed with site-specific physical meanings. Owing to the great uncertainties associated with environmental pollutants’ release timing, frequency, quantities and sampling locations in heterogeneous environmental media, environmental data are usually characterised in a wide range. This chapter attempts to address potential statistical pitfalls in analysing this type of wide-range environmental data. Data analysis has always been a solid starting point for any kind of argument in environmental forensics. The pitfalls as presented in this chapter will make the argument lose foundation. The examples of case studies in this chapter serve as a warning to those who are not very careful to evaluate statistical analysis results for environmental forensics conclusions or regulatory decisions. This chapter reviews five common statistical methods that might be interesting to those working in environmental forensics and applies each of them in a case study demonstrating possible pitfalls and flaws. Data used in this chapter are obtained in real cases of soil and groundwater contamination under the regulatory oversight in Los Angeles, California, USA. The five statistical methods include: 1. 2. 3.

percentile and confidence interval; correlation coefficient; regression analysis;

Practical Environmental Statistics and Data Analysis, edited by Yue Rong. # 2011 ILM Publications, a trading division of International Labmate Limited.

244 4. 5.

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

analysis of variance (ANOVA); data trend analysis.

Much literature has been devoted to discussion of statistical methods and their applications to various environmental problems; for example, Berthouex and Brown (1994); Davis (1986); Happel et al. (1998); Rong (1998; 1999a; 1999b); and US EPA (1989). These publications usually concentrated on how common statistical methods can be used to assist in solving environmental problems. Among them, few have focused on the pitfalls or flaws of the common statistical methods in applications to environmental data analysis. In this chapter an attempt is made to fill in this area of discussing the pitfalls with demonstrations using real case study data. Each of the five statistical methods is discussed in mathematical detail and in a type of case study where the method can be applied. Finally, all potential pitfalls are summarised with respect to each statistical method. Reminders and enlightenment obtained from the pitfalls are given at the end of the chapter.

10.2

ESTIMATION OF PERCENTILE AND CONFIDENCE INTERVAL

Both percentile and confidence interval are the characteristics associated with data distribution. As indicated earlier, environmental data often vary in a wide range owing to uncertainties associated with data collection, which would widely spread out the data distribution. Two commonly encountered data distributions are normal and lognormal distributions. The density function of a normal distribution shows a symmetric ‘bell’ shape and of a log-normal shows the shape of a ‘long tail’ skewed to the right.

10.2.1

Calculation of percentiles

To understand data distribution, a percentile value usually needs to be calculated. Given data distribution mean and standard deviation, percentiles of a normal distribution can be calculated by using a normal distribution probability table as (Winkler and Hays, 1975) x p ¼  þ Z p

(10:1)

where x p is the pth percentile value of variable x;  and  are the mean and standard deviation of the variable x distribution, respectively; and Z p is the normal curve table value for pth percent of probability under the probability density curve. For example, Z0:95 ¼ 1.645 is the 95th percentile value from the normal distribution table. For a standard normal distribution ( ¼ 0 and  ¼ 1), Equation 10.1 gives the 95th percentile value equal to x0:95 ¼ 0 + 1.645 3 1 ¼ 1.645. Unlike any other distribution curves, log-normal distribution does not have a table for all percentile values. We usually ‘borrow’ the normal probability distribution table for log-normal distribution calculations. If x is normally distributed with mean ( ) and standard deviation (), that is x ! Normal (, ), then y ¼ ex is defined to be

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

245

log-normally distributed. Since x is normally distributed, the logarithmic value of y: log(y) ¼ x is normally distributed, that is log(y) ! Normal (, ). If y is known to be log-normally distributed, then the natural logarithm of y, loge (y), is normally distributed. For the pth percentile, Equation 10.1 becomes loge ð yp Þ ¼  þ Z p 

(10:2)

Thus, the pth percentile of the log-normally distributed y is yp ¼ exp ( y þ Z p  y )

(10:3)

where y p is the pth percentile value of the log-normally distributed variable y;  y and  y are the logarithmic mean and logarithmic standard deviation of the log-normally distributed variable y, respectively. For example, soil concentration data of a log-normal distribution is transformed into logarithmic values that produce the mean of 2.724 and the standard deviation of 1.995. Thus,  ¼ 2.724,  ¼ 1.995, Z0:95 ¼ 1.645, and Equation 10.3 yields y0:95 ¼ exp( + Z0:95 .) ¼ exp(2.724 + 1.645 3 1.995) ¼ 406. Therefore, the 95 percentile of this soil concentration distribution is equal to 406. By using Equation 10.3 backward, we can also determine what percentage under the probability curve with known y p. For example, for the same case above, given yp ¼ 60 ¼ exp(2.724 + Z p .1.995), thus, Zp ¼ (ln 60  2.724)/1.995 ¼ 0.69. From the normal probability distribution table, N(Z ¼ 0.69) ¼ 0.75. Therefore, soil concentration at 60 is at the 75th percentile of the log-normal distribution.

10.2.2

Calculation of confidence interval

To evaluate environmental sample data relative to a singular value of a regulatory standard, a confidence interval for the sample mean is often calculated to compare with the standard (US EPA, 1989). Given data distribution mean and standard deviation of a normal distribution, the confidence interval is calculated as (Walpole, 1974) X (1Æ) ¼ X  Z p 3 ( =n1=2 )

(10:4)

where X(1Æ) is the confidence interval for the mean of variable x at level of 1  Æ; X is the sample mean; n is the number of samples, and p ¼ 1  Æ/2 for both sides. For example, we assume a normal distribution, X ¼ 327,  ¼ 362, n ¼ 16, Z0:975 ¼ 1.96, 1 1 X(0:95) ¼ [X  Z0:975 (/n2 ), X + Z0:975 (/n2 )] ¼ [150, 505]. If data follow a log-normal distribution, Equation 10.4 becomes h i Y(1Æ) ¼ exp Y  Z p 3 ( y =n1=2 ) (10:5) where Y(1Æ) is the confidence interval for the mean of log-normally distributed variable y at level of 1  Æ; Y is the sample logarithmic mean; and Y is the sample logarithmic standard deviation. For the same example as above, we assume a log-

246

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

normal distribution, Y ¼ 5.08, Y ¼ 1.465, n ¼ 16, Z0:975 ¼ 1.96, Y(0:95) ¼exp 1 1 [Y  Z0:975 (Y /n2 ), Y + Z0:975 (Y /n2 )] ¼ [78, 329]. A potential pitfall is usually to assume a normal distribution and then use Equations 10.1 and 10.4 to calculate percentile and confidence interval without knowledge of data distribution. This may lead to a misinterpretation of data because environmental data so often show a log-normal distribution.

10.2.3

Case study for percentile and confidence interval

Percentile and confidence interval can be used to estimate pollutants’ concentration relative to a numerical standard, for example, a regulatory clean-up standard. To do so, we first need to know the data distribution, which will determine the percentile and confidence interval. Using real field data, this case study compares both results of percentiles and confidence intervals from a normal and a log-normal distribution and demonstrates the difference. Table 10.1 presents a set of groundwater monitoring data for benzene concentrations over a period of 16 quarters from 1996 to 1999. Since the data of 16 samples show a log-normal distribution, we should use Equations 10.3 and 10.5 to calculate

Table 10.1: Case study data for percentile and confidence interval. Sampling date March 1996 June 1996 September 1996 December 1996 February 1997 May 1997 July 1997 October 1997 March 1998 June 1998 August 1998 December 1998 March 1999 May 1999 August 1999 December 1999 Mean Standard deviation Distribution type 50th percentile (median) 95th percentile Mean confidence interval at 95%

Benzene (g/L)

Log(benzene)

810 1200 200 980 340 320 3.8 480 180 38 55 110 35 230 130 120 327 362 Normal 327 922 [150, 505]

6.70 7.09 5.30 6.89 5.83 5.77 1.34 6.17 5.19 3.64 4.01 4.70 3.56 5.44 4.87 4.79 5.08 1.465 Log-normal 161 1788 [78, 329]

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

247

percentile and confidence interval. Table 10.1 also shows the difference between the normal and log-normal distributions in terms of 95th percentiles and confidence interval for the mean at 95% level. Since the log-normal distribution has a ‘long tail’ to the right, it is conceivable to have greater value than normal distribution at the 95th percentile. However, the median of the normal distribution is more than twice the median of the log-normal distribution. Differences in the medians resulted from the normal and log-normal distributions produce very different mean confidence intervals. Because of the shape of the log-normal distribution, the confidence interval for the mean of this distribution is much smaller than that of the normal distribution. Therefore, it is important to determine the data distribution prior to computing the percentile and confidence interval. As shown here, we may have a very different conclusion depending on which distribution to be used. It is not prudent simply to assume a normal distribution.

10.3 CORRELATION COEFFICIENT The correlation coefficient is a way to quantify the linear relationship between two random variables. The correlation coefficient is calculated as (Berthouex and Brown, 1994) r x1,x2 ¼ ð1=nÞ

n X

(x1i  X 1 )(x2i  X 2 )=( x1 3  x2 )

(10:6)

i¼1

where r x1, x2 is the correlation coefficient between variables x1 and x2 ; X 1 and X 2 are the means of x1 and x2 , respectively; x1 and x2 are the standard deviations of x1 and x2 , respectively; and n is the number of data points. The correlation coefficient ranges from 1 to +1, which implies a negative and a positive correlation, respectively. In general, r x1, x2 ¼ 0 means no correlation at all, and increasing coefficient in absolute value means increasing degree of correlation. Equation 10.6 indicates that correlation coefficient is actually an index of the sum of two respective comparisons between a variable with its mean. If x1 is greater than its mean (X 1 ) and x2 is less than its mean (X 2 ), Equation 10.6 gives a negative value. If x1 is greater than its mean and so is x2 , the result of Equation 10.6 is a positive value. Equation 10.6 also shows that the correlation coefficient has an inverse relationship with the standard deviation. A potential pitfall is to use inconsistent data over a wide range of values to calculate a correlation coefficient that can be skewed by a pair of large values of data.

10.3.1

Case study for correlation coefficient

The correlation coefficient can be used to study the relationship between soil and groundwater concentrations. Table 10.2 presents methyl tertiary butyl ether (MTBE) soil versus groundwater concentrations at 29 sites. The table contains two sets of data under columns A and B. Correlation coefficients were calculated between soil and

248

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 10.2: MTBE case study data for correlation coefficient.

Case no.

Column

A (MTBE)

Column

B (MTBE)

Soil concentration (mg/kg)

Groundwater concentration (g/L)

Soil concentration (mg/kg)

Groundwater concentration (g/L)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

15 74.1 15 6.4 120 58 0.53 710 0.87 200 7.3 3.1 3.5 41 22.6 210 0.24 32 315 57 170 160 2.3 1.5 11.2 90.3 2.7 10 860

43 82 000 18 700 59 000 9906 470 130 3100 2200 22 000 1200 690 12 000 110 000 14 000 50 000 2000 38 45.2 24 000 5 290 4000 410 24 000 180 000 5800 61 2 100 000

Correlation coefficient Mean Std dev. Max.

110.3 204.2 860

0.700 94 003.0 387 917.5 2 100 000

15 74.1 15 6.4 120 58 0.53 710 0.87 200 7.3 3.1 3.5 41 22.6 210 0.24 32 315 57 170 160 2.3 1.5 11.2 90.3 2.7 10

43 82 000 18 700 59 000 9906 470 130 3100 2200 22 000 1200 690 12 000 110 000 14 000 50 000 2000 38 45.2 24 000 5 290 4000 410 24 000 180 000 5800 61

83.6 147.2 710

0.03 22 360.3 41 141.0 180 000

MTBE: methyl tertiary butyl ether; Std dev.: standard deviation.

groundwater data for columns A and B, respectively. These two sets of data are almost identical except for the last pair of data points (case no. 29). The values of the last data points are not only the maximum among all the data, but also much greater than other pairs of data in magnitude. If a pair of data points is far greater than others in

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

249

value, only this pair of data is countable and other data become trivia in Equation 10.6. This phenomenon is clearly demonstrated in Table 10.2. The correlation coefficient is 0.7 with the last pair of data and 0.03 without it. Apparently, we can draw completely different conclusions based on these two correlation coefficients, of which 0.7 indicates a moderately good correlation and 0.03 shows no correlation at all. Environmental data are usually widespread in terms of magnitude, probably because pollutant concentration is tested not only at the source areas but also at background and because of lack of knowledge on pollutant release timing, frequency and quantities. Therefore, field data should be screened for magnitude of the data prior to computing the correlation coefficient. Usually, the standard deviation would indicate the degree of data spreading. In this case study, it is more realistic to use the correlation coefficient of 0.03 than 0.7 for any environmental decision, since the coefficient of 0.7 seems to have resulted from only one pair of the maximum data points.

10.4 REGRESSION Regression is a method of studying the relationship among variables, and is more often used to study two variables where one depends on the other. This relationship is approximated by a function that is fitted by field data and then the ‘fitted function’ is used to predict the result of one variable given the other variable. For example, the following equation is a simple linear regression model (Walpole, 1974) y ¼ aþ b  xþ e

(10:7)

where y is a dependent variable; x is an independent variable; a is the linear intercept; b is the linear slope; and e is the error term of measurement. The coefficients a and b are usually determined by the ‘least squares’ method, which virtually minimises the sum of deviations of all data points to the artificial ‘fitted’ line to make the error term as small as possible. Given a series data of xi and yi (i ¼ 1, . . ., n), the algebraic solution of the least squares method is a¼Y bX # " # n n X X 2 b¼ (xi  X )( yi  Y ) (xi  X )

(10:8)

"

i¼1

(10:9)

i¼1

where X is the mean of x and Y is the mean of y. The degree of ‘goodness’ of fitting the regression models can be measured by an index of R2 " n # X n X R2 ¼ 1  ( yi  Y 9)2 ( yi  Y )2 (10:10) i¼1

i¼1

where R2 is the coefficient of determination of ‘goodness to fit’ (also called

250

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

‘R squared’), and Y 9 is the predicted y from Equation 10.7. R2 ranges between 0 and 1. R2 near unity implies an excellent fit of the regression line. The larger R2 is, the better the fitting line is, and vice versa. More complicated regression models such as multiple linear regression or nonlinear regression are also available. The solutions for fitting the coefficients in those models are now usually built in spreadsheet computer programs. R2 is still used as a common indicator to the goodness to fit. Algebraically, the regression method can be used to predict changes of one variable, with known other variable(s). A potential pitfall is to try to establish a regression relationship between two uncertain variables, which would give us an even more uncertain prediction as a result.

10.4.1

Case study for regression analysis

The regression method can be used to predict concentration Y at a downgradient well based on concentration X at the source well, given field observations over time for a pollutant of interest. This case study has groundwater monitoring data of MTBE from 1996 to 1999 at the source well (MW1) and downgradient well (MW2) (Table 10.3). The data show that MTBE concentration at the source well MW1 reached its peak Table 10.3: MTBE groundwater concentrations over time for regression analysis. MTBE (g/L) Sampling date June 1996 September 1996 December 1996 February 1997 May 1997 July 1997 October 1997 March 1998 June 1998 August 1998 December 1998 March 1999 May 1999 August 1999 December 1999 Mean Std dev. Max. Min.

MW1

MW2

21 000 37 000 16 000 59 000 56 000 48 000 45 000 26 000 35 000 35 000 16 000 13 000 13 000 7600 3000 28 706.7 17 803.8 59 000 3000

2700 5100 4200 5200 4300 5600 7500 8200 6300 12 000 13 000 15 000 20 000 19 000 20 000 9873.3 6139.9 20 000 2700

MTBE: methyl tertiary butyl ether; Std dev.: standard deviation; MW1: monitoring well no. 1; MW2: monitoring well no. 2.

251

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

(59 000 g/L) in February 1997 and gradually decreased after that. On the other hand, MTBE concentration at the downgradient well MW2 gradually increased and reached its peak (20 000 g/L) around May 1999. The spatial and temporal variation of MTBE concentrations seems to be reasonable. Apparently, the MTBE plume originated at the source as detected at the source well and the centre mass of the plume has migrated towards downgradient as indicated at both the source well and the downgradient well. The purpose of the regression model is to find a relationship of MTBE concentrations between these two wells and then to use the concentration at the source well to predict the concentration at the downgradient well. Using the built-in regression function in the Excel spreadsheet program, based on data in Table 10.3, we generate two regression lines: linear model (R2 ¼ 0.48) and logarithmic model (R2 ¼ 0.6), as shown in Figures 10.1 and 10.2, respectively. The two models are expressed as: y ¼ 0.24x + 16 755 (linear) and y ¼ 5763 loge (x) + 67 590 (logarithmic), respectively. Now, the most recent monitoring data from February 2000 show 1400 g/L MTBE at the source well MW1 and 13 000 g/L MTBE at the downgradient well MW2, which are consistent with the trend of historical data. However, using the regression models, given x ¼ 1400, the linear model yields y ¼ 0.24 3 1400 + 16 755 ¼ 16 419; and the logarithmic model yields y ¼ 5763 loge (1400) + 67 590 ¼ 25 841. Comparing with the real field data y ¼ 13 000, we can determine the percent difference for the linear regression model to be (16 419 – 13 000)/13 000 ¼ 0.263 and for the logarithmic regression model to be (25 841  13 000)/13 000 ¼ 0.988. Although the predictions of the regression models stay within an order of magnitude from the field observations, the accuracy of the prediction can be off from 26% to 99%. However, we should note that y ¼ 13 000 is a random one-time observation. A series of observations may increase the accuracy of this value. Surprisingly, although the logarithmic model is the better fitted one (R2 ¼ 0.6) than the linear model (R2 ¼ 0.48) by meaning of the R2

MTBE: µg/l at downgradient well MW2

Figure 10.1: Linear regression model for case study. 25 000 20 000

y  0.2397x  16 755 R2  0.4832

15 000 10 000 5000 0 0

10 000

20 000 30 000 40 000 50 000 MTBE: µg/l at upgradient well MW1

60 000

70 000

252

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 10.2: Logarithmic regression model for case study. MTBE: µg/l at downgradient well MW2

25 000

20 000 y  5763.1 ln(x)  67 590 R2  0.601 15 000

10 000

5000

0 0

10 000

20 000 30 000 40 000 50 000 MTBE: µg/l at upgradient well MW1

60 000

70 000

values, the logarithmic regression model produces the prediction margin far greater from the field observation than that produced by the linear regression model. Therefore, the fitting criterion R2 may be used to evaluate the fitting condition based on the existing data set, and is not really reliable in terms of prediction by the regression models. If a regression model produces a prediction margin about 100% from a field observation like this case, this model may not be meaningful to model users. We recognise uncertainties associated with the field data at times and at locations. The uncertainties could produce a wide range of data set, which consequently creates difficulties to fit for regression models. That might be the reason causing the regression model’s inaccuracy to predict. This case study reminds us that if there is a wide-range data set, we should apply the regression method very carefully. Otherwise, we just simply carry the uncertainties over to the predictions, which may or may not be helpful to decision making based on environmental data.

10.5

ANALYSIS OF VARIANCE

Analysis of variance (ANOVA) is a method for testing equality of the statistical means from two or more factors that contain different sources of variation (Davis, 1986). The calculation procedure is presented in Table 10.4, where 2 0 12 !2 3 m k m X k X X X 4 SSA ¼ X ij =k 5  @ X ij A =n j¼1

i¼1

j¼1 i¼1

(10:11)

253

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

Table 10.4: Analysis of variance calculation table. Source of variation

Sum of squares

Degree of freedom

Mean squares

F-test

Among samples Within samples Total variation

SSA SSW SST

m1 nm n1

MSA MSW

F ¼ MSA/MSW

SSW ¼

m X k X

X 2ij 

j¼1 i¼1

m k X X j¼1

!2 X ij

=k

(10:12)

i¼1

where SSA is the sum of squares among samples; SSW is the sum of squares within samples; SST ¼ SSA + SSW is the total sum of squares; m is the number of factors (e.g. m ¼ 2 in the case of groundwater concentrations at upgradient versus downgradient monitoring wells); n is the total number of sample points; k ¼ n/m is the average number of sample points within factors; MSA ¼ SSA/(m  1) is the mean squares among samples; MSW ¼ SSW/(n  m) is the mean squares within samples; and the F-test ratio is equal to MSA/MSW. The hypothesis of F-test is that all means among factors ( j ¼ 1, . . ., m) are equal. If the F-test ratio is greater than FÆ table value at the threshold Æ ¼ 0.05, the hypothesis is rejected. Otherwise, the hypothesis is accepted. In statistical hypothesis testing, the rejection of a hypothesis means to conclude it is false, but the acceptance of a hypothesis merely implies to have insufficient evidence to reject it and not necessarily to prove it is true (Walpole, 1974). Therefore, an ANOVA rejection of the hypothesis is a stronger argument than an acceptance of it. In other words, an ANOVA rejection presents a stronger conclusion. This argument will be demonstrated in the following case studies. A potential pitfall is to make a stronger conclusion than it should be on the basis of acceptance of the hypothesis.

10.5.1

Case study I for ANOVA

ANOVA can be used to compare the mean concentrations of MTBE detected between the sand/gravel and fine-grained material type soils. Table 10.5 presents data of MTBE soil concentrations and soil types at each soil sample from a total of 29 sites. Soil type is presented by Unified Soil Classification System (USCS) symbols. This case study uses ANOVA to evaluate the difference between two groups of MTBE soil concentrations, which are divided by soil type. Group one contains MTBE soil concentrations detected in predominantly clay/silt type soil, and group two in predominantly sand/ gravel type soil. ANOVA calculation procedures in Table 10.4 produce the F-test ratio value equal to 0.073. Comparing with the F-table value with Æ ¼ 0.05 and the degree of freedom (1, 27) equal to 4.21, the F-table value is greater than the F-test ratio value calculated based on the MTBE data. Therefore, the hypothesis of equal MTBE means in concentration between the two groups is accepted.

254

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Table 10.5: MTBE case study data for ANOVA by soil type. MTBE soil concentration (mg/kg) 6.4 120 22.6 32 57 2.7 200 3.5 41 210 315 2.3 11.2 430 15 74.1 15 58 0.87 3.1 160 1.5 0.53 710 7.3 0.24 170 90.3 10

Soil type at the highest soil concentration CL CL CL CL CL SC ML ML ML ML ML ML ML ML SM SM SM SM SM SM SM SM SP/SM SW SP/SW SP SG SP GW

ANOVA group by soil type 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

MTBE: methyl tertiary butyl ether. Soil type is expressed by USCS (CL: clay; SC: clayey sand; ML: silt; SM: silty sand; SP: poorly graded sand; SW: well-graded sand; SG: gravelly sand; GW: well-graded gravel)

As discussed earlier, the acceptance of a hypothesis is a weak argument of statistical analysis result. Therefore, to accept the hypothesis of equal means between MTBE detected in fine-grained soil and in coarse soil is not necessarily to make a strong conclusion that these means are really equal, but rather a weak argument that we cannot tell the difference between the two means. Then, it can only be concluded that the difference is not statistically significant between the two groups of MTBE concentrations. This implies that it is difficult to say whether or not MTBE is more

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

255

likely to be detected in a sand/gravel type soil than in a fine-grained material type soil, or vice versa.

10.5.2

Case study II for ANOVA

ANOVA can be used to compare the trichloroethylene (TCE) concentrations detected between the upgradient and downgradient groundwater monitoring wells in order to evaluate contribution from a potential source at an impacted site. Table 10.6 presents TCE groundwater monitoring data obtained from two upgradient wells (MW1 and MW2) and three downgradient wells (MW3, MW4 and MW5 (the most downgradient)) over a period of one year (four quarters). Based on the dividing factor of the monitoring wells (m ¼ 5), ANOVA calculation procedures in Table 10.4 yield the F-test ratio value equal to 3.093. Comparing with the F-table value with Æ ¼ 0.05 and the degree of freedom (4, 15) equal to 3.06, the F-table value is smaller than the F-test ratio value calculated based on the TCE data. Therefore, the hypothesis of equal TCE means in concentration among the upgradient wells and downgradient wells is rejected. It is then concluded that there is a statistically significant difference between the upgradient wells and downgradient wells. Therefore, there might be a contributing source at the subject site because TCE Table 10.6: TCE case study data for ANOVA by well. Well no.

TCE concentration (g/L)

Quarter

1500 340 1900 460 300 75 430 1100 2200 420 2000 390 1100 370 1800 930 3150 985 3600 2100

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

MW1 MW1 MW1 MW1 MW2 MW2 MW2 MW2 MW3 MW3 MW3 MW3 MW4 MW4 MW4 MW4 MW5 MW5 MW5 MW5 TCE: trichloroethylene.

256

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

concentration at the downgradient wells may be higher than that at the upgradient wells. As discussed earlier, the rejection of a hypothesis presents a strong argument. However, the conclusion of this case study may be debatable because the F-test ratio value (3.09) is so close to the F-table value (3.06), so may easily be concluded the other way around given the uncertainties with field data and laboratory analytical and reporting procedures. Facing this situation, we should use additional site-specific information that may further support or contradict the conclusion. Figure 10.3 plots the data from all wells and apparently shows that TCE concentration at the most downgradient well MW5 is higher than any other wells at all times. Therefore, this is evidence further to justify the conclusion drawn from ANOVA. In addition, continuation of groundwater monitoring may be necessary to generate more data points for ANOVA to test the result further.

10.6

DATA TREND ANALYSIS

Statistics practitioners often use statistical methods to analyse environmental monitoring data in time sequence to see the trend of concentration changes over time. A popular method for groundwater monitoring data analysis is the Mann–Kendall method (Gilbert, 1987; Kendall, 1975). Let x1 , x2 , . . ., x j , . . ., x n be n data points where x j is a data monitoring value at time j. Then the Mann–Kendall index is calculated by the method S¼

n1 X n X

MKð x j  x k Þ

(10:13)

k¼1 j¼ kþ1

where MKð x j  x k Þ ¼ 1,

if

ðx j  xk Þ . 0

Figure 10.3: ANOVA data for all wells.

TCE concentration: µg/l

4000 3500 3000

MW1

2500

MW2

2000

MW3

1500

MW4

1000

MW5

500 0 1

2 3 Monitoring quarter

4

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

MKð x j  x k Þ ¼ 0, MKð x j  x k Þ ¼ 1,

if if

257

ðx j  xk Þ ¼ 0 ðx j  xk Þ , 0

ð k ¼ 1, 2, . . ., n  1Þ ð j ¼ k þ 1, k þ 2, . . ., nÞ where S is the total score of the Mann–Kendall index. MK is the individual Mann– Kendall score at one comparison, where x j is the data point at time j and x k is the data point at time k. A very high positive value of S indicates an increasing trend, and a very low negative value indicates a decreasing trend. The method also takes many other statistical steps (probability, hypothesis, etc.) to justify further the significance of the trend. Nevertheless, the flaws in the fundamental Equation 10.13 of the Mann– Kendall method for data trend analysis are discussed below. The potential pitfall for this method is to analyse the data temporal trend without looking into the magnitude of the data values and the meanings of the values. This type of statistical method, like the Mann–Kendall method (comparing data pairs), may be good for a situation where a ‘yes or no’ determination is needed. For example, where you have detection or no detection. However, this method does not take into account the magnitude of data values. In other words, the MK score is obtained regardless of the magnitude of values compared. For example, MK equals 1 if x j ¼ 10 000 and x k ¼ 10, whereas MK also equals 1 if x j ¼ 10 000 and x k ¼ 9999. Apparently, despite both MKs being equal to one, the former case shows a great decreasing trend from 10 000 to 10, while the latter case shows almost no decrease. The concentration magnitude matters in a situation where active remedial action is needed.

10.6.1

Case study for data trend analysis

Table 10.7 shows two cases of groundwater monitoring data for TCE for a monitoring period of four quarters. The case study demonstrated very obviously that both case 1 and case 2 have the Mann–Kendall statistic S equal to 4; however, the reality is that case 1 has very little decrease and decrease in case 2 is very significant over the period of four quarters. Therefore, the Mann–Kendall method is not able to distinguish case 1 from case 2. Table 10.7: TCE groundwater monitoring data in four quarters. Unit: g/L

Case 1

Case 2

Quarter 1 Quarter 2 Quarter 3 Quarter 4

1010 995 980 969

10 000 5600 479 34

TCE: trichloroethylene.

258

10.7

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

SUMMARY AND CONCLUSIONS

In this chapter, we have reviewed five commonly used statistical methods for data analysis in environmental forensics and discussed potential pitfalls associated with application of these methods through real case study data. For estimation of percentile and confidence interval, the potential pitfall includes the automatic assumption of a normal distribution to environmental data, which so often show a log-normal distribution. The percentile and confidence interval determined based on a normal distribution and a log-normal distribution can be very different. For correlation coefficient, the potential pitfall includes the use of a wide range of data in which the maximum data points may trivialise other smaller data points and consequently skew the correlation coefficient. For regression analysis, the potential pitfall includes the propagation of uncertainties of input variables to the regression model prediction, which may be even more uncertain. For ANOVA, the potential pitfall includes the acceptance of a hypothesis as a weak argument to imply a strong conclusion. For data temporal trend analysis, the potential pitfall is to analyse the data trend without looking into the magnitude of the data values and the meanings of the values. As demonstrated in this chapter, we may draw very different conclusions based on statistical analysis if the pitfalls are not identified. Environmental data so often show a wide range of values due to uncertainties associated with the pollutants’ release timing, frequency, quantities and sampling locations in an heterogeneous environment as well as with analytical testing and reporting procedures. This wide range may lead us to potential pitfalls for an inappropriate conclusion in statistical analysis. When analysing this kind of widerange data, traditional statistical methods may not be adequate to draw a convincing conclusion, as demonstrated in this chapter. To use statistical methods effectively and correctly, we must understand the mathematical implication of every applied equation of the statistics and the variation trend of environmental data so that the possible pitfalls can be detected. If possible, more than one statistical method should be applied to analyse environmental data and discover whether they may support or contradict each other. It is always a good practice to glance at the sample mean and standard deviation for a sense of data distribution and variation range before applying any statistical method. Statistical conclusions should be further evaluated based on sitespecific conditions to find out if there are any relevant physical meanings for the statistical results. To be aware of the potential statistical pitfalls would lead us to a more intelligent conclusion from statistical analysis in environmental forensics.

ACKNOWLEDGEMENT This chapter is based on an article by the current author that was published in the journal Environmental Forensics, entitled ‘Statistical methods and pitfalls in environmental data analysis’ (copyright 2000, Vol. 1, pp. 213–220). The author would like to acknowledge the generous permission to use the original publication from Taylor and Francis (www.informaworld.com).

STATISTICAL METHODS AND PITFALLS IN ENVIRONMENTAL DATA ANALYSIS

259

REFERENCES Berthouex, P.M. and Brown, L.C. (1994). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton, Florida. Davis, J.C. (1986). Statistics and Data Analysis in Geology. John Wiley, New York. Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, New York. Happel, A.M., Beckenback, E.H. and Halden, R.U. (1998). An Evaluation of MTBE Impacts to California Groundwater Resources. UCRL-AR-130897 (11 June). Lawrence Livermore National Laboratory (LLNL), Livermore, California. Kendall, M.G. (1975). Rank Correlation Methods, 4th edn. Charles Griffin, London. Rong, Y. (1998). Groundwater sampling: to purge or not to purge. Environmental Geosciences. 5, 2: 57–60. Rong, Y. (1999a). Groundwater data analysis for methyl tertiary butyl ether. Environmental Geosciences. 6, 2: 76–81. Rong, Y. (1999b). A study of vertical plume length for methyl tertiary butyl ether in the Charnock Wellfield investigation area. Environmental Geosciences. 6, 3: 123–129. US EPA (US Environmental Protection Agency) (1989). Statistical Analysis of Ground-water Monitoring Data at RCRA Facilities. Section 8 (April). Office of Solid Waste Management Division, Washington D.C. Walpole, R.E. (1974). Introduction to Statistics. Macmillan Publishing, New York. Winkler, R.L. and Hays, W.L. (1975). Statistics: Probability, Inference, and Decision, 2nd edn. Holt, Rinehart and Winston Publishers, New York.

INDEX

Note: Figures and Tables in main text are indicated by italic page numbers, Figures and Tables in colour insert by suffix ‘(colour)’ accidental anomalies 219 detection of 221–3, Fig. 9.4(colour) accidental errors 217, 219 shape of error distribution 227 adaptive allocation meaning of term 92 for stratified and two-stage sampling 87–92 adaptive cluster sampling 82–7 examples of use 87 procedure 83–6 sample size 87 adaptive sampling, meaning of term 81 adaptive searching, meaning of term 92 adaptive two-stage sequential sampling 89, 92 advective leaching model, for gasoline contaminants 176–8 Agenda 21 3, 10 on informed decision making 3, 38 on integrated approach to environmental management 62 on sustainability 38 aleatory uncertainty 68, 71 aligned rank tests 27, 29 analysis of variance (ANOVA) 252 calculation procedure 252–3 in case studies 253–6 potential pitfalls 253, 254–5, 256, 258 annual frequency of exceedance 69 in earthquake study 77 anomaly consequences of failure to detect 218 meaning of term 217, 229 anomaly detection, meaning of term 217 anomaly detection methods Bois’ ellipse test 225–6 double mass analysis 220–1 non-parametric methods 220 parametric methods 219–20 regression residuals analysis 221–5

aquifers analysis of piezometric time series long-term variations 123–5 noise processes 130–4 pumping test information 120–2 short-term variations 125–32 effect of naturally occurring stresses 127–32 hydrothermal systems 117–19 residence/travel times 119, 170 see also Rennes les Bains hydrothermal aquifer ASTM-RBCA transport models 167 AT123D transport model 167 attenuation relationships 74 in earthquake study 77 attractor 133 dimension of 133, 134 see also reconstructed attractors average linkage method (in cluster analysis) 197 ‘average mutual information’ method 133, 134 averaged-parameter simulation 180, 190 background seismicity 77 barometric effects on aquifers 129 Rennes les Bains well data 129–30, 130, 131 benthic response index (BRI) 105 benzene case study data for percentile and confidence interval 246–7 in gasoline 179 groundwater monitoring data 156, 157, 158, 159, 160, 246 physical properties 179 BIOCHLOR model 167 biodegradation rates 168, 178, 180 BIOSCREEN model 167

262

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

blue-winged teal duck survey 84, 85, 89 Bois’ ellipse test 225–6, Fig. 9.9(colour) in case study 235, 236, 239–40 box–whisker plot 142, 144–6 applications 157, 158 break(s) of stationarity 223, 224, 225–6 in case study 236, 239 breakthrough curves 171–2 early breakthrough 172–3, 180, Fig. 7.5(colour) late breakthrough 173, Fig. 7.5(colour) one-source simulation model 180, 181 six-sources simulation model 186, 187–8, 188–9 two-sources simulation model 184, 186, 187 brownfield redevelopment 165 Brownian noise 126, 127 California earthquake hazard prediction study 75–8 Orange County Dry Weather Monitoring Programme 111–12, 112, 113 nutrient and sediment loading 109 statistics in environmental policy-making and compliance 97–112 Surface Water Ambient Monitoring Programme 103–4 California logistic regression model 102 California Toxics Rule (CTR) 101 implementation of 110–11 capture zone model 170, 171, 182 carbon chain analysis, petroleum hydrocarbons 198–200 carbon number analysis petroleum hydrocarbons 203, 204 cluster analysis 210–11, Fig. 8.5(colour) Castle Hill buttercup study 90–2 CES approach see cost-effective sampling chemical packaging facility, trend analysis for 151–2 chloroform, in ground water 152, 153 ‘city block distance’ 196–7 Clean Water Act (CWA, 1972) 98–9 List of Impaired Waters 98 see also CWA 303(d) list climatic data, presentation of 227 cluster analysis 196–8 examples 204–5, 210–14 petroleum hydrocarbons carbon number data 210–11, Fig. 8.5(colour)

PIANO data 210, Fig. 8.4(colour) coefficient of variance, in Mann–Kendall trend analysis 143 complete allocation stratified sampling 89–90, 92 case study 90–2 complete linkage method (in cluster analysis) 197 compliance failures, probabilistic estimation of 20–1 comprehensive assessment 142, 150–1 applications 154, 160 concentration trend analysis for local areas 142–4 applications 152, 157 for site-wide distributions 142, 144–6 applications 157, 158 conditional entropy 30–2 confidence interval calculation of 245–6 in case study 246–7 potential pitfalls 247, 258 confidence limits in compliance assessment 21–22 in Mann–Kendall trend analysis 143 confirmatory data analysis 28 constituents of potential concern (COPCs) in fuel terminal site 156 trend analysis for 157 contamination concentration distribution, graphical analysis of 142, 1144–6 continuous monitoring, advantages and disadvantages 19 continuous wavelet analysis 125 correlation analysis, Rennes les Bains well data 123, 124, 129–30 correlation coefficient calculation of 247 in case study 247–8 potential pitfalls 247, 249, 258 cost considerations 7 cost-effective sampling (CES) approach 149 modified method 142, 149–50, 151 applications 159, 160 decision process in 149–50, 151 in MAROS software 150 crack ratio (in vapour intrusion model) 173 criterion continuous concentration (CCC) calculation of 102 derivation of 100 criterion maximum concentration (CMC), calculation of 102

INDEX

cross-correlograms, Rennes les Bains well data 124, 130 cumulative probability curves one-source simulation 183, 184 six-source simulation 189 cumulative regression residuals temporal evolution of 225, Fig. 9.9(colour) see also Bois’ ellipse test CWA 303(d) listing 98, 105 California state policy 104–8 see also Clean Water Act dam design 218 Darcy permeability, Rennes les Bains aquifer 121, 134 Darcy’s law 165, 168 data decision-making process and 3–6 meaning of term 6 role in environmental management 2–3 transfer into information 7–10 data analysis 9–10 data collection systems 7 design 9, 12, 81 objectives and constraints 12 data distributions 244 data formats 5, 13 data management impacts of poor management 11 integration of 10–12 data management systems information gap 4 limitations 58 objectives and constraints 9 quality control in 4–5 see also environmental data management system(s) data preparation 35, 37 data presentation 5, 13 climatic data 227 data storage 9 data uncertainty, and model uncertainty 163, 164 ‘data-rich/information-poor syndrome’ 6, 12 databases, in decision support systems 1, 3, 35, 37 decision making for environmental management 34–40 information required for 2 decision-making process data and 3–6 for water resources management 35, 36 decision support systems (DSS) 1, 3, 35

263 components 1, 3, 35, 37 Delaunay method 142, 146–8 applications 153–4, 159, 160 area ratio (AR) 147–8 concentration ratio (CR) 147, 148 in MAROS software 148, 153, 159 multiple contaminants 148 multiple sampling method 148 slope factor (SF) 147, 148 Delaunay triangulation 146–7, Fig. 6.3(colour) delays, method of 133 dendrograms 196 examples 210, 211, 214, Figs 8.4–8.7(colour) diesel-range organics (DRO), monitoring of concentration in ground water 156, 157, 158, 159, 160 digital elevation model (DEM), Gediz River Basin 45, Fig. 1.10(colour) dispersion coefficient, in transport model 168, 169–70 dispersivity, scale dependence of 169 distance measures in cluster analysis 196–7 distribution-free methods 26 see also non-parametric methods Domenico models 167 double mass analysis 220–1 see also Bois’ ellipse test DPSIR approach 39, 40 Dry Weather Monitoring (DWM) Programme (Orange County, California) 111–12 tolerance intervals 112, 113 dynamical systems, study of 132–3 earth tide 127 effect on Rennes les Bains well levels 128–9 earthquake hazard prediction 67–79 example using historical data 74–8 earthquake intensity attenuation relationships 74 in example 77 edge units, in adaptive cluster sampling 83, 85 effluent limits 111 entropy measures 30–2 environmental data 4 characteristics 16–17 shortcomings 5–6, 12–13 environmental data analysis 15–34 selection of methodology 15 statistical methods and pitfalls in 243–59 environmental data management system(s) 7, 8 goal of 7

264

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

integrated approach 10–12 see also data management systems environmental forensics, statistical methods used in 195–214, 243–58 environmental management decision making for 34–40 development of information system for 35, 37 sustainability issues 37–40 environmental models, empirical basis 165 environmental processes, characteristics 4 environmental regulations surface water quality in California 98–9 implementation of 110–12 environmental sampling design, statistics in 103–4 environmental sampling plan(s), evaluation and optimisation of 141–61 EPA see US Environmental Protection Agency epistemic uncertainty 68, 71 error distributions, shapes 227 error types 217, 219 Euclidean distance 196 European Environment Agency (EEA), sustainability indicators 39, 40 exceedance probability in earthquake hazard prediction 70, 77 in water resources management 20 expert systems 37 in decision support systems 1, 3, 35, 37 exploratory data analysis 28, 198 extreme value(s) 15 estimation of 19–22 F-test 253 F-test ratio value calculation of 253 in case studies 253, 255 factor loadings 229 spatial projections 229, 231 in case study 234–5, Fig. 9.19(colour) spatial variograms for 231, 233 in case study 235, 239 factors scores 218 effect of errors on 229, 232 fault activity rate (earthquakes) 71–2 fault sources (earthquakes) 69–70, 77 first-order auto-regressive process, effective sample size for 25–26 Fisher information measure 30 use in hydrology 30 flow-adjusted water-quality concentrations 27 fractals, noise classification based on 126

freshwater data 3–4 fuel terminal site comprehensive evaluation of monitoring plan for 154–60 concentration trend analysis 157–9 sampling frequency analysis 159 sampling location analysis 159 ground water monitoring network 156, Fig. 6.8(colour) site information 155–6 vapour extraction and air sparging (VE/AS) system 156 gas chromatography/electron capture detector (GC/ECD) analysis, petroleum hydrocarbons 211 gas chromatography/flame ionisation detector (GC/FID) analysis petroleum hydrocarbons 198–202 carbon chain analysis 198–200 list of identifiable compounds 200–2 raw data 202 gas chromatography/mass spectrometry (GC/MS) analysis petroleum hydrocarbons 202–4 carbon number analysis 203, 204 PIANO analysis 202–4 gasoline leaching model 176–8 gasoline-range organics (GRO), monitoring of concentration in ground water 156, 157, 158, 159, 160 Gaussian noise 126, 127 Gediz River Basin (Turkey) 41–58 aquifers 53, 55 basin profile data 41, 42–4 districts within 45 flow time series 52 geospatial data 45, 50–1 base maps 45, 51, Fig. 1.10(colour) hydrometeorological observations 50–1 landcover information 45, 50 Fig. 1.12(colour) river network 45, 51 soil information 50, Fig. 1.13(colour) hydrometeorological data 50–1 location 41, Fig. 1.8(colour) modelling studies 51–3 reaches characteristics 53, 55, Fig. 1.14(colour) reservoir characteristics 53, 53 scenario development and testing 56–8, 59, 60–1 socio-economic data 41, 45, 46–50

265

INDEX

agricultural income 48–9 birth and death rates 47 employment rates 50 life expectancy values 47 migration data 47 population density and growth 46 geographical information systems (GIS), in decision support systems 1, 3, 35, 37 GEWEX, generation of data sets 12 goodness-to-fit coefficient (regression model) 249–50 ground water contamination of 148, 152, 153 fluctuations in Rennes les Bains well 120–34 in Gediz Basin 42 groundwater flow velocity Rennes les Bains aquifer 121 in transport model 168, 170 groundwater models, incorrect calibration of 218 groundwater monitoring ANOVA calculations 255 temporal data trend analysis 257 groundwater monitoring networks/ programmes, evaluation of 148, 154–60 Gutenberg–Richter model 72 Henry’s Law coefficient 174, 177 homogeneity requirements, in simple transport model 167 Horvitz–Thompson estimator 83–4, 85 hydraulic conductivity 165 fuel terminal site 155–6 range in various materials 134, 155–6, 165 Rennes les Bains aquifer 134 transport affected by 170 variations in 170, 178 hydrological time series 17 hydrometeorological observations, Gediz River Basin 50–1 hydrothermal aquifers characterisation of 117–35 standard investigative approaches 117 see also Rennes les Bains hypocentre location density function 73 hypothesis testing 23–5, 105, 253 and ANOVA 253, 254, 256 balanced Æ and  errors approach 107 in CWA 303(d) listing process 105–7 information blurring of see noise

meaning of term 6 requirements for decision-making process 2–3, 6–7 role in environmental management 2–3 transfer of data into 7–10 information content of data, measuring 30–2 information system for environmental management, development of 35, 37 information transmission 32 integrated environmental data management 10–12 intervention analysis 22, 26, 28 intrablock tests 27, 29 Jacob equation 130 Johnson–Ettinger model (JEM) 173, 175 joint entropy 31 joint inclusion probabilities 85, 86 karstic aquifer 118, 124, 125, 135 Kendall test 27, 28, 29 Kolmogorov–Sinai entropy 133 laboratory analyses, uncertainties caused by 9 landcover maps 45, Fig. 1.12(colour) leaking underground storage tank (LUST) sites contamination of municipal water supply by 176 ratio analysis vs raw data 211–14, Fig. 8.6–8.7(colour) in simulation models 171, 182 least squares method 249 light non-aqueous phase liquid (LNAPL), at leaking underground storage tank sites 211–14 linear regression model 249 in case study 251, 252 linear trends 23 detection of 23–4, 26–7 linkage methods in cluster analysis 197 log-normal distribution 244, 245 logarithmic regression model, in case study 251, 252 logic tree approach 76–7 Long Island aquifer 170 magnitude density distribution (earthquakes) 72–3 ‘Manhattan distance’ 196–7, 210–11 Mann–Kendall statistic 143, 256–7 Mann–Kendall test 27, 28, 29, 142–3 Mann–Kendall trend analysis 142–3, 256–7 applications 151–2, 154, 157, 160, 257

266

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

concentration trends in 143, 144 in MAROS software 144 in modified CES method 149 potential pitfalls 257, 258 Mann–Whitney test 24, 26, 29 MAROS see Monitoring and Remediation Optimisation System software mass conservation, in transport models 166–7 mean value(s) 15 estimation of 17–19 Mediterranean Action Plan (MAP), sustainability indicators 39 methyl tertiary butyl ether (MTBE) groundwater monitoring data 250–2 physical properties 179 soil concentrations relationship with groundwater concentrations 247–9 and soil types 253–4 Millington–Quirk relationship 174 model uncertainty, and data uncertainty 163, 164 modelling 7, 9–10 models calibration of 164–5 in decision support systems 1, 3, 35, 37 predictive error in 164 see also surface transport models; TELEMAC model; WaterWare (WRM) model monitoring programme(s) evaluation and optimisation of, for fuel terminal site 154–60 sampling location analysis 142, 146–8, 153–4 see also sampling plan(s) Monitoring and Remediation Optimisation System (MAROS) software, techniques built into 144, 148, 150, 153, 159 Monte Carlo simulation 175 in gasoline leaching model 178, 179–80 limitations 175 one-source simulation 180–4 time series 24 Morlet wavelet analysis 125 of Rennes les Bains well data 127, Fig. 5.9(colour) multiresolution wavelet analysis 125 Rennes les Bains well data 129, 132, 133 municipal water supply wells, contamination of 175–80

National Pollutant Discharge Elimination System (NPDES) permits 111 National Toxics Rule (NTR) 101 natural resources management, environmental data and information for 1–66 Needles (California, USA) earthquake hazard prediction study 75–8 seismic activity 74–5 neighbourhood searching 83 Neptune, discovery of 166 networks, adaptive cluster sampling 83 noise (in data) 13–15 analysis of Rennes les Bains data 130–1 causes 14 classification of 126–7 study of processes causing 132–3 non-exceedance probabilities 20 non-parametric methods meaning of term 26 for trend analysis/detection 22, 26–9, 142–4 non-stationarity, removal from series data 17, 18 normal distribution 244 normal distribution probability table 244, 245 OPTIMA project 1, 39, 40, 41–58 data requirements 56 optimization constraints 55–6 performance criteria and indicators 55–6 scenario development and testing 56–8, 59, 60–1 orthogonal wavelet analysis 125 outliers see anomalies paired t-test 146 parametric methods, for trend detection 22–6 Pearson correlation, as distance measure 197, 210 percentiles calculation of 244–5 in case study 246–7 potential pitfalls 247, 258 petroleum hydrocarbons contamination by 175–90, 195 GC/FID analyses 198–202 carbon chain analysis 198–200 list of identifiable compounds 200–2 raw numerical data 202, 203 GC/MS analyses 202–4 carbon number analysis 203, 204

INDEX

full scan 204, 206–9 list of identifiable compounds 203 PIANO analysis 202–4 identifiable compounds listed 200–2, 212–13 see also gasoline PIANO (paraffins/isoparaffins/aromatics/ naphthenes/olefins) analysis petroleum hydrocarbons 202–4 cluster analysis 210, Fig. 8.4(colour) examples for various petroleum products 203–4, 205 Poisson process 70 pollutant concentration, exceedance of critical level 20–1 population statistic 25 porosity of aquifer 130 Rennes les Bains aquifer 130, 134 Porter–Cologne Act (1969) 99 power of tests, in trend detection 24, 26–7, 28 prediction-with-observation approach 166 predictive error in models 164 primary models 34 principal component analysis (PCA) use in anomaly detection 221, 227, 228, 229, 231 case study 234, 235, 238 probabilistic seismic hazard analysis (PSHA) concepts used 67–8 example using historical data 74–8 mathematical formulation 68–73 pumping test results for Rennes les Bains well 121–2 Theis–Jacob recovery method 121, 122 rainfall data in case study 234–7 Bois’ test 235, 236, 239–40 location of monitoring stations 234, Fig. 9.18(colour) PCA results 234, 238 spatial projection of factor loadings 235, Fig. 9.19(colour) spatial variograms for factor loadings 235, 239 hypothetical dataset 229, 230 detected errors 231, Table 9.4(colour) with introduced errors 229, 231, Table 9.3(colour) rank sum test 28 rare and clustered populations 82 adaptive cluster sampling 82–7 adaptive two-stage sequential sampling 89

267 complete allocation stratified sampling 89–92 ratio analysis, petroleum hydrocarbons 200–1, 214, 215 Rayeigh number 135 Rayleigh–Benard convection 134 reconstructed attractors analysis 133 Grassberger–Procaccia method 133 Rennes les Bains well data 133–4, Fig. 5.19(colour) recoverability 37–8 regional vectors 218, 228, 239 regression analysis 249–50 case study 250–2 potential pitfalls 252, 258 regression equation 25 and earthquake intensity attenuation relationships 74 regression model 249 sediment toxicity evaluated using 102 regression residuals analysis 221–5 detection of accidental anomalies using 221–3, Fig. 9.4(colour) detection of systematic anomalies using 223–5 regression residuals plot 221–2, Fig. 9.3(colour) remedial sites, examples of monitoring plans 151–60 remote sensing data and technology 11, 34–5, 58 renewability 37 Rennes les Bains (France) hydrothermal aquifer circulation times 119 flow velocity 121 hydrogeological cross-section 119 location 118 permeability 121, 134 porosity 130, 134 thickness 121 Rennes les Bains springs 119 Rennes les Bains well 119 analysis of piezometric time series long-term variations 123–5 noise processes 130–4 pumping test information 120–2 short-term variations 125–32 capacity effect 121, 122, 135 effect of karstic aquifer 118, 124, 125, 135 groundwater level fluctuations 120–34 temperature gradient in 119, 120 thermal convection 118, 131, 134–5

268

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

representative sampling 9 resilience 37 retardation factor, in transport model 168 return period, in earthquake hazard analysis 69, 78 runoff data 16 rupture dimension density functions 73 rupture location density functions 73 sample variance of residuals, plot vs sampling interval 18–19 sampling errors 21–2 sampling frequency 16, 19 sampling frequency analysis 142, 149–51 application(s) 159 sampling location analysis 142, 146–8 application(s) 153–4, 159 sampling locations 16 selection for future monitoring 153–4, 155 sampling plans evaluation and optimisation of 141–61 multicomponent approach 142–51 site applications 151–60 purpose 141 see also monitoring programme(s) sampling requirements 9 ‘sand box’ aquifers 167 scenario development and testing 37 Gediz Basin case study 56–8, 59, 60–1 OPTIMA project 56–7, 57 SMART project 57, 57 screening models 34 seasonality of water quality, and trend detection tests 27, 28, 29 secondary models 34 sediment quality objectives (SQOs) 102–3 seismic moment 71 sentry wells 154 Shannon entropy 30–2 sign test 142, 146 applications 157 signal-processing techniques 123–4 simple random sampling, compared with other techniques 81, 86 simulation models one-source simulation 180–4 six-source simulation 186, 187, 188–9 two-source simulation 184, 187 see also Monte Carlo simulation single linkage method (in cluster analysis) 197 SMART project 1, 39, 40, 41–58 smear zone leaching 177

soil concentrations MTBE relationship with groundwater concentrations 247–9 and soil types 253–5 soil maps 50, Fig. 1.13(colour) soil types MTBE soil concentrations influenced by 253–5 USCS classification 253, 254 Southern California Bight Regional Monitoring Programme 104, Fig. 4.4(colour) Spearman’s rho test 24, 26 spectral analysis, of Rennes les Bains well data 123–4, 127–9 spectral density function, noise analysis using 130–1 stationarity break(s) 223, 224, 225–6 in case study 236, 239 step trends 23 detection of 23, 24, 26 storage coefficient (for aquifers) 129 for Rennes les Bains aquifer 129 stratified sampling adaptive allocation for 82, 87–8, 89–90 meaning of term 81, 87 stream gauging stations, in Gediz Basin 51, 51 Student t-distribution 18, 24 Student t-statistic 18 subsurface transport models basis 166–8 predictive capability 164 simple models 167 surface water, in Gediz Basin 42 Surface Water Ambient Monitoring Programme (SWAMP) 103–4 surface water quality regulations 98–9 sustainability adoption as policy 2 in environmental management 37–40 systematic anomalies 219 detection of 223–5 systematic errors 217, 219 shape of error distribution 228 TELEMAC model 58 temporal data trend analysis 256–7 Theis–Jacob approximation method 121 threshold not-to-exceed exceedance, in CWA 303(d) listing process 107 time series rainfall data (in case study) 234–7

INDEX

reference time series comparison to 220–6 construction of 227–33 types of errors in 219 time series analysis, water quality data 23, 26–7, 28–9 time series data 16 total maximum daily loads (TMDLs) 98–9, 108–10 expression as non-daily loads 108–9 transport in environmental systems, uncertainty in modelling 163–91 transport equation mathematic statement 168 parameters in 168–71 restrictive assumptions for analytic solution 167, 190 transport models see subsurface transport models tree diagrams see dendrograms trend analysis/detection Lettenmaier’s techniques 25–6, 26–7, 28, 29 for local-area concentration trends 142–4, 151–2 applications 151–2, 157, 159 non-parametric methods 22, 26–9, 142–4, 146, 256–7 parametric methods 22–6 selection of method 28–9 for site-wide concentration trends 142, 144–6 applications 157–9 trends 15 estimation of 22–9, 142–6 trichloroethene (TCE), in ground water 153, 255–6, 257 truncated exponential model 72 and seismic data 75, 76 Turonian karstic aquifer 125 two-phase sampling, meaning of term 93 two-phase stratified sampling, adaptive allocation applied to 82, 88–9, 92 two-stage sampling adaptive allocation for 89 meaning of term 87, 93 uncalibrated models 165 uncertainty analysis for simple transport model 166–73 for subsurface transport in ground water 175–90

269 for vapour intrusion model 173–5 uncertainty in modelling transport 163–91 uniform aquifers, transport in 171–3 US Environmental Protection Agency (EPA) on data uncertainty and model uncertainty 163, 164 effluent guidelines 111 soil screening criteria 167 toxicity studies 99–100 vapour extraction and air sparging (VE/AS) system 156 vapour intrusion into indoor air 165, 173–5 variables, water quality 16 Voronoi diagrams 147, Fig. 6.3(colour) water management studies Gediz Basin 41–58 models used in 33–4 statistics used driving forces 32 observed data 33 watershed system statistics 33 water pollution, total maximum daily loads 98–9, 108–10 water quality criteria (WQC) 99–100, 101 alternative to 101 factors affecting 102 water quality standards (WQSs) components 99, 100 monitoring of 98 process of setting 100–1 water quality variables 16 estimation of mean values 17 water resources information systems 3–4 water resources management decision-making process for 34–5, 36 sustainability issues 37–40 watershed system statistics 33 WaterWare (WRM) model 51 data requirements 51–3 Gediz Basin scenarios 58, 60–1 wavelet analysis 125 of Rennes les Bains well data 127, Fig. 5.9(colour) weather forecasting 166 Wilcoxon signed rank test 146 World Bank, sustainability indicators 38–9 Youngs and Coppersmith characteristic model 72–3

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 1.8

Istanbul

Gediz Basin Izmir

Figure 1.10

i

ii Figure 1.12

Figure 1.13

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

iii

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 1.14

Figure 2.4 475 year ARP

5

3

5.5

5.25

5 50

30–35

40–50

Distance: km

35–40

15–20 20–25 25–30

0

5.75

6 6.25 6.5 6.75

1

7 7.25 7.5 7.75

8

2

5 5–10 10–15

Contribution: %

4

Magnitude

iv

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 4.4 Santa Barbara Ventura

Los Angeles Orange

San Diego

Figure 5.9

Piezometry: m

35.0

34.5

34.0

33.5 1/08

4/08

7/08

11/08

14/08 Time

18/08

21/08

25/08 28/08/97

Scale

3.5j 21h 5h30 1h20 3/08

6/08

9/08

12/08 15/08 18/08 20/08 23/08 26/08 1997 Time

v

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 5.18

0.4

0.8 1.0 1.2 1.4

0.08

0.06

0.04

0.02 log l (a)

0

0.02

0.04

Reconstructed attractor of the piezometric fluctuations (April 1996) 12 10

Dimension of the attractor

Log (C)

0.6

8 6 4 2.8 2 0

0

2

4

6 8 Embedding dimension (b)

10

12

vi

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 6.3 MW-16

MW-17

MW-44

MW-24 MW-47 MW-37 BB MW-36

MW-35

MW-45 MW-38 MW-39 MW-40 MW-43 MW-41 MW-42A

MW-34

MW-57

MW-14 MW-54 MW-25 MW-56 MW-26 MW-55 MW-53

MW-46A

MW-13 MW-52 MW-15

MW-62

MW-18

MW-51 MW-61 MW-50

MW-49

MW-60 MW-48A MW-A2 MW-A3

MW-A1

MW-58 Voronoi polygons are connected by yellow lines

MW-59

Delaunay triangles are connected by blue lines

Figure 7.5 20 Averaged parameters

18

Earliest first arrival and highest peak concentration Latest first arrival

16

Concentration: mg/L

14

Shortest duration

12

Longest duration

10 8 6 4 2 0

0

1000 2000 3000 4000 5000 6000 7000 8000 9000 10 000 Time: days

vii

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 6.8 MW-17

MW-44

ur

MW-24

Ha

rb o

MW-47 TW-1BB MW-37A

MW-57

MW-38 MW-36

MW-45

MW-14

MW-54

MW-39

MW-53

MW-55

MW-40

MW-13 MW-46A

MW-41 MW-42A

MW-26

MW-52 MW-15

MW-43 MW-35

MW-25

MW-56

MW-18

MW-51

MW-62

MW-34

MW-61 MW-50 MW-49

MW-60

General ground water flow direction

MW-48A MW-A2 MW-A3

MW-A1

MW-58

Benzene GRO

N

DRO

MW-59

(a) MW-17

MW-44

MW-24

rb

ou

r

MW-47 TW-1BB

Ha

MW-37A

MW-38 MW-36

MW-57

MW-45

MW-14

MW-54

MW-39

MW-53

MW-55

MW-40

MW-13 MW-46A

MW-41 MW-42A

MW-18

MW-51

MW-62

MW-34

MW-26

MW-52 MW-15

MW-43 MW-35

MW-25

MW-56

MW-61 MW-50 MW-49

MW-60

General ground water flow direction

MW-48A MW-A2 MW-A3

MW-A1

MW-58

Benzene GRO

N

DRO

MW-59

(b)

viii

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 8.4 Dendrogram Single linkage, Manhattan distance

Dendrogram Single linkage, Manhattan distance 53.45

Similarity

Similarity

53.45

68.97

84.48

100.00

G1

G4

G5

G2 G3 G6 Observations (a)

G7

G8

68.97

84.48

100.00

G1

Dendrogram Single linkage, Pearson distance

Similarity

Similarity

78.86

G1

G2

G3

G6 G7 G8 G4 G5 Observations (c) Dendrogram Average linkage, Manhattan distance

Similarity

Similarity

75.46

G1

G4

G5

G2 G3 G6 G7 G8 Observations (e) Dendrogram Complete linkage, Euclidean distance

G1

G4

G5

G2 G3 G6 G7 Observations (d) Dendrogram Average linkage, Pearson distance

G1

G4

G5

G4

G5

G8

49.80

74.90

G2 G3 G6 G7 G8 Observations (f) Dendrogram Complete linkage, Manhattan distance

0.00

Similarity

Similarity

76.76

100.00

0.00

33.33

66.67

G1

G2

G3

G4 G5 G6 Observations (g)

G7

G8

Dendrogram Complete linkage, Pearson distance 0.00

Similarity

G8

24.70

50.93

33.33

66.67

100.00

G7

53.52

100.00

26.39

100.00

G2 G3 G6 Observations (b)

30.28

57.71

100.00

G5

Dendrogram Average linkage, Euclidean distance

36.57

100.00

G4

G1

G4

G5

G2 G3 G6 Observations (i)

G7

G8

33.33

66.67

100.00

G1

G2 G3 G6 Observations (h)

G7

G8

ix

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 8.5 Dendrogram Single linkage, Euclidean distance

Dendrogram Single linkage, Manhattan distance 59.03

Similarity

Similarity

52.55

68.36

84.18

100.00

G1

G4

G5

G2 G3 G6 G7 Observations (a) Dendrogram Single linkage, Pearson distance

G8

Similarity

Similarity

87.44

G1

G4

G5

G2 G3 G6 G7 G8 Observations (c) Dendrogram Average linkage, Manhattan distance

Similarity

Similarity

74.72

G1

G4

G5

G2 G3 G7 G8 G6 Observations (e) Dendrogram Complete linkage, Euclidean distance

G1

G4

G5

G1

G4

G5

G4

G5

G2 G3 G6 G7 Observations (d) Dendrogram Average linkage, Pearson distance

G8

55.21

77.61

G2 G3 G6 G7 G8 Observations (f) Dendrogram Complete linkage, Manhattan distance

0.00

Similarity

Similarity

G2 G3 G7 G8 G6 Observations (b) Dendrogram Average linkage, Euclidean distance

73.52

100.00

0.00

33.33

66.67

G1

G4

G5

G2 G3 G6 G7 G8 Observations (g) Dendrogram Complete linkage, Pearson distance

G1

G4

G5

0.00

Similarity

G5

32.82

49.44

33.33

66.67

100.00

G4

47.05

100.00

24.15

100.00

G1

20.57

74.87

100.00

86.34

100.00

62.31

100.00

72.69

G2 G3 G7 Observations (i)

G8

G6

33.33

66.67

100.00

G1

G2 G3 G6 Observations (h)

G7

G8

x

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 8.6 Dendrogram Single linkage, Euclidean distance

Dendrogram Single linkage, Manhattan distance 35.08

Similarity

Similarity

13.33

42.22

71.11

100.00

MW1

GW5

GW4 GW1 GW2 GW3 Observations (a) Dendrogram Single linkage, Pearson distance

Similarity

Similarity

80.09

MW1

GW5

GW4 GW1 GW2 GW3 Observations (c) Dendrogram Average linkage, Manhattan distance

Similarity

Similarity

73.65

MW1

GW5

GW4 GW1 GW2 GW3 Observations (e) Dendrogram Complete linkage, Euclidean distance

GW4 GW1 GW2 GW3 Observations (d) Dendrogram Average linkage, Pearson distance

53.09

76.54

MW1

GW5

MW1

GW5

GW4 GW3 GW1 GW2 Observations (f) Dendrogram Complete linkage, Manhattan distance

0.00

Similarity

Similarity

GW5

69.07

100.00

0.00

33.33

66.67

MW1

GW5

GW4 GW1 GW2 GW3 Observations (g) Dendrogram Complete linkage, Pearson distance

0.00

Similarity

MW1

GW4 GW1 GW2 GW3 Observations (b) Dendrogram Average linkage, Euclidean distance

29.63

47.31

33.33

66.67

100.00

GW5

38.15

100.00

20.96

100.00

MW1

7.22

60.19

100.00

78.36

100.00

40.28

100.00

56.72

MW1

GW5

GW4 GW3 Observations (i)

GW1

GW2

33.33

66.67

100.00

GW4 GW1 Observations (h)

GW2

GW3

xi

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 8.7 Dendrogram Single linkage, Euclidean distance

Dendrogram Single linkage, Manhattan distance 48.29

Similarity

Similarity

42.79

61.86

80.93

100.00

MW1

GW1

GW2 GW3 GW4 GW5 Observations (a) Dendrogram Single linkage, Pearson distance

Similarity

Similarity

77.37

MW1

GW1

GW2 GW3 GW4 GW5 Observations (c) Dendrogram Average linkage, Manhattan distance

Similarity

Similarity

76.23

MW1

GW1

GW2 GW3 GW4 GW5 Observations (e) Dendrogram Complete linkage, Euclidean distance

GW2 GW3 GW4 GW5 Observations (d) Dendrogram Average linkage, Pearson distance

48.13

74.06

MW1

GW1

MW1

GW3

GW2 GW3 GW4 GW5 Observations (f) Dendrogram Complete linkage, Manhattan distance

0.00

Similarity

Similarity

GW1

75.70

100.00

0.00

33.33

66.67

MW1

GW3

GW4 GW5 GW1 GW2 Observations (g) Dendrogram Complete linkage, Pearson distance

0.00

Similarity

MW1

GW2 GW3 GW4 GW5 Observations (b) Dendrogram Average linkage, Euclidean distance

22.19

52.47

33.33

66.67

100.00

GW1

51.40

100.00

28.70

100.00

MW1

27.10

54.74

100.00

82.76

100.00

32.11

100.00

65.52

MW1

GW3

GW4 GW5 Observations (i)

GW1

GW2

33.33

66.67

100.00

GW4 GW5 Observations (h)

GW1

GW2

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 O16 O17 O18 O19 O20 O21 O22 O23 O24 O25 O26 O27 O28 O29 O30

S2

S3

S4

S5

611 670 867 499 454 956 778 751 774 843 955 982 1107 844 823 896 733 915 855 848 457 795 641 855 598 892 976 1036 876 896 622 587 736 663 598 596 498 598 495 454 784 692 697 662 578 849 866 893 810 657 576 512 414 576 557 399 350 367 341 275 1155 1116 1247 1009 1213 821 871 921 753 797 836 879 1043 776 764 704 628 753 640 751 748 729 1030 647 766 644 374 554 406 508 717 980 926 881 826 526 572 696 642 596 742 649 798 637 707 583 511 705 541 411 747 632 800 635 757 769 638 827 590 710 581 789 685 719 659 683 889 884 819 747 881 878 866 705 909 974 841 1179 829 976 728 870 846 718 863 1066 1141 1306 921 1025

S1

Table 9.3

718 822 1080 929 785 1118 608 690 742 915 594 470 1182 951 994 581 867 453 955 674 848 605 723 916 763 1025 950 1151 838 1190

S6

S8

686 586 892 765 948 843 968 749 701 704 949 858 725 593 540 319 660 695 720 892 529 476 436 252 1158 957 901 712 876 802 714 576 838 689 573 343 867 952 490 573 788 671 514 564 688 740 854 587 521 613 920 823 1009 603 1016 887 725 748 1133 1059

S7 449 864 985 867 714 905 663 566 696 830 486 324 979 780 868 700 704 407 744 488 680 587 634 593 617 693 754 882 948 994

S9 402 679 939 822 696 957 609 598 748 971 567 294 952 688 727 576 749 401 721 488 633 432 552 531 671 750 560 754 775 980

S10 483 857 932 838 804 926 636 586 630 912 549 385 982 722 770 585 660 318 842 491 621 529 563 610 614 771 678 705 702 980

S11 649 1029 1023 908 638 922 826 611 659 821 494 470 1135 883 971 765 777 675 859 414 708 547 778 784 621 778 885 1019 734 1106

S12

S14

S15

829 612 615 876 772 793 1166 962 942 855 716 894 759 775 687 1035 926 929 727 572 659 574 549 520 648 715 635 856 874 958 581 600 471 442 411 245 1225 940 1164 947 701 755 1043 728 878 727 661 649 945 728 784 504 455 439 840 1027 920 627 636 538 803 663 526 625 572 557 799 625 597 909 702 608 758 798 548 858 840 777 926 667 809 1061 777 892 799 773 929 1194 1089 1012

S13 720 709 984 853 827 1020 583 645 669 779 518 384 1009 897 915 555 828 433 993 626 886 575 696 732 736 826 705 890 933 1131

S16 825 838 1197 798 676 1106 638 693 588 856 472 345 1239 1014 986 644 882 583 902 624 865 512 806 983 777 920 1011 1141 674 1162

S17 725 1000 1193 877 831 1122 574 648 668 1027 484 363 1336 958 1022 769 839 470 1039 685 895 593 775 839 701 885 946 1140 943 1191

S18 648 972 1087 868 699 1006 728 564 683 884 424 389 1186 886 839 686 792 533 801 663 686 500 650 739 539 771 829 913 811 1152

S19 584 851 1052 911 842 1034 648 623 704 926 658 442 1102 623 780 588 746 387 968 495 616 588 730 671 656 1005 791 774 805 1094

S20 520 939 1072 813 764 1054 596 676 680 825 414 398 1162 889 896 667 903 419 943 542 821 614 827 906 679 1013 983 1031 924 1127

S21 723 710 1221 885 1007 1080 633 654 608 987 485 402 1288 941 966 598 837 527 929 788 891 475 696 483 662 927 964 1112 950 1086

S22 767 921 1102 855 425 1228 718 601 704 824 506 346 1398 931 970 691 879 652 782 501 691 637 806 1013 537 773 960 1040 873 992

S23 793 755 1160 847 771 1101 637 747 606 908 387 412 1332 1058 1030 830 934 564 858 720 907 542 820 947 678 1035 1087 1224 912 1152

S24 676 810 1070 938 785 1022 520 705 635 903 509 271 1155 780 946 636 802 342 874 526 707 519 599 721 625 868 744 952 882 1006

S25

xii PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

S3

611 670 867 956 778 751 955 982 1107 896 733 915 457 795 641 892 976 1036 622 587 736 596 498 598 784 692 697 849 866 893 576 512 414 399 350 367 1155 1116 1247 821 871 921 836 879 1043 704 628 753 748 729 1030 644 374 554 717 980 926 526 572 696 742 649 798 583 511 705 747 632 800 769 638 827 581 789 685 683 889 884 881 878 866 974 841 1179 728 870 846 1066 1141 1306

S2

S5

S6

S7

S8

499 454 718 686 586 774 843 822 892 765 844 823 1080 948 843 855 848 929 968 749 855 598 785 701 704 876 896 1118 949 858 663 598 608 725 593 495 454 690 540 319 662 578 742 660 695 810 657 915 720 892 576 557 594 529 476 341 275 470 436 252 1009 1213 1182 1158 957 753 797 951 901 712 776 764 994 876 802 640 751 581 714 576 647 766 867 838 689 406 508 453 573 343 881 826 955 867 952 642 596 674 490 573 637 707 848 788 671 541 411 605 514 564 635 757 723 688 740 590 710 916 854 587 719 659 763 521 613 819 747 1025 920 823 705 909 950 1009 603 829 976 1151 1016 887 718 863 838 725 748 921 1025 1190 1133 1059

S4 449 864 985 867 714 905 663 566 696 830 486 324 979 780 868 700 704 407 744 488 680 587 634 593 617 693 754 882 948 994

S9 402 679 939 822 696 957 609 598 748 971 567 294 952 688 727 576 749 401 721 488 633 432 552 531 671 750 560 754 775 980

S10 483 857 932 838 804 926 636 586 630 912 549 385 982 722 770 585 660 318 842 491 621 529 563 610 614 771 678 705 702 980

S11 649 1029 1023 908 638 922 826 611 659 821 494 470 1135 883 971 765 777 675 859 414 708 547 778 784 621 778 885 1019 734 1106

S12

S14

S15

S16

S17

S18

S19

829 612 615 720 825 725 648 876 772 793 709 838 1000 972 1166 962 942 984 1197 1193 1087 855 716 894 853 798 877 868 759 775 687 827 676 831 699 1035 926 929 1020 1106 1122 1006 727 572 659 583 638 574 728 574 549 520 645 693 648 564 648 715 635 669 588 668 683 856 874 958 779 856 1027 884 581 600 471 518 472 484 424 442 411 245 384 345 363 389 1225 940 1164 1009 1239 1336 1186 947 701 755 897 1014 958 886 1043 728 878 915 986 1022 839 727 661 649 555 644 769 686 945 728 784 828 882 839 792 504 455 439 433 583 470 533 840 1027 920 993 902 1039 801 627 636 538 626 624 685 663 803 663 526 886 865 895 686 625 572 557 575 512 593 500 799 625 597 696 806 775 650 909 702 608 732 983 839 739 758 798 548 736 777 701 539 858 840 777 826 920 885 771 926 667 809 705 1011 946 829 1061 777 892 890 1141 1140 913 799 773 929 933 674 943 811 1194 1089 1012 1131 1162 1191 1152

S13

Note: Green shading indicates detected imposed errors; tan shading indicates newly detected anomalies.

O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 O11 O12 O13 O14 O15 O16 O17 O18 O19 O20 O21 O22 O23 O24 O25 O26 O27 O28 O29 O30

S1

Table 9.4

584 851 1052 911 842 1034 648 623 704 926 658 442 1102 623 780 588 746 387 968 495 616 588 730 671 656 1005 791 774 805 1094

S20 520 939 1072 813 764 1054 596 676 680 825 414 398 1162 889 896 667 903 419 943 542 821 614 827 906 679 1013 983 1031 924 1127

S21 723 710 1221 885 1007 1080 633 654 608 987 485 402 1288 941 966 598 837 527 929 788 891 475 696 483 662 927 964 1112 950 1086

S22 767 921 1102 855 425 1228 718 601 704 824 506 346 1398 931 970 691 879 652 782 501 691 637 806 1013 537 773 960 1040 873 992

S23 793 755 1160 847 771 1101 637 747 606 908 387 412 1332 1058 1030 830 934 564 858 720 907 542 820 947 678 1035 1087 1224 912 1152

S24 676 810 1070 938 785 1022 520 705 635 903 509 271 1155 780 946 636 802 342 874 526 707 519 599 721 625 868 744 952 882 1006

S25

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

xiii

xiv

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 9.3 Confidence interval at 95%

Y

Confidence interval at 90%

1.96sy

1  r2

my xi yi

εi Point to be verified at 95% confidence level 1.645sy

1  r2

xi X

Figure 9.4 Y

Errors of 4 sy systematically detected

Confidence interval at 95% (2  U < 2)

Errors of 2 sy detected every other time

X

xv

PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS

Figure 9.9 Zi Confidence interval at 95%

Sum of the deviations to the conditional mean value: mm

800 600 400

i(ne  i) ne  1

1.96σε

200

i

0 4

8

12

16

20

24

28

32 i

200

36

40

44

48

400

ne  50

600 800 Possible start of gauge shifting

1000

Figure 9.18 N

St Etienne Isola Entrevaux

W

St Sauveur

Guillaume

E S

St Martin Clans Lucéram

Entrevaux Sigale

Levens

Coursegoule

La Colle Nice

St Jean

Golfe Juan

Cannes Fréjus

L'Escarène Menton La Turbie

St Vallier Mons

Peillon

St Raphael

Antibes

50 km

18 300

18 400

9500

9700

0.34 0.38

9600

0.08

0.01

9800

es

0.3

Al p

0.1 0.14 0.22

.2 0.16 0 0.1 0.0 0.03 0.11 0.10 0.01 0.17 0.25 0.0

(a)

9900 10 000 10 100 10 200 10 300

0.32 0.33 0.3 0.28

0.1

0.28

0.30 0.36 0. 4

18 300

18 400

18 500

18 600

18 700

18 800

18 900

19 000

19 100

19 200

19 300

9500

9600

0.09

0.08

0 .2

0.20

(b)

9900 10 000 10 100 10 200 10 300

0.41

0.08 0.29 0.11 .1 0.07 0 0.29 

0.09

0.05 0.02

0.00 0.0

9800

0.11

9700

0.20

0.26

0.20

0.25

0.18

1

0.08

0.01

 0.

Mercantour

0.04

0.15

0.06

0.04 0.2 0.27

0.1

0.0

0.05

0.3

18 500

0.00

0.3

0.2

0.1  0.25

0.07

0.15

0.40 0.4

Mercantour

0.44

0.1

18 600

18 700

18 800

0.38

0.39

0

18 900

19 000

19 100

19 200

19 300

0.

Figure 9.19

0.

0

s

0.2

pe

0.2

Al

0.1

xvi PRACTICAL ENVIRONMENTAL STATISTICS AND DATA ANALYSIS