Quantitative methods and socio-economic applications in GIS [2ed.] 978-1-4665-8473-0, 1466584734

The second edition of a bestseller, Quantitative Methods and Socio-Economic Applications in GIS (previously titled Quant

626 60 16MB

English Pages 301 [327] Year 2014

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computer Applications and Quantitative Methods in Archaeology 1989 9780860546955, 9781407348438

201 62 221MB Read more

Computer Applications and Quantitative Methods in Archaeology 1990 9780860547136, 9781407348629

229 76 153MB Read more

Image Processing and GIS for Remote Sensing: Techniques and Applications [2ed.] 1118724208, 9781118724200, 9781118724170, 1118724178

Following the successful publication of the 1st edition in 2009, the 2nd edition maintains its aim to provide an applica

2,470 365 93MB Read more

Computational Methods and GIS Applications in Social Science [3 ed.] 1032266813, 9781032266817

This textbook integrates GIS, spatial analysis, and computational methods for solving real-world problems in various pol

341 62 53MB Read more

Computational Methods and GIS Applications in Social Science - Lab Manual [1 ed.] 1032302437, 9781032302430

This lab manual is a companion to the third edition of the textbook Computational Methods and GIS Applications in Social

170 8 41MB Read more

Computational Methods and GIS Applications in Social Science [3 ed.] 1032266813, 9781032266817

This textbook integrates GIS, spatial analysis, and computational methods for solving real-world problems in various pol

197 32 55MB Read more

Enterprise GIS: Concepts and Applications 1138478296, 9781138478299

This book defines and discusses how the field of Enterprise Architecture (EA) can be incorporated into the design of Ent

1,146 158 10MB Read more

Image Processing and GIS for Remote Sensing: Techniques and Applications [2ed.] 1118724208, 9781118724200, 9781118724170, 1118724178, 150-125-100-7

Following the successful publication of the 1st edition in 2009, the 2nd edition maintains its aim to provide an applica

583 157 31MB Read more

Enterprise GIS: Concepts and Applications 1138478296, 9781138478299

This book defines and discusses how the field of Enterprise Architecture (EA) can be incorporated into the design of Ent

707 102 3MB Read more

Quantitative Methods in Tourism: A Handbook 9781845416201

In this revised second edition, the authors offer a presentation of quantitative research methods for tourism researcher

194 77 7MB Read more

Quantitative methods and socio-economic applications in GIS [2ed.]
978-1-4665-8473-0, 1466584734

Author / Uploaded
Wang
Fahui

Table of contents :
Content: Section 1. GIS and basic spatial analysis tasks --
section 2. Basic quantitative methods and applications --
section 3. Advanced quantitative methods and applications.

Citation preview

Second Edition

Quantitative Methods and Socio-Economic Applications in

GIS

Fahui Wang

Second Edition

Quantitative Methods and Socio-Economic Applications in

GIS

Second Edition

Quantitative Methods and Socio-Economic Applications in

GIS

Fahui Wang

Boca Raton London New York

CRC Press is an imprint of the Taylor & Francis Group, an informa business

The cover image is used by permission. © 2014 Esri and its data providers. All rights reserved.

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2015 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20141107 International Standard Book Number-13: 978-1-4665-8473-0 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

In loving memory of Katherine Z. Wang To Lei and our three J’s (Jenny, Joshua, and Jacqueline)

Contents List of Figures .......................................................................................................... xv List of Tables ...........................................................................................................xix Foreword .................................................................................................................xxi Preface.................................................................................................................. xxiii Author ..................................................................................................................xxvii List of Major GIS Datasets and Program Files .....................................................xxix List of Quick References for Spatial Analysis Tasks ............................................xxxi

Section i Chapter 1

GiS and Basic Spatial Analysis tasks

Getting Started with ArcGIS: Data Management and Basic Spatial Analysis Tools ................................................................3 1.1

Spatial and Attribute Data Management in ArcGIS .................3 1.1.1 Map Projections and Spatial Data Models ................... 4 1.1.2 Attribute Data Management and Attribute Join ...........5 1.2 Spatial Analysis Tools in ArcGIS: Queries, Spatial Joins, and Map Overlays ......................................................................7 1.3 Case Study 1: Mapping and Analyzing Population Density Pattern in Baton Rouge, Louisiana ............................ 10 1.3.1 Part 1: Mapping the Population Density Pattern across Census Tracts .................................................. 10 1.3.2 Part 2: Analyzing the Population Density Pattern across Concentric Rings ............................................. 16 1.4 Summary ................................................................................. 23 Appendix 1: Identifying Contiguous Polygons by Spatial Analysis Tools ......................................................................... 23 Chapter 2

Measuring Distance and Time ........................................................... 27 2.1 2.2 2.3 2.4

2.5

Measures of Distance .............................................................. 27 Computing Network Distance and Time ................................. 29 Distance Decay Rule ............................................................... 32 Case Study 2: Computing Distances and Travel Time to Public Hospitals in Louisiana .................................................. 33 2.4.1 Part 1: Measuring Euclidean and Manhattan Distances ....................................................................34 2.4.2 Part 2: Measuring Travel Time .................................. 38 Summary ................................................................................. 41

vii

viii

Contents

Appendix 2A: Valued Graph Approach to the Shortest Route Problem ......................................................................... 41 Appendix 2B: Estimating Travel Time Matrix by Google Maps API.... 42 Chapter 3

Spatial Smoothing and Spatial Interpolation ..................................... 47 3.1

Spatial Smoothing ................................................................... 47 3.1.1 Floating Catchment Area (FCA) Method .................. 48 3.1.2 Kernel Density Estimation ......................................... 49 3.2 Point-Based Spatial Interpolation ............................................ 50 3.2.1 Global Interpolation Methods .................................... 50 3.2.2 Local Interpolation Methods ...................................... 51 3.3 Case Study 3A: Mapping Place Names in Guangxi, China ...... 53 3.3.1 Part 1: Spatial Smoothing by the Floating Catchment Area Method ............................................ 53 3.3.2 Part 2: Spatial Interpolation by Various Methods ...... 56 3.4 Area-Based Spatial Interpolation ............................................ 59 3.5 Case Study 3B: Area-Based Interpolations of Population in Baton Rouge, Louisiana ......................................................60 3.5.1 Part 1. Using the Areal Weighting Interpolation to Transform Data from Census Tracts to School Districts in 2010 .........................................................60 3.5.2 Part 2. Using the Target-Density Weighting (TDW) Interpolation to Interpolate Data from Census Tracts in 2010 to Census Tracts in 2000........ 61 3.6 Summary .................................................................................64 Appendix 3A: Empirical Bayes Estimation for Spatial Smoothing.......64 Appendix 3B: Network Hierarchical Weighting Method for Areal Interpolation................................................................... 65

Section ii Basic Quantitative Methods and Applications Chapter 4

GIS-Based Trade Area Analysis and Application in Business Geography ........................................................................... 69 4.1 4.2

Basic Methods for Trade Area Analysis.................................. 70 4.1.1 Analog Method and Regression Models .................... 70 4.1.2 Proximal Area Method............................................... 70 Gravity Models for Delineating Trade Areas .......................... 72 4.2.1 Reilly’s Law ................................................................ 72 4.2.2 Huff Model ................................................................. 73 4.2.3 Link between Reilly’s Law and Huff Model .............. 74 4.2.4 Extensions of the Huff Model .................................... 75

ix

Contents

4.3

Case Study 4A: Defining Fan Bases of Cubs and White Sox in Chicago Region ................................................. 76 4.3.1 Part 1. Defining Fan Base Areas by the Proximal Area Method............................................................... 78 4.3.2 Part 2. Defining Fan Base Areas and Mapping Probability Surface by Huff Model ............................ 79 4.3.3 Discussion................................................................... 81 4.4 Case Study 4B: Estimating Trade Areas of Public Hospitals in Louisiana ............................................................. 82 4.4.1 Part 1. Defining Hospital Service Areas by the Proximal Area Method............................................... 82 4.4.2 Part 2. Defining Hospital Service Areas by Huff Model.......................................................................... 83 4.5 Concluding Remarks ............................................................... 87 Appendix 4A: Economic Foundation of the Gravity Model .............. 88 Appendix 4B: A Toolkit for Implementing the Huff Model ..............90 Chapter 5

GIS-Based Measures of Spatial Accessibility and Application in Examining Health Care Access ......................................................... 93 5.1 5.2

Issues on Accessibility............................................................. 93 Floating Catchment Area Methods ......................................... 95 5.2.1 Earlier Versions of Floating Catchment Area (FCA) Method ............................................................ 95 5.2.2 Two-Step Floating Catchment Area (2SFCA) Method .......................................................................96 5.3 Gravity-Based and Generalized 2SFCA Models..................... 98 5.3.1 Gravity-Based Accessibility Index ............................. 98 5.3.2 Comparison of the 2SFCA and Gravity-Based Methods ......................................................................99 5.3.3 Generalized 2SFCA Model ...................................... 100 5.4 Case Study 5: Measuring Spatial Accessibility to Primary Care Physicians in Chicago Region ........................ 101 5.4.1 Part 1. Implementing the 2SFCA Method ................ 102 5.4.2 Part 2. Implementing the Gravity-Based Accessibility Model .................................................. 105 5.4.3 Discussion................................................................. 108 5.5 Concluding Comments .......................................................... 108 Appendix 5A: A Property of Accessibility Measures ...................... 110 Appendix 5B: A Toolkit of Automated Spatial Accessibility Measures ................................................................................ 112 Chapter 6

Function Fittings by Regressions and Application in Analyzing Urban Density Patterns .................................................................... 115 6.1

Density Function Approach to Urban and Regional Structures............................................................................... 115

x

Contents

6.1.1 Urban Density Functions .......................................... 115 6.1.2 Regional Density Functions ..................................... 117 6.2 Function Fittings for Monocentric Models ............................ 118 6.2.1 Four Simple Bivariate Functions .............................. 118 6.2.2 Other Monocentric Functions .................................. 120 6.2.3 GIS and Regression Implementations ...................... 121 6.3 Nonlinear and Weighted Regressions in Function Fittings ... 122 6.4 Function Fittings for Polycentric Models .............................. 126 6.4.1 Polycentric Assumptions and Corresponding Functions .................................................................. 126 6.4.2 GIS and Regression Implementations ...................... 128 6.5 Case Study 6: Analyzing Urban Density Patterns in Chicago Urban Area .............................................................. 129 6.5.1 Part 1: Function Fittings for Monocentric Models at the Census Tract Level ......................................... 130 6.5.2 Part 2: Function Fittings for Polycentric Models at the Census Tract Level ......................................... 133 6.5.3 Part 3: Function Fittings for Monocentric Models at the Township Level............................................... 134 6.6 Discussions and Summary .................................................... 136 Appendix 6A: Deriving Urban Density Functions ........................... 137 Appendix 6B: Centrality Measures and Association with Urban Densities ................................................................................ 139 Appendix 6C: OLS Regression for a Linear Bivariate Model ......... 140 Chapter 7

Principal Components, Factor and Cluster Analyses, and Application in Social Area Analysis ................................................ 143 7.1 Principal Components Analysis ............................................ 144 7.2 Factor Analysis ...................................................................... 145 7.3 Cluster Analysis..................................................................... 149 7.4 Social Area Analysis ............................................................. 151 7.5 Case Study 7: Social Area Analysis in Beijing ..................... 153 7.6 Discussions and Summary .................................................... 159 Appendix 7: Discriminant Function Analysis .................................. 162

Chapter 8

Spatial Statistics and Applications ................................................... 163 8.1 8.2 8.3

The Centrographic Measures ................................................ 164 Case Study 8A: Measuring Geographic Distributions of Racial–Ethnic Groups in Chicago Urban Area ..................... 166 Spatial Cluster Analysis Based on Feature Locations ........... 168 8.3.1 Tests for Global Clustering Based on Feature Locations .................................................................. 168 8.3.2 Tests for Local Clusters Based on Feature Locations .................................................................. 168

xi

Contents

8.4

Case Study 8B: Spatial Cluster Analysis of Place Names in Guangxi, China ................................................................. 170 8.5 Spatial Cluster Analysis Based on Feature Values ................ 172 8.5.1 Defining Spatial Weights.......................................... 172 8.5.2 Tests for Global Clustering Based on Feature Values ....................................................................... 173 8.5.3 Tests for Local Clusters Based on Feature Values ... 175 8.6 Spatial Regression ................................................................. 176 8.6.1 Spatial Lag Model and Spatial Error Model ............ 176 8.6.2 Geographically Weighted Regression ...................... 178 8.7 Case Study 8C: Spatial Cluster and Regression Analyses of Homicide Patterns in Chicago........................................... 178 8.7.1 Part 1: Spatial Cluster Analysis of Homicide Rates . 180 8.7.2 Part 2: Regression Analysis of Homicide Patterns... 182 8.8 Summary ............................................................................... 187 Appendix 8: Spatial Filtering Methods for Regression Analysis ..... 190

Section iii Advanced Quantitative Methods and Applications Chapter 9

Regionalization Methods and Application in Analysis of Cancer Data.................................................................................. 193 9.1 9.2

Small Population Problem and Regionalization .................... 193 Spatial Order and the Modified Scale–Space Clustering (MSSC) Methods ................................................................... 196 9.3 REDCAP Method .................................................................. 199 9.4 Case Study 9: Constructing Geographical Areas for Analysis of Late-Stage Breast Cancer Risks in the Chicago Region ..................................................................... 201 9.5 Summary ...............................................................................209 Appendix 9A: Poisson-Based Regression Analysis ......................... 212 Appendix 9B: Toolkit of the Mixed-Level Regionalization Method .................................................................................213 Chapter 10 System of Linear Equations and Application of Garin–Lowry Model in Simulating Urban Population and Employment Patterns . 217 10.1 System of Linear Equations................................................... 217 10.2 Garin–Lowry Model ............................................................. 219 10.2.1 Basic versus Nonbasic Economic Activities ............ 219 10.2.2 Model’s Formulation ................................................ 220 10.2.3 An Illustrative Example ........................................... 222 10.3 Case Study 10: Simulating Population and Service Employment Distributions in a Hypothetical City ................ 223

xii

Contents

10.4 Discussion and Summary ...................................................... 229 Appendix 10A: Input–Output Model ............................................... 230 Appendix 10B: Solving a System of Nonlinear Equations .............. 231 Appendix 10C: Toolkit for Calibrating the Garin– Lowry Model .... 233 Appendix 10D: Cellular Automata (CA) for Urban Land Use Modeling................................................................................ 233 Chapter 11 Linear Programming and Applications in Examining Wasteful Commuting and Allocating Healthcare Providers ........................... 237 11.1 Linear Programming and the Simplex Algorithm ................ 238 11.1.1 LP Standard Form .................................................... 238 11.1.2 Simplex Algorithm ................................................... 238 11.2 Case Study 11A: Measuring Wasteful Commuting in Columbus, Ohio................................................................. 241 11.2.1 Issue of Wasteful Commuting and Model Formulation .............................................................. 241 11.2.2 Data Preparation in ArcGIS ..................................... 242 11.2.3 Measuring Wasteful Commuting in an R Program ...245 11.3 Integer Programming and Location–Allocation Problems ... 247 11.3.1 General Forms and Solutions for Integer Programming ........................................................... 247 11.3.2 Location–Allocation Problems.................................248 11.4 Case Study 11B: Allocating Health Care Providers in Baton Rouge, Louisiana......................................................... 251 11.5 Summary ............................................................................... 254 Appendix 11A: Hamilton’s Model on Wasteful Commuting ........... 254 Appendix 11B: Coding Linear Programming in SAS ...................... 256 Appendix 11C: Programming Approach to Minimal Disparity in Accessibility .......................................................................... 257 Chapter 12 Monte Carlo Method and Its Application in Urban Traffic Simulation ........................................................................................ 259 12.1 Monte Carlo Simulation Method ...........................................260 12.1.1 Introduction to Monte Carlo Simulation ..................260 12.1.2 Monte Carlo Applications in Spatial Analysis .........260 12.2 Travel Demand Modeling ...................................................... 262 12.3 Examples of Monte Carlo–Based Spatial Simulation ...........264 12.4 Case Study 12: Monte Carlo–Based Traffic Simulation in Baton Rouge, Louisiana.........................................................266 12.4.1 Data Preparation and Program Overview ................266 12.4.2 Module 1: Interzonal Trip Estimation ...................... 269 12.4.3 Module 2: Monte Carlo Simulation of Trip Origins and Destinations .......................................... 272

Contents

xiii

12.4.4 Module 3: Monte Carlo Simulation of Trip Distribution............................................................... 273 12.4.5 Module 4: Trip Assignment and Model Validation ................................................................. 273 12.5 Summary ............................................................................... 276 References ............................................................................................................. 279

List of Figures FIGURE 1.1

Dialog window for attribute query. .................................................. 12

FIGURE 1.2 Dialog window for spatial query...................................................... 13 FIGURE 1.3 Dialog windows for projecting a spatial dataset.............................. 14 FIGURE 1.4 Dialog window for calculating a field.............................................. 15 FIGURE 1.5 Dialog window for defining mapping symbols. .............................. 16 FIGURE 1.6

Population density pattern in Baton Rouge in 2010......................... 17

FIGURE 1.7

Dialog window for multiple ring buffer........................................... 18

FIGURE 1.8 Dialog window for the Dissolve tool. .............................................. 19 FIGURE 1.9 Dialog window for creating a graph in ArcGIS. .............................20 FIGURE 1.10 Dialog window for spatial join. ..................................................... 21 FIGURE 1.11 Population density patterns based on data at the census tract and block levels. ............................................................................. 22 FIGURE 1.12 Flow chart for Case Study 1........................................................... 22 FIGURE A1.1 Rook versus queen contiguity. .......................................................24 FIGURE A1.2 Workflow for defining queen contiguity........................................24 FIGURE 2.1 An example for the label-setting algorithm. .................................... 30 FIGURE 2.2 Dialog window for geocoding hospitals based on geographic coordinates. ..................................................................................... 35 FIGURE 2.3

Dialog window for geocoding hospitals based on street addresses. ...................................................................................... 36

FIGURE 2.4 Dialog window for an attribute join. ............................................... 37 FIGURE A2.1 A valued-graph example. .............................................................. 41 FIGURE A2.2 Dialog window for defining Toolbox properties........................... 43 FIGURE A2.3 Google Maps API tool user interface for computing O-D travel time matrix. ................................................................44 FIGURE A2.4 Estimated travel time by ArcGIS and Google. ............................. 45 FIGURE 3.1 Floating catchment area method for spatial smoothing. ...................... 48 FIGURE 3.2 Kernel density estimation. ............................................................... 49

xv

xvi

List of Figures

FIGURE 3.3 Zhuang and non-Zhuang place names in Guangxi, China. ............. 54 FIGURE 3.4 Dialog window for summarization.................................................. 55 FIGURE 3.5 Zhuang place name ratios in Guangxi by the FCA method. ........... 57 FIGURE 3.6

Kernel density of Zhuang place names in Guangxi. ....................... 57

FIGURE 3.7

Spatial interpolation of Zhuang place names in Guangxi by the IDW method. ............................................................................. 58

FIGURE 3.8 Population change rate in Baton Rouge 2000–2010........................ 63 FIGURE 3.9

Flow chart for implementing the TDW method. ............................. 63

FIGURE 4.1

Constructing Thiessen polygons for five points............................... 71

FIGURE 4.2

Reilly’s law of retail gravitation. ................................................. 72

FIGURE 4.3 Proximal areas for the Cubs and White Sox. .................................. 77 FIGURE 4.4 Probability of choosing the Cubs by the Huff model. ..................... 81 FIGURE 4.5 Proximal areas for public hospitals in Louisiana. ...........................84 FIGURE 4.6

Service areas for public hospitals in Louisiana by Huff model. ..... 86

FIGURE 4.7

Probability of visiting LSUHSC-Shreveport Hospital by Huff model. .............................................................................................. 87

FIGURE A4.1 FIGURE 5.1

Interface for implementing the Huff model. ................................. 91 Basic floating catchment area method in Euclidean distance..........96

FIGURE 5.2 Two-step floating catchment area method in travel time. ............... 98 FIGURE 5.3 Conceptualizing distance decay in G2SFCA. ............................... 101 FIGURE 5.4 Flow chart for implementing the 2SFCA in ArcGIS..................... 104 FIGURE 5.5 Accessibility to primary care physician in Chicago region by 2SFCA. .......................................................................................... 105 FIGURE 5.6

Accessibility to primary care physician in Chicago region by 2SFCA. .......................................................................................... 106

FIGURE 5.7

Comparison of accessibility scores by the 2SFCA and gravitybased methods. .............................................................................. 109

FIGURE A5.1 Interface for implementing G2SFCA method. ............................ 112 FIGURE 6.1

Regional growth patterns by the density function approach. ....................................................................................... 119

FIGURE 6.2

Excel dialog window for regression. ............................................. 122

FIGURE 6.3 Excel dialog window for Format Trendline. ................................. 123 FIGURE 6.4 Illustration of polycentric assumptions. ........................................ 127

List of Figures

FIGURE 6.5

xvii

Population density surface and job centers in Chicago. ................ 130

FIGURE 6.6 Density versus distance exponential trend line (census tracts)...... 133 FIGURE 6.7 Density versus distance exponential trend line (survey townships). ..................................................................................... 136 FIGURE 7.1

Scree plot and variance explained in principal components analysis. .......................................................................................... 147

FIGURE 7.2 Major steps in principal components analysis and factor analysis........................................................................................... 148 FIGURE 7.3 Dendrogram for a cluster analysis example. .................................. 150 FIGURE 7.4 Conceptual model for urban mosaic. ............................................. 153 FIGURE 7.5 Districts and subdistricts in Beijing. .............................................. 154 FIGURE 7.6

Spatial patterns of factor scores in Beijing. ................................... 158

FIGURE 7.7

Social areas in Beijing. .................................................................. 159

FIGURE 8.1 Mean centers and ellipses for racial–ethnic groups in the Chicago area. ................................................................................. 167 FIGURE 8.2

SaTScan dialog windows for point-based spatial cluster analysis. ..................................................................................... 171

FIGURE 8.3 A spatial cluster of Zhuang place names in Guangxi, China. ....... 172 FIGURE 8.4 ArcGIS dialog window for computing Getis–Ord General G. ..... 173 FIGURE 8.5 Clusters of homicide rates based on local Moran’s Ii. ................... 183 FIGURE 8.6 Clusters of homicide rates based on Gi∗. ........................................ 184 FIGURE 8.7 GeoDa dialog window for defining spatial weights. ..................... 186 FIGURE 8.8

GeoDa dialog window for regression. ........................................... 187

FIGURE 8.9 Standard residuals in the GWR model. ......................................... 188 FIGURE 8.10

Spatial variations of coefficients from the GWR model. ............ 189

FIGURE 9.1 Female breast cancer death rates in Illinois for 2003–2007. ......... 194 FIGURE 9.2 Example of assigning spatial order values to areas. ...................... 197 FIGURE 9.3 Example illustrating REDCAP....................................................... 200 FIGURE 9.4 Interface windows in REDCAP. ....................................................204 FIGURE 9.5 Late-stage breast cancer rates in zip code areas in the Chicago region in 2000................................................................................206 FIGURE 9.6

Distribution of late-stage breast cancer rates in the Chicago region in 2000. ...............................................................................207

xviii

FIGURE 9.7

List of Figures

Screen shot for “Dissolve” in data aggregation. ............................209

FIGURE 9.8 Late-stage breast cancer rates in newly defined areas in Chicago in 2000. ........................................................................... 210 FIGURE 9.9

Hot and cold spots of late-stage breast cancer rates in newly defined areas in the Chicago region in 2000. ................................ 211

FIGURE A9.1 User interface of the MLR method. ............................................ 214 FIGURE 10.1 Interaction between population and employment distributions in a city......................................................................................... 219 FIGURE 10.2

A simple city for illustration........................................................ 222

FIGURE 10.3 Spatial structure of a hypothetical city. .......................................224 FIGURE 10.4 Population distributions in various scenarios. ............................. 228 FIGURE 10.5 Service employment distributions in various scenarios. ............. 228 FIGURE A10.1 Interface of the Garin–Lowry model tool. ................................ 234 FIGURE 11.1 TAZs with employment and resident workers in Columbus, Ohio.............................................................................................. 243 FIGURE 11.2 Interface of R. ..............................................................................246 FIGURE 11.3 Five selected hospitals in the p-median model. ........................... 253 FIGURE 12.1 Monte Carlo simulations of (a) resident workers, and (b) jobs. ... 265 FIGURE 12.2 Traffic monitoring stations and adjacent areas in Baton Rouge..... 267 FIGURE 12.3

Workflow of the TSME. .............................................................. 269

FIGURE 12.4 TSME interface for the intrazonal trip estimation module. ........ 271 FIGURE 12.5

TSME interface for the Monte Carlo simulation of O’s and D’s module. .................................................................................. 272

FIGURE 12.6 TSME interface for the Monte Carlo simulation of trips module. ........................................................................................ 274 FIGURE 12.7 TSME interface for the trip assignment and validation module. .... 275 FIGURE 12.8 Observed versus simulated traffic. .............................................. 276

List of Tables TABLE 1.1

Types of Relationships in Combining Tables .......................................6

TABLE 1.2 Types of Spatial Joins in ArcGIS .........................................................9 TABLE 1.3

Comparison of Spatial Query, Spatial Join and Map Overlay ........... 10

TABLE 2.1

Solution to the Shortest Route Problem ............................................. 31

TABLE 4.1

Fan Bases for Cubs and White Sox by Trade Area Analysis ............. 79

TABLE 4.2 Population by Hospital Trade Areas in Louisiana ............................. 85 TABLE 5.1

Comparison of Accessibility Measures ............................................ 107

TABLE A5.1 Items to Be Defined in the Accessibility Toolkit Interface ........... 113 TABLE 6.1 Linear Regressions for a Monocentric City...................................... 123 TABLE 6.2

Polycentric Assumptions and Corresponding Functions ................. 128

TABLE 6.3 Regressions Based on Monocentric Functions ................................ 132 TABLE 6.4

Regressions Based on Polycentric Assumptions 1 and 2 ................. 135

TABLE 7.1

Idealized Factor Loadings in Social Area Analysis ......................... 152

TABLE 7.2 Basic Statistics for Socioeconomic Variables in Beijing .................... 155 TABLE 7.3

Eigenvalues from Principal Components Analysis .......................... 156

TABLE 7.4 Factor Loadings in Social Area Analysis ......................................... 157 TABLE 7.5 Characteristics of Social Areas ........................................................ 160 TABLE 7.6

Zones and Sectors Coded by Dummy Variables .............................. 160

TABLE 7.7

Regressions for Testing Zonal versus Sector Structures .................. 161

TABLE 8.1

Rotated Factor Patterns of Socioeconomic Variables in Chicago in 1990 .............................................................................................. 180

TABLE 8.2 OLS and Spatial Regressions of Homicide Rates in Chicago ......... 185 TABLE 9.1 Approaches to the Small Population Problem................................... 196 TABLE 9.2 Descriptive Statistics for Female Breast Cancer by Zip Code and by Constructed New Areas in Chicago Metro Area in 2000 ....202 TABLE 9.3

Regression Results for Late-Stage Breast Cancer Risks in the Chicago Region in 2000 ...................................................................207

TABLE A9.1 Items to Be Defined in the MLR Toolkit Interface........................ 215

xix

xx

List of Tables

TABLE 10.1 Simulated Scenarios of Population and Service Employment Distributions ................................................................................... 227 TABLE 11.1 Location–Allocation Models .......................................................... 250 TABLE 11.2

Service Areas for the Clinics ......................................................... 253

TABLE 12.1 Major Tasks and Estimated Computation Time in Traffic Simulation....................................................................................... 271

Foreword This book introduces the reader in a gentle and unassuming way to the notion that the spatial structure of cities and regions is organized around ideas about the spatial geometry of cities in terms of distances, densities of occupation, and nearness or proximity usually referred to as accessibility. These are the driving forces of the way spatial structures defining our cities self-organize into recognizable forms and functions, and during the last 50 years, they have been catalogued and researched using formal methods and models that provide a unifying sense of the way the physical form of cities is organized. In general, cities grow from some central location, traditionally the marketplace that is often established accidently or relates to some predominant natural advantage such as a river crossing or harbor. But as the city grows around this pole or center, it provides the essential structure of the city with its land uses and movement patterns, reinforcing the resulting configuration. Sometimes, when the centralizing forces are destroyed by those of decentralization, new hubs or centers emerge in the periphery—edge cities, thus generating landscapes which are polycentric, composed of multiple cores and clusters of different sizes which function in an autonomous whole. These are the theories and models that constitute the subject matter of this text. Fahui Wang provides an excellent introduction to these various models and the methods that are used to link them to data and thence to prediction, but he does much more than this for he casts all the models that he introduces into a framework which is dominated by desktop GIS, specifically ArcGIS. Not only is the text an excellent and cogent summary of the main theories that explain the spatial structure of our cities, it is a working manual for making these theories operational by turning theories into models that are then estimated or fitted to existing cities using the various software and extensions that have been developed in the field of GIS during the last 20 years. Readers are thus treated to a view of operational theory building and modeling in the social sciences with the focus on spatial structures and contemporary software which ultimately empowers the persistent reader who works through the book with tools and methods for turning theory into practice. This revised and extended edition of Fahui Wang’s book, originally published as Quantitative Methods and Applications in GIS in 2006, is divided into three sections. Section I deals with getting started with ArcGIS, which focuses on key functions involving mapping population densities, computing distances and travel times, and interpolating and smoothing spatial surfaces from discrete points and areal data. This sets the scene for Section II which deals with basic quantitative methods and applications starting with defining trade areas or hinterlands which are key factors in business geography. Measuring accessibilities that depend on gravity and distance come next with applications to health care, and this is followed by linear analysis that is used to transform nonlinear functions of population density into forms that can be estimated for cities. Extracting structure from data at different scales using principal components and factor analysis with clustering comes next, and this section of the xxi

xxii

Foreword

book concludes with a new chapter on spatial statistics that also builds on clustering with respect to local and more global structures. Section III of the book treats more advanced topics: regionalization which again relates to clustering, various linear methods of land use modeling and optimization, and finally, a new chapter on modeling of traffic based on Monte Carlo techniques. A very nice feature of the book is the wealth of examples that are included. Different places such as Louisiana State, Baton Rouge, Louisiana, Columbus, Ohio and Chicago, Illinois in the United States, and Beijing and Guangxi Province of China, among other locations are used, while the applications of these various techniques are to population density, retail trade, health care, hospital provision, social area analysis, the spatial incidence of cancer, and traffic. In this book, Fahui Wang shows just how far GIS has progressed. Virtually everything that was developed prior to the GIS age can now be applied, further developed, and interpreted through the GIS lens, and in some respects, his examples illustrate the great range and diversity of potential applications which are the marks of a mature technology. In fact, GIS is becoming part of the routine tool kit that any analyst would use in studying data that varies across space. What is intriguing about the treatment here is that the edifice of theory of urban and regional systems that draws on locational analysis and social physics, although still advancing, is becoming increasingly integrated with GIS, and it is treatment of the subject area such as that developed here that shows how relevant these tools are to contemporary urban policy. In fact, it is this focus on policy that marks the book. Those reading it will find that the author weaves together explanations of the theories involved with their translation into tools and models and their estimation using straightforward statistics with notions about different ways of applying these models to real problems that have strong policy implications. At the end of the day, it is not only understanding cities better that is the quest for the tools introduced here but also understanding them in deep enough ways so that effective policies can be advanced and tested that will provide more sustainable and resilient cities—one of the challenges of nearand medium-term futures in the socioeconomic domain. This book sets a standard, shows how this can be achieved, and charts the way forward. Michael Batty Centre for Advanced Spatial Analysis University College, London, United Kingdom

Preface One of the most important advancements in recent social science research (including applied social sciences and public policy) has been the application of quantitative or computational methods in studying the complex human or social systems. Research centers in computational social sciences have flourished in major university campuses including Harvard University (http://www.iq.harvard.edu/), Stanford University (https://css-center.stanford.edu/), UCLA (http://ccss.ucla.edu/), University of Washington (http://julius.csscr.washington.edu/), and George Mason University (http://www.css.gmu.edu/). Many conferences have also been organized around this theme (http://computationalsocialscience.org/). Geographic Information System (GIS) has played an important role in this movement because of its capability of integrating and analyzing various data sets, in particular spatial data. The Center for Spatially Integrated Social Science (CSISS) at the University of California Santa Barbara, funded by the National Science Foundation (1999–2007), has been an important force in promoting the usage of GIS technologies in social sciences. The Centre for Advanced Spatial Analysis (CASA) at the University College, London, UK, is also known for its leading efforts in applied GIS and geo-simulation research with a focus on cities. The growth of GIS has made it increasingly known as geographic information science (GISc), which covers broader issues such as spatial data quality and uncertainty, design and development of spatial data structure, social, and legal issues related to GIS, and many others. Many of today’s students in geography and other social science-related fields (e.g., sociology, anthropology, business management, city and regional planning, public administration) share the same excitement in GIS. But their interest in GIS may fade away quickly if the GIS usage is limited to managing spatial data and mapping. In the meantime, a significant number of students complain that courses on statistics, quantitative methods, and spatial analysis are too dry and feel irrelevant to their interests. Over years of teaching GIS, spatial analysis, and quantitative methods, I have learned the benefits of blending them together and practicing them in case studies using real-world data. Students can sharpen their GIS skills by applying some GIS techniques to detecting hot spots of crime, or gain better understanding of a classic urban land use theory by examining their spatial patterns in a GIS environment. When students realize that they can use some of the computational methods and GIS techniques to solve real-world problems in their own field, they become better motivated in class. In other words, technical skills in GIS or quantitative methods are learned in the context of addressing subject issues. Both are important for today’s competitive job market. This book is the result of my efforts of integrating GIS and quantitative (computational) methods, demonstrated in various policy-relevant socioeconomic applications. The applications are chosen with three objectives in mind. The first is to demonstrate the diversity of issues where GIS can be used to enhance the studies related to socioeconomic issues and public policy. Applications spread from typical xxiii

xxiv

Preface

themes in urban and regional analysis (e.g., trade area analysis, regional growth patterns, urban land use and transportation) to issues related to crime and health analyses. The second is to illustrate various computational methods. Some methods become easy to use in automated GIS tools, and others rely on GIS to enhance the visualization of results. The third objective is to cover common tasks (e.g., distance and travel time estimation, spatial smoothing and interpolation, accessibility measures) and major issues (e.g., modifiable areal unit problem, rate estimate of rare events in small populations, spatial autocorrelation) that are encountered in spatial analysis. One important feature of this book is that each chapter is task-driven. Methods can be better learned in the context of solving real-world problems. While each method is illustrated in a special case of application, it can be used to analyze different issues. Each chapter has one subject theme and introduces the method (or a group of related methods) most relevant to the theme. For example, spatial regression is used to examine the relationship between job access and homicide patterns; systems of linear equations are analyzed to predict urban land use patterns; linear programming is introduced to solve the problem of wasteful commuting and allocate healthcare facilities; and Monte Carlo technique is illustrated in simulating urban traffic. Another important feature of this book is the emphasis on implementation of methods. All GIS-related tasks are illustrated in the ArcGIS platform, and most statistical analyses are conducted by SAS. In other words, one may only need access to ArcGIS and SAS in order to replicate the work discussed in the book and conduct similar research. ArcGIS and SAS are chosen because they are the leading software for GIS and statistical analysis, respectively. Some specific tasks such as spatial clustering and spatial regression use free software that can be downloaded from the Internet. Most data used in the case studies are publicly accessible. Instructors and advanced readers may use the data sources and techniques discussed in the book to design their class projects or craft their own research projects. A link to the website is provided for downloading all data and computer programs used in the case studies (see “List of Major GIS Datasets and Program Files”). I plan to run a blog under my homepage (http://ga.lsu.edu/faculty/fahui-wang/) in support of the book. The book has 12 chapters. Section I includes the first three chapters, covering some generic issues such as an overview of data management in GIS and basic spatial analysis tools (Chapter 1), distance and travel time measurement (Chapter 2), and spatial smoothing and interpolation (Chapter 3). Section II includes Chapters 4 through 8, covering some basic quantitative methods that require little or no programming skills: trade area analysis (Chapter 4), accessibility measures (Chapter 5), function fittings (Chapter 6), factor analysis (Chapter 7), and spatial statistics (Chapter 8). Section III includes Chapters 9 through 12, covering more advanced topics: regionalization (Chapter 9), a system of linear equations (Chapter 10), linear programming (Chapter 11), and Monte Carlo simulation (Chapter 12). Sections I and II may serve an upper-level undergraduate course. Section III may be used for a graduate course. It is assumed that readers have some basic GIS and statistical knowledge equivalent to one introductory GIS course and one elementary statistical course.

Preface

xxv

Each chapter focuses on one computational method except for the first chapter. In general, a chapter (1) begins with an introduction to the method, (2) discusses a theme to which the method is applied, and (3) uses a case study to implement the method using GIS. Some important issues, if not directly relevant to the main theme of a chapter, are illustrated in appendixes. Many important tasks are repeated in different projects to reinforce the learning experience (see “Quick Reference for Spatial Analysis Tasks”). My interest in quantitative methods has been very much influenced by my doctoral advisor, Jean-Michel Guldmann, in the Department of City and Regional Planning of The Ohio State University. I learned linear programming and solving a system of linear equations in his courses on static and dynamic programming. I also benefited a great deal from the mentorship of Donald Haurin in the Department of Economics of The Ohio State University. The topics on urban and regional density patterns and wasteful commuting can be traced back to his inspiring urban economics course. Philip Viton, also in the Department of City and Regional Planning of The Ohio State University, taught me much of the econometrics. I only wish I could have been a better student then. It has been 8 years since the publication of the previous edition of this book. Several friends and many users, especially my students, have found some errors and given me valuable feedback. Several tasks in the previous edition had to be implemented in the deprecated ArcInfo workstation environment (and its associated AML program). The software, particularly ArcGIS, has advanced with increasing capacities and more user-friendly interfaces. Numerous people have urged me to automate some popular tasks in Python. I have also collected a few new tricks from my own research that I hope to share with a broader audience. The reasons for preparing a new version just kept piling on. It came to a point that I could not find any more excuses for not doing it when my sabbatical request was approved by Louisiana State University in early 2013. Inevitably I have lost a step or two over the years. The added administrative duty did not help either. I truly felt that the revision took more work than the previous edition. I used most of my summer and the sabbatical in the fall of 2013 on the book, and often found myself working on the book in my office very early in the mornings of weekdays and weekends for much of the spring semester of 2014. In this version, several popular tasks such as trade area delineation, spatial accessibility measures (2SFCA), mixed-level regionalization (MLR), and Garin–Lowry model are automated as a convenient toolkit in ArcGIS; and some common tasks such as REDCAP regionalization, wasteful commuting measure, and Monte Carlo simulation are now implemented in more user-friendly programs. All case studies have been tested multiple times, and instructions were based on ArcGIS 10.2. I would like to believe that it was the amount of revision work that merited the investment of so much of time. It could be an understatement to say that more than 60% of the materials (including case studies) are new. I hope that the readers will be convinced likewise that it is a worthy cause. I am so grateful for the generous help from several individuals in preparing this second edition. Yujie Hu coded the program TSME used in the case study in Chapter 12 and coauthored Chapter 12. He also tested all case studies and corrected numerous

xxvi

Preface

errors. Haojie Zhu implemented the Python programs for three toolkits (trade area delineation, spatial accessibility measures, and Garin–Lowry model). Both Yujie and Haojie are PhD students in the Department of Geography and Anthropology, Louisiana State University. Lan Mu in the Department of Geography, University of Georgia, implemented the mixed-level regionalization method in Python and coauthored Appendix 9B in Chapter 9. Xinyue Ye in the Department of Geography, Kent State University, coauthored Appendix 10D on cellular automata for urban land use modeling in Chapter 10. Carmi J. Neiger of Elmhurst College tested the case studies in Chapters 1 through 4 and edited the four chapters. I thank Michael Batty for graciously writing the new Foreword on short notice. Finally, I would like to thank the editorial team at Taylor & Francis: acquisition editor, Irma Britton; production coordinator, Laurie Schlags; and many others, including typesetters, proofreaders, cartographers, and computer specialists. Thank you all for guiding me through the whole process. This book intends to mainly serve students in geography, urban and regional planning, public policy, and related fields. It can be used in courses such as (1) spatial analysis, (2) location analysis, (3) quantitative methods in geography, and (4) applications of GIS in business and social science. The book can also be useful for social scientists in general with research interests related to spatial issues. Some in urban economics may find the studies on urban structures and wasteful commuting relevant, and others in business may think the chapters on trade area analysis and accessibility measures useful. The case study on crime patterns may interest criminologists, and the one on regionalization and cancer analysis may find the audience among epidemiologists. Some basic GIS knowledge (e.g., one introductory GIS course) will help readers navigate through the case studies smoothly, but is not required. Additional material including the datasets and program files is available from the CRC website: http://www.crcpress.com/product/isbn/9781466584723.

Author Fahui Wang is James J. Parsons Professor and Chair of the Department of Geography and Anthropology, Louisiana State University. He earned a BS degree in geography from Peking University, China, an MA degree in economics and a PhD degree in city and regional planning from The Ohio State University. His research interests include GIS applications in human geography (urban, economic, and transportation), city and regional planning, and public policy (crime and health). His work has been supported by the National Science Foundation, U.S. Department of Energy, U.S. Department of Health and Human Services (Agency for Healthcare Research & Quality and the National Cancer Institute), U.S. Department of Housing and Urban Development, U.S. Department of Justice (National Institute of Justice and Office of Juvenile Justice and Delinquency Prevention), and the National Natural Science Foundation of China. He is on the editorial boards of several international journals including Annals of the Association of American Geographers, Annals of GIS, Applied Geography, and the Chinese Geographical Science. He has published over 100 refereed articles.

xxvii

List of Major GIS Datasets and Program Files* Case Study

Data Folder

GIS Dataset

1

BatonRouge

2

Louisiana

3A 3B

China_GX BatonRouge

BR.gdb, Census. gdb LA_State.gdb, LAState.gdb GX.gdb BR.gdb

4A 4B

Chicago Louisiana

5 6

Chicago Chicago

7 8A 8B 8C 9

Beijing Chicago China_GX Chicago Chicago/ Chizip

BJSA.gdb ChiUrArea.gdb GX.gdb ChiCity.gdb Chizip shapefiles

10 11A 11B

SimuCity Columbus BatonRouge

SimuCity.gdb Columbus.gdb BR.gdb

12

BatonRouge

BRMSA.gdb

*

ChiRegion.gdb LA_State.gdb, LAState.gdb ChiRegion.gdb ChiUrArea.gdb

Program File

Study Area East Baton Rouge Parish Louisiana

Huff Model.tbx Huff Model.tbx Accessibility.tbx monocent.sas, polycent.sas, mono_twnshp.sas PCA_FA_CA.sas, BJreg.sas

PCA_FA.sas poisson_zip.sas, MixedLevel Regionalizaiton Tools.tbx Garin-Lowry Model.tbx WasteComm.R, LP.sas

TSME.exe, Features To Text File.tbx

Guangxi, China East Baton Rouge Parish Chicago Region Louisiana Chicago Region Chicago Urban Area Beijing, China Chicago Urban Area Guangxi, China Chicago City Chicago Region

Hypothetical city Columbus, Ohio East Baton Rouge Parish Baton Rouge MSA

http://www.crcpress.com/product/isbn/9781466584723

xxix

List of Quick References for Spatial Analysis Tasks Task

Section First Introduced (Step)

Figure Illustration

2SFCA method (automated toolkit) Areal weighting Interpolation Attribute join

Appendix 5B Section 3.5.1 (1–3) Section 2.4.1 (5)

Figure A5.1

Attribute query Dissolve Euclidean distance matrix computation Geocoding by address locator Geocoding by geographic coordinates Geographically weighted regression (GWR) Getis–Ord general G Huff model (automated toolkit) IDW spatial interpolation Kernel density estimation (KDE) Local Gi∗ Local Moran’s Ii (LISA) Mean center Moran’s I Network dataset building OLS regression in ArcGIS Projection of spatial data Proximal area by Euclidean distance Proximal area by travel time Spatial join Spatial query Spatial regression in GeoDa Standard deviational ellipse Standard distance Summarize attribute

Section 1.3.1 (1) Section 1.3.2 (10) Section 2.4.1 (3)

Figure 1.1 Figure 1.8

Section 2.4.1 (1) Section 2.4.1 (1) Section 8.7.2 (8)

Figure 2.3 Figure 2.2

Travel time O-D matrix computation

Section 2.4.2 (9–12) Section 4.3.1 (1)

Weighted centroids

Section 8.7.1 (3) Appendix 4B Section 3.3.2 (10) Section 3.3.2 (9) Section 8.7.1 (4) Section 8.7.1 (4) Section 8.2 (1) Section 8.7.1 (3) Section 2.4.2 (8) Section 8.7.2 (5) Section 1.3.1 (3) Section 4.3.1 (2–3) Section 4.4.1 (1–3) Section 1.3.2 (13) Section 1.3.1 (2) Section 8.7.2 (7) Section 8.2 (3) Section 8.2 (2) Section 3.3.1 (5)

Figure 2.4

Sections (Steps) Repeated

Section 3.3.1 (3), Section 4.3.2 (6) Section 3.3.1 (4) Section 3.3.1 (2), Section 4.3.2 (5)

Figure A4.1 Section 6.5.1 (2) Section 6.5.1 (2)

Section 10.3 (1, 3) Section 6.5.1 (4–5) Figure 1.3

Figure 1.10 Figure 1.2

Figure 3.4

Section 6.5.1 (1)

Section 4.3.1 (4), Section 4.3.2 (7) Section 10.3 (1, 3), Section 11.2.2 (3)

xxxi

Section I GIS and Basic Spatial Analysis Tasks

1

Getting Started with ArcGIS Data Management and Basic Spatial Analysis Tools

A Geographic Information System (GIS) is a computer system that captures, stores, manipulates, queries, analyzes, and displays geographically referenced data. Among the diverse set of tasks a GIS can do, mapping remains the primary function. The first objective of this chapter is to demonstrate how GIS is used as a computerized mapping tool. The key skill involved in this task is the management of spatial and aspatial (attribute) data and the linkage between them. However, GIS goes beyond mapping and has been increasingly used in various spatial analysis tasks as GIS software has become more capable and also user-friendly for these tasks. The second objective of this chapter is to introduce some basic tools for spatial analysis in GIS. Given its wide usage in education, business and governmental agencies, ArcGIS has been chosen as the major software platform to implement GIS tasks in this book. Unless otherwise noted, studies in this book are based on ArcGIS 10.2. All chapters are structured in a similar fashion, beginning with conceptual discussions to lay out the foundation for methods, followed by case studies to acquaint the readers with related techniques. This chapter serves as a quick warm-up to prepare readers for more advanced spatial analysis in later chapters. Section 1.1 offers a quick tour of spatial and attribute data management in ArcGIS. Section 1.2 surveys basic spatial analysis tools in ArcGIS including queries, spatial joins and map overlays. Section 1.3 uses a case study to illustrate the typical process of GIS-based mapping and introduces some of the basic spatial analysis tools in ArcGIS. The chapter concludes with a brief summary. In addition, Appendix 1 demonstrates how to use GIS-based spatial analysis tools to identify spatial relationships such as polygon adjacency. Polygon adjacency defines spatial weights often needed in advanced spatial statistical studies such as spatial cluster and regression analyses (see Chapter 8).

1.1

SPATIAL AND ATTRIBUTE DATA MANAGEMENT IN ArcGIS

Since ArcGIS is chosen as the primary software platform for this book, it is helpful to have a brief overview of its major modules and functions. ArcGIS was released in 2001 by Environmental Systems Research Institute, Inc. (ESRI) with a full-featured graphic user interface (GUI) to replace the older versions of ArcInfo that were based 3

4

Quantitative Methods and Socioeconomic Applications in GIS

on command line interface. Prior to the release of ArcGIS, ESRI had an entry-level GIS package with a GUI called ArcView. ArcView is now referred to as ArcGIS for Desktop Basic. ArcGIS contains three major modules: ArcCatalog, ArcMap, and ArcToolbox. ArcCatalog views and manages spatial data files. ArcMap displays, edits, and analyzes the spatial data as well as attribute data. ArcToolbox contains various taskspecific tools for data analysis such as data conversion, data management, spatial analysis, and spatial statistics. ArcCatalog and ArcMap can be launched separately from the Start button on the Windows taskbar, and either module can also be accessed by clicking its distinctive symbol inside the other modules. ArcToolbox is accessed in either ArcMap or ArcCatalog. ArcMap is used most often, and thus is the module being referred to in the text unless otherwise noted.

1.1.1 Map projections and spatial data Models GIS differs from other information systems because of its unique capability of managing geographically referenced or spatial (location) data. Understanding the management of spatial data in GIS requires knowledge of the data’s geographic coordinate system with longitude and latitude values and its representations on various plane coordinate systems with (x,y) coordinates (map layers). Transforming the Earth’s spherical surface to a plane surface is a process referred to as projection. The two map projections most commonly used in the US are Universal Transverse Mercator (UTM) and State Plane Coordinate System (SPCS). Strictly speaking, the SPCS is not actually a projection; instead, it is a set of more than 100 geographic zones or coordinate systems for specific regions of the US. To achieve minimal distortion, north-south oriented states or regions use a Transverse Mercator projection, and eastwest oriented states or regions use a Lambert Conformal Conic projection. Some states (e.g., Alaska, New York) use more than one projection. For more details, one may refer to an ArcGIS online documentation, “Understanding Map Projections.” In ArcGIS, ArcMap automatically converts data of different coordinate systems to that of the first dataset added to the frame in map displays. This feature is commonly referred to as on-the-fly reprojection. However, this may be a time-consuming process if the dataset is very large. More importantly, one cannot obtain meaningful measurements such as distance or area on a dataset in a geographic coordinate system that is not projected. It is a good practice to use the same projection for all data layers in one project. To check the existing projection for a spatial dataset in ArcGIS, one may use ArcCatalog by right-clicking the layer > Properties > XY Coordinate System, or use ArcMap by right-clicking the layer > Properties > Source. Projection-related tasks are conducted under ArcToolbox > Data Management Tools > Projections and Transformations. If the spatial data are vector based, proceed to choose Project to transform the coordinate systems. The tool provides the option to define a new coordinate system or import the coordinate system from an existing geodataset. If the spatial data are raster based, choose Define Projection (or other suitable options). step 3 in Section 1.3 illustrates the use of transforming a spatial dataset in a geographic coordinate system to a projected coordinate system.

Getting Started with ArcGIS

5

Traditionally, a GIS uses either the vector or the raster data model to manage spatial data. A vector GIS uses geographically referenced points to construct spatial features of points, lines and areas; and a raster GIS uses grid cells in rows and columns to represent spatial features. The vector data structure is used in most socioeconomic applications, and also in most case studies in this book. The raster data structure is simple and widely used in environmental studies, physical sciences, and natural resource management. Most commercial GIS software can convert from vector into raster data, or vice versa. In ArcGIS, the tools are available under ArcToolbox > Conversion Tools. Earlier versions of ESRI’S GIS software used the coverage data model. Later, shapefiles were developed for the ArcView package. Since the release of ArcGIS 8, the geodatabase model has become available, and represents the current trend of object-oriented data models. The object-oriented data model stores the geometries of objects (spatial data) also with the attribute data, whereas the traditional coverage or shapefile model stores spatial and attribute data separately. Spatial data used for case studies in this book are provided almost entirely in geodatabases (with rare exceptions such as for network datasets). Spatial and attribute data in socioeconomic analysis often come from different sources, and a typical task is to join them together in GIS for mapping and analysis. This involves attribute data management as discussed below.

1.1.2

attribute data ManageMent and attribute join

A GIS includes both spatial and attribute data. Spatial data capture the geometry of map features, and attribute data describe the characteristics of map features. Attribute data are usually stored as tabular files or tables. ArcGIS reads several types of table formats. Shapefile attribute tables use the dBase format, ArcInfo Workstation uses the INFO format, and geodatabase tables use relational database format. ArcGIS can also read several text formats, including comma-delimited and tab-delimited text. Some of the basic tasks of data management can be done in either ArcCatalog or ArcMap, and some can be done in only one or the other. Creating a new table or deleting/copying an existing table is done only in ArcCatalog. Recall that ArcCatalog is for viewing and managing GIS data files. For example, a dBase table is created in ArcCatalog by right-clicking the folder where the new table will be placed > New > dBase Table. A table can be deleted, copied, or renamed in ArcCatalog by right-clicking the table > Delete (or Copy, Rename). Adding a new field or deleting an existing field in a table (e.g., the attribute table associated with a spatial feature or a standalone file) can be done in ArcCatalog or ArcMap. For example, to add a field to a table in ArcCatalog, right-click the table (or the layer) > Properties > click Fields > type the new field name into an empty row in the Field Name column and define its Data Type; to delete a field, simply choose the entire row for the field and click the delete key (and click Apply to confirm the changes). To add a field in ArcMap, right-click the table (or the layer) in the Table of Contents window > Open (or Open Attribute Table) > in the open table, click the Table Options dropdown icon Add Field; to delete a field, choose the entire column of the field, right-click and choose Delete Field. Updating values for a field

6

Quantitative Methods and Socioeconomic Applications in GIS

in a table is done in ArcMap: on the open table > right-click the field, choose Field Calculator to access the tool. In addition, basic statistics for a field can be obtained in ArcMap by right-clicking the field and choosing Statistics. Step 5 in Section 1.3 demonstrates how to create a new field and update its values. In GIS, an attribute join is often used to link information in two tables based on a common field (key). The table can be an attribute table associated with a particular geodataset or a standalone table. In an attribute join, the names of the fields used to link the two tables do not need to be identical in the two tables to be combined, but the data types must match. There are various relationships between tables when linking them: one-to-one, many-to-one, one-to-many, and many-to-many. For either one-to-one or many-to-one relationship, a join is used to combine two tables in ArcGIS. However, when the relationship is one-to-many or many-to-many, a join cannot be performed. ArcGIS uses a relate to link two tables while keeping the two tables separated. In a relate, one or more records are selected in one table, and the associated records are selected in another table. Table 1.1 summarizes the relationships and corresponding tools in ArcGIS. A join or relate is performed in ArcMap. Under Table of Contents (TOC), rightclick the spatial dataset or the table that is to become the target table, and choose Joins and Relates > Join (or Relate) > In the Join Data dialog window, choose “Join attributes from a table,” and select (1) the common field in the target table, (2) the source table, and (3) the common field in the source table. It is important to understand the different roles of the target and source tables in the process. As a result, the target table is expanded by adding the fields from the source table. A join is temporary in the sense that no new data are created and the join relationship is lost once the project is exited without being saved. To preserve the resulting layer (or combined table), right-click it under Table of Contents > Data > Export Data (or Export) and to save it in a new spatial dataset (table). If it is desirable to only save the joined attribute table of a spatial dataset, in the open attribute table, click the Table Options dropdown icon > Export. The latter can be termed “permanent join.” Alternatively, one may use the tools in ArcToolbox to implement attribute joins: ArcToolbox > Data Management Tool > Joins > choose Add Join for a temporary join or Join Field for a permanent join. These ArcToolbox tools

TABLE 1.1 Types of Relationships in Combining Tables Relationship One-to-one Many-to-one One-to-many Many-to-many

Match

Join or Relate in ArcGIS

One record in the target table → one record in the source table Multiple records in the target table → one record in the source table One record in the target table → multiple records in the source table Multiple records in the target table → multiple records in the source table

Join Join Relate Relate

Getting Started with ArcGIS

7

are particularly useful for automating the process in programming (see step 8 in Section 5.4.1). Once the attribute information is joined to a spatial layer, mapping the data is also done in ArcMap. Right-click the layer and choose Properties > Symbology to invoke the dialog window. In the dialog, one can select a field to map, choose colors and symbols, and plan the layout. From the main menu bar, choose View > Layout View to see what the draft map looks like. Map elements (title, scale, north arrow, legends, and others) can be added to the map by clicking Insert from the main menu bar.

1.2

SPATIAL ANALYSIS TOOLS IN ArcGIS: QUERIES, SPATIAL JOINS, AND MAP OVERLAYS

Many spatial analysis tasks utilize the information on how spatial features are related to each other in terms of location. Spatial operations such as “queries,” “spatial joins,” and “map overlays” provide basic tools in conducting such tasks. Queries include attribute (aspatial) queries and spatial queries. An attribute query uses information in an attribute table to find attribute information (in the same table) or spatial information (features in a spatial data layer). Attribute queries are accessed in ArcMap: either (1) under Selection from the main menu bar, choose Selection by Attributes, or (2) in an open table, click the Table Options dropdown icon > Selection by Attributes. Both allow users to select spatial features based on an SQL query expression using attributes (or simply select attribute records from a standalone table). One may also select features on the screen either in a map or a table by using the option “Interactive Selection Method” under Selection in the main menu bar (or directly clicking the Select Features graphic icon under the main menu). Compared to other information systems, a unique feature of GIS is its capacity to perform spatial queries, which find information based on locational relationships between features from different layers. The option Selection by Location under Selection in the main menu bar searches for features in one layer based on their spatial relationships with features in another layer. The spatial relationships are defined by various operators including “intersect,” “are within a distance of,” “contain,” “touch,” etc. In Case Study 1 (Section 1.3), step 1 demonstrates the use of attribute query, and step 2 is an example of spatial query. The selected features in a spatial dataset (or selected records in a table) either by an attribute query or a spatial query can be exported to a new dataset (or a new table) by following the steps explained in Section 1.1.2. While an attribute join utilizes a common field between two tables, a spatial join uses the locations of spatial features such as overlap or proximity between two layers. Similar to the attribute join, the source layer and target layer play different roles in a spatial join. Based on the join, information from the source layer are processed and transferred to the features in the target layer, and the result is saved in a new spatial dataset. If one object in the source layer corresponds to one or multiple objects in the target layer, it is a simple join. For example, by spatially joining a polygon layer of counties (source layer) with a point layer of school locations (target layer), attributes

8

Quantitative Methods and Socioeconomic Applications in GIS

of each county (e.g., FIPS code, name, administrator) are assigned to schools that fall within the county boundary. If multiple objects in the source layer correspond to one object in the target layer, two operations may be performed: “summarized join” and “distance join.” A summarized join summarizes the numeric attributes of features in the source layer (e.g., average, sum, minimum, maximum, standard derivation, or variance) and adds the information to the target layer. A distance join identifies the closest object (out of many) in the source layer from the matching object in the target layer, and transfers all attributes of this nearest object (plus a distance field showing how close they are) to the target layer. For example, one may spatially join a point layer of geocoded crime incidents (source layer) with a polygon layer of census tracts (target layer), and generate aggregated crime counts in those census tracts. There are a variety of spatial joins between different spatial features (Price, 2004, pp. 287–288). Table 1.2 summarizes all types of spatial joins in ArcGIS. Spatial joins are accessed in ArcMap similar to attribute joins: right-click the source layer > choose Joins and Relates > Join. In the Join Data dialog window, choose “Join data from another layer based on spatial location” instead of “Join attributes from a table.” In Section 1.3.2, step 13 illustrates the use of a spatial join, in particular, a case of a summarized join. Map overlays may be broadly defined as any spatial analysis involving modifying features from different layers. The following reviews some of the most commonly used map overlay tools: Clip, Erase, Intersect, Union, Buffer, and Multiple Ring Buffer. A Clip truncates the features from one layer using the outline of another. An Erase trims away the features from the input layer that fall inside the erase layer. An Intersect overlays two layers and keeps only the areas that are common to both. A Union also overlays two layers but keeps all the areas from both layers. A Buffer creates areas by extending outward from point, line, or polygon features over a specified distance. A Multiple Ring Buffer generates buffer features based on a set of distances. In ArcGIS 10.2, the above map overlay tools are grouped under different toolsets in ArcToolbox > Analysis Tools: Clip is under the Extract toolset; Erase, Intersect and Union are under the Overlay toolset; and Buffer and Multiple Ring Buffer are grouped under the Proximity toolset. Other common map overlay tools used in case studies elsewhere in the book include Point Distance (see step 3 in Section 2.4.1) and Near (see step 2 in Section 4.3.1). In Section 1.3.2, step 7 uses the multiple ring buffer tool, and step 8 uses the intersect tool. One may notice the similarity among spatial queries, spatial joins, and map overlays. Indeed, many spatial analysis tasks may be accomplished by any one of the three. Table 1.3 summarizes the differences between them. A spatial query only finds the information on the screen and does not create new datasets (unless one chooses to export the selected records or features). A spatial join always saves the result in a new layer. There is an important difference between spatial joins and map overlays. A spatial join merely identifies the location relationship between spatial features of input layers, and does not change existing spatial features or creates new features. In the process of a map overlay, some of the input features are split, merged, or dropped in generating a new layer. In general, a map overlay operation takes more computation time than a spatial join, and a spatial join takes more time than a spatial query.

Polygon

Line

Polygon

For each polygon in T, find the polygon in S that it falls completely inside and transfer the attributes of source polygon to T

For each line in T, find the polygon in S that it falls completely inside, and transfer the polygon’s attributes to T

Polygon

Point

For each line in T, find the line in S that it is part of and transfer the attributes of source line to T

Line

Point

For each polygon in T, find its closest line in S and transfer the line’s attributes to T

For each polygon in T, find its closest point in S and transfer the point’s attributes to T

For each line in T, find the polygon in S that it is closest to and transfer the polygon’s attributes to T

For each line in T, find its closest point in S and transfer the point’s attributes to T

For each point in T, find its closest polygon in S and transfer attributes of that polygon to T

Line

Distance Join For each point in T, find its closest point in S and transfer attributes of that closest point to T

Polygon

For each point in T, find the polygon in S containing the point, and transfer the polygon’s attributes to T

Simple Join

For each point in T, find the closest line in S and transfer attributes of that line to T

Point

Point

Line

Target Layer (T)

Source Layer (S)

TABLE 1.2 Types of Spatial Joins in ArcGIS Summarized Join

For each polygon in T, find all the polygons in S that intersect it and transfer the summarized attributes of intersected polygons to T

For each polygon in T, find all the lines in S that intersect it and transfer the lines’ summarized attributes to T

For each polygon in T, find all the points in S that fall inside it and transfer the points’ summarized attributes to T

For each line in T, find all the polygons in S crossed by it and transfer the polygons’ summarized attributes to T

For each line in T, find all the lines in S that intersect it and transfer the summarized attributes of intersected lines to T

For each line in T, find all the points in S that either intersect or lie closest to it and transfer the points’ summarized attributes to T

For each point in T, find all lines in S that intersects it and transfer the lines’ summarized attributes to T

For each point in T, find all the points in S closer to this point than to any other point in T and transfer the points’ summarized attributes to T

Getting Started with ArcGIS 9

10

Quantitative Methods and Socioeconomic Applications in GIS

TABLE 1.3 Comparison of Spatial Query, Spatial Join, and Map Overlay Basic Spatial Analysis Tools Spatial query

Spatial join

Map overlay

1.3

Function

Whether a New Layer Is Created

Finds information based No (unless the on location relationship selected features between features from are exported to a different layers and new dataset) displays on screen Identifies location Yes relationship between features from different layers and transfers the attributes to target layer Overlays layers to create Yes new features and saves the result in a new layer

Whether New Computation Features Are Created Time No

Least

No

Between

Yes (splitting, merging or deleting features to create new)

Most

CASE STUDY 1: MAPPING AND ANALYZING POPULATION DENSITY PATTERN IN BATON ROUGE, LOUISIANA

For readers with little previous exposure to GIS, nothing demonstrates the value of GIS and eases the fear of GIS complexity better than the experience of mapping data from the public domain with just a few clicks. As mentioned earlier, GIS goes far beyond mapping, and the main benefits lie in its spatial analysis capacity. The first case study illustrates how spatial and attribute information are managed and linked in a GIS, and how the information is used for mapping and analysis. Part 1 uses the 2010 US Census data to map the population density pattern across census tracts in a county, and Part 2 utilizes some basic spatial analysis tools to analyze the distance decay pattern of density across concentric rings from the city center. The procedures are designed so that most functions introduced in Sections 1.1 and 1.2 will be utilized.

1.3.1

part 1: Mapping the population density pattern across census tracts

A GIS project begins with data acquisition, and often uses existing data. For socioeconomic applications in the US, the Topologically Integrated Geographical Encoding and Referencing (TIGER) files from the Census Bureau are the major source for spatial data, and the decennial census data from the same agency are the major source for attribute data. Both can be accessed at the website www.census.gov. Advanced users may download (1) various TIGER (mostly 2010) data directly from the TIGER Products website http://www.census. gov/geo/maps-data/data/tiger.html and (2) the 2010 Census Data Products at

Getting Started with ArcGIS

11

http://www.census.gov/population/www/cen2010/glance/index.html, and use other programs to process the Census Data and then ArcGIS to link them together. Here for convenience, we use the TIGER/Line® Shapefiles Prejoined with Demographic Data, available at http://www.census.gov/geo/maps-data/data/tigerdata.html, and specifically the “2010 Census Demographic Profile 1–Geodatabase Format (County and Census Tract).” The data include more than 200 items such as total population, age, sex, race, and ethnicity, average household size, total housing units, housing tenure, and many others (see the file DP_TableDescriptions. xls under the project folder “BatonRouge” for variable descriptions). The dataset is available for the whole US at three levels (state, county, and census tracts) prejoined with attribute variables, and is therefore sizeable. We prepared the data for Louisiana, and kept all attribute variables for the state and county levels, but only selected ones for the census tract level to reduce the dataset size. Another convenient source is the data DVDs that come with each ArcGIS software release from the ESRI. For historical census data in GIS format, visit the National Historical Geographic Information System (NHGIS) website at www. nhgis.org. Data needed for Part 1 of the project are provided under the folder BatonRouge: 1. Features State, County and Tract in geodatabase Census.gdb (representing the state, county, and census tract level, respectively). 2. Feature BRCenter in geodatabase BR.gdb (representing the city center of Baton Rouge). Throughout this book, all computer file names, field names, and also some computer commands are in the Courier New font. The following is a step-by-step guide for the project. In many cases, GIS analysts need to extract a study region that is contained in a larger area. This project begins with deriving the census tracts in East Baton Rouge Parish (“parish” is a county-equivalent unit in Louisiana). The task is accomplished by steps 1 and 2. Step 1. Using an attribute query to extract the parish boundary: In ArcMap, click the Add Data tab to add all three features in geodatabase Census.gdb and the feature BRCenter in geodatabase BR.gdb to the project. From the main menu, choose Selection > Select By Attributes >. In the dialog window (Figure 1.1), make sure that the Layer is County; in the bottom box, input NAMELSAD10 = 'East Baton Rouge Parish'*; and click Apply (or OK). The parish is selected and highlighted on the map. The selection is based on the attributes of a GIS dataset, and thus it is an attribute query. In the layers window under Table Of Contents (TOC), right-click the layer County > Data > Export Data >. In the dialog window, make sure that Selected

*

For convenience and accuracy, input a field name such as “NAMELSAD10” by double clicking the field from the top window. The parish name “East Baton Rouge Parish” is case sensitive.

12

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 1.1 Dialog window for attribute query.

features are in the top dropdown bar; under Output feature class, navigate to geodatabase BR.gdb and name it as EBRP; and click OK. Click Yes to add the exported data to the map as a new layer. Step 2. Using a spatial query to extract census tracts in the parish: From the main menu of ArcMap, choose Selection > Select By Location >. In the dialog window (Figure 1.2), check Tract under Target layer(s), choose EBRP as Source layer, choose “are within the source layer feature” under Spatial selection method for target layer feature(s), and click OK. All census tracts within the parish are selected. The selection is based on the spatial relationship between two layers, and thus it is a spatial query. Similar to step 1, return to the layers window, and right-click the layer Tract > Data > Export Data > export the selected features to a new feature BRTrt in geodatabase BR.gdb. Use the Add Data tab to add BRTrt to the map. Step 3. Transforming the spatial dataset to a projected coordinate system: In the layers window, right-click the layer BRTrt > Properties >. In the Layer Properties window, click Source from the top bar. It shows the Extent of the feature in decimal degrees (dd) and its Geographic Coordinate System (GCS_North_American_1983).

Getting Started with ArcGIS

FIGURE 1.2

13

Dialog window for spatial query.

Repeat on other layers (Tract, County and State) to confirm that they are all in a geographic coordinate system, as are all TIGER data. Now check the layer BRCenter. It has the Extent measured in meters (m) and a Projected Coordinate System (NAD_1983_UTM_Zone_15N).* For reasons discussed previously, we need to transform the census tracts dataset BRTrt in an unprojected geographic coordinate system to one in a projected coordinate system. For convenience, we use the projection system that is predefined in the layer BRCenter. Open ArcToolbox by clicking the icon on the menu bar above the ArcMap palette. In ArcToolbox, go to Data Management Tools > Projections and Transformations > double-click the tool Project. In the Project dialog window (as shown on the left of Figure 1.3), choose BRTrt for Input Dataset or Feature Class; *

In ArcMap, as we move the cursor in a map, the bottom right corner displays the corresponding coordinates in the current map units, in this case, Decimal Degrees. In other words, maps in the data (or layout) view are in geographic coordinates, which is set by the coordinate system of the first layer added to the project.

14

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 1.3 Dialog windows for projecting a spatial dataset.

name the Output Dataset or Feature Class as BRTrtUtm (make sure that it is in geodatabase BR.gdb); under Output Coordinate System, click the graphic icon to activate the Spatial Reference Properties dialog window >. In this window (as shown on the right of Figure 1.3), click the Add Coordinate System tab and choose Import; browse to select BRCenter and click Add*; and click OK to close the dialog >. Click OK again to execute the Project tool. Use the Add Data tab to add the feature BRTrtUTM to the map. Check the coordinate system of the new dataset BRTrtUtm. Step 4. Calculating area in census tracts: Under File on the top-level menu, click New to exit the project, click OK when the New Document dialog appears, and click No to “Save changes to the Untitled.” Now first add the feature BRTrtUtm obtained from step 3 and then the other layers. Maps now change to the projected coordinate system in meters. Under Table Of Contents, right-click the layer BRTrtUtm >. Open Attribute Table >. In the table, click the Table Options tab (the first at the upper left corner) > Add Field >. In the Add Field dialog window, enter Area for Name, and choose Double for Type, and click OK. A new field Area is now added to the table as the last column. In the open table, scroll to the right side of the table, right-click the field Area > choose Calculate Geometry, and click Yes on the warning message >. In the Calculate Geometry dialog window, choose Area for Property and “Square Meters [sqm]” for Units, and click OK to update areas (click Yes again on the warning message). Step 5. Calculating population density in census tracts: Still on the open attribute table of BRTrtUtm, follow the above procedure to add a new field PopuDen (choose *

Alternatively, expand Layers in the middle box > choose “NAD_1983_UTM_Zone_15N,” which is the projection used in the layer BRCenter.

Getting Started with ArcGIS

15

Float for Type) >. Right-click the field PopuDen > choose Field Calculator >. In the dialog window (Figure 1.4), under “PopuDen =,” input the formula [DP0010001]/ ([Area]/1000000), and click OK to update the field.* Note that the map projection unit is meter, and thus the area unit is square meter. The dominator in the formula converts area unit in square meter into square kilometer, the numerator DP0010001 is total population, and the unit of population density is persons per square kilometer. (In project instructions in other chapters of this book, the bracket signs for each variable may be dropped in a formula for simplicity, for example, the above formula may be written as PopuDen = DP0010001/(Area/1000000).) It is often desirable to obtain some basic statistics for a field. Still on the open table, right-click the field PopuDen > choose Statistics. The result includes count, minimum, maximum, sum, mean, standard deviation, and a frequency distribution. Close the table window. Step 6. Mapping population density across census tracts: In the layers window, right-click the layer BRTrtUtm > choose Properties. In the Layer Properties window, click Symbology > Quantities > Graduated Colors to map the field PopuDen

FIGURE 1.4 *

Dialog window for calculating a field.

Similar to step 1, field names such as [DP0010001] and [Area] in the formula are entered by doubleclicking the field names from the top box to save time and minimize typos.

16

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 1.5

Dialog window for defining mapping symbols.

(as shown in Figure 1.5). Click Apply to see the changes. Experiment with different classification methods, number of classes, and color schemes. From the main menu bar, choose View > Layout View to preview the map. Also from the main menu bar, choose Insert > Legend (Scale Bar, North Arrow, and others) to add map elements. When you are satisfied with the map, go to the main menu bar > File > Export Map to save the map product in a chosen format. The population density map shown in Figure 1.6 uses a manual classification for density classes. Save the project with the final map by going to the main menu bar > File > Save as.

1.3.2

part 2: analyzing the population density pattern across concentric rings

A general observation from the population density map in Figure 1.6 is that densities near the city center are higher than more remote areas. This is commonly referred to as the distance decay pattern for urban population densities. Part 2 of this case study analyzes such a pattern while illustrating several spatial analysis tools, mainly spatial join and map overlay, discussed in Section 1.2. In addition to the datasets used in Part 1, we will use the “2010 Census Population and Housing Unit Counts—Blocks” data, available from the same TIGER data website. The dataset at the block level delivers population information at a finer

17

Getting Started with ArcGIS N

Center Census tract Person/sq_km 0–400 401–800 801–1200 1201–1600 1601–2580

FIGURE 1.6

0 2 4

8

12

16

km

Population density pattern in Baton Rouge in 2010.

geographic resolution and provides an alternative measure to examine the pattern. The new dataset is: feature BRBlkUtm in geodatabase BR.gdb (census blocks in Baton Rouge, with fields such as HOUSING10 for number of housing units and POP10 for population). Step 7. Generating concentric rings around the city center: In ArcMap, add BRTrtUtm to the map. In ArcToolbox, choose Analysis Tools > Proximity > double-click the tool Multiple Ring Buffer. In the dialog window (Figure 1.7), choose BRCenter for Input Features, name the Output Feature Class Rings, input values 2000, 4000, 6000, and so on till 26,000 under Distances (using the + sign to add one by one), and click OK. The output feature has 13 concentric rings with an increment of 2 km. While the traditional Buffer tool generates a buffer area around a feature (point, line, or polygon), the Multiple Ring Buffer tool creates multiple buffers at defined distances. Both are popular map overlay tools. Step 8. Overlaying the census tracts layer and the rings layer: Again in ArcToolbox, choose Analysis Tools > Overlay > double-click the tool Intersect. In the dialog window, select BRTrtUtm and Rings as Input Features, name the Output Feature Class “TrtRings,” and click OK. Only the overlapping areas of the two input features are preserved in the output feature, where any census tract crossing multiple rings is split. This step uses another popular map overlay tool Intersect.

18

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 1.7 Dialog window for multiple ring buffer.

In ArcMap, right-click the layer TrtRings in the layers window > Open Attribute Table. Note that it inherits the attributes from both input features. Among the fields in the table, values for the fields such as DP0010001 (population) and Area are those for the whole tracts and thus no longer valid for the new polygons, but values for the fields such as PopuDen and distance remain valid if assuming that population is uniformly distributed within a tract and areas in a ring are at the same distance from the city center. Step 9. Calculating area and estimating population in the intersected areas: Still on the open attribute table of TrtRings, follow the procedure illustrated in steps 4 and 5 to add two new fields InterArea (choose Float for Type) and EstPopu (choose Float for Type); and use “Calculate Geometry” to update the area size for InterArea, and use the formula “ = [PopuDen]*([InterArea]/1000000)” to update EstPopu. Here, population in each new polygon is estimated by its population density (inherited from the tract layer) multiplied by its area. Step 10. Computing population and density in concentric rings: Another popular map overlay tool Dissolve is introduced here for data aggregation. In ArcToolbox, choose Data Management Tools > Generalization > double-click the tool Dissolve. In the dialog window (Figure 1.8), choose TrtRings for Input Features, name the Output Feature class DissRings, check distance for Dissolve_Field(s); for Statistics Field(s), choose a Field EstPopu and its Statistic Type “SUM,” and

Getting Started with ArcGIS

FIGURE 1.8

19

Dialog window for the Dissolve tool.

choose another Field InterArea and its Statistic Type SUM; click OK. In essence, the Dissolve tool merges polygons sharing the same values of a Dissolve_Field (here distance) and also aggregates attributes according to the defined fields and their corresponding statistics. In ArcMap, right-click the layer DissRings in the layers window > Open Attribute Table. Similarly, add a new field PopuDen (choose Float for Type) and calculate it as “ = [SUM_EstPopu]/([SUM_InterArea]/1000000).” Note that the fields SUM_EstPopu and SUM_InterArea are total population and total area for each ring. Step 11. Graphing and mapping population density across concentric rings: Still on the open attribute table of DissRings, click the graphic icon for Table Options (at the top left corner) > Create Graph. In the dialog window (Figure 1.9), choose Vertical Bar for Graph type, DissRings for Layer/Table, PopuDen for Value field, and distance for X field; click Next > Finish. One may right-click the resulting graph > Copy as Graphic, and then paste to a project report (or Export to a file in a chosen format such as JPEG, PDF, etc.). Similar to step 6, one may map the population density pattern across the concentric rings in the study area. The following illustrates the process to analyze the concentric density pattern based on the block-level Census data. One key step uses a spatial join tool.

20

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 1.9 Dialog window for creating a graph in ArcGIS.

Step 12. Generating the block centroids: In ArcToolbox, choose Data Management Tools > Features > Feature To Point. In the dialog window, choose BRBlkUtm for Input Feature, enter BRBlkPt for Output Feature Class, check the “Inside (optional)” box, and click OK. Step 13. Using a spatial join to aggregate population from blocks to rings: In the layers window, right-click the layer DissRings > Joins and Relates > Join. In the dialog window (Figure 1.10), choose “Join data from another layer based on spatial location” in the top box*; for the box under 1, select BRBlkPt; for 2, check the first option (“Each polygon will be given a summary of …”), and the box next to Sum; for 3, name the resulting layer RingBlkP. Click OK to perform the join. The tool processes (in this case, sums up) attributes of the source layer (here, BRBlkPt) and joins the new attributes to the target layer (here, DissRings) based on the spatial relationship between the two layers. Although a new layer (here, RingBlkP) is created, it is essentially the same feature as the target layer with only the attributes updated. This is a spatial join. Step 14. Analyzing and comparing population and density estimates in rings: In the attribute table of RingBlkP, the field Sum_POP10 is the total population in each ring, the result of summing up the population of all blocks whose centroids fall

*

This choice is a spatial join. Another option is “Join attributes from a table” (for an attribute join). Although not used in this case study, attribute join is used very often (see step 5) in Case Study 2 in Chapter 2.

Getting Started with ArcGIS

FIGURE 1.10

21

Dialog window for spatial join.

within the ring, in contrast to the field SUM_EstPopu that was derived from the census tract data in step 10. Similar to step 10, add a new field PopuDen1, and calculate it as “ = [Sum_ POP10]/([SUM_InterArea]/1000000).” Recall that the field SUM_InterArea for total area in a ring remains valid and applicable here. One may map and graph the population density based on the field PopuDen1, and the results are very similar to those obtained from step 11. Figure 1.11 shows that the densities estimated from the two approaches are generally consistent with each other. By doing so, we examine whether a research finding is stable when data of different areal units are used. This is commonly referred to as the “modifiable areal unit problem” (MAUP) (Fotheringham and Wong, 1991). Figure 1.12 summarizes the steps for Part 1 and Part 2 of the case study. At the end of the project, one may use ArcCatalog to manage the data (e.g., deleting unneeded data to save disk space, renaming some layers, etc.).

22

Quantitative Methods and Socioeconomic Applications in GIS 1400

Based on tract

Density (p/sq_km)

1200

Based on block

1000 800 600 400 200 0

0

5000

10,000

15,000 20,000 Distance (m)

25,000

30,000

FIGURE 1.11 Population density patterns based on data at the census tract and block levels.

(a) LACnty

1. Select the parish

EBRP BRTrt

LATrt

2. Select tracts in the parish BRCenter 3. Project to UTM

6. Map density

BRTrtUtm

(b) BRCenter

5. Calculate density

7. Multibuffer

BRTrtUtm

4. Calculate area

BRTrtUtm

Rings 8. Intersect

9. Calc area & popu

TrtRings

TrtRings

BRTrtUtm 10. Dissolve 11. Graph & map

BRBlUtm

12. Generate centroids

DissRings

BRBlkPt 13. Spatial join 14. Graph & map

FIGURE 1.12

Flow chart for Case Study 1: (a) Part 1. (b) Part 2.

RingBlkP

Getting Started with ArcGIS

23

1.4 SUMMARY Major GIS and spatial analysis skills learned in this chapter include the following: 1. Map projections and transformation between them 2. Spatial data management (copying, renaming and deleting a spatial dataset) 3. Attribute data management (adding, deleting, and updating a field including calculating polygon areas in a table) 4. Mapping attributes 5. Attribute query versus Spatial query 6. Attribute join versus Spatial join 7. Map overlay operations (multiple ring buffer; intersect and dissolve; buffer, clip, and erase). Other tasks introduced include finding spatial and attribute data in the public domain, generating centroids from a polygon layer to create a point layer, and graphing the relationship between variables. Projects in other chapters will also utilize these skills. This chapter also discusses key concepts such as different relationships between tables (one-to-one, many-to-one, one-to-many, and many-to-many), various spatial joins, and differences between spatial queries, spatial joins, and map overlays. For additional practice of GIS-based mapping, readers can download the census data and TIGER files for a familiar area, and map some demographic (population, race, age, sex, etc.) and socioeconomic variables (income, poverty, educational attainment, family structure, housing characteristics, etc.) in the area.

APPENDIX 1: IDENTIFYING CONTIGUOUS POLYGONS BY SPATIAL ANALYSIS TOOLS Deriving a polygon adjacency matrix is an important task in spatial analysis. For example, in Chapter 8, both the area-based spatial cluster analysis and spatial regression utilize the spatial weight matrix in order to account for spatial autocorrelation. Polygon adjacency is one of the several ways to define spatial weights that capture spatial relationships between spatial features (e.g., the neighboring features). Adjacency between polygons may be defined in two ways: (1) rook contiguity defines adjacent polygons as those sharing edges, and (2) queen contiguity defines adjacent polygons as those sharing edges or nodes (Cliff and Ord, 1973). This appendix illustrates how the queen contiguity can be defined by using some basic spatial analysis tools discussed in Section 1.2. Here we use a census tract from the same study area BRTrtUtm in Case Study 1 to illustrate how to identify the contiguous polygons for one specific tract. Take a tract with its OBJECTID = 28540 as an example. As shown in Figure A1.1, based on the rook contiguity labeled “R,” it has four neighboring tracts (28537, 28534, 28582, and 28568). Using instead the queen contiguity labeled “Q,” tracts 28589 and 28573 are added as neighboring tracts in addition to the four rook-contiguous tracts. The following steps (shown in Figure A1.2) illustrate how to implement the process of identifying queen-contiguous polygons.

24

Quantitative Methods and Socioeconomic Applications in GIS

28607

28537 R/Q

28589 Q

28540

28568 R/Q

28582 R/Q

28573 Q

28534 R/Q

FIGURE A1.1

Rook versus queen contiguity.

28607

1. Query

28537

28589

28540

28568

28607

2. Buffer

28534

28537

28589

28540

28568

28582

28573

28534 28582

28573

Zonei

Zonei_Buff 3. Clip

28537

28589 28534

28534 28568

28582

4. Erase

28573

Zonejs

28537

28589

28540

28568

28582

28573

Zonei_Clip

FIGURE A1.2 Workflow for defining queen contiguity.

1. Use an attribute query to select the tract with “OBJECTID = 28540” and save it (say, as Zonei). 2. Buffer a small distance (e.g., 100 m) around the feature zonei to create a layer Zonei_Buff.

Getting Started with ArcGIS

25

3. Use Zonei_Buff to clip from the study area to generate a layer Zonei_ Clip (be sure to set the XY Tolerance as 10 m with high accuracy). 4. Erase Zonei from Zonei_Clip to yield another new layer Zonejs (also set the XY Tolerance as 10 m). Note that the resulting layer Zonejs contains all neighboring tracts (i.e., their ids) of Zonei based on the queen contiguity. Each step uses a spatial analysis tool (highlighted in italics). The above process can be iterated for deriving a polygon adjacency matrix for the study area (Shen, 1994).

2

Measuring Distance and Time

This chapter discusses a basic task encountered very often in spatial analysis: measuring distance and time. After all, spatial analysis is about how the physical environment and human activities vary across space—in other words, how these activities change with distances from reference locations or objects of interest. In many applications, once the distance or time measure is obtained, studies may be completed outside of a GIS environment. The advancement and wide availability of GIS have made the task much easier than it was in the past. The task of distance or time estimation is found throughout this book. For example, spatial smoothing and spatial interpolation in Chapter 3 utilize distance to determine which objects are considered in the computation and how much influence the objects have upon it. In trade area analysis in Chapter 4, distances (or time) between stores and consumers affect how frequently stores are visited. In the discussion of accessibility measures in Chapter 5, distance or time measures serve as the building blocks of either the floating catchment area method or the gravity-based method. Chapter 6 examines how population density or land use intensity declines with distance from a city or regional center. Measurement of distance and time can also be found in other chapters. This chapter is structured as follows. Section 2.1 provides an overview of various distance measures. Section 2.2 focuses on the computation of the shortest route distance (time traveled) through a network. Section 2.3 discusses the distance decay rule. A case study measuring the distances and network travel time between residents (at the census tract level) and public hospitals in Louisiana is presented in Section 2.4. Results from this case study will be used in Case Study 4B in Chapter 4. The chapter concludes with a brief summary in Section 2.5.

2.1 MEASURES OF DISTANCE Distance measures include Euclidean distance, geodetic distance, Manhattan distance, network distance, topological distance, and others. Euclidean distance, also referred to as straight-line distance or “as the crow flies,” is simply the distance between two points connected by a straight line on a flat surface. Prior to the widespread use of GIS, researchers needed to use mathematical formulas to compute the distance, and the accuracy was limited depending on the information available and tolerance of computational complexity. If a study area is small in terms of its geographic territory (e.g., a city or a county), Euclidean distance between two points (x1, y1) and (x2, y2) in Cartesian coordinates is approximated as

27

28

Quantitative Methods and Socioeconomic Applications in GIS

d12 = [( x1 − x2 )2 + ( y1 − y2 )2 ]1/ 2 .

(2.1)

If the study area covers a large territory (e.g., a state or a nation), the geodetic distance is a more accurate measure. Geodetic distance between two points is the distance through a great circle assuming the earth as a sphere. Given the geographic coordinates of two points as (a, b) and (c, d) in radians that may be converted from decimal degrees, the geodetic distance between them is d12 = r ∗ a cos[sin b ∗ sin d + cos b ∗ cos d ∗ cos(c − a )],

(2.2)

where r is the radius of the earth (approximately 6367.4 km). As the name suggests, Manhattan distance describes a rather restrictive movement in rectangular blocks as in the New York City borough of Manhattan. Manhattan distance is the length of the change in the x direction plus the change in the y direction. For instance, the Manhattan distance between two nodes (x1, y1) and (x2, y2) in Cartesian coordinates is simply computed as d12 = | x1 − x2 | + | y1 − y2 |.

(2.3)

Similar to Equation 2.1, the Manhattan distance defined by Equation 2.3 is meaningful only in a small study area (e.g., a city). Network distance is the shortest path (or least cost) distance through a transportation network, and will be discussed in detail in Section 2.2. Manhattan distance can be used as an approximation for network distance if the street network is close to a grid pattern. All the above measures of distance are metric. In contrast, topological distance emphasizes the number of connections (edges) between objects instead of their total lengths. For example, the topological distance between two locations on a transportation network is the fewest number of edges that takes to connect them. For the topological distance between polygons, it is 1 between two neighboring polygons, 2 if they are connected through a common neighbor, and so on. Topological distance measure is used in applications such as ecological modeling (Foltêtea et al., 2008), urban centrality index (Hillier, 1996), social network analysis (Freeman, 1979), modeling interaction between administrative units (Liu et al., 2014), and others. Other distance terms such as behavioral distance, psychological distance (e.g., Liberman et al., 2007), mental distance, and social distance emphasize that the aforementioned physical distance measures may not be the best approach to model spatial decision making that varies by individual attributes, by environmental considerations and by the interaction between them. For example, one may choose the safest but not necessarily the shortest route to return home, and the choice may be different at various times of the day. The routes from the same origin to the same destination may vary between a resident with local knowledge and a new visitor to the area.

Measuring Distance and Time

29

In ArcGIS, simply click on the graphic tool (“measure”) in ArcMap to obtain the distance between two points (or a cumulative distance along several points). Distance is also computed as a by-product in many spatial analysis operations in ArcGIS. For example, a distance join (a spatial join method) in ArcGIS, as explained in Section 1.2, records the nearest distances between objects of two spatial data sets. In a distance join, the distance between lines or polygons is between their closest points. Similarly, the Near tool (under ArcToolbox > Analysis Tools > Proximity) computes the distance from each feature in one layer to its closest feature in another layer. Some applications need to use distances between any two points either within one layer or between different layers, and thus a distance matrix. The Point Distance tool in ArcToolbox is designed for this purpose, and is accessed in ArcToolbox > Analysis Tools > Proximity > Point Distance. In the output file from the above operations, if the value for the resulting field “DISTANCE” is 0, it could be that either the actual distance is indeed 0 (e.g., from a point to itself) or that the point is located beyond the search radius. The current ArcGIS version does not have a built-in tool for computing the less commonly used Manhattan distance, and the implementation of its computation in ArcGIS is discussed in Section 2.4. Computation of metric or topological network distances is discussed in the next section.

2.2

COMPUTING NETWORK DISTANCE AND TIME

A network consists of a set of nodes (or vertices) and a set of arcs (or edges or links) that connect the nodes. If the arcs are directed (e.g., one-way streets), the network is a directed network. A network without regard to direction may be considered a special case of directed network with each arc having two permissible directions. Finding the shortest path from a specified origin to a specified destination is the shortest route problem, which records the shortest distance or the least time (cost) if the impedance value (e.g., travel speed) is provided on each arc. Different methods for solving the problem have been proposed in the literature, including the label setting algorithm discussed in this section and the valued-graph (or L-matrix) method in Appendix 2A. The popular label setting algorithm was first described by Dijkstra (1959). The method assigns “labels” to nodes, and each label is actually the shortest distance from a specified origin. To simplify the notation, the origin is assumed to be node 1. The method has four steps: 1. Assign the permanent label y1 = 0 to the origin (node 1), and a temporary label yj = M (a very large number) to every other node. Set i = 1. 2. From node i, recomputed the temporary labels yj = min(yj, yi + dij), where node j is temporarily labeled and dij < M (dij is the distance from i to j). 3. Find the minimum of the temporary labels, say yi. Node i is now permanently labeled with value yi. 4. Stop if all nodes are permanently labeled; go to step 2 otherwise.

30

Quantitative Methods and Socioeconomic Applications in GIS

(a)

45

2 25

55

3

30

y2* = 25

(c)

35

y1* = 0

55

55

3 y3 = 55

y2* = 25

(e)

30

45

2

y4 = M 4

y2* = 25

(d)

35

55

3 y3* = 55

30

45

2

y1* = 0

5 y5 = M y4 = 70 4

50

40

1 55

5

3 y3* = 55

y5 = M y4* = 70 4

y2* = 25

(f )

35

y1* = 0

2

5 y5 = 75

55

35 5 y5 = 75 y4* = 70 4

50

40

1

30

45

25

50

40

1

30

35

25

25 y1* = 0

3 y3 = M

50

40

1

4

50

40

1

25 y1* = 0

y4 = M

45

2

5

45

2

y2 = M

(b) 25

50

40

1

4

3 y3* = 55

30

35

5 y5* = 75

FIGURE 2.1 An example for the label-setting algorithm.

The following example is used to illustrate the method. Figure 2.1a shows the network layout with nodes and links. The number next to a link is the impedance value for the link. Following step 1, permanently label node 1 and set y1 = 0; temporarily label y2 = y3 = y4 = y5 = M. Set i = 1. A permanent label is marked with an asterisk (*). See Figure 2.1b. In step 2, from node 1 we can reach nodes 2 and 3 that are temporarily labeled. y2 = min (y2 , y1 + d12) = min (M, 0 + 25) = 25, and similarly y3 = min (y3 , y1 + d13) = min (M, 0 + 55) = 55. In step 3, the smallest temporary label is min (25, 55, M, M) = 25 = y2. Permanently label node 2, and set i = 2. See Figure 2.1c. Back to step 2 as nodes 3, 4 and 5 are still temporarily labeled. From node 2, we can reach temporarily labeled nodes 3, 4 and 5. y3 = min (y3 , y2 + d23) = min (55, 25 + 40) = 55, y4 = min (y4 , y2 + d24) = min (M, 25 + 45) = 70, y5 = min (y5 , y2 + d25) = min (M, 25 + 50) = 75.

31

Measuring Distance and Time

Following step 3 again, the smallest temporary label is min (55, 70, 75) = 55 = y3. Permanently label node 3, and set i = 3. See Figure 2.1d. Back to step 2 as nodes 4 and 5 are still temporarily labeled. From node 3, we can reach only node 5 (still temporarily labeled). y5 = min (y5 , y3 + d35) = min (75, 55 + 30) = 75. Following step 3, the smallest temporary label is min (70, 75) = 70 = y4. Permanently label node 4, and set i = 4. See Figure 2.1e. Back to step 2 as node 5 is still temporarily labeled. From node 4, we can reach node 5. y5 = min (y5 , y4 + d45) = min (75, 70 + 35) = 75. Node 5 is the only temporarily labeled node, so we permanently label node 5. By now all nodes are permanently labeled, and the problem is solved. See Figure 2.1f. The permanent labels yi give the shortest distance from node 1 to node i. Once a node is permanently labeled, we examine arcs “scanning” from it only once. The shortest paths are stored by noting the “scanning node” each time a label is changed (Wu and Coppins, 1981, p. 319). The solution to the above example can be summarized in Table 2.1. Topological distances in a transportation network can be calibrated similarly by using the above algorithm. However, in a topological network, the edge length between two nodes is coded as 1 if they are directly connected and 0 otherwise. Topological distances between polygons can be computed by constructing a topological network representing the polygon connectivity based on rook adjacency (see Appendix 1), and then applying the shortest route algorithm. A transportation network has many network elements such as link impedances, turn impedances, one-way streets, overpasses and underpasses that need to be defined (Chang, 2004, p. 351). Putting together a road network requires extensive data collection and processing, which can be very expensive or infeasible for many applications. For example, a road layer extracted from the TIGER/Line files does not contain nodes on the roads, turning parameters or speed information. When such information is not available, one has to make various assumptions to prepare the network data set. For example, in Luo and Wang (2003), speeds are assigned to different roads according to the CFCC codes (census feature class codes) used by the U.S. Census Bureau in its TIGER/Line files, and whether the roads are in urban/suburban/rural areas. Wang (2003) uses regression models to predict travel speeds by land use intensity (business and residential densities) and other factors.

TABLE 2.1 Solution to the Shortest Route Problem Origin-Destination Nodes 1, 2 1, 3 1, 4 1, 5

Arcs on the Route

Shortest Distance

(1, 2) (1, 3) (1, 2), (2, 4) (1, 2), (2, 5)

25 55 70 75

32

Quantitative Methods and Socioeconomic Applications in GIS

Section 2.4 uses a case study to illustrate how the ArcGIS network analysis module can be used to estimate travel time through a road network by utilizing the commonly available TIGER files. The result is adequate for most planning purposes. For more accurate estimate of travel time, consider using microscopic transportation simulation packages such as CORSIM by FHWA (2013) or TRANSCAD by Caliper (2013), which usually require more detailed road network data. Appendix 2B shows an alternative of estimating travel time with the Google Maps API, which enables us to tap into the dynamically updated transportation network data and the routing rules maintained by Google and possibly obtain a more current and reliable estimate of travel times.

2.3

DISTANCE DECAY RULE

The distance decay rule demonstrates that the interaction effect between physical or socioeconomic objects declines with distance between them. As a result, the density, intensity, frequency, or other measures of a feature decline with distance from a reference point. It is also referred to as Waldo R. Tobler’s (1970) first law of geography that “all things are related, but near things are more related than far things.” There are many examples of the distance decay rule. For example, the volume of air passengers between cities or the number of commuters in a city tends to decline with trip distances (Jin et al., 2004; De Vries et al., 2009). Analysis of the journey-to-crime data suggests that the number of crime incidents drops with increasing distances between offenders’ residences and crime locations (Rengert et al., 1999). The frequency of hospitalization also declines with distance from patients’ residences (Delamater et al., 2013). The distance decay rule is the foundation for many spatial analysis methods. In Chapter 3, both the kernel density estimation and the inverse distance weighted methods use a distance decay function to interpolate the value at any location based on known values at observed locations. In Chapter 4, the delineation of trade areas is based on the assumption that the influence of a store (or probability of customers visiting a store) declines with distance from the store until its influence (or probability of being visited) is equal to that of other stores. In Chapter 5, the measure of accessibility to a service facility is also modeled as a distance decay function. Chapter 6 focuses on identifying functions that capture the spatial pattern of density decay with distance from the city center. Some continuous distance decay functions commonly found in the literature include the following: 1. The power function f ( x ) = x −β (e.g., Hansen, 1959) 2. The exponential function f ( x ) = e −βx (e.g., Wilson, 1969) 1 2 3. The Gaussian function f ( x ) = e − (1/ 2 )[( x − μ )/ σ ] (e.g., Shi et al., 2012) σ 2π 4. The log-logistic function f ( x ) = γ / (1 + ( x /α )β ) (e.g., Delamater et al., 2013) The distance decay rule can also be modeled as a discrete set of choices or a hybrid form such as a kernel function (i.e., a continuous function within a certain distance range and fixed values beyond) (Wang, 2012, p. 1106).

33

Measuring Distance and Time

How do we choose a distance decay function and set its parameter values? The question can only be answered by analyzing the actual trip behavior to find the best fitting function. Take the power function as an example. It is critical to find an appropriate value for the distance decay parameter β (or coefficient of distance friction). The power function is based on an earlier version of the gravity model such as Tij = aOi D j dij−β

(2.4)

where Tij is the number of trips between area i and j, Oi is the size of an origin i (e.g., population in a residential area), Dj is the size of a destination j (e.g., a store size), a is a scalar (constant), dij is the distance between them, and β is the distance friction coefficient. Rearranging Equation 2.4 and taking logarithms on both side yield ln[Tij / (Oi D j )] = ln a − β ln dij The model can be estimated by a simple bivariate regression model. A more general gravity model can be written as Tij = aOi α1 D αj 2 dij−β

(2.5)

where α1 and α2 are the added exponents for origin Oi and destination Dj. The logarithmic transformation of Equation 2.5 is ln Tij = ln a + α1 ln Oi + α 2 ln D j − β ln dij

(2.6)

which can be estimated by a multivariate regression model. In the conventional gravity model, the task is to estimate flows Tij given nodal attractions Oi and Dj. Some more recent work proposes the notion of “reverse gravity model” that seeks to reconstruct the gravitational attractions Oi and Dj from network flow data Tij (O’Kelly et al., 1995). In both the conventional and reverse gravity models, the distance friction coefficient β is a critical parameter capturing the distance decay effect. It can be estimated by the regression method as illustrated above or other methods such as linear programming (O’Kelly et al., 1995), the particle swarm optimization (PSO) (Xiao et al., 2013), and the Monte Carlo simulation (Liu et al., 2012).

2.4

CASE STUDY 2: COMPUTING DISTANCES AND TRAVEL TIME TO PUBLIC HOSPITALS IN LOUISIANA

In many spatial analysis applications, a distance matrix between a set of origins and a set of destinations is needed. This case study explains how various distances and network travel time are computed in ArcGIS.

34

Quantitative Methods and Socioeconomic Applications in GIS

Louisiana has a larger portion of population under conditions of poverty, uninsured, and in poor health, than most of other states in the US. It currently operates 10 public hospitals formerly referred to as the “Charity Hospitals.” These make up “Louisiana’s Health Care Safety Net,” that is, the default system of medical care for the uninsured, underinsured, Medicaid, and other vulnerable populations. In recent years, the hospitals have constantly been subject to budget cuts and are under threat of being privatized to save the state money. This case study illustrates how to measure distances and travel time between residents (at the census tract level) and the public hospitals in Louisiana. As noted earlier, measuring distance and time is a fundamental task in spatial analysis. For instance, results from this project will be used in a case study in Chapter 4 that examines the service areas of hospitals, and could also be utilized in developing other analyses (e.g., accessibility measures discussed in Chapter 5, and optimizing allocation of healthcare facilities in Chapter 11). Other than the hospitals data, the demographic and road data are from the Census (including the TIGER data) as discussed in Chapter 1. The following data sets are provided under the data folder Louisiana: 1. A comma-separated text file Hosp_Address.csv (containing addresses, geographic coordinates, and the number of staffed beds for the 10 public hospitals in Louisiana). 2. Geodatabase LA_State.gdb in the UTM projection (containing the feature LA_Trt for the census tracts in Louisiana, and a feature dataset LA_MainRd including the single feature LA_MainRd for the main roads in Louisiana).*

2.4.1

part 1: Measuring euclidean and Manhattan distances

Step 1. Geocoding hospitals: Two geocoding approaches are introduced: one based on geographic coordinates, and another (optional) based on street addresses. [Geocoding based on geographic coordinates] Activate ArcMap, and add a layer in an unprojected geographic coordinate system by navigating to the data folder of Case Study 1 in Chapter 1 and loading BatonRouge\Census.gdb\State to the project.† From the main menu, choose File > Add Data > Add XY Data > In the dialog window (Figure 2.2), choose Hosp_Address.csv; the default X Field and Y Field (LONGITUDE and LATITUDE) are correct; click OK to execute the command and also OK in the warning message window. The result is shown as a layer “Hosp_Address.csv Events.” Right-click the layer Hosp_Address.csv Events > Data > Export Data. Save it as a layer Hosp_geo, which is in an unprojected geographic coordinate system. *

†

Another geodatabase LAState.gdb includes additional features Hosp and TrtPt and network dataset LA_MainRd_ND, which are generated from steps 1, 2, and 8. The geodatabase is provided for your convenience (e.g., working on Case Study 4B in Section 4.4 without completing this case study). The coordinates recorded in the hospitals file Hosp_Address.csv are in geographic coordinates. It is important to implement this geocoding approach in a project environment that is also in an unprojected geographic coordinate system.

Measuring Distance and Time

FIGURE 2.2

35

Dialog window for geocoding hospitals based on geographic coordinates.

Project the unprojected layer Hosp_geo to a new feature Hosp in the UTM projection (e.g., by importing the projection predefined in the layer Louisiana\ LA_State.gdb\LA_Trt). Make sure the feature Hosp is saved under the geodatabase LA_State.gdb. If needed, refer to step 3 in Case Study 1. [Optional: Geocoding based on street addresses] This approach requires a street layer not included in the data due to its large size. One may extract the street feature in Louisiana from the Census TIGER files (http://www.census.gov/cgi-bin/geo/ shapefiles2013/main) or from the data files that are delivered along with the ArcGIS software by the ESRI. After deriving the streets from one of these or another source, name the layer LA_Street. Exit from the previous project, and reactivate ArcMap. In ArcToolbox > Geocoding Tools > Create Address Locator >. In the dialog shown in Figure 2.3, for Address Locator Style, select US Address—Dual Ranges, and OK; for Reference Data, select LA_Street and make sure that its Role is Primary Table; for Output Address Locator, navigate to the project folder, name it LA_AddLoc, and click Save; click OK to execute it. In ArcToolbox, choose Geocoding Tools > Geocode Addresses >. In the dialog window, for Input Table, choose Hosp_Address.cvs; for Input Address Locator, choose LA_AddLoc; for Output feature Class, navigate to the project folder and name it Hosp1; click OK to execute the command. If the street file used

36

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 2.3 Dialog window for geocoding hospitals based on street addresses.

for geocoding (LA_Street) is in an unprojected geographic coordinate system, follow the similar procedure to project it to a new layer Hosp2 in the UTM projection. One may overlay the layers Hosp and Hosp2 and see that they are almost identical to each other. Step 2. Generating census tract centroids: In ArcToolbox, choose Data Management Tools > Features > Feature To Point >. In the dialog window, for Input Features, under the project data folder Louisiana, choose LA_State. gdb then LA_Trt; for Output Feature Class, navigate to the geodatabase LA_ State.gdb and name the feature TrtPt, and check the option Inside. Click OK to execute the tool. Step 3. Computing Euclidean distances: In ArcToolbox, choose Analysis Tools > Proximity > Point Distance >. In the dialog window, choose TrtPt as Input Features and Hosp as Near Features; for Output Table, choose the geodatabase LA_ State.gdb and name the table Dist; make sure that the unit is meters, and click OK to execute the tool. There is no need to define a search radius as all distances are needed. Note that there are 1148 (tracts) × 10 (hospitals) = 11,480 records in the distance table, where the field INPUT_FID inherits the values of OBJECTID_1 in the attribute table of TrtPt, the field NEAR_FID inherits the values of OBJECTID in the attribute table of Hosp, and the field DISTANCE is Euclidean distance between them in meters.

Measuring Distance and Time

37

As explained earlier, Manhattan distance is only useful in an intraurban setting, where a study area is much smaller and the effect of the earth’s curvature may be negligible. It is not an appropriate measure at a regional scale, as it is the case here. The computation of Manhattan distances in the following steps 4–6 is only for demonstrating its implementation in ArcGIS and is indicated as “optional.” Step 4. Optional: Adding XY coordinates for tract centroids and hospitals: In ArcToolbox, choose Data Management Tools > Features > Add XY Coordinates > choose TrtPt as Input Features, and click OK. In the attribute table of TrtPt, results are saved in the fields POINT_X and POINT_Y. Repeat the same process to add XY coordinates for the layer Hosp. Since both the layers are in a projected coordinate system, the XY coordinates are added in the projected coordinate system as well (in meters). Step 5. Optional: Using an attribute join to attach the coordinates to the distance table: In ArcMap, right-click the table Dist in the layers window > choose Joins and Relates > Join >. In the dialog window shown in Figure 2.4, make sure that “Join attributes from a table” is chosen in the top box; for 1, choose the field INPUT_FID (which is from the table Dist); for 2, choose the layer TrtPt (actually its attribute table); for 3, choose the field OBJECTID_1 (which is from the attribute table of

FIGURE 2.4 Dialog window for an attribute join.

38

Quantitative Methods and Socioeconomic Applications in GIS

TrtPt); and click OK. This step uses an attribute join to join the attribute table of TrtPt (commonly referred to as “source table”) to the distance table Dist (referred to as the “target table”) based on the common keys: INPUT_FID in the table Dist and OBJECTID_1 in the attribute table of TrtPt. For simplicity, such an attribute join operation will be stated in this book simply as: “join TrtPt (source table) to Dist (target table).” Unlike a spatial join that results in a new layer, an attribute join generates an expanded target table with no new table created.* Similarly, use the same tool to join the attribute table of Hosp to the table Dist based on the common keys: NEAR_FID in the table Dist and OBJECTID in the table of Hosp. Now the expanded table Dist contains the attributes from both TrtPt and Hosp, where their XY coordinates will be used in the next step. Step 6. Optional: Computing Manhattan distances: In the updated table Dist, add a field ManhDist (as Type Float) and calculate it as ManhDist = abs([TrtPt. POINT _ X]-[Hosp.POINT _ X]) + abs([TrtPt.POINT _ Y]-[Hosp. POINT _ Y]). Refer to steps 4 and 5 in Section 1.3.1 for detailed illustration of adding and calculating a field. The formula implements the definition of Manhattan distance in Equation 2.3. Note that a field name such as TrtPt.POINT_X is made of its source table TrtPt and the field name POINT_X for clarification. The Manhattan distances are in meters, and are always larger than (or in rare cases, equal to) the corresponding Euclidean distances.

2.4.2

part 2: Measuring travel tiMe

As network analysis often takes significant computation power and time, this project uses the main roads (LA_MainRd) instead of all streets as the road network for computing travel time between census tracts and hospitals. The main roads are extracted from the streets layer and include the Interstate, U.S., and state highways with Feature Class Codes (FCC) A11-A38. The spatial extent of roads in LA_ MainRd is slightly larger than the state of Louisiana to permit some routes through the roads beyond the state borders. Step 7. Preparing the road network source layer: The source feature dataset must have fields representing network impedance values such as distance and travel time. ArcGIS recommends naming these fields as the units of impedance such as meters and minutes so that they can automatically be detected. In ArcMap, open the attribute table of feature class LA_MainRd, and add fields Meters and Minutes (as Type Float). Right-click the field Meters > Calculate Geometry > choose Length for Property and Meters for Units to calculate the length of each road segment. Right-click the field Minutes and calculate it as “ = 60*([Meters]/1609)/[SPEED].” Note that the unit of speed is miles per hour, and the formula returns the travel time in minutes. Exit ArcMap. *

An attribute join can also be accessed in ArcToolbox > Data Management Tools > Joins > Add Join. The join is temporary as the join fields from the source table will be removed from the target table when the project is closed. To make the join permanent, use ArcToolbox > Data Management Tools > Joins > Join Field.

Measuring Distance and Time

39

Step 8. Building the network dataset in ArcCatalog: In ArcCatalog, choose Customize from the main menu > Extensions > make sure that Network Analyst is checked, and close the Extensions dialog. This turns on the Network Analyst module. In ArcCatalog, navigate to your project folder, right-click the feature dataset (not the single feature under it) LA_MainRd > New > Network Dataset. Following the steps below to complete the process*: • Name the New Network Dataset as LA_MainRd_ND, and click Next. • Select the feature class that will participate in the network dataset: LA_ MainRd, and click Next. • In the dialog window of modeling turns, choose No, and click Next. • In the defining connectivity window, click Next. • In the modeling elevation window, choose None as our dataset contains no elevation fields, and click Next. • In the window for specifying the attributes, Network Analyst searches for and assigns the relevant fields: Meters, Minutes, Oneway, and RoadClass > Choose Oneway and click Remove to remove this restriction > Click Next. • In the window for establishing driving directions, choose No, and click Next. • Click Finish to close the summary window. • Click Yes to build the new network dataset. The new network dataset LA_MainRd_ND and a new feature class LA _ MainRd_ND_Junctions become part of the feature dataset LA_MainRd. Right-click the network dataset LA_MainRd_ND > Properties, and review the Sources, Connectivity, Elevation, Attributes and Directions defined above, and use Reset to modify any if needed. Step 9. Starting Network Analyst in ArcMap: In ArcMap, to display the Network Analyst toolbar, choose Customize from the main menu > Toolbars > check Network Analyst. Add the feature dataset LA_MainRd to the active layers. All features associated with it (LA_MainRd_ND_Junctions, LA_MainRd and LA_MainRd_ND) are displayed. Also add the layers for the origins (O) and the destinations (D): features TrtPt and Hosp.† *

†

The source feature dataset does not contain reliable network attributes such as turn or one-way restriction, elevation, accessible intersection, and so on. The above definitions are chosen for simplicity and also reduce the computation time in network analysis. Both the origin and destination features are outputs from Part 1 of this case study. However, we use the features saved in the geodatabase LA _ State.gdb in the data folder instead of standalone shapefiles. A geodatabase indexes its feature ids (e.g., OBJECTID) beginning with 1 whereas a shapefile indexes its feature ids (e.g., FID) from 0. The former is consistent with the features used in the Network Analyst module, and thus makes it easy join with the results from network analysis, as illustrated in Section 4.4.

40

Quantitative Methods and Socioeconomic Applications in GIS

Step 10. Activating the O-D Cost Matrix tool: On the Network Analyst toolbar, click the Network Analyst drop-down menu and choose New OD Cost Matrix. A composite network analysis layer “OD Cost Matrix” (with six empty classes: Origins, Destinations, Lines, Point Barriers, Line Barriers, and Polygon Barriers) is added to the layers window under Table Of Content. Note that another layer OD Cost Matrix is added next to the three features associated with the network layer LA_ MainRd_ND.* The same six empty classes are also added to the Network Analyst window under “OD Cost Matrix” (If not shown, click the icon next to the Network Analyst drop-down menu to activate it). Step 11. Defining origins and destinations: In the Network Analyst window under “OD Cost Matrix,” • Right-click Origins (0) > Load Locations >. In the dialog window, for Load From, choose TrtPt; for Sort Field, select OBJECTID _ 1; for Location Position, choose Use Geometry and set the Search Tolerance 55000 meters†; and click OK. 1148 origins are loaded. • Right-click on Destination (0) > Load Locations >. In the dialog window, similarly, choose Hosp for Load From, select OBJECTID for Sort Field, and set the same search tolerance 5000 under Use Geometry, and click OK. 10 destinations are loaded.‡ Step 12. Computing the O-D cost (travel time) matrix: On the Network Analyst toolbar, click the Solve button . The solution is saved in the layer Lines under Table Of Contents or in the Network Analyst window. Right-click either one > Open Attribute Table. The table contains the fields OriginID, DestinationID, and Total_ Minutes, representing the origin’s ID (consistent with field OBJECTID_1 in the feature TrtPt), the destination ID (consistent with field OBJECTID in the feature Hosp), and total minutes between them, respectively. On the above open table, click the Table Options dropdown icon > Export > choose “dBASE Table” for the output type, and name the table ODTime.dbf. Some of the original fieldnames are truncated in the dBASE table (e.g., Destinatio for DestinationID, and Total_Minu for Total_Minutes). Save the project for future reference.

*

†

‡

If it is also desirable to record the travel distances, right click the layer OD Cost Matrix > Properties > Under Accumulation, check both Meters and Minutes > OK. In essence, a route between an origin and a destination is composed of three segments: the segment from the origin to its nearest junction on the network (as the crow flies), the segment from the destination to its nearest junction on the network (as the crow flies), and the segment between these two junctions through the network. In the study area, many census tract centroids (particularly those in the open coastal area in the southeast) are far from the road network (e.g., >60 km). The default search tolerance (5000 m) would not be able to assign these tracts to their nearest junctions on the network and would result in a status “Unlocated,” but sufficient for loading the hospitals. One downside with a large search distance is the increased computation time. Do not load the origins and destinations for multiple times, which would add duplicated records to the analysis.

41

Measuring Distance and Time

2.5

SUMMARY

This chapter covers three basic spatial analysis tasks: 1. Measuring Euclidean distances 2. Measuring Manhattan distances 3. Measuring network distances or travel time It also introduces the important GIS skill of geocoding based on geographic coordinates and from street addresses. The value of measuring distance and time can be appreciated by the broad applications of the distance decay rule. Both Euclidean and Manhattan distances are fairly easy to obtain in GIS. Computing network distances or travel time requires the road network data and also takes more steps to implement. Several projects in other chapters need to compute Euclidean distances, network distances or travel time, and thus provide additional practice for developing this basic skill in spatial analysis.

APPENDIX 2A: VALUED GRAPH APPROACH TO THE SHORTEST ROUTE PROBLEM The valued graph, or L-matrix, provides another way to solve the shortest route problem (Taaffe et al., 1996, pp. 272–275). For example, a network is shown as in Figure A2.1. The network resembles the highway network in north Ohio, with node 1 for Toledo, 2 for Cleveland, 3 for Cambridge, 4 for Columbus, and 5 for Dayton. We use a matrix L1 to represent the network, where each cell is the distance on a direct link (one-step link). If there is no direct link between two nodes, the entry is M (a very large number). We enter 0 for all diagonal cells L1(i, i) because the distance is 0 to connect a node to itself.

Toledo

116

1

Nodes 1 2 3 4 5

FIGURE A2.1

Cleveland

142

155 Dayton 5

2

77

4 Columbus

1 0 116 M M 155

2 116 0 113 142 M

113 76

3 M 113 0 76 M

3

Cambridge

4 M 142 76 0 77

A valued-graph example.

5 155 M M 77 0

Two-step connection 1–3 (1,1) + (1,3) = 0 + M = M (1,2) + (2,3) = 116 + 113 = 229 (1,3) + (3,3) = M + 0 = M (1,4) + (4,3) = M + 76 = M (1,5) = (5,3) = 155 + M = M

Nodes 1 2 3 4 5

1 0

2 116 0

3 229 113 0

4 232 142 76 0

5 155 219 153 77 0

42

Quantitative Methods and Socioeconomic Applications in GIS

The next matrix L2 represents two-step connections. All cells in L1 with values other than M remain unchanged because no distances by two-step connections can be shorter than a one-step (direct) link. We only need to update the cells with the value M. For example, L1(1, 3) = M needs to be updated. All possible “two-step” links are examined: L1(1, 1) + L1(1, 3) = 0 + M = M L1(1, 2) + L1(2, 3) = 116 + 113 = 229 L1(1, 3) + L1(3, 3) = M + 0 = M L1(1, 4) + L1(4, 3) = M + 76 = M L1(1, 5) + L1(5, 3) = 155 + M = M The cell value L2(1, 3) is the minimum of all the above links, which is L1(1, 2) + L1(2, 3) = 229. Note that it records not only the shortest distance from 1 to 3, but also the route (through node 2). Similarly, other cells are updated such as L2(1, 4) = L1(1, 5) + L1(5, 4) = 155 + 77 = 232, L2(2, 5) = L1(2, 4) + L1(4, 5) = 142 + 77 = 219, L2(3, 5) = L1(3, 4) + L1(4, 5) = 76 + 77 = 153, and so on. The final matrix L2 is shown in Figure A2.1. By now, all cells in L2 have values other than M, and the shortest route problem is solved. Otherwise, the process continues until all cells have values other than M. For example, L3 would be computed as L3 (i, j ) = min {L1 (i, k ) + L2 (k, j ), ∀k}

APPENDIX 2B: ESTIMATING TRAVEL TIME MATRIX BY GOOGLE MAPS API Google launched the Google Maps API, a JavaScript API, to allow customization of online maps in 2005. The Google Maps API enables one to estimate the travel time without reloading the web page or displaying portions of the map. A python program TravelTime.py (under the data folder “Louisiana/Scripts”) was developed to use the Google Maps API for computing an O-D travel time matrix (Wang and Xu, 2011).* The program reads the origin (O) and destination (D) layers of point features in a geographic coordinate system. The data are fed into a tool in Python that automates the process of estimating the travel time between one origin and one destination at a time by calling the Google Directions API. As long as the input features are in geographic coordinates, there is no need to have the coordinates physically residing in their attribute tables. The Arcpy module automatically extracts the location information from the input features. The iterations stop when the program reaches the last O-D combination. The result is saved in an ASCII *

Suggested citation when using this tool: Wang, F. and Y. Xu. 2011. Estimating O-D matrix of travel time by Google Maps API: Implementation, advantages and implications. Annals of GIS 17, 199–209.

Measuring Distance and Time

43

FIGURE A2.2 Dialog window for defining Toolbox properties.

file (with a file extension.txt) listing each origin, each destination and travel time (in seconds) between them. The program can be added as a tool under ArcToolbox by following the procedures below: 1. In ArcToolbox, right-click ArcToolbox > Add Toolbox. In the dialog window “Add Toolbox,” navigate to your project folder, click the New Toolbox icon , rename the default name “Toolbox.tbx” to “Google API.” 2. Right-click the newly created Toolbox “Google API” > Add > Script. On the dialog window “Add Script,” enter Name (ODTime), Label and Description, and click Next > navigate to choose the Script File TravelTime.py and click Next > in the dialog window as shown in Figure A2.2, enter “Origins” for Display Name, choose “Feature Class” for Data Type, and note the default setting for Direction under Parameter Properties is “Input” (which is ok); similarly, enter “Destinations” and its Data Type “Feature Class” (and also “Input” as Direction); enter “ODTime” and its Data Type “Text File,” and change the setting for Direction under Parameter Properties to “Output”; click Finish to complete creating the toolbox.

44

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE A2.3 Google Maps API tool user interface for computing O-D travel time matrix.

3. Right-click the toolbox “Google API” > Save As to save it.* 4. Double-click the newly added tool ODTime to run it. As shown in Figure A2.3, a user only needs to define two point features in geographic coordinates as inputs (e.g., for both “Origins” and “Destinations,” choose the hospitals layer Hosp_geo in a geographic coordinate system obtained from step 1 in Section 2.3.1) and a text file name as the result (e.g., for “ODTime,” name it H2H_Time.txt). The total computation time depends on the size of O-D travel time matrix, the preset pause time between requests (3 s in our program) and the internet connection speed.† In our test, it took approximately 5.5 minutes to run the example above with 100 O-D trips. In an experiment, we computed the travel time between the city center and each census tract centroid in Baton Rouge, and found that travel time by the Google Maps API approach was consistently longer than that by the ArcGIS Network Analyst approach (Wang and Xu, 2011), as shown in Figure A2.4. The estimated travel time by either method correlates well with the (Euclidean) distance from the city center (with a R2 = 0.91 for both methods). However, the regression model of travel time against corresponding distances by Google has a significant intercept of 4.68 minutes (vs. a negligible 0.79-minute intercept in the model of travel time by ArcGIS). The 4.68-minute intercept by Google probably reflects the elements of starting and ending time on a trip (getting on and off a street on the road network). This is consistent with our daily travel experience and also empirical data (Wang, 2003, p. 258). The regression model for the travel time by Google also has a slightly steeper slope (1.06) than the slope in the model for the travel time by ArcGIS (0.96), but this difference is minor. Some advantages of using the Google Maps API approach include the convenience of not preparing a road network data set and the ability of tapping into Google’s updated road data and algorithm accounting for the impact of traffic. *

†

If you want to have this toolbox loaded by default every time you open ArcGIS, right-click on a blank area inside the toolbox tab > Save Settings > To Default. Use of the Google Geocoding API is subject to a query limit of 2500 geolocation requests per day to prevent abuse (http://code.google.com/apis/maps/documentation/geocoding/). A licensed Premier user may perform up to 100,000 requests per day (http://code.google.com/apis/maps/documentation/ distancematrix/).

45

Measuring Distance and Time 45 ArcGIS

Travel time from center (min)

40

Google

35

Linear (ArcGIS)

y = 1.06x + 4.68 R2 = 0.91

Linear (Google)

30 25

y = 0.96x + 0.79 R2 = 0.91

20 15 10 5 0

FIGURE A2.4

0

5

10 15 20 Distance from center (km)

25

30

Estimated travel time by ArcGIS and Google.

There are also some major limitations. The most important being that Google’s query limit for geolocation requests (currently 2500 per day) may not meet the demand of many spatial analysis tasks. The computation time also seems long. Another drawback for many advanced researchers is that a user has neither control over the data quality nor any editing rights. Furthermore, the tool can only generate the current travel time, and thus is not suitable for work that needs the travel time in the past.

3

Spatial Smoothing and Spatial Interpolation

This chapter covers two generic tasks in GIS-based spatial analysis: spatial smoothing and spatial interpolation. Spatial smoothing and spatial interpolation are closely related, and both are useful in visualizing spatial patterns and highlighting spatial trends. Some methods (e.g., kernel density estimation) can be used in either spatial smoothing or interpolation. There are varieties of spatial smoothing and spatial interpolation methods. This chapter covers those that are most commonly used. Conceptually similar to moving averages (e.g., smoothing over a longer time interval), spatial smoothing computes the averages using a larger spatial window. Section 3.1 discusses the concepts and methods for spatial smoothing. Spatial interpolation uses known values at some locations to estimate unknown values at other locations. Section 3.2 presents several popular point-based spatial interpolation methods. Section 3.3 uses a case study of place names in southern China to illustrate some basic spatial smoothing and interpolation methods. Section 3.4 discusses areabased spatial interpolation, which estimates data for one set of areal units with data for a different set of areal units. Area-based interpolation is useful for aggregation and integration of data based on different areal units. Section 3.5 presents a case study on spatial interpolation between different census areal units. The chapter is concluded with a brief summary in Section 3.6.

3.1 SPATIAL SMOOTHING Similar to moving averages that are calculated over a longer time interval (e.g., 5-day moving average temperatures), spatial smoothing computes the value at a location as the average of its nearby locations (defined in a spatial window) to reduce spatial variability. Spatial smoothing is a useful method for many applications. One is to address the small population problem, which will be explored in detail in Chapter 9. The problem occurs for areas with small populations, where the rates of rare events such as cancer or homicide are unreliable because of random errors associated with small numbers. The occurrence of one case can give rise to unusually high rates in some areas, whereas the absence of cases leads to a zero rate in many areas. Another application is for examining spatial patterns of point data by converting discrete point data into a continuous density map, as illustrated in Part 1 under Case Study 3A. This section discusses two common spatial smoothing methods (floating catchment area method and kernel density estimation), and Appendix 3A introduces the empirical Bayes estimation.

47

48

Quantitative Methods and Socioeconomic Applications in GIS

3.1.1 Floating catchMent area (Fca) Method The floating catchment area (FCA) method draws a circle or square around a location to define a filtering window, and uses the average value (or density of events) within the window to represent the value at the location. The window moves across the study area until averages at all locations are obtained. The average values have less variability, and are thus spatially smoothed values. The FCA method may also be used for other purposes such as accessibility measures (see Section 5.2). Figure 3.1 shows part of a study area with 72 grid-shaped tracts. The circle around tract 53 defines the window containing 33 tracts (a tract is included if its centroid falls within the circle), and therefore the average value of these 33 tracts represents the spatially smoothed value for tract 53. A circle of the same size around tract 56 includes another set of 33 tracts that defines a new window for tract 56. The circle centers around each tract centroid and moves across the whole study area until smoothed values for all tracts are obtained. Note that windows near the borders of a study area do not include as many tracts, and cause a less degree of smoothing. Such an effect is referred to as edge effect. The choice of window size has a significant effect on the result. A larger window leads to stronger spatial smoothing, and thus better reveals regional than local patterns; and a smaller window generates reverse effects. One needs to experiment with different sizes and choose one with balanced effects. Furthermore, the window size may vary to maintain a certain degree of smoothing across a study area. For example, for mapping a reliable disease rate, it is desirable to have the rate estimate in a spatial window including population above a certain threshold (or of a similar size). In order to do so, one has to vary the spatial window size to obtain a comparable 11

12

13

14

15

16

17

18

21

22

23

24

25

26

27

28

31

32

33

34

35

36

37

38

41

42

43

44

45

46

47

48

51

52

53

54

55

56

57

58

61

62

63

64

65

66

67

68

72

73

75

76

77

78

81

82

83

86

87

91

92

93

71

FIGURE 3.1

74 84 94

85 95

96

97

Floating catchment area (FCA) method for spatial smoothing.

88 98

49

Spatial Smoothing and Spatial Interpolation

base population. This is a strategy adopted in adaptive spatial filtering (Tiwari and Rushton, 2004; Beyer and Rushton, 2009). Implementing the FCA in ArcGIS is demonstrated in detail in Section 3.3.1. We first compute the distances between all objects, and then distances less than or equal to the threshold distance are extracted.* In ArcGIS, we then summarize the extracted distance table by computing the average values of attributes by origins. Since the table contains only distances within the threshold, only those objects (destinations) within the window are included and form the catchment area in the summarization operation. This eliminates the need for programming that implements iterations of drawing a circle and searching for objects within the circle.

3.1.2

Kernel density estiMation

The kernel density estimation (KDE) bears some resemblance to the FCA method. Both use a filtering window to define neighboring objects. Within the window, the FCA method does not differentiate far and nearby objects whereas the kernel density estimation weighs nearby objects more than far ones. The method is particularly useful for analyzing and displaying point data. The occurrences of events are shown as a map of scattered (discrete) points, which may be difficult to interpret. The kernel density estimation generates a density of the events as a continuous field, and thus highlights areas of concentrated events as peaks and areas of lower densities as valleys. The method may be also used for spatial interpolation. A kernel function looks like a bump centered at each point xi and tapering off to 0 over a bandwidth or window. See Figure 3.2 for illustration. The kernel density at point x at the center of a grid cell is estimated to be the sum of bumps within the bandwidth: 1 fˆ ( x ) = nh d

n

⎛ x − xi ⎞ h ⎟⎠

∑ K ⎜⎝ i =1

Kernel function K( )

xi

Data point Bandwidth Grid

FIGURE 3.2 Kernel density estimation.

*

One may use the threshold distance to set the search radius in distance computation, and directly obtain the distances within the threshold. However, starting with a table for distances between all objects gives us the flexibility of experimenting with various window sizes.

50

Quantitative Methods and Socioeconomic Applications in GIS

where K() is the kernel function, h is the bandwidth, n is the number of points within the bandwidth, and d is the data dimensionality. Silverman (1986, p. 43) provides some common kernel functions. For example, when d = 2, a commonly used kernel function is defined as fˆ ( x ) =

1 nh2 π

n

∑ i =1

2

⎡ ( x − xi )2 + ( y − yi )2 ⎤ ⎢1 − ⎥ , h2 ⎣ ⎦

(3.1)

where ( x − xi )2 + ( y − yi )2 measures the deviation in x–y coordinates between points (xi, yi) and (x, y). Similar to the effect of window size in the FCA method, larger bandwidths tend to highlight regional patterns, and smaller bandwidths emphasize local patterns (Fotheringham et al., 2000, p. 46). ArcGIS has a built-in tool for kernel density estimation (KDE). To access the tool, make sure that the Spatial Analyst extension is turned on by going to Customize from the main manual bar and selecting Extensions. The KDE tool can be accessed in ArcToolbox > Spatial Analyst Tools > Density > Kernel Density.

3.2 POINT-BASED SPATIAL INTERPOLATION Point-based spatial interpolation includes global and local methods. A global interpolation utilizes all points with known values (control points) to estimate an unknown value. A local interpolation uses a sample of control points (e.g., points within a certain distance) to estimate an unknown value. As Tobler’s (1970) first law of geography states, “everything is related to everything else, but near things are more related than distant things.” The choice of global versus local interpolation depends on whether faraway control points are believed to have influence on the unknown values to be estimated. There are no clear-cut rules for choosing one over the other. One may consider the scale from global to local as a continuum. A local method may be chosen if the values are most influenced by control points in a neighborhood. A local interpolation also requires less computation than a global interpolation (Chang, 2004, p. 277). One may use validation techniques to compare different models. For example, the control points can be divided into two samples: one sample is used for developing the models, and the other sample is used for testing the accuracy of the models. This section surveys two global interpolation methods briefly, and focuses on three local interpolation methods.

3.2.1

global interpolation Methods

Global interpolation methods include trend surface analysis and regression model. Trend surface analysis uses a polynomial equation of x–y coordinates to approximate points with known values such as z = f(x,y),

51

Spatial Smoothing and Spatial Interpolation

where the attribute value z is considered as a function of x and y coordinates (Bailey and Gatrell, 1995). For example, a cubic trend surface model is written as z( x, y) = b0 + b1 x + b2 y + b3 x 2 + b4 xy + b5 y 2 + b6 x 3 + b7 x 2 y + b8 xy 2 + b9 y3 . The equation is usually estimated by an ordinary least square regression. The estimated equation is then used to project unknown values at other points. Higher order models are needed to capture more complex surfaces and yield higher R-square values (goodness-of-fit) or lower RMS in general.* However, a better fit for the control points is not necessarily a better model for estimating unknown values. Validation is needed to compare different models. If the dependent variable (i.e., the attribute to be estimated) is binary (i.e., 0 and 1), the model is a logistic trend surface model that generates a probability surface. A local version of trend surface analysis uses a sample of control points to estimate the unknown value at a location, and is referred to as local polynomial interpolation. ArcGIS offers up to 12th-order trend surface model. To access the method, make sure that the Spatial Analyst extension is turned on. In ArcToolbox, choose Spatial Analyst Tools > Interpolation > Trend. A regression model uses a linear regression to find the equation that models a dependent variable based on several independent variables, and then uses the equation to estimate unknown points (Flowerdew and Green, 1992). Regression models can incorporate both spatial (not limited to x–y coordinates) and attribute variables in the models, whereas trend surface analysis only uses x–y coordinates as predictors.

3.2.2

local interpolation Methods

The following discusses three popular local interpolators: inverse distance weighted, thin-plate splines, and Kriging. The inverse distance weighted (IDW) method estimates an unknown value as the weighted average of its surrounding points, in which the weight is the inverse of distance raised to a power (Chang, 2004, p. 282). Clearly, the IDW utilizes Tobler’s first law of geography. The IDW is expressed as s

zu

∑ = ∑

i =1 s

zi diu− k

i =1

,

diu− k

where zu is the unknown value to be estimated at u, zi is the attribute value at control point i, diu is the distance between points i and u, s is the number of control points used in estimation, and k is the power. The higher the power, the stronger (faster) the effect of distance decay is (i.e., nearby points are weighted much higher than remote ones). In other words, distance raised to a higher power implies stronger localized effects.

*

RMS (root mean square) is measured as RMS =

∑

n i =1

( zi ,obs − zi ,est )2 /n .

52

Quantitative Methods and Socioeconomic Applications in GIS

The thin-plate splines method creates a surface that predicts the values exactly at all control points, and has the least change in slope at all points (Franke, 1982). The surface is expressed as n

z ( x, y ) =

∑Ad

2 i i

ln di + a + bx + cy,

i =1

where x and y are the coordinates of the point to be interpolated, di = ( x − xi )2 + ( y − yi )2 is the distance from the control point (xi, yi), and Ai, a, b, and c are the n + 3 parameters to be estimated. These parameters are estimated by solving a system of n + 3 linear equations (see Chapter 10) such as n

∑Ad

2 i i

ln di + a + bxi + cyi = zi ;

i =1

n

∑

n

Ai = 0,

i =1

∑

n

∑Ay

Ai xi = 0, and

i i

i =1

= 0.

i =1

Note that the first equation above represents n equations for i = 1, 2, …, n, and zi is the known attribute value at point i. The thin-plate splines method tends to generate steep gradients (“overshoots”) in data-poor areas. Other methods such as spline with tension, completely regularized spline, multiquadric function, and inverse multiquadric function have been proposed to mitigate the problem (see Chang, 2004, p. 285). These advanced interpolation methods are grouped as radial basis functions (RBF). Kriging (Krige, 1966) models the spatial variation as three components: a spatially correlated component, representing the regionalized variable; a “drift” or structure, representing the trend; and a random error. To measure spatial autocorrelation, Kriging uses the measure of semivariance (1/2 of variance): γ (h ) =

1 2n

n

∑ [ z( x ) − z( x i

i

+ h)]2 ,

i =1

where n is the number of pairs of the control points that are distance (or spatial lag) h apart, and z is the attribute value. In the presence of spatial dependence, γ(h) increases as h increases. A semivariogram is a plot that shows the values of γ(h) along the y-axis and the distances h along the x-axis. Kriging fits the semivariogram with a mathematical function or model, and uses it to estimate the semivariance at any given distance, which is then used to compute a set of spatial weights. For instance, if the spatial weight for each control point i and a point s (to be interpolated) is Wis, the interpolated value at s is: ns

zs =

∑W z , is i

i =1

Spatial Smoothing and Spatial Interpolation

53

where ns is the number of sampled points around the point s, and zs and zi are the attribute values at s and i, respectively. Similar to the kernel density estimation, Kriging can be used to generate a continuous field from point data. In ArcGIS, all three basic local interpolation methods are available in the Spatial Analyst extension. For accessing the IDW (Kriging or thin-plate splines) method, go to ArcToolbox > Spatial Analyst Tools > Interpolation > IDW (Kriging or Spline). The other advanced four radial basis function (RBF) methods requires the Geostatistical Analyst extension. In ArcMap > customize from the main manual bar Extensions > check the Geostatistical Analyst box, and Close; also under Customize > Toolbars > check the Geostatistical Analyst box to add the toolbar. Click the Geostatistical Wizard icon choose Radial Basis Functions under “Deterministic methods.” Note that other basic interpolation methods such as Inverse Distance Weighting, Polynomial Interpolation and, Kriging are available in both Spatial Analyst and Geostatistical Analyst. If both extensions are available, the Geostatistical Analyst is recommended as it offers more information and better interface (Chang, 2004, p. 298).

3.3 CASE STUDY 3A: MAPPING PLACE NAMES IN GUANGXI, CHINA This case study examines the distribution pattern of contemporary Zhuang place names (toponyms) in Guangxi, China, based on a study reported in Wang et al. (2012). Zhuang, part of the Tai language family, is the largest ethnic minority in China that mostly live in the Guangxi Zhuang Autonomous Region (a provincial unit simply referred to as “Guangxi” here). The Sinification of ethnic minorities, such as the Zhuang, has been a long and ongoing historical process in China. The impact has been uneven in the preservation of Zhuang place names. The case study is chosen to demonstrate the benefit of using GIS in historical, linguistic, and cultural studies. Mapping is a fundamental function of GIS. However, direct mapping Zhuang place names as in Figure 3.3 has limited value. Spatial analysis techniques such as spatial smoothing and spatial interpolation methods can help enhance the visualization of the spatial pattern of Zhuang place names. A geodatabase GX.gdb is provided under the data folder “China _ GX,” and includes the following features: 1. Point feature Twnshp for all townships in the region with the field Zhuang identifying whether a place name is Zhuang (=1) or non-Zhuang (=0) (mostly Han). 2. Two polygon features County and Prov for the boundaries of counties and the provincial unit, respectively.

3.3.1

part 1: spatial sMoothing by the Floating catchMent area Method

We first test the floating catchment area method. Different window sizes are used to help identify an appropriate window size for an adequate degree of smoothing to balance the overall trends and local variability. Within the window around each place, the ratio of Zhuang place names among all place names is computed to represent the

54

Quantitative Methods and Socioeconomic Applications in GIS

N

Twnshp Place names Non-Zhuang Zhuang Prov County 0 30 60

FIGURE 3.3

120

180

240 km

Zhuang and non-Zhuang place names in Guangxi, China.

concentration of Zhuang place names around that place. In implementation, the key step is to utilize a distance matrix between any two places and extract the places that are within a specified search radius from each place. Step 1. Direct mapping of Zhuang place names: In ArcMap, add all features in GX.gdb to the project, right-click the layer Twnshp > Properties. In the Layer Properties window, click Symbology > Categories > Unique values > and click Add All Value (at the lower-left corner of the dialog window) to map the field Zhuang (as shown in Figure 3.3). Step 2. Computing distance matrix between places: In ArcToolbox, choose Analysis Tools > Proximity > Point Distance > choose Twnshp as both the Input Features and the Near Features, name the output table Dist_50 km.dbf, type 50,000 for Search Radius, and click OK. Refer to step 3 in Section 2.4.1 for detailed illustration of computing the Euclidean distance matrix. By defining a wide search radius of 50 km, the distance table allows us to experiment with various window sizes ≤50 km. In the table Dist_50 km.dbf, the field INPUT_FID identifies the “from” (origin) place, and the NEAR_FID identifies the “to” (destination) place. Step 3. Attaching attributes of Zhuang place names to distance matrix: In ArcMap, right-click the table Dist_50 km in the layers window > choose Joins and Relates > Join. In the dialog window, make sure that “Join attributes from a table” is chosen in the top box; for 1, choose the field NEAR_FID; for 2, choose the layer Twnshp; for 3, choose the field OBJECTID; and click OK. Refer to step 5 of Section 2.4.1 for detailed illustration of using an attribute join. Hereafter in

Spatial Smoothing and Spatial Interpolation

55

the book, such an attribute join operation is stated simply as: “join the attribute table of Twnshp (based on the field OBJECTID) to the target table Dist_50 km. dbf (based on the field NEAR_FID).” The expanded table Dist_50 km.dbf now contains the field Zhuang that identifies each destination place as either Zhuang or non-Zhuang. Step 4. Selecting places within a window around each place: For example, we define the window size with a radius of 10 km. Open the table Dist_50 km. dbf > click the Table Options tab > Select By Attributes > enter the condition “Dist_50 km.DISTANCE < =10000”. For each origin place, only those destination places within 10 km are selected. Step 5. Summarizing places within a window around each place: On the open table Dist_50 km.dbf with selected records being highlighted, right-click the field INPUT_FID and choose Summarize. In the dialog window (Figure 3.4), for 1, Dist_50 km.INPUT_FID shows as the field to summarize; for 2, expand the field Twnshp.Zhuang and check Sum as the summary statistics; for 3, name the output table Sum_10 km.dbf; note that “Summarize on the selected records only” is checked; and click OK. Alternatively, one may access the Summarize tool in ArcToolbox > Analysis Tools > Statistics > Summary Statistics.

FIGURE 3.4

Dialog window for summarization.

56

Quantitative Methods and Socioeconomic Applications in GIS

The Summarize operation on a table is similar to the spatial analysis tool Dissolve on a spatial dataset as used in step 10 in Section 1.3.2. It aggregates the values of a field or fields (here only Zhuang) to obtain its basic statistics such as minimum, maximum, average, sum, standard deviation, and variance (here only “sum”) by the unique values of the field to summarize (here INPUT_FID). In the resulting table Sum_10 km.dbf, the first field INPUT_FID is from the source table; Count_INPUT_FID is the number of distance records within 10 km from each unique INPUT_FID and thus the total number of places around each place; and Sum_Zhuang indicates the number of Zhuang place names within the range (because a Zhuang place name has a value 1 and a non-Zhuang place name has a value 0). Step 6. Calculating Zhuang place ratios around each place: Add a new field ZhuangR to the table Sum_10 km.dbf, and calculate it as ZhuangR = [Sum_ Zhuang]/[Cnt_INPUT_]. Note that Cnt_INPUT_ is the truncated fieldname for Count_INPUT_FID. This ratio measures the portion of Zhuang place names among each place within the window that is centered at each place. Step 7. Mapping Zhuang place name ratios: Join the table Sum_10 km.dbf to the layer Twnshp based on the common keys INPUT_FID in Sum_10 km.dbf and OBJECTID in Twnshp. Right-click the layer Twnshp > Properties > Symbology > choose Quantities and then Proportional symbols to map the field ZhuangR, which represents the Zhuang place name ratios within a 10 km radius around a place across the study area as shown in Figure 3.5. This completes the FCA method for spatial smoothing, which converts a binary variable Zhuang into a continuous numerical variable ZhuangR. Step 8. Sensitivity analysis: Experiment with other window sizes such as 5 km and 15 km and repeat steps 4–7. Compare the results with Figure 3.5 to examine the impact of window size. A larger window size leads to stronger spatial smoothing.

3.3.2

part 2: spatial interpolation by various Methods

Step 9. Using the kernel density estimation method in Spatial Analyst: In the dialog window of ArcToolbox > Spatial Analyst Tools > Density > Kernel Density > choose Twnshp for Input point or polyline features, choose Zhuang for Population field, name the Output raster KDE_20 km, input 20,000 for Search radius (which defines the bandwidth in KDE as 20 km), keep other default choices, and click Environments > For Environment Settings, under Processing Extent, choose “Same as layer County” for Extent (which uses the spatial extent of the layer County to define the rectangular extent for the raster); under Raster Analysis, also choose County as the Mask (which filters out cells outside the boundary of the layer County); and click OK to execute the tool. A raster layer KDE_20 km within the study area is generated. The kernel function is the same as previously discussed in Equation 3.1. By default, estimated kernel densities are categorized into nine classes, displayed as different hues. Figure 3.6 is based on five classes and natural breaks (Jenks) classification with the county layer as the background. The kernel density map shows the

57

Spatial Smoothing and Spatial Interpolation

N

Twnshp ZhuangR 0.1 0.25 0.5 0.75 1 County Prov

0 30 60

120

180

240 km

FIGURE 3.5 Zhuang place name ratios in Guangxi by the FCA method.

N

Kernel density 0–0.001 0.002–0.002 0.003–0.003 0.004–0.005 0.006–0.01 Twnshp Place names Non-Zhuang Zhuang

FIGURE 3.6

0 25 50

100 150 200 km

Kernel density of Zhuang place names in Guangxi.

58

Quantitative Methods and Socioeconomic Applications in GIS

distribution of Zhuang place names as a continuous surface so that areas of concentrated Zhuang place names can be easily spotted. However, the density values simply indicate relative degrees of concentration and cannot be interpreted as a meaningful ratio like ZhuangR in the FCA method. Step 10. Using the inverse distance weighting method in geostatistical analyst: In ArcMap, make sure that the Geostatistical Analyst extension is turned on: choose Customize from the main menu > Extensions > check Geostatistical Analyst, and Close. Also, under Customize > Toolbars > check Geostatistical Analyst to add the toolbar. Click the Geostatistical Analyst dropdown arrow > choose Geostatistical Wizard >, choose Inverse Distance Weighting under Deterministic methods > choose Twnshp for Source Dataset and Zhuang for Data Field, and click Next. Use the default power of 2, Standard neighborhood type, 15 maximal and 10 minimal neighboring points and other default settings, and click Next. Note the statistics under Prediction Errors (e.g., root mean square = 0.39), click Finish and OK. A layer Inverse Distance Weighting Prediction Map is generated. Right-click it > Export the surface to a raster > keep Inverse Distance Weighting as the Input geostatistical layer, name the Output surface raster IDW, and click Environment. Similar to step 9, use the layer County to define both the Extent and Mask (under Raster Analysis), click OK and then OK again. A surface raster for the study area IDW is generated (shown in Figure 3.7). Note that all interpolated values are within the same range as the original, that is, between 0 and 1.

N

IDW value 0–0.02 0.03–0.06 0.07–0.17 0.18–0.42 0.43–1 Twnshp Place names Non-Zhuang Zhuang

0 30 60

120 180 240 km

FIGURE 3.7 Spatial interpolation of Zhuang place names in Guangxi by the IDW method.

59

Spatial Smoothing and Spatial Interpolation

Experiment other methods (Global/Local Polynomial Interpolation, Kriging, Empirical Bayesian Kriging, etc.) available in the Geostatistical Wizard.

3.4

AREA-BASED SPATIAL INTERPOLATION

Area-based (Areal) interpolation is also referred to as cross-area aggregation, which transforms data from one system of areal units (source zones) to another (target zones). A point-based method such as Kriging or Polynomial Trend Surface Analysis can also be used as a bridge to accomplish the transformation from one area unit to another. Specifically, the source polygon dataset may be represented by their centroids and is first interpolated to a grid surface by a point-based interpolation method; then the values of grid cells are aggregated to predict the attributes of the target polygon dataset. In ArcGIS, the Areal Interpolation tool is accessed under the Geostatistical Wizard. This approach is essentially point-based interpolation. The following discusses a couple of area-based interpolation methods, among others (Goodchild et al., 1993). The simplest and most widely used is areal weighting interpolator (Goodchild and Lam, 1980). The method apportions the attribute value from each source zone to target zones according to the areal proportion. The method assumes that the attribute value is evenly distributed within each of the source zones. In fact, steps 8–10 in Section 1.3.2 already use this assumption to interpolate population from one area unit (census tracts) to another (concentric rings), and the interpolation is done by passing the “population density” variable through various polygon features. More advanced methods may be used to improve interpolation if additional information for the study area is available and utilized. The following discusses an interpolation method, termed Target-Density Weighting (TDW) developed by Schroeder (2007), which is particularly useful for interpolating census data in the US. With s and t denoting the source and target zones, respectively, the task is to interpolate the variable of interest y from s to t. The analysis begins with the spatial intersect operation between the source and target zones that produces the segments (atoms) denoted by st. Assume that additional information such as the ancillary variable z in the target zones t is also available. The method then distributes the ancillary variable z in the target zones t to the atoms st proportionally to area sizes. Next, unlike the simple areal weighting interpolation that assigns weights proportional to areas of atoms, the TDW interpolation assigns weights according to the value of ancillary feature z to honor its pattern. In other words, the TDW assumes that y is more likely to be consistent with the pattern of z than simply proportional to areas. Finally, it aggregates the interpolated y in atoms st to the target zone t. The TDW can be expressed as yˆ t =

∑yˆ

st

s

=

zˆst

∑ zˆ s

s

ys =

∑ s

( Ast /At )zt

∑ (A t

st

/At )zt

ys ,

where A denotes the area, and other notations are as stated above.

(3.2)

60

Quantitative Methods and Socioeconomic Applications in GIS

The TDW method is shown to be more accurate than the simple areal weighting interpolation in the temporal analysis of census data (Schroeder, 2007). It can be easily implemented in ArcGIS by using a series of “join” and “summarize” operations. Another area-based interpolation method is very similar to the concept of the above TDW method by utilizing an ancillary variable such as road network (see Appendix 3B).

3.5 CASE STUDY 3B: AREA-BASED INTERPOLATIONS OF POPULATION IN BATON ROUGE, LOUISIANA Transforming data from one area unit to another is a common task in spatial analysis for integrating data of different scales or resolutions. If the area units are strictly hierarchal (e.g., a state is composed of multiple whole counties, and a county is composed of multiple census tracts), the transformation does not require any interpolation, but simply aggregates data from one areal unit to another. By using a table linking the correspondence between the units, the task can be accomplished without GIS. There are also cases that one area unit is not completely contained in the other but assumed so for simplicity when the approximation is reasonable with negligible errors. For example, step 13 in Section 1.3.2 assumes that census blocks are entirely contained in the concentric rings and use a spatial join to aggregate the population data. This case study uses areal interpolation methods for better accuracy. The areal interpolation methods illustrated in this case study include the areal weighting interpolation and the Target-Density Weighting (TDW) interpolation. Data needed for the project are provided in the Geodatabase BR.gdb under the folder “BatonRouge,” which contains 1. Feature BRTrtUtm for census tracts in the study area in 2010 2. Feature BRUnsd for unified school districts in the study area in 2010 3. Feature BRTrt2k for census tracts in the study area in 2000 Note that the feature BRTrtUtm is the output from step 3 in Section 1.3.1.

3.5.1

part 1. using the areal Weighting interpolation to transForM data FroM census tracts to school districts in 2010

Step 1. Overlaying the census tract and the school district layers: In ArcToolbox, choose Analysis Tools > Overlay > Intersect. In the dialog window, select BRTrtUtm and BRUnsd as Input Features; for the Output Feature, navigate to the Geodatabase BR.gdb, name and save as TrtSD under BR.gdb,* and click OK. In the attribute table of TrtSD, the field Area is the area size of each census tract inherited from

*

By saving the output feature in a geodatabase instead a standalone shapefile, it automatically creates two fields Shape_length and Shape_area. The latter will be used in the next step.

Spatial Smoothing and Spatial Interpolation

61

BRTrtUtm, and the field Shape_Area is the area size of each newly-intersected polygon (“atom”). Step 2. Apportioning attribute to area size in intersected polygons: On the open attribute table of TrtSD, add a field EstPopu, and compute it as Est Popu = [DP0010001]*[Shape_Area]/[Area]. This is the interpolated population for each polygon in the intersected layer (atom) by the areal weighting method.* For example, the census tract 44.03 (GEOID10 = 22033004403) with Area = 27,778,145 and population of 5489 is now split into two polygons: the northeast one in Central Community School District (with Shape_Area = 22,585,614 or 81.3%), and the southwest one in East Baton Rouge Parish School District (with Shape_Area = 5,192,531 or 18.7%). Therefore, the estimated population is 5489 × 22585614/27778145 = 4463 in the northeast polygon, and 1026 in the other part. Step 3. Aggregating data to school districts: On the attribute table of TrtSD, right-click the field UNSDLEA (school district codes) and choose Summarize > select EstPopu (Sum) in the second box, and name the output table SD_Pop.dbf. The interpolated population for school districts is in the field Sum_EstPopu in the table SD _ Pop.dbf, which may be joined to the layer BRUnsd for mapping or other purposes. In Section 1.3.1, the population in the rings is reconstructed from population density, which is passed on from the tracts to the atoms. It makes the same assumption of uniform population distribution.

3.5.2

part 2. using the target-density Weighting (tdW) interpolation to interpolate data FroM census tracts in 2010 to census tracts in 2000

It is fairly common that census tract boundaries in a study area change from one decennial year (say, 2000) to another (say, 2010). In order to map and analyze the population change over time, one has to convert the population in two different years into the same areal unit. The task here is to interpolate the 2010 population in the 2010 census tracts to the 2000 census tracts so that population change rates can be assessed in the 2000 census tracts. The 2000 population pattern is used as an ancillary variable to improve the interpolation. According to the TDW, the layers of 2000 and 2010 census tracts are intersected to create a layer of atoms, and the ancillary variable (z or 2000 population) in the target zone (t or 2000 census tracts) is firstly distributed (disaggregated) across the atoms (st) assuming uniformity, and then aggregated in the source zone (s, i.e., 2010 census tracts), producing the proportion ( zˆst /zˆs ). The variable of interest (y or 2010 population) in the source zone is subsequently disaggregated to atoms according to this proportion to honor the spatial distribution pattern of the ancillary variable (z or 2000 population). The disaggregated y (2010 population) is then *

For validation, add a field popu_valid, and calculate it as popu_valid = popuden*Shape_ Area /1000000, which should be identical to EstPopu. The areal weighting method assumes that population is distributed uniformly within each census tract, and thus a polygon in the intersected layer resumes the population density of the tract, of which the polygon is a component.

62

Quantitative Methods and Socioeconomic Applications in GIS

aggregated in the target zone (2000 census tracts) to complete the interpolation process. In summary, the TDW re-distributes the population of 2010 to match the 2000 census tract boundary in a manner to honor the historical distribution pattern of the population of 2000. Step 4. Overlaying the two census tract layers: Similar to step 1, use the Intersect tool to intersect BRTrtUtm and BRTrt2 k to obtain InterTrt (make sure it is saved under the Geodatabase BR.gdb). Step 5. Disaggregating 2000 population to the atoms: On the open attribute table of InterTrt, add a field EstP2000 and calculate it as =[POP2000]*[Shape _ Area]/[Area _ 1], where Shape _ Area is the area size of atoms and Area _ 1 is the area size of 2000 tracts. Similar to step 2 in Section 3.5.1, this proportions the 2000 population to the areas of atoms. The field EstP2000 is zˆst . Step 6. Aggregating estimated 2000 population to 2010 tracts: On the open attribute table of InterTrt, right-click the field GEOID10 (2010 tract codes) and choose Summarize > select EstP2000 (Sum) in the second box, and name the output table Trt2010 _ P2k.dbf. The field Sum _ EstP2000 is zˆs. Step 7. Computing the ratio of 2000 population in atoms out of 2010 tracts: Under Table Of Content, right-click the layer InterTrt > Joins and Relates > Join. Use an attribute join to join the table Trt2010_P2k to the attribute table of InterTrt by their shared common key GEOID10. On the open attribute table of InterTrt, add a field P2k_rat and calculate it as =[InterTrt.EstP2000]/[Trt2010_ P2k.Sum_EstP2000], which is the proportion ( zˆst /zˆs ). Step 8. Disaggregating 2010 population to atoms according to the ratio: On the open attribute table of InterTrt, add another field EstP2010 and calculate it as =[InterTrt.DP0010001]*[InterTrt.P2k _ rat]. Step 9. Aggregating estimated 2010 population to 2000 tracts: On the open attribute table of InterTrt, right-click the field STFID (2000 tract codes) and choose Summarize > select EstP2010 (Sum) in the second box, and name the output table Trt2k _ P2010.dbf, where the field Sum _ EstP2010 is the interpolated 2010 population in the 2000 tracts. Step 10. Mapping the population change rate from 2000 to 2010 in 2000 tracts: Use an attribute join to join the table Trt2k _ P2010.dbf to the layer BRTrt2 k based on their common key STFID. Add a field ChngRate to the attribute table of BRTrt2 k and calculate it as “([Trt2k _ P2010.Sum _ EstP2010][BRTrt2 k.POP2000])/[BRTrt2 k.POP2000].” Map the field ChngRate (population change rate) as shown in Figure 3.8. Areas near the city center lost significant population, and suburban areas experienced much growth, particularly in the northwest and south. The process is illustrated in a flow chart as shown in Figure 3.9. The above project design uses the 2000 population pattern as an ancillary variable to facilitate transforming the 2010 population in the 2010 tracts to the 2000 tracts. By doing so, the population data in two different years (2000 and 2010) are in the same areal unit so that the changes can be assessed in the 2000 tracts. Similarly, one may transform (interpolate) the 2000 population data to the 2010 tracts, and assess the changes in the 2010 tracts. In that case, the ancillary variable will be the 2010 population.

63

Spatial Smoothing and Spatial Interpolation N

City center Population change rate 2000–2010 –0.24 to –0.10 –0.09–0.00 0.01–0.10 0.11–0.20 0.21–0.77

FIGURE 3.8

0

2

4

8

12

16 km

Population change rate in Baton Rouge 2000–2010.

Source layer: ys 4. Intersect

Atoms: 5. Disaggregate ys,zt

Target layer: zt

8. Disaggregate

10. Join Table by target zones: yt

Atoms: zst

9. Aggregate

6. Aggregate

7. Join & compute zst/zs

Atoms: yst=(zst/zs)*ys

FIGURE 3.9 Flow chart for implementing the TDW method.

Table by source zones: zs

64

Quantitative Methods and Socioeconomic Applications in GIS

3.6

SUMMARY

Skills learned in this chapter include the following: 1. 2. 3. 4.

Implementing the FCA method for spatial smoothing Kernel density estimation for mapping point data Trend surface analysis (including logistic trend surface analysis) Local interpolation methods such as inverse distance weighting, thin-plate splines, and Kriging 5. Areal weighting interpolation 6. Target-density weighting interpolation

Spatial smoothing and spatial interpolation are often used for mapping spatial patterns, similar to Case Study 3A on mapping Zhuang place names in southern China. However, surface mapping is merely descriptive. Identified spatial patterns such as concentrations or lack of concentrations can be arbitrary. Where are the concentrations statistically significant instead of random? The answer relies on rigorous statistical analysis, for example, spatial cluster analysis—a topic to be covered in Chapter 8 (Case Study 8B uses the same data set to identify spatial clusters of Zhuang place names). Area-based spatial interpolation is often used to convert data from different sources into one areal unit for an integrated analysis. It is also used to convert data from a finer into a coarser resolution for examining the modifiable areal unit problem (MAUP). For example, in Case Study 6 on urban density functions, the technique is used to aggregate data from census tracts to townships so that functions based on different areal units can be compared.

APPENDIX 3A: EMPIRICAL BAYES ESTIMATION FOR SPATIAL SMOOTHING Empirical Bayes estimation is another commonly used method for adjusting or smoothing variables (particularly rates) across areas (e.g., Clayton and Kaldor, 1987; Cressie, 1992). Based on the fact that the joint probability of two events is the product of the probability of one event and the probability of the second event conditional upon the first event, Bayesian inference may be expressed as being the inclusion of prior information or belief about a data set in estimating the probability distribution of the data (Langford, 1994, p. 143), that is, Likelihood function × Prior belief = Posterior belief. Using a disease risk as an example, the likelihood function can be said to be the number of Poisson distributed observed cases across a study area. The prior belief is on the distribution of relative risks (rates) conditional on the distribution of observed cases: for example, relative risks in areas of larger population size are likely to be more reliable than those in areas of smaller population size. In summary, (1) the mean rate in the study area is assumed to be reliable and unbiased, (2) rates for large

Spatial Smoothing and Spatial Interpolation

65

population are adjusted less than rates for small population, and (3) rates follow a known probability distribution. Assume that a common distribution, gamma, is used to describe the prior distribution of rates. The gamma distribution has two parameters, namely, the shape parameter α and the scale parameter ν, with the mean = ν/α and the variance = ν/α2. The two parameters α and ν can be estimated by a mixed maximum likelihood and moments procedure discussed in Marshall (1991). For an area i with population Pi and ki cases of the disease, the crude incidence rate for the area is ki /Pi. It can be shown that the posterior expected rate or empirical Bayes estimate is

Ei =

ki + ν . Pi + α

If area i has a small population size, the values of ki and Pi are small relative to ν and α, and the empirical Bayes estimate Ei will be “shrunken” toward the overall mean ν/α. Conversely, if area i has a large population size, the values of ki and Pi are large relative to ν and α, and the empirical Bayes estimate Ei will be very close to the crude rate ki/Pi. Compared to the crude rate ki/Pi, the empirical Bayes estimate Ei is smoothed by the inclusion of ν and α. Empirical Bayes (EB) estimation can be applied to a whole study area where all rates are smoothed toward the overall rate. This is referred to as global empirical Bayes smoothing. It can also be applied locally by defining a neighborhood around each area and smoothing the rate toward its neighborhood rate. The process is referred to as regionalized empirical Bayes smoothing. A neighborhood for an area can be defined as all its contiguous areas plus itself. Contiguity may be defined as rook contiguity or queen contiguity (see Appendix 1 in Chapter 1), first-order or second-order contiguity and so on. GeoDa, a free package developed by Luc Anselin and his colleagues (http:// sal.agecon.uiuc.edu/geoda_main.php), can be used to implement the EB estimation for spatial smoothing. The tools are available in GeoDa 0.9.5-i by choosing Map > Smooth > Empirical Bayes (or Spatial Empirical Bayes). The Empirical Bayes procedure smoothes the rates toward the overall mean in the whole study area, and thus is global EB smoothing. The Spatial Empirical Bayes procedure smoothes the rates toward a spatial “window” around an area (defined as the area and its neighboring areas based on a spatial weights file), and thus is local EB smoothing.

APPENDIX 3B: NETWORK HIERARCHICAL WEIGHTING METHOD FOR AREAL INTERPOLATION Utilizing the road network information in the U.S. Census Bureau’s TIGER files, Xie (1995) (also see Batty and Xie, 1994a,b) develops some network-overlaid algorithms to project population or other residents-based attributes from one areal unit to another. Residential houses are usually located along the sides of streets or along

66

Quantitative Methods and Socioeconomic Applications in GIS

roads. As a result, the distribution of population is closely related to the street network. Among the three algorithms (network length, network hierarchical weighting, and network house bearing methods), the network hierarchical weighting (NHW) method yields the most promising results. We use one example to illustrate the method. Researchers on urban issues often use the Census Transportation Planning Package (CTPP) data* to analyze land use and transportation issues. The CTPP Urban Element data in most regions are aggregated at the Traffic Analysis Zone (TAZ) level.† For various reasons (e.g., merging with other census data), it may be desirable to interpolate the CTPP data from TAZs to census tracts. In this case, TAZs are the source zones and census tracts are the target zones. The following five steps implement the task: 1. Overlay the TAZ layer with the census tract layer to create an intersected TAZ-tract (polygon) layer, and overlay the TAZ-tract layer with the road network layer to create a control-net (line) layer. 2. Construct a weight matrix for different road categories as population or business densities vary along various road classes. 3. Overlay the TAZ layer with the network layer, compute the lengths of various roads and then the weighted length within each TAZ, and allocate population or other attributes to the network. 4. Attach the result from step 3 (population or other attributes) to the controlnet layer, and sum up the attributes within each polygon based on the TAZtract layer. 5. Sum up attributes by census tracts to get the interpolated attributes within each census tract.

*

†

For the CTPP 1990 and 2000, visit https://www.fhwa.dot.gov/planning/census_issues/ctpp/data_ products/. For the CTPP 2006-2010, visit http://ctpp.transportation.org/Pages/5-Year-Data.aspx. For example, among the 303 CTPP Urban Element regions in 1990, 265 regions are summaries for TAZs, 13 for census tracts, and 25 for block groups (based on the file Regncode.asc distributed by the Bureau of Transportation Statistics, and summarized by the author).

Section II Basic Quantitative Methods and Applications

4

GIS-Based Trade Area Analysis and Application in Business Geography

“No matter how good its offering, merchandising, or customer service, every retail company still has to contend with three critical elements of success: location, location, and location” (Taneja, 1999, p. 136). Trade area analysis is a common and important task in the site selection of a retail store. A trade area is simply “the geographic area from which the store draws most of its customers and within which market penetration is highest” (Ghosh and McLafferty, 1987, p. 62). For a new store, the study of proposed trading areas reveals market opportunities with existing competitors (including those in the same chain or franchise) and helps decide on the most desirable location. For an existing store, it can be used to project market potentials and evaluate the performance. In addition, trade area analysis provides many other benefits for a retailer: determining the focus areas for promotional activities, highlighting geographic weakness in its customer base, projecting future growth and others (Berman and Evans, 2001, pp. 293–294). There are several methods for delineating trade areas: the analog method, the proximal area method, and the gravity models. The analog method is nongeographic, and is often implemented by regression analysis. The proximal area method and the gravity models are geographic approaches, and can benefit from GIS technologies. The analog and the proximal area methods are fairly simple, and are discussed in Section 4.1. The gravity models are the focus of this chapter, and are covered in detail in Section 4.2. Two case studies are presented in Sections 4.3 and 4.4 to illustrate how the two geographic methods (the proximal area method and the gravity models) are implemented in GIS. Case Study 4A draws from traditional business geography but with a fresh angle: instead of the typical retail store analysis, it analyzes the fan bases for two professional baseball teams in Chicago. Case Study 4B demonstrates how the techniques of trade area analysis are used in delineating service areas for public hospitals in Louisiana. Case Study 4A uses Euclidean distances to measure the spatial impedance for simplicity, and thus can be implemented without the Network Analyst module. Case Study 4B uses travel time through the road network and thus relies on several tools in the Network Analyst module. The chapter is concluded with some remarks in Section 4.5.

69

70

4.1 4.1.1

Quantitative Methods and Socioeconomic Applications in GIS

BASIC METHODS FOR TRADE AREA ANALYSIS analog Method and regression Models

The analog method, developed by Applebaum (1966, 1968), is considered the first systematic retail forecasting model founded on empirical data. The model uses an existing store or several stores as analogs to forecast sales in a proposed similar or analogous facility. Applebaum’s original analog method did not use regression analysis. The method uses customer surveys to collect data of sample customers in the analogous stores: their geographic origins, demographic characteristics, and spending habits. The data are then used to determine the levels of market penetration (e.g., number of customers, population, average spending per capita) at various distances. The result is used to predict future sales in a store located in similar environments. Although the data may be used to plot market penetrations at various distances from a store, the major objective of the analog method is to forecast sales but not to define trade areas geographically. The analog method is easy to implement, but has some major weaknesses. The selection of analog stores requires subjective judgment (Applebaum, 1966, p. 134), and many situational and site characteristics that affect a store’s performance are not considered. A more rigorous approach to advance the classical analog method is the usage of regression models to account for a wide array of factors that influence a store’s performance (Rogers and Green, 1978). A regression model can be written as Y = b0 + b1 x1 + b2 x2 +  + bn xn , where Y represents a store’s sales or profits, x’s are explanatory variables, and b’s are the regression coefficients to be estimated. The selection of explanatory variables depends on the type of retail outlets. For example, the analysis on retail banks by Olsen and Lord (1979) included variables measuring trade area characteristics (purchasing power, median household income, home ownership), variables measuring site attractiveness (employment level, retail square footage), and variables measuring level of competition (number of competing banks’ branches, trade area overlap with branch of same bank). Even for the same type of retail stores, regression models can be improved by grouping the stores into different categories and running a model on each category. For example, Davies (1973) classified clothing outlets into two categories (corner-site stores and intermediate-site stores) and found significant differences in the variables affecting sales. For corner-site stores, the top five explanatory variables are floor area, store accessibility, number of branches, urban growth rate, and distance to nearest car park. For intermediate-site stores, the top five explanatory variables are total urban retail expenditure, store accessibility, selling area, floor area, and number of branches.

4.1.2

proxiMal area Method

A simple geographic approach for defining trade areas is the proximal area method, which assumes that consumers choose the nearest store among similar outlets (Ghosh

GIS-Based Trade Area Analysis and Application in Business Geography

71

and McLafferty, 1987, p. 65). This assumption is also found in the classical central place theory (Lösch, 1954; Christaller, 1966). The proximal area method implies that customers consider only travel distance (or travel time as an extension) in their shopping choice, and thus the trade area is simply made of consumers that are closer to the store than any other. Once the proximal area is defined, sales can be forecasted by analyzing the demographic characteristics within the area and surveying their spending habits. The proximal area method can be implemented in GIS by two ways. The first approach is consumers-based. It begins with a consumer location, and searches for the nearest store among all store locations. The process continues until all consumer locations are covered. At the end, consumers that share the same nearest store constitute the proximal area for that store. When the spatial impedance is measured in Euclidean distance, this approach is implemented by utilizing the “Near” tool in ArcToolbox > Analysis Tools > Proximity > Near (see Case Study 4A). The ArcGIS Network Analyst module has a “Closet Facility” tool that implements the proximal area method when the spatial impedance is measured in network distance or travel time (see Case Study 4B). The second approach is stores-based. It constructs Thiessen polygons from the store locations, and the polygon around each store defines the proximal area for that store. The layer of Thiessen polygons may then be overlaid with that of consumers (e.g., a census tract layer with population information) to identify demographic structures within each proximal area. Figure 4.1a–c show how the Thiessen polygons are constructed from five points. First, five points are scattered in the study area (a)

B

A C

D

E (b)

B

A C

D

E (c)

B

A C E

FIGURE 4.1

D

Constructing Thiessen polygons for five points.

72

Quantitative Methods and Socioeconomic Applications in GIS

as shown in Figure 4.1a. Second, in Figure 4.1b, lines are drawn to connect points that are near each other, and lines are drawn perpendicular to the connection lines at their midpoints. Finally, in Figure 4.1c, the Thiessen polygons are formed by the perpendicular lines. In ArcGIS, Thiessen polygons can be generated from a point layer of store locations. The tool is accessed in ArcToolbox > Analysis Tools > Proximity > Create Thiessen Polygons.

4.2 GRAVITY MODELS FOR DELINEATING TRADE AREAS 4.2.1

reilly’s laW

The proximal area method only considers distance (or time) in defining trade areas. However, consumers may bypass the closest store to patronize stores with better prices, better goods, larger assortments, or a better image. A store in proximity to other shopping and service opportunities may also attract customers farther than an isolated store because of multipurpose shopping behavior. Methods based on the gravity model consider two factors: distances (or time) from and attractions of stores. Reilly’s law of retail gravitation applies the concept of gravity model to delineating trade areas between two stores (Reilly, 1931). Consider two stores 1 and 2 that are at a distance of d12 from each other (see Figure 4.2). Assume that the attractions for stores 1 and 2 are measured as S1 and S2 (e.g., in square footage of the store’s selling area), respectively. The question is to identify the breaking point (BP) that separates trade areas of the two stores. The BP is d1x from store 1 and d2x from store 2, that is, d1x + d2 x = d12

(4.1)

By the notion of the gravity model, the retail gravitation by a store is in direct proportion to its attraction and in reverse proportion to the square of distance. Consumers at the BP are indifferent in choosing either store, and thus the gravitation by store 1 is equal to that by store 2, such as S1 /d12x = S2 /d22x .

(4.2)

Breaking point d2x

d1x Store 1 (S1)

Store 2 (S2) X

d12

FIGURE 4.2 Reilly’s law of retail gravitation.

GIS-Based Trade Area Analysis and Application in Business Geography

73

Using Equation 4.1, we obtain d1x = d12 − d2 x . Substituting it into Equation 4.2 and solving for d1x yields

(

)

(4.3)

(

)

(4.4)

d1x = d12 1 + S2 /S1 . Similarly, d2 x = d12 1 + S1 /S2 .

Equations 4.3 and 4.4 define the boundary between two stores’ trading areas, and are commonly referred to as Reilly’s law.

4.2.2

huFF Model

The Reilly’s law only defines trade areas between two stores. A more general gravity model method is the Huff model that defines trade areas of multiple stores (Huff, 1963). The model’s widespread use and longevity “can be attributed to its comprehensibility, relative ease of use, and its applicability to a wide range of problems” (Huff, 2003, p. 34). The behavioral foundation for the Huff model may be drawn similar to that of the multichoice logistic model. The probability that someone chooses a particular store among a set of alternatives is proportional to the perceived utility of each alternative. That is, n

Pij = U j

∑U ,

(4.5)

k

k =1

where Pij is the probability of an individual i selecting a store j, Uj and Uk are the utility choosing the stores j and k respectively, and k’s are the alternatives available (k = 1, 2, …, n). In practice, the utility of a store is measured as a gravity kernel. Like in Equation 4.2, the gravity kernel is positively related to a store’s attraction (e.g., its size in square footage) and inversely related to the distance between the store and a consumer’s residence. That is, n

Pij = S j dij−β

∑ (S d

−β k ik

),

(4.6)

k =1

where S is a store’s size, d is the distance, β > 0 is the distance friction coefficient, and other notations are the same as in Equation 4.5. Note that the gravity kernel in Equation 4.6 is a more general form than in Equation 4.2, where the distance friction coefficient β is assumed to be 2. The term S j dij−β is also referred to as potential, measuring the impact of a store j on a demand location at i.

74

Quantitative Methods and Socioeconomic Applications in GIS

Using the gravity kernel to measure utility may be purely a choice of empirical convenience. However, the gravity model (also referred to as “spatial interaction models”) can be derived from individual utility maximization (Niedercorn and Bechdolt Jr, 1969; Colwell, 1982), and thus has its economic foundation (see Appendix 4A). Wilson (1967, 1975) also provided a theoretical base for the gravity model by an entropy-maximization approach. Wilson’s work also led to the discovery of a “family” of gravity models: a production-constrained model, an attractionconstrained model, and a production-attraction-constrained or doubly constrained model (Wilson, 1974; Fotheringham and O’Kelly, 1989). On the basis of Equation 4.6, consumers in an area visit stores with various probabilities, and an area is assigned to the trade area of a store that is visited with the highest probability. In practice, given a customer location i, the denominator in Equation 4.6 is identical for various stores j, and thus the highest value of numerator identifies the store with the highest probability. The numerator S j dij−β is also known as “gravity potential” for store j at distance dij. In other words, one only needs to identify the store with the highest potential for defining the trade area. Implementation in ArcGIS can take full advantage of this property. However, if one desires to show a continuous surface of shopping probabilities of individual stores, Equation 4.6 needs to be fully calibrated. In fact, one major contribution of the Huff model is the suggestion that retail trade areas are continuous, complex, and overlapping, unlike the nonoverlapping geometric areas of central place theory (Berry, 1967). Implementing Huff model in ArcGIS utilizes a distance matrix between each store and each consumer location, and probabilities are computed by using Equation 4.6. The result is not simply trade areas with clear boundaries, but a continuous probability surface. Based on the probabilities, the traditional trade areas can be certainly defined as areas where residents choose a particular store with the highest probability.

4.2.3

linK betWeen reilly’s laW and huFF Model

Reilly’s law may be considered as a special case of Huff model. In Equation 4.6, the choices are only two stores (k = 2), and Pij = 0.5 at the breaking point. That is to say, S1d1−xβ /(S1d1−xβ + S2 d2−xβ ) = 0.5. Assuming β = 2, the above equation is the same as Equation 4.2, based on which Reilly’s law is derived. For any β, a general form of Reilly’s law is written as d1x = d12 /[1 + (S2 /S1 )1/β ],

(4.7)

d2 x = d12 /[1 + (S1 /S2 )1/β ].

(4.8)

Based on Equation 4.7 or 4.8, if store 1 increases its size faster than store 2 (i.e., S1 /S2 increases), d1x increases and d2x decreases, indicating that the breaking point

75

GIS-Based Trade Area Analysis and Application in Business Geography

(BP) shifts toward store 2 and the trade area for store 1 expands. The observation is straightforward. It is also interesting to examine the impact of distance friction coefficient on the trade areas. When β decreases, the movement of BP depends on the store sizes: 1. If S1 > S2 , that is, S2 /S1 < 1, (S2 /S1 )1/ β decreases, and thus d1x increases and d2x decreases, indicating that a larger store is expanding its trade area. 2. If S1 < S2 , that is, S2 /S1 > 1, (S2 /S1 )1/ β increases, and thus d1x decreases and d2x increases, indicating that a smaller store is losing its trade area. That is to say, when the β value decreases over time due to improvements in transportation technologies or road network, travel distance matters to a less degree, giving even a stronger edge to larger stores. This explains some of the success of superstores in the new era of retail business.

4.2.4

extensions oF the huFF Model

The original Huff model does not include an exponent associated with the store size. A simple improvement over Huff model in Equation 4.6 is expressed as n

Pij = S αj dij−β

∑ (S

α −β k ik

d ),

(4.9)

k =1

where the exponent α captures elasticity of store size (e.g., a larger shopping center tends to exert more attraction than its size suggests because of scale economies). The improved model still only uses size to measure attractiveness of a store. Nakanishi and Cooper (1974) proposed a more general form called the multiplicative competitive interaction model (MCI). In addition to size and distance, the model accounts for factors such as store image, geographic accessibility, and other store characteristics. The MCI model measures the probability of a consumer at residential area i shopping at a store j, Pij, as ⎛ Pij = ⎜ ⎝

L

∏ l =1

⎞ Aljαl ⎟ dij−β ⎠

⎡⎛ ⎢⎜ k ∈Ni ⎢ ⎣⎝

L

∑ ∏A

αl lk

l =1

⎞ −β ⎤ ⎟ dik ⎥, ⎠ ⎥⎦

(4.10)

where Alj is a measure of the lth (l = 1, 2, …, L) characteristic of store j, Ni is the set of stores considered by consumers at i, and other notations are the same as in Equations 4.6 and 4.9. If disaggregate data of individual shopping trips, instead of the aggregate data of trips from areas, are available, the multinomial logit model (MNL) is used to model shopping behavior (e.g., Weisbrod et al., 1984), written as ⎛ Pij = ⎜ ⎝

L

∏ l =1

⎞ eαlj Alij ⎟ e −βij dij ⎠

⎡⎛ ⎢⎜ k ∈Ni ⎢ ⎣⎝

L

∑ ∏e l =1

α lik Alk

⎞ −β d ⎤ ⎟ e ik ik ⎥ . ⎠ ⎥⎦

(4.11)

76

Quantitative Methods and Socioeconomic Applications in GIS

Instead of using a power function for the gravity kernel in Equation 4.10, an exponential function is used in Equation 4.11. The model is estimated by multinomial logit regression. The distance friction coefficient β is a key parameter in the gravity models, and its value may vary over time and also across regions. Ideally, it needs to be derived from the existing travel pattern in a study area as discussed in Section 2.3. Actual travel data may also suggest a different function other than the commonly used power function that best captures the distance decay behavior.

4.3 CASE STUDY 4A: DEFINING FAN BASES OF CUBS AND WHITE SOX IN CHICAGO REGION In Chicago, it is well known that between the two Major League Baseball (MLB) teams the Cubs outdraw the White Sox in fans. Many factors such as history, neighborhoods surrounding the ballparks, team managements, winning records, and others may attribute to the difference. In this case study, we attempt to investigate the issue from a geographic perspective. For illustrating trade area analysis techniques, only population surrounding the ballparks is considered. The proximal area method is first used to examine which club has an advantage if fans choose a closer club. For methodology demonstration, we then consider winning percentage as the only factor for measuring attraction of a club,* and use the gravity model method to calibrate the probability surface. For simplicity, Euclidean distances are used for measuring proximity in this project (network travel time will be used in Case Study 4B), and the distance friction coefficient is assumed to be 2, that is, β = 2. The study area is defined as the 10 Illinois counties in the Chicago CMSA (Consolidated Metropolitan Statistical Area), as shown in the inset of Figure 4.3. The counties are (their Federal Information Processing Standards or FIPS codes in parentheses): Cook (031), DuPage (043), DeKalb (037), Grundy (063), Kane (089), Kankakee (091), Kendall (093), Lake (097), McHenry (111), and Will (197). In this book, the 10-county Chicago CMSA is referred to as the “Chicago Region,” which is also the study area in Case Studies 5 and 9. This is in contrast to the smaller 6-county core area (mostly urbanized), referred to as the “Chicago Urban Area” or simply “Chicago Area” in this book, which is the study area for Case Studies 6 and 8A. The City of Chicago is even smaller and forms the study area for Case Study 8C. The inset of Figure 4.3 shows these three study areas. All project data associated with these three areas are grouped under the corresponding geodatabases such as “ChiRegion. gdb,” “ChiUrArea.gdb” and “ChiCity.gdb,” all under “Chicago.” Data needed for this project are all grouped in geodatabase ChiRegion.gdb such as the following: 1. Polygon feature trt2k for census tracts (2000) in the study area *

Evidently this is an oversimplification. Despite its sub-pal records for many years, the Cubs have earned a nickname “lovable losers” as one of the most followed clubs in professional sports. However, the record still matters as tickets to the Wrigley Field became harder to get when the Cubs made a rare playoff run as “recent” as in 2008.

GIS-Based Trade Area Analysis and Application in Business Geography

77

N

Mcllenry

DeKalb

Lake

3 study areas ChiCity ChiUrArea ChiRegion

Cubs

Kane DuPage

Cook

White Sox

Kendall Will

Grundy MLB Clubs County

Kankakee

Cubs proximal area W Sox proximal area

FIGURE 4.3

0 5 10

20

30

40 km

Proximal areas for the Cubs and White Sox.

2. Point feature blkpt2k for census block centroids (2000) in the study area 3. Point feature CubsSox for the two MLB clubs The following explains how the above data sets are obtained and processed. The spatial and corresponding attribute data for the 2000 census tracts and blocks were extracted from the ESRI data DVDs and processed following similar procedures discussed in Section 1.3. For this project, only the population information is needed (available in the field POPU of trt2k and field POP2000 of blkpt2k). One may find other demographic variables such as income, age, and sex also useful (available at the Census web site www.census.gov) and use them for more in-depth analysis.

78

Quantitative Methods and Socioeconomic Applications in GIS

The data set for the two clubs (CubsSox) was prepared by geocoding them based on the addresses of the two clubs (Chicago Cubs at the Wrigley Field, 1060 W Addison St, Chicago, IL 60613; and Chicago White Sox at the US Cellular Field, 333 W 35th St, Chicago, IL 60616). See step 1 in Section 2.4.1 for more detail. The clubs’ winning percentages (in the field winrat) were based on their records in 2003 (the last time the Cubs played in the NL Championship Series) as an example. From now on, project instructions will be brief unless a new task is introduced. One may refer to previous case studies for details if necessary.

4.3.1

part 1. deFining Fan base areas by the proxiMal area Method

Step 1. Optional: Generating population weighted centroids of census tracts: One may use the census blocks directly to implement the trade area analysis. Due to its large size (131,159) and demanding computation time in other steps, this project uses the census tracts. Locations of the tracts will be represented by their population weighted centroids instead of their geographic centroids for better accuracy. The differences between them may be significant particularly in rural or peripheral suburban areas where population or business tend to concentrate in limited space (Luo and Wang, 2003). Population-weighted tract centroids are obtained by computing the weighted x–y coordinates based on block-level population data such as: ⎛ xc = ⎜ ⎝

nc

∑ i =1

⎞ pi xi ⎟ ⎠

⎛ ⎜ ⎝

nc

∑ i =1

⎞ pi ⎟ ⎠

and

⎛ yc = ⎜ ⎝

nc

∑ i =1

⎞ pi yi ⎟ ⎠

⎛ ⎜ ⎝

nc

⎞

∑ p ⎟⎠ , i

(4.12)

i =1

where xc and yc are the x and y coordinates of population weighted centroid of a census tract, respectively; xi and yi are the x and y coordinates of the ith block centroid within that census tract, respectively; pi is the population at the ith census block within that census tract; and nc is the total number of blocks within that census tract. Implementing this task utilizes a spatial statistics tool as illustrated below. In ArcToolbox, choose Spatial Statistics Tools > Measuring Geographic Distributions > Mean Center. In the dialog window, choose the point layer of census blocks “blkpt2k” as the Input Feature Class, enter “TrtWgtCent” as the name for Output Feature Class, choose the field “POP2000” as the Weight Field and the census tract id “CNTYTRT” as the Case Field, and click OK.* Note that the field CNTYTRT is a 9-digit unique code for a census tract (with the first 3 digits representing a county’s code and the next 6 digits representing a tract’s code). For the reason explained in the footnote below, it is recommended to use the point feature trtcent provided in the geodatabase ChiRegion.gdb for the remaining steps of this project. This step is designed to introduce the skill of generating weighted centroids. *

The resulting tract centroids layer TrtWgtcent contains 1896 points, five fewer than the number of tracts (1901) in the tract polygon layer trt2 k. Those five tracts have no population and thus make Equation 4.11 for computing weighted x–y coordinates invalid. Not including them would not affect the results in subsequent analysis. The point feature trtcent provided in the geodatabase ChiRegion. gdb has the complete 1901 tract centroids by adding the geographic centroids of the missing five tracts.

GIS-Based Trade Area Analysis and Application in Business Geography

79

TABLE 4.1 Fan Bases for Cubs and White Sox by Trade Area Analysis By the Proximal Area Method Club

2 Miles

5 Miles

10 Miles

Study Area

By Huff Model

Cubs White Sox

241,297 129,396

1,010,673 729,041

1,759,721 1,647,852

4,482,460 3,894,141

4,338,619 4,037,447

Step 2. Finding the nearest clubs: In ArcToolbox, choose Analysis Tools > Proximity > Near. In the dialog window, select trtcent as the Input Features and CubsSox as the Near Features, and click OK. The attribute table for the layer trtcent now has two new fields: “NEAR_FID” identifies the nearest club from each tract centroid (1 for the Cubs and 2 for the White Sox), and “NEAR_DIST” is the distance between them in meters.* Step 3. Mapping the proximity areas of clubs: Use an attribute join to join the attribute table of trtcent to the tract polygon layer trt2k by the common field CNTYTRT, and map the proximity areas as shown in Figure 4.3 (in the Layer Properties window, Symbology > Categories > Unique values, select NEAR_FID for Value Field and click Add ALL Values). Step 4. Summarizing results: Open the attribute table of trt2k, and summarize the population (POPU) in each trade area by clubs (e.g., NEAR _ FID). Refer to step 5 in Section 3.3.1 for detailed illustration of using the Summarize tool. Furthermore, on the table, use Table Options > Select By Attributes to create subsets of the table that contain tracts within 2 miles (=3218 m), 5 miles (=8045 m), 10 miles (=16,090 m), and 20 miles (=32,180 m), and summarize the total population near each club. The results are summarized in Table 4.1. It shows a clear advantage for the Cubs, particularly in short distance ranges. If resident income is considered, the advantage is even stronger for the Cubs.

4.3.2

part 2. deFining Fan base areas and Mapping probability surFace by huFF Model

Step 5. Computing distance matrix between clubs and tracts: In ArcToolbox, choose Analysis Tools > Proximity > Point Distance. Select trtcent as Input Features and CubsSox as Near Features, and name the Output Table as dist. dbf. In the distance file dist.dbf, the field “INPUT_FID” identifies the id from the input feature (trtcent), the field “NEAR_FID” identifies the id from the near feature (CubsSox), and the field DISTANCE is the Euclidean distance between them (in meters). It has 1901 (number of tracts) × 2 (number of clubs) = 3802 records.

*

One may also use a spatial join (specifically, the second option for point-to-point join, also termed “distance join” in Section 1.2) to accomplish the task.

80

Quantitative Methods and Socioeconomic Applications in GIS

Step 6. Measuring the potentials of the clubs: Join the attribute table of CubsSox (based on the field OBJECTID) to dist.dbf (based on the field NEAR _ FID) so that the clubs’ information (e.g., winning records) is attached to the distance file. Add a new field potent to dist.dbf, and calculate it as 1,000,000*[CubsSox.winrat]/([dist.DISTANCE]/1000)^2. Hereafter in the book, table names and bracket signs will be dropped from the formulas for field calculator for simplicity. In other words, the formula will simply written as potent = 1,000,000*winrat/(DISTANCE/1000)^2. Note that the values of potential do not have a unit, and multiplying it by a constant 1,000,000 is to avoid values being too small. The field potent returns the values for the numerator in Equation 4.6. Step 7. Calculating the probabilities visiting each club: On the table dist.dbf, use “Summarize” to sum the field potent by census tracts (i.e., INPUT _ FID), and name the output table as sum _ potent.dbf.* Refer to step 5 in Section 3.3.1 for detailed illustration of the use of “Summarize.” In the resulting table sum _ potent.dbf, the field Sum _ potent is the summed potentials of two clubs for each tract, which is the dominator term in Equation 4.6. Join the table sum _ potent.dbf back to dist.dbf (based on the common field INPUT _ FID). On the joined table dist.dbf, add a field prob and calculate it as prob = potent/Sum _ potent. The field prob completes the implementation of Equation 4.6, and indicates the probability of residents in each tract choosing a particular club. Step 8. Mapping probability surface: On the table dist.dbf, use Table Options > Select By Attributes to select the records with the criterion NEAR _ FID = 1, and export the selected records to a new table Cubs _ Prob.dbf. This extracts the probabilities of visiting the Cubs for all tracts. Join the table Cubs _ Prob.dbf (based on field INPUT _ FID) to the tract point layer trtcent (based on the field OBJECTID). One may directly map the field prob in the layer trtcent to see the change of probability visiting the Cubs. We use the spatial interpolation techniques learned in Case Study 3A to map the probability surface for the Cubs, as shown in Figure 4.4. The inset is the zoom-in area near the two clubs, showing the change from one trade area to another along the 0.50-probability line. One may repeat the analysis for the White Sox, and the result will be a reverse of Figure 4.4 since the probability of visiting the White Sox = 1 minus the probability of visiting the Cubs. Step 9. Defining fan bases by the Huff model: After the join in step 8, the attribute table of trtcent has a field prob indicating the probability of residents visiting the Cubs. Add a field cubsfan to the table, and calculate it as cubsfan = prob*POPU. Use the “Statistics” option to find out the sum of the field cubsfan (4,338,619), which is the projected fan base for the Cubs by the Huff model). The remaining population is projected to be the fan base for the White Sox, that is, 8,376,066 (total population in the study area) − 4,338,619 = 4,037,447.

*

If the newly added field “potent” from step 6 is not available for selection, remove the joins from dist.dbf and reattempt “Summarize.”

GIS-Based Trade Area Analysis and Application in Business Geography

81

Legend tract centroid Prob (Cubs) 0.0–0.5 0.5–1.0

N

Cubs W Sox Prob (Cubs) 0–0.125 0.125–0.25 0.25–0.375 0.375–0.5 0.5–0.625 0.625–0.75 0.75–0.875 0.875–1

0 4.5 9

FIGURE 4.4

18

27 36 km

Probability of choosing the Cubs by the Huff model (β = 2.0).

The process for implementing the Huff model is automated in a tool “Huff Model” under the toolkit Huff Model.tbx (available under the study area folder Chicago). The tool automatically generates the Euclidean distance table between customers and facilities. See Appendix 4B for its usage.

4.3.3

discussion

The proximal area method defines trade areas with definite boundaries. Within a trade area, all residents are assumed to choose one club over the other. The Huff model computes the probabilities of residents choosing each club. Within each tract,

82

Quantitative Methods and Socioeconomic Applications in GIS

a portion of the residents chooses one club and the remaining chooses the other. The Huff model seems to produce a more logical result as real-world fans of different clubs often live in the same area (even in the same household). The model accounts for the impact of each club’s attraction though its measurement is usually complex. The Huff model may also be used to define the traditional trade areas with definite boundaries by assigning tracts of the highest probabilities visiting a club to its trade area. In this case, tracts with prob > .50 belong to the Cubs, and the remaining tracts for the White Sox.

4.4

CASE STUDY 4B: ESTIMATING TRADE AREAS OF PUBLIC HOSPITALS IN LOUISIANA

This section presents another case study of trade area analysis. Unlike the previous case study based on Euclidean distances, this one uses the ArcGIS network analyst module to define the service (trade) areas of 10 public hospitals in Louisiana based on travel time through the road network. In the Huff model, we assume β = 3.0* and use the number of beds to measure the attraction (i.e., S in Equation 4.6). Data sets needed for the project are the same as in Case Study 2 under the data folder Louisiana. Several data sets generated from Case Study 2 will also be used in this study: 1. The geodatabase LAState.gdb contains the census tracts layer LA _ Trt, the hospitals layer Hosp generated from step 1 in Case Study 2, the census tract centroids layer TrtPt from step 2 in Case Study 2, and the major road network dataset LA _ MainRd _ ND under the feature dataset LA _ MainRd (generated from step 8 in Case Study 2) 2. The O-D travel time matrix file ODTime.dbf (generated from step 12 of Case Study 2)

4.4.1

part 1. deFining hospital service areas by the proxiMal area Method

Step 1. Activating the Network Analyst module and the Closest Facility tool in ArcMap: In ArcMap, add the above data sets to the project (LA _ Trt, Hosp, TrtPt, LA _ MainRd and ODTime.dbf). Choose Customize from the main menu > Extensions > check Network Analyst. From the Network Analyst drop-down menu, choose New Closest Facility. A composite network analysis layer “Closest Facility” is added to the layers window. Step 2. Finding the nearest hospital for each census tract: Under the Network Analyst window, right-click on Facilities (0) > Load Locations. In the dialog *

We use β = 3.0 for convenience without deriving the actual distance decay behavior from hospital visitation data as discussed in Section 2.3. When smaller β values (e.g., 1.0, 1.5, 2.0, 2.5) are used, the effect of hospital size dominates that of travel time and leads to service areas for large hospitals (particularly the LSUHSC-Shreveport and the Interim LSU Public Hospital) extending to areas that are not contiguous. That may well be the case in the real world.

GIS-Based Trade Area Analysis and Application in Business Geography

83

window, choose Hosp for Load From, select OBJECTID as Sort Field, and under Use Geometry, set the Search Tolerance 5000 m, and click OK to load the 10 hospital locations. Similarly, right-click on Incidents (0) > Load Locations > choose TrtPt for Load From, select OBJECTID _ 1 as Sort Field, and under Use Geometry, set the Search Tolerance 55,000 m, and click OK to load the 1148 tract centroids. On the Network Analyst toolbar, click the Solve button . The solution is saved in the layer Routes under Table Of Contents (another in the Network Analyst window), which shows the routes from census tracts to their closest hospitals through the major roads. Right-click it > Open Attribute Table to examine the result that clearly identifies the nearest hospital for each census tract. Step 3. Defining and mapping the proximal (service) areas of hospitals: Join the attribute table of Routes to the point layer TrtPt (based on the common keys identifying unique census tracts: field IncidentID of Routes and field OBJECTID _ 1 of TrtPt) and export the joined attribute table to a dBASE file, say TrtPtNearHosp. Then, join the table TrtPtNearHosp to the polygon layer LA _ Trt (based on their common key GEOID10). By doing so, we are able to join the attribute table of Routes to the original tract layer LA _ Trt. Right-click LA _ Trt > Data > Export Data to save the joined result as a new layer LA _ Trt1a, where the field FacilityID identifies the closest hospital from each tract. Similar to step 10 in Section 1.3.2, in ArcToolbox, choose Data Management Tools > Generalization > Dissolve. In the dialog window, choose LA _ Trt1a for Input Features, name the Output Feature class ProxArea, check FacilityID for Dissolve_Field(s); for Statistics Field(s), choose fields DP0010001 (total population), DP0090002 (Black population) and DP0100002 (Hispanic population), all with Statistic Type “SUM”; click OK. Also save the project for future reference. The output layer ProxArea defines the proximal areas of the 10 public hospitals in Louisiana (as shown in Figure 4.5). Table 4.2 presents some basic demographic information for the proximal areas.

4.4.2

part 2. deFining hospital service areas by huFF Model

Step 4. Attaching hospital attributes to the OD travel time table: In the OD travel time table ODTime.dbf, the fieldnames OriginID and Destinatio represent the origin’s id (i.e., field OBJECTID _ 1 of layer TrtPt) and the destination id (i.e., field OBJECTID of layer Hosp), respectively. Join the attribute table of Hosp (based on the field OBJECTID) to the table ODTime.dbf (based on the field Destinatio). Step 5. Measuring hospital potentials: Add a new field Potent (data type defined as Double) to ODTime.dbf, and calculate it as Potent = 10000*Beds/ Total _ Minu^3. Again, the values of potential do not have a unit, and multiplying it by a constant 10,000 is to avoid very small values. Step 6. Identifying hospitals with the highest potential: In contrast to the proximal area method that assigns a tract to the service area of its closest hospital, the Huff model considers the impact of hospital’s attraction (here, its size in terms of

84

Quantitative Methods and Socioeconomic Applications in GIS

N

Hospital Proximal area Routes

0 25 50

100

150

200 km

FIGURE 4.5 Proximal areas for public hospitals in Louisiana.

number of beds) in addition to the distance (travel time) factor. A tract belongs to the service area of a hospital if the hospital (among 10 hospitals) exerts the highest influence (potential) on it. In other words, the task is to identify the maximum potential S j dij−β for j = 1, 2, …, 10. For a particular tract i, the denominator ∑ nk =1 (Sk dik−β ) in Equation 4.6 is the same for any hospital j, thus the highest potential implies the highest probability, that is, Pij = S j dij−β / ∑ kn =1 (Sk dik−β ). This is implemented by using the Summarize tool in ArcGIS (see step 5 in Section 3.3.1). On the table ODTime.dbf, right-click the field OriginID > Summarize. In the Summarize dialog window, choose ODTime.OriginID as the field to summarize; for summary statistics, expand the field ODTime.Potent and check Maximum and Sum*; name the output table Sum _ Potent.dbf; and click Ok. In the resulting table Sum _ Potent.dbf, the field Maximum _ Potent is the identified maximum potential among 10 hospitals for each tract, and the field Sum _ Potent is the total potentials of 10 hospitals for each tract. Join the table Sum _ Potent.dbf back to ODTime.dbf (based on the common key OriginID). On the expanded table ODTime.dbf, select records with the criterion ODTime.Potent = Sum _ Potent.Max _ Potent (1148 records

*

If the newly created field is not visible for selection, save, exit, and then reopen the project to reattempt the task.

186 61 37 18 65 17 76 386 103 44 993

Hospital Name

Interim LSU Public Hospital L.J. Chabert Medical Center LSUHSC-Bogalusa Medical Center Lallie Kemp Regional Medical Center University Medical Center W.O. Moss Regional Medical Center Earl K Long Medical Center LSUHSC-Shreveport E.A. Conway Medical Center LSUHSC-Huey P. Long Regional Medical Center Sum

1,100,570 260,819 77,420 214,352 592,206 299,820 773,883 495,910 364,122 354,270 4,533,372

Total Population 406,861 49,762 16,717 52,119 173,936 67,119 286,143 192,428 138,243 103,557 1,486,885

Black (37.0) (19.1) (21.6) (24.3) (29.4) (22.4) (37.0) (38.8) (38.0) (29.2) (32.8)

(%)

Proximal Areas

Population by Hospital Trade Areas in Louisiana No. of Beds

Table 4.2

88,688 11,033 1857 7458 16,722 8575 26,763 15,551 6906 9007 192,560

Hispanic (8.1) (4.2) (2.4) (3.5) (2.8) (2.9) (3.5) (3.1) (1.9) (2.5) (4.2)

(%) 1,274,385 231,347 73,453 137,962 595,925 214,196 764,896 663,461 327,382 250,298 4,533,304

Total Population

441,019 48,180 17,919 38,613 166,527 54,331 281,891 244,136 120,116 74,127 1,486,859

Black

(34.6) (20.8) (24.4) (28.0) (27.9) (25.4) (36.9) (36.8) (36.7) (29.6) (32.8)

(%)

94,343 9326 1938 4847 17,262 5653 25,983 20,374 6160 6671 192,557

Hispanic

Trade Areas by Huff Model

(7.4) (4.0) (2.6) (3.5) (2.9) (2.6) (3.4) (3.1) (1.9) (2.7) (4.2)

(%)

GIS-Based Trade Area Analysis and Application in Business Geography 85

86

Quantitative Methods and Socioeconomic Applications in GIS

N 386

103

Hospital Census tracts

44

18

76

37

65

17

186 61

0 25 50

FIGURE 4.6

100

150

200 km

Service areas for public hospitals in Louisiana by Huff model (β = 3.0).

selected),* and export the selected records to a table Hosp _ MaxPotent.dbf, which identifies the hospital that has the highest influence (potential) on a tract. Step 7. Mapping the service areas of hospitals by the Huff model: Join the table Hosp _ MaxPotent.dbf (based on the field OriginID) to the attribute table of LA _ Trt (based on the field OBJECTID _ 1), which now contains the field Destinatio identifying which hospital has the highest potential for a census tracts. A map, as in Figure 4.6, shows the service areas of hospitals by the Huff model (the number next to each hospital is its bed size). In comparison to Figure 4.5 based on the proximal areas, the service areas of large hospitals such as the LSUHSC-Shreveport at the northwest corner and the Interim LSU Public Hospital at the southeast corner expand, but those of the smaller hospitals shrink. Table 4.2 summarizes the difference between the results by the two methods If desirable, one may also physically construct the new service area boundary layer by using the Dissolve tool as in step 3 in Section 4.4.1. Appendix 4B discusses a toolkit that automates the implementation of the Huff model. Step 8. Mapping the probability surface for a hospital’s visitation: As explained in step 6, the service area of a hospital in Figure 4.6 includes tracts *

If the number of selected records is not 1148, a likely cause is the different precision used in the two fields, and one may use the criterion “ABS(ODTime.Potent − Sum _ Potent. Max _ Potent) < 0.001.”

GIS-Based Trade Area Analysis and Application in Business Geography

87

N Hospital Census tracts Probability 0.00–0.02 0.03–0.08 0.09–0.17 0.18–0.31 0.32–0.52 0.53–0.84 0.85–1.00

0

25 50

100

150

200 km

FIGURE 4.7 Probability of visiting LSUHSC-Shreveport Hospital by Huff model (β = 3.0).

whose residents visit the hospital with the highest probability. However, one major lesson from the Huff model is that residents are likely to visit all hospitals with different probabilities. This step uses the largest hospital (i.e., LSUHSCShreveport) to illustrate the implementation of calibrating and mapping the probabilities of hospital visitation. On the table ODTime.dbf, add a new field Prob and calculate it as Prob = Potent/Sum _ Potent, which is the probability of a tract visiting a hospital; and then select the records for the LSUHSC-Shreveport (Destinatio = 8) and export to a table Prob _ Shrev. Similar to step 7, join the table Prob _ Shrev to LA _ Trt, and map the field Prob, that is, the probability of residents visiting the LSUHSC-Shreveport among the 10 hospitals, as shown in Figure 4.7. Another tool “Huff Model (w External Distance Table)” under the toolkit Huff Model.tbx (also available under the study area folder Louisiana) automates the above process while using a predefined distance (travel time) table between customers and facilities (as it is the case here). See Appendix 4B for more details.

4.5 CONCLUDING REMARKS While the concepts of proximal area method and Huff model are straightforward, their successful implementation relies on adequate measurements of variables, which remain one of the most challenging tasks in trade area analysis.

88

Quantitative Methods and Socioeconomic Applications in GIS

First, both methods use distance or time. The proximal area method is based on the commonly known “least-effect principle” in geography (Zipf, 1949). Road network distance or travel time is generally a better measure than the straight line (Euclidean) distance. However, network distance or travel time may not be the best measure for travel impedance. Travel cost, convenience, comfort, or safety may also be important as discussed in Section 2.1. Research indicates that people of various socioeconomic or demographic characteristics perceive the same distance differently, that is, a difference between cognitive and physical distances (Cadwallader, 1975). Defining network distance or travel time also depends on the particular transportation mode. This makes distance or time measurement more than a routine task. Accounting for interactions by telecommunication, Internet, and other modern technologies adds further complexity to the issue. Second, in addition to distance or time, the Huff model has two more variables: attraction and travel friction coefficient (S and β in Equation 4.6). Attraction is measured by winning percentage in Case Study 4A and hospital bed size in Case Study 4B. Both are oversimplifications. More advanced methods may be employed to consider more factors in measuring the attraction (e.g., the multiplicative competitive interaction or MCI model discussed in Section 4.2.4). The travel friction coefficient β is also difficult to define as it varies across time and space, between transportation modes, by type of commodities, etc. For additional practice of trade area analysis methods, one may conduct the trade area analysis of chain stores in a familiar study area. Store addresses can be found on the Internet or in other sources (yellow pages, store directories), and geocoded by following the procedure discussed in Section 2.4.1. Population census data can be used to measure customer bases. A trade area analysis of the chain stores may be used to project market potentials and evaluate the performance of individual stores.

APPENDIX 4A: ECONOMIC FOUNDATION OF THE GRAVITY MODEL The gravity model is often criticized, particularly by economists, for its lack of foundation in individual behavior. This appendix follows the work of Colwell (1982) in an attempt to provide a theoretical base for the gravity model. For a review of other approaches to derive the gravity model, see Fotheringham et al. (2000, pp. 217–234). Assume a trip utility function in a Cobb–Douglas form such as τ

ui = ax α z γ tijij ,

(A4.1)

where ui is the utility of an individual at location i, x is a composite commodity (i.e., all other goods), z is leisure time, tij is the number of trips taken by an individual at i to j, τ ij = βPjϕ /Pi ξ is the trip elasticity of utility that is directly related to the destination’s population Pj and reversely related to the origin’s population Pi, and a, α, β, γ, φ, and ξ are positive parameters. Colwell (1982, p. 543) justifies the particular way of defining trip elasticity of utility on the ground of central place theory: larger places serve many of the same functions as smaller places plus higher order functions not

GIS-Based Trade Area Analysis and Application in Business Geography

89

found in smaller places, and thus the elasticity τij is larger for trips from the smaller to the larger place than trips from the larger to the smaller place. The budget constraint is written as px + rdij tij = wW ,

(A4.2)

where p is the price of x, r is the unit distance cost for travel, dij is the distance between point i and j, w is the wage rate, and W is the time worked. In addition, the time constraint is sx + hdij tij + z + W = H ,

(A4.3)

where s is the time required per unit of x consumed, h is the travel time per unit of distance, and H is total time. Combining the two constraints in Equations A4.2 and A4.3 yields ( p + ws ) x + (rdij + whdij )tij + wz = wH .

(A4.4)

Maximizing the utility in Equation A4.1 subject to the constraint in Equation A4.4 yields the following Lagrangian function τ

L = ax α z γ tijij − λ[( p + ws ) x + (rdij + whdij )tij + wz − wH ]. On the basis of the four first-order conditions, that is, ∂L /∂x = ∂L /∂z = ∂L / ∂tij = ∂L /∂λ = 0, we can solve for tij by eliminating λ, x, and z: tij =

wH τij . (r + wh)dij (α + γ + τij )

(A4.5)

It is assumed that travel cost per unit of distance r is a function of distance dij such as r = r0 dijσ ,

(A4.6)

where r0 > 0 and σ > −1 so that total travel costs are an increasing function of distance. Therefore, the travel time per unit of distance, h, has a similar function h = h0 dijσ ,

(A4.7)

so that travel time is proportional to travel cost. For simplicity, assume that the utility function is homogeneous to degree one, that is, α + γ + τij = 1.

(A4.8)

Substituting Equations A4.6, A4.7, and A4.8 into A4.5 and using τ ij = βPjφ /Pi ξ , we obtain

90

Quantitative Methods and Socioeconomic Applications in GIS

tij =

wHβPi − ξ Pjϕ . (r0 + wh0 )dij1+ σ

(A4.9)

Finally, multiplying Equation A4.9 by the origin’s population yields the total number of trips from i to j Tij = Pt i ij =

wH βPi1− ξ Pjϕ , (r0 + wh0 )dij1+ σ

(A4.10)

which resembles the gravity model in Equation 4.14.

APPENDIX 4B: A TOOLKIT FOR IMPLEMENTING THE HUFF MODEL* A toolkit is developed to automate the implementation of the Huff model as discussed in Section 4.2.† Specifically, it includes two tools: 1. Huff model (based on Euclidean distances between customers and facilities) 2. Huff model with an external distance table One may access a tool in the toolkit directly from ArcCatalog. Take the “Huff model” tool as an example. In ArcCatalog, locate the toolkit Huff Model.tbx and expand it to display the aforementioned two tools > double-click the tool “Huff Model.” The interface is shown in Figure A4.1, which uses the tool to implement the Huff model (steps 5-7 in Section 4.3.2). Below is a quick overview of items to define: • Two items associated with the customer feature (the layer and its ID field) • Three items for the facility feature (in addition to the layer and its ID, facility size defines the variable S in Equation 4.6) • Distance decay function f(x), that is, one of the three continuous functions (power, exponential, and Gaussian), and its associated coefficient,‡ as discussed in Section 2.3 • Distance unit conversion factor, with a default value 0.001 (e.g., converting meters into kilometers)§ *

†

‡

§

Suggested citation for using this tool: Zhu H. and F. Wang. 2015. Appendix 4B: A toolkit for implementing the Huff model. In Quantitative Methods and Socioeconomic Applications in GIS (2nd ed.). Boca Raton, FL: Taylor & Francis. pp. 90–91. Under the study area folder Chicago, note the subfolder Scripts, which contains the python codes for the toolkit. Be sure to copy the subfolder along with the file Huff Model.tbx for the program to be executable. The same files associated with the toolkit are also available under another study area folder Louisiana. The default values for the power and exponential functions are set at 3 and 0.03, respectively. For the Gaussian function, the coefficient is the value for σ (with a default value = 80) since μ = 0 for a monotonic distance decay pattern when assuming no crater around x = 0. Since the Euclidean distances are automatically generated by the tool, the values may be large when a unit such as meter (e.g., in a UTM projection) is used. The direct use of distance values without a conversion could be especially problematic for the exponential or Guassian distance decay function, which is not scale independent.

GIS-Based Trade Area Analysis and Application in Business Geography

FIGURE A4.1

91

Interface for implementing the Huff model.

• Weight scale factor for inflating the values for potentials, with a default value 10,000 (as explained in step 6 in Section 4.3.2) • Facility IDs (if one facility is chosen, a field in the output feature will save the probability of each customer location visiting the chosen facility; if several facilities are chosen, multiple fields will save the information for selected facilities) • Output feature for saving the results (including a field identifying the facility visited by each customer location with the highest probability, and aforementioned field(s) for probabilities visiting selected facilities) It is often desirable for a user to prepare an external distance (or travel time) table between the supply and demand locations prior to the implementation of Huff model because the automatically generated distance table is based on Euclidean distances. In such a case, use the tool “Huff Model w External Distance Table.” The interface is very similar to Figure A4.1 other than four items for defining the external distance

92

Quantitative Methods and Socioeconomic Applications in GIS

table (table name, customer ID, facility ID, and travel impedance between them). The ID fields of customer and facility locations must be consistent with how the corresponding layers and ID fields are defined in the same interface. Since the distance (travel time) is from the predefined table, and thus no option is given to define a distance unit conversion factor.

5

GIS-Based Measures of Spatial Accessibility and Application in Examining Health Care Access

Accessibility refers to the relative ease by which the locations of activities such as work, school, shopping, recreation, and health care can be reached from a given location. Accessibility is an important issue for several reasons. Resources or services are scarce, and their efficient delivery requires adequate access by people. The spatial distribution of resources or services is not uniform, and needs effective planning and allocation to match demands. Disadvantaged population groups (e.g., lowincome and minority residents) often suffer from poor access to certain activities or opportunities because of their lack of economic or transportation means. Access can thus become a social justice issue, which calls for careful planning and public policies by government agencies. Accessibility is determined by the distributions of supply and demand and how they are connected in space, and thus a classic issue for location analysis well suited for GIS to address. This chapter focuses on how spatial accessibility is measured by GIS-based methods. Section 5.1 overviews the issues on accessibility, followed by two GIS-based methods for defining spatial accessibility: the floating catchment area methods in Section 5.2 and the gravity-based method in Section 5.3. Section 5.4 illustrates how the two methods are implemented in a case study of measuring accessibility to primary care physicians in the Chicago region. Section 5.5 concludes the chapter with a brief summary.

5.1 ISSUES ON ACCESSIBILITY Access may be classified according to two dichotomous dimensions (potential vs. revealed, and spatial vs. aspatial) into four categories, such as potential spatial access, potential aspatial access, revealed spatial access, and revealed aspatial access (Khan, 1992). Revealed accessibility focuses on actual use of a service, whereas potential accessibility signifies the probable utilization of a service. The revealed accessibility may be reflected by frequency or satisfaction level of using a service, and thus be obtained in a survey. Most studies examine potential accessibility, based on which planners and policy analysts evaluate the existing system of service delivery and identify strategies for improvement. Spatial access emphasizes the importance of spatial separation between supply and demand as a barrier or a facilitator, 93

94

Quantitative Methods and Socioeconomic Applications in GIS

whereas the aspatial access stresses nongeographic barriers or facilitators (Joseph and Phillips, 1984). Aspatial access is related to many demographic and socioeconomic variables. In the study on job access, Wang (2001b) examined how workers’ characteristics such as race, sex, wage, family structure, educational attainment, and home ownership status affect commuting time and thus job access. In the study on healthcare access, Wang and Luo (2005) included several categories of aspatial variables: demographics such as age, sex, and ethnicity; socioeconomic status such as population under the poverty line, female-headed households, home ownership and income; environment such as residential crowdedness and housing units’ lack of basic amenities; linguistic barrier and service awareness such as population without a high-school diploma and households linguistically isolated; and transportation mobility such as households without vehicles. Since these variables are often correlated to each other, they may be consolidated into a few independent factors by using the principal components and factor analysis techniques (see Chapter 7). This chapter focuses on measuring potential spatial accessibility, an issue particularly interesting to geographers and location analysts. If the capacity of supply is less a concern, one can use simple supply-oriented accessibility measures that emphasize the proximity to supply locations. For instance, Onega et al. (2008) used minimum travel time to the closest cancer care facility to measure accessibility to cancer care service. Distance or time from the nearest provider can be obtained using the techniques illustrated in Chapter 2. Scott and Horner (2008) used the cumulative opportunities within a distance (travel time) range to measure job accessibility. Hansen (1959) used a simple gravity-based potential model to measure accessibility to jobs. The model is written as: n

AiH =

∑S d

−β j ij

,

(5.1)

j =1

where AiH is the accessibility at location i, Sj is the supply capacity at location j, dij is the distance or travel time between the demand (at location i) and a supply location j, β is the travel friction coefficient, and n is the total number of supply locations. The superscript H in AiH denotes the measure based on the Hanson model versus F for the measure based on the two-step floating catchment area method in Equation 5.2 or G for the measure based on the gravity model in Equation 5.3. The potential model values supplies at all locations, each of which is discounted by a distance term. The model does not account for the demand side. That is to say, the amount of population competing for the supplies is not considered to affect accessibility. The model is the foundation for a more advanced gravity-based method that will be explained in Section 5.3. In most cases, accessibility measures need to account for both supply and demand because of scarcity of supply. Prior to the widespread use of GIS, the simple supply–demand ratio method computes the ratio of supply versus demand in an area (usually an administrative unit such as township or county) to measure accessibility. For example, Cervero (1989) and Giuliano and Small (1993) measured job accessibility by the ratio of jobs versus resident workers across subareas (central city and combined suburban townships) and used the ratio to explain intraurban variations of

GIS-Based Measures of Spatial Accessibility in Health Care Access

95

commuting time. In the literature on job access and commuting, the method is commonly referred to as the jobs-housing balance approach. The U.S. Department of Health and Human Services (DHHS) uses the population-to-physician ratio within a “rational service area” (most as large as a whole county or a portion of a county or established neighborhoods and communities) as a basic indicator for defining physician shortage areas (Lee, 1991; GAO, 1995). In the literature on healthcare access and physician shortage area designation, the method is referred to as the regional availability measure (vs. the regional accessibility measure based on the gravity model) (Joseph and Phillips, 1984). The simple supply–demand ratio method has at least two shortcomings. First, it cannot reveal the detailed spatial variations within an area unit (usually large). For example, the job–housing balance approach computes the jobs–resident workers ratio and uses it to explain commuting across cities, but cannot explain the variation within a city. Secondly, it assumes that the boundaries are impermeable, that is, demand is met by supply only within the areas. For instance, in physician shortage area designation by the DHHS, the population-to-physician ratio is often calculated at the county level, implying that residents do not visit physicians beyond county borders. The next two sections discuss the floating catchment area (FCA) methods and some more advanced models. All consider both supply and demand, and overcome the shortcomings mentioned above.

5.2 FLOATING CATCHMENT AREA METHODS 5.2.1

earlier versions oF Floating catchMent area (Fca) Method

Earlier versions of the floating catchment area (FCA) method are very much like the one discussed in Section 3.1 on spatial smoothing. For example, in Peng (1997), a catchment area is defined as a square around each location of residents, and the jobs–residents ratio within the square measures the job accessibility for that location. The catchment area “floats” from one residential location to another across the study area, and defines the accessibility for all locations. The catchment area may also be defined as a circle (Immergluck, 1998; Wang, 2000) or a fixed travel time range (Wang and Minor, 2002), and the concept remains the same. Figure 5.1 uses an example to illustrate the method. For simplicity, assume that each demand location (e.g., tract) has only one resident at its centroid and the capacity of each supply location is also one. Assume that a circle around the centroid of a residential location defines its catchment area. Accessibility in a tract is defined as the supply-to-demand ratio within its catchment area. For instance, within the catchment area of tract 2, total supply is 1 (i.e., only a), and total demand is 7. Therefore, accessibility at tract 2 is the supply-to-demand ratio, that is, 1/7. The circle floats from one centroid to another while its radius remains the same. The catchment area of tract 11 contains a total supply of 3 (i.e., a, b and c) and a total demand of 7, and thus the accessibility at tract 11 is 3/7. Here the ratio is based on the floating catchment area and not confined by the boundary of an administrative unit. The above example can also be used to explain the fallacies of this simple FCA method. It assumes that services within a catchment area are fully available to residents

96

Quantitative Methods and Socioeconomic Applications in GIS Catchment area for demand Demand centroid and ID

1

R = 1/7

Supply location and ID

a 1

Administrative unit boundary Demand tract boundary

3 5

2 4 6

7

a

b

10

8

11 12

13

9

R = 3/7

c 14

15

FIGURE 5.1 Basic floating catchment area (FCA) method in Euclidean distance.

within that catchment area and residents use only those services. However, the distance between the supply and demand within the catchment area may exceed the threshold distance (e.g., in Figure 5.1, the distance between 13 and a is greater than the radius of the catchment area of tract 11). Furthermore, the supply at a is within the catchment of tract 2, but may not be fully available to serve demands within that catchment as it is also reachable by tract 11. This points out the need to discount the availability of a supplier by the intensity of competition for its service of surrounding demand.

5.2.2

tWo-step Floating catchMent area (2sFca) Method

A method developed by Luo and Wang (2003) overcomes the above fallacies. It repeats the process of “floating catchment” twice (once on supply locations and once on demand locations), and is therefore referred to as the two-step floating catchment area (2SFCA) method. First, for each supply location j, search all demand locations (k) that are within a threshold travel distance (d0) from location j (i.e., catchment area j), and compute the supply-to-demand ratio Rj within the catchment area: Rj =

Sj

∑

k ∈{dkj ≤ d0 }

Dk

,

GIS-Based Measures of Spatial Accessibility in Health Care Access

97

where dkj is the distance between k and j, Dk is the demand at location k that falls within the catchment (i.e., dkj ≤ d 0), and Sj is the capacity of supply at location j. Next, for each demand location i, search all supply locations (j) that are within the threshold distance (d 0) from location i (i.e., catchment area i), and sum up the supply to demand ratios Rj at those locations to obtain the accessibility AiF at demand location i: ⎛ ⎜ A = Rj = ⎜ j ∈{dij ≤ d0 } j ∈{dij ≤ d0 } ⎝ F i

∑

∑ ∑

⎞ ⎟, Dk ⎟ k ∈{dkj ≤ d0 } ⎠ Sj

(5.2)

where dij is the distance between i and j, and Rj is the supply-to-demand ratio at supply location j that falls within the catchment centered at i (i.e., dij ≤ d 0). A larger value of AiF indicates a better accessibility at a location. The first step above assigns an initial ratio to each service area centered at a supply location as a measure of supply availability (or crowdedness). The second step sums up the initial ratios in the overlapped service areas to measure accessibility for a demand location, where residents have access to multiple supply locations. The method considers interaction between demands and supplies across areal unit borders, and computes an accessibility measure that varies from one location to another. Equation 5.2 is basically the supply-to-demand ratio (filtered by a threshold distance or filtering window). Figure 5.2 uses the same example to illustrate the 2SFCA method. Here we use travel time instead of Euclidean distance to define catchment area. The catchment area for supply a has one supply and eight residents, and thus carries a supply-todemand ratio of 1/8. Similarly, the ratio for catchment b is 1/4 and for catchment c is 1/5. The resident at tract 3 has access to a only, and the accessibility at tract 3 is equal to the supply-to-demand ratio at a (the only supply location), that is, Ra = 0.125. Similarly, the resident at tract 5 has access to b only, and thus its accessibility is Rb = 0.25. However, the resident at 4 can reach both supplies a and b (shown in an area overlapped by catchment areas a and b), and therefore enjoys a better accessibility (i.e., Ra + Rb = 0.375). The catchment drawn in the first step is centered at a supply location, and thus the travel time between the supply and any demand within the catchment does not exceed the threshold. The catchment drawn in the second step is centered at a demand location, and all supplies within the catchment contribute to the supplyto-demand ratios at that demand location. The method overcomes the fallacies in the earlier FCA methods. Equation 5.2 is basically a supply-to-demand ratio with only selected supplies and demands entering the numerator and the denominator, and the selections are based on a threshold distance or time within which supplies and demands interact. If a supply location falls within the catchment area from a demand site, the demand site is also within the catchment area from that supply facility. The distance or time matrix is calculated once, but used twice in the search in both steps. Travel time should be used if the distance is a poor measure of travel impedance.

98

Quantitative Methods and Socioeconomic Applications in GIS Catchment area for supply location

1

Demand centroid and ID Supply location and ID

a 1

Administrative unit boundary Demand tract boundary

3 5

2 4 6

7

a

Ra = 1/8 8

Rb = 1/4 8

11

10

9

b

12

13

14

c

Rc = 1/5 15

FIGURE 5.2 Two-step floating catchment area (2SFCA) method in travel time.

The method can be implemented in ArcGIS. The detailed procedures are explained in Section 5.4.

5.3 5.3.1

GRAVITY-BASED AND GENERALIZED 2SFCA MODELS gravity-based accessibility index

The 2SFCA method draws an artificial line (say, 15 miles or 30 min) between an accessible and inaccessible location. Supplies within that range are counted equally regardless of the actual travel distance or time (e.g., 2 miles vs. 12 miles). Similarly, all supplies beyond that threshold are considered as inaccessible, regardless of any difference in travel distance or time. The gravity model rates a nearby supply more accessibly than a remote one, and thus reflects a continuous decay of access in distance. The potential model in Equation 5.1 considers only the supply not the demand side (i.e., competition for available supplies among demands). Weibull (1976) improved the measurement by accounting for competition for services among residents (demands). Joseph and Bantock (1982) applied the method to assess healthcare accessibility. Shen (1998) and Wang (2001b) used the method for evaluating job accessibility. The gravity-based accessibility measure at location i can be written as

GIS-Based Measures of Spatial Accessibility in Health Care Access n

AiG =

∑ j =1

S j dij−β , where V j = Vj

99

m

∑D d

−β k kj

.

(5.3)

k =1

AiG is the gravity-based index of accessibility, where n and m are the total number of supply and demand locations, respectively, and the other variables are the same as in Equations 5.1 and 5.2. Compared to the primitive accessibility measure based on the Hansen model AiH , AiG discounts the availability of a physician by the service competition intensity at that location, Vj, measured by its population potential. A larger AiG implies better accessibility. This accessibility index can be interpreted similar to the one defined by the 2SFCA method. It is essentially the ratio of supply S to demand D, each of which is weighted by travel distance or time to a negative power. The total accessibility scores (i.e., sum of “individual accessibility indexes multiplied by corresponding demand amounts”), either by the 2SFCA or the gravity-based method, are equal to the total supply. Alternatively, the weighted average of accessibility in all demand locations is equal to the supply-to-demand ratio in the whole study area (see Appendix 5A for a proof).

5.3.2

coMparison oF the 2sFca and gravity-based Methods

A careful examination of the two methods further reveals that the two-step floating catchment area (2SFCA) method is merely a special case of the gravity-based method. The 2SFCA method treats distance (time) impedance as a dichotomous measure, that is, any distance (time) within a threshold is equally accessible and any distance (time) beyond the threshold is equally inaccessible. Using d0 as the threshold travel distance (time), distance or time can be recoded as: 1. dij (or dkj) = ∞ if dij (or dkj) > d 0 2. dij (or dkj) = 1 if dij (or dkj) ≤ d 0 For any β > 0 in Equation 5.3, we have: 1. dij−β (or dkj−β) = 0 when dij (or dkj) = ∞ 2. dij−β (or dkj−β ) = 1 when dij (or dkj) = 1 In case 1, Sj or Pk will be excluded by being multiplied by zero, and in case 2, Sj or Pk will be included by being multiplied by one. Therefore, Equation 5.3 is regressed to Equation 5.2, and thus the 2SFCA measure is just a special case of the gravitybased measure. Considering that the two methods have been developed in different fields for a variety of applications, this proof validates their rationale for capturing the essence of accessibility measures. In the 2SFCA method, a larger threshold distance or time reduces variability of accessibility across space, and thus leads to stronger spatial smoothing (Fotheringham et al., 2000, p. 46; also see Section 3.1). In the gravity-based method, a lower value of

100

Quantitative Methods and Socioeconomic Applications in GIS

travel friction coefficient β leads to a lower variance of accessibility scores, and thus stronger spatial smoothing. The effect of a larger threshold travel time in the 2SFCA method is equivalent to that of a smaller travel friction coefficient in the gravitybased method. Indeed, a lower β value implies that travel distance or time matters less and people are willing to travel farther to seek a service. The gravity-based method seems to be more conceptually sound than the 2SFCA method. However, the 2SFCA method may be a better choice in some cases for two reasons. First, the gravity-based method tends to inflate accessibility scores in pooraccess areas than the 2SFCA method (Luo and Wang, 2003), but the poor-access areas are usually the areas of most interest to many public policy makers. In addition, the gravity-based method also involves more computation and is less intuitive. In particular, finding the value of the distance friction coefficient β requires additional data and work to define and may be region-specific (Huff, 2000).

5.3.3

generalized 2sFca Model

Despite the relative popularity of 2SFCA, the method’s major limitation remains its dichotomous approach that defines a doctor being accessible or inaccessible by a cut-off distance (time). Many studies have attempted to improve it, and most are from healthcare-related applications. A kernel density function (Guagliardo, 2004) or a Gaussian function (Dai, 2010; Shi et al., 2012) is proposed to model the distance decay effect (i.e., a continuously gradual decay within a threshold distance and no effect beyond). The catchment radius may also vary by provider types or neighborhood types (Yang et al., 2006), and by geographic settings (e.g., shorter in urban and longer in rural areas) (McGrail and Humphreys, 2009a). Weights can be assigned to different travel time zones so that the supply–demand interaction drops with travel time by multiple steps, referred to as “expanded 2SFCA (E2SFCA)” (Luo and Qi, 2009). McGrail and Humphreys (2009b) proposed a constant weight within 10 min, a zero weight beyond 60 min and a weight of gradual decay between, and their method can be termed “three-zone hybrid approach.” These methods can be synthesized together in one framework and vary in their conceptualization of distance decay in patient–physician interactions (Wang, 2012). Representing the distance decay effect in a function f(d), we can write the generalized 2SFCA model as: ⎡ ⎛ ⎢ S j f (dij ) ⎜ ⎝ j =1 ⎢ ⎣ n

Ai =

∑

⎞⎤ Pk f (dkj )⎟ ⎥, ⎠ ⎥⎦ k =1 m

∑

(5.4)

Figure 5.3a–f summarizes various ways of conceptualizing the distance decay function f(d): a continuous function such as a power function (Figure 5.3a) or a Gaussian function (Figure 5.3b), a discrete variable such as binary in 2SFCA (Figure 5.3c) or multiple weights in E2SFCA (Figure 5.3d), or a hybrid of the two such as a kernel function (Figure 5.3e) or a three-zone hybrid approach (Figure 5.3f).

101

GIS-Based Measures of Spatial Accessibility in Health Care Access f(d)

(a)

(c)

f(d)

(e)

f(d)

1

0

d f(d)

(b)

0 (d)

d0

d

f(d)

0 (f )

d f(d)

W1 W2 W3 0

d

0

d1

d2

d3

d

0

d1

d2

d

FIGURE 5.3 Conceptualizing distance decay in G2SFCA: (a) Gravity function, (b) Gaussian function, (c) 2SFCA, (d) E2SFCA, (e) kernel, and (f) three-zone hybrid approach.

What is the appropriate size for the catchment area(s)? Which is the best function to capture the distance decay effect? Any debate over the best function or the right size for catchment area cannot be settled without analyzing real-world travel behavior. See Section 2.3 for various methods for deriving the best-fitting function and related parameter(s).

5.4

CASE STUDY 5: MEASURING SPATIAL ACCESSIBILITY TO PRIMARY CARE PHYSICIANS IN CHICAGO REGION

This case study is based on the work reported in Luo and Wang (2003). It uses the same study area as in Case Study 4A (Section 4.3), that is, 10 Illinois counties in the Chicago Consolidated Metropolitan Statistical Area (CMSA). The following features in the geodatabase ChiRegion.gdb under the data folder Chicago are used for this project: 1. Feature trtcent for census tract centroids with the field POPU representing population extracted from the 2000 census 2. Feature zipcent for zip code area centroids with the field DOC00 representing the number of primary care physicians in each zip code area based on the 2000 Physician Master File of the American Medical Association (AMA) 3. Polygon features trt2k (census tracts) and cnty10 (counties) for reference and mapping

102

Quantitative Methods and Socioeconomic Applications in GIS

Both census tracts and zip code areas are represented by their population-weighted centroids derived by a process illustrated in step 1 in Section 4.3. This project uses the Euclidean distances to measure travel impedance for simplicity so that we can focus on implementing the accessibility measures in GIS.

5.4.1

part 1. iMpleMenting the 2sFca Method

Step 1. Computing distances between census tracts and zip code areas: In ArcToolbox, use the Point Distance tool to compute the Euclidean distances between population locations (trtcent as Input Features) and physician locations (zipcent as Near Features) and save the Output Table as DistAll.dbf. Refer to step 3 in Section 2.4.1 for detailed illustration of using the Point Distance tool. The distance table has 1901 × 325 = 617,825 records. Step 2. Attaching population and physician data to the distance table: Join the attribute table of population (trtcent, based on the field OBJECTID) to the distance table DistAll.dbf (based on the field INPUT_FID), and then join the attribute table of physicians (zipcent, based on the field OBJECTID) to the distance table DistAll.dbf (based on the field NEAR_FID).* Step 3. Extracting distances within a catchment area: On the distance table DistAll.dbf, select the records with “DISTANCE ≤ 32,180 m” (i.e., 20 miles), and export to a new table Dist20mi.dbf, which has 229,227 records. This is similar to step 4 in Section 3.3.1 for detailed illustration of using an attribute query. The new distance table only includes those distances within the threshold of 20 miles,† and thus implements the selection conditions i ∈ {dij ≤ d0} and k ∈ {dkj ≤ d0} in Equation 5.2. Step 4. Summing population around each physician location: On the table Dist20mi.dbf, use the Summarize tool to sum up population POPU by physician locations NEAR _ FID, and save the result in a new table DocAvl20mi.dbf, which has 325 records (i.e., number of physician locations in zip code areas). Refer to step 5 in Section 3.3.1 for detailed illustration of using the Summarize tool. The field Sum _ POPU is the total population within the threshold distance from each physician location, and therefore implements calibration of the term Σ k ∈{dkj ≤ d0 } Dk in Equation 5.2. Step 5. Computing initial physician-to-population ratio at each physician location: Join the updated table DocAvl20mi.dbf back to the distance table Dist20mi.dbf (based on their common field NEAR_FID). On the table Dist20mi.dbf, add a field docpopR and compute it as docpopR = 1000*DOC00/Sum_POPU. This assigns an initial physician-to-population ratio to each physician location. This step computes the term (S j /Σ k ∈{dkj ≤ d0 } Dk ) in Equation 5.2. The ratio is inflated 1000 times to indicate the physician availability per 1000 residents. *

†

The step uses the tool “attribute join,” which is also used in steps 5 and 7 in this case study. We use the language, “join A to B,” where A represents the source table and B the destination table. See step 5 in Section 2.4.1 for detailed implementation. A reasonable threshold distance for primary care physician is 15 miles (travel distance). We set the search radius at 20 miles (Euclidean distance), roughly equivalent to a travel distance of 30 miles, i.e., twice the reasonable travel distance. This is perhaps an upper limit.

GIS-Based Measures of Spatial Accessibility in Health Care Access

103

Step 6. Summing up physician-to-population ratios by population locations: On the updated table Dist20mi.dbf, use the Summarize tool again to sum up the initial physician-to-population ratios docpopR by population locations INPUT_FID, and save the result in a new table TrtAcc20mi.dbf, which has 1901 records (i.e., number of population locations in census tracts). If the newly added field docpopR from step 5 is not available for selection, remove the joins from Dist20mi.dbf (or export to a new table) and redo “Summarize” (similar to step 7 in Section 4.3.2). The field Sum_docpopR is the summed-up availability of physicians that are reachable from each residential location, and thus returns the accessibility score AiF in Equation 5.2. Figure 5.4 illustrates the process of 2SFCA implementation in ArcGIS from step 1 to 6. The layer/table names with critical field names are in the box, and the actions (tools) are numbered according to the corresponding steps. Step 7. Mapping accessibility: Use a permanent attribute join (ArcToolbox > Data Management Tools > Joins > Join Field) to join the table TrtAcc20mi.dbf (based on the field INPUT_FID) with the accessibility scores (in the field Sum_docpopR) to the attribute table of population layer trtcent (based on the field OBJECTID). This is to ensure that the accessibility scores are carried over in the next join (see step 5 in Section 2.4.1). Join the updated attribute table of trtcent to the census tract polygon layer trt2k (based on their common field CNTYTRT) for mapping.* Figure 5.5 shows the result using the 20-mile threshold. The accessibility exhibits a monocentric pattern with the highest score near the city center and declining outward. Like any accessibility study, results near the borders of the study area need to be interpreted with caution because of edge effects discussed in Section 3.1. In other words, physicians outside of the study area may also contribute to accessibility of residents near the borders but are not accounted for in this study. Step 8. Optional: Sensitivity analysis using various threshold distances: A sensitivity analysis can be conducted to examine the impact of using different threshold distances. For instance, the study can be repeated through steps 3–7 by using threshold distances of 15, 10, and 5 miles, and results can be compared. This process is automated in a tool “Two-Step Floating Catchment Area (2SFCA)” under the toolkit Accessibility.tbx (available under the study area folder Chicago). See Appendix 5B for its usage. One may also follow the steps in Section 2.4.2 to compute the travel time matrix between census tract centroids and zip code area centroids and update the analysis with travel time to measure travel impedance. Similarly, another tool “Two-Step Floating Catchment Area (2SFCA) (w External Distance Table)” under the toolkit Accessibility.tbx, also under the data folder Chicago, implements this task. Figure 5.6 shows the result by the 2SFCA using a catchment area size of 30-min travel time. While the general pattern is consistent with Figure 5.5 based on Euclidean distances, it shows the areas of high accessibility stretching along *

The feature ids in the census tract’s point layer (trtcent) and polygon layer (trt2 k) are indexed differently and thus it is not feasible to join the accessibility result (table TrtAcc20mi.dbf) directly to trt2k for mapping.

2b. Join

1. Point distance

DistAll: INPUT_FID, NEAR_FID, POPU, DOC00

2a. Join

FIGURE 5.4 Flow chart for implementing the 2SFCA in ArcGIS.

zipcent: OBJECTID, DOC00

trtcent: OBJECTID, POPU 5b. Calculate docpopR

5a. Join

DocAv120mi:: NEAR_FID, Sum_POPU

4. Summarize POPU by NEAR_FID

3. Select by DISTANCE < = D0

Dist20mi: INPUT_FID, NEAR_FID, POPU, DOC00

TrtAcc20mi: INPUT_FID Sum_docpopR

6. Summarize docpopR by INPUT_FID

Dist20mi: INPUT_FID, NEAR_FID, docpopR

104 Quantitative Methods and Socioeconomic Applications in GIS

GIS-Based Measures of Spatial Accessibility in Health Care Access

105

N

Study area

Census tract accessibility 0.20–1.51 1.51–2.39 2.39–3.13 3.13–3.62 3.62–4.17 0 5 10

20

30

40

km

FIGURE 5.5 Accessibility to primary care physician in Chicago region by 2SFCA (20-mile catchment).

expressways (particularly high around the intersections). Different threshold travel times can also be tested for sensitivity analysis. Table 5.1 (top part) shows the results by 2SFCA using the threshold time ranging from 20 to 50 min. See more discussion in Section 5.4.3.

5.4.2

part 2. iMpleMenting the gravity-based accessibility Model

The process of implementing the gravity-based model is similar to that of the 2SFCA method. The differences from Part 1 are highlighted here. The gravity model utilizes all distances or the distances up to a maximum (i.e., an upper limit for one to visit a primary care physician). Therefore, step 3 in Part 1 for

106

Quantitative Methods and Socioeconomic Applications in GIS

N

Highway Study area Census tract accessibility 1.055–1.641 1.642–2.094 2.095–2.510 2.511–2.866 2.867–3.362

0 5 10

20

30

40 km

FIGURE 5.6 Accessibility to primary care physician in Chicago region by 2SFCA (30-min catchment).

extracting distances within a threshold is skipped, and subsequent computation will be based on the original distance table DistAll.dbf. Step 9. Computing population potential for each physician location: Similar to step 4, add a field PPotent to the table DistAll.dbf, compute it as PPotent = POPU*DISTANCE^(−1) (assuming a distance friction coefficient β = 1.0 here). On the basis of the updated distance table, summarize the potential PPotent by physician locations NEAR_FID, and save the result in a new table DocAvlg.dbf (adding g to the table name to differentiate from the filenames in Part 1).* The field Sum_PPotent calibrates the term V j = ∑ mk =1 Dk dkj−β in Equation 5.3, which is the population potential for each physician location. *

Again, if the field “PPotent” is unavailable for selection, remove a join (e.g., the join to “trtcent”) in DistAll.dbf and redo the Summarize.

GIS-Based Measures of Spatial Accessibility in Health Care Access

107

TABLE 5.1 Comparison of Accessibility Measures Method

Parameter

Min.

Max.

Std Dev.

Mean

Weighted Mean

2SFCA

d0 = 20 min

0

14.088

2.567

2.721

2.647

d0 = 25 min

0

7.304

1.548

2.592

2.647

d0 = 30 min

0.017

5.901

1.241

2.522

2.647

d0 = 35 min

0.110

5.212

1.113

2.498

2.647

d0 = 40 min

0.175

4.435

1.036

2.474

2.647

d0 = 45 min

0.174

4.145

0.952

2.446

2.647

d0 = 50 min

0.130

3.907

0.873

2.416

2.647

β = 0.6

1.447

2.902

0.328

2.353

2.647

β = 0.8

1.236

3.127

0.430

2.373

2.647

β = 1.0

1.055

3.362

0.527

2.393

2.647

β = 1.2

0.899

3.606

0.618

2.413

2.647

β = 1.4

0.767

3.858

0.705

2.433

2.647

β = 1.6

0.656

4.116

0.787

2.452

2.647

β = 1.8

0.562

4.380

0.863

2.470

2.647

Gravity-based method

Note: All based on travel time as spatial impedance.

Step 10. Computing accessibility attributable to individual physician locations: Similar to step 5, join the updated table DocAvlg.dbf to the distance table DistAll.dbf by physician locations (i.e., based on the common field NEAR _ FID). On the table DistAll.dbf, add a field R and compute it as R = 1000*DOC00*DISTANCE^(−1)/Sum_PPotent. This computes the term (S j dij−β / V j ) in Equation 5.3. Again, the multiplier 1000 is applied in computing R to avoid small values. Step 11. Summing up accessibility to all physician locations: Similar to step 6, on the updated DistAll.dbf, sum up R by population locations and name the output table TrtAccg.dbf. The field Sum_R is the gravity-based accessibility AiG in Equation 5.3. Similarly, the sensitivity analysis for the gravity-based method can be conducted by varying the β value. Luo and Wang (2003) experimented with various β in the range 0.6–1.8 by an increment of 0.2. The process is automated in a tool “Generalized 2SFCA” under the toolkit Accessibility.tbx. As illustrated in Appendix 5B, the tool implements the gravity-based model, which is a special case among the three distance decay functions (in this, the power function) provided by the tool. One may also follow the steps in Section 2.4.2 to compute the travel time matrix between census tract centroids and zip code area centroids and update the analysis with travel time to measure travel impedance. The tool “Generalized 2SFCA (w External Distance Table)” under the toolkit Accessibility.tbx implements this task.

108

Quantitative Methods and Socioeconomic Applications in GIS

Table 5.1 (bottom part) shows the results by the gravity-based method when travel time is used for measuring spatial impedance.

5.4.3

discussion

As shown in Figures 5.5 and 5.6, the highest accessibility is generally found in the central city and declines outward to suburban and rural areas. This is most evident in Figure 5.5 when Euclidean distances are used. The main reason is the concentration of major hospitals in the central city. When travel times are used, Figure 5.6 shows that areas of highest accessibility are near major highway intersections. Table 5.1 is compiled for comparison of various accessibility measures. As the threshold time in the 2SFCA method increases from 20 to 50 min, the variance (or standard deviation) of accessibility measures declines (also the range from minimum to maximum shrinks), and thus leads to stronger spatial smoothing. As the travel friction coefficient β in the gravity-based method increases from 0.6 to 1.8, the variance of accessibility measures increases, which is equivalent to the effect of smaller thresholds in the 2SFCA method. In general (within the reasonable parameter ranges), the gravity-based method has a stronger spatial smoothing effect than the 2SFCA method. This confirms the discussion in Section 5.3. The simple mean of accessibility scores varies slightly by different methods using different parameters. However, the weighted mean remains the same, confirming the property proven in Appendix 5A. The 2SFCA using a certain threshold time can generate accessibility with the same variance as the gravity-based method using a certain friction coefficient. For instance, comparing the accessibility scores by the 2SFCA method with d 0 = 50 to the scores by the gravity-based method with β = 1.8, they have a similar variance. However, the distribution of accessibility scores differs. Figure 5.7a shows the distribution by the 2SFCA method (skewed toward high scores). Figure 5.7b shows the distribution by the gravity-based method (a more evenly distributed bell shape). Figure 5.7c plots them in one graph, showing that the gravity-based method tends to inflate the scores in low-accessibility areas. This chapter focuses on various GIS-based measures of spatial accessibility. Aspatial factors as discussed in Section 5.1 also play important roles in affecting accessibility. Indeed, the U.S. Department of Health and Human Services designates two types of Health Professional Shortage Areas (HPSA): geographic areas and population groups. Generally, geographic-area HPSA is intended to capture spatial accessibility, and population-group HPSA accounts for aspatial factors. See Wang and Luo (2005) for an approach integrating spatial and aspatial factors in assessing healthcare access.

5.5 CONCLUDING COMMENTS Accessibility is a common issue in many studies at various scales. For example, a local community may be interested in examining the accessibility to children’s

109

GIS-Based Measures of Spatial Accessibility in Health Care Access

Frequency

(a)

Accessibility scores by 2SFCA (d0 = 50) 200 180 160 140 120 100 80 60 40 20 0 0.1 0.4 0.7 0.9 1.2 1.4 1.7 2.0 2.2 2.5 2.8 3.0 3.3 3.6 3.8 Accessibility

Frequency

(b)

Accessibility by gravity-based method (β = 1.8)

(c)

Accessibility scores by gravity-based method (β = 1.8) 90 80 70 60 50 40 30 20 10 0 0.6 0.8 1.1 1.4 1.6 1.9 2.2 2.4 2.7 3.0 3.2 3.5 3.8 4.0 4.3 Accessibility 5

4

3

2

1

0

0

1 2 3 4 Accessibility scores by 2SFCA (d0 = 50 min)

5

FIGURE 5.7 Comparison of accessibility scores by the 2SFCA and gravity-based methods.

110

Quantitative Methods and Socioeconomic Applications in GIS

playgrounds, identifying underserved areas, and designing a plan of constructing new playgrounds or expanding existing ones (Talen and Anselin, 1998). It is also important to assess the accessibility to public parks as good access may encourage physical activity and promote a healthy lifestyle (Zhang et al., 2011). Research also suggests that spatial access to schools also matters in student achievement (Talen, 2001; Williams and Wang, forthcoming). Here some brief discussion is provided on research of job accessibility. There is a rich body of literature on the topic that rivals that of healthcare accessibility. Interested readers may apply the methods learned in this chapter to develop their own studies. Data needed for measuring job accessibility such as employment (supply side) and resident workers (demand side) can be extracted from the Census Transportation Planning Package (CTPP) (http://www.fhwa.dot.gov/planning/ census_issues/ctpp/ ). The CTPP Part 1 contains residence tables for mapping resident workers, and Part 2 has place-of-work tables for mapping jobs. A strong interest in studying job accessibility is attributable to its important implications in urban structure, spatial mismatch, unemployment, welfare reform, and others. For instance, poorer job accessibility is shown to be linked to higher crime rates in neighborhoods (Wang and Minor, 2002; also see Section 8.7.2). As income and housing costs place considerable economic constraints on residential mobility, the mobility for minority, low-income, and less advantaged residents is particularly limited. Therefore, poor job accessibility tied to one’s residential location not only has an adverse effect on employment prospects, but also incurs high monetary and psychological costs for workers already in the labor force and increases their willingness to risk losing their jobs through involvement in deviant or criminal behavior (Wang, 2007).

APPENDIX 5A: A PROPERTY OF ACCESSIBILITY MEASURES The accessibility index measured in Equation 5.2 or 5.3 has an important property: the total accessibility scores are equal to the total supply, or the weighted mean of accessibility is equal to the ratio of total supply to total demand in a study area. The following uses the gravity-based measure to prove the property (also see Shen, 1998, pp. 363–364). As shown in Section 5.3, the two-step floating catchment area (2SFCA) method in Equation 5.2 is only a special case of the gravity-based method in Equation 5.3, and therefore the proof also applies to the index defined by the 2SFCA method. Recall the gravity-based accessibility index for demand site i written as n

AiG =

∑ j =1

S j dij−β m

∑D d k =1

The total accessibility (TA) is

.

−β k kj

(A5.1)

111

GIS-Based Measures of Spatial Accessibility in Health Care Access m

∑D A

TA =

G i

i

= D1 A1G + D2 A2G +  + Dm AmG . (A5.2)

i =1

Substituting (A5.1) into (A5.2) and expanding the summation terms yield

TA = D1

S j d1−jβ

∑∑

Dk dkj−β

j

k

D1S1d11−β

=

∑

k

Dk dk−1β

∑

∑

k

+

Dk dk−1β

Dm S1dm−β1

+

∑

k

+

Dk dk−1β

k

S j d2−jβ

∑∑

D1S2 d12−β

+

−β D2 S1d21

+

+D2

Dk dk−2β

j

k

Dk dk−2β

Dm S2 dm−β2

∑

k

Dk dkj−β

++

−β D2 S2 d22

∑

k

Dk dk−2β

+  + Dm

−β S j dmj

∑∑ j

k

Dk dk−jβ

D1Sn d1−nβ

∑

++ ++

k

−β Dk dkn

D2 Sn d2−nβ

∑

k

−β Dm Sn dmn

∑

k

+

−β Dk dkn

−β Dk dkn

.

Rearranging the terms, we obtain

TA =

−β D1S1d11−β + D2 S1d21 +  + Dm S1dm−β1

∑

k

Dk dk−1β

∑ Dd S ∑ Dd + ∑ Dd k

=

−β +  + Dm S2 dm−β2 D1S2 d12−β + D2 S2 d22

∑

k

Dk dk−2β

D1Sn d1−nβ + D2 Sn d2−nβ +  + Dm Sn dm−βn

++

∑ Dd ∑ Dd

S1

+

−β k k1

k

−β k k1

k

2

k

k

−β k kn

−β k k2

−β k k2

++

∑ Dd ∑ Dd

Sn

k

k

−β k kn

−β k kn

= S1 + S2 +  + Sn . Denoting the total supply in the study area as S (i.e., S = ∑ in=1 Si ), the above equation shows that TA = S, that is, total accessibility is equal to total supply. Denoting the total demand in the study area as D (i.e., D = ∑ im=1 Di ), the weighted average of accessibility is m

W =

⎛ Di ⎞

∑ ⎜⎝ D ⎟⎠ A

G i

= (1/D)( D1 A1G + D2 A2G +  + Dm AmG ) =

i =1

which is the ratio of total supply to total demand.

TA S = . D D

112

Quantitative Methods and Socioeconomic Applications in GIS

APPENDIX 5B: A TOOLKIT OF AUTOMATED SPATIAL ACCESSIBILITY MEASURES* A toolkit is developed to automate the spatial accessibility measures discussed in this chapter. Specifically, it includes four tools: 1. 2. 3. 4.

2SFCA as discussed in Section 5.2 2SFCA with an external distance table Generalized 2SFCA as discussed in Section 5.3 Generalized 2SFCA with an external distance table

Take the tool “Generalized 2SFCA with an external distance table” as an example since it has the most items to define. The interface is shown in Figure A5.1, and the

FIGURE A5.1 Interface for implementing G2SFCA method. *

Suggested citation for using this tool: Zhu H. and F. Wang. 2015. Appendix 5B: A toolkit of automated spatial accessibility measures. In Quantitative Methods and Socioeconomic Applications in GIS (2nd ed.). Boca Raton, FL: Taylor & Francis. pp. 112–113.

GIS-Based Measures of Spatial Accessibility in Health Care Access

113

TABLE A5.1 Items to Be Defined in the Accessibility Toolkit Interface

2SFCA Supply layer Supply ID field Supply value field Demand layer Demand ID field Demand zone code field Demand value field Distance matrix table Distance matrix supply ID field Distance matrix demand ID field Distance matrix value field Distance threshold Distance decay functiona Distance decay coefficientb Output table

X X X X X X (opt) X

2SFCA with an External Distance Table

X

X X X X X X (opt) X X X X X X (opt)

X

X

Generalized 2SFCA X X X X X X (opt) X

X (opt) X X (opt) X

Generalized 2SFCA with an External Distance Table X X X X X X (opt) X X X X X X (opt) X X (opt) X

Note: Opt, optional. a Three options available: power, exponential, and Gaussian. b Default values for the power decay coefficient, exponential decay coefficient, and Gaussian decay coefficient are 1, 0.0001, and 10,000, respectively.

items are also outlined in Table A5.1. Everything is self-explanatory except for the optional item “demand zone code field.” This option is created for the convenience of linking the accessibility results back to the original demand layer (usually a polygon feature) for mapping. Its ID field can be different from its “offspring” (e.g., centroids), the point feature used as the demand layer. The tool “Generalized 2SFCA” has a very similar interface, but is simpler. Also shown in Table 5.1, it does not need to define the items associated with an external distance table. The tool computes a Euclidean distance table itself. The tool “2SFCA” has the fewest items to define, and also internally generates a Euclidean distance table. Note that a threshold distance is required to input. For the tool “2SFCA with an external distance table,” the distance threshold is set as “optional.” When no threshold is defined here, all records in the distance matrix participate in the computation. This assumes that the predefined table has already excluded distances beyond a threshold. If the predefined table contains distances for all O-D pairs, the resulting accessibility would be uniform and thus not meaningful.

6

Function Fittings by Regressions and Application in Analyzing Urban Density Patterns

Urban and regional studies begin with analyzing the spatial structure, particularly population density patterns. As population serves as both supply (labor) and demand (consumers) in an economic system, the distribution of population represents that of economic activities. Analysis of changing population distribution patterns over time is a starting point for examining economic development patterns in a city or region. Urban and regional density patterns mirror each other: the central business district (CBD) is the center of a city whereas the whole city itself is the center of a region, and densities decline with distances both from the CBD in a city and from the central city in a region. While the theoretical foundations for urban and regional density patterns are different, the methods for empirical studies are similar and closely related. This chapter discusses how we find the best fitting function to capture the density pattern, and what we can learn about urban and regional growth patterns from this approach. The methodological focus is on function fittings by regressions and related statistical issues. Section 6.1 explains how density functions are used to examine urban and regional structures. Section 6.2 presents various functions for a monocentric structure. Section 6.3 discusses some statistical concerns on monocentric function fittings and introduces nonlinear regression and weighted regression. Section 6.4 examines various assumptions for a polycentric structure and corresponding function forms. Section 6.5 uses a case study in Chicago urban area to illustrate the techniques (monocentric vs. polycentric models, linear vs. nonlinear and weighted regressions). The chapter is concluded in Section 6.6 with discussions and a brief summary.

6.1 DENSITY FUNCTION APPROACH TO URBAN AND REGIONAL STRUCTURES 6.1.1

urban density Functions

Since the classic study by Clark (1951), there has been great interest in empirical studies of urban population density functions. This cannot be solely explained by the easy availability of data. Many researchers are fascinated by the regularity of urban 115

116

Quantitative Methods and Socioeconomic Applications in GIS

density pattern and its solid foundation in economic theory. McDonald (1989, p. 361) considers the population density pattern as “a critical economic and social feature of an urban area.” Among all functions, the exponential function or the Clark’s model is the one used most widely: Dr = aebr,

(6.1)

where Dr is the density at distance r from the city center (i.e., CBD), a is a constant or the CBD intercept, and b is another constant or the density gradient. Since the density gradient b is often a negative value, the function is also referred to as the negative exponential function. Empirical studies show that it is a good fit for most cities in both developed and developing countries (Mills and Tan, 1980). The economic model by Mills (1972) and Muth (1969), often referred to as the Mills–Muth model, is developed to explain the empirical finding of urban density pattern as a negative exponential function. The model assumes a monocentric structure: a city has only one center, where all employment is concentrated. Intuitively, as everyone commutes to the city center for work, a household farther away from the CBD spends more on commuting and is compensated by living in a larger lot house that is cheaper in terms of price per area unit. The resulting population density exhibits a declining pattern with distance from the city center. Appendix 6A shows how the negative exponential urban density function is derived by the economic model. In the deriving process, the parameter b in Equation 6.1 is the same as the unit cost of transportation or commuting. Therefore, the declining transportation cost over time, as a result of improvements in transportation technologies and road networks, is associated with a flatter density gradient (i.e., more gradual or less sharp decline of density with increasing distance from the city center). This helps us understand that urban sprawl or suburbanization (characterized by a flatter density gradient) is mainly attributable to transportation improvements. However, economic models are “simplification and abstractions that may prove too limiting and confining when it comes to understanding and modifying complex realities” (Casetti, 1993, p. 527). The main criticisms lie in its assumptions of the monocentric city and unit price elasticity for housing, neither of which is supported by empirical studies. Wang and Guldmann (1996) developed a gravitybased model to explain the urban density pattern (see Appendix 6A). The basic assumption of the gravity-based model is that population at a particular location is proportional to its accessibility to all other locations in a city, measured as a gravity potential. Simulated density patterns from the model conform to the negative exponential function when the distance friction coefficient β falls within a certain range (0.2 ≤ β ≤ 1.0 in the simulated example). The gravity-based model does not make the restrictive assumptions as in the economic model, and thus implies wide applicability. It also explains two important empirical findings: (1) flattening density gradient over time (corresponding to smaller β) and (2) flatter gradients in larger cities. The economic model explains the first finding well, but not the second (McDonald, 1989, p. 380). Both the economic model and the gravity-based model explain the change of density gradient over time through

Function Fittings by Regressions and Application

117

transportation improvements. Note that both the distance friction coefficient β in the gravity model and the unit cost of transportation in the economic model decline over time. Another perspective that helps us understand the urban population density pattern is the space syntax theory (Hillier and Hanson, 1984), which emphasizes that urban structure is “determined by the structure of the urban grid itself rather than by the presence of specific attractors or magnets” (Hillier et al., 1993, p. 32). In other words, it is the configuration of a city’s street network that shapes its structure in terms of land use intensity (employment and population density patterns). Based on the approach, various centrality measures derived from the street network capture a location’s (un)attractiveness and thus explain the concentration (or lack of concentration) of economic (including residential) activities. Appendix 6B illustrates how centrality indices are measured and implemented and how the indices are associated with urban densities. Earlier empirical studies of urban density patterns are based on the monocentric model, that is, how population density varies with distance from the city center. It emphasizes the impact of the primary center (CBD) on citywide population distribution. Since 1970s, more and more researchers recognize the changing urban form from monocentricity to polycentricity (Ladd and Wheaton, 1991; Berry and Kim, 1993). In addition to the major center in the CBD, most large cities have secondary centers or subcenters, and thus are better characterized as polycentric cities. In a polycentric city, assumptions of whether residents need to access all centers or some of the centers lead to various function forms. Section 6.4 will examine the polycentric models in detail.

6.1.2

regional density Functions

The study of regional density patterns is a natural extension to that of urban density patterns when the study area is expanded to include rural areas. The urban population density patterns, particularly the negative exponential function, are empirically observed first, and then explained by theoretical models (e.g., the economic model and the gravity-based model). Even considering the Alonso’s (1964) urban land use model as the precedent of the Mills–Muth urban economic model, the theoretical explanation lags behind the empirical finding on urban density patterns. In contrast, following the rural land use theory by von Thünen (1966, English version), economic models for the regional density pattern by Beckmann (1971) and Webber (1973) were developed before the work of empirical models for regional population density functions by Parr (1985), Parr et al. (1988), and Parr and O’Neill (1989). The city center used in the urban density models remains as the center in regional density models. The declining regional density pattern has a different explanation. In essence, rural residents farther away from a city pay higher transportation costs for the shipment of agricultural products to the urban market and for gaining access to industrial goods and urban services in the city, and are compensated by occupying cheaper, and hence, more land (Wang and Guldmann, 1997). Similarly, empirical studies of regional density patterns can be based on a monocentric or a polycentric structure. Obviously, as the territory for a region is much

118

Quantitative Methods and Socioeconomic Applications in GIS

larger than a city, it is less likely for physical environments (e.g., topography, weather, and land use suitability) to be uniform across a region than a city. Therefore, population density patterns in a region tend to exhibit less regularity than in a city. An ideal study area for empirical studies of regional density functions would be an area with uniform physical environments, like the “isolated state” in the von Thünen model (Wang, 2001a, p. 233). Analyzing the function change over time has important implications for both urban and regional structures. For urban areas, we can examine the trend of urban polarization versus suburbanization. The former represents an increasing percentage of population in the urban core relative to its suburbia, and the later refers to a reverse trend with an increasing portion in the suburbia. For regions, we can identify the process as centralization versus decentralization. Similarly, the former refers to the migration trend from peripheral rural to central urban areas, and the latter is the opposite. Both can be synthesized into a framework of core versus periphery. According to Gaile (1980), economic development in the core (city) impacts the surrounding (suburban and rural) region through a complex set of dynamic spatial processes (i.e., intraregional flows of capital, goods and services, information and technology, and residents). If the processes result in an increase in activity (e.g., population) in the periphery, the impact is spread. If the activity in the periphery declines while the core expands, the effect is backwash. Such concepts help us understand core-hinterland interdependencies and various relationships between them (Barkley et al., 1996). If the exponential function is a good fit for regional density patterns, the changes can be illustrated as in Figure 6.1, where t + 1 represents a more recent time than t. In a monocentric model, we can see the relative importance of the city center; in a polycentric model, we can understand the strengthening or weakening of various centers. In the reminder of this chapter, the discussion focuses on urban density patterns. However, similar techniques can be applied to studies of regional density patterns.

6.2 FUNCTION FITTINGS FOR MONOCENTRIC MODELS 6.2.1 Four siMple bivariate Functions In addition to the exponential function (Equation 6.1) introduced earlier, three other simple bivariate functions for the monocentric structure have often been used: Dr = a + br,

(6.2)

Dr = a + blnr,

(6.3)

Dr = arb.

(6.4)

Equation 6.2 is a linear function, Equation 6.3 is a logarithmic function, and Equation 6.4 is a power function. The parameter b in all the above four functions is expected to be negative, indicating declining densities with distances from the city center.

119

Function Fittings by Regressions and Application Spread (decentralization) (a)

(b)

Dr

InDr

Log-transform

t+1 t+1 r

t

t

r

Backwash (centralization) (c)

(d)

Dr

InDr

Log-transform

t t r

FIGURE 6.1

t+1

t+1 r

Regional growth patterns by the density function approach.

Equations 6.2 and 6.3 can be easily estimated by an ordinary least square (OLS) linear regression. Equations 6.1 and 6.4 can be transformed to linear functions by taking the logarithms on both sides such as: lnDr = A + br,

(6.5)

lnDr = A + blnr.

(6.6)

Equation 6.5 is the log-transform of Equations 6.1 and 6.6 is the log-transform of Equation 6.4. The intercept A in both Equations 6.5 and 6.6 is just the log-transform of constant a (i.e., A = lna) in Equations 6.1 and 6.4. The value of a can be easily recovered by taking the reverse of logarithm, that is, a = eA. Equations 6.5 and 6.6 can also be estimated by a linear OLS regression. In regressions for Equations 6.3 and 6.6 containing the term lnr, samples should not include observations where r = 0 to avoid taking logarithms of zero. Similarly, in Equations 6.5 and 6.6 containing the term lnDr, samples should not include those where Dr = 0 (with zero population). Take the log-transform of exponential function in Equation 6.5 for an example. The two parameters, intercept A and gradient b, characterize the density pattern in

120

Quantitative Methods and Socioeconomic Applications in GIS

a city. A smaller value of A indicates a lower density around the central city; and a lower value of b (in terms of absolute value) represents a flatter density pattern. Many cities have experienced a lower intercept A and a flatter gradient b over time, representing a common trend of urban sprawl and suburbanization. The changing pattern is similar to Figure 6.1a, which also depicts decentralization in the context of regional growth patterns.

6.2.2

other Monocentric Functions

In addition to the four simple bivariate functions discussed above, three other functions are also used widely in the literature. One was proposed by Tanner (1961) and Sherratt (1960) independently from each other, commonly referred to as the Tanner– Sherratt model. The model is written as 2

(6.7)

Dr = aebr ,

where the density Dr declines exponentially with distance squared r 2. Newling (1969) incorporated both the Clark’s model and the Tanner–Sherratt model, and suggested the following model: 2

(6.8)

Dr = aeb1r + b2 r ,

where the constant term b1 is most likely to be positive and b2 negative, and other notations remain the same. In Newling’s model, a positive b1 represents a density crater around the CBD, where population density is relatively low due to the presence of commercial and other nonresidential land uses. According to Newling’s model, the highest population density does not occur at the city center, but rather at a certain distance away from the city center. The third model is the cubic spline function used by some researchers (e.g., Anderson, 1985; Zheng, 1991) in order to capture the complexity of urban density pattern. The function is written as k

Dx = a1 + b1 ( x − x0 ) + c1 ( x − x0 )2 + d1 ( x − x0 )3 +

∑d

i +1

( x − xi )3 Z i* , (6.9)

i =1

where x is the distance from the city center, Dx is the density there, x0 is the distance of first density peak from the city center, xi is the distance of ith knot from the city center (defined by either the 2nd, 3rd … density peak or simply even intervals across the whole region), Z i* is a dummy variable (=0, if x is inside the knot; =1, if x is outside of the knot). The cubic spline function intends to capture more fluctuations of the density pattern (e.g., local peaks in suburban areas), and thus cannot be strictly defined as a monocentric model. However, it is still limited to examining density variations related to distance from the city center regardless of directions, and thus assumes a concentric density pattern.

Function Fittings by Regressions and Application

6.2.3

121

gis and regression iMpleMentations

The density function analysis only uses two variables: one is Euclidean distance r from the city center and the other is the corresponding population density Dr. Euclidean distances from the city center can be obtained using the techniques explained in Section 2.1. Identifying the city center requires knowledge of the study area, and is often defined as a commonly recognized landmark site by the public. In the absence of a commonly recognized city center, one may use the location with the highest job concentration to define it, or follow Alperovich’s (1982) approach to test various locations and choose the one that produces the highest R2 in density function fittings. Density is simply computed as population divided by area size. Area size is saved in the field “Shape _ Area” in any polygon feature in a geodatabase, or added by following step 4 in Section 1.3.1. Once the two variables are obtained in GIS, the data set can be exported to an external file for regression analysis. Linear OLS regression is available in many software packages. For example, one may use the widely available Microsoft Excel. Here, our illustration is based on Excel 2013. Make sure that the Analysis ToolPak is installed in Excel.* Open the distance and density data as an Excel workbook, add two new columns to the workbook (e.g., lnr and lnDr), and compute them as the logarithms of distance and density, respectively. From the main menu bar, select Data > Data Analysis > Regression to activate the regression dialog window shown in Figure 6.2. By defining the appropriate data ranges for X and Y variables, Equations 6.2, 6.3, 6.5, and 6.6 can be all fitted by an OLS linear regression in Excel. Note that Equations 6.5 and 6.6 are the log-transformations of exponential function (Equation 6.1) and power function (Equation 6.4), respectively. Based on the results, Equation 6.1 or 6.4 can be easily recovered by computing the coefficient a = eA and the coefficient b unchanged. Alternatively, one may use the Chart Wizard in Excel to obtain the regression results for all four bivariate functions along with their x–y scatter plots. From the main menu, choose Insert > select the Insert Scatter (X, Y) tool to generate a graph depicting how density varies with distance. Then right-click the data point on the graph and choose “Add Trendline” to activate the dialog window shown in Figure 6.3. All four functions (linear, logarithmic, exponential, and power) are available for selection. Check the options “Display equation on chart” and “Display R-squared value on chart” to have regression results shown on the graph. The “Add Trendline” tool outputs the regression results for the four original bivariate functions without the need of log-transformations,† but does not report as many statistics as the “Regression” tool. Both the Tanner–Sherratt model (Equation 6.7) and Newling’s model (Equation 6.8) can be estimated by linear OLS regression on their log-transformed forms (see Table 6.1). In the Tanner–Sherratt model, the X variable is distance squared (r 2). *

†

In Microsoft Office 2013, File > Options > Add-Ins > If Analysis ToolPak is not listed under “Active Application Add-ins”, choose it under the list of “Inactive Application Add-ins” and click Go; check Analysis ToolPak in the Add-Ins window and OK to activate it. The regression results reported here are still based on linear OLS regression by using the log-transformations in Equations 6.5 and 6.6. The power or exponential function shown on the graph is reconstructed from the linear OLS regression result though the computation is done internally. This is different from nonlinear regressions that will be discussed in Section 6.3.

122

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 6.2 Excel dialog window for regression.

In Newling’s model, there are two X variables (r and r 2). Newling’s model has one more explanatory variable (i.e., r 2) than the Clark’s model (exponential function), and thus always generates a higher R2 regardless of the significance of the term r 2. In this sense, Newling’s model is not comparable to the other five bivariate models in terms of fitting power. Table 6.1 summarizes the models. Fitting the cubic spline function (Equation 6.9) is similar to that of other monocentric functions with some extra work in preparing the data. First, sort the data by the variable “distance” in an ascending order. Second, define the constant x0, and calculate the terms (x − x0), (x − x0)2, and (x − x0)3. Third, define the constants xi (i.e., x1, x2, …), and compute the terms ( x − xi )3 Z i* . Take one term ( x − x1 )3 Z1* as an example: (1) set the values = 0 for those records with x ≤ x1, and (2) compute the values = (x − x1)3 for those records with x > x1. Finally, run a multivariate regression, where the Y variable is density Dx, and the X variables are (x − x0), (x − x0)2, (x − x0)3, ( x − x1 )3 Z1* , ( x − x2 )3 Z 2* , and so on. The cubic spline function contains multiple X variables, and thus its regression R2 tends to be higher than other models.

6.3

NONLINEAR AND WEIGHTED REGRESSIONS IN FUNCTION FITTINGS

In function fittings for the monocentric structure, two statistical issues merit more discussion. One is the choice between nonlinear regression directly on the

123

Function Fittings by Regressions and Application

FIGURE 6.3

Excel dialog window for Format Trendline.

exponential or power function versus linear regression on its log-transformation (as discussed in Section 6.2). Generally they yield different results since the two have different dependent variables (Dr in nonlinear regression versus lnDr in linear regression) and imply different assumptions of the error term (Greene and Barnbrock, 1978). TABLE 6.1 Linear Regressions for a Monocentric City Models

Function Used in Regression

Original Function

Linear Logarithmic Power Exponential Tanner–Sherratt

Dr = a + br Dr = a + blnr lnDr = A + blnr lnDr = A + br lnDr = A + br2

Same Same

Newling’s

lnDr = A + b1r + b2r2

Dr = aeb1r + b2r

Dr = arb Dr = aebr Dr = aebr

2

2

X Variable(s)

Y Variable

r lnr lnr r r2

Dr Dr lnDr lnDr lnDr

None

r, r2

lnDr

Dr ≠ 0

Restrictions r≠0 r ≠ 0 and Dr ≠ 0 Dr ≠ 0 Dr ≠ 0

124

Quantitative Methods and Socioeconomic Applications in GIS

We use the exponential function (Equation 6.1) and its log-transformation (Equation 6.5) to explain the differences. The linear regression on Equation 6.5 has the errors ε such as lnDr = A + br + ε.

(6.10)

That implies multiplicative errors and weights equal percentage errors equally, such as Dr = aebr+ε. The nonlinear regression on the original exponential function (Equation 6.1) assumes additive errors and weights all equal absolute errors equally, such as Dr = aebr + ε.

(6.11)

The ordinary least square (OLS) regression seeks the optimal values of A (or a) and b so that residual sum of squares (RSS) is minimized. For the linear regression based on Equation 6.10, it is to minimize RSS =

∑ (ln D

i

− A − bri )2 .

i

For the nonlinear regression based on Equation 6.11, it is to minimize RSS =

∑ (D

i

− aebri )2 ,

i

where i indexes individual observations. For linear regression, a popular measure for a model’s goodness of fit is R2 (coefficient of determination). Appendix 6C illustrates how the parameters in a bivariate linear OLS regression are estimated and how R2 is defined. In essence, R2 is the portion of dependent variable’s variation (termed “total sum squared (TSS)”) explained by a regression model (termed “explained sum squared (ESS)),” that is, R2 = ESS/ TSS = 1 − RSS/TSS. There are several ways to estimate the parameters in a nonlinear regression (Griffith and Amrhein, 1997, p. 265), and all use iterations to gradually improve guesses. For example, the modified Gauss-Newton method uses linear approximations to estimate how RSS changes with small shifts from the present set of parameter estimates. Good initial estimates (i.e., those close to the correct parameter values) are critical in finding a successful nonlinear fit. The initialization of parameters is often guided by experience and knowledge. Note that R2 is no longer applicable to nonlinear regression as the identity “ESS + RSS = TSS” no longer holds and the residuals do not add up to 0. However, one may use a pseudo-R2 (defined similarly as 1 − RSS/TSS) as an approximate measure of goodness of fit.

Function Fittings by Regressions and Application

125

Which is a better method for estimating density functions, linear or nonlinear regression? The answer depends on the emphasis and objective of a study. The linear regression is based on the log-transformation. By weighting equal percentage errors equally, the errors generated by high-density observations are scaled down (in terms of percentage). However, the differences between the estimated and observed values in those high-density areas tend to be much greater than those low-density areas (in terms of absolute value). As a result, the total estimated population in the city could be off by a large margin. On the contrary, the nonlinear regression is to minimize the residual sum of squares (RSS) directly based on densities instead of their logarithms. By weighting all equal absolute errors equally, the regression limits the errors (in terms of absolute value) contributed by high-density samples. As a result, the total estimated population in the city is often closer to the actual value than the one based on linear regression, but the estimated densities in low-density areas may be off by high percentages. Another issue in estimating urban density functions concerns randomness of sample (Frankena, 1978). A common problem for census data (not only in the US) is that high-density observations are many and are clustered in a small area near the city center, whereas low-density ones are fewer and spread in remote areas. In other words, high-density samples may be overrepresented as they are concentrated within a short distance from the city center, and low-density samples may be underrepresented as they spread across a wide distance range from the city center. A plot of density versus distance will show many more observations in short distances and fewer in long distances. This is referred to as nonrandomness of sample, and causes biased (usually upward) estimators. A weighted regression can be used to mitigate the problem. Frankena (1978) suggests weighting observations in proportion to their areas. In the regression, the objective is to minimize the weighted RSS. For the same reason explained for nonlinear regression, R2 is no longer valid for a weighted regression, and a pseudo-R2 may be constructed from calculated RSS and TSS. See Wang and Zhou (1999) for an example. Some researchers favor samples with uniform area sizes. In Case Study 6 (Section 6.5.3 in particular), we will also analyze population density functions based on survey townships of approximately same area size. Estimating the nonlinear regression or weighted regression requires the use of advanced statistical software. For example, in SAS, if the DATA step uses POPDEN to represent “population density,” LNPOPDEN is the logarithm of density, DIST to represent “distance,” and area _ km2 to represent “area size,” the following SAS statements implement the nonlinear regression (using SAS procedure NLIN) for the exponential function: PROC NLIN; /* procedure for nonlinear regression */ PARMS a = 1000 b = -0.1;/*initialize parameters */ MODEL POPDEN = a * exp(b * DIST); /* code the function */ The statement PARMS (or PARMETERS) assigns initial values for parameters a and b in the iteration process. If the model does not converge, experiment with different initial values until a solution is reached.

126

Quantitative Methods and Socioeconomic Applications in GIS

SAS also has a procedure REG to run OLS linear regressions. A weighted regression for the logarithmic transform of exponential function is implemented by adding a statement to define the weight variable: PROC REG; /*procedure for linear regression */ model LNPOPDEN = DIST;/*exponential model */ weight area _ km2;/* define the weight variable */ See the sample SAS program monocent.sas included in the data folder for detail. All SAS programs used in this book are tested in SAS 9.3.

6.4

FUNCTION FITTINGS FOR POLYCENTRIC MODELS

Monocentric density functions simply assume that densities are uniform at the same distance from the city center regardless of directions. Urban density patterns in some cities may be better captured by a polycentric structure. In a polycentric city, residents and business value access to multiple centers, and therefore population densities are functions of distances to these centers (Small and Song, 1994). Centers other than the primary or major center at the CBD are called subcenters.

6.4.1

polycentric assuMptions and corresponding Functions

A polycentric density function can be established under several alternative assumptions: Assumption 1. If the influences from different centers are perfectly substitutable so that only the nearest center (CBD or subcenter) matters, the city is composed of multiple monocentric subregions. Each subregion is the proximal area for a center (see Section 4.1), within which various monocentric density functions can be estimated. Taking the exponential function as an example, the model for the subregion around the ith center is D = ai ebi ri ,

(6.12)

where D is the density of an area, ri is the distance between the area and its nearest center i, and ai and bi (i = 1, 2, …) are parameters to be estimated. Assumption 2. If the influences are complementary so that some access to all centers is necessary, then the polycentric density is the product of those monocentric functions (McDonald and Prather, 1994). For example, the log-transformed polycentric exponential function is written as: n

ln D = a +

∑br , i i

i =1

(6.13)

127

Function Fittings by Regressions and Application

where D is the density of an area, n is the number of centers, ri is the distance between the area and center i, and a and bi (i = 1, 2 …) are parameters to be estimated. Assumption 3. Most researchers (Griffith, 1981; Small and Song, 1994) believe that the relationship among the influences of various centers is between assumptions 1 and 2, and the polycentric density is the sum of center-specific functions. For example, a polycentric model based on the exponential function is expressed as: n

D=

∑a e i

bi ri

.

(6.14)

i =1

The above three assumptions are based on Heikkila et al. (1989). Assumption 4. According to the central place theory, the major center at the CBD and the subcenters play different roles. All residents in a city need access to the major center for higher order services; for other lower order services, residents only need to use the nearest subcenter (Wang, 2000). In other words, everyone values access to the major center and access to the nearest center. Using the exponential function as an example, the corresponding model is ln D = a + b1r1 + b2r 2,

(6.15)

where r1 is the distance from the major center; r 2 is the distance from the nearest center, and a, b1, and b2 are parameters to be estimated. Figure 6.4 illustrates the different assumptions for a polycentric city. Residents need access to all centers under assumption 2 or 3, but effects are multiplicative in 2 and additive in 3. Table 6.2 summarizes the above discussion. (1)

(2) or (3)

Resident Major center

(4)

Proximal area boundary Linkage

Subcenter

FIGURE 6.4

Illustration of polycentric assumptions.

128

Quantitative Methods and Socioeconomic Applications in GIS

TABLE 6.2 Polycentric Assumptions and Corresponding Functions Label 1

Access only to the nearest center is needed Access to all centers is necessary (multiplicative effects) Access to all centers is necessary (additive effects) Access to CBD and the nearest center is needed

2

3

4

a b

Assumption

Model (Exponential as an Example) ln D = Ai + biri

ln D = a +

D=

∑

∑

n i =1

X Variables

Sample

Estimation Method

Distances ri from the Areas in Linear nearest center i (one subregion i regressiona variable) n Distances from each All areas Linear bi ri center (n variables ri) regression i =1

ai ebiri

Distances from each All areas center (n variables ri)

All areas ln D = a + b1r1 + b2r2 Distances from the major and nearest center (two variables)

Nonlinear regression Linear regressionb

This assumption may also be estimated by nonlinear regression on D = ai ebiri . This assumption may also be estimated by nonlinear regression on D = a1eb1r1 + a2 eb2r2 .

6.4.2

gis and regression iMpleMentations

Analysis of polycentric models first requires the identification of multiple centers. Ideally, these centers should be based on the distribution of employment (e.g., Gordon et al., 1986; Giuliano and Small, 1991; Forstall and Greene, 1998). In addition to traditional choropleth maps, Wang (2000) used surface modeling techniques to generate isolines (contours) of employment density,* and identified employment centers based on both the contour value (measuring the threshold employment density) and the size of area enclosed by the contour line (measuring the base value of total employment). With the absence of employment distribution data, one may use surface modeling of population density to guide the selection of centers (Wang and Meng, 1999).† See Section 3.2 for various surface modeling techniques. Surface modeling is descriptive in nature. Only rigorous statistical analysis of density functions can answer whether the potential centers identified from surface modeling indeed exert influence on surrounding areas and how the influences interact with each other. Once the centers are identified, GIS prepares the data of distances and densities for analysis of polycentric models. For assumption 1, only the distances from the nearest centers (including the major center) need to be computed by using the Near tool in ArcGIS. For assumption 2 or 3, the distance between each area and every center needs to be obtained by the Point Distance tool in ArcGIS. For assumption 4, *

†

The contour map of employment density was based on the logarithm of density. The direct use of density values would lead to crowded contour lines in high-density areas. Population density peaks may not qualify as centers as we have learned from Newling’s model for the monocentric structure. Commercial and other business entities often dominate the land use in an employment center, which exhibits a local population density crater.

Function Fittings by Regressions and Application

129

it requires two distances: the distance between each area and the major center and the distance between each area and its nearest center. The two distances are obtained by using the Near tool in ArcGIS twice. See Section 6.5.2 for detail. Based on assumption 1, the polycentric model is degraded to monocentric functions (Equation 6.12) within each center’s proximal area, which can be estimated by the techniques explained in Sections 6.2 and 6.3. Equation 6.13 for assumption 2 and Equation 6.15 for assumption 4 can be also estimated by simple multivariate linear regressions. However, Equation 6.14 based on assumption 3 needs to be estimated by a nonlinear regression, as shown below. Assuming a model of two centers with DIST1 and DIST2 representing the distances from the two centers respectively, a sample SAS program for estimating Equation 6.14 is similar to the program for estimating Equation 6.1 such as: PROC NLIN; parms a1 = 1000 b1 = −0.1 a2 = 1000 b2 = −0.1; model DEN = a1*exp(b1*DIST1)+ a2*exp(b2*DIST2);

6.5 CASE STUDY 6: ANALYZING URBAN DENSITY PATTERNS IN CHICAGO URBAN AREA Chicago has been an important study site for urban studies. The classic urban concentric model by Burgess (1925) was based on Chicago and led to a series of studies on urban structure, forming the so-called Chicago School. This case study uses the 2000 census data to examine the urban density patterns in Chicago urban area. The study area is limited to the core six-county area in Chicago CMSA (Consolidated Metropolitan Statistical Area) (county codes in parentheses): Cook (031), DuPage (043), Kane (089), Lake (097), McHenry (111), and Will (197). The area is smaller than the 10-county study area used in Case Studies 4A and 5 as we are interested in mostly urbanized areas (see inset in Figure 6.5). In order to examine the possible modifiable areal unit problem (MAUP), the project analyzes the density patterns at both the census tract and survey township levels. The MAUP refers to sensitivity of results to the analysis units for which data are collected or measured, and is well known to geographers and spatial analysts (Openshaw, 1984; Fotheringham and Wong, 1991). The census tract feature trt2k in geodatabase ChiRegion.gdb for the Chicago 10-county region is used to extract census tracts in this study area. Its field POPU is the population data in 2000. In addition, the following features in geodatabase ChiUrArea.gdb under the data folder Chicago are provided 1. Feature polycent15 contains 15 centers identified as employment concentrations* from a previous study (Wang, 2000). *

These are job centers, and do not necessarily have the highest population densities (see Figure 6.5). The classic urban economic model for explaining urban density functions assumes a single center at the CBD where all jobs are located. Polycentric models extend the assumption to multiple job centers in a city. Therefore, centers should be based on the employment instead of population distribution pattern. For cities in the US, the data source for employment location is a special census data: CTPP (Census Transportation Planning Package). See Section 5.5.

130

Quantitative Methods and Socioeconomic Applications in GIS

10

6 N

4

11

1 5

13

O’Hare

7

12

8

9

15 CBD

2 3

Center ids County boundary Density surface

14

p/sq_km < = 1000 1000–2000 2000–4000 4000–8000 >8000 0

FIGURE 6.5

5

10

20

30

40 km

Population density surface and job centers in Chicago.

2. Feature twnshp contains 115 survey townships in the study area, providing an alternative areal unit that is relatively uniform in area size. 3. Feature cnty6 defines the study area. This study uses ArcGIS for spatial analysis tasks such as distance computation and areal interpolation, Microsoft Excel for simple linear regression and graphs, and SAS for more advanced nonlinear and weighted regressions.

6.5.1

part 1: Function Fittings For Monocentric Models at the census tract level

Step 1. Data preparation in ArcGIS: extracting study area and CBD location: Use a spatial query (Select by Location) to select features from trt2k (in geodatabase

Function Fittings by Regressions and Application

131

ChiRegion.gdb) that have their centroids in cnty6, and export the selected features to a new layer cnty6trt. It contains 1837 census tracts in the six-county area. Add a field popden to the attribute table of cnty6trt, and calculate it as popden = 1000000*POPU/Shape _ Area, which is population density in person per square kilometer. Based on the polygon feature cnty6trt, create a point feature cnty6trtpt for tract centroids.* See step 2 in Section 2.4.1. Based on the feature polycent15, select the point with CENT15 _ = 15 and export it to a new layer monocent, which identifies the location of CBD (i.e., the only center based on the monocentric assumption). Step 2. Mapping population density surface in ArcGIS: Use the surface modeling techniques learned in Section 3.3 to map the population density surface in the study area. A sample map is shown in Figure 6.5. Note that job centers are not necessarily the peak points of population density though suburban job centers in general are found to be near the local density peaks. Step 3. Computing distances between tract centroids and CBD in ArcGIS: Use the ArcToolbox analysis tool Near to compute distances between tract centroids (cnty6trtpt) and CBD (monocent). In the updated attribute table of cnty6trtpt, the field NEAR _ FID identifies the nearest center (in this case the only point in monocent) and the field NEAR _ DIST is the distance of each tract centroid from the center. Add a field DIST _ CBD and calculate it as DIST _ CBD = NEAR _ DIST/1000, which is the distances from the CBD in kilometer.† Step 4. Running bivariate linear regression in Excel‡: Make sure that the Analysis ToolPak is activated in Excel. Open cnty6trtpt.dbf in Excel, and save it as monodist _ trt in Excel workbook format. From the main menu bar, select Data > Data Analysis > Regression to activate the dialog window shown in Figure 6.2. Input Y range (data range for the variable popden) and X range (data range for the variable DIST _ CBD). The linear regression results may be saved in a separate worksheet by checking the option “New Worksheet Ply” under “Output options.” In addition to estimated coefficients and R2, the output includes corresponding standard errors, t statistics, and p values for the intercept and distance. Step 5. Additional monocentric function fittings in Excel: In the Excel file monodist _ trt, add and compute new fields dist _ sq, lndist, and lnpopden which are the distance squared, logarithm of distance and logarithm of population density,§ respectively. For density functions with logarithmic terms (lnr or lnDr), the distance (r) or density (Dr) value cannot be 0. In our case, five tracts have Dr = 0. Following common practice, we add 1 to the original variable popden when

*

†

‡

§

Alternatively one may obtain the tract centroids in the study area by utilizing the tract centroid feature trtcent in geodatabase ChiRegion.gdb. Step 8 in Part 2 uses the “Near” tool again, which will update both fields NEAR _ FID and NEAR _ DIST. The field DIST _ CBD is added here also to preserve the distances from the CBD. One may also refer to step 5 in Section 8.7.2 in Chapter 8 for implementing the OLS regression in ArcGIS. For example, use the formula =LN( ) for natural logarithms in Excel.

132

Quantitative Methods and Socioeconomic Applications in GIS

TABLE 6.3 Regressions Based on Monocentric Functions Regression Techniques

Census Tracts (n = 1837) Functions

a (or A)

Linear

7188.15 Dr = a + br 12,067.76 Dr = a + blnr 10.09 lnDr = A + blnr 8.80 lnDr = A + br 8.29 lnDr = A + br2 8.83 lnDr = A + b1r + b2r2

Nonlinear

Dr = arb Dr = aebr Dr = aebr

2

Dr = aeb1r + b2r Weighted

12,141.8 10,016.1 8167.3

lnDr = A + br

2

6334.6 8.58

Townships (n = 115)

b

R

−120.13 −2739.74 −0.8132 −0.0417 −0.0006 b1 = −0.0452; b2 = 0.00005a −0.3807 −0.0471 −0.0019

0.324 0.347 0.300 0.384 0.346 0.384

b1 = 0.0578; b2 = −0.0044 −0.0559

0.698b 8455.2

2

a (or A)

b

R2

3535.46 9330.86 14.09 9.03 7.58 9.40

−48.17 −2175.69 −2.16 −0.06 −0.0005 b1 = −0.08; b2 = 0.0002a −0.73 −0.06 −0.0019

0.489 0.687 0.604 0.695 0.629 0.699

0.616b 16,557.6 0.690b 9434.0 0.696b 7043.0

0.552b

8.97

0.631b 0.839b 0.840b

b1 = 0.0383; 0.842b b2 = −0.0006 0.677b −0.06

Note: Results in bold indicate the best fitting models. a Not significant (all others significant at 0.001). b Pseudo-R2 calculated from RSS and TSS reported by SAS using the formula “pseudo-R2 = 1 − RSS/TSS).

computing lnDr, that is, the column lnpopden is computed as ln(popden + 1) to avoid taking the logarithm of zero.* Repeat step 4 to fit the logarithmic, power, exponential, Tanner–Sherratt, and Newling’s functions. Use Table 6.1 as a guideline on what data to use for defining the X variable(s) and Y variable. All regressions, except for Newling’s model, are simple bivariate models.† Regression results are summarized in Table 6.3. Step 6. Drawing graphs and adding trend lines in Excel: In the Excel file monodist _ trt, add an X-Y scatter graph depicting how density varies with distance. Click the data points on the graph > Add Trendline (see Figure 6.3 for a sample dialog window). Check the options “Display equation on chart” and “Display R-squared value on chart” to have regression results added to the trend lines. Test the four bivariate functions (linear, logarithmic, exponential, and power) one at a time. *

†

The choice of adding 1 (instead of 0.2, 0.5 or others) is arbitrary, and may bias the coefficient estimates. However, different additive constants have a minimal impact on significance testing as standard errors grow proportionally with the coefficients and thus leave the t values largely unchanged (Osgood, 2000, p. 36). One may also fit the cubic spline function in Excel. For example, by assigning arbitrary values to the parameters xi such as x0 = 1, x1 = 5, x2 = 10, x3 = 15, the cubic spline model has an R2 = 0.43, but most of the terms are not statistically significant.

133

Function Fittings by Regressions and Application 40,000

Density (per sq_km)

35,000 30,000 25,000 20,000 15,000

y = 6606.5e–0.042x R2 = 0.3837

10,000 5000 0

FIGURE 6.6

0

20

40

80 60 Distance (km)

100

120

Density versus distance exponential trend line (census tracts).

Figure 6.6 shows the exponential trend line superimposed on the X-Y scatter graph of density versus distance.* Note that the exponential and power regressions are based on the logarithmic transformation of the original functions, and recovered to the exponential and power function forms by computing the coefficient a = eA (e.g., for the exponential function, 6606.5 = e8.7958, see Figure 6.6). Results from the trend line tool are the same as those obtained by the regression tool. Step 7. Implementing nonlinear and weighted regressions in SAS: Nonlinear and weighted regression models need to be estimated in SAS. A sample SAS program monocent.sas is provided in the data folder. The SAS program implements all linear regressions and their corresponding weighted regressions. The power, exponential, Tanner–Sherratt, and Newling’s functions are also fit by nonlinear regression. Regression results are summarized in Table 6.3. Among comparable functions, the ones with the highest R2 are highlighted. Both the linear and nonlinear regressions indicate that the exponential function has the best fit among the bivariate models (Newling’s model has two explanatory terms and thus is not comparable). A weighted regression on the exponential function is presented as an example.

6.5.2

part 2: Function Fittings For polycentric Models at the census tract level

Step 8. Computing distances between tract centroids and their nearest centers in ArcGIS: Use the analysis tool “Near” to compute distances between tract centroids (cnty6trtpt) and their nearest centers (polycent15). In the updated attribute table for cnty6trtpt, the fields NEAR _ FID and NEAR _ DIST identifies the

*

The exponential (or power) trendline is not available when the original “popden” is used as y-axis because of the five tracts with zero density. Create a new column, calculate it as “popden + 1”, and use it as y-axis for graphing and adding trendline for these two functions.

134

Quantitative Methods and Socioeconomic Applications in GIS

nearest center from each tract centroid and the distance between them, respectively.* Add a field D _ NEARC and calculate it as D _ NEARC = NEAR _ DIST/1000. This distance field will be used to test the polycentric assumptions 1 and 4. Step 9. Computing distances between tract centroids and all centers in ArcGIS: Use the analysis tool Point Distance to compute distances between tract centroids (cnty6trtpt) and all 15 centers (polycent15), and name the output table Trt2PolyD. dbf, which has 1837 × 15 = 27,555 records. Join the attribute table of cnty6trtpt to Trt2PolyD.dbf to attach the tract density information, and export it to a new file PolyDist.dbf. The file will be used to test the polycentric assumptions 2 and 3. Step 10. Fitting polycentric functions in SAS: While all linear regressions may be implemented in Excel as shown in Part 1, the program polycent.sas under the data folder Chicago fits all functions listed in Table 6.2. Linear regressions are adopted for fitting functions based on assumptions 1, 2, and 4, and nonlinear regression is used to fit the function based on assumption 3. Assumption 1 implies several simple monocentric functions, each of which is based on the center’s proximal area. Assumption 2 leads to a multivariate linear function for the whole study area. See Table 6.4 for the regression results. The function based on assumption 3 is a complex nonlinear function with 30 parameters to estimate (two for each center). After numerous trials, the model does not converge and no regression result may be obtained. For illustration, a model with only two centers (center 15 at the CBD and center 5 at the O’Hare airport) is obtained such as: D = 9971.8e −0.0456r15 − 6630.7e −0.3387r5 with pseudo-R 2 = 0.692. (38.65) (−21.21) (−1.58) (−2.38) The corresponding t-values in parentheses (estimated by a coefficient’s value divided by its corresponding standard error) indicate that the CBD is far more significant than the O’Hare airport center. The function based on assumption 4 is obtained by a linear regression such as lnD = 8.8603 − 0.0396rCBD − 0.0128rcent with R 2 = 0.387. (199.37) (−28.53)

(−3.34)

The corresponding t values in parentheses imply that both distances from the CBD and the nearest center are statistically significant, but the distance from the CBD is far more important.

6.5.3

part 3: Function Fittings For Monocentric Models at the toWnship level

Step 11. Using area-based interpolation to estimate population in townships in ArcGIS: In ArcToolbox, use the analysis tool Intersect to overlay cnty6trt and twnshp, and name the output layer twntrt. On the attribute table of twntrt, add a field *

Note that the “Near” tool is already executed once in step 3 of Part 1, and the fields NEAR _ FID and NEAR _ DIST are updated.

135

Function Fittings by Regressions and Application

TABLE 6.4 Regressions Based on Polycentric Assumptions 1 and 2 (n = 1837 Census Tracts) n

1: ln D = Ai + biri for Center i’s Proximal Area

Center Index i

No. Tracts

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

185 105 400 76 50 68 51 47 60 103 87 22 27 65 491

Ai 7.300*** 7.4106*** 8.3691*** 7.6878*** 5.534*** 7.0748*** 7.2460*** 7.5772*** 7.1751*** 7.4755*** 7.3180*** 6.6131*** 8.1187*** 7.1778*** 8.2310***

bi 0.1577*** 0.1511*** −0.0464*** −0.0513** 0.3671*** −0.0338 −0.0001 −0.0698* −0.0174 −0.0673*** −0.0584*** −0.0283 −0.2535*** −0.0810*** 0.0430*

2: lnD = a + ∑ i = 1 bi ri for the Whole Study Area

R2

bi

0.185 0.267 0.110 0.114 0.282 0.045 0.000 0.127 0.008 0.349 0.292 0.012 0.633 0.457 0.011

−0.1110*** −0.0628* −0.0152 −0.0659** 0.1394*** 0.1138*** −0.0528 0.1065*** −0.0344 −0.0539*** −0.0378* 0.0801**

a = 11.01*** R2 = 0.429. Sample size = 1837

−0.0468** −0.0046 −0.0272**

Note: *Significant at 0.05; **Significant at 0.01; ***Significant at 0.001.

InterArea and use the tool Calculate Geometry to update it, which is the area size of the intersected feature (see step 4 in Section 1.3.1). Add another field EstPopu to twntrt, and calculate it as EstPopu = popuden*InterArea/1000000. Here we use the population density inherited from the layer cnty6trt (see step 9 in Section 1.3.1).* On the attribute table of twntrt, summarize EstPopu by the field RNGTWN (township ids), and name the output table twn _ pop.dbf. The field Sum _ EstPopu in the table twn _ pop.dbf is the estimated population in townships. Step 12. Computing density and distance from CBD for townships in ArcGIS: Join the table twn _ pop.dbf to the attribute table of twnshp based on their common field RNGTWN. Add a field popden to the joined table and calculate it as popden = 1000000* Sum _ EstPopu/Shape _ Area. Based on the polygon feature twnshp, generate a point feature twnshppt as their centroids (see step 2 in Section 2.4.1). Use the analysis tool Near to obtain the distances of survey townships (twnshppt) from the CBD (monocent). The attribute table of twnshppt now contains the variables needed for function fittings: population density (in the truncated *

An alternative is to use the areal weighting interpolation method implemented in Section 3.5.1.

136

Quantitative Methods and Socioeconomic Applications in GIS 10,000 9000

Density (per sq_km)

8000 7000 6000 y = 8339.8e–0.061x R2 = 0.6948

5000 4000 3000 2000 1000 0

20

40

60 Distance (km)

80

100

120

FIGURE 6.7 Density versus distance exponential trend line (survey townships).

fieldname twnshp _ pop), distance (in field NEAR _ DIST), and area (in the truncated fieldname twnshp _ S _ 1). Step 13. Monocentric function fittings in SAS: For convenience, one may feed the data twnshppt.dbf to a slightly-modified SAS program mono _ twnshp.sas (also provided in the data folder) to run both the linear and nonlinear regressions. Results are also shown in Table 6.3. Given the small sample size, polycentric functions are not tested. Figure 6.7 shows the fitted exponential function curve based on the interpolated population data for survey townships. Survey townships are much larger than census tracts and have far fewer observations. It is not surprising that the function is a better fit at the township level (as shown in Figure 6.7) than at the tract level (as shown in Figure 6.6).

6.6 DISCUSSIONS AND SUMMARY As shown in Table 6.3, among the five bivariate monocentric functions (linear, logarithmic, power, exponential, and Tanner–Sherratt), the exponential function has the best fitting overall. It generates the highest R2 by linear regressions using both census tract data and survey township data. Only by nonlinear regressions, the Tanner–Sherratt model has an R2 slightly higher than the exponential model. Newling’s model has two terms (distance and distance squared), and thus its R2 is not comparable to the bivariate functions. In fact, Newling’s model is not a very good fit here since the term “distance squared” is not statistically significant by linear regression on either census tract data or survey township data. Data aggregated at the survey township level smooth out variations in both densities and distances. As expected, R2 is higher for each function obtained from the survey township data than that from the census tract data. We here use the exponential function as an example to compare the regression results by linear and nonlinear regressions. The nonlinear regression generates

Function Fittings by Regressions and Application

137

a higher intercept than the linear regression for census tracts (8167.3 > e8.7958 = 6606.5), but a lower intercept for survey townships (7043.0 < e9.0288 = 8339.8). Recall the discussion in Section 6.3 that nonlinear regression tends to value high-density areas more than low-density areas in minimizing residual sum squared (RSS). In the case of using the census tract data, the logarithmic transformation in linear regression reduces the error contributions by high-density areas, and thus its fitting intercept tends to swing lower; in the case of using the township data, the effect is the opposite. The trend for the slope is consistent in using the two area units, i.e., a flatter slope by nonlinear than by linear regression. In weighted regression, observations are weighted by their area sizes. The area size for survey townships is considered comparable except for some incomplete townships on the east border. Therefore, the results between weighted and unweighted (OLS linear) regression are very similar when based on the survey township data (using the exponential function as an example), but are different when based on the census tract data. As seen in Table 6.4, the regression results based on the first polycentric assumption reveal that within most of the proximal areas, population densities decline with distances from their nearest centers. This is particularly true for suburban centers. But there are areas with a reversed pattern with densities increasing with distances, particularly in the central city (e.g., centers 1, 2, 5, and 15; refer to Figure 6.5 for their locations). This clearly indicates density craters around the downtown or near-downtown job centers because of significant nonresidential land uses (commercial, industrial, or public) around these centers. The analysis of density function fittings based on polycentric assumption 1 enables us to examine the effect of centers on the local areas surrounding the centers. The regression result based on assumption 2 indicates distance decay effects for most centers: seven of the coefficients bi have the expected negative sign and are statistically significant, and four more have the expected negative sign, though not statistically significant. The regression based on assumption 3 is most difficult to implement, particularly when the number of centers is large. The regression result based on assumption 4 indicates that the CBD exerts the dominant effect on the density pattern, and the effects from the nearest subcenters are also significant. Function fitting is commonly encountered in many quantitative analysis tasks to characterize how an activity or event varies with distance from a source. Urban and regional studies suggest that in addition to population density, land use intensity (reflected in its price), land productivity, commodity prices and wage rate may all experience some “distance decay effect” (Wang and Guldmann, 1997), and studies of their spatial patterns may benefit from the function fitting approach. Furthermore, various growth patterns (backwash vs. spread, centralization vs. decentralization) can be identified by examining the changes over time and analyzing whether the changes vary in directions (e.g., northward vs. southward, along certain transportation routes, etc.).

APPENDIX 6A: DERIVING URBAN DENSITY FUNCTIONS This appendix discusses the theoretical foundations for urban density functions: an economic model following Mills (1972) and Muth (1969) (also see Fisch, 1991), and a gravity-based model based on Wang and Guldmann (1996).

138

Quantitative Methods and Socioeconomic Applications in GIS

Mills–Muth economic model: Each urban resident intends to maximize his/her utility, that is, Max U(h,x) by consuming h amount of land (i.e., housing lot size) and x amount of everything else. The budget constraint y is given as y = phh + pxx + tr, where ph and px are the prices of land and everything else respectively, t is the unit cost for transportation, and r is the distance to the city center where all jobs are located. The utility maximization yields a first-order condition given by dU dp = 0 = h h + t. dr dr

(A6.1)

Assume that the price elasticity of the land demand is −1 (i.e., often referred to as the assumption of “negative unit elasticity for housing demand”) such as (A6.2)

h = ph−1 .

Combining (A6.1) and (A6.2) yields the negative exponential rent gradient: 1 dph = − t. ph dr

(A6.3)

As population density D(r) is given by the inverse of lot size h (i.e., D(r) = 1/h), we have, D(r ) = 1/ ( ph−1 ) = ph . Substituting into (A6.3) and solving the differential equation yields the negative exponential density function D(r ) = D0 e − tr .

(A6.4)

Gravity-based model: Consider a city composed of n equal-area tracts. The population in tract j, xj, is proportional to the potential there, such as n

kx j =

xi

∑d i =1

β ij

, (A6.5)

where dij is the distance between tract i and j, β is the distance friction coefficient, and n is the total number of tracts in the city. This proposition assumes that the population at a particular location is determined by its location advantage or accessibility to all other locations in the city, measured as a gravity potential. Equation A6.5 can also be written, in matrix notation, as kX = AX,

(A6.6)

139

Function Fittings by Regressions and Application

where X is a column vector of n elements (x1, x2, …, xn), A is an n × n matrix with terms involving the dij ’s and β, and k is an unknown scalar. Normalizing one of the population terms, say x1 = 1, (A6.6) becomes a system of n equations with n unknown variables that can be solved by numerical analysis. Assuming a transportation network made of circular and radial roads (see Section 10.3 in Chapter 10) that define the distance term dij and given a β value, the population distribution pattern can be simulated in the city. The simulation indicates that the density pattern conforms to the negative exponential function when β is within a particular range (i.e., 0.2 ≤ β< 1.5). When β is larger (i.e., 1.5 ≤ β ≤ 2.0), the log-linear function becomes a better fit. Therefore, the model suggests that the best-fitting density function may vary over time and in different cities.

APPENDIX 6B: CENTRALITY MEASURES AND ASSOCIATION WITH URBAN DENSITIES Among various centrality indices (Kuby et al., 2005), three are most widely used: closeness (CC), betweenness (CB) and straightness (CS), measuring a location being close to all others, being the intermediary between others, and being accessible via a straight route to all others. All are based on a street network that is composed of nodes and edges connecting them. The distance between nodes can be measured in a dual graph that does not differentiate edge lengths (i.e., topological distance as discussed in Section 2.1), or by a primal approach that represents road segments (edges) with lengths (Jiang and Claramunt, 2004). The latter has been increasingly adopted in recent studies. Closeness centrality CC measures how close a node is to all the other nodes along the shortest paths of the network. CC for a node i is defined as: CiC =

N −1

∑

N j =1, j ≠ i

,

(A6.7)

dij

where N is the total number of nodes in the network, and dij is the shortest distance between nodes i and j. In other words, CC is the inverse of average distance from this node to all other nodes. Betweenness centrality CB measures how often a node is traversed by the shortest paths connecting all pairs of nodes in the network. CB is defined as: CiB =

1 ( N − 1)(N − 2)

N

∑

j=1;k=1; j ≠ k ≠ i

n jk (i ) , n jk

(A6.8)

where njk is the number of shortest paths between nodes j and k, and njk(i) is the number of these shortest paths that contain node i. CB captures a special property for a place: it does not act as an origin or a destination for trips, but as a pass-through point.

140

Quantitative Methods and Socioeconomic Applications in GIS

Straightness centrality CS measures how much the shortest paths from a node to all others deviate from the virtual straight lines (Euclidean distances) connecting them. CS is defined as: CiS =

1 N −1

N

∑

j=1; j ≠ i

dijEucl , dij

(A6.9)

where dijEucl is the Euclidean distance between nodes i and j. CS measures the extent to which a place can be reached directly along a straight line from all other places in a city. The global centrality indices are calculated on the street network of the whole study area, and local centrality indices for any location are based on the street network within a radius (catchment area) around the location. One may use the ArcGIS Network Analyst module to prepare a street network dataset (see step 8 in Section 2.4.2), and download and install the Urban Network Analysis tool by Sevtsuk et al. (2013) to implement the computation of the centrality indices. To examine the association between centrality and land use intensity (e.g., population density), one needs to convert the two features (e.g., centrality values at nodes of a street network vs. population density in census tracts) into one data frame. One approach is to convert both into a raster dataset by kernel density estimation as discussed in Section 3.1.2, and use a tool in ArcGIS (ArcToolbox > Spatial Analyst Tools > Multivariate > Band Collection Statistics, and check the option “Compute covariance and correlation matrices”) to examine the correlation between them. In a study reported in Wang et al. (2011), the centrality indices outperform the economic monocentric model in explaining the intraurban variation of population density significantly.

APPENDIX 6C: OLS REGRESSION FOR A LINEAR BIVARIATE MODEL A linear bivariate regression model is written as yi = a + bxi + ei, where x is a predictor (independent variable) of y (dependent variable), e is the error term or residual, i is the index for individual cases or observations, and a and b are parameter estimates (often referred to as “intercept” and “slope,” respectively). The predicted value for y by the linear model is yî = a + bxi . Residuals e measure prediction error, that is, ei = yi − yî = yi − (a + bxi ). When e > 0, the actual yi is higher than the predicted yî (i.e., underprediction). When e < 0, the actual yi is lower than the predicted yî (i.e., overprediction). A

141

Function Fittings by Regressions and Application

perfect prediction results in a zero residual (e = 0). Either underprediction or overprediction contributes to inaccuracy, and therefore the total error or the residual sum squared (RSS) is RSS =

∑ e = ∑ (y 2 i

− a − bxi )2 .

i

i

(A6.10)

i

Denoting the mean of Yi by Y, total sum squared (TSS), defined as TSS = ∑ i (Yi − Y )2 , represents the total variation of dependent variable Yi, and explained sum squared (ESS), defined as ESS = ∑ i (Yî − Y )2 , reflects the variation explained by the model. The difference between them is RSS (i.e., RSS = TSS − ESS). The ordinary least square (OLS) regression seeks the optimal values of a and b so that RSS is minimized. Here xi and yi are observed, and a and b are the only variables. The optimization conditions for minimizing Equation A6.10 are ∂RSS = −2 ∂a ∂RSS =2 ∂b

∑ (y

− a − bxi ) = 0,

(A6.11)

− a − bxi )(− xi ) = 0.

(A6.12)

i

i

∑ (y

i

i

Assuming that n is the total number of observations, we have ∑ i a = na. Solving Equation A6.11 for a, we get a=

1 n

∑y

i

i

−

b n

∑x .

(A6.13)

i

i

Substituting Equation A6.13 into Equation A6.12 and solving for b yields

b=

n

∑ xy −∑ x∑ y . n∑ x − ( ∑ x ) i

i i

i

i

2 i

i

i

i 2

i

(A6.14)

i

Substituting Equation A6.14 back to Equation A6.13 yields the solution to a. How good is the regression model? A popular measure for goodness of fit is R2 = ESS/TSS = 1 − RSS/TSS It captures the portion of Y’s variation from its mean explained by the model.

7

Principal Components, Factor and Cluster Analyses, and Application in Social Area Analysis

This chapter discusses three important multivariate statistical analysis methods: principal components analysis (PCA), factor analysis (FA), and cluster analysis (CA). PCA and FA are often used together for data reduction by structuring many variables into a limited number of components (factors). The techniques are particularly useful for eliminating variable collinearity and uncovering latent variables. Applications of the methods are widely seen in socioeconomic studies (e.g., Section 8.7.1). While the PCA and FA group the variables, the CA classifies observations into categories according to similarity among their attributes. In other words, given a data set as a table, the PCA and FA reduce the number of columns and the CA reduces the number of rows. By reducing data dimension, benefits of PCA and FA include uncovering latent variables for easy interpretation and removing multicollinearity for subsequent regression analysis. In many socioeconomic applications, variables extracted from census data are often correlated with each other, thus contain information duplicated to some extent. PCA and FA use fewer components or factors to represent the original variables, and thus simplify the structure for analysis. Resulting component or factor scores are uncorrelated to each other (if not rotated or orthogonally rotated), and thus can be used as independent explanatory variables in regression analysis. Despite the commonalities, PCA and FA are “both conceptually and mathematically very different” (Bailey and Gatrell, 1995, p. 225). PCA uses the same number of components to simply transform the original data, and thus is strictly a mathematical transformation. FA uses fewer factors to capture most of the variations among the original variables with error terms, and thus is a statistical analysis process. PCA attempts to explain the variance of observed variables, whereas FA intends to explain their intercorrelations (Hamilton, 1992, p. 252). Social area analysis is used to illustrate the techniques as it employs all three methods. The interpretation of social area analysis results also leads us to a review and comparison of three classic models on urban structure, namely, the concentric zone model, the sector model, and the multi-nuclei model. The analysis demonstrates how analytical statistical methods synthesize descriptive models into one framework. Beijing, the capital city of China, in the midst of forming its social areas after decades under a socialist regime, is chosen as the study area for a case study. 143

144

Quantitative Methods and Socioeconomic Applications in GIS

Sections 7.1 through 7.3 discuss principal components analysis, factor analysis and cluster analysis, respectively. Section 7.4 reviews social area analysis. A case study on the social space in Beijing is presented in Section 7.5. The chapter is concluded with discussion and a brief summary in Section 7.6.

7.1 PRINCIPAL COMPONENTS ANALYSIS For convenience, the original data of observed variables Xk are first standardized so that variables do not depend on measurement scale and thus are comparable to each other. Denote the mean and the standard deviation for a series of data Xk as X and σ, respectively. Data standardization involves the process of converting the data series Xk into a new series Zk such that Z k = ( X k − X )/σ. By doing so, the resulting data series Zk has the mean equal to 0 and the standard deviation equal to 1. Principal components analysis (PCA) transforms data of K observed variables Zk (likely correlated with each other to some degree) to data of K principal components Fk that are independent from each other: Z k = lk1 F1 + lk 2 F2 +  + lkj Fj +  + lkK FK .

(7.1)

The components Fj can also be expressed as a linear combination of the original variables Zk: Fj = a1 j Z1 + a2 j Z 2 +  + akj Z j +  + aKj Z K .

(7.2)

The components Fj are constructed to be uncorrelated with each other, and are ordered such that the first component F1 has the largest sample variance (λ1), F2 the second largest, and so on. The variances λj corresponding to various components are termed eigenvalues, and λ1 > λ2 > …. A larger eigenvalue represents a larger share of the total variance in all of the original variables captured by a component and thus indicates a component of more importance. That is to say, the first component is more important than the second, and the second is more important than the third, and so on. For readers familiar with the terminology of linear algebra, the eigenvalues λj (j = 1, 2, …, K) are based on the correlation matrix R (or, less commonly, the covariance matrix C) of the original variables Z. The correlation matrix R is written as ⎛ 1 r12 ⎜r 1 21 R=⎜ . . ⎜ ⎜r ⎝ K 1 rK 2

 r1K ⎞  r2 K ⎟ ⎟,  . ⎟  1 ⎟⎠

where r12 is the correlation between Z1 and Z 2, and so on. The coefficients (a1j, a2j, …, aKj) for component Fj in Equation 7.2 comprise the eigenvector associated with the jth largest eigenvalue λj.

145

Components, Factor and Cluster Analyses in Social Area Analysis

Since standardized variables have variances of 1, the total variance of all variables also equals to the number of variables such as λ1 + λ2 + … + λk = K. Therefore, the proportion of total variance explained by the jth component is λj/K. In PCA, the number of components equals the number of original variables, so no information is lost in the process. The newly constructed components are a linear function of the original variables and have two important and useful properties: (a) they are independent of each other, and (b) their corresponding eigenvalues reflect their relative importance. The first property ensures that multicollinearity is avoided if the resulting components are used in regression, and the second property enables us to use the first fewer components to capture most of the information in a multivariate data set. By discarding the less important components and keeping only the first few components, we achieve data reduction. If the J largest components (J < K) are retained, Equation 7.1 becomes Z k = lk1 F1 + lk 2 F2 +  + lkJ FJ + vk ,

(7.3)

where the discarded components are represented by the residual term vk such as vk = lk , J +1 FJ +1 + lk , J + 2 FJ + 2 +  + lkK FK .

(7.4)

Equations 7.3 and 7.4 are sometimes termed a principal components factor model. It is not a true factor analysis (FA) model. The SAS procedure for principal components analysis (PCA) is PRINCOMP. It reports the correlation matrix of input variables, eigenvalues of the correlation matrix, and eigenvectors. It also generates a scree plot and a graph for proportions of variance explained by components, which help researchers decide the number of principal components to retain. The following sample SAS statements implement the PCA on 14 variables (x1 through x14) and export the result to an SAS dataset PCOMP: proc princomp out = PCOMP(replace = yes); var x1-x14;

7.2 FACTOR ANALYSIS Factor analysis (FA) is based on an assumption different from that of PCA. FA assumes that each variable has common and unique variance, and the method seeks to create factors that account for the common, not the unique, variance. FA uses a linear function of J unobserved factors Fj, plus a residual term uk, to model each of K observed variables Zk: Z k = lk1 F1 + lk 2 F2 +  + lkJ FJ + uk .

(7.5)

146

Quantitative Methods and Socioeconomic Applications in GIS

The same J factors, Fj, appear in the function for each Zk, and thus are termed common factors. Equation 7.5 looks like the principal components factor model in Equation 7.3, but the residuals uk in Equation 7.5 are unique to variables Zk, termed unique factors, and uncorrelated. In other words, after all the common factors are controlled for, correlations between the unique factors become zero. On the other hand, residuals vk, defined as in Equation 7.4, are linear functions of the same discarded components, and thus correlated. In order to understand this important property of FA, it is helpful to introduce the classical example used by psychologist Charles Spearman, who invented the method about 100 years ago. Spearman hypothesized that a variety of tests of mental ability such as scores in mathematics, vocabulary and other verbal skills, artistic skills, and logical reasoning ability could all be explained by one underlying factor of general intelligence (g). If g could be measured and controlled for, there would be no correlations among any tests of mental ability. In other words, g was the only factor common to all the scores, and the remainder was the unique factor indicating the special ability in that area. When both Zk and Fj are standardized, the lkj in Equation 7.5 are standardized coefficients in the regression of variables Zk on common factors Fj, also termed factor loadings. In other words, the loadings are correlation coefficients between the standardized Zk and Fj, and thus range from −1 to +1. For example, lk1 is the loading of variables Zk on standardized component F1. Factor loading reflects the strength of relations between variables and factors. Estimates of the factors are termed factor scores. For instance, the estimate for Fj (j = 1, 2, …, J; and J < K) is: Fˆ j = c1 j Z1 + c2 j Z 2 +  + cKj Z K ,

(7.6)

where ckj is the factor score coefficient for the kth variable on the jth factor. Equation 7.6 is used to generate factor scores for corresponding observations. The computing process of FA is more complex than that of PCA. A commonly used procedure for FA is principal factor analysis, which utilizes the results from PCA in iterations to obtain the final factor structure and factor scores. The PCA extracts the principal components based on the eigenvalues of correlation matrix R, and its computation is straightforward. The principal factor analysis extracts principal factors of a modified correlation matrix R*. Whereas elements on the major diagonal of R are 1, these elements in R* are replaced by estimates of communality (a variable’s variance shared by the common factors). For instance, the communality of variable Zk in Equation 7.5 equals the proportion of Zk ’s variance explained by the common factors, that is, ∑ Jj =1 lkj2 . After an initial number of factors is decided, the communality and the modified correlation matrix R* are estimated through an iterative process until there is little change in the estimates. Based on the final modified correlation matrix R* and extracted principal components, the first principal component defines the first factor, the second component defines the second factor, and so forth. In addition to the principal factor analysis method, there are also several alternative methods for FA based on maximum-likelihood estimation.

Components, Factor and Cluster Analyses in Social Area Analysis

147

How do we decide the number of factors to use? Eigenvalues provide a basis for judging which factors are important and which are not. In deciding the number of factors to include, one has to make a tradeoff between the total variance explained (higher by including more factors) and interpretability of factors (better with fewer factors). A rule of thumb is to include only factors with eigenvalues greater than 1. Since the variance of each standardized variable is 1, a factor with λ < 1 accounts for less variance than an original variable’s variance, and thus does not serve the purpose of data reduction. Using a value of 1 as a cutoff value to identify “important” factors is arbitrary. A scree graph plots eigenvalues against factor number, and provides a more useful guidance. Figure 7.1 shows the scree graph of eigenvalues (as reported in step 1 of Case Study 7 in Section 7.5). The graph levels off after component 4, indicating that components 5–14 account for relatively little additional variance. Therefore, four components (factors) are retained to account for a cumulative 70% variance. Other techniques such as statistical significance testing based on the maximum-likelihood FA may also be used to choose the number of factors. However, most importantly, selecting the number of factors must be guided by understanding the true forces underlying the research issue and supported by theoretical justifications. There is always a tradeoff between the total variance explained (higher by using more factors) and interpretability of factors (simpler with fewer factors). Initial results from FA are often hard to interpret as variables load across factors. While fitting the data equally well, rotation generates simpler structure by attempting to find factors that have loadings close to ±1 or 0. This is done by maximizing the loading (positive or negative) of each variable on one factor and minimizing the loadings on the others. As a result, we can detect more clearly which factors mainly capture what variables (not others) and therefore label the factors adequately. The derived factors are considered latent variables representing underlying dimensions that are combinations of several observed variables. Variance explained

5

1.0

4

0.8 Proportion

Eigenvalue

Scree plot

3 2

0.6 0.4 0.2

1 0

0.0 0

5 10 Principal component

15

0

5 10 Principal component

15

Cumulative Proportion

FIGURE 7.1

Scree plot and variance explained in principal components analysis (PCA).

148

Quantitative Methods and Socioeconomic Applications in GIS

Orthogonal rotation generates independent (uncorrelated) factors, an important property for many applications. A widely used orthogonal rotation method is varimax rotation, which maximizes the variance of the squared loadings for each factor, and thus polarizes loadings (either high or low) on factors. Oblique rotation (e.g., promax rotation) generates even greater polarization but allows correlation between factors. As a summary, Figure 7.2 illustrates the major steps in PCA and FA: 1. The original data set of K observed variables with n records is first standardized. 2. PCA then generates K uncorrelated components to account for all the variance of the K variables, and suggests the first J components accounting for most of the variance. 3. FA uses only J (J < K) factors to model interrelations of the original K variables. 4. A rotation method can be used to load each variable strongly (close to ±1) on one factor and very little (near zero) on the others for easier interpretation. In essence, PCA attempts to explain all of the variance in the original variables, whereas FA tries to explain their intercorrelations (covariances). Therefore, PCA is used to construct composite variables that reproduce the maximum variance of the original variables, and FA is used to reveal the relationships between the original variables and unobserved underlying dimensions. Despite the differences, PCA and FA share much common ground. Both are primarily used for data reduction, uncovering latent variables, and removing multicollinearity in regression analysis. Many consider PCA a special case of FA.

k

Original data set

1 Standardize

n

1 2 3

1 23

k

Z scores

k components

2 PCA

k variables

1 23

1 23 k 1 2 3 Component loadings k

n

FA

1 2 3 j (j < k) 1 2 4 3 Factor loadings Rotation k

j factors

k variables

3

k variables

j factors 1 2 3 k

1 23

j

Maximizing badings on one factor

1 2 3

k variables

n records

n records

k variables

FIGURE 7.2 Major steps in principal components analysis (PCA) and factor analysis (FA).

Components, Factor and Cluster Analyses in Social Area Analysis

149

The SAS procedure for FA is FACTOR. Its report includes eigenvalues of the correlation matrix, factor pattern (variable loadings on factors), and standardized scoring coefficients. In the following sample SAS statements, the factor analysis is implemented to consolidate 14 variables x1 through x14 into four factors and adopts the varimax rotation technique. It exports the factor scores to an SAS dataset FACTSCORE. proc factor out = FACTSCORE (replace = yes) nfact = 4 rotate = varimax; var x1-x14;

7.3 CLUSTER ANALYSIS Cluster analysis (CA) groups the observations according to the similarity among their attributes. As a result, the observations within a cluster are more similar than observations between clusters as measured by the clustering criterion. Note the difference between CA and another similar multivariate analysis technique—discriminant function analysis (DFA). Both group the observations into categories based on the characteristic variables. The difference is that the categories are unknown in CA but known in DFA. See Appendix 7 for further discussion on DFA. Geographers have a long-standing interest in cluster analysis (CA) and have used it in applications such as regionalization and city classification. A key element in deciding assignment of observations to clusters is “attributive distance,” in contrast to the various spatial distances discussed in Chapter 2. The most commonly used attributive distance measure is Euclidean distance: ⎛ dij = ⎜ ⎝

K

⎞ ( xik − x jk )2 ⎟ ⎠ k =1

∑

1/ 2

,

(7.7)

where xik and xjk are the kth variable of the K-dimensional observations for individuals i and j, respectively. When K = 2, Euclidean distance is simply the straight line distance between observations i and j in a two-dimensional space. Other distance measures include Manhattan distance and others (e.g., Minkowski distance, Canberra distance) (Everitt et al., 2001, p. 40). The most widely used clustering method is the agglomerative hierarchical method (AHM). The method produces a series of groupings: the first consists of single-member clusters, and the last consists of a single cluster of all members. The results of these algorithms can be summarized with a dendrogram, a tree diagram showing the history of sequential grouping process. See Figure 7.3 for the example illustrated below. In the diagram, the clusters are nested, and each cluster is a member of a larger and higher level cluster. For illustration, an example is used to explain a simple AHM, the single linkage method or the nearest neighbor method. Consider a data set of four observations with the following distance matrix:

150

Quantitative Methods and Socioeconomic Applications in GIS Distance C3

5.0

C2

4.0 C1

3.0 2.0 1.0 1

FIGURE 7.3

2 Data points

3

4

Dendrogram for a cluster analysis example.

1 ⎡0 ⎤ ⎥ 2 ⎢⎢ 3 0 ⎥. D1 = ⎥ 3 ⎢6 5 0 ⎢ ⎥ 4 ⎣9 7 4 0 ⎦ The smallest nonzero entry in the above matrix D1 is (2 → 1) = 3, and therefore individuals 1 and 2 are grouped together to form the first cluster C1. Distances between this cluster and the other two individuals are defined according to the nearest neighbor criterion: d(12 )3 = min{d13 , d23} = d23 = 5 d(12 ) 4 = min{d14 , d24 } = d24 = 7. A new matrix is now obtained with cells representing distances between cluster C1 and individuals 3 and 4: (12) ⎡0 ⎤ ⎢ D2 = 3 ⎢ 5 0 ⎥⎥ . 4 ⎢⎣ 7 4 0 ⎥⎦ The smallest nonzero entry in D2 is (4 → 3) = 4, and thus individuals 3 and 4 are grouped to form a cluster C2. Finally, clusters C1 and C2 are grouped together, with distance equal to 5, to form one cluster C3 containing all four members. The process is summarized in the dendrogram in Figure 7.3, where the height represents the distance at which each fusion is made.

Components, Factor and Cluster Analyses in Social Area Analysis

151

Similarly, the complete linkage (furthest neighbor) method uses the maximum distance between pair of objects (one object from one cluster and another object from the other cluster); the average linkage method uses the average distance between pairs of objects; and the centroid method uses squared Euclidean distance between individuals and cluster means (centroids). Another commonly used AHM is the Ward’s method. The objective at each stage is to minimize the increase in the total within-cluster error sum of squares given by C

E =

∑E , c

c =1

where Ec =

nc

K

i

k =1

∑ ∑ (x

ck ,i

− xck )2 ,

in which xck,i is the value for the kth variable for the ith observation in the cth cluster, and xck is the mean of kth variable in the cth cluster. Each clustering method has its advantages and disadvantages. A desirable clustering should produce clusters of similar size, densely located, compact in shape, and internally homogeneous (Griffith and Amrhein, 1997, p. 217). The single linkage method tends to produce unbalanced and straggly clusters, and should be avoided in most cases. If the outlier is a major concern, the centroid method should be used. If compactness of clusters is a primary objective, the complete linkage method should be used. The Ward’s method tends to find same size and spherical clusters, and is recommended if no single overriding property is desired (Griffith and Amrhein, 1997, p. 220). The case study in this chapter also uses Ward’s method. The choice for the number of clusters depends on objectives of specific applications. Similar to the selection of factors based on the eigenvalues in factor analysis, one may also use a scree plot to assist the decision. In the case of Ward’s method, a graph of R2 versus the number of clusters helps choose the number, beyond which little more homogeneity is attained by further mergers. In SAS, the procedure CLUSTER implements the cluster analysis, and the procedure TREE generates the dendrogram. The following sample SAS statements use Ward’s method for clustering, and cut off the dendrogram at 9 clusters. proc cluster method = ward outtree = tree; id subdist_id;/* variable for labeling ids */ var factor1-factor4;/* variables used */ proc tree out = bjcluster ncl = 9; id subdist_id;

7.4

SOCIAL AREA ANALYSIS

The social area analysis was developed by Shevky and Williams (1949) in a study of Los Angeles, and was later elaborated on by Shevky and Bell (1955) in a study of San

152

Quantitative Methods and Socioeconomic Applications in GIS

Francisco. The basic thesis is that the changing social differentiation of society leads to residential differentiation within cities. The studies classified census tracts into types of social areas based on three basic constructs: economic status (social rank), family status (urbanization), and segregation (ethnic status). Originally, the three constructs were measured by six variables: economic status was captured by occupation and education; family status by fertility, women labor participation, and single-family houses; and ethnic status by percentage of minorities (Cadwallader, 1996, p. 135). In factor analysis, an idealized factor loadings matrix probably looks like Table 7.1. Subsequent studies using a large number and variety of measures generally confirmed the validity of the three constructs (Berry, 1972, p. 285; Hartschorn, 1992, p. 235). Geographers made an important advancement in social area analysis by analyzing the spatial patterns associated with these dimensions (e.g., Rees, 1970; Knox, 1987). The socioeconomic status factor tends to exhibit a sector pattern (Figure 7.4a): tracts with high values for variables such as income and education form one or more sectors, and low-status tracts form other sectors. The family status factor tends to form concentric zones (Figure 7.4b): inner zones are dominated by tracts with small families with either very young or very old household heads, and tracts in outer zones are mostly occupied by large families with middle-aged household heads. The ethnic status factor tends to form clusters, each of which is dominated by a particular ethnic group (Figure 7.4c). Superimposing the three constructs generates a complex urban mosaic, which can be grouped into various social areas by cluster analysis (see Figure 7.4d). By studying the spatial patterns from social area analysis, three classic models for urban structure, Burgess’s (1925) concentric zone model, Hoyt’s (1939) sector model, and Ullman–Harris (Harris and Ullman, 1945) multi-nuclei model, are synthesized into one framework. In other words, each of the three models reflects one specific dimension of urban structure, and is complementary to each other. There are at least three criticisms of the factorial ecological approach to understanding residential differentiation in cities (Cadwallader, 1996, p. 151). First, the results from social area analysis are sensitive to research design such as variables selected and measured, analysis units, and factor analysis methods. Secondly, it is still a descriptive form of analysis and fails to explain the underlying process

TABLE 7.1 Idealized Factor Loadings in Social Area Analysis Occupation Education Fertility Female labor participation Single-family house Minorities

Economic Status

Family Status

Ethnic Status

I I O O O O

O O I I I O

O O O O O I

Note: Letter I denotes a number close to 1 or –1; O denotes a number close to 0.

Components, Factor and Cluster Analyses in Social Area Analysis

153

High SES

Low SES (a) Socioeconomic status (SES) Small families Overlay

Large families (b) Family status

(d) Urban mosaic

Ethnic enclaves

(c) Ethnic status

FIGURE 7.4

Conceptual model for urban mosaic.

that causes the pattern. Thirdly, the social areas identified by the studies are merely homogeneous, but not necessarily functional regions or cohesive communities. Despite the criticisms, social area analysis helps us to understand residential differentiation within cities, and serves as an important instrument for studying intraurban social spatial structure. Applications of social area analysis can be seen in cities in developed countries, particularly in cities in North America (see a review by Davies and Herbert, 1993), and also in cities in developing countries (e.g., Abu-Lughod, 1969; Berry and Rees, 1969).

7.5 CASE STUDY 7: SOCIAL AREA ANALYSIS IN BEIJING This case study is developed on the basis of a research project reported in Gu et al. (2005). Detailed research design and interpretation of the results can be found in the original paper. This section shows the procedures to implement the study with emphasis

154

Quantitative Methods and Socioeconomic Applications in GIS

on illustrating the three statistical methods discussed in Sections 7.1 through 7.3. In addition, the study illustrates how to test the spatial structure of factors by regression models with dummy explanatory variables. Since the 1978 economic reforms in China, and particularly the 1984 urban reforms including the urban land use reform and the housing reform, urban landscape in China has changed significantly. Many large cities have been on the transition from a self-contained work-unit neighborhood system to more differentiated urban space. As the capital city of China, Beijing offers an interesting case to look into this important change in urban structure in China. The study area was the contiguous urbanized area of Beijing including four innercity districts (Xicheng, Dongcheng, Xuanwu, and Chongwen)* and four suburban districts (Haidian, Chaoyang, Shijingshan, and Fengtai) with a total of 107 subdistricts (jiedao) in 1998. Some subdistricts on the periphery of those four suburban districts were mostly rural and are thus excluded from the study (see Figure 7.5). The study area had a total population of 5.9 million, and the analysis unit “subdistrict” had an average population of 55,200 in 1998.

Haidian

DongXicheng cheng

Shijingshan

Chaoyang

Xuanwu Chongwen Fengtai

District boundary Study area

FIGURE 7.5 Districts and subdistricts in Beijing. *

Xuanwu and Chongwen are now merged into Xicheng and Dongcheng, respectively.

Components, Factor and Cluster Analyses in Social Area Analysis

155

The following data sets are provided under the data folder Beijing: 1. Feature subdist in geodatabase BJSA.gdb contains 107 urban subdistricts. 2. Text file bjattr.csv is the attribute data set. In the attribute table of subdist, the field sector identifies four sectors (1 for NE, 2 for SE, 3 for SW, and 4 for NW), and the field ring identifies four zones (1 for the most inner zone, 2 for the next, and so on). Sectors and zones are needed for testing spatial structures of social space. The text file bjattr.csv has 14 socioeconomic variables (X1 to X14) for social area analysis (see Table 7.2). Both the attribute table of subdist and the text file bjattr.csv contain the common field ref _ id identifying the subdistricts. Step 1. Executing the principal components analysis (PCA), factor analysis (FA), and cluster analysis (CA) in SAS: Use the sample SAS program PCA _ FA _ CA.sas under the project folder to implement this step. After reading the dataset bjattr.csv, the first SAS procedure PRINCOMP implements the PCA. The eigenvalues of the correlation matrix is reported, as shown

TABLE 7.2 Basic Statistics for Socioeconomic Variables in Beijing (n = 107) Index X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 a

b c

d

e

Variables Population density (persons/km ) Natural growth rate (‰) Sex ratio (M/F) Labor participation ratio (%)a Household size (persons/household) Dependency ratiob Income (yuan/person) Public service density (no./km2)c Industry density (no./km2) Office/retail density (no./km2) Ethnic enclave (0,1)d Floating population ratio (%)e Living space (m2/person) Housing price (yuan/m2) 2

Mean

Std Dev.

Minimum

Maximum

14,797.09 −1.11 1.03 0.60 2.98 1.53 29,446.49 8.35 1.66 14.90 0.10 6.81 8.89 6,686.54

13,692.93 2.79 0.08 0.06 0.53 0.22 127,223.03 8.60 1.81 15.94 0.31 7.55 1.71 3,361.22

245.86 −16.41 0.72 0.47 2.02 1.34 7,505.00 0.05 0.00 0.26 0.00 0.00 7.53 1,400.00

56,378.00 8.58 1.32 0.73 6.55 2.14 984,566.00 29.38 10.71 87.86 1.00 65.59 15.10 18,000.00

Labor participation ratio is the percentage of persons in the labor force out of the total population at the working ages, that is, 18–60 years for males and 18–55 years for females. Dependency ratio is the number of dependents divided by number of persons in the labor force. Public service density is the number of governmental agencies, non-profit organizations, educational units, hospitals, postal, and telecommunication units per square kilometer. Ethnic enclave is a dummy variable identified whether a minority (mainly Muslims in Beijing) or migrant concentrated area was present in a subdistrict. Population includes permanent residents (those with a registration status in the hukou system) and floating population (those migrants with a permanent resident status). The ratio is the floating population ratio out of total population.

156

Quantitative Methods and Socioeconomic Applications in GIS

TABLE 7.3 Eigenvalues from Principal Components Analysis Component 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Eigenvalue

Proportion

Cumulative

4.9231 2.1595 1.4799 1.2904 0.8823 0.8286 0.6929 0.5903 0.3996 0.2742 0.1681 0.1472 0.1033 0.0608

0.3516 0.1542 0.1057 0.0922 0.0630 0.0592 0.0495 0.0422 0.0285 0.0196 0.0120 0.0105 0.0074 0.0043

0.3516 0.5059 0.6116 0.7038 0.7668 0.8260 0.8755 0.9176 0.9462 0.9658 0.9778 0.9883 0.9957 1.0000

Note: Eigenvalues in bold are larger than 1, and their corresponding factors are retained.

in Table 7.3. The first four components have eigenvalues greater than 1, and account for a cumulative 70.4% of total variance explained. The second SAS procedure FACTOR implements the FA. Four factors are used to capture most of the information in the original 14 variables. The output factor scores are saved in a text file factscore.csv containing the original 14 variables and the four factor scores. The Varimax rotation technique is used to polarize the variable loadings. Table 7.4 presents the rotated factor structure (variables are reordered to highlight the factor loading structure). The factors are labeled to reflect major variables loaded: a. “Land use intensity” is by far the most important factor explaining 35.16% of the total variance, and captures mainly six variables: three density measures (population density, public service density, and office and retail density), housing price, and two demographic variables (labor participation ratio and dependency ratio). b. “Neighborhood dynamics” accounts for 15.42% of the total variance and includes three variables: floating population ratio, household size, and living space. c. “Socioeconomic status” accounts for 10.57% of the total variance and includes two variables: average annual income per capita and population natural growth rate. d. “Ethnicity” accounts for 9.22% of the total variance and includes three variables: ethnic enclave, sex ratio, and industry density.

157

Components, Factor and Cluster Analyses in Social Area Analysis

TABLE 7.4 Factor Loadings in Social Area Analysis Variables

Land Use Intensity

Public service density Population density Labor participation ratio Office/retail density Housing price Dependency ratio Household size Floating population ratio Living space Income Natural growth rate Ethnic enclave Sex ratio Industry density

0.8887 0.8624 −0.8557 0.8088 0.7433 0.7100 0.0410 0.0447 −0.5231 0.1010 −0.2550 0.0030 −0.2178 0.4379

Neighborhood Dynamics 0.0467 0.0269 0.2909 −0.0068 −0.0598 0.1622 0.9008 0.8879 0.6230 0.1400 0.2566 −0.1039 0.2316 −0.1433

Socioeconomic Status 0.1808 0.3518 0.1711 0.3987 0.1786 −0.4873 −0.0501 0.0238 −0.0529 0.7109 −0.6271 −0.1263 −0.1592 0.3081

Ethnicity 0.0574 0.0855 0.1058 0.2552 −0.1815 −0.2780 0.0931 −0.1441 0.0275 −0.1189 0.1390 0.6324 0.5959 0.5815

Note: Values in bold indicate the largest loading of each variable on one factor among the four.

The third SAS procedure CLUSTER implements the CA, and produces a complete dendrogram tree of clustering. The procedure PROC TREE uses the option NCL = 5 to define the number of clusters, based on which the dendrogram tree is cut off. The program saves the result in a text file cluster5.csv (rename the fieldname cluster to cluster5 for clarification). Repeat the cluster analysis by changing the option to NCL = 9, and save the result to cluster9.csv (rename the fieldname cluster to cluster9 for clarification). For instance, cluster 2 identified in the five-cluster scenario is further divided to clusters 2, 4, and 5 in the ninecluster scenario. Each cluster represents a social area. Step 2. Mapping factor patterns in ArcGIS: In ArcGIS, open the layer subdist, and join the text file factscore.csv to it based on the common key ref _ id. Map the field factor1 (“land use intensity”), factor2 (“neighborhood dynamics”), factor3 (“socioeconomic status”), and factor4 (“ethnicity”), as shown in Figures 7.6a–d. Export the expanded attribute table of subdist to a new file bjf4score. dbf, which contains the factor scores, and the fields ring (identifying zones) and sector (identifying sectors). It will be used for regression analysis in step 4. Step 3. Mapping social areas in ArcGIS: Similar to step 2, join both cluster9. csv and cluster5.csv to the layer subdist in ArcGIS, and map the social areas as shown in Figure 7.7. The five basic social areas are shown in different area patterns, and the nine detailed social areas are identified by their cluster numbers. For understanding the characteristics of each social area, use the “summarize” tool on the merged table in ArcGIS to compute the mean values of factor scores within each cluster. The results are reported in Table 7.5. The clusters are labeled by analyzing the factor scores and the locations relative to the city center.

158

Quantitative Methods and Socioeconomic Applications in GIS

(a)

(b)

Factor 1 scores –1.535921–1.156531 –1.156530–0.562178

–0.562177–0.278436

Factor 2 scores

–0.278437–1.282308 –1.282309–2.427528

(c)

–1.459063–0.732648 –0.732647–0.231689

–0.231688–0.335100 –0.335101–1.024712 1.024713–5.796850

(d)

Factor 3 scores –1.547825–0.811655 –0.811654–0.223554

FIGURE 7.6

–0.223553–0.612911

Factor 4 scores

0.612912–2.056387

–2.627556–1.038891

2.056388–5.579335

–1.038890–0.327085

–0.327084–0.472905 0.472906–1.999327 1.999328–4.359766

Spatial patterns of factor scores in Beijing.

Step 4. Testing spatial structure by regression with dummy explanatory variables: Regression models can be constructed to test whether the spatial pattern of a factor is better characterized as a zonal or sector model (Cadwallader, 1981). Based on the circular ring roads, Beijing is divided into four zones, coded by three dummy variables (x2, x3, and x4). Similarly, three additional dummy variables (y2, y3, and y4) are used to code the four sectors (NE, SE, SW, and NW). Table 7.6 shows how the zones and sectors are coded by the dummy variables. A simple linear regression model for testing the zonal structure can be written as Fi = b1 + b2 x2 + b3 x3 + b4 x4 ,

(7.8)

159

Components, Factor and Cluster Analyses in Social Area Analysis

9 1

1

1

5

3

3 3

5

3

1

3 3

1

1

1 1 1

2

2

5

2 2

1 2 5 5 1 5 1 2 1 4 2 6 1 1 6 6 6 4 2 5 5 2 1 1 4 4 4 66 6 2 1 2 4 4 6 2 2 1 1 4 4 4 6 6 2 1 5 2 4 47 4 4 4 4 84 7 4 44 2 5 2 4 5 2 5 4 2 5 2 5 5 5 2 5 5 5

5

2

2

5

Sun mod-dem

Inner city mod-inc

Inner city high-inc

Inner sun mod-inc

Outer sub mod-inc

Inner city ethnic

Outer sub mod-inc

Inner city low-inc

Outer sub float-pop

FIGURE 7.7 Social areas in Beijing.

where Fi is the score of a factor (i = 1, 2, 3, and 4), the constant term b1 is the average factor score in zone 1 (when x2 = x3 = x4 = 0, also referred to as reference zone), and the coefficient b2 or b3 or b4 is the difference of average factor scores between zone 1 and zone 2, zone 3 or zone 4, respectively. Similarly, a regression model for testing the sector structure can be written as Fi = c1 + c2 y2 + c3 y3 + c4 y4 ,

(7.9)

where notations have interpretations similar to those in Equation 7.8. Based on the file bjf4score.dbf, one may use Excel or SAS to create and compute the dummy variables x2, x3, x4, y2, y3, and y4 according to Table 7.6, and run regression models in Equations 7.8 and 7.9. The results are presented in Table 7.7. A sample SAS program BJreg.sas is provided in the data folder for reference.

7.6

DISCUSSIONS AND SUMMARY

In Table 7.7, R2 indicates whether the zonal or sector model is a good fit, and an individual t statistic (in parenthesis) indicates whether a coefficient is statistically significant (i.e., whether a zone or a sector is significantly different from the reference zone

160

Quantitative Methods and Socioeconomic Applications in GIS

TABLE 7.5 Characteristics of Social Areas (Clusters) Clusters Index Five Clusters 1 2

3

4 5

Averages of Factor Scores No. Land Use Subdistricts Intensity

Nine Clusters 1. Suburban moderate-density 2. Inner suburban moderate-income 4. Inner city moderate-income 5. Outer suburban moderate-income 3. Outer suburban manufacturing with high floating population 9. Outer suburban with highest floating population 7. Inner city high-income 8. Inner city ethnic enclave 6. Inner city low-income

NeighborSociohood economic Dynamics Status Ethnicity

21

−0.2060

0.6730

−0.6932

0.3583

23

−0.4921

−0.5159

−0.0522

0.4143

22

0.8787

−0.1912

0.5541

0.1722

21

−0.8928

−0.8811

0.0449

−0.7247

6

−1.4866

2.0667

0.3611

0.1847

1

0.1041

5.7968

−0.2505

−1.8765

2 1 10

0.7168 1.8731 2.0570

0.9615 −0.0147 0.0335

5.1510 1.8304

−0.8112 4.3598

−1.1423

−0.7591

TABLE 7.6 Zones and Sectors Coded by Dummy Variables Zones Index and Location 1. Inside 2nd Ring 2. Between 2nd and 3rd Rings 3. Between 3rd and 4th Rings 4. Outside 4th Ring

Sectors Codes

Index and Location

Codes

x2 = x3 = x4 = 0 x2 = 1, x3 = x4 = 0 x3 = 1, x2 = x4 = 0 x4 = 1, x2 = x3 = 0

1. NE 2. SE 3. SW 4. NW

y2 = y3 = y4 = 0 y2 = 1, y3 = y4 = 0 y3 = 1, y2 = y4 = 0 y4 = 1, y2 = y3 = 0

or sector). Clearly, the land use intensity pattern fits the zonal model well, and the negative coefficients b2, b3, and b4 are all statistically significant and indicate declining land use intensity from inner to outer zones. The neighborhood dynamics pattern is better characterized by the sector model, and the positive coefficient c4 (statistically significant) confirms high portions of floating population in the northwest sector of Beijing. The socioeconomic status factor displays both zonal and sector patterns, but a stronger sector structure. The negative coefficients b3 and b4 (statistically significant) in the socioeconomic status model imply that factor scores decline toward the third zone and fourth zone; and the positive coefficient c3 (statistically significant) indicates higher factor scores in the southwest sector, mainly because

Components, Factor and Cluster Analyses in Social Area Analysis

161

TABLE 7.7 Regressions for Testing Zonal versus Sector Structures (n = 107) Factors Zonal model

Sector model

Land Use Intensity b1 b2 b3 b4 R2 c1 c2 c3 c4 R2

1.2980*** (12.07) −1.2145*** (−7.98) −1.8009*** (−11.61) −2.1810*** (−14.47) 0.697 0.1929 (1.14) −0.1763 (−0.59) −0.2553 (−0.86) −0.3499 (−1.49) 0.022

Neighborhood Dynamics −0.1365 (−0.72) 0.0512 (0.19) −0.0223 (−0.08) 0.4923 (1.84) 0.046 −0.3803** (−2.88) −0.3511 (−1.52) 0.0212 (0.09) 1.2184*** (6.65) 0.406

Socioeconomic Status 0.4861** (2.63) −0.4089 (−1.57) −0.8408** (−3.16) −0.7125** (−2.75) 0.105 −0.3833** (−2.70) 0.4990* (2.01) 1.6074*** (6.47) 0.1369 (0.69) 0.313

Ethnicity −0.0992 (−0.51) 0.1522 (.56) −0.0308 (−0.11) 0.2596 (0.96) 0.014 −0.2206 (−1.32) 0.6029* (2.06) 0.4609 (1.58) 0.1452 (0.63) 0.051

Note: *Significant at 0.05; **Significant at 0.01; ***Significant at 0.001. Results in bold indicate models with an overall statistical significance.

of two high-income subdistricts in Xuanwu District. The ethnicity factor does not conform to either the zonal or sector model. Ethnic enclaves scatter citywide, and may be best characterized by a multiple nuclei model. Land use intensity is clearly the primary factor forming the concentric social spatial structure in Beijing. From the inner city (clusters 4, 6, 8, and 9) to inner suburbs (clusters 1 and 2) and to remote suburbs (clusters 3, 5, and 7), population densities as well as densities of public services, offices, and retails declined along with land prices. The neighborhood dynamics, mainly the influence of floating population, is the second factor shaping the formation of social areas. Migrants are attracted to economic opportunities in the fast-growing Haidian District (cluster 1) and manufacturing jobs in Shijingshan District (cluster 3). The effects of the third factor (socioeconomic status) can be found in the emergence of the high-income areas in two inner city subdistricts (cluster 8), and the differentiation between middle-income (cluster 1) and low-income areas in suburbs (clusters 2, 3, and 5). The fourth factor of ethnicity does not come to play until the number of clusters is expanded to nine. In Western cities, the socioeconomic status construct is a dominant force in forming a sector pattern, along with the family structure construct featuring a zonal pattern and the ethnicity construct exhibiting a multi-nuclei pattern. In Beijing, the factors of socioeconomic status and ethnicity remain effective but move to less prominent roles; and the family status factor is almost absent in Beijing. Census data and corresponding spatial data (e.g., TIGER files in the US) are conveniently available for almost any city in developed countries, and implementing social area analysis in these cities is fairly easy. However, reliable data sources are often a large obstacle for social area studies in cities in developing countries, and future studies can certainly benefit from data of better quality, that is, data with more socioeconomic, demographic, and housing variables and in smaller geographic units.

162

Quantitative Methods and Socioeconomic Applications in GIS

APPENDIX 7: DISCRIMINANT FUNCTION ANALYSIS Certain categorical objects bear some characteristics, each of which can be measured in a quantitative way. The goal of discriminant function analysis (DFA) is to find a linear function of the characteristic variables and use the function to classify future observations into the above known categories. DFA is different from cluster analysis, in which categories are unknown. For example, we know that females and males bear different bone structures. Now some body remnants are found, and we can identify the genders of the bodies by DFA. Here we use a two-category example to illustrate the concept. Say, we have two types of objects, A and B, measured by p characteristic variables. The first type has m observations, and the second type has n observations. In other words, the observed data are XijA (i = 1, 2, …, m; j = 1, 2, …, p), XijB (i = 1, 2, …, n; j = 1, 2, …, p). The objective is to find a discriminant function R such that p

R=

∑c X k

k

− R0 ,

k =1

(A7.1)

where ck (k = 1, 2, …, p) and R0 are constants. After substituting all m observations XijA into R, we have m values of R(A). Similarly, we have n values of R(B). The R(A)’s have a statistical distribution, so do the R(B)’s. The goal is to find a function R, such that the distributions of R(A) and R(B) are most remote from each other. This goal is met by two conditions, such as 1. The mean difference Q = R( A) − R( B) is maximized. 2. The variances F = S A2 + S B2 are minimized (i.e., with narrow bands of distribution curves). That is equivalent to maximize V = Q/F by selecting coefficients ck. Once the ck ’s are obtained, we simply use the pooled average of estimated means of R(A) and R(B) to represent R0: R0 =

mR( A) + nR( B) . m+n

(A7.2)

For any given sample, we can calculate its R value, and compare it to R0. If it is greater than R0, it is category A; otherwise, it is category B. DFA is implemented in PROC DISCRIM, or other procedures such as PROC STEPDISC or PROC CANDISC in SAS.

8

Spatial Statistics and Applications

Spatial statistics analyzes the pattern, process, and relationship in spatial (geographic) data. Although built upon statistical concepts and methods, spatial statistics has some unique capacities that are beyond regular statistics. Some major spatial statistical concepts were raised several decades ago, but related applications were initially limited because of their requirements of intensive computation. Recent advancements in GIS and development of several free packages on spatial statistics have stimulated greater interests and wide usage. This chapter serves an overview of three fundamental tasks in spatial statistics: measuring geographic distributions, spatial cluster analysis, and spatial regression models. Measuring geographic distributions intends to capture the characteristics of a feature’s distribution such as its center, compactness, and orientation by some descriptive statistics. For example, by mapping the historical trend of population center in the US (http://www.census.gov/geo/www/cenpop/ MeanCenter.html), one can tell that the population in the west has grown faster than the rest of the country and to a lesser degree, the south has outpaced the north. By drawing the standard distances for various crimes in a city, we may detect that one type of crimes tend to be more geographically confined than others. Based on the ellipses of an ethnic group’s settlement in different eras, historians can examine the migration trend and identify whether the trend is related to a transportation route or some environmental barriers along certain directions. The techniques computing the center, standard distance, and ellipse are also referred to as “centrographic measures.” The centrographic measures are intuitive and easy to interpret, but merely descriptive and do not put any statistical significance on a revealed pattern. Spatial cluster analysis detects unusual concentrations or nonrandomness of events in space with a rigorous statistical test. One type of spatial cluster analysis focuses on the pattern of feature locations (usually discrete features) only, and another examines feature values (usually for analyzing contiguous areas). The former require exact locations of individual occurrences, whereas the latter use aggregated rates in areas. Therefore, they are also referred to as point-based and area-based cluster analysis, respectively. Data availability dictates which methods to be used. The common belief that point-based methods are better than area-based methods is not well grounded (Oden et al., 1996). Two application fields utilize spatial cluster analysis extensively. In crime studies, it is often referred to as “hot spot” analysis. Concentrations of criminal activities or hot spots in certain areas may be caused by (1) particular activities such as drug trading (e.g., Weisburd and Green, 1995), (2) specific land uses such as skid row areas and bars, or (3) interaction between activities and land uses such as thefts at 163

164

Quantitative Methods and Socioeconomic Applications in GIS

bus stops and transit stations (e.g., Block and Block, 1995). Identifying hot spots is useful for police and crime prevention units to target their efforts on limited areas. Health-related research is another field with wide usage of spatial cluster analysis. Does the disease exhibit any spatial clustering pattern? What areas experience a high or low prevalence of disease? Elevated disease rates in some areas may arise simply by chance alone and thus have no public health significance. The pattern generally warrants study only when it is statistically significant (Jacquez, 1998). Spatial cluster analysis is an essential and effective first step in any exploratory investigation. If the spatial cluster patterns of a disease do exist, case–control, retrospective cohort, and other observational studies can follow up. Nonrandomness of events in spatial distribution indicates the existence of spatial autocorrelation, a common issue encountered in analysis of geographic data. Spatial autocorrelation violates the assumption of independent observations in ordinary least squares (OLS) regression discussed in Chapter 6, and necessitates the usage of spatial regression models. Spatial regression models include global models such as spatial lag and spatial error models with constant coefficients for explanatory variables and local model such as geographically weighted regression with coefficients varying across a study area. This chapter begins with discussion of centrographic measures in Section 8.1, followed by a case study of racial–ethnic distributions in Chicago in Section 8.2. Section 8.3 examines spatial cluster analysis based on feature locations, followed by a case study of Zhuang place names (or toponyms) in southern China in Section 8.4. Section 8.5 covers spatial cluster analysis based on feature values, and Section 8.6 introduces spatial regression. A case study of homicide patterns in Chicago in Section 8.7 illustrates both spatial cluster analysis based on feature values and spatial regression. The chapter is concluded by a brief summary in Section 8.8.

8.1 THE CENTROGRAPHIC MEASURES Similar to the mean in regular statistics that represents the average value of observations, the mean center may be interpreted as the average location of a set of points. As a location has x and y coordinates, the mean center’s coordinates (X , Y ) are the average x coordinate and average y coordinate of all points (Xi, Yi) for i = 1, 2, …, n, respectively, such as X =

Xi

∑ n;

Y =

i

Yi

∑ n. i

If the points are weighted by an attribute variable wi, the weighted mean center has coordinates such as

X =

∑ (w X ) ; ∑w i

i

i

i

i

Y =

∑ (w Y ) . ∑w i i

i

i

i

165

Spatial Statistics and Applications

Two other measures of the center are used less often: median center and central feature. Median center is the location having the shortest total distance to all points. Central feature is the point (among all input points) that has the shortest total distance to all other points. Median center or central feature finds the location that is the most accessible. The difference is that central feature must be one from the input points and median center does not need to be. Computing the central feature is straightforward by selecting the point with the lowest total distance from others. However, the location of median center can only be found approximately by experimenting with locations around the mean center and identifying one with the lowest total distance from all points (Mitchell, 2005). In ArcToolbox, the three center measures are available under Spatial Statistics Tools > Measuring Geographic Distributions. If lines or areas are used as input features, they are represented by the coordinates of the center points of the lines or the centroids of the areas, respectively. Similar to the standard deviation (SD) in regular statistics that captures the degree of total variations of observations from the mean, the standard distance is the average difference in distance between the points and their mean center, such as

∑ (X

SD =

i

− X ) 2 /n +

i

∑ (Y − Y ) /n . 2

i

i

For the weighted standard distance, it is written as SD w =

∑ w (X i

i

− X )2

i

∑ w + ∑ w (Y − Y ) ∑ w . i

i

i

2

i

i

i

i

In the graphic representation, it is a circle around the mean center with a radius equal to the standard distance. The larger the standard distance, the more widely dispersed the features are from the mean center. Therefore, the standard distance measures the compactness of the features. Similarly, lines and areas are represented by their centers and centroids, respectively, in calibrating their standard distances. The ellipse further advances the standard distance measure by identifying the orientation of the features and the average differences in distance along the long axis and the short axis. In GIS, the orientation of the long axis (i.e., the new y-axis) is determined by rotating from 0° so that the sum of the squares of the distance between the features and the axis is minimal, and the short axis (i.e., the new x-axis) is then vertical to the derived long axis. The lengths of the two axes are twice the standard deviation along each axis, which is written as SD x =

∑ (X i

i

− X ) 2 /n

and SD y =

∑ (Y − Y ) /n i

i

Therefore, it is also termed the “standard deviational ellipse.”

2

166

Quantitative Methods and Socioeconomic Applications in GIS

In the graphic representation, the ellipse indicates that the features are more dispersed along the long axis than the short axis from the mean center. In ArcToolbox, both the standard distance and the directional distribution (standard deviational ellipse) tools are also grouped under Spatial Statistics Tools > Measuring Geographic Distributions.

8.2

CASE STUDY 8A: MEASURING GEOGRAPHIC DISTRIBUTIONS OF RACIAL–ETHNIC GROUPS IN CHICAGO URBAN AREA

This case study illustrates the application of several centrographic measures in analyzing geographic distribution of racial–ethnic groups in Chicago Urban Area. Similar to Case Study 6, the study area is the core six-county (Cook, DuPage, Kane, Lake, McHenry, and Will) area in Chicago metropolitan area at the census tract level. What is different from Case Study 6 is that the data are based on the 2010 Census, specifically the 2010 Census Demographic Profile (County and Census Tract) as discussed in Section 1.3. The 2010 census tract feature Trt2010 in geodatabase ChiUrArea.gdb for the study area is provided. It contains the following fields of interest to this case study: total population and population counts of major racial–ethnic groups* (e.g., DP0010001 for total population, DP0090001 for White, DP0090002 for Black, DP0090004 for Asian and DP0100002 for Hispanic). Step 1. Generating the mean centers: Activate ArcMap and add Trt2010 to the project. Access the tool by choosing ArcToolbox > Spatial Statistics Tools > Measuring Geographic Distributions > Mean Center. In the dialog window, select Trt2010 as Input Feature Class; navigate to the project folder and name the Output Feature Class as White _ Center; select DP0090001 for the Weight Field; and click OK. The result is the mean center for White. Repeat the process to obtain the mean centers for Black, Asian, and Hispanic, and save them as Black_Center, Asian_Center, and Hispanic_Center, respectively. With reference to the location of White’s mean center, the Black’s is the farthest away to the southeast, the Asian’s is slightly to the northeast, and the Hispanic is slightly to the southeast. Step 2. Generating the standard distances: Similarly, use ArcToolbox > Spatial Statistics Tools > Measuring Geographic Distributions> Standard Distance. Entries for fields in the dialog window are similar to step 1 except for naming the Output Feature Classes as White_StdDist, Black_StdDist, Asian_StdDist, and Hispanic_StdDist for the standard distances for White, Black, Asian, and Hispanic, respectively. The descending order of standard distance sizes is White > Hispanic > Asian > Black. In other words, Blacks are most spatially concentrated, followed by Asians, Hispanics, and Whites. *

In the study area, White, Black, and Asian account for 65.7%, 18.7%, and 7.0% of total population, respectively. Either American Indian and Alaska Native or Native Hawaiian and Other Pacific Islander accounts for Analyzing Patterns > Average Nearest Neighbor.

8.3.2

tests For local clusters based on Feature locations

For many applications, it is also important to identify cluster locations or local clusters. Even when a global clustering test does not reveal the presence of overall clustering in a study region, there may be some places exhibiting local clusters. The geographical analysis machine (GAM) developed by Openshaw et al. (1987) first generates grid points in a study region, then draws circles of various radii around each grid point, and finally searches for circles containing a significantly high prevalence of cases. One shortcoming of the GAM method is that it tends to generate a high percentage of “false positive” circles (Fotheringham and Zhan, 1996). Since

Spatial Statistics and Applications

169

many significant circles overlap and contain the same cluster of cases, the Poisson tests that determine each circle’s significance are not independent, and thus lead to the problem of multiple testing. The test by Besag and Newell (1991) only searches for clusters around cases. Say, k is the minimum number of cases needed to constitute a cluster. The method identifies the areas that contain the k − 1 nearest cases (excluding the centroid case), then analyzes whether the total number of cases in these areas* is large relative to total risk population. Common values for k are between 3 and 6 and may be chosen based on sensitivity analysis using different k. As in the GAM, clusters identified by the Besag and Newell’s test often appear as overlapping circles. But the method is less likely to identify false positive circles than the GAM, and is also less computationally intensive (Cromley and McLafferty, 2002, p. 153). Other point-based spatial cluster analysis methods include Rushton and Lolonis (1996) and others. The following discusses the spatial scan statistic by Kulldorff (1997), implemented in SaTSCan. SaTScan is a free software program developed by Kulldorff and Information Management Services Inc., available at http://www.satscan.org (current version 9.1.1). Its main usage is to evaluate reported spatial or space–time disease clusters and to see if they are statistically significant. Like the GAM, the spatial scan statistic uses a circular scan window to search the entire study region, but takes into account the problem of multiple testing. The radius of the window varies continuously in size from 0% to 50% of the total population at risk. For each circle, the method computes the likelihood that the risk of disease is higher inside the window compared to outside the window. The spatial scan statistic uses either a Poisson-based model or a Bernoulli model to assess statistical significance. When the risk (base) population is available as aggregated area data, the Poisson-based model is used, and it requires case and population counts by areal units and the geographic coordinates of the points. When binary event data for case–control studies are available, the Bernoulli model is used, and it requires the geographic coordinates of all individuals. The cases are coded as ones and controls as zeros. For instance, under the Bernoulli model, the likelihood function for a specific window z is L ( z, p, q ) = pc (1 − p)n − c qC − c (1 − q )( N − n ) − (C − c ) ,

(8.1)

where N is the combined total number of cases and controls in the study region, n is the combined total number of cases and controls in the window, C is the total number of cases in the study region, c is the number of cases in the window, p = c/n (probability of being a case within the window), and q = (C − c)/(N − n) (probability of being a case outside the window). The likelihood function is maximized over all windows, and the “most likely” cluster is the one that is least likely to have occurred by chance. The likelihood ratio for the window is reported and constitutes the maximum likelihood ratio test *

The number may be slightly larger than k since the last (farthest) area among those nearest areas may contain more than one case.

170

Quantitative Methods and Socioeconomic Applications in GIS

statistic. Its distribution under the null hypothesis and its corresponding p-value are determined by a Monte Carlo simulation approach (see Chapter 12). The method also detects secondary clusters with the highest likelihood function for a particular window that does not overlap with the most likely cluster or other secondary clusters.

8.4

CASE STUDY 8B: SPATIAL CLUSTER ANALYSIS OF PLACE NAMES IN GUANGXI, CHINA

This project extends the toponymical study of Zhuang in Guangxi, China, introduced in Section 3.3 in Chapter 3, which uses spatial smoothing and interpolation techniques to map the relative concentrations of Zhuang place names (toponyms). Mapping is merely descriptive, and cannot identify whether concentrations of Zhuang place names in some areas are random or statistically significant. The answer relies in rigorous statistical analysis, in this case, spatial cluster analysis. The software SaTScan is used to implement the study. The project uses the same data sets as in Case Study 3A: mainly the point feature Twnshp in geodatabase GX.gdb where the field Zhuang identifies whether a place name is Zhuang (=1) or non-Zhuang (=0). Step 1. Preparing data in ArcGIS for SaTScan: Implementing the Bernoulli model in SaTScan requires three data files: a case file (containing location id and the number of cases in each location), a control file (containing location id and the number of controls in each location), and a coordinate file (containing location id and Cartesian coordinates or latitude and longitude). The three files can be extracted in SaTScan Import Wizard after all attributes are defined in ArcGIS. In the attribute table of Twnshp, the field Zhuang already defines the case number (=1) for each location and thus the case file. For defining the control file, open the attribute table of Twnshp in ArcMap, add a new field NonZhuang and calculate it as NonZhuang = 1 – Zhuang. Right click the layer Twnshp > Data> Export Data and save it as a shapefile TwnshpPt. For defining the coordinates file, use ArcToolbox > Data Management Tools > Features > Add XY Coordinates. In the dialog window, select TwnshpPt as the Input Features and OK to execute the tool. Two new fields POINT_X and POINT_Y are added to the attribute table of TwnshpPt (i.e., in the dBase file TwnshpPt.dbf). Step 2. Executing spatial cluster analysis in SaTScan: Activate SaTScan, and choose Create New Session. A “New Session” dialog window is shown in Figure 8.2. Under the first tab “Input,” for “Case File,” click the second input tool “Import case file” (as highlighted in the left window in Figure 8.2), and select TwnshpPt. dbf as the input file > In the “Import Wizard” window (as shown in the right window in Figure 8.2), choose “Bernoulli model” in the top bar; under Source File Variable, choose OBJECTID for “Location ID” and Zhuang for “Number of Cases”; click “Next” at the bottom > click “Execute” to close the Import Wizard dialog. Still under the tab “Input,” this time for “Control File,” follow similar steps to use the Import Wizard for defining the Control File. The only difference is to choose NonZhuang for “Number of Controls.”

Spatial Statistics and Applications

FIGURE 8.2

171

SaTScan dialog windows for point-based spatial cluster analysis.

For the Coordinates File, also follow similar steps. Note that in the Import Wizard window, select “Cartesian (x,y) Coordinates” in the top bar, and choose OBJECTID for “Location ID,” POINT_X for “X” and POINT_Y for “Y” under Source File Variable. Under the second tab “Analysis,” the default settings are okay (“Purely Spatial” under “Type of Analysis,” “Bernoulli” under “Probability Model,” and “High Rates” under “Scan For Areas With.” Under the third tab “Output,” browse to your project folder and name the Results File as ZhuangCls, and check all five boxes under “dBase.” Finally, click the Execute Session button under the main menu to run the program. Results are saved in various dBase files sharing the file name ZhuangCls, among which ZhuangCls.gis.dbf is most useful and will be used in the next step. Step 3. Mapping spatial cluster of Zhuang place names: In ArcMap, add the dBase file ZhuangCls.gis.dbf to the project. Its field CLUSTER identifies whether a place is included in a cluster. Here only one (primary) cluster is identified with its value being 1 (being “null” for those not included in the cluster).* In order to facilitate the attribute join, add a new field OBJECTID (defined as Long Integer) and calculate it as OBJECTID = LOC _ ID†. Join the file ZhuangCls.gis.dbf to the attribute table of TwnshpPt using the common key OBJECTID. Figure 8.3 uses distinctive symbols to highlight the place names in the cluster and those not in the cluster. The circle is drawn manually to show the approximate extent of the cluster. * †

If more clusters exist, its value would be 2 for the secondary cluster, 3 for the third and so on. SaTScan imports the location ID (i.e., LOC_ID) from the source file’s field OBJECTID (defined as integer) but saves it as “text string” (though the values are unchanged). The attribute join in ArcGIS requires the common keys with the same data type definition.

172

Quantitative Methods and Socioeconomic Applications in GIS

N

0 35 70

140

210

280 km

Place name cluster Non-cluster Cluster

FIGURE 8.3 A spatial cluster of Zhuang place names in Guangxi, China.

One may refer back to Figure 3.3 showing the distribution of Zhuang and nonZhuang place names in the study area. The spatial cluster analysis confirms that the major concentration of Zhuang place names is in the west.

8.5

SPATIAL CLUSTER ANALYSIS BASED ON FEATURE VALUES

This section first discusses various ways of defining spatial weights, and then introduces two types of spatial statistic indices available in ArcGIS. Spatial cluster analysis methods based on feature values include tests for global clustering and corresponding tests for local clusters. The former are usually developed earlier than the latter. Other methods include Rogerson’s (1999) R statistic (a spatial version of the well-known chi-square goodness-of-fit statistic) and others.

8.5.1

deFining spatial Weights

Spatial cluster analysis methods based on feature values utilize a spatial weights matrix to define spatial relationships of observations. Defining spatial weights can be based on distance (d): 1. Inverse distance (1/d) 2. Inverse distance squared (1/d2)

Spatial Statistics and Applications

FIGURE 8.4

173

ArcGIS dialog window for computing Getis–Ord General G.

3. Distance band (=1 within a specified critical distance and = 0 outside of the distance) 4. A continuous weighting function of distance such as wij = exp(− dij2 /h2 ) where dij is the distance between areas i and j, and h is referred to as the bandwidth (Fotheringham et al., 2000, p. 111). The bandwidth determines the importance of distance, that is, a larger h corresponds to a larger sphere of influence around each area. Defining spatial weights can also be based on polygon contiguity (see Appendix 1), where wij = 1 if the area j is adjacent to i, and 0 otherwise. All the above methods of defining spatial weights can be incorporated in the Spatial Statistics tools in ArcGIS. In particular, it is defined at the stage of “Conceptualization of Spatial Relationships,” which provides the options of “Inverse Distance,” “Inverse Distance Squared,” “Fixed Distance Band,” “Zone of Indifference,” “Contiguity Edges Only” (i.e., rook contiguity), “Contiguity Edges Corners” (i.e., queen contiguity), and “Get Spatial Weights from File.” See Figure 8.4. All methods that are based on the distance use the geometric centroids to represent areas, and the distances are defined as either Euclidean or Manhattan distances. If a weights matrix file is used, it should contain three fields: from feature ID, to feature ID, and weight.

8.5.2

tests For global clustering based on Feature values

Moran’s I statistic (Moran, 1950) is one of the oldest indicators that detect global clustering (Cliff and Ord, 1973). It detects whether nearby areas have similar or dissimilar attributes overall, that is, positive or negative spatial autocorrelation respectively. Moran’s I is calculated as:

174

Quantitative Methods and Socioeconomic Applications in GIS

∑ ∑ w ( x − x )( x − x ) , (∑ ∑ w ) ∑ ( x − x )

N I =

i

ij

j

i

i

j

ij

j

2

i

i

(8.2)

where N is the total number of areas, wij is the spatial weight linking area i and j, xi, and xj are the attribute values for area i and j respectively, and x is the mean of the attribute values. It is helpful to interpret Moran’s I as the correlation coefficient between a variable and its spatial lag. The spatial lag for variable x is the average value of x in neighboring areas j defined as xi ,−1 =

∑wx ∑w. ij

j

j

(8.3)

ij

j

Therefore, Moran’s I varies between −1 and +1. A value near +1 indicates that similar attributes are clustered (either high values near high values or low values near low values); and a value near −1 indicates that dissimilar attributes are clustered (either high values near low values or low values near high values). If a Moran’s I is close to 0, it indicates a random pattern or absence of spatial autocorrelation. Getis and Ord (1992) developed General G statistic. General G is a multiplicative measure of overall spatial association of values, which fall within a critical distance (d) of each other, defined as

G(d ) =

∑ ∑ w (d ) x x ∑ ∑ xx i

j ≠i

i

ij

i

i

j ≠i

j

,

(8.4)

j

where xi and xj can only be positive variables. A large G value (a positive Z score) indicates clustering of high values. A small G (a negative Z score) indicates clustering of low values. Z score value is a measure of statistical significance. For example, a Z score larger than 1.96, 2.58, and 3.30 (or smaller than −1.96, −2.58, and −3.30) corresponds to a significance level at 0.05, 0.01, and 0.001, respectively. A Z score near 0 indicates no apparent clustering within the study area. In addition, Geary’s C (Geary, 1954) also detects global clustering. Unlike Moran’s I using the cross product of the deviations from the mean, Geary’s C uses the deviations in intensities of each observation with one another. It is defined as

C =

∑ ∑ w (x − x ) 2 (∑ ∑ w ) ∑ ( x − x)

( N − 1) i

i

j

j

ij

ij

i

i

j

i

2

2

.

(8.5)

The values of Geary’s C typically vary between 0 and 2 (although 2 is not a strict upper limit), with C = 1 indicating that all values are spatially independent from

175

Spatial Statistics and Applications

each other. Values between 0 and 1 typically indicate positive spatial autocorrelation while values between 1 and 2 indicate negative spatial autocorrelation, and thus Geary’s C is inversely related to Moran’s I. For either the Getis–Ord General G or the Moran’s I, the statistical test is a normal z test such as z = (index − expected ) / variance . If z is larger than 1.96 (critical value), it is statistically significant at the 0.05 (5%) level; if z is larger than 2.58, it is statistically significant at the 0.01 (1%) level; and if z is larger than 3.29, it is statistically significant at the 0.001 (0.1%) level. For a general overview of significance testing for Moran’s I, G(d) and Geary’s C, interested readers may refer to Wong and Lee (2005). For detailed theoretical explanations, readers should consult Cliff and Ord (1973) and Getis and Ord (1992). In ArcToolbox, choose Spatial Statistics Toolbox> Analyzing Patterns. The tools Spatial Autocorrelation (Moran’s I) and High–Low Clustering (Getis–Ord General G) implements the computations of Moran’s I and Getis–Ord General G, respectively. GeoDa and CrimeStat also have similar tools.

8.5.3

tests For local clusters based on Feature values

Anselin (1995) proposed a local Moran index or Local Indicator of Spatial Association (LISA) to capture local pockets of instability or local clusters. The local Moran index for an area i measures the association between a value at i and values of its nearby areas, defined as Ii =

( xi − x ) s x2

∑ [w ( x ij

j

j

− x )],

(8.6)

where s x2 = ∑ j ( x j − x )2 / n is the variance, and other notations are the same as in (8.2). Note that the summation over j does not include the area i itself, that is, j ≠ i. A positive Ii means either a high value surrounded by high values (high–high) or a low value surrounded by low values (low–low). A negative Ii means either a low value surrounded by high values (low–high) or a high value surrounded by low values (high–low). Similarly, Getis and Ord (1992) developed the Gi statistic, a local version of the global or General G statistic, to identify local clusters with statistically significant high or low attribute values. The Gi statistic is written as

Gi =

∑ (w x ) , ∑x ij

j

j

j

(8.7)

j

where the summations over j may or may not include i. When the target feature (i) is not included, it is the Gi statistic, which captures the effect of the target feature on its surrounding ones. When the target feature (i) is included, it is called the Gi∗ statistic, which detects hot spots or cold spots. ArcGIS computes Gi∗ statistic. A high Gi value

176

Quantitative Methods and Socioeconomic Applications in GIS

indicates that high values tend to be near each other; and a low Gi value indicates that low values tend to be near each other. The Gi statistic can also be used for spatial filtering in regression analysis (Getis and Griffith, 2002), as discussed in Appendix 8. For an overview of significance testing for the local Moran’s I and local Gi, one may refer to Wong and Lee (2005). For in-depth theoretical formulations of the tests, please refer to Anselin (1995) and Getis and Ord (1992). The tools are available in ArcToolbox > Spatial Statistics Tools > Mapping Clusters > Cluster and Outlier Analysis (Anselin Local Morans I) for computing the local Moran’s I, or Hot Spot Analysis (Getis–Ord Gi∗) for computing the local Gi∗. In analysis of disease or crime risks, it may be interesting to focus only on local concentrations of high rates or the high–high areas. In some applications, all four types of associations (high–high, low–low, high–low, and low–high) have important implications. For example, Shen (1994, p. 177) used the Moran’s I to test two hypotheses on the impact of growth-control policies in San Francisco area. The first is that residents who are not able to settle in communities with growth-control policies would find the second best choice in a nearby area, and consequently, areas of population loss (or very slow growth) would be close to areas of population growth. This leads to a negative spatial autocorrelation. The second is related to the so-called NIMBY (Not In My Backyard) symptom. In this case, growth-control communities tend to cluster together; so do the pro-growth communities. This leads to a positive spatial autocorrelation.

8.6 8.6.1

SPATIAL REGRESSION spatial lag Model and spatial error Model

The spatial cluster analysis detects spatial autocorrelation, in which values of a variable are systematically related to geographic location. In the absence of spatial autocorrelation or spatial dependence, the ordinary least square (OLS) regression model can be used. It is expressed in matrix form y = Xβ + ε,

(8.8)

where y is a vector of n observations of the dependent variable, X is an n × m matrix for n observations of m independent variables, β is a vector of m regression coefficients, and ε is a vector of n random errors or residuals, which are independently distributed with a mean of zero. When spatial dependence is present, the residuals are not independent from each other, and the OLS regression is no longer applicable. This section discusses two commonly used models of maximum likelihood estimator. The first is a spatial lag model (Baller et al., 2001) or spatially autoregressive model (Fotheringham et al., 2000, p. 167). The model includes the mean of the dependent variable in neighboring areas (i.e., spatial lag) as an extra explanatory variable. Denoting the weights matrix by W, the spatial lag of y is written as Wy. The element of W in the ith row and jth column is wij / • j wij as defined in Equation 8.3. The model is expressed as

177

Spatial Statistics and Applications

y = ρWy + Xβ + ε,

(8.9)

where ρ is the regression coefficient for the spatial lag, and other notations are the same as in Equation 8.8. Rearranging Equation 8.9 yields ( I − ρW ) y = Xβ + ε. Assuming the matrix (I – ρ W) is invertible, we have y = (I − ρW )−1 Xβ + ( I − ρW )−1ε.

(8.10)

This reduced form shows that the value of yi at each location i is not only determined by xi at that location (like in the OLS regression model), but also by the xj at other locations through the spatial multiplier (I – ρW)−1 (not present in the OLS regression model). The model is also different from the autoregressive model in time-series analysis, and cannot be calibrated by the SAS procedures for time-series modeling such as AR or AMAR. The second is a spatial error model (Baller et al., 2001) or spatial moving average model (Fotheringham et al., 2000, p. 169) or simultaneous autoregressive (SAR) model (Griffith and Amrhein, 1997, p. 276). Instead of treating the dependent variable as autoregressive, the model considers the error term as autoregressive. The model is expressed as y = Xβ + u,

(8.11)

where u is related to its spatial lag such as u = λWu + ε,

(8.12)

where λ is a vector of spatial autoregressive coefficients, and the second error term ε is independent. Solving Equation 8.12 for u and substituting into Equation 8.11 yield the reduced form y = Xβ + ( I − λW )−1 ε.

(8.13)

This shows that the value of yi at each location i is affected by the stochastic errors εj at all other locations through the spatial multiplier (I − λW)−1. Estimation of either the spatial lag model in Equation 8.10 or the spatial error model in Equation 8.13 is implemented by the maximum likelihood (ML) method (Anselin and Bera, 1998). The case study in Section 8.7 illustrates how the spatial lag

178

Quantitative Methods and Socioeconomic Applications in GIS

and the spatial error models are implemented in GeoDa using the algorithms developed by Smirnov and Anselin (2001). The current GeoDa version is 1.4.6, available for free download (https://geodacenter.asu.edu/software/downloads). Anselin (1988) discusses the statistics to decide which model to use. The statistical diagnosis rarely suggests that one model is preferred to the other (Griffith and Amrhein, 1997, p. 277).

8.6.2

geographically Weighted regression

Geographically Weighted Regression (GWR) allows the regression coefficients to vary over space to investigate the spatial nonstationarity (Fotheringham et al., 2002). The model can be expressed as: y = β0i + β1i x1 + β2i x2 +  + β mi xm + ε,

(8.14)

where the β’s have subscript i indicating that the coefficients are variant with location i. GWR estimates the coefficients at each location such as βi = ( X ʹWi X )−1 X ʹWY i ,

(8.15)

where βi is the coefficient set for location i, and Wi is the diagonal matrix with diagonal elements being the weight of observations for location i. Equation 8.15 represents weighted regressions locally at every observation location using the weighted least square approach. Usually a monotone decreasing function (e.g., Gaussian kernel) of the distance is used to assign weights to observations. Closer observations are weighted more than distant ones in each local regression. GWR yields an improved R2 than OLS regression because it has more parameters to be estimated. This does not necessarily indicate a better model. The corrected Akaike’s Information Criterion (AICc) index is used to measure the relative performance of a model since it builds in a tradeoff between the data fit of the model and the model’s complexity (Joseph et al., 2012). A smaller AIC value indicates a better data fit and a less complicated model. In addition to removing the spatial autocorrelation of residuals, the major benefit of GWR is that it reports local coefficients, local t values and local R2 values that can be mapped to examine their spatial variations. In ArcGIS, GWR is accessed in ArcToolbox > Spatial Statistics Tools > Modeling Spatial Relationship > Geographically Weighted Regression. For those familiar with the R package, use the program spgwr to implement the GWR algorithm.

8.7 CASE STUDY 8C: SPATIAL CLUSTER AND REGRESSION ANALYSES OF HOMICIDE PATTERNS IN CHICAGO Most crime theories suggest, or at least imply, an inverse relationship between legal and illegal employment. The strain theory (e.g., Agnew, 1985) argues that crime results from the inability to achieve desired goals, such as monetary success, through conventional means like legitimate employment. The control theory (e.g., Hirschi, 1969) suggests that individuals unemployed or with less desirable employment have less to lose by engaging in crime. The rational choice (e.g., Cornish and Clarke, 1986) and economic theories (e.g., Becker, 1968) argue that people make rational

Spatial Statistics and Applications

179

choices to engage in a legal or illegal activity by assessing the cost, benefit, and risk associated with it. Research along this line has focused on the relationship between unemployment and crime rates (e.g., Chiricos’s, 1987). According to the economic theories, job market probably affects economic crimes (e.g., burglary) more than violent crimes including homicide (Chiricos, 1987). Support for the relationship between job access and homicide can be found in the social stress theory. According to the theory, “[h]igh stress can indicate the lack of access to basic economic resources and is thought to be a precipitator of … homicide risk” (Rose and McClain, 1990, pp. 47–48). Social stressors include any psychological, social, and economic factors that form “an unfavorable perception of the social environment and its dynamics,” particularly unemployment and poverty that are explicitly linked to social problems including crime (Brown, 1980). Most literature on the relation between job market and crime has focused on the link between unemployment and crime using large areas such as the whole nation, states, or metropolitan areas (Levitt, 2001). There may be more variation within such units than between them, and intraurban variation of crime rates needs to be explained by examining the relationship between local job market and crime (e.g., Bellair and Roscigno, 2000). Wang and Minor (2002) argued that not every job was an economic opportunity for all, and only an accessible job was meaningful. They proposed that job accessibility, reflecting one’s ability to overcome spatial and other barriers to employment, was a better measure of local job market condition. Their study in Cleveland suggested a reverse relationship between job accessibility and crime, and stronger (negative) relationships with economic crimes (including auto theft, burglary, and robbery) than violent crimes (including aggravated assault, homicide, and rape). Wang (2005) further extended the work to focus on the relationship between job access and homicide patterns with refined methodology, based on which this case study is developed. The following data sets are provided under the study area folder Chicago: 1. A polygon feature citytrt in the geodatabase ChiCity.gdb contains 846 census tracts in the City of Chicago (not including the O’Hare airport tract because of its unique land use and noncontiguity with other tracts). 2. A text file cityattr.txt contains tract ids and 10 corresponding socioeconomic attribute values based on census 1990. In the attribute table of citytrt, the field cntybna is each tract’s unique id, the field POPU is population in 1990, the field JA is job accessibility measured by the methods discussed in Chapter 5 (a higher JA value corresponds to better job accessibility), and the field CT89 _ 91 is total homicide counts for a 3-year period around 1990 (i.e., 1989, 1990, and 1991). Homicide data for the study area are extracted from the 1965–1995 Chicago homicide data set compiled by Block et al. (1998), available through the National Archive of Criminal Justice Data (NACJD) at www. icpsr.umich.edu/NACJD/home.html. Homicide counts over a period of 3 years are used to help reduce measurement errors and stabilize rates. Note that the job market for defining job accessibility is based on a much wider area (mostly urbanized six counties: Cook, Lake, McHenry, Kane, DuPage, and Will) than the city of Chicago.

180

Quantitative Methods and Socioeconomic Applications in GIS

TABLE 8.1 Rotated Factor Patterns of Socioeconomic Variables in Chicago in 1990 Factor 1: Concentrated Disadvantages Public assistance Female-headed households Black Poverty Unemployment Non-high school diploma Crowdedness Latinos New residents Renters occupied

0.93120 0.89166 0.87403 0.84072 0.77234 0.40379 0.25111 −0.51488 −0.21224

Factor 2: Concentrated Latino Immigration

Factor 3: Residential Instability

0.17595 0.15172 −0.23226 0.30861 0.18643 0.81162 0.83486 0.78821 −0.02194 0.20098

−0.01289 0.16524 −0.15131 0.24573 −0.06327 −0.11539 −0.12716 0.19036 0.91275 0.77222

Note: Values in bold indicate the largest loading of each variable on one factor among the three.

Data for defining the 10 socioeconomic variables and population are based on the STF3A files from the 1990 census and are measured in percentage. In the text file cityattr.txt, the first column is tract ids (i.e., identical to the field cntybna in the GIS layer citytrt) and the 10 variables are in the following order: 1. Families below the poverty line (labeled “poverty” in Table 8.1) 2. Families receiving public assistance (“public assistance”) 3. Female-headed households with children under 18 (“female-headed households”) 4. “Unemployment” 5. Residents who moved in the last 5 years (“new residents”) 6. Renter-occupied homes (“renter-occupied”) 7. Residents without high school diplomas (“no high school diploma”) 8. Households with an average of more than one person per room (“crowdedness”) 9. Black residents (“Black”) 10. Latino residents (“Latinos”).

8.7.1

part 1: spatial cluster analysis oF hoMicide rates

Step 1. Optional: Factor analysis on socioeconomic variables in SAS: Use SAS or other statistical software to conduct factor analysis based on the 10 socioeconomic covariates contained in cityattr.txt. Save the result (factor scores and the tract ids), and join it to the GIS layer citytrt to view the distribution patterns of factor scores. For example, the SAS program PCA _ FA.SAS under the folder Chicago runs the principal components analysis and factor analysis. It provides another

Spatial Statistics and Applications

181

opportunity to practice the methods discussed in Chapter 7. This step is optional, as the result (factor scores FACTOR1, FACTOR2, and FACTOR3) is already provided in the attribute table of citytrt. The principal components analysis result shows that three components (factors) have eigenvalues greater than 1 and are thus retained. These three factors capture 83.4% of the total variance of the original 10 variables. Table 8.1 shows the rotated factor patterns. Factor 1 (accounting for 53.6% variance among three factors) is labeled “concentrated disadvantages,” and captures five variables (public assistance, female-headed households, black, poverty, and unemployment). Factor 2 (accounting for 27.1% variance among three factors) is labeled “concentrated Latino immigration,” and captures three variables (residents with no high school diplomas, households with more than person per room, and Latinos). Factor 3 (accounting for 19.3% variance among three factors) is labeled “residential instability,” and captures two variables (residential instability, and renters-occupied homes). The three factors are used as control variables (socioeconomic covariates) in the regression analysis of job access and homicide rate. The higher the value of a factor, the more disadvantageous a tract is in terms of socioeconomic characteristics. Step 2. Computing homicide rates in ArcGIS: In ArcMap, on the GIS layer citytrt, use an attribute query to select features with POPU > 0 (845 tracts selected), and export it to a new feature layer citytract. This excludes a tract with no population, and thus avoids zero-denominator in computing homicide rates. Because of rarity of the incidence, homicide rates are usually measured as homicides per 100,000 residents. On the attribute table of citytract, add a field HomiRate and calculate it as HomiRate = CT89_91* 100000/ POPU, which is the homicide rates per 100,000 residents during 1989–91. In regression analysis (Part 2 of this case study in Section 8.7.2), the logarithmic transformation of homicide rates (instead of the raw homicide rate) is often used to measure the dependent variable (see Land et al., 1990, p. 937), and a value of 1 is added to the rates to avoid taking logarithm of zero.* Add another field LnHomiRate to the attribute table of citytract, and calculate it as LnHomiRate = log(HomiRate + 1). Step 3. Computing Getis–Ord General G and Moran’s I: In ArcToolbox, choose Spatial Statistics Tools > Analyzing Patterns > High–Low Clustering (Getis–Ord General G) to activate a dialog window shown in Figure 8.4. Choose citytract as the Input Feature Class and HomiRate as the Input Field, and check the option Generate Report (other settings such as INVERSE_DISTANCE for Conceptualization of Spatial Relationships and EUCLIDEAN_DISTANCE for Distance Method are okay). Click OK to execute it. The result reports that the observed and expected General G’s are 0.000079 and 0.000034, respectively. The Z score (=16.4) indicates a strong clustering of high homicide rates. One may repeat the analysis by selecting “CONTIGUITY_EDGES_ONLY” for Conceptualization *

The choice of adding 1 (instead of 0.2, 0.5, or others) is arbitrary, and may bias the coefficient estimates. However, different additive constants have minimal consequence for significance testing as standard errors grow proportionally with the coefficients and thus leave the t values unchanged (Osgood, 2000, p. 36). In addition, adding 1 ensures that log(r + 1) = 0 for r = 0 (zero homicide).

182

Quantitative Methods and Socioeconomic Applications in GIS

of Spatial Relationships, and obtain a similar result suggesting a statistically significant clustering of high homicide rates (the observed and expected General G’s are 0.010098 and 0.005452, respectively, with a Z score of 8.6). Similarly, execute another analysis by choosing the tool “Spatial Autocorrelation (Morans I).” The result suggests even a stronger clustering of similar (either high– high or low–low) homicide rates (the observed and expected Moran’s I’s are 0.193234 and −0.001185, respectively, with a Z score of 24.8 when using the Euclidean distances for conceptualization of spatial relationships). Step 4. Mapping local Moran’s Ii and local Gi∗: In ArcToolbox, choose Spatial Statistics Tools > Mapping Clusters > Cluster and Outlier Analysis (Anselin Local Moran’s I). In the dialog window, define the Input Feature Class, Input Field, Conceptualization of Spatial Relationships and Distance Method similar to those in step 3, and name the Output Feature Class HomiR _ LISA. Figure 8.5 shows the automatically mapped result identifying five types of tracts: most of the tracts belong to the type “not significant,” there are two major areas of “high–high cluster,” only two tracts in the north are “high–low outlier,” about a dozen tracts of “low–high outlier” are interwoven with the high–high clusters, and the type “low–low cluster” is absent. The attribute table of HomiR _ LISA contains four new fields: LMiIndex for the local Moran’s Ii, LMiZScore the corresponding Z score, LMiPValue the P value, and COType taking the values of NULL (blank), HH, HL, LH, and LL for the five types of spatial autocorrelation as explained. Repeat the analysis using the tool “Hot Spot Analysis (Getis–Ord Gi∗).” This time, experiment with another setting for “Conceptualization of Spatial Relationships” such as FIXED_DISTANCE_BAND, and save the output feature as HomiR _ Gi. As shown in Figure 8.6, it clearly differentiates those hot spots (clusters of high homicide rates) versus cold spots (clusters of low homicide rates) and otherwise not significant. Among the hot spots or the cold spots, it is further divided into three types with different levels of confidence (90%, 95%, and 99%). Note that the two hot spots of high homicide rates are largely consistent with the two major clusters of high–high clusters identified by the local Moran’s I. The attribute table of HomiR _ Gi contains three new fields: GiZScore the Z score, GiPValue the P value, and GiBin taking the values of −3, −2, −1, 0, 1, 2, and 3 for 99% confidence cold spot, 95% confidence cold spot, 90% confidence cold spot, not significant, 90% confidence hot spot, 95% confidence hot spot, 99% confidence hot spot, respectively. It does not report the exact Gi∗ values, whose statistical significances as indicated in the field GiBin are derived from the corresponding Z scores.

8.7.2

part 2: regression analysis oF hoMicide patterns

This part implements several regression models of homicide patterns in Chicago. Three global models include the OLS model, the spatial lag model and the spatial error model, and the local one is the GWR. The OLS and GWR models are implemented in ArcGIS, and the spatial lag and spatial error models are implemented in GeoDa. For reasons explained previously, the logarithm of the homicide rate is used as the dependent variable in all regression models.

183

Spatial Statistics and Applications

N

Not significant High-high cluster High-low outlier Low-high outlier Low-low cluster 0 1.25 2.5

FIGURE 8.5

5

7.5

10 km

Clusters of homicide rates based on local Moran’s Ii.

Step 5. Implementing the OLS regression in ArcGIS: In ArcToolbox, choose Spatial Statistics Tools > Modeling Spatial Relationships > Ordinary Least Squares. In the dialog window, choose citytract as the Input Feature Class and OBJECTID as the Unique ID Field, name the Output Feature Class Homi _ OLS, choose LnHomiRate as Dependent Variable, and check four fields (FACTOR1, FACTOR2, FACTOR3, and JA) as explanatory variables, name the Output Report File OLSRpt and click OK to run the model.

184

Quantitative Methods and Socioeconomic Applications in GIS

N

Homicide rate Gi*

Cold spot - 99% confidence Cold spot - 95% confidence Cold spot - 90% confidence

Not significant Hot spot - 90% confidence Hot spot - 95% confidence Hot spot - 99% confidence 0 1.25 2.5

FIGURE 8.6

5

7.5

10

km

Clusters of homicide rates based on Gi∗ .

The result is summarized in Table 8.2. Except for factor 3, both factor 1 and factor 2 are positive and significant, indicating that more concentrated disadvantage and concentrated Latino immigration contributed to higher homicide rates. Poorer job accessibility is also related to higher homicide rates. See Wang (2005) for more discussion.

185

Spatial Statistics and Applications

TABLE 8.2 OLS and Spatial Regressions of Homicide Rates in Chicago (n = 845 Census Tracts) Independent Variables Intercept Factor 1 Factor 2 Factor 3 Job access Spatial lag (ρ) Spatial error (λ) R2

OLS Model

Spatial Lag Model

6.1324 (10.87) 1.2200 (15.43)* 0.4989 (7.41)* −0.1230 (−1.83) −2.9143 (−5.41)* *

0.395

4.5428 (7.53) 0.9669 (10.93)* 0.4052 (6.02)* −0.0998 (−1.53) −2.2103 (−4.14)* 0.2736 (5.86)* *

0.424

Spatial Error Model 5.8334 (8.98)* 1.1783 (12.91)* 0.4782 (6.02)* −0.0862 (−1.09) −2.6352 (−4.26)* 0.2617 (4.80)* 0.415

Note: Values in parentheses indicate t values; * Significant at 0.001.

Step 6. Defining spatial weights in GeoDa: Download the software GeoDa and install it. Start GeoDa. Select File from the main manual bar > Open Shapefile, and choose citytract.shp. Also from the main manual bar, select Tools > Weights > Create to activate the dialog window for defining the spatial weights. In the dialog window (Figure 8.7), select OBJECTID as Weights File ID Variable, check Queen Contiguity under Contiguity Weight (and use the default Order of contiguity “1”), and click Create. Name the spatial weights file tractQueen1. GAL and create it. Step 7. Running the spatial lag and spatial error models in GeoDa: In GeoDa, choose Methods > Regression. In the Regression dialog window (Figure 8.8), (1) use >, », Modeling Spatial Relationships > Geographically Weighted Regression. In the dialog window, choose citytract as the Input Features, select LnHomiRate as Dependent Variable, add

186

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE 8.7 GeoDa dialog window for defining spatial weights.

four variables (FACTOR1, FACTOR2, FACTOR3, and JA) as explanatory variables, name the Output Feature Class Homi_GWR, use default settings for Kernel type (“Fixed”) and Bandwidth method (“AICc”), and click OK to run the model. The reported results include R2 = 0.415, AICc = 3512.9, and Sigma = 1.9230. By default, ArcGIS maps the standard residuals from the GWR (i.e., field StdResid in the output feature Homi_GWR). As shown in Figure 8.9, it does not exhibit any particular pattern as expected and thus suggests a random pattern. Based on the attribute table of Homi_GWR, Figure 8.10a–d shows the varying coefficients of factor 1, factor 2, factor 3, and JA across the study area (using fields C1_FACTOR1, C2_FACTOR2, C3_FACTOR3, and C4_JA). Note that the coefficient ranges are positive for factor 1 (concentrated disadvantage) and factor 2 (concentrated Latino immigration) and (mostly) negative for job accessibility, which are consistent with the findings from the three global regression models. The coefficients for factor 3 are not significant. In addition, the positive effect of concentrated disadvantage on homicide rates is the strongest on the “south side” (where the area has long been plagued with socioeconomic disadvantages) and declines gradually toward north; and the positive effect of concentrated Latino immigration is the strongest in the northwest (with the highest concentration of Latino immigrants) and declines toward southeast.

Spatial Statistics and Applications

FIGURE 8.8

187

GeoDa dialog window for regression.

The negative effect of job accessibility is the opposite of the effect of concentrated disadvantage.

8.8

SUMMARY

The centrographic measures provide a basic set of descriptive measures of geographic pattern in terms of the mean center, degree of dispersion from the center, and possible orientation. Spatial cluster analysis detects nonrandomness of spatial patterns or existence of spatial autocorrelation. The spatial cluster method such as the nearest neighbor index is based on locations only, and identifies whether data points are clustered, dispersed or random. Other spatial cluster methods also focus on feature locations but features are distinctively classified as cases and controls; and the methods analyze whether events of cases within a radius exhibit a higher level of concentration than a random pattern would suggest. Spatial cluster analysis methods based on feature values examine whether objects in proximity or adjacency are related (similar or dissimilar) to each other. Applications of spatial cluster analysis are widely seen in crime and health-related studies. The techniques

188

Quantitative Methods and Socioeconomic Applications in GIS

N

GWR Standard residual 2.5 std. dev. 0 1.25 2.5

5

7.5

10 km

FIGURE 8.9 Standard residuals in the GWR model.

can benefit other fields such as history and culture studies as demonstrated in Case Study 8B. The existence of spatial autocorrelation necessitates the usage of spatial regression in regression analysis. Global spatial regression models are the spatial lag model and the spatial error model. Both need to be estimated by the maximum likelihood (ML) method. The local model is the Geographically Weighted Regression (GWR)

189

Spatial Statistics and Applications

(a)

N

Factor 1 coef. 0.99–1.09 1.10–1.20 1.21–1.35 1.36–1.55 1.56–1.94 01.25 2.5 5

N

Factor 2 coef. 0.38–0.45 1.46–0.50 0.51–0.54 0.55–0.57 0.58–0.61 7.5

10 km

(c)

01.252.5 5

N

Factor 3 coef. –0.37 to –0.26 –0.25 to –0.17 –0.16 to –0.06 –0.05–0.07 0.08–0.23 01.25 2.5 5

(b)

7.5

7.5

10 km

(d)

N

JA coef. –3.42 to –2.93 –2.92 to –2.63 –2.62 to –2.15 –2.14 to –1.43 –1.42–0.18 10 km

01.25 2.5 5

7.5

10 km

FIGURE 8.10 Spatial variations of coefficients from the GWR model: (a) factor 1, (b) factor 2, (c) factor 3, (d) JA (job accessibility).

that yields regression coefficients and corresponding t statistics varying across a study area. The current version of ArcGIS (10.2) provides some popular spatial statistics tools including the centrographic measures, the nearest neighbor index, and spatial cluster analysis indices based on feature values, and has newly added the GWR tool. However, implementing the spatial cluster analysis based on binary point features

190

Quantitative Methods and Socioeconomic Applications in GIS

(cases vs. controls) or the spatial lag and spatial error regression model has to rely on some specialized software such as SaTScan and GeoDa.

APPENDIX 8: SPATIAL FILTERING METHODS FOR REGRESSION ANALYSIS The spatial filtering methods by Getis (1995) and Griffith (2000) take a different approach to account for spatial autocorrelation in regression. The methods separate the spatial effects from the variables’ total effects, and allow analysts to use conventional regression methods such as OLS to conduct the analysis (Getis and Griffith, 2002). Compared to the maximum likelihood spatial regression, the major advantage of spatial filtering methods is that the results uncover individual spatial and nonspatial component contributions, and are easy to interpret. Griffith’s (2000) eigenfunction decomposition method involves intensive computation, and takes more steps to implement. This appendix discusses Getis’s method. The basic idea in Getis’s method is to partition each original variable (spatial autocorrelated) into a filtered nonspatial variable (spatial independent) and a residual spatial variable, and then feed the filtered variables into OLS regression. Based on the Gi statistic in Equation 8.7, the filtered observation xi∗ is defined as xi∗ =

Wi /(n − 1) xi , Gi

where xi is the original observation, Wi = ∑ j wij (averaged spatial weights for i ≠ j), n is the number of observations, and Gi is the local Gi statistic. Note that the numerator Wi/(n − 1) is the expected value for Gi. When there is no autocorrelation, xi∗ = xi . The difference L xi = xi − xi∗ represents the spatial component of the variable at i. Feeding the filtered variables (including the dependent and explanatory variables) into an OLS regression yields the spatially filtered regression model such as y∗ = f ( x1∗ , x2∗ ,), where y* is the filtered dependent variable, x1∗ , x2∗ , and others are the filtered explanatory variables. The final regression model includes both the filtered nonspatial component and the spatial component of each explanatory variable such as y = f ( x1∗ , L x1 , x2∗ , L x2 ,...), where y is the original dependent variable, and L x1, L x2, … are the corresponding spatial component of explanatory variables x1, x2, …. Like the Gi statistic, Getis’s spatial filtering method is only applicable to variables with a natural origin and positive values, not those represented by standard normal variates, rates, or percentage change (Getis and Griffth, 2002, p. 132).

Section III Advanced Quantitative Methods and Applications

9

Regionalization Methods and Application in Analysis of Cancer Data

Analysis of rare events (e.g., cancer, AIDS, homicide) often suffers from the small population (numbers) problem, which can lead to unreliable rate estimates, sensitivity to missing data and other data errors (Wang and O’Brien, 2005), and data suppression in sparsely populated areas. The spatial smoothing techniques such as the floating catchment area method and the empirical Bayesian smoothing method, as discussed in Chapter 3, can be used to mitigate the problem. This chapter introduces a more advanced approach, namely, “regionalization,” to this issue. Regionalization is to group a large number of small units into a relatively small number of regions while optimizing a given objective function and satisfying certain constraints. The chapter begins with an illustration of the small population problem and a brief survey of various approaches in Section 9.1. Section 9.2 discusses two GIS-based approaches, the spatial order method and the Modified Scale–Space Clustering (MSSC) method. Section 9.3 explains the foundation of Regionalization with Dynamically Constrained Agglomerative Clustering and Partitioning (REDCAP) method, and Section 9.4 uses a case study of analyzing late-stage breast cancers in the Chicago region to illustrate its implementation. The chapter concludes in Section 9.5 with a brief summary.

9.1 SMALL POPULATION PROBLEM AND REGIONALIZATION The small population problem is common for analysis of rare events. The following uses two fields, crime and health studies, to illustrate issues related to this problem and some attempts to mitigate it. In criminology, the study of homicide rates across geographic units and for demographically specific groups often entails analysis of aggregate homicide rates in small populations. Research has found inconsistency between the two main sources of homicide data in the US, that is, the Uniform Crime Reporting (UCR) and the National Vital Statistics System (NVSS) (Dobrin and Wiersema, 2000), which confirms reporting errors in either system. In addition, a sizeable and growing number of unsolved homicides have to be excluded from studies analyzing offender characteristics (Fox, 2000). For a small county with only hundreds of residents, data errors for one homicide could swing the homicide rate per 100,000 residents by hundreds of times (Chu et al., 2000), and raise the question of reliability of homicide rates. Several nongeographic strategies have been attempted by criminologists to mitigate the problem. For example, Morenoff and Sampson (1997) used homicide counts 193

194

Quantitative Methods and Socioeconomic Applications in GIS

instead of per capita rates or simply deleted outliers or unreliable estimates in areas with small population. Some used larger units of analysis (e.g., states, metropolitan areas, or large cities) or aggregated over more years to generate stable homicide rates. Aggregate crime rates from small populations violate two assumptions of ordinary least-squares (OLS) regressions, that is, homogeneity of error variance or homoscedasticity (because errors of prediction are larger for crime rates in smaller populations) and normal error distribution (because more crime rates of zero are observed as populations decrease). Land et al. (1996) and Osgood (2000) used Poisson regression to better capture the nonnormal error distribution pattern in regression analysis of homicide rates in small population (see Appendix 9A). In health studies, another issue such as data suppression is associated with the small population problem. Figure 9.1, generated from the State Cancer Profiles web site (statecancerprofiles.cancer.gov), shows age-adjusted death rates for female breast cancer in Illinois counties for 2003–2007. Rates for 37 out of 102 counties (i.e., 36.3%, mostly rural counties) are suppressed to “ensure confidentiality and stability of rate estimates” because counts were 15 cases or fewer. Cancer incidence in these counties cannot be analyzed, leaving large gaps in our understanding of geographic variation in cancer and its social and environmental determinants (Wang et al., 2012). Many researchers in health-related fields have used some spatial analytical or geographic methods to address the issue. For example, spatial smoothing computes the average rates in a larger spatial window to obtain more reliable and stable rates (Talbot et al., 2000). Spatial smoothing methods include the floating catchment area method, kernel density estimation and empirical Bayes estimation as discussed in Chapter 3, and more recently locally weighted average (Shi et al. 2007) and adaptive Age-adjusted death rates for Illinois, 2003–2007 Breast All races (includes Hispanic), female, all ages

Age-adjusted Annual death rate (Deaths per 100,000) Quantile interval 28.5–43.7 26.7–28.4 24.1–26.6 23.0–24.0 21.3–22.9 14.8–21.2 Suppressed* United States Rate (95% C.I.) 24.0 (23.9–24.1) Illinois Rate (95% C.I.) 25.2 (24.7–25.8) Healthy people 2010 Goal 03–03 22.3

FIGURE 9.1

Female breast cancer death rates in Illinois for 2003–2007.

Regionalization Methods and Application in Analysis of Cancer Data

195

spatial filtering (Tiwari and Rushton, 2004; Beyer and Rushton, 2009), among others. Another method, hierarchical Bayesian modeling (HBM), commonly used in spatial epidemiology, uses a nonparametric Bayesian approach to detect clusters of high risk and low risk with the prior model assuming constant risk within a cluster (Knorr-Held, 2000; Knorr-Held and Rasser, 2000). This chapter focuses on the approach of constructing larger geographic areas from the ones with small population. The purpose is similar to that of aggregating over a longer period of time to achieve a greater degree of stability in homicide rates across areas. The technique has much common ground with the long tradition of regional classification (regionalization) in geography (Cliff et al., 1975), which is particularly desirable for studies that require comparable analysis areas with a minimum base population. For instance, Black et al. (1996) developed the ISD method (after the Information & Statistics Division of the Health Service in Scotland, where it was devised) to group a large number of census enumeration districts (ED) in the UK into larger analysis units of approximately equal population size. Lam and Liu (1996) used the spatial order method to generate a national rural sampling frame for HIV/AIDS research, in which some rural counties with insufficient HIV-cases were merged to form larger sample areas. Both approaches emphasize spatial proximity, but neither considers within-area homogeneity of attribute. Haining et al. (1994) attempted to consolidate many EDs in the Sheffield Health Authority Metropolitan District in the UK to a manageable number of regions for health service delivery (commonly referred to as the “Sheffield method”). The Sheffield method started by merging adjacent EDs sharing similar deprivation index scores (i.e., complying with within-area attribute homogeneity), and then used several subjective rules and local knowledge to adjust the regions for spatial compactness (i.e., accounting for spatial proximity). The method attempted to balance the two criteria of attribute homogeneity and spatial proximity, a major challenge in regionalization analysis. In other words, only contiguous areas can be clustered together, and these areas must have similar attributes. If very different areas were grouped together, much of the geographic variation, which is of primary interest to spatial analysis, would be smoothed out in the regionalization process. There are a number of automated regionalization methods reported in the literature: the AZP (Openshaw and Rao, 1995; Openshaw, 1977; Cockings and Martin, 2005; Grady and Enander, 2009), MaxP (Duque et al., 2012), and MSSC (Mu and Wang, 2008). The AZP method starts with an initial random regionalization and then iteratively refines the solution by reassigning objects to neighboring regions to improve an objective function value (e.g., maximum homogeneity within derived regions), and therefore the regionalization result varies, depending on the initial randomization state. The MaxP groups a set of geographic areas into the maximum number of homogeneous regions such that the value of a spatially extensive regional attribute is above a predefined threshold value. The MSSC merges or melts adjacent and similar areas to form larger areas by following a process guided by an objective of minimizing loss of information in aggregation. None of these methods guarantees that newly formed areas have population above a threshold, an important factor considered in the REDCAP (Guo 2008; Guo and Wang 2011) and the mixed-level regionalization (MLR) method introduced in Appendix 9B.

196

Quantitative Methods and Socioeconomic Applications in GIS

TABLE 9.1 Approaches to the Small Population Problem Approach

References with Examples

Comments

Use counts instead of per capita rates

Morenoff and Sampson (1997)

Delete samples of small populations

Harrell and Gouvis (1994), Morenoff and Sampson (1997) Messner et al. (1999), most studies surveyed by Land et al. (1990) Osgood (2000), Osgood and Chambers (2000) Clayton and Kaldor (1987), Tiwari and Rushton (2004), Shi (2007)

Not applicable for most studies that are concerned with rates relative to population sizes Omitted observations may contain valuable information and bias a study Infeasible for analysis of variations within the time period or within the larger areas Not applicable to nonregression studies Suitable for mapping trends, but smoothed rates are not true rates in analysis areas

Openshaw (1977), Haining et al. (1994), Black et al. (1996), Lam and Liu (1996), Guo (2008), Mu and Wang (2008), Duque et al. (2012)

Generative of reliable rates for statistical reports, mapping, exploratory spatial analysis, regression analysis, and others

Aggregate over more years or to a higher geographic level

Poisson regressions Spatial smoothing: floating catchment area, kernel density estimation, empirical Bayes estimation, locally weighted average, adaptive spatial filtering Regionalization: Sheffield method, spatial order, ISD, AZP, MaxP, MSSC, REDCAP

Constructing geographic areas enables the analysis to be conducted at multiple geographic levels, and thus permits the test of the “modifiable areal unit problem” (MAUP) (Fotheringham and Wong, 1991). This issue is also tied to the uncertain geographic context problem (UGCoP) (e.g., in multilevel modeling), referring to sensitivity in research findings to different delineations of contextual units (Kwan, 2012). Furthermore, since similar areas are merged to form new regions, spatial autocorrelation is less of a concern in the new regions. Table 9.1 summarizes the approaches to the small population problem. In the next two sections, three methods (the spatial order, MSSC, and REDCAP) are presented with detailed discussion.

9.2

SPATIAL ORDER AND THE MODIFIED SCALE–SPACE CLUSTERING (MSSC) METHODS

The spatial order method uses space-filling curves to determine the nearness or spatial order of areas. The first algorithm was developed by Italian mathematician Giuseppe Peano in 1890 and thus also referred to as Peano curve. Space-filling curves traverse space in a continuous and recursive manner to visit all areas, and assign a spatial order (from 0 to 1) to each area based on its relative position in a

197

Regionalization Methods and Application in Analysis of Cancer Data

0.25

0.3

0.45

0.5

0.2

0.35

0.4

0.55

0.05

0.9

0.85

0.7

0

0.95

0.8

0.75

0. 0.05

FIGURE 9.2

0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

0.25

0.3

0.45

0.5

0.2

0.35

0.4

0.55

0.05

0.9

0.85

0.7

0

0.95

0.8

0.75

0.7 0.75 0.8 0.85 0.9 0.95

Example of assigning spatial order values to areas.

two-dimensional (2-D) space. As shown in Figure 9.2, the spatial order of each area is calculated and labeled (on the left). The centroids of these areas in the 2-D space are then mapped onto the 1-D line underneath. The Peano curve connects the centroids from the lowest spatial order to the highest (on the right). On the basis of their spatial order values, the areas can be grouped by using a clustering method discussed in Section 7.3 in Chapter 7. The procedure is currently available in ArcToolbox > Data Management Tools > General > Sort (under “Spatial Sort Method,” choose the option PEANO). It is based on one of the algorithms developed by Bartholdi and Platzman (1988). In general, areas that are close together have similar spatial-order values, and areas that are far apart have dissimilar spatial-order values. The method provides a first-cut measure of closeness. The spatial order method has been used to construct larger areas from small ones in several health studies (Lam and Liu, 1996; Wang and O’Brien, 2005). The spatial order method considers only spatial proximity, but not withinarea attribute homogeneity. The modified scale–space clustering (MSSC) method accounts for both factors. As we know, objects in the world appear in different ways depending upon the scale of observation. In the case of an image, the size of scale ranges from a single pixel to a whole image. There is no right scale for an object as any real-world object may be viewed at multiple scales. The operation of systematically simplifying an image at a finer scale and representing it at coarser levels of scale is termed scale– space smoothing. A major reason for scale–space smoothing is to suppress and remove unnecessary and disturbing details (Lindeberg, 1994, p. 10). There are various scale–space clustering algorithms (e.g., Wong, 1993; Wong and Posner, 1993). In essence, an image is composed of many pixels with different brightness. As the scale increases, smaller pixels are melted to form larger pixels. The melting process is guided by some objectives such as entropy maximization (i.e., minimizing loss of information).

198

Quantitative Methods and Socioeconomic Applications in GIS

Applying the scale–space clustering method in a socioeconomic context requires simplification of the algorithm. The method introduced here is modified and improved from the early versions of scale–space clustering methods such as the melting algorithm (Wong, 1993; Ciucu et al., 2003) and the blurring algorithm (Leung, Zhang and Xu, 2000; Wang, 2005). It was first reported in Mu and Wang (2008). The idea is to melt each spatial object (area) to its most similar neighbor. The similarity is measured by the attribute distance between two neighboring areas. An object i has t attributes standardized as (xi1, …, xit); and its adjacent objects j have attributes also standardized as (xj1, …, xjt). Similar to Equation 7.7 in the cluster analysis in Chapter 7, the attribute distance between i and j is defined as Dij =

∑ (x

it

t

− x jt )2 .

(9.1)

Among i’s neighboring objects (l = 1, 2, …, m), its most similar neighbor (k) is the one with the shortest attribute distance from it, termed minimum-distance criterion, such as Dik = min{Dil }. l

(9.2)

An object with the highest attribute value among its surrounding objects serves as a local nucleus, and searches outward for its most similar neighboring object for grouping. A branch of the outward searching process continues until it reaches an object with the lowest attribute value among its surrounding objects. The grouping process is much like our cognitive process of viewing a picture, the pattern or structure is captured by its brightest pixels, and the surrounding darker pixels serve as the background. In practice, for the purpose of operation, one needs to follow the direction from a local minimum to a local maximum, and groups all the objects along the way. By merging surrounding areas (up to local minima) to the local maxima, a region is simplified with fewer areas while the structure is preserved. The algorithm is implemented in GIS in four major steps: 1. Establishing a link between each object and its most similar adjacent object. Using the minimum-distance criterion, object i is linked to object k if Dik satisfies Equation 9.2. 2. Determining the link’s direction. The direction of the link between objects i and k is determined by their attribute values, represented by an aggregate attribute score (Q). In Mu and Wang (2008), Q is computed as the average of factor scores, weighted by their corresponding eigenvalues, representing proportions of variance captured by the factors. The direction is defined such as i → k if Qi < Qk; otherwise i ← k. Therefore, the directional link always points toward a higher aggregate score. 3. Identifying local minima and maxima. A local minimum is an object with all directional links pointing toward other objects, and thus has the lowest Q among surrounding objects. A local maximum is an object with all

Regionalization Methods and Application in Analysis of Cancer Data

199

directional links pointing toward it, and thus has the highest Q among surrounding objects. 4. Grouping around local maxima. Beginning with a local minimum, search outward following link directions until a local maximum is reached. All objects between the local minimum and maximum are grouped into one cluster. Attributes in each cluster are updated as the averaged attributes of its composed objects (weighted by each object’s population). The clustering process continues until all objects are grouped into a cluster. As outlined previously, the spatial order method does not account for attribute similarity within a cluster (newly formed region), but it is fairly easy to incorporate a threshold (minimum) population for the new regions in the clustering process. The MSSC method considers the attribute similarity in the clustering process, but does not impose a criterion of minimum population. In some applications, both criteria are desirable. Appendix 9B presents the MLR method that integrates the two together.

9.3

REDCAP METHOD

REDCAP refers to a family of methods, termed “regionalization with dynamically constrained agglomerative clustering and partitioning.” REDCAP extends the single-linkage (SLK) clustering, average-linkage (ALK) clustering, complete-linkage (CLK) clustering, and the Ward hierarchical clustering methods (see Section 7.3) to enforce the spatial contiguity of clusters and obtain a set of regions while explicitly optimizing an overall homogeneity measure (Guo, 2008). In other words, it accounts for both spatial contiguity (only merging adjacent areas) and attribute homogeneity (only grouping similar areas). For example, the case study presented in Section 9.4 uses the CLK clustering, where the distance between two clusters is defined as the furthest pair (most dissimilar) of data points. In essence, the goal of REDCAP is to construct a set of homogeneous regions by aggregating contiguous small areas of similar attribute values (e.g., socioeconomic structure). To achieve this goal, REDCAP constructs a cluster hierarchy based on attribute similarities among small areas and then partitions the spatially contiguous cluster tree to explicitly optimize a homogeneity measure. The homogeneity measure is the total sum of squared deviations (SSD) (Everitt 2002), as defined in Equation 9.3, where k is the number of regions, nr is the number of small areas in region r, d is the number of variables considered, xij is a variable value, and x j is the regional mean for variable j. Each input data variable should be normalized and a weight can be assigned for each variable. k

SSD =

nr

d

∑ ∑ ∑ (x

ij

− x j )2 .

(9.3)

r =1 i =1 j =1

As illustrated in Figure 9.3, REDCAP is composed of two steps: (1) contiguityconstrained hierarchical clustering and (2) top–down tree partitioning. The shade of each polygon represents its attribute value, and similar shades represent similar

200

Quantitative Methods and Socioeconomic Applications in GIS

values. Two polygons are considered contiguous in space if they share a segment of boundary (i.e., based on the rook contiguity). In the first step, as shown in Figure 9.3a, REDCAP constructs a hierarchy of spatially contiguous clusters based on the attribute similarity under contiguity constraint. Two adjacent and most similar areas are grouped to form the first cluster; two adjacent and most similar clusters are grouped together to form a higher level cluster; and so on until the whole study area is one cluster. It is a clustering process very similar to the one explained in Section 7.3, and adds one new spatial constraint (i.e., only spatially contiguous polygons or clusters can be grouped). A clustering tree is generated to fully represent the cluster hierarchy (i.e., each cluster at any level is a sub-tree in the map). In the second step, as shown in Figure 9.3b, REDCAP partitions the tree to generate two regions by removing the best edge (i.e., 11–15 in Figure 9.3b) that optimizes the homogeneity measure (i.e., SSD) as defined in Equation 9.3. In other words, the two regions are created in a way that the total within-region homogeneity is maximized. The partitioning continues until the desired number of regions is reached. Note that the first step (i.e., contiguity-constrained clustering) is a bottom–up process, which builds a hierarchy of spatially contiguous clusters but does not directly optimize the objective function. The second step (i.e., tree partitioning) is a top–down approach that directly optimizes the objective function. The final regions mostly likely are not the same as the top clusters suggested in the cluster hierarchy. This is why the second step is necessary, which makes the REDCAP methods different from traditional contiguity-constrained hierarchical clustering. REDCAP is (a)

(b) 4

3

2

1

7

6

5

4

3

2

1

8

7

6

5 8

10 9

18

22

19

23

10

16

17

20

24

21

25

18

22

16

15

14

19

23

13

12

11

9

15

14

13

12

11

20

24

17

21

25

FIGURE 9.3 Example illustrating REDCAP: (a) a spatially contiguous tree is built with a hierarchal clustering method, (b) partitioning the tree by removing the edge that optimizes the SSD measure.

Regionalization Methods and Application in Analysis of Cancer Data

201

similar to the SKATER method (Assunçăo et al., 2006) in terms of the two-step framework but significantly outperforms the latter according to criteria such as total heterogeneity, region size balance, internal variation, and preservation of data distribution (Guo, 2008). As discussed previously, for the purpose of mitigating the small population problem, REDCAP is modified to accommodate one more constraint such as a minimum regional size (Wang et al., 2012). It can be a threshold region population and/or the number of incidents (whichever is the denominator in a rate estimate). Such a constraint is enforced in the second step, that is, tree partitioning. For each potential cut, if it cannot produce two regions that both satisfy the constraints, the cut will not be considered as a candidate cut. Then the best of all candidate cuts is chosen to partition a tree into two regions. If there is no candidate cut (i.e., no cut can produce regions that satisfy the constraints), then the region will not be partitioned further. If none of the current regions can be cut, the regionalization process stops. The method is deterministic. In other words, given the same criteria (definitions of attribute similarity and spatial contiguity, minimum region population and/or number of cancer cases), the method yields the same regions. The resulting regions are all large enough and have the highest homogeneity within each region.

9.4 CASE STUDY 9: CONSTRUCTING GEOGRAPHICAL AREAS FOR ANALYSIS OF LATE-STAGE BREAST CANCER RISKS IN THE CHICAGO REGION The variations of breast cancer mortality rates from place to place reflect both underlying differences in breast cancer prevalence and differences in diagnosis and treatment that affect the risk of death. Patients with cancer who are diagnosed early have fewer complications and substantially higher rates of survival than those who are diagnosed late. For breast cancer, access to primary care and mammography screening is critically important for early detection (Wang et al., 2008). Access is strongly influenced by financial, sociocultural, and geographic barriers (or risk factors) (Wang, 2012). The small population problem is commonly encountered in cancer data analysis, and here in the analysis of late-stage breast cancer risks. This case study is developed from the research reported in Wang et al. (2012)* to illustrate the implementation and benefits of using the REDCAP methods. The construction of suitable geographic areas enables us to map reliable late-stage rates in new areas and conduct related exploratory spatial analysis to identify high-risk areas. Furthermore, OLS regression for the new areas also becomes less problematic. Data used in this project are provided under the study area folder Chicago, primarily the feature for 317 zip code areas in the Chicago metropolitan area in *

The study area for this case study is only the Chicago metro area whereas the whole state of Illinois was used in Wang et al. (2012). Also due to the data use agreement with the Illinois State Cancer Registry (ISCR), I modified both the breast cancer counts and late-stage breast cancer counts at the zip code level in preparing the data for the case study. An algorithm was used to introduce a randomized component to each variable while preserving the general pattern and total counts.

202

Quantitative Methods and Socioeconomic Applications in GIS

shapefile ChiZip under the subfolder ChiZip. Its attribute table includes the following fields at the zip code level: 1. Pop is population, and BrCancer and BrCancerLS are female breast cancer counts and late-stage female breast cancer counts. Table 9.2 shows some basic statistics for cancer counts and rates (information for the zip code areas on the top). 2. Fact1 and Fact2 are the factor scores, labeled “socioeconomic disadvantages” and “sociocultural barriers,” respectively. These are results from a factor analysis by consolidating 10 variables from census in order to capture the demographic and socioeconomic characteristics of neighborhood context that affect people’s accessibility to health care. A higher score of either factor corresponds to a more disadvantaged area. The two factors will be used for measuring attribute similarity in the regionalization process. 3. AccDoc is an index measuring spatial accessibility to primary care estimated by the two-step floating catchment area (2SFCA) method (see Section 5.2.2). A higher accessibility value indicates better access to primary care. T _ msite is travel time in minutes from a zip code area (represented by its population-weighted centroid) to the nearest mammography facility, which measures spatial accessibility to cancer screening services. These two variables along with the two factors will be used as explanatory variables in assessing late-stage cancer risks. For detailed discussion of the variable selection and measurements, see Wang et al. (2012).

TABLE 9.2 Descriptive Statistics for Female Breast Cancer by Zip Code and by Constructed New Areas in Chicago Metro Area in 2000 Zip code areas (n = 317) Minimum Maximum Mean Standard deviation New areas (n = 195) Minimum Maximum Mean Standard deviation a

Total Cases

Late-Stage Cases

Late-Stage Rate

0 75 20.63 16.61

0 24 6.08 5.20

0a 1a 0.3124a 0.1695a

15 76 33.54 13.65

1 28 9.89 4.94

0.0294 0.5417 0.2971 0.1019

Applies only to 292 zip code areas with nonzero total cases (excluding 25 zip code areas with 0 total cases).

Regionalization Methods and Application in Analysis of Cancer Data

203

Step 1. Installing and launching REDCAP: REDCAP is a general toolkit for regionalization developed by Guo (http://www.spatialdatamining.org/software/ redcap), from where the program can be downloaded. Unzip the program file. It is recommended to install and launch the program from the Windows command window (Start > All Programs > Accessories > Command Prompt). Navigate to your project folder where the program file redcap.jar resides. For example, in my case, I typed cd c:\QuantGIS_V2\Chicago_Zip

Then launch the program by typing: java -jar -Xmx1024 m redcap.jar

In the Confirmation window, choose Yes to proceed. As indicated in the window, only if your data set is small (e.g., less than 1000 spatial objects), you may launch the program by simply double clicking the redcap.jar file. Step 2. Creating spatial contiguity file in REDCAP: In the REDCAP window, under Views, leave the option Simplify Shapes off by default.* Under Contiguity, choose Create Rook contiguity > in the window Open Shape File, navigate to the shape file ChiZip.shp and click Open > in the window Save Contiguity File, click Save to accept the default settings to save the contiguity file ChiZip.shp.ctg under the project folder.† Step 3. Loading data and configuring variables in REDCAP: In the REDCAP window, select File > Load Data > in the window Open Shape File, navigate to the shape file ChiZip.shp, click Open. Three windows are shown: a. Control (listing variables from the attribute table under Data) b. Map (showing the study area) c. System Message In the Control window, under Data, select two variables Fact1 and Fact2 together by holding down the key and click Submit Variables > now on the right side of the Control window, under “[Variable/Normalizer]*Weight,” set corresponding weights for the above variables by replacing the two values “1.0” by 4.6576 and 2.3679,‡ and leave the default setting No Smoother on > Click OK (see Figure 9.4 for reference). Step 4. Selecting and running a regionalization algorithm in REDCAP: In the Algorithm window,

*

†

‡

If your shapefile size is large, it may contain unnecessary geographic details for the boundary. Simplifying the shapes helps speed up the process. If your study area is composed of multiple regions (say, separated by water bodies such as rivers), as it is the case here, the command window displays a message: “The contiguity graph has 7 component(s) … bridging the 7 components with 6 forced connections.” The two values are the eigenvalues corresponding to the two factors from the factor analysis, which represent the variances (or information) captured by the factors and thus are used here as weights.

204

FIGURE 9.4

Quantitative Methods and Socioeconomic Applications in GIS

Interface windows in REDCAP.

a. For “Regionalization method,” the default “Full-Order-ALK” is okay. b. For “Maximum of regions,” type “317” (in this case, it is not relevant. I suggest using “317,” i.e., number of zip code areas in the study area). c. For “#1 Control population,” choose “BrCancer,” that is, breast cancer counts in zip code areas; and for “Minimum population per region” underneath, type “15,” the threshold number of cancer cases in the newly formed regions. d. For “#2 Control population,” and “Minimum population per region” and “Logical operator of CSTR#1 and CSTR#2” underneath it, keep the defaults (no variable for the 2nd control population is used in this case study). Click Run > In the window “Choose levels,” move to the bottom and choose “195,” click Add to select it and then “OK” at the bottom to run the program. We choose 195 regions because that is the first round of clustering that meets the requirement of a minimum of 15 cases in the new regions. The System Message window shows the regionalization algorithm is finished (and 195 regions are generated). Similar messages are shown in the command window. The Map window shows the newly constructed region.

Regionalization Methods and Application in Analysis of Cancer Data

205

Back to the Algorithm window, select “Save CSV” to save the regionalization result as a CSV file > navigate to the project folder and name the file (e.g., Reg_ min15.csv). The text file contains two fields: ObjectID identifies unique zip code areas inheriting from the field FID of the shapefile ChiZip, and regions195 indicates the region each zip code area is clustered to. This concludes the analysis in REDCAP. Step 5. Mapping late-stage breast cancer rates in zip code areas in ArcGIS: In ArcMap, add the shapefile ChiZip.shp > Open its attribute table, and right click the field BrCancer > Statistics. The result shows that many zip code areas have cancer cases fewer than 15. Select zip code areas with “BrCancer > 0” (292 out of 317 are selected, 25 with zero case are excluded), and export the selected features to a new shapefile BrLSRate_Zip. Add a field BrLSRate (late breast cancer rate) to the attribute table of BrLSRate_Zip, and compute it as “ = BrCancerLS/ BrCancer.” Figure 9.5 shows the late breast cancer rates across the zip code areas including 25 zip code areas with missing late-stage rates because of zero cases there. Among the 292 zip code areas with valid late-stage rates, Figure 9.6a shows the full range of late-stage rates including 15 zip code areas with a rate of 0.0 and 5 with a rate of 1.0 (also see Table 9.2). For the reasons discussed earlier, direct mapping of late-stage breast cancer rates in zip code areas as in Figure 9.5 displays a highly fragmented geographic pattern with a high variability. Moreover, the areas with missing rates affect the polygon contiguity of those with valid rates. Therefore, exploratory spatial data analysis including spatial autocorrelation or hot-spot analysis is infeasible for zip code area data due to unreliable rates in areas with low cancer counts and the fragmented spatial pattern. This highlights the need for constructing larger and comparable areas to permit such analysis. Step 6. Regression analysis on late-stage breast cancer rates in zip code areas in ArcGIS: In ArcToolbox, choose Spatial Statistics Tools > Modeling Spatial Relationships > Ordinary Least Squares (also see step 8 in Section 8.7.2). In the dialog window, choose (1) Input Feature Class: BrLSRate_Zip, (2) Unique ID Field: OID_, (3) Dependent variable: BrLSRate, (4) Explanatory Variables: Fact1, Fact2, AccDoc, T_Msite, and click OK to run the regression. The result is reported in Table 9.3. The following is optional. Due to the prevalence of low cancer counts in the data, Poisson regression is more suitable than the OLS model for analysis of late-stage cancer risks (see Appendix 9A). An SAS program poisson_zip.sas is available under the study area folder Chicago to implement the Poisson regression, and the result is also presented in Table 9.3. The Poisson regression result is largely consistent with that of OLS, but yields higher statistical significance levels for Fact1, Fact2, and AccDoc. Step 7. Examining the regionalization result in ArcGIS: In ArcGIS, join the text file Reg_min15.csv (common key ObjectID) to the shapefile ChiZip.shp (common key FID). Right click the layer ChiZip > Properties > Symbology > Cate gories > Unique values > Select “regions195” for Value Field, and click “Add All Values” at the bottom left corner to use unique colors to represent different region. The map may look like the one shown in REDCAP (i.e., Figure 9.4). Clearly, some

206

Quantitative Methods and Socioeconomic Applications in GIS N

Breast cancer Late-stage rate 0.00–0.15 0.16–0.29 0.30–0.42 0.43–0.70 0.71–1.00 No breast cancer case

FIGURE 9.5

0 5 10

20

30

40 km

Late-stage breast cancer rates in zip code areas in the Chicago region in 2000.

zip code areas, particularly those on the west edge (mostly small rural zip code areas), share the same region codes and form the clusters. Step 8. Constructing new areas (regions) and aggregating data from zip code areas in ArcGIS: Both the independent variables (Fact1, Fact2, AccDoc, and T_MSite) and dependent variable (BrLSRate) need to be aggregated from zip code areas to the new clusters (identified by the field region195). For the former (Fact1, Fact2, AccDoc and T_MSite), each is obtained by calculating the weighted averages within a cluster using population (Pop) as weights; and for the

207

Regionalization Methods and Application in Analysis of Cancer Data (a)

(b)

Frequency distribution

50 40

30

30

20

20

10

10 0

Frequency distribution

40

0.0

0.2

0.3

0.5

0.6

0.8

0

0.9

0.0

0.1

0.2

0.3

0.4

0.4

0.5

FIGURE 9.6 Distribution of late-stage breast cancer rates in the Chicago region in 2000: (a) 292 zip code areas, (b) 195 new areas.

latter (BrLSRate), it is derived by calculating the total late-stage case counts and the total case counts in a cluster and then the rate between the two. Taking Fact1 as an example, the formula for its weighted average is xw =

∑w x , ∑w i i

(9.4)

i

where wi and xi are population and factor 1 score in zip code area i within a cluster (region), respectively. In ArcGIS, this is achieved by (1) creating a new field (say, F1XP) in the zip code area layer and calculating it as F1XP = Fact1*Pop, (2) summing up F1XP and Pop within each cluster to obtain the numerator and denominator in Equation 9.4, respectively, and (3) taking the ratio of the two to yield the weighted average for Fact1. TABLE 9.3 Regression Results for Late-Stage Breast Cancer Risks in the Chicago Region in 2000 OLS

Intercept Fact 1 (socioeconomic disadvantages) Fact 2 (sociocultural barriers) AccDoc (spatial access to primary care) T_Msite (travel time to nearest mammography) Goodness-of-fit

Poisson

Zip Code Areas (n = 292)

New Areas (n = 195)

Zip Code Areas (n = 292)

0.3848*** 0.0333* 0.0333* −0.0289** 0.0018

0.3691*** 0.0426*** 0.0464*** −0.0281*** −0.0026

−0.9819*** 0.1255*** 0.1475*** −0.0982*** 0.0093

R2 = 0.05

R2 = 0.19

Akaike’s Information Criterion = 1239.1

Note: For OLS regression, the dependent variable is late-stage breast cancer rate; for Poisson regression, the dependent variable is late-stage breast cancer counts and the offset variable is total breast cancer counts; *Significant at 0.05; **Significant at 0.01; ***Significant at 0.001.

208

Quantitative Methods and Socioeconomic Applications in GIS

Specifically, add new fields F1XP, F2XP, AccDocXP, and T _ MSiteXP to the attribute table of shapefile ChiZip, and calculate each of them as: F1XP = Fact1*Pop F2XP = Fact2*Pop AccDocXP = AccDoc*Pop T_MSiteXP = T_MSite*Pop

In ArcToolbox, choose Data Management Tools > Generalization > Dissolve, then (1) Input Features: ChiZip, (2) Output Feature Class: Reg_min15, (3) Dissolve_ Field(s): Reg_min15.cvs.regions195, (4) Statistics Fields: choose BrCancer, and then under Statistic Type, select SUM; repeat on other six fields: BrCancerLS, Pop, F1XP, F2XP, AccDocXP, T_MSiteXP. Click OK to execute it. The screen shot looks like Figure 9.7. The resulting shapefile Reg_min15 is the newly constructed regions (areas). Examining its attribute table, most field names are truncated. Following the order of the statistics fields input above, we can identify that the summed-up values of BrCancer, BrCancerLS, Pop, F1XP, F2XP, AccDocXP, and T_MSiteXP are now labeled SUM_ChiZip, SUM_ChiZ_1, SUM_ChiZ_2, SUM_ChiZ_3, SUM _ ChiZ _ 4, SUM_ChiZ_5, and SUM_ChiZ_6, respectively. Therefore, the aggregated values for the dependent and independent variables are obtained by adding the following fields to the attribute table of Reg_min15 and calculating them as BrLSRate = SUM_ChiZ_1/SUM_ChiZip Fact1 = SUM_ChiZ_3/SUM_ChiZ_2 Fact2 = SUM_ChiZ_4/SUM_ChiZ_2 AccDoc = SUM_ChiZ_5/SUM_ChiZ_2 T_Msite = SUM_ChiZ_6/SUM_ChiZ_2

Step 9. Mapping late-stage breast cancer rates in the new areas in ArcGIS: In ArcMap, map late-stage breast cancer rates in the new areas (using the field BrLSRate of shapefile Reg_min15), as shown in Figure 9.8. Its distribution has a narrower range 0.03–0.54 (Table 9.2) that resembles more to a normal distribution (Figure 9.6b) than that of late-stage rates in zip code areas. Step 10. Hot-spot analysis of late-stage breast cancer rates in the new areas in ArcGIS: Similar to step 4 in Section 8.7.1, conduct the hot-spot analysis of late-stage breast cancer rates in the new areas, as shown in Figure 9.9. Three cold spots are observed in about 30 km west and northwest of downtown Chicago, and several hot spots are farther to the west, southwest, and northwest. Step 11. Regression analysis of late-stage breast cancer rates in the new areas in ArcGIS: Similar to step 6, run the OLS regression based on the new areas. The result is also presented in Table 9.3. In comparison to the OLS model on zip code areas, the goodness of fit of the model improves significantly with higher statistical significance levels for independent variables Fact1, Fact2, and AccDoc. This is similar to the effect of using the Poisson regression on the zip code areas. In other words, the Poisson model indeed is able to achieve an effect similar to constructing larger analysis areas by mitigating the small population problem.

Regionalization Methods and Application in Analysis of Cancer Data

209

FIGURE 9.7 Screen shot for “Dissolve” in data aggregation.

Since the new areas now have a minimum case count of 15, there is no need for Poisson regression.

9.5

SUMMARY

In geographic areas with few events (e.g., cancer, AIDS, homicide), rate estimates are often unreliable because of high sensitivity to data errors associated with small numbers. Researchers have proposed various approaches to mitigate the problem.

210

Quantitative Methods and Socioeconomic Applications in GIS N

Breast cancer Late-stage rate 0.03–0.18 0.19–0.26 0.27–0.33 0.34–0.40 0.41–0.54

FIGURE 9.8

0 5 10

20

30

40 km

Late-stage breast cancer rates in newly defined areas in Chicago in 2000.

Applications are particular rich in criminology and health studies. Among various methods, geographic approaches seek to construct larger geographic areas so that more stable rates may be obtained. The spatial order method is fairly primitive, and does not consider whether areas grouped together are homogenous in attributes. The modified scale–space clustering (MSSC) method accounts for attribute homogeneity while grouping adjacent geographic areas together. It is guided by some objectives such as entropy maximization (i.e., minimizing loss of information). One shortcoming of the MSSC method is that it does not guarantee the derived regions with

Regionalization Methods and Application in Analysis of Cancer Data

211

N

Hot and cold spots based on z value of Gi* Cold spot—99% Conf. Cold spot—95% Conf. Cold spot—90% Conf.

Not significant Hot spot—90% Conf. Hot spot—95% Conf. Hot spot—99% Conf.

0 5 10

20

30

40 km

FIGURE 9.9 Hot and cold spots of late-stage breast cancer rates in newly defined areas in the Chicago region in 2000.

population (or total event counts) above a threshold, a desirable property in many health-related studies. Like the MSSC and other modern regionalization methods automated in GIS, the REDCAP method accounts for both spatial contiguity (only merging adjacent areas) and attribute homogeneity (only grouping similar areas). One additional advantage of REDCAP is its ability to accommodate the constraint of a minimum population size in the new regions. This may be attributable to its two-step process: (1) bottom– up hierarchical clustering and (2) top–down tree partitioning. The latter is modified

212

Quantitative Methods and Socioeconomic Applications in GIS

to enforce the constraint of a threshold region population. A case study on analysis of late-stage cancer risks is developed to illustrate the method. Appendix 9B introduces another GIS-based regionalization method that also accounts for the threshold population criterion.

APPENDIX 9A: POISSON-BASED REGRESSION ANALYSIS This appendix is based on the method used by Osgood (2000). Assuming that the timing of the events is random and independent, the Poisson distribution characterizes the probability of observing any discrete number (0, 1, 2, …) of events for an underlying mean count. When the mean count is low (e.g., in a small population), the Poisson distribution is skewed toward low counts. In other words, only these low counts have meaningful probabilities of occurrence. When the mean count is high, the Poisson distribution approaches the normal distribution, and a wide range of counts have meaningful probabilities of occurrence. The basic Poisson regression model is given by ln(λ i ) = β 0 + β1 x1 + β 2 x2 +  + β k xk ,

(A9.1)

where λi is the mean (expected) number of events for case i, x’s are explanatory variables, and β’s are regression coefficients. Note that the left-hand side in Equation A9.1 is the logarithmic transformation of the dependent variable. The probability of an observed outcome yi follows the Poisson distribution, given the mean count λi, such that Pr(Yi = yi ) =

e −λ λ iyi . yi !

(A9.2)

Equation A9.2 indicates that the expected distribution of counts depends on the fitted mean count λi. In many studies, it is the rates and not the counts of events that are of most interest to analysts. Denoting the population size for case i as ni, the corresponding rate is λ i /ni . The regression model for rates is written as ⎛λ ⎞ ln ⎜ i ⎟ = β0 + β1 x1 + β2 x2 +  + β k xk ⎝ ni ⎠ that is, ln(λ i ) = ln(ni ) + β0 + β1 x1 + β2 x2 +  + +β k xk .

(A9.3)

Equation A9.3 adds the population size ni (with a fixed coefficient of 1) to the basic Poisson regression model in Equation A9.1, and transforms the model of analyzing counts to a regression model of analyzing rates. The model is the Poisson-based

Regionalization Methods and Application in Analysis of Cancer Data

213

regression that is standardized for the size of base population, and solutions can be found in many statistical packages such as SAS (PROC GENMOD) by defining the base population as the offset variable. Note that the variance of the Poisson distribution is the mean count λ, and thus its standard deviation is SDλ = λ . The mean count of events λ equals the underlying per capita rate r multiplied by the population size n, that is, λ = rn. When a variable is divided by a constant, its standard deviation is also divided by the constant. Therefore, the standard deviation of rate r is SDr = SDλ /n =

λ /n =

rn /n =

r / n.

(A9.4)

Equation A9.4 shows that the standard deviation of per capita rate r is inversely related to the population size n, that is, the problem of heterogeneity of error variance discussed in Section 9.1. The Poisson-based regression explicitly addresses the issue by acknowledging the greater precision of rates in larger population. For implementing the Poisson regression in SAS, see a sample program poisson _ zip.sas available under the data folder Chicago.

APPENDIX 9B: TOOLKIT OF THE MIXED-LEVEL REGIONALIZATION METHOD* The mixed-level regionalization (MLR) method decomposes areas of large population (to gain more spatial variability) and merges areas of small population (to mask privacy of data) in order to obtain regions of comparable population. The method is developed based on the modified Peano Curve algorithm (MPC) and the modified scale–space clustering (MSSC) by accounting for spatial connectivity and compactness, attributive homogeneity, and exogenous criteria such as minimum (and approximately equal) population and/or disease counts. For instance, for rural counties with small population (or a low cancer count), it is desirable to group counties to form regions of similar size; and for urban counties with large population (or a high cancer count), it is necessary to segment each into multiple regions also of similar size and each region is composed of lower level areas (e.g., census tracts). The method attempts to generate regions that are recognizable by preserving county boundaries as much as possible. One conceivable application of the method is the design and delineation of cancer data release regions that require a minimum population of 20,000 and the minimum cancer counts greater than 15 for the concern of privacy. The resulting regions are mixed-level, with some tract regions (sub-county level), some single-county regions, and some multi-county regions. To implement the (MLR) method, an ArcGIS toolkit of “Mixed-Level Regionalization Tools” has been developed and contains one tool, “Mixed-Level *

Suggested citation for using this tool: Mu, L. and F. Wang. 2015. Appendix 9B: A toolkit of the mixedlevel regionalization method. In Quantitative Methods and Socioeconomic Applications in GIS (2nd ed.). Boca Raton, FL: Taylor & Francis. pp. 213–215.

214

Quantitative Methods and Socioeconomic Applications in GIS

Regionalization” (available under the data folder Chicago). The user interface of the tool is shown in Figure A9.1 and the user inputs of the tool are summarized in Table A9.1. The following illustrates briefly the major steps for using the tool. 1. Specify input polygon shapefile or feature class that has multi-level geographic identification, for instance, tract and county (label 1 in Figure A9.1). 2. Choose variables from the input to be used for regionalization such as

1 2 3 4 5 6

18

22

7

8

19

9

23

10

11 12 13 14 15

20

21

16 17

FIGURE A9.1 User interface of the MLR method.

Input feature class Existing field in the input feature class User named new field User specified values User named new feature class Pull-down field list from the input Pairwise field-value

Regionalization Methods and Application in Analysis of Cancer Data

215

TABLE A9.1 Items to Be Defined in the MLR Toolkit Interface Label on Figure A9.1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

Type of Input or Relationship

Description

Input feature class User-named field User-named field User-named field User-named field Existing field Used specified value Used specified value Existing field Used specified value User-named field User-named field Existing field User-named field Existing field User-named field User-named feature class Pull-down field list Pull-down field list Pull-down field list Pull-down field list Pairwise Pairwise

The input polygon shapefile or feature class A new field for the spatial order A new field for the attributive order A new field for the normalized spatial order A new field of integrated order (both spatial and attribute) Fields to be used for attribute order calculation Weights of the above attributes The percentage of spatial consideration (0–100%) Fields to be used as constraints in regionalization Weights of the above constraints A new field to record the cluster membership at one level A new field to record whether a unit is isolated The field to identify the upper geographic level in MLR A new field to record the types of mixed-level clusters The field used for weighted aggregation A new field to record the mixed-level cluster ID A new feature class or shapefile to be created for MLR results The list of fields are generated from the input The list of fields are generated from the input The list of fields are generated from the input The list of fields are generated from the input A list of weights corresponding to a list of attributes A list of capacities corresponding to a list of constraints

a. Attributes and weights to be considered to achieve homogeneity (labels 18, 6, 7 and 22 in Figure A9.1) b. Constraints to be included such as minimum population and minimum cancer count (labels 19, 9, 10, and 23 in Figure A9.1) c. Upper geographic boundaries to be preserved (labels 20 and 13 in Figure A9.1) d. Aggregation weights (labels 21 and 15 in Figure A9.1) 3. Name new variables for intermediate or final results (labels 2, 3, 4, 5, 8, 11, 12, 14, and 16 in Figure A9.1; also refer to Table A9.1 for details). 4. Specify output filename (label 17 in Figure A9.1). For detailed discussion of the method, see Mu et al. (forthcoming).

10

System of Linear Equations and Application of Garin–Lowry Model in Simulating Urban Population and Employment Patterns

This chapter introduces the method for solving a system of linear equations (SLE). The technique is used in many applications including the popular input–output analysis (see Appendix 10A for a brief introduction). The method is fundamental in numerical analysis (NA), and is often used as a building block in other NA tasks such as solving a system of nonlinear equations and the eigenvalue problem. Appendix 10B shows how the solution of a system of nonlinear equations utilizes the solution method of SLE. Here, the SLE is illustrated in the Garin–Lowry model, a model widely used by urban planners and geographers for analyzing urban land use structure. A program written in Python implements the Garin–Lowry model and makes the model calibration easy to replicate with a user friendly interface. A case study using a hypothetical city shows how the distributions of population and employment interact with each other and how the patterns can be affected by various transportation networks. The GIS usage in the case study involves the computation of a travel time matrix and other data preparation tasks.

10.1 SYSTEM OF LINEAR EQUATIONS A system of n linear equations with n unknowns x1, x2, …, xn is written as ⎧a11 x1 + a12 x2 +  + a1n xn = b1 ⎪ ⎪a21 x1 + a22 x2 +  + a2 n xn = b2 . ⎨ ⎪ ⎪⎩an1 x1 + an 2 x2 +  + ann xn = bn

217

218

Quantitative Methods and Socioeconomic Applications in GIS

In the matrix form, it is ⎡ a11 ⎢a ⎢ 21 ⎢ ⎢ ⎣ an1

a12 a22 an 2

a1n ⎤ ⎡ x1 ⎤ ⎡ b1 ⎤ a2 n ⎥⎥ ⎢⎢ x2 ⎥⎥ ⎢⎢ b2 ⎥⎥ = ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎥ ⎢ ⎥ ann ⎦ ⎣ xn ⎦ ⎣bn ⎦

or simply Ax = b.

(10.1)

If the matrix A has a diagonal structure, system (10.1) becomes ⎡ a11 ⎢0 ⎢ ⎢ ⎢ ⎣0

0 a22 0

0 ⎤ ⎡ x1 ⎤ ⎡ b1 ⎤ 0 ⎥⎥ ⎢⎢ x2 ⎥⎥ ⎢⎢ b2 ⎥⎥ = . ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎥ ⎢ ⎥ ann ⎦ ⎣ xn ⎦ ⎣bn ⎦

The solution is simple, and it is xi = bi/aii. If aii = 0 and bi = 0, xi can be any real number; and if aii = 0 and bi ≠ 0, there is no solution for the system. There are two other simple systems with easy solutions. If the matrix A has a lower triangular structure (i.e., all elements above the main diagonal are 0), system (10.1) becomes ⎡ a11 ⎢a ⎢ 21 ⎢ ⎢ ⎣ an1

0 a22 an 2

0 ⎤ ⎡ x1 ⎤ ⎡ b1 ⎤ 0 ⎥⎥ ⎢⎢ x2 ⎥⎥ ⎢⎢ b2 ⎥⎥ = . ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎥ ⎢ ⎥ ann ⎦ ⎣ xn ⎦ ⎣bn ⎦

Assuming aii ≠ 0 for all i, the forward substitution algorithm is used to solve the system by obtaining x1 from the first equation, substituting x1 in the second equation to obtain x2, and so on. Similarly, if the matrix A has an upper triangular structure (i.e., all elements below the main diagonal are 0), system (10.1) becomes ⎡ a11 ⎢0 ⎢ ⎢ ⎢ ⎣0

a12 a1n ⎤ ⎡ x1 ⎤ ⎡ b1 ⎤ a22 a2 n ⎥⎥ ⎢⎢ x2 ⎥⎥ ⎢⎢ b2 ⎥⎥ = . ⎥⎢ ⎥ ⎢ ⎥ ⎥⎢ ⎥ ⎢ ⎥ 0 ann ⎦ ⎣ xn ⎦ ⎣bn ⎦

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 219

The back substitution algorithm is used to solve the system. By converting system (10.1) into the simple systems as discussed above, one may obtain the solution for a general system of linear equations. Say, if the matrix A can be factored into the product of a lower triangular matrix L and an upper triangular matrix U such as A = LU, system (10.1) can be solved in two stages: 1. Lz = b, solve for z 2. Ux = z, solve for x The first one can be solved by the forward substitution algorithm, and the second one by the back substitution algorithm. Among various algorithms for deriving the LU-factorization (or LU-decomposition) of A, one called Gaussian elimination with scaled row pivoting is used widely as an effective method. The algorithm consists of two steps: a factorization (or forward elimination) phase and a solution (involving updating and back substitution) phase (Kincaid and Cheney, 1991, p. 145). Computation routines for the algorithm of Gaussian elimination with scaled row pivoting can be found in various computer languages such as FORTRAN (Press et al., 1992a; Wang, 2006, pp. 234–241), C (Press et al., 1992b), and C++ (Press et al., 2002). One may also use commercial software MATLAB (www.mathworks.com) or Mathematica (www.wolfram.com) for the task of solving a system of linear equations.

10.2 10.2.1

GARIN–LOWRY MODEL basic versus nonbasic econoMic activities

There has been an interesting debate on the relation between population and employment distributions in a city. Does population follow employment (i.e., workers find residences near their workplaces to save commuting)? Or vice versa (i.e., businesses locate near residents for recruiting workforce or providing services)? The Garin– Lowry model (Lowry, 1964; Garin, 1966) argues that population and employment distributions interact with each other and are interdependent. However, different types of employment play different roles. The distribution of basic employment is independent of the population distribution pattern and may be considered as exogenous. Service (nonbasic) employment follows population. On the other side, the population distribution is determined by the distribution patterns of both basic and service employment (see Figure 10.1 for illustration). The interactions between employment and population decline with distances, which are defined by a transportation Basic employment Employment

FIGURE 10.1

Nonbasic employment

Population

Interaction between population and employment distributions in a city.

220

Quantitative Methods and Socioeconomic Applications in GIS

network. Unlike the urban economic model built on the assumption of monocentric employment (see Section 6.2), the Garin–Lowry model has the flexibility of simulating a population distribution pattern corresponding to any given basic employment pattern. It can be used to examine the impact of basic employment distribution on population as well as that of transportation network. The binary division of employment into basic and service employment is based on the concept of basic and nonbasic activities. A local economy (a city or a region) can be divided into two sectors: basic sector and nonbasic sector. The basic sector refers to goods or services that are produced within the area but sold outside of the area. It is the export or surplus that is independent of the local economy. The nonbasic sector refers to goods or services that are produced within the area and also sold within the area. It is local or dependent, and serves the local economy. By extension, basic employment refers to workers in the basic sector, and service employment refers to those in the nonbasic sector. This concept is closely related to the division of primary and secondary activities. According to Jacobs (1961, pp. 161–162), primary activities are “those which, in themselves, bring people to a specific place because they are anchorages,” including manufactures and nodal activities at the metropolitan and regional levels; and secondary activities are “enterprises that grow in response to the presence of primary uses, to serve the people the primary uses draw.” To some extent, primary and secondary activities correspond to basic and nonbasic activities, respectively. These two activities may also have distinctive location preference as exemplified in correlations with various centrality measures introduced in Appendix 6B (Porta et al., 2012). The concept of basic and nonbasic activities is useful for several reasons (Wheeler et al., 1998, p. 140). It identifies the basic activities as being the most important to a city’s viability. Expansion or recession of the basic sector leads to economic repercussions throughout the city and affects the nonbasic sector. City and regional planners forecast the overall economic growth based on anticipated or predicted changes in the basic activities. A common approach to determine employment in basic and nonbasic sectors is the minimum requirements approach by Ullman and Dacey (1962). The method examines many cities of approximately the same population size, and computes the percentage of workers in a particular industry for each of the cities. If the lowest percentage represents the minimum requirements for that industry in a city of a given population size range, that portion of the employment is engaged in the nonbasic or city-serving activities. Any portion beyond the minimum requirements is then classified as basic activity. Classifications of basic and nonbasic sectors can also be made by analyzing export data (Stabler and St. Louis, 1990).

10.2.2

Model’s ForMulation

In the Garin–Lowry model, an urban area is composed of n tracts. The population in any tract j is affected by employment (including both the basic and service employment) in all n tracts; and the service employment in any tract i is determined by population in all n tracts. The degree of interaction declines with distance measured by a gravity kernel. Given a basic employment pattern and a distance matrix, the model computes the population and service employment at various locations.

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 221

First, the service employment in any tract i, Si, is generated by the population in all tracts j (j = 1, 2, …, n), Pj, through a gravity kernel tij, with ⎡ ( Pj tij ) = e ⎢ ⎢⎣ j =1 n

Si = e

∑

n ⎛ ⎞⎤ Pj ⎜ dij− α / dlj− α ⎟ ⎥ , ⎝ ⎠ ⎥⎦ j =1 l =1 n

∑

∑

(10.2)

where e is the service employment/population ratio (a simple scalar uniform across all tracts), dij (or dlj) the distance between tracts i (or l) and j, and α the distance friction coefficient characterizing shopping (resident-to-service) behavior. The gravity kernel tij represents the proportion of service employment in tract i owing to the influence of population in tract j, out of its impacts on all tracts l. In other words, the service employment at i is a result of summed influences of population at all tracts j (j = 1, 2, …, n), each of which is only a fraction of its influences on all tracts l (l = 1, 2, …, n). Second, the population in any tract j, Pj, is determined by the employment in all tracts i (i = 1, 2, …, n), Ei, through a gravity kernel gij, with n

n

Pj = h

∑ i =1

( Ei gij ) = h

∑ i =1

n ⎡ ⎛ ⎞⎤ ⎢( Bi + Si ) ⎜ dij−β / dik−β ⎟ ⎥, ⎢⎣ ⎝ ⎠ ⎥⎦ k =1

∑

(10.3)

where h is the population/employment ratio (also a scalar uniform across all tracts), and β the distance friction coefficient characterizing commuting (resident-to-workplace) behavior. Note that employment Ei includes both service employment Si and basic employment Bi, that is, Ei = Si + Bi. Similarly, the gravity kernel gij represents the proportion of population in tract j owing to the influence of employment in tract i, out of its impacts on all tracts k (k = 1, 2, …, n). Let P, S, and B be the vectors defined by the elements Pj, Si, and Bi, respectively, and G and T the matrices defined by gij (with the constant h) and tij (with the constant e), respectively. Equations 10.2 and 10.3 become: S = TP,

(10.4)

P = GS + GB.

(10.5)

Combining (10.4) and (10.5) and rearranging, we have (I − GT)P = GB,

(10.6)

where I is the n × n identity matrix. Equation 10.6 in the matrix form is a system of linear equations with the population vector P unknown. Four parameters (the distance friction coefficients α and β, and the population/employment ratio h, and service employment/population ratio e) are given; the distance matrix d is derived from a road network; and the basic employment B is predefined. Plugging the solution P back to Equation 10.4 yields the service employment vector S. For more detailed discussion of the model, see Batty (1983). The following subsection offers a simple example to illustrate the model.

222

Quantitative Methods and Socioeconomic Applications in GIS

10.2.3

an illustrative exaMple

See Figure 10.2 for an urban area with five (n = 5) equal-area square tracts. The dashed lines are roads connecting them. Only two tracts (say, tract 1 and 2, shaded in the figure) need to be differentiated, and carry different population and employment. Others (3, 4, and 5) are symmetric of tract 2. Assume all basic employment is concentrated in tract 1 and normalized as 1, that is, B1 = 1, B2 = B3 = B4 = B5 = 0. This normalization implies that population and employment are relative, since we are only interested in their variation over space. The distance between tracts 1 and 2 is a unit 1, and the distance within a tract is defined as 0.25 (i.e., d11 = d22 = d33 = … = 0.25). Note that the distance is the travel distance through the transportation network (e.g., d23 = d21 + d13 = 1 + 1 = 2). For illustration, define constants e = 0.3, h = 2.0, α = 1.0, and β = 1.0. From Equation 10.2, after taking advantage of the symmetric property (i.e., tracts 2, 3, 4, and 5 are equivalent in locations relative to tract 1), we have ⎛ ⎞ d12−1 d11−1 S1 = 0.3 ⎜ −1 P + P × 4⎟ . 1 −1 −1 −1 2 −1 −1 −1 −1 −1 −1 d12 + d22 + d32 + d42 + d52 ⎝ d11 + d21 + d31 + d41 + d51 ⎠ That is S1 = 0.1500P1 + 0.1846P2,

(10.7)

where the distance terms have been substituted by their values. Similarly, −1 −1 ⎛ d22 d21 S2 = 0.3 ⎜ −1 P + −1 P −1 −1 −1 2 −1 −1 −1 −1 1 −1 d12 + d22 + d32 + d42 + d52 ⎝ d11 + d21 + d31 + d41 + d51

+

−1 d13−1 + d23

−1 −1 ⎞ d23 d24 P × 2 + P , 3 −1 −1 −1 −1 −1 −1 4 ⎟ −1 −1 d14 + d24 + d34 + d44 + d54 ⎠ + d33 + d43 + d53

3

4

1

2

Roads Tracts

5

FIGURE 10.2 A simple city for illustration.

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 223

where tracts 3 and 5 are equivalent in locations relative to tract 2. Noting that P4 = P3 = P2, the above equation is simplified as S2 = 0.0375P1 + 0.2538P2.

(10.8)

From Equation 10.3, we have P1 = S1 + 1.2308S2 + 1,

(10.9)

P2 = 0.2500S1 + 1.6923S2 + 0.25.

(10.10)

Solving the system of linear equations (Equations 10.7 through 10.10), we obtain: P1 = 1.7472, P2 = 0.8136; S1 = 0.4123, S2 = 0.2720. Both the population and service employment are higher in the central tract than others.

10.3

CASE STUDY 10: SIMULATING POPULATION AND SERVICE EMPLOYMENT DISTRIBUTIONS IN A HYPOTHETICAL CITY

The hypothetical city is here assumed to be partitioned by a transportation network made of 10 circular rings and 15 radial roads (see Figure 10.3). Areas around the central business district (CBD) form a unique downtown tract, and thus the city has 1 + (9 × 15) = 136 tracts. The hypothetical city does not have any geographic coordinate system or unit for distance measurement as we are only interested in the spatial structure of the city. The case study is built upon the work reported in Wang (1998), but revised extensively for clarity of illustration. The project primarily uses the file geodatabase SimuCity.gdb under the data folder SimuCity, which contains: 1. A polygon feature tract (136 tracts) 2. A point feature trtpt (136 tract centroids) 3. A feature dataset road for the road network Others include two distance matrices ODTime.dbf and ODTime1.dbf (results from step 1 but provided here for convenience), and the toolkit file Garin-Lowry. tbx. Features tract and trtpt contain similar attribute fields: area and perimeter for each tract, BEMP_CBD and BEMP_Unif for predefined basic employment in the basic case and the uniform distribution case (see steps 4 and 5), POP_CBD, POP_ Unif, POP_A2B2, and POP_Belt for population to be calibrated in the four scenarios (basic case, uniform distribution case, α = β = 2.0 (step 6) and the case with the beltway (step 7), and SEMP_CBD, SEMP_Unif, SEMP_A2B2, and SEMP_Belt for service employment to be calibrated in the corresponding four scenarios. Prior to the construction of its corresponding network dataset, the road network feature dataset road contains a single feature road, which has two important attribute fields, length and length1. Field length is the length on each road segment and defines the standard impedance values in the network travel time computation. In this case, travel speed is assumed to be uniform on all roads, and travel distance is equivalent to travel time for measuring network impedance. Another field

224

Quantitative Methods and Socioeconomic Applications in GIS

Tracts CBD Tract centroids Roads Beltway

FIGURE 10.3

Hypothetical city with no scale or orientation

Spatial structure of a hypothetical city.

length1 is defined as 1/2.5 of length for the 7th ring road and the same as length for others, and will be used to define the new impedance values for examining the impact of a suburban beltway in step 7. That is to say, the travel speed on the beltway in the 7th ring is assumed to be 2.5 times (e.g., 75 mph) of the speed on others (e.g., 30 mph), and thus its travel time is 1/2.5 of others. Step 1. Computing network travel time matrices in the basic case: If necessary, refer to detailed instructions in Section 2.4.2 for computing network travel time. The following provides a brief guideline. In ArcCatalog, right-click the feature dataset road > New > Network dataset. Name the new network dataset road_ND, follow the instruction in step 8 of Section 2.4.2 to build the network dataset (make sure to select Length when “Specify the attributes for the network dataset”*). Skip (close) the warning message for errors in the “Network Dataset Build Report” (since the spatial data in this study does not *

But select Length1 when computing the travel time matrix ODTime1.dbf in the case with a beltway.

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 225

have any coordinates or projection). The newly created network dataset road_ND and feature class road_ND_Junctions are both placed under the feature dataset road. In ArcMap, use the O-D cost matrix tool in the Network Analyst module to compute the travel time matrix ODTime.dbf in the basic case (see steps 9–12 in Section 2.4.2). Note that the point feature trtpt is loaded to define both origins and destinations, and the result is contained in the feature Lines with 136 × 136 = 18,496 records (its field Total _ Length is the total network distance between two tracts). Step 2. Amending network travel time matrices by accounting for intrazonal travel time in the basic case: One has to account for the intrazonal travel distance not only to avoid zero-distance terms in calibration of gravity model but also to capture more realistic travel impedance (Wang, 2003). In this case, the average within-tract travel distance is approximated as 1/4 of the tract perimeters.* This can be also considered as the segment that one residing in a tract has to travel in order to reach the nearest junction on the road network (see p. 44 in Chapter 2). Therefore, the total network travel distance between two tracts is composed of the aforementioned network distance segment and both intratract travel distances at the origin and destination tracts.† Save the network distance (time) matrix in dBASE file ODTime.dbf, where the fields OriginID, Destinatio, and NetwkTime represent origin tract ID, destination tract ID and amended distance (time) between them. Step 3. Computing network travel time matrices with a suburban beltway: Repeat steps 1–2 by using length1 instead of length as the travel impedance values when creating a new network dataset (say, road_1), and output the travel times to a similar dBASE file ODTime1.dbf, where a different field name NetwkTime1 is used to represent the time between tracts in the case with a suburban beltway. In this case, travel time is no longer equivalent to travel distance because the speed on the 7th ring road is faster than others. Both files ODTime.dbf and ODTime1.dbf are provided under the folder SimuCity for convenience. Step 4. Simulating distributions of population and service employment in the basic case: The basic case, as in the monocentric model, assumes that all basic employment (say, 100) is concentrated at the CBD. In addition, the basic case assumes that α = 1.0 and β = 1.0 for the two distance friction coefficients in the gravity kernels. The values of h and e in the model are set equal to 2.0 and 0.3, respectively, based on data from the Statistical Abstract of the United States (Bureau of Census, 1993). If PT, BT, and ST are the total population, and total basic, and total service employments, respectively, we have: ST = ePT and PT = hET = h(BT + ST), and thus: PT = (h/(1 − he))BT. As BT is normalized to 100, it follows that: PT = 500 and ST = 150. Keeping h, e, and BT constant throughout the analysis implies that the total population and employment (basic and service) remain constant. Our focus is on the *

†

Another alternative proposed in the literature is to approximate the intrazonal distance as the radius of a circle that has an identical area with that of each zone (Frost et al., 1998). In implementation, join the features trtpt and tract to Lines (note that the feature ids are identical in trtpt and tract) and export the attribute table of Lines to a dBASE file ODTime. Add a new field NetwTime to ODTime and calculate it as [Total_Leng] + 0.25*[PERIMETER] + 0.25*[PE RIMETE_1] (note that some field names are truncated).

226

Quantitative Methods and Socioeconomic Applications in GIS

effects of exogenous variations in the spatial distribution of basic employment and in the values of the travel friction parameters α and β, and on the impact of building a suburban beltway. Use the toolkit Garin-Lowry.tbx to implement the Garin–Lowry model (see Appendix 10C). Since the distance (travel time) table is defined in steps 1–2, the tool “Garin–Lowry model (w External distance table)” is used here (and other steps throughout this case study). Input the items in the interface as shown in Figure A10.1. Note that the basic employment pattern is defined in the field BEMP_CBD with its value = 100 in the CBD tract and 0 elsewhere, and it saves the results, that is, the number of population and service employment, in the predefined fields POP_CBD and SEMP_CBD, respectively. Since the values are identical for tracts on the same ring, 10 tracts from different rings along the same direction are selected.* The results of simulated population and service employment are reported in the first column under Population and the first column under Service Employment, respectively, in Table 10.1. Figure 10.4 and Figure 10.5 show the population and service employment patterns, respectively. Step 5. Examining the impact of basic employment pattern: To examine the impact of basic employment pattern, this project simulates the distributions of population and service employment given a uniform distribution of basic employment. In this case, all tracts have the same amount of basic employment, that is, 100/136 = 0.7353, which has been predefined in the field in BEMP_Unif. In the interface as shown in Figure A10.1, change the inputs for “Basic Employment Field,” “Service Employment Field (Output),” and “Population Field (Output)” to BEMP_Unif, SEMP_Unif, and POP_ Unif, respectively. The results are reported under “Uniform Basic Employment” in Table 10.1, and similarly in Figures 10.4 and 10.5. Note that both the population and service employment are declining from the city center even when the basic employment is uniform across space. That is to say, the declining patterns with distances from the CBD are largely due to the location advantage (better accessibility) near the CBD shaped by the transportation network instead of job concentration in the CBD. The job concentration in CBD in the basic case does enhance the effect, that is, both the population and service employment exhibit steeper slopes in the basic case (a monocentric pattern) than this case (uniform basic employment distribution). In general, service employment follows population, and their patterns are consistent with each other. One may design other basic employment patterns (e.g., polycentric) and examine how the population and service employment respond to the changes. See Guldmann and Wang (1998) for more scenarios of different basic employment patterns. Step 6. Examining the impact of travel friction coefficient: Keep all parameters in the basic case unchanged except the two travel friction coefficients α and β. Compared to the basic case where α = 1 and β = 1, this new case uses α = 2 and β = 2. Input most items the same as shown in Figure A10.1, but set “Service Employment Field (Output)” as SEMP_A2B2, “Population Field (Output)” as POP_A2B2, ALPHA as 2 *

For convenience, a field tract0_id is created on the feature tract (or trtpt) with default value 0, but 1–10 for 10 tracts along one sector. One may extract the population and service employment patterns in this sector to prepare Table 10.1.

5.3349 5.0806 4.6355 4.1184 3.9960 3.5643 3.2303 2.9374 2.6737 2.4315

Uniform Basic Employment 22.4994 9.7767 6.1368 3.6171 3.3542 2.3907 1.8372 1.4520 1.1679 0.9487

α, β = 2 9.0211 6.1889 4.9930 3.8529 3.7318 3.2494 2.9197 2.9394 2.4016 2.1532

With a Suburban Beltway 1.7368 1.6170 1.4418 1.2338 1.1958 1.0496 0.9392 0.8459 0.7648 0.6930

Basic Case 1.6368 1.5540 1.4091 1.2384 1.2008 1.0632 0.9578 0.8666 0.7858 0.7130

Uniform Basic Employment 3.2811 2.5465 1.8242 1.2292 1.1366 0.8209 0.6403 0.5111 0.4135 0.3367

α, β = 2

Service Employment

All basic employment is concentrated at the central business district; α, β = 1; travel speed is uniform on all roads.

9.2023 6.3483 5.1342 3.9463 3.8227 3.2860 2.8938 2.5789 2.3168 2.0924

1 2 3 4 5 6 7 8 9 10

a

Basic Casea

Location

Population

TABLE 10.1 Simulated Scenarios of Population and Service Employment Distributions

1.6700 1.5536 1.3839 1.1946 1.1576 1.0345 0.9515 0.9831 0.8037 0.7209

With a Suburban Beltway

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 227

228

Quantitative Methods and Socioeconomic Applications in GIS 25

Population

20

Basic case Uniform basic empl

15

Alpha = beta = 2 With a beltway

10 5 1

1

3

5 Location (ring)

7

9

FIGURE 10.4 Population distributions in various scenarios.

and BETA as 2. The results are reported under “α, β = 2” in Table 10.1, and similarly in Figures 10.4 and 10.5. Note the steeper slopes for both population and service employment in this case with larger α and β. The travel friction parameters indicate how much people’s travel behavior (including both commuting to workplace and shopping) is affected by travel distance (time). As transportation technologies as well as road networks improve over time, these two parameters have generally declined. In other words, the new case with α = 2 and β = 2 may correspond to a city in earlier years. That explains the flattening population density gradient over time, an important observation in the study of population density patterns (see Chapter 6). Step 7. Examining the impact of transportation network: Finally, we examine the impact of transportation network, in this particular case, the building of a suburban 4

Service employment

3 Basic case

3

Uniform basic empl

2

Alpha = beta = 2

2

With a beltway

1 1 0

1

3

5 Location (ring)

7

FIGURE 10.5 Service employment distributions in various scenarios.

9

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 229

beltway. Assume that the 7th ring road is a faster beltway: if the average travel speed on other urban road is 30 miles per hour (mph), the speed on the beltway is 75 mph. Step 3 has already generated a different travel time data set ODTime1.dbf based on the new assumption. Input most items the same as shown in Figure 10.4, but set “Service Employment Field (Output)” as SEMP_Belt, “Population Field (Output)” as POP_Belt, “Distance Matrix Table” as ODTime1.dbf, and “Distance Field” as NetwkTime1. The results are reported under “With a Suburban Beltway” in Table 10.1, and similarly in Figures 10.4 and 10.5. The distribution patterns of population and service employment are similar to those in the basic case, but both are slightly flatter than those in the basic case (note the lower values near the CBD and higher values near the 10th ring). In other words, building the suburban beltway narrows the gap in location advantage between the CBD and suburbia, and thus leads to flatter population and service employment patterns. More importantly, the presence of a suburban beltway leads to a local peak in population or service employment around there (note the higher values highlighted in italics in Table 10.1). That explains that the construction of suburban beltway in most large metropolitan areas in the US helped accelerate the suburbanization process as well as the transformation of urban structure from monocentricity to polycentricity. See related discussion in Chapter 6 and Wang (1998).

10.4

DISCUSSION AND SUMMARY

The concept of basic and nonbasic activities emphasizes different roles in the economy played by basic and nonbasic sectors. The Garin–Lowry model uses the concept to characterize the interactions between employment and population distributions within a city. In the model, basic (export-oriented) employment serves as the exogenous factor whereas service (locally oriented) employment depends on the population distribution pattern; on the other side, the distribution pattern of population is also determined by that of employment (including both basic and nonbasic employment). The interactions decline with travel distances or times as measured by gravity kernels. Based on this, the model is constructed as a system of linear equations. Given a basic employment pattern and a road network (the latter defines the matrix of travel distances or times), the model solves for the distributions of population and service employment. Applying the Garin–Lowry model in analyzing real-world cities requires the division of employment into basic and nonbasic sectors. Various methods (e.g., the minimum requirements method) have been proposed to separate basic employment from the total employment. However, the division is unclear in many cases as most economic activities in a city serve both the city itself (nonbasic sector) and beyond (basic sector). The case study uses a hypothetical city to illustrate the impacts of basic employment patterns, travel friction coefficients, and transportation networks. The case study helps us understand the change of urban structure under various scenarios and explain many empirical observations in urban density studies. For example, suburbanization of basic employment in an urban area leads to dispersion of population as well as service employment; but the dispersion does not change the general pattern of higher concentrations of both population and employment toward the CBD. As the improvements in transportation technologies and road

230

Quantitative Methods and Socioeconomic Applications in GIS

networks enable people to travel farther in less time, the traditional accessibility gap between the central city and suburbs is reduced, and leads to a more gradual decline in population toward the edge of an urban area. This explains the flattening density gradient over time as reported in many empirical studies on urban density functions. Suburban beltways “were originally intended primarily as a means of facilitating intercity travel by providing metropolitan bypass, it quickly became apparent that there were major unintended consequences for intracity traffic” (Taaffe et al., 1996, p. 178). The simulation in the case with a suburban beltway suggests a flatter population distribution pattern, and noticeably a suburban density peak near the beltway even when all basic employment is assumed to be located in the CBD. The model may be used to examine more issues in the study of urban structure. For example, comparing population patterns across cities with different numbers of ring roads sheds light on the issue whether large cities exhibit flatter density gradients than smaller cities (McDonald, 1989, p. 380). Solving the model for a city with more radial or circular ring roads helps us understand the impact of road density. Simulating cities with different road networks (e.g., a grid system, a semicircular city) illustrates the impact of road network structure. In addition to the challenges of decomposing basic and nonbasic employment and defining a realistic distance decay function for the gravity kernels, one major limitation of the Garin–Lowry model is its high level of abstraction in characterizing urban land uses. In the model, a city is essentially composed of three types of land uses: population reflecting residential land use, basic employment and nonbasic employment capturing a combination of industrial, commercial, and others where jobs are located. This simple conceptualization of urban land uses is also adopted in the urban traffic simulation model in Section 12.4 of Chapter 12. However, urban land uses are far more diverse and the interactions between them are much more dynamic than a gravity kernel. Appendix 10D introduces a cellular automata (CA) model as an example of advanced urban land use modeling.

APPENDIX 10A: INPUT–OUTPUT MODEL The input–output model is widely used in economic planning at various levels of governments (Hewings, 1985). In the model, the output from any sector is also the input for all sectors (including the sector itself); and the inputs to one sector are provided by the outputs of all sectors (including itself). The key assumption in the model is that the input–output coefficients connecting all sectors characterize the technologies for a time period, and remain unchanged over the period. The model is often used to examine how a change in production in one sector of the economy affects all other sectors, or how the productions of all sectors need to be adjusted in order to meet any changes in demands in the market. We begin with a simple economy of two industrial sectors to illustrate the model. Consider an economy with two sectors and their production levels: X1 for auto, and X2 for iron and steel. The total amount of auto has three components: (1) for each unit of output X1, a11 is used as input (and thus a total amount of a11X1) in the auto industry itself; (2) for each unit of output X2, a12 is used as input (and thus a total amount of a12 X2) in the iron and steel industry; and (3) in addition to inputs that are consumed

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 231

within industries, d1 serves the final demand to consumers. Similarly, X2 has three components: a21 X1 as the total input of iron and steel for the auto industry, a22 X2 as the total input for the iron and steel industry itself, and d2 for the final demand for iron and steel in the market. It is summarized as ⎧ X1 = a11 X1 + a12 X 2 + d1 , ⎨ ⎩ X 2 = a21 X1 + a22 X 2 + d2 where the aij’s are the input–output coefficients. In matrix, IX = AX + D, where I is an identity matrix. Rearranging the equation yields (I − A)X = D, which is a system of linear equations. Given any final demands in the future D and the input–output coefficients A, we can solve for the productions of all industrial sectors X.

APPENDIX 10B: SOLVING A SYSTEM OF NONLINEAR EQUATIONS We begin with the solution of a single nonlinear equation by Newton’s method. Say, f is a nonlinear function whose zeros are to be determined numerically. Let r be a real solution, and let x be an approximation to r. Keeping only the linear term in the Taylor expansion, we have 0 = f(r) = f(x + h) ≈ f(x) + hf ′(x),

(A10.1)

where h is a small increment such as h = r − x. Therefore, h ≈ −f(x)/f ′(x). If x is an approximation to r, x − f(x)/f ′(x) should be a better approximation to r. Newton’s method begins with an initial value x0 and uses iterations to gradually improve the estimate until the function reaches an error criterion. The iteration is defined as x n +1 = x n −

f ( xn ) . f ʹ( xn )

The initial value assigned (x0) is critical for the success of using Newton’s method. It must be “sufficiently close” to the real solution (r) (Kincaid and Cheney, 1991, p. 65). Also, it only applies to a function whose first-order derivative has a definite form. The method can be extended to solve a system of nonlinear equations.

232

Quantitative Methods and Socioeconomic Applications in GIS

Consider a system of two equations with two variables: ⎧ f1 ( x1 , x2 ) = 0 . ⎨ ⎩ f2 ( x1 , x2 ) = 0

(A10.2)

Similarly to Equation A10.1, using the Taylor expansion, we have ⎧0 = f1 ( x1 + h1 , x2 + h2 ) ≈ f1 ( x1 , x2 ) + h1 ∂∂xf1 + h2 ∂∂xf1 ⎪ 1 2 . ⎨ ∂f2 ∂f2 0 = f ( x + h , x + h ) ≈ f ( x , x ) + h + h ⎪⎩ 2 1 1 2 1 ∂x1 2 ∂x2 2 2 1 2

(A10.3)

The system of linear equations (Equations A10.3) provides the basis for determining h1 and h2. The coefficient matrix is the Jacobian matrix of f1 and f 2: ⎡ ∂∂xf1 1 J = ⎢ ∂f ⎢ ∂x2 ⎣ 1

∂f1 ∂x2

⎤ ⎥. ∂f2 ⎥ ∂x2 ⎦

Therefore, Newton’s method for system (A10.2) is x1,n +1 = x1,n + h1,n x2,n +1 = x2,n + h2,n

,

where the increments h1,n and h2,n are solutions to the rearranged system of linear Equations A10.3: ⎧h1,n ⎪ ⎨ ⎪⎩h1,n

∂f1 ∂x1 ∂ f2 ∂x1

∂f1 ∂x2

= − f1 ( x1,n , x2,n )

∂f2 2 , n ∂x2

= − f2 ( x1,n , x2,n )

+ h2,n +h

,

or ⎡ h1,n ⎤ ⎡ f1 ( x1,n , x2,n ) ⎤ J ⎢ ⎥ = −⎢ ⎥. h ⎣ 2, n ⎦ ⎣ f2 ( x1,n , x2,n ) ⎦

(A10.4)

Solving the system of linear Equations A10.4 uses the method discussed in Section 10.1. Solution of a larger system of nonlinear equations follows the same strategy, and only the Jocobian matrix is expanded. For instance, for a system of three equations, the Jocobian matrix is ⎡ ∂∂xf1 ⎢ 1 J = ⎢ ∂∂xf21 ⎢ ⎢ ∂f3 ⎣ ∂x1

∂f1 ∂x2 ∂f2 ∂x2 ∂f3 ∂x2

∂f1 ∂x3

⎤ ⎥ ∂f2 ⎥. ∂x3 ⎥ ∂f3 ⎥ ∂x3 ⎦

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 233

APPENDIX 10C: TOOLKIT FOR CALIBRATING THE GARIN– LOWRY MODEL* The toolkit Garin-Lowry.tbx under the data folder SimuCity implements the Garin–Lowry model. It includes two tools: the first tool “Garin–Lowry model (Euclidean distance)” automatically generating a Euclidean distance matrix, and the other “Garin–Lowry model (w external distance table)” reads a distance matrix (as in the case study in Section 10.3). Both read a basic employment pattern defined in either a polygon feature (e.g., tracts) or a point feature (e.g., tract centroids), and output the population and service employment patterns in the predefined field names (with empty values prior to the tool’s use) in the polygon (point) feature. The interfaces for the two tools differ only slightly. In the first tool, the program needs the perimeter values in order to account for the intrazonal distance (approximated as 1/4 of a zone’s perimeter, similar to the strategy adopted in step 2 in Section 10.3) in calibrating the Euclidean distance matrix. In the second tool, the program reads a predefined distance table, where the origin and destination IDs need to be consistent with the IDs of the polygon (point) feature, shown in Figure A10.1. Since one of the major interests in the case study in Section 10.3 is to examine the impact of road network, the travel time matrixes are computed prior to the implementation of the Garin–Lowry model, and are fed to the second tool.

APPENDIX 10D: CELLULAR AUTOMATA (CA) FOR URBAN LAND USE MODELING† A fundamental notion of urban modeling is that patterns arising at the aggregate level are based on the operation at the microlevel or agent level. The across-scale linkage is almost impossible to be addressed by confirmatory modeling, since many unexpected outcomes might occur due to the stochastic feature of the interactions across agents and scales. Most of operational urban models in the literature and practice are built around the linkage between land use and transport (as it is the case of Garin–Lowry model in this chapter), while urban econometric models are mainly on simulating the connections among a variety of industrial sectors. These models attempt to be static and comprehensive, but they have to sacrifice spatiotemporal details and eventually limit their applicability in the real world. The ability to connect all dots of spatial statics in a simulation framework is essential and beneficial. Three types of urban models, namely Cellular automata (CA), Agent-based Models (ABMs), and Microsimulation, have recently attracted most attentions by using a bottom–up perspective. Among them, CA is perhaps the most popular model that has been applied empirically (Batty, 2009). CA simulates land use changes through transition rules, which are usually carried out upon immediate neighboring cells or *

†

Suggested citation: Zhu H. and F. Wang. 2015. Appendix 10C: Toolkit for calibrating the Garin–Lowry model. In Quantitative Methods and Socioeconomic Applications in GIS (2nd ed.). Boca Raton, FL: Taylor & Francis. p. 233. Suggested citation: Ye X. and F. Wang. 2015. Appendix 10D: A cellular automata (CA) urban land use model. In Quantitative Methods and Socioeconomic Applications in GIS (2nd ed.). Boca Raton, FL: Taylor & Francis. p. 233–235.

234

Quantitative Methods and Socioeconomic Applications in GIS

FIGURE A10.1 Interface of the Garin–Lowry model tool (with external distance table).

the defined neighborhood of spatial influence (Batty, 2007). Raster data, particularly those from remote sensing, can readily work with such cell-based models as input data layers. This appendix illustrates the basic structure of CA, and explains how the technique helps us understand urban land use dynamics. CA was originally invented by Von Neumann (1951) to provide a formal framework for investigating the self-reproducing features of biological systems. A key advantage of CA is its simplicity in the space–time context. A well-known application of a CA is the Game of Life, where a cell will continue its life only if two or three neighboring cells are alive (otherwise die due to overcrowding or loneliness). It illustrates how macrolevel order can arise from microlevel interactions. Traditional CA is composed of five fundamental components (White and Engelen, 2000): 1. A space represented as a collection of homogeneous cells 2. A set of possible cell states (e.g., on or off, alive or dead) 3. A definition of the neighborhood of the cell (e.g., the focal cell having eight surrounding cells in a Moore neighborhood, but only four neighboring cells in a von Neumann neighborhood, similar to the queen and rook contiguity definitions in Appendix 1)

Garin–Lowry Model in Simulating Urban Population and Employment Patterns 235

4. A set of deterministic or stochastic transition rules being applied to all cells on the grid 5. A sequence of time steps. In a basic CA model on a regular two-dimensional grid, the states of the cells are updated synchronously at each discrete time step according to the transition rules. The transition rules define the state of a cell at a subsequent time as a function of the cell’s current state and the state of its neighboring cells. For example, for land use modeling, the rules are a set of land-use conversion probability measures. Through dynamic models of local interactions among cells, complex global structures emerge. CA modeling of urban dynamics has gained popularity over the last three decades due to its simplicity, flexibility, and capacity of incorporating GIS and remote sensing data (e.g., White and Engelen, 1993; Batty and Xie, 2005; Clarke and Gaydos, 1998; Li and Yeh, 2000; Wu, 2002; Feng and Liu, 2012). Major achievements have been made in the configuration of cells and cell states, identification of transition rules, assessment of simulation accuracies, software development, and applications. One recent model is SIMLANDER, standing for SIMulation of LAND use changE using R (R is an open-source software environment) (Hewitt et al., 2014) See Section 11.2.3 for an example of using R.

11

Linear Programming and Applications in Examining Wasteful Commuting and Allocating Healthcare Providers

This chapter introduces linear programming (LP), an important optimization technique in socioeconomic analysis and planning. LP seeks to maximize or minimize an objective function, subject to a set of constraints. Both the objective and the constraints are expressed in linear functions. It would certainly take more than one chapter to cover all issues in LP, and many graduate programs in planning, engineering, or other fields use a whole course or more to teach LP. This chapter discusses the basic concepts of LP, and illustrates how LP problems are solved in ArcGIS and other popular software (R or SAS). Section 11.1 reviews the formulation of LP and the simplex method. In Section 11.2, the method is applied to examine the issue of wasteful commuting. Commuting is an important research topic in urban studies for its theoretical linkage to urban structure and land use as well as its implications in public policy. Themes in the recent literature of commuting have moved beyond the issue of wasteful commuting, and cover a diverse set of issues such as relation between commuting and urban land use, explanation of intraurban commuting, and implications of commuting patterns in spatial mismatch and job access. However, strong research interests in commuting are, to some degree, attributable to the issue of wasteful commuting raised by Hamilton (1982) (see Appendix 11A). A case study in Columbus, Ohio is used to illustrate the method of measuring wasteful commuting by LP, and a program in the open-source R is developed to solve the LP in the case study. Appendix 11B presents the implementation of LP in SAS. Section 11.3 introduces the integer linear programming (ILP), in which some of the decision variables in a linear programming problem take on only integer values. Some classic location–allocation problems such as the p-median problem, the location set covering problem (LSCP), and the maximum covering location problem (MCLP) are used to illustrate the formulation of ILP problems. Applications of these location–allocation problems can be widely seen in both private and public sectors. Section 11.4 uses an example of allocating healthcare providers in Baton Rouge, Louisiana to illustrate the implementation of a location–allocation problem 237

238

Quantitative Methods and Socioeconomic Applications in GIS

in ArcGIS. Appendix 11C discusses a new location–allocation problem with an objective of minimizing accessibility disparities. The chapter concludes with a brief summary in Section 11.5.

11.1 LINEAR PROGRAMMING AND THE SIMPLEX ALGORITHM 11.1.1

lp standard ForM

The linear programming problem in the standard form can be described as follows: n n Find the maximum of c j x j subject to the constraints aij x j ≤ bi for all

∑

∑

j =1

j =1

i ∈ {1, 2, …, m} and x j ≥ 0 for all j ∈{1, 2, …, n}. n The function c j x j is the objective function, and a solution xj (j ∈{1, 2, …, n}) j =1 is also called the optimal feasible point. In the matrix form, the problem is stated as: Let c ∈ Rn , b ∈ Rm , and A ∈ Rm × n . Find the maximum of cT x subject to the constraints Ax ʺ b and x ≥ 0. Since the problem is fully determined by the data A, b, and c, it is referred to as Problem (A, b, c).* Other problems not in the standard form can be converted into it by the following transformations (Kincaid and Cheney 1991, p. 648):

∑

1. Minimizing cTx is equivalent to maximizing −cTx. 2. A constraint 3. A −

∑

constraint

∑

n j =1

∑

j =1

aij x j ≥ bi is equivalent to −

∑

∑

n j =1

aij x j ≤ −bi .

aij x j = bi

is

to

∑

n

| aij x j |≤ bi

is equivalent to

∑

n

n j =1

equivalent

j =1

aij x j ≤ bi ,

aij x j ≤ −bi .

4. A constraint −

n

n j =1

∑

n j =1

j =1

aij x j ≤ bi ,

aij x j ≤ bi .

5. If a variable xj can also be negative, it is replaced by the difference of two variables such as x j = u j − v j .

11.1.2

siMplex algorithM

The simplex algorithm (Dantzig, 1948) is widely used for solving linear programming problems. By skipping the theorems and proofs, we move directly to illustrate the method in an example. Consider a linear programming problem in the standard form: Maximize: z = 4 x1 + 5 x2 *

The dual problem to the linear programming problem is (−AT, −c, −b). See Wu and Coppins (1981) for economic interpretations of a primary and a dual linear programming problem.

Examining Wasteful Commuting and Allocating Healthcare Providers

239

Subject to: 2 x1 + x2 ≤ 12 −4 x1 + 5 x2 ≤ 20 x1 + 3 x2 ≤ 15 x1 ≥ 0, x2 ≥ 0 The simplex method begins with introducing slack variables u ≥ 0 so that the constraints Ax ≤ b can be converted into an equation form Ax + u = b. For the above problem, three slack variables (x3 ≥ 0, x4 ≥ 0, x5 ≥ 0 ) are introduced. The problem is rewritten as Maximize: z = 4 x1 + 5 x2 + 0 x3 + 0 x4 + 0 x5 Subject to: 2 x1 + x2 + x3 + 0 x4 + 0 x5 = 12 −4 x1 + 5 x2 + 0 x3 + x4 + 0 x5 = 20 x1 + 3 x2 + 0 x3 + 0 x4 + x5 = 15 x1 ≥ 0, x2 ≥ 0, x3 ≥ 0, x4 ≥ 0, x5 ≥ 0 The simplex method is often accomplished by exhibiting the data in a tableau form such as 4 2 −4 1

5 1 5 3

0 1 0 0

0 0 1 0

0 0 12 0 20 1 15

The top row contains coefficients in the objective function cTx. The next m rows represent the constraints that are re-expressed as a system of linear equations. We leave the element at the top right corner blank as the solution to the problem zmax is yet to be determined. The tableau is of a general form cT A

0 I b

If the problem has a solution, it is found at a finite stage in the algorithm. If the problem does not have a solution (i.e., an unbounded problem), the fact is discovered in the course of the algorithm. The tableau is modified in successive steps according to certain rules until the solution (or no solution for an unbounded problem) is found.

240

Quantitative Methods and Socioeconomic Applications in GIS

By assigning 0 to the original variables x1 and x2, the initial solution (x1 = 0, x2 = 0, x3 = 12, x4 = 20, and x5 = 15) certainly satisfies the constraints of equations. The variables xj that are zero are designated nonbasic variables; and the remaining ones, usually nonzero, are designated basic variables. The tableau has n components of nonbasic variables and m components of basic variables, corresponding to the numbers of original and slack variables, respectively. In the example, x1 and x2 are the nonbasic variables (n = 2), and x3, x4, and x5 are the basic variables (m = 3). In the matrix that defines the constraints, each basic variable occurs in only one row; and the objective function must be expressed only in terms of nonbasic variables. In each step of the algorithm, we attempt to increase the objective function by converting a nonbasic variable into a basic variable. This is done through Gaussian elimination steps since elementary row operations on the system of equations do not alter the set of solutions. The following summarizes the work on any given tableau: 1. Select the variable xs whose coefficient in the objective function is the largest positive number, that is, cs = max{ci > 0}. This variable becomes the new basic variable. 2. Divide each bi by the coefficient of the new basic variable in that row, aij; and among those with ais > 0 (for any i), select the minimum bi/aij as the pivot element and assign it to the new basic variable, that is, xs = bk / akj = min{bi / aij }. If all ais are ≤0, the problem has no solution. 3. Using the pivot element aks, create 0’s in column s with Gaussian elimination steps (i.e., keeping the kth row with the pivot element, and subtracting it from other rows). 4. If all coefficients in the objective function (the top row) are ≤0, the current x is the solution. We now apply the procedures to the example. In step 1, x2 becomes the new basic variable because 5 (i.e., its coefficient in the objective function) is the largest positive coefficient. In step 2, a22 is identified as the pivot element (highlighted in underscore in the following tableau) because 20/5 is the minimum among {12/1, 20/5, 15/3}, and x2 = 20/5 = 4. 0.8 2 −0.8 0.3333

1 1 1 1

0 0 0 1 0 0 12 0 0.2 0 4 0 0 0.3333 5

In step 3, Gaussian eliminations yield a new tableau 1.6 2.8

0 0 −0.2 0 1 −0.2

0

0 8 −0.8 1 0 0.2 0 4 1.1333 0 0 −0.2 0.3333 1

Examining Wasteful Commuting and Allocating Healthcare Providers

241

According to step 4, the process continues as c1 = 1.6 > 0. Similarly, x1 is the new basic variable, a13 is the pivot element, x1 = 1/1.1333 = 0.8824, and the resulting new tableau is 0 0 0 0.0515 −0.2941 0 0 0.3571 0.1050 −0.2941 1.9748 0 1.25 0 0.0735 0.2941 5.8824 1 0 0 −0.1765 0.2941 0.8824 The process continues, and generates a new tableau such as 0 0 −3.4 0 0 3.4 0 17 −3.4 5.6667 0 3.4

0 −2.9143 1 −2.8 18.8 0 6.8 61.2 0 −1.1333 23.8

By now, all coefficients in the objective function (the top row) ≤0, and the solution is x1 = 23.8/5.6667 = 4.2 and x2 = 61.2/17 = 3.6. The maximum value of the objective function is zmax = 4 × 4.2 + 5 × 3.6 = 34.8. Many software packages (some free) are available for solving LP problems. In the following case study, we select the open-source R to illustrate the LP implementation. See Appendix 11B for its implementation in SAS.

11.2 11.2.1

CASE STUDY 11A: MEASURING WASTEFUL COMMUTING IN COLUMBUS, OHIO issue oF WasteFul coMMuting and Model ForMulation

The issue of wasteful commuting was first raised by Hamilton (1982). Assuming that residents can freely swap houses, the planning problem is to minimize total commuting given the locations of houses and jobs. Hamilton used the exponential function to capture the distribution patterns of both residents and employment. Since employment is more centralized than population (i.e., the employment density function has a steeper gradient than the population density function), the solution to the problem is that commuters always travel toward the CBD and stop at the nearest employer. See Appendix 11A for details. Hamilton found 87% wasteful commuting in 14 cities in the US. White (1988) proposed a simple LP model to measure wasteful commuting. White’s study yielded very little wasteful commuting, likely attributable to a large area unit she used (Small and Song, 1992). Using a smaller unit, Small and Song (1992) applied White’s LP approach to Los Angeles and found about 66% wasteful commuting, less than that in the Hamilton’s model but still substantial. The impact of area units on the measurement of wasteful commuting is further elaborated in Horner and Murray (2002). The following formulation follows White’s LP approach.

242

Quantitative Methods and Socioeconomic Applications in GIS

Given the number of resident workers Pi at i (i = 1, 2, …, n), and the number of jobs at Ej at j (j = 1, 2, …, m), the minimum commute is the solution to the following linear programming problem: n

Minimize:

m

∑∑c x

ij ij

i =1 j =1

m

Subject to:

∑x

ij

≤ Pi ∀i (i = 1, 2,…, n),

j =1

n

∑x

ij

≤ E j ∀j ( j = 1, 2,…, m),

i =1

xij > 0 for all i ( j = 1, 2, …, n) and all j (j = 1, 2, …, m), where cij is the commute distance (time) from residential location i to job site j, and xij is the number of commuters on that route. The objective function is the total amount of commute in the city. The first constraint defines that the total commuters from each residential location to various job locations cannot exceed the number of resident workers there. The second constraint defines that the total commuters from various residential locations to each job site cannot exceed the number of jobs there. In the urbanized areas of most U.S. metropolitan areas, it is most likely that the total number of jobs exceeds the total number n m of resident workers, that is, Pi ≤ Ej .

∑

11.2.2

i =1

∑

j =1

data preparation in arcgis

The following datasets are prepared and provided under the data folder Columbus: 1. Geodatabase Columbus.gdb includes an area feature urbtaz and its corresponding point feature urbtazpt with 991 TAZs (traffic analysis zones and their centroids, respectively), and a feature data set roads containing a single feature roads for the road network. 2. The R program file WasteComm.R implements the LP approach. 3. ASCII files restaz.txt, emptaz.txt, and ODTime.csv are to be prepared by steps 1 and 3 in this subsection (provided in the data folder for convenience), and contain information such as the number of resident workers in each TAZ, the number of employment in each TAZ, and the travel time between them, respectively. The spatial data are extracted from the TIGER files. Specifically, the feature urbtazpt is defined by combining all TAZs (traffic analysis zones) within the

243

Examining Wasteful Commuting and Allocating Healthcare Providers

Columbus MSA including seven counties (Franklin, Union, Delaware, Licking, Fairfield, Pickaway, and Madison) in 1990, and keeping only the urbanized portion. The same study area was used in Wang (2001b), which focused on explaining intraurban variation of commute (see Figure 11.1). The feature dataset road is defined similarly by combining all roads in the study area, but covers a slightly larger area than the urbanized area for maintaining network connectivity. In the attribute table of urbtazpt, the field emp is the number of employment (jobs) in a TAZ, and the field work is the number of resident workers. In addition, the field popu is the population in a TAZ for reference. These attributes are extracted from the 1990 Census Transportation Planning Package (CTPP) (http://www.fhwa.dot.gov/planning/census_issues/ctpp/) Urban Element data sets for Columbus MSA. Specifically, the information for resident workers and population is based on Part 1 (by place of residence), and the information for jobs is based on Part 2 (by place of work). The fields work and emp in urbtazpt define the numbers of resident workers and employment in each TAZ, respectively, and thus the variables Pi (i = 1, 2, …, n) and Ej (j = 1, 2, …, m) in the LP problem stated in Section 11.2.1. In order to minimize the computation load, we need to restrict the origins and destinations to those TAZs with nonzero employment and nonzero resident worker counts, respectively, and compute the O-D travel time cij only between those TAZs. Section 2.4.2 in Chapter 2 has discussed the procedures for computing the O-D travel time matrix, and this case study provides another opportunity to practice the technique. This study uses a simple approach for measuring travel time by assuming a uniform speed N 0 1.5 3

6

9

12 km

TAZ w employment TAZ w resident workers TAZ

FIGURE 11.1 TAZs with employment and resident workers in Columbus, Ohio.

244

Quantitative Methods and Socioeconomic Applications in GIS

on the same level of roads. For a more advanced method, see Wang (2003). One may skip the steps and go directly to wasteful commuting analysis in Section 11.2.3 as the resulting files are also provided as stated previously. Step 1. Extracting locations of nonzero employment and nonzero resident workers: In ArcMap, open the layer urbtazpt, (1) select the TAZs with work > 0 and export the selected features to another new feature class restaz (812 resident worker locations), and (2) select the TAZs with emp > 0 and export the selected features to a new feature class emptaz (931 employment locations). These two feature classes will be used as the origins and destinations in the O-D time matrix. Also in ArcMap, open the attribute table of feature restaz, and export it to a text file restaz.txt. Similarly, open the attribute table of feature emptaz, and export it to a text file emptaz.txt. Both text files are comma separated with headings and seven variables (OBJECTID, AREA, TAZ, POPU, WORK, EMP, and Length), and restaz.txt has 812 records and emptaz.txt has 931 records excluding the headings. As shown in Figure 11.1, some TAZs in the study area had no employment or no resident workers. These two text files (provided in the data folder) will be used in the wasteful commuting computation in Section 11.2.3. Step 2. Defining impedance for the road network: In ArcMap, open the attribute table of feature class roads, add a field Speed and assign a default value of 25 (mph) for all road segments, then update the values according to the corresponding CFCC codes (Luo and Wang, 2003) by using a series of Select By Attributes and Field Calculator: For CFCC >= “A11” and CFCC = “A21” and CFCC = “A31” and CFCC New > Network dataset. Name the new network dataset roads_ ND, follow the instruction in step 8 of Section 2.4.2 to build the network dataset (make sure to select Minutes when “Specify the attributes for the network dataset”). Close the warning message for errors in the “Network Dataset Build Report.” In ArcMap, use the O-D cost matrix tool in the Network Analyst module to compute the network travel time matrix (table) by using restaz as the origins and emptaz as the destinations (see steps 9–12 in Section 2.4.2). Export the result to a table ODTime under geodatabase Columbus.gdb. Similar to step 2 in Section 10.3, intrazonal time here is approximated as 1/4 of a TAZ’s perimeter divided by a constant speed 670.56 meters/minute (i.e., 25 mph) in this study. The total travel time between two TAZs is composed of the aforementioned network time and the intrazonal time at both the origin and destination TAZs. Join the attribute table of feature restaz to the table ODTime (based on the common fields OBJECTID and OriginID), and also join the attribute table of feature emptaz to the table ODTime table (based on the common

Examining Wasteful Commuting and Allocating Healthcare Providers

245

fields OBJECTID and DestinationID). By doing so, the perimeter values (in field Length) for both the origin and destination TAZs are attached to the table ODTime. Add a new field NetwTime to the table ODTime and calculate it as [Total_Minutes] + 0.25*[Length] + [Length_1])/670.56. This amends the network travel time by adding intrazonal times. Export it to a commaseparated text file odtime.txt to preserve the expanded table. Simplify the text file odtime.txt by keeping only three columns of values with headings such as OBJECTID_1 (ids for the origins or resident workers), OBJECTID_2 (ids for the destinations or employment), and NetwTime (travel time in minutes). The new simplified O-D travel time file is named ODTime.csv with 812 × 931 = 755,972 records (also provided in the data folder).

11.2.3

Measuring WasteFul coMMuting in an r prograM

The LP problem for measuring wasteful commuting is defined by the spatial distributions of resident workers and employment and the travel time between them, which are prepared in Section 11.2.2. As stated previously, these files (restaz. txt, emptaz.txt, and ODTime.csv) are also provided directly under the data folder Columbus for your convenience. The following illustrates the implementation in an R program. Step 4. Downloading and installing R: R is a free statistical computing package, and runs on several operating systems such as Windows, Mac OS, and UNIX. It can be accessed and downloaded from its mirrors (http://cran.r-project.org/m irrors. html) based on a location near you. For example, we choose to download the version “R for Windows” from the mirror in University of California, Berkeley, CA, USA (http://cran.cnr.Berkeley.edu). Under the list of subdirectories (e.g., base, contrib, Rtools), choose the base subdirectory if it is your first time to install R. Download the current (dated on March 21, 2014) version R3.03 for Windows and install it. Step 5. Downloading and installing the LP package in R: Launch R. From the main menu of R as shown in Figure 11.2, select Packages > Install package(s) > In the CRAN mirror window, similar to step 4, choose a location close to you (in our case, choose “USA (CA1)”) for illustration. Under the list of Packages, choose lpSolve to install this package. From the main menu, select Packages > Load package > In the window “Select one”, select lpSolve and OK to load it.* Step 6. Running the program for measuring wasteful commuting: After loading the package lpSolve, from the main menu, select File > Open script. Navigate to the program file WasteComm.R and open it. Edit the locations of three input files (ODTime.csv, restaz.txt and emptaz.txt) and one output file (min_com. csv), and save the script. From the main menu, select Edit > Run All to run the whole program. Alternatively, one may choose the option “Run line or selection” to run the script line by line. In the R Console window, the result is saved in a text file min_com.csv containing the OBJECTIDs of origin and destination TAZs, and the travel time and number *

To use a downloaded package, one needs to load it into a project every time before running it.

246

FIGURE 11.2

Quantitative Methods and Socioeconomic Applications in GIS

Interface of R.

of commuters on the route. The R Console window also shows that the objective function (total minimum commute time) is 3,805,721 minutes, that is, an average of 7.65 minutes per resident worker (with a total of 497,588 commuters). The actual mean commute time was 20.40 minutes in the study area in 1990 (Wang, 2001b, p.173), implying 62.5% wasteful commuting, significantly below 87% reported by Hamilton (1982). Further examining the “optimal” commuting patterns in min_com.csv reveals that many of the trips have the same origin and destination TAZs, and therefore accounting for intrazonal commute time is significant in reducing the resulting wasteful commuting. Studies on the 1990 CTPP Part 3 (journey-to-work) for the study area show that among the 346 TAZs with nonzero intrazonal commuters, the average intrazonal drive-alone time was actually 16.3 minutes. Wang (2003) also revealed significant intrazonal commute time in Cleveland, Ohio (11.3 minutes). That is to say, the intrazonal time estimated in step 3 is much less than that reported in surveys. In addition, ArcGIS also tends to underestimate travel time (e.g., in comparison to time retrieved from Google Maps, as shown in Appendix 2B). In summary, a significant portion of the so-called wasteful commuting may be attributable to inaccurate travel time estimation.

Examining Wasteful Commuting and Allocating Healthcare Providers

11.3

247

INTEGER PROGRAMMING AND LOCATION–ALLOCATION PROBLEMS

11.3.1

general ForMs and solutions For integer prograMMing

If some of the decision variables in a linear programming problem are restricted to only integer values, the problem is referred to as an integer programming problem. If all decision variables are integers, it is an integer linear programming (ILP) problem. Similar to the LP standard form, it is written as n

Maximize:

∑c x j

j

j =1

n

Subject to:

∑a x ij

j

≤ bi for all i ∈ {1, 2, , m}, and

j =1

integers x j ≥ 0 for all j ∈ {1, 2, , n}. If some of the decision variables are integers and others are regular nonnegative numbers, it is a mixed-integer linear programming (MILP) problem. It is written as p

n

Maximize:

∑

cj xj +

j =1

∑ j =1

k k

k =1

p

n

Subject to:

∑d y

aij x j +

∑g

y ≤ bi for all i ∈ {1, 2, , m},

ik k

k =1

integers x j ≥ 0 for all j ∈ {1, 2, , n} and yk ≥ 0 for all k ∈ {1, 2, , p}. One may think that an easy approach to an ILP or MILP problem is to solve the problem as a regular LP problem and round the solution. In many situations, the rounded solution is not necessarily optimal (Wu and Coppins, 1981, p. 399). Solving the ILP or MILP requires special approaches such as the cutting planes method or the branch and bound method, and the latter is more popular. The following summarizes a general branch and bound algorithm. 1. Find a feasible solution f L as the lower bound on the maximum value of the objective function.

248

Quantitative Methods and Socioeconomic Applications in GIS

2. Select one of the remaining subsets, and separate it into two or more new subsets of solutions. 3. For each subset, compute an upper bound f U on the maximum value of the objective function over all completions. 4. A subset is eliminated (fathomed) if (i) f U < f L, or (ii) its solution is not feasible, or (iii) the best feasible solution in this subset has been found, in which case f L is replaced with this new value. 5. Stop if there are no remaining subsets. Otherwise, go to step (2). When the integer decision variables are restricted to value 0 or 1, the problem is said to be a 0-1 (binary) programming problem. The 0-1 programming problem has wide applications in operations research and management science, particularly in location–allocation problems.

11.3.2

location–allocation probleMs

We use three classic location–allocation problems to illustrate the formulation of ILP problems. The first is the p-median problem (ReVelle and Swain, 1970). The objective is to locate a given number of facilities among a set of candidate facility sites so that the total travel distance or time to serve the demands assigned to the facilities is minimized. The p-median model formulation is

Minimize: Z =

∑∑a d x

i ij ij

Subject to: xij ≤ x jj ∀i, j, i ≠ j (each demand assignment is restricted to what has been located), m

∑x

ij

= 1 ∀i (each demand must be assigned to a facility),

j =1 m

∑x

jj

= p ∀j (exactly p facilities are located ),

j =1

xij = 1, 0 ∀i, j (a demand area is either assigned to a facility or not), where i indexes are demand areas (i = 1, 2, …, n), j indexes are candidate facility sites (j = 1, 2, …, m), p is the number of facilities to be located, ai is the amount of demand at area i, dij is the distance or time between demand i and facility j, xij is 1 if demand i is assigned to facility j or 0 otherwise. One may add a constraint that each demand site must be served by a facility within a critical distance or time (dij ʺ d0 ). This formulation is known as the p-median

Examining Wasteful Commuting and Allocating Healthcare Providers

249

problem with a maximum distance constraint (Khumawala, 1973; Hillsman and Rushton, 1975). The second is the location set covering problem (LSCP) that minimizes the number of facilities needed to cover all demand (Toregas and ReVelle, 1972). The model formulation is m

∑x

Minimize: Z =

j

j =1

Ni

Subject to:

∑x

j

≥ 1 ∀i (a demand area must be within the critical

j =1

distance or time of at least one open facility sitte) x j = 1, 0 ∀j (a candidate facility is either open or closed ), where Ni is the set of facilities where the distance or time between demand i and facility j is less than the critical distance or time d0, that is, dij ʺ d0, xj is 1 if a facility is open at candidate site j or 0 otherwise, i, j, m, and n are the same as in the above p-median model formulation. The third is the maximum covering location problem (MCLP) that maximizes the demand covered within a desired distance or time threshold by locating p facilities (Church and ReVelle, 1974). The model formulation is n

Minimize: Z =

∑a y

i i

i =1

Ni

Subject to:

∑x

j

+ yi ≥ 1 ∀i (a demand area must be within the critical

j =1

distance or time of at least one open facility site or itt is not covered), m

∑x

j

= p (exactly p facilities are located ),

j =1

x j = 1, 0 ∀j (a candidate facility is either open or closed ), yi = 1, 0 ∀i (a demand area is either not covered or covered), where i, j, m, n, and p are the same as in the p-median model formulation, Ni and xj are the same as in the above LSCP model formulation. Note that yi is 1 if a demand area i is not covered or 0 otherwise; thus the objective function is structured to minimize the amount of demand not covered, equivalent to maximizing the amount covered. Similarly, one may add an additional constraint to the original MCLP that requires an uncovered demand point within

250

Quantitative Methods and Socioeconomic Applications in GIS

a mandatory closeness constraint (the second and a larger distance threshold). The revised model is known as the MCLP with mandatory closeness constraints (Church and ReVelle, 1974). The above problems may be solved in the Network Analyst module in ArcGIS. Specifically, under “New Location–Allocation”, “Minimize Impedance” solves the p-median problem, “Minimize Facilities” solves the LSCP, and “Maximum Capacitated Coverage” solves the MCLP. Table 11.1 summarizes the models and corresponding location–allocation problem types in ArcGIS. For details, use the ArcGIS help on the topic “Location–allocation analysis.” As demonstrated in this section, a location–allocation task is often formulated as an optimization problem composed of an objective function (or multiple objectives) and a set of constraints with decision variables to solve. Appendix 11C presents a case with an objective of maximum equal accessibility. The current location–allocation models available in ArcGIS are limited in capacities and do not have the flexibility of adding complex constraints. Many applications call for the formulation of more advanced models that require the use of R, SAS, or more specialized optimization packages such as LINGO.

TABLE 11.1 Location–Allocation Models Location– Allocation Model I

p-median problem

II

p-median with a maximum distance constraint Location set covering problem (LSCP) Maximum covering location problem (MCLP)

III

IV

V

a b

MCLP with mandatory closeness constraints

Problem Type in ArcGIS Location–Allocation Analysis

Objective

Constraints

Minimize total distance (time) Minimize total distance (time)

Locate p facilities; all demands are covered

Minimize impedance

Demand must be within a specified distance (time) of its assigned facilitya

Minimize impedance (by setting an impedance cutoff)

Minimize the number of facilities Maximize coverage

All demands are covered

Minimize facilities

Locate p facilities; demand is covered if it is within a specified distance (time) of a facility Demand not covered must be within a second (larger) distance (time) of a facilityb

Maximize capacitated coverage (by setting an impedance cutoff)

Maximize coverage

Additional to constraints for model I. Additional to constraints for model IV.

Maximize capacitated coverage (by setting two impedance cutoffs)

Examining Wasteful Commuting and Allocating Healthcare Providers

11.4

251

CASE STUDY 11B: ALLOCATING HEALTHCARE PROVIDERS IN BATON ROUGE, LOUISIANA

This is a hypothetical project developed to illustrate the implementation of location–allocation models in ArcGIS. Say, the Department of Health and Hospitals, State of Louisiana, plans to set up five temporal clinics for administering free flu shots for residents in the East Baton Rouge Parish. Sites for the five clinics will be selected from the 26 hospitals in the parish. For the purpose of illustration, the objective in this project is to minimize the total travel time for all clients, and thus a p-median problem. Other location–allocation models can be implemented similarly in ArcGIS.* The following data sets for the study area are provided: 1. Under the data folder “BatonRouge”, geodatabase BR.gdb containing a point feature BR_Hosp for all (26) hospitals, a point feature BRBlkPt for all (7896) census block centroids and a polygon feature BRTrtUtm for all (92) census tracts (both from Case Study 1). 2. Under the data folder “Louisiana”, geodatabase LAState.gdb containing a feature dataset LA_MainRd for the major road network in Louisiana (from Case Study 2). The planning problem is to locate five clinics among the 26 hospitals in order to serve the residents in a most efficient way (i.e., minimizing total travel time). The Minimize Impedance (p-median) model in ArcGIS will be used to find the optimal solution. In practice, facilities are hospitals, demand points are represented by census tract centroids,† and impedance is measured as the network travel time. For demand point properties, the field “Weight” is the relative weighting of a demand point and is defined by the total population in each census tract. Step 1. Preparing the demand point layer: The area size of census tract varies a great deal, and is much larger in a rural area than in the central city. The geographic centroid of a census tract, particularly in a rural or suburban area, may not be a good representation of its location. A better approach is to use the populationweighted centroid (Luo and Wang, 2003). It is accomplished by using the mean center tool (see step 1 in Section 4.3.1). In ArcMap, use ArcToolbox > Spatial Statistics Tools > Measuring Geographic Distributions > Mean Center. In the Mean Center dialog window, select BRBlkPt for Input Feature Class, name the Output Feature Class BRTrt_WgtPt, choose

*

†

The following location–allocation models are available in ArcGIS 10.2: Minimize Impedance (p-median), Maximize Coverage, Maximize Capacitated Coverage, Minimize Facilities, Maximize Attendance, Maximize Market Share, and Target Market Share. For more details, see the topic “Location–allocation analysis” in ArcGIS 10.2 Help. One may use the census blocks instead of tracts to define the demand for better accuracy. This case study chooses census tracts for illustration of the method and saving computation time.

252

Quantitative Methods and Socioeconomic Applications in GIS

POP10 under Weight Field and TRTID10 under Case Field,* and click OK. The resulting layer BRTrt_WgtPt has 91 points, which do not include the Baton Rouge Airport tract (Tract 9800) with no population, and thus one fewer than the number of tracts in BRTrtUtm. Join BRTrtUtm to BRTrt_WgtPt based on the common keys (GEOID10 in BRTrtUtm and TRTID10 in BRTrt_WgtPt) to attach the population information to the weighted centroids layer, and export the data to a new feature class BRTrtPt under geodatabase BR.gdb to preserve the joined result. The layer BRTrtPt with population-weighted tract centroids defines the demand points, where the field DP0010001 is the total population for a tract. Step 2. Activating the Location–Allocation tool in Network Analyst: In ArcMap, add the hospital feature BR_Hosp and all feature classes in the road network feature dataset LA_MainRd to the project. From the ArcMap main menu, choose Customize > Extensions > make sure that Network Analyst is checked; back to Customize > Toolbars > select Network Analyst to activate the module. On the Network Analyst toolbar, click the Network Analyst drop-down manual and choose New Location–Allocation. The network analysis layer is created in both Table Of Contents and Network Analyst windows. The location–allocation analysis layer has six classes: Facilities, Demand Points, Lines, Point Barriers, Line Barriers, and Polygon Barriers (similar to step 10 in Section 2.4.2). Step 3. Defining network analysis classes (Facilities, Demand Points, and others): In the Network Analyst window, right-click Facilities > Load Locations to load from BR_Hosp (other default settings are okay). Similarly, right-click Demand Points > Load Locations to load from BRTrtPt; under Location Analysis Properties, set Name to OBJECTID, Weight to DP0010001 (total population); under Location Position, check Use Geometry and set the Search Tolerance to 6500 meters†; click OK to load it. As a result, 26 facilities and 91 demand points are loaded (similar to step 11 in Section 2.4.2). Step 4. Solving the location–allocation model: In the Table Of Contents window, double-click the composite layer Location–Allocation to open its Layer Properties dialog box‡ > Under the Analysis Settings tab, choose Minutes (Minutes) for Impedance; Under the Advanced Settings tab, choose Minimize Impedance for Problem Type, input 5 for Facilities To Choose (other default settings are okay); click OK. On the Network Analyst toolbar, click the Solve button . The output properties of network analysis objects are updated to reflect the results. In particular, in the *

† ‡

The Case Field is used to “group features for separate mean center calculations”. In other words, for each unique value of the case field (i.e., TRTID10 for census tract id), ArcGIS computes the mean center for all features sharing the same value of this field (i.e., multiple blocks within a tract). For your information, the field TRTID10 was not available from the original BRBlkPt layer, and was added by the author to facilitate the attribute join with the census tract layer. TRTID10 is defined as “Text with length 11”, and the formula for its calculation is: TRTID10 = [STATEFP10] & [COUNTYFP10] & [TRACTCE10]. This is to accommodate loading a tract in the northeast corner of the study area. Alternatively, in the Network Analysts window, one may click the Location–Allocation Properties button next to the Location–Allocation drop-down menu to open it.

253

Examining Wasteful Commuting and Allocating Healthcare Providers

N

Lane rehabilitation center

Greater baton rouge surgical hospital

Woman’s hospital Hospital candidate

Promise specialty hospital of baton rouge (Oschner campus)

Baton rouge VA outpatient clinic

Hospital chosen Allocation assignment Tract centroid Census tract

0

2.5

5

10

15

20 km

FIGURE 11.3 Five selected hospitals in the p-median model.

attribute table of Facilities, the field FacilityType identifies which five hospitals are selected; and the fields DemandCount and DemandWeight represent the total number of demand points (census tracts) and total population served by each clinic, respectively. In the attribute table of Demand Points, the Field FacilityID shows which facility a tract is assigned to. The Line class shows the allocation assignment between a demand point (census tract) and a chosen hospital (clinic). The results are presented in Figure 11.3. Among the 26 hospitals in the study area, five are selected as the clinic sites for minimizing the total travel time for potential clients. Table 11.2 summarizes the information for the service areas of the five clinics.

TABLE 11.2 Service Areas for the Clinics Name Baton Rouge VA Outpatient Clinic Greater Baton Rouge Surgical Hospital Lane Rehabilitation Center Promise Specialty Hospital of Baton Rouge Woman’s Hospital

No. of Tracts

Total Population

30 25 7 10 19

147,409 108,628 40,084 62,947 81,103

254

11.5

Quantitative Methods and Socioeconomic Applications in GIS

SUMMARY

Applications of linear programming (LP) are widely seen in operational research, engineering, socioeconomic planning, location analysis, and others. This chapter first discusses the simplex algorithm for solving LP. The wasteful commuting issue is used as a case study to illustrate the LP implementation in R. The solution to minimizing total commute in a city is the optimal (required or minimum) commute. In the optimal city, many journey-to-work trips are intrazonal (i.e., residence and workplace are in the same zone), and yield very little required commute and thus lead to a high percentage of wasteful commuting when compared to actual commuting. Computation of intrazonal travel distance or time is particularly problematic as the lengths of intrazonal trips vary. Unless the journey-to-work data are down to street addresses, it is difficult to measure commute distances or times accurately. The interest in examining wasteful commuting, to some degree, stimulates the studies on commuting and issues related to it. One area with important implications in public policy is the issue of spatial mismatch and social justice (see Kain, 2004 for a review). The location–allocation models are just a set of examples of integer linear programming (ILP) applications in location analysis. The second case study in this chapter illustrates how one of the models (p-median problem) is applied for site selections of healthcare providers, and how the problem is solved by using a location–allocation tool in ArcGIS. The convenience of using ArcGIS to solve location– allocation models is evident as the inputs and outputs are in GIS format. However, its current capacity remains limited as it has yet to incorporate complex constraints. Solving more advanced models requires programming in R, SAS, or other specialized software packages.

APPENDIX 11A: HAMILTON’S MODEL ON WASTEFUL COMMUTING Economists often make assumptions in order to simplify a model with manageable complexity while capturing the most important essence of real-world issues. Similar to the monocentric urban economic model, Hamilton (1982) made some assumptions for urban structure. One also needs to note two limitations for Hamilton in the early 1980s: the lack of intraurban employment distribution data at a fine geographic resolution and the GIS technology at its developmental stage. First, consider the commuting pattern in a monocentric city where all employment is concentrated at the CBD (or the city center). Assume that population is distributed according to a density function P(x), where x is the distance from the city center. The concentric ring at distance x has an area size 2πxdx, and thus population 2πxP(x)dx, who travels a distance x to the CBD. Therefore, the total distance D traveled by a total population N in the city is the aggregation over the whole urban circle with a radius R: R

R

∫

∫

D = x(2πxP(x ))dx = 2 π x 2 P(x )dx. 0

0

Examining Wasteful Commuting and Allocating Healthcare Providers

255

Therefore, the average commute distance per person A is R

A=

D 2π 2 = x P(x )dx. N N

∫

(A11.1)

0

Now assume that the employment distribution is decentralized across the whole city according to a function E(x). Hamilton believed that this decentralized employment pattern was more realistic than the monocentric one. Similarly to Equation A11.1, the average distance of employment from the CBD is R

B=

2π 2 x E (x )dx, J

∫

(A11.2)

0

where J is the total number of employment in the city. Assuming that residents can freely swap houses in order to minimize commute, the planning problem here is to minimize total commuting given the locations of houses and jobs. Note that employment is usually more centralized than population. The solution to the problem is that commuters always travel toward the CBD and stop at the nearest employer. Compared to a monocentric city, “displacement of a job from the CBD can save the worker a commute equal to the distance between the job and the CBD” (Hamilton, 1982, p. 1040). Therefore, “optimal” commute or required commute or minimum commute per person is the difference between the average distance of population from the CBD (A) and the average distance of employment from the CBD (B): C = A−B =

R

R

∫

∫

2πP0 2πE0 2 − tx 2 − rx x e dx − x e dx, N J 0

(A11.3)

0

where both the population and employment density functions are assumed to be exponential, that is, P( x ) = P0 e − tx and E (x ) = E0 e − rx , respectively. Solving Equation A11.3 yields C =−

2 πP0 −2 − tR 2 2 πE0 −2 − rR 2 R e + + R e − . tN t rJ r

Hamilton studied 14 American cities of various sizes, and found that the required commute accounts only for 13% of the actual commute, and the remaining 87% is wasteful. He further calibrated a model in which households choose their homes and job sites at random, and found that the random commute distances are only 25% over actual commuting distances, much closer than the optimal commute! There are many reasons that people commute more than the required commute predicted by Hamilton’s model. Some are recognized by Hamilton himself, such as bias in the model’s estimations (residential and employment density functions,

256

Quantitative Methods and Socioeconomic Applications in GIS

radial road network)* and the assumptions made. For example, residents do not necessarily move close to their workplaces when they change their jobs because of relocation costs and other concerns. Relocation costs are likely to be higher for home owners than renters, and thus home ownership may affect commute. There are also families with more than one worker. Unless the jobs of all workers in the family are at the same location, it is impossible to optimize commute trips for each income earner. More importantly, residents choose their homes for factors not related to their job sites such as accessibility to other activities (shopping, services, recreation and entertainment, etc.), quality of schools and public services, neighborhood safety, and others. Some of these factors are considered in research on explaining intraurban variation of actual commuting (Shen, 2000; Wang, 2001b).

APPENDIX 11B: CODING LINEAR PROGRAMMING IN SAS SAS, already used in Chapters 6 and 7, is a powerful package and is particularly convenient for coding large matrices. The LP procedure in SAS solves linear programming, integer programming, and mixed-integer programming problems. The definition of an LP problem in SAS takes two formats: a dense and a sparse format. The sparse input format will be illustrated here because of its flexibility of coding large matrices. The sparse input format uses the COEF, COL, TYPE, and ROW statements to identify variables in the problem dataset, or simply uses SAS internal variable names, _COEF_, _COL_, _TYPE_, and _ROW_, respectively. The following SAS program solves the example illustrated in Section 11.1: Data; Input _row_ $1-6 _coef_ 8-9 _type_ $11-13 _col_ $15-19; Cards; Object . max . /*define type for Obj_func */ Object 4 . x1 /*coefficient for 1st variable in Obj_func */ Object 5 . x2 /*coefficient for 2nd variable in Obj_func */ Const1 12 le _RHS_ /*Type & RHS value in 1st constraint */ Const1 2 . x1 /*coefficient for 1st var in 1st constraint */ Const1 1 . x2 /*coefficient for 2nd var in 1st constraint */ Const2 20 le _RHS_ /* Type & RHS value in 2nd constraint */ Const2 −4 . x1 /*coefficient for 1st var in 2nd constraint */ Const2 5 . x2 /*coefficient for 2nd var in 2nd constraint */ Const3 15 le _RHS_ /* Type & RHS value in 3rd constraint */ Const3 1 . x1 /*coefficient for 1st var in 3rd constraint */ Const3 3 . x2 /*coefficient for 2nd var in 3rd constraint */ ; Proc lp sparsedata; /*run the LP procedure */ Run;

*

Hamilton did not differentiate residents (population in general, including dependents) and resident workers (those actually in the labor force who commute). By doing so, it was assumed that the labor participation ratio was 100% and uniform across an urban area.

Examining Wasteful Commuting and Allocating Healthcare Providers

257

The main part in the above SAS program is to code the objective function and constraints in the LP problem. Each takes multiple records defined by four variables: i. _row_ labels whether it is the objective function or which constraint. ii. _coef_ defines the coefficient of a variable (or as a missing value in the first record for coding the objective function type MAX or MIN, or as the value on the right-hand side in the first record for coding each constraint). iii. _type_ takes the value “MIN” or “MAX” in the first record for coding the objective function, or the value “LE”, “LT”, “EQ”, “GE,” or “GT” in the first record for coding each constraint, or a missing value in others.* iv. _col_ is the variable name (or a missing value in the first record for coding the objective function, or “_RHS_” in the first record for each constraint). As shown in the above sample program, the first record defines the type for objective function or the right-hand side value in a constraint. Therefore, the number of records for defining the objective function or each constraint is generally one more than the number of variables. Coding the wasteful commuting case study in SAS involves more coding complexity, mainly the DO routines. The LP.SAS program included in the data folder Columbus implements the linear programming problem with comments explaining critical steps. After the data of employment, resident workers, and commute distances are read into SAS, the program first defines the constraint for employment, then the constraints for resident workers, and finally the objective function. The result is written to an external file min_com.txt containing the origin TAZ code, destination TAZ code, number of commuters and commute time on each route.

APPENDIX 11C: PROGRAMMING APPROACH TO MINIMAL DISPARITY IN ACCESSIBILITY Section 11.3 has outlined various location–allocation models formulated as optimization problems. Unlike those problems with objectives such as minimal travel impedance, maximal service coverage and minimal number of facilities, this appendix introduces a planning problem with a new objective that emphasizes equity. Take health and health care for example, equity may be defined as equal access to health care, equal utilization of healthcare service or equal (equitable) health outcomes among others (Culyer and Wagstaff, 1993). Most agree that equal access is the most appropriate principle of equity from a public health policy perspective (Oliver and Mossialos, 2004). In Wang and Tang (2013), maximal equal accessibility of services is formulated as an objective of minimizing inequality in accessibility of services. Recall that spatial accessibility is measured by the generalized two-step floating catchment area (G2SFCA) method in Chapter 5 such as *

_TYPE_ also takes the value “INTEGER” or “BINARY” to identify variables being integer-constrained or (0, 1)-constrained, respectively.

258

Quantitative Methods and Socioeconomic Applications in GIS

⎡ ⎛ Ai = ⎢ S j f (tij ) / ⎜ ⎝ j =1 ⎢ ⎣ n

∑

m

∑ D f (t k

kj

k =1

⎞⎤ )⎟ ⎥, ⎠ ⎥⎦

(A11.4)

where Ai is the accessibility at location i, Sj is the capacity of supply at location j, Dk is the amount of demand at location k, t is the travel time between them, and n and m are the total number of supply locations and demand locations, respectively. Also recall that the average accessibility (weighted by demand) across all demand sites is equal to the ratio of total supply and total demand in the study area. Denoting the total supply as S (= S1 + S2 + … + Sn) and total demand as D (= D1 + D2 + … + Dm), we have the weighted average accessibility a as a constant such as m

a=

∑ ( D /D ) A i

i

= S /D.

(A11.5)

i =1

The objective function is formulated as minimizing the variance (i.e., least squares) of accessibility index Ai across all demand locations, written as: m

min

∑ D ( A − a) . i

i

2

(A11.6)

i =1

Various scenarios may be considered for designing the decision variables. In Wang and Tang (2013), the amount of supply at all locations is to be adjusted (redistributed) while holding the total supply fixed. A more realistic scenario is to distribute a fixed amount of additional supply among the existing supply facilities. In both cases, the supply capacities in facilities (Sj) are decision variables to solve, and the aforementioned design scenarios become constraints. The problems fit the description of a quadratic programming (QP), where the objective function is a quadratic function subject to linear constraints. MATLAB (www.mathworks.com/products/ matlab/) can be used to solve them. The third scenario is to construct new facilities from a list of candidate sites. Using a new set of binary decision variables xj (= 1, 0) to represent a candidate site being selected as a supply facility or not, the optimization problem is a 0-1 integer programming problem similar to those discussed in Section 11.3. But it is a quadratic (not linear) integer programming problem, and requires the use of more specialized optimization packages such as LINGO (www. lindo.com/products/lingo/).

12

Monte Carlo Method and Its Application in Urban Traffic Simulation*

Monte Carlo simulation is a numerical analysis technique that uses repeated random sampling to obtain the distribution of an unknown probabilistic entity. It provides a powerful computational framework for spatial analysis, and has become increasingly popular with rising computing power. Some applications include data disaggregation, designing a statistical significance test, and detecting spatial patterns. This chapter demonstrates the Monte Carlo technique in a case study of simulating urban traffic flows. The commonly known Urban Transportation Modeling System (UTMS) or Urban Transportation Planning System (UTPS) typically uses the four-step travel demand model, composed of trip generation, trip distribution, mode choice, and trip assignment. Each step requires significant efforts in data collection, model calibration, and validation. Even with advanced transportation modeling packages, implementation of the whole process is often arduous or infeasible. A prototype program of traffic simulation modules for education (TSME) is developed to simulate traffic with data that are widely accessible such as the Census Transportation Planning Package (CTPP) and TIGER files from the Census. Specifically, the Monte Carlo method is used in two critical steps of the program: one on disaggregating area-based residential and employment data to individual trip origins (O) and destinations (D), and another on forming realistic O-D pairs. Validation of the model compares simulated traffic to data obtained at traffic monitoring stations from Baton Rouge, Louisiana, and the results are promising. This chapter is organized as follows. Section 12.1 provides a brief introduction to the Monte Carlo simulation technique and its applications in spatial analysis. Section 12.2 summarizes the four-step travel demand forecast model and some major software packages of transportation modeling. Section 12.3 discusses how the Monte Carlo method is used to implement two tasks involved in the traffic simulation model. Section 12.4 uses a case study in Baton Rouge to illustrate the major steps in the TSME program. Section 12.5 concludes the chapter with a brief summary.

*

Authors: Yujie Hu and Fahui Wang, Department of Geography and Anthropology, Louisiana State University, Baton Rouge, LA 70803.

259

260

Quantitative Methods and Socioeconomic Applications in GIS

12.1 MONTE CARLO SIMULATION METHOD 12.1.1

introduction to Monte carlo siMulation

Basically, a Monte Carlo method generates suitable random numbers of parameters or inputs to explore the behavior of a complex system or process. The random numbers generated follow a certain probability distribution function (PDF) that describes the occurrence probability of an event. Some common PDFs include the following: 1. A normal distribution is defined by a mean and a standard deviation, similar to the Gaussian function introduced in Section 2.3. Values in the middle, near the mean, are most likely to occur, and the probability declines symmetrically from the mean. 2. If the logarithm of a random variable is normally distributed, the variable’s distribution is lognormal. For a lognormal distribution, the variable takes only positive real values, and may be considered as the multiplicative product of several independent random variables that are positive. The left tail of a lognormal distribution is short and steep and approaches toward 0, and its right tail is long and flat and approaches toward infinity. 3. In a uniform distribution, all values have an equal chance of occurring. 4. A discrete distribution is composed of specific values, each of which occurs with a corresponding likelihood. For example, there are several turning choices at an intersection, and a field survey suggests a distribution of 20% turning left, 30% turning right, and 50% going straight. In a Monte Carlo simulation, a set of random values are generated according to a defined PDF. Each set of samples is called an iteration and recorded. This process is repeated a large number of times. The larger the number of iteration times simulated, and the better the simulated samples conform to the predefined PDF. Therefore, the power of the Monte Carlo method relies on a large number of simulations. By doing so, Monte Carlo simulation provides a comprehensive view of what may happen and the probability associated with each scenario. Monte Carlo simulation is a means of statistical evaluation of mathematical functions using random samples. Software such as Matlab and R package provides several random number generators corresponding to different PDFs (e.g., a normal distribution). Others use pseudorandom number sampling. For example, the inverse transformation generates sample numbers at random from a probability distribution defined by its cumulative distribution function (CDF), and is therefore also called inverse CDF transformation. Specifically, one can generate a continuous random variable X by (i) generating a uniform random variable U within (0, 1), and (ii) setting X = F−1(U) for transformation to solve X in terms of U. Similarly, random numbers following any other probability distributions could be obtained.

12.1.2

Monte carlo applications in spatial analysis

The Monte Carlo simulation technique is widely used in spatial analysis. Here, we briefly discuss its applications in spatial data disaggregation and statistical testing.

Monte Carlo Method and Its Application in Urban Traffic Simulation

261

Spatial data often come as aggregated data in various areal units (sometime large areas) for various reasons. One likely cause is the concern of geo-privacy as discussed in Section 9.1. Others include administrative convenience, integration of various data sources, and limited data storage space, etc. Several problems are associated with analysis of aggregated data such as 1. Modifiable areal unit problem (MAUP), that is, instability of research results when data of different areal units are used, as raised in Section 1.3.2 2. Ecological fallacy, when one attempts to infer individual behavior from data of aggregate areal units (Robinson, 1950). 3. Loss of spatial accuracy when representing areas by their centroids in distance measures. Therefore, it is desirable to disaggregate data in area units to individual points in some studies. For example, Watanatada and Ben-Akiva (1979) used the Monte Carlo technique to simulate representative individuals distributed in an urban area in order to estimate travel demand for policy analysis. Wegener (1985) designed a Monte Carlo based housing market model to analyze location decisions of industry and households, corresponding migration and travel patterns, and related public programs and policies. Poulter (1998) employed the method to assess uncertainty in environmental risk assessment and discussed some policy questions related to this sophisticated technique. Luo et al. (2010) used it to randomly disaggregate cancer cases from the zip code level to census blocks in proportion to the age–race composition of block population, and examined implications of spatial aggregation error in public health research. Gao et al. (2013) used it to simulate trips proportionally to mobile phone Erlang values and to predict traffic-flow distributions by accounting for the distance decay rule. Another popular application of Monte Carlo simulation in spatial analysis is to test statistical hypotheses using randomization tests. The tradition can be traced back to the seminal work by Fisher (1935). In essence, it returns test statistics by comparing observed data to random samples that are generated under a hypothesis being studied, and the size of the random samples depends on the significance level chosen for the test. A major advantage of Monte Carlo testing is that investigators could use flexible informative statistics rather than a fixed, known distribution theory. Besag and Diggle (1977) described some simple Monte Carlo significance tests in analysis of spatial data including point patterns, pattern similarity, and space–time interaction. Clifford et al. (1989) used a Monte Carlo simulation technique to assess statistical tests for the correlation coefficient or the covariance between two spatial processes in spatial autocorrelation. Anselin (1995) used Monte Carlo randomization in the design of statistical significance tests for global and local spatial autocorrelation indices. Shi (2009) proposed a Monte Carlo-based approach to test whether the spatial pattern of actual cancer incidences is statistically significant by computing p-values based on hundreds of randomized cancer distribution patterns. The case study in Section 12.4 utilizes Monte Carlo simulation for data disaggregation, one from area-based aggregated data to individual points and another from area-based flow data to individual O-D trips. Both are commonly encountered in spatial analysis.

262

Quantitative Methods and Socioeconomic Applications in GIS

12.2 TRAVEL DEMAND MODELING The history of travel demand modeling, especially in the US, has been dominated by the four-step model (McNally, 2008), composed of trip generation, trip distribution, mode choice, and trip assignment. The first comprehensive application of the four-step model was in the Chicago Area Transportation Study (Weiner, 1999). This section provides a brief overview of each step. Traffic originated from areas or going to the same units is collectively referred to as trip generation (Black, 2003, p. 151). The former is sometimes called trip production, and the latter trip attraction. Trip production is usually estimated by trip purposes (e.g., work, school, shopping, etc.). Consider work trips produced by residential areas as an example. A multivariate regression model can be used to estimate the number of trips by several independent variables such as population, labor participation ratio, vehicle ownership, income and others at a zonal (e.g., subdivision, census tract, or traffic analysis zone) level. Trip attraction is often modeled by different types of land use (e.g., port, factory, hotel, hospital, school, office, store, service, park), and similarly with various regression models. In either trip production or attraction, there are hundreds of equations with coefficients tabulated for various sizes of metropolitan areas and different regions based on data collected over the years (ITS, 2012). On the basis of the results from trip generation, trip distribution distributes the total number of trips from an origin (or to a destination) to specific O-D pairs. In other words, trip generation estimates Oi and Dj, and trip distribution breaks them down to Tij so that Oi =

∑T,

Dj =

∑T.

j

i

ij

ij

Among various models, the gravity model discussed in Section 2.3 (Equation 2.5) remains the most popular approach for implementing trip distribution. Calibration of the parameters such as the distance friction and scale adjustment factors in the gravity model involves an iteration process as outlined in FHWA (1977). With the O-D traffic obtained from trip distribution, mode choice analysis splits the traffic among available modes such as drive-alone, carpool, public transit, bicycle and others. The logit model is commonly used to estimate probabilities of travelers choosing particular modes. For example, for i = 1, 2, …, m modes, the probability of choosing the ith mode, P(i), is written as P(i ) =

eV (i )

∑

m j =1

,

eV ( j )

where V is the perceived utility of each mode, which is estimated by various attributes associated with a mode. Attributes influencing the mode choice include travel

Monte Carlo Method and Its Application in Urban Traffic Simulation

263

time, travel cost, convenience, comfort, trip purpose, automobile availability, and reliability (Black, 2003, pp. 188–189). The logit model can be estimated by a logistic regression based on a survey of travelers. Finally, trip assignment predicts the paths (routes) for the trips by a particular mode. Section 2.2 discusses the algorithm for computing the shortest (minimum) route. The approach assumes that travel speed is constant (static) on each road segment, commonly referred to as “free-flow speed.” However, the travel speed usually decreases when the traffic flow increases beyond a capacity constraint (Khisty and Lall, 2003, p. 533). The traffic-flow-dependent travel time is represented by a function such as TQ = T0 [1 + α(Q /Qmax )β ], where TQ is the travel time at traffic flow Q (vehicles per hour), T0 is the free-flow travel time, Qmax is practical capacity, and α and β are parameters. The trip assignment by accounting for capacity restraints calls for a dynamic modeling approach by adjusting traffic assignments on various routes in response to the change of speeds dependent on traffic flow until equilibrium is reached. It also requires the road capacity data. Transportation modeling is conducted at different levels of detail from microscopic models for individual vehicles to mesoscopic models for aggregated behavior of vehicles, and to macroscopic models of traffic flow between areas (Kotusevski and Hawick, 2009). There are a large number of transportation modeling packages for completing some or all of the four steps of travel demand forecasting. The following is a brief overview that surveys some of the popular software: 1. SUMO (namely, Simulation of Urban Mobility) is an open source, portable, and microscopic traffic simulation package mainly developed by the Institute of Transportation Systems at the German Aerospace Center (http:// sumo-sim.org/) (Krajzewicz et al., 2012). It is capable of simulating individual vehicles with their own routes and moving behavior. 2. CORSIM (known as Corridor Simulation) is also a microscopic traffic simulation model that integrates the previously NETSIM (for surface street simulation of signal systems) and FRESIM (for freeway simulation), and is thus also referred to as TSIS (Traffic Software Integrated System). Developed by McTrans Center at the University of Florida (http://mctrans. ce.ufl.edu/featured/tsis/), CORSIM is applicable to road networks with a complete selection of control devices (e.g., stop/yield sign, traffic signal, ramp metering), and uses commonly employed vehicle and driver models to simulate traffic and traffic control systems. 3. TRANSIMS (Transportation Analysis and Simulation System) is an integrated set of tools for travel forecasts for transportation planning and emissions analysis on regional transportation systems (https://www.fhwa. dot.gov/planning/tmip/resources/transims/). Its development has been supported by the U.S. Department of Transportation and the Environmental

264

Quantitative Methods and Socioeconomic Applications in GIS

Protection Agency. Based on a cellular automata model, TRANSIMS considers synthetic populations and their activities during the simulation of individual travelers. 4. TransModeler is a microscopic traffic simulation package developed by Caliper (http://www.caliper.com/transmodeler). It simulates multimodal networks in a large area in great detail, and visualizes traffic flow dynamics, traffic signal and ITS operations, and overall network performance in a two-dimensional or three-dimensional GIS environment. The latest development of TransModeler enables the dynamic modeling of route choices based upon historical or simulated time dependent travel times. Results from TransModeler can easily be integrated with TransCAD, another popular travel demand forecasting software developed by Caliper. Traffic simulation is a major task in transportation planning and analysis. The aforementioned programs require extensive data preparation and input (detailed land use data, road network data with turning lanes and traffic signals), have builtin algorithms unknown to users, and incur a significant learning curve for users. The next two sections introduce a prototype model with four traffic simulation modules for education purposes (termed “TSME”). With minimal data requirement (using public accessible data of road network and land use), the model illustrates the major steps of travel demand forecasting (excluding the step of “mode choice”) in distinctive modules. The simulation result is validated by observed traffic data, and the validation also helps inform the adjustment of trip simulation algorithm.

12.3

EXAMPLES OF MONTE CARLO–BASED SPATIAL SIMULATION

This section uses two tasks encountered in our traffic simulation program to illustrate the implementation of Monte Carlo simulation in spatial data analysis. The first task is to disaggregate areal data randomly to individual points so that the density pattern of simulated points reflects the pattern at the zonal level. The second is to randomly connect the points between zones, and the volume of resulting connections is consistent with a predefined (observed) interzonal flow pattern. While the examples are extracted from transportation modeling, both have potentials for broad applications. Data disaggregation from areas to points illustrated in the first example is a common task in spatial analysis as reviewed in Section 12.1.2. The second example provides a convenient tool for disaggregating interzonal flow data to flows between points, which could be used in analysis beyond traffic simulation (e.g., population or wildlife migration, disease spread, journey to crime, etc.). The first example is related to the second module of the TSME program on simulating trip origins and destinations as points from residents and employment data in census tracts. The spatial boundary of each census tract is defined by recorded X and Y coordinates of all vertices on its boundary. The number of simulated trip origins is equal (or proportional) to the number of resident workers in a tract, and the number of trip destination is equal (or proportional) to the number of jobs there.

Monte Carlo Method and Its Application in Urban Traffic Simulation

265

For simplicity, the spatial distribution of resident workers or jobs is assumed to be uniform. The following illustrates the implementation process based on Monte Carlo simulation: 1. Calculating spatial extent of each census tract. Based on the spatial extent of a tract (e.g., recorded in the text file trt.txt by using an ArcGIS tool “Write Features To Text File” as illustrated in the case study in Section 12.4), we derive the maximums and minimums for its X and Y coordinates such as Xmax, Xmin, Ymax, and Ymin. 2. Generating X and Y coordinates in corresponding ranges. Use the Monte Carlo simulation to generate a random number Xi within the range [Xmin, Xmax] and another one Yi within [Ymin, Ymax] following a uniform distribution for each. Specifically, the “Random()” function with the system time is used as generating seed. 3. Determining if a point is located within a census tract. Based on an algorithm built upon Taylor’s (1989) ray method, detect whether a point (Xi, Yi) is located inside or outside of a census tract. If inside, it is retained as a trip origin (destination); if outside, it is discarded. A program written in C# in Visual Studio 2010 automates the above process, available under the tab “Monte Carlo Simulation of O’s & D’s” in the TSME program interface. Figure 12.1a and b shows simulated 5000 origins (resident workers) and 5000 destinations (jobs) in four tracts, respectively. Clearly, the pattern of origins differs from that of destinations. The second example simulates trips by pairing origins and destinations given the zonal-level flow patterns (e.g., defined in the table trip_mtx.txt in the case study in Section 12.4). It utilizes the results from the first example, that is, a set of origins and another for destinations (e.g., tables O_all.txt and D_all.txt in the case study in Section 12.4). The following illustrates the process based on Monte Carlo simulation: 1. Randomly choosing an origin point in a zone. Say a zone m contains Om origins. Use the Monte Carlo method to randomly choose an origin, denoted as Oi, where i ∈[1,Om].

FIGURE 12.1

Monte Carlo simulations of (a) resident workers, and (b) jobs.

266

Quantitative Methods and Socioeconomic Applications in GIS

2. Randomly choosing a destination point in another zone. Similarly, randomly choose a destination from a zone with Dn destinations, denoted as Dj, where j ∈ [1,Dn]. 3. Forming O-D trips and counting frequency. Link Oi and Dj to form a trip. Cumulate the count for trips from an origin zone to a destination zone, denoted by Fmn. 4. Capping the number of O-D trips. Continue the iterations of the above three steps till Fmn reaches the predefined zonal-level flow Fmn0. The process is automated in a C# program, available under the tab “Monte Carlo Simulation of Trips” in the TSME program interface.

12.4

CASE STUDY 12: MONTE CARLO–BASED TRAFFIC SIMULATION IN BATON ROUGE, LOUISIANA

This case study is developed to illustrate the applications of Monte Carlo simulation in spatial analysis as discussed in Section 12.3. In addition, for reasons explained in Section 12.2, it is desirable to have a simple traffic simulation program that utilizes commonly available data sources and estimates the spatial pattern of urban traffic with reasonable accuracy. Such a program needs to be designed in a way to help us understand major tasks and challenges for travel demand modeling, and if feasible, it may also enhance our understanding of some major GIS techniques, the very purpose of this book. It is with these objectives in mind that we take on the challenges of developing a prototype traffic simulation program for the education purpose. It is composed of four modules, roughly in correspondence to the four-step travel demand forecast model except for the absence of “mode choice” step (given our focus on traffic of personal vehicles and also for simplicity) and the addition of “interzonal trip estimation” module (considered an extended task of data preparation). We adopt the modular approach also in anticipation that some users may use a module for applications beyond traffic simulation, as discussed in Section 12.3.

12.4.1

data preparation and prograM overvieW

The study area for this case study is East Baton Rouge Parish (EBRP) of Louisiana (parish in Louisiana is a county equivalent geopolitical unit), highlighted by the shaded area in Figure 12.2. The study area represents the urbanized core of the Baton Rouge Metropolitan Area (Wang et al., 2011), with the data of annual average daily traffic (AADT) recorded by traffic monitoring stations. While the focus is on this core area, we also include eight neighboring parishes in the metropolitan area in traffic simulation in order to account for internal–external, external–internal, and through traffic. Validation of simulation results is confined to the core area EBRP where the observed traffic data are available. Hereafter, the study area is simply referred to as Baton Rouge. The following datasets are prepared and provided for this case study under the data folder BatonRouge:

267

Monte Carlo Method and Its Application in Urban Traffic Simulation

West Feliciana

East Feliciana

St. Helena

Pointe Coupee

N

East Baton Rouge West Baton Rouge

Livingston Station Tract

Iberville

Parish

Ascension

Road 0

5

10

20

30

40 Miles

FIGURE 12.2 Traffic monitoring stations and adjacent areas in Baton Rouge.

1. Geodatabase BRMSA.gdb includes (a) an area feature trt and its corresponding point feature trt_pt for 151 census tracts and their centroids, respectively (each containing fields res_worker and employment for the number of resident workers and jobs, respectively),* (b) a feature dataset rd containing all feature classes associated with the road network dataset including the edges and junctions, and (c) a point feature station for 816 traffic count stations (its field ADT representing the annual average daily traffic (AADT) in 2005). 2. For convenience, the geodatabase BRMSA.gdb also includes other features that are to be produced in the project. For example, station_rd is the point feature generated by a spatial join between features station and rd (from step 8). 3. An executable program TSME.exe has four modules for corresponding tasks in traffic simulation.† 4. A toolkit “Features To Text File.tbx” and the corresponding folder Scripts include a deprecated ArcGIS tool extracted from the “Samples Toolbox” (http://resources.arcgis.com/gallery/file/geoprocessing/ details?entryID = F25C5576-1422-2418-A060-04188EBD33A9). It will be used in step 4. 5. Also several ASCII files are to be generated in the project and are provided here for convenience: file pop_emp.txt for the number of resident workers and employment in census tracts (step 1), OD_time. txt for the travel time between them (step 2), trip_mtx.txt for the * †

A field “trtID” is added to the attribute tables to assign an index value (1–151) to the tracts. Depending on the system, one may be asked to download and install the “.Net” framework.

268

Quantitative Methods and Socioeconomic Applications in GIS

estimated number of interzonal trips (step 3), trt.txt for the spatial extent of the census tract boundaries (step 4), O_all.txt and D_all. txt for trip origins and destinations, respectively (step 5), OD_indiv. txt for individual trips (step 6), station_rd.txt for the closest road segment associated with each traffic count station (also step 8), and traffic_simu.txt for observed and simulated traffic at traffic monitoring stations (step 9). Note the file route.txt from step 8 is not included due to its large size. The spatial data for census tracts and road network are based on the 2010 TIGER files (http://www.census.gov/geo/maps-data/data/tiger-line.html). Information of the number of resident workers and jobs in census tracts are extracted from the 2006– 2010 five-year Census Transportation Planning Package (CTPP) for the study area (http://ctpp.transportation.org/Pages/5-Year-Data.aspx). Similarly to the datasets used in Case Study 11A in Section 11.2.2, the information for resident workers and population is based on Part 1, and the information for jobs is based on Part 2. The traffic count data in 816 monitoring stations in the central parish (EBRP) is downloaded from the Louisiana Department of Transportation and Development (DOTD) website (http://www.dotd.la.gov/ highways/tatv/default.asp). The most recent traffic count data in 2005 is used for validation of our simulation result. Figure 12.3 outlines the workflow of the TSME program. Based on the number of resident workers and employment (jobs) in census tracts, Module 1 uses a gravity model to estimate the zone-level traffic between tracts. As illustrated in Section 12.2, Module 2 uses the Monte Carlo method to randomly simulate individual trip origins (proportional to the distribution of resident workers) and destinations (proportional to the distribution of employment). Again, based on the Monte Carlo method, Module 3 connects the origins and destinations randomly and caps the volume of O-D trips between each pair of zones proportionally to the estimated interzonal traffic from Module 1. Module 4 calibrates the shortest routes for all O-D trips, measures the simulated traffic through each monitoring station and compares it to the observed traffic for validation. The validation result may be fed back to Module 1, and the user can adjust the parameters in the gravity model and repeat the process multiple times until the best fit between simulated and observed traffic data is achieved. The “optimal” parameters in the gravity model thus capture the human mobility measured by the distance decay rule specifically for the study area. This is considered a much improved approach to estimation of the distance decay function and its associated parameters than simply relying on the zone-level traffic data discussed in Section 2.3. In essence, Module 2 corresponds to the task of “trip generation” in the four-step model, Module 3 implements the task of “trip distribution,” and part of Module 4 does “trip assignment.” Module 1 of the program is equivalent to the zone-level trip distribution that guides Module 3 to simulate more realistic individual trips. For reasons stated previously, the program does not include the task of “mode choice.” Sections 12.4.2 through 12.4.5 provide a step-by-step walkthrough of the TSME program.

Monte Carlo Method and Its Application in Urban Traffic Simulation Road network

269

Census tracts

Interzontal travel time matrix

Module 1: Interzonal trip estimation by gravity model

CTPP part 1

CTPP part 2

Resident workers

Employments

Module 2: Monte Carlo simulation of origins Module 2: Monte Carlo simulation of destinations

Module 3: Monte Carlo simulation of individual O-D trips O1-D1

On-Dn

O1-D1 ... On–1-Dn–1

Shortest routes for O-D trips Measuring simulated traffic at monitoring stations No

Best fit between simulated and observed traffic?

Observed traffic

Module 4: Trip assignment and validation Yes Simulated traffic that best fits observed

FIGURE 12.3

12.4.2

Workflow of the TSME.

Module 1: interzonal trip estiMation

The gravity model is perhaps the most widely used method for estimating interzonal traffic, similar to Equation 2.4 introduced in Section 2.3: Tij = aOi D j dij−β , where Tij is the number of trips between zones (here tracts) i and j, Oi is the size of an origin i, Dj is the size of a destination j, a is a scalar factor, dij is the distance (here travel time) between them, and β is the distance friction coefficient. Here, the conventional power function is chosen as the distance decay function, and other functions discussed in Section 2.3 may also be considered. This case study uses the number of resident workers to represent the size of an origin, and the number of jobs to represent that of a destination. This does not imply

270

Quantitative Methods and Socioeconomic Applications in GIS

that the simulated traffic is commuting trips per se, but perhaps closely resembles to home-based trips, which account for by far the majority (70–75%) of urban trips by individuals (Schultz and Allen, 1996). Resident workers (W) represent not just those work trips but also a good proxy for population in general that produces other trips. Employment (E) captures all other non-residential land uses (industrial, commercial, schools, and other businesses) that attracts trips. Specifically, the daily traffic from zone i to j is modeled as Tij = a(Wi E j + Ei W j )dij−β , where the term WiEj accounts for trips originated from homes and the term EiWj for trips ending at homes. Admittedly, this approach reflects a high level of abstraction of urban trips with a rich set of diverse purposes, but a necessity for simplicity and data limitations in many transportation planning projects. Step 1. Preparing the attribute table of census tracts in ArcGIS: In ArcGIS, export the attribute table of the feature trt to a text file pop_emp.txt. One may simplify the file by keeping only fields such as trtID, res_worker, employment, and shape_length. Step 2. Computing O-D time matrix between census tracts in ArcGIS: The zone-level trip estimation assumes that a trip begins and ends at the zonal centroids. Follow the network analysis steps (e.g., steps 9–12 in Section 2.4.2) to compute the O-D time between tract centroids. Note that feature trt_pt is loaded as both origins and destinations, choose the field trtID for both Sort Field and Name Field in each case, and set the Search Tolerance as 7500 m. Table 12.1 lists estimated computing time for major tasks including this one (time is negligible for tasks not listed). Similar to step 2 in Section 10.3, intrazonal time here is approximated as onefourth of a tract’s perimeter divided by a speed (670.56 m/min, i.e., 25 mph, in this study). The total travel time between two tracts is composed of the aforementioned network time and the intratract times at both the origin and destination tracts. As discussed in Section 10.3, there are other alternatives for modeling intrazonal travel time. Some attempt to couple the estimates with a zone’s area size, some with a zone’s perimeter (as adopted by us here), and others adjust by its shape. All are faulty and call for the need of a method that simulates the locations of individual origins and destinations, which is the approach adopted in the next three modules of TSME. Export the O-D time matrix table to a (comma-separated) text file OD_time. txt with 151 × 151 = 22,801 records. Step 3. Estimating interzonal traffic in TSME: Double-click TSME.exe to launch the program. Select the tab “Interzonal Trip Estimation” to run the first module. As shown in Figure 12.4, (1) import the travel time impedance matrix file OD_time.txt prepared in step 2, and import the population and employment file pop_emp.txt prepared in step 1 (both files are also provided under the data folder); (2) specify the output trip matrix file (e.g., trip_mtx.txt by default); (3) under “Parameter settings,” select corresponding fields associated with input files (here automatically populated by the program once the files are loaded, as

Monte Carlo Method and Its Application in Urban Traffic Simulation

271

TABLE 12.1 Major Tasks and Estimated Computation Time in Traffic Simulation Step Index 2

Computing O-D time matrix between census tracts in ArcGIS Simulating origins and destinations in TSME

5 6 7

Simulating Individual O-D Trips in TSME Computing shortest paths in ArcGIS: loading stops Computing shortest paths in ArcGIS: solving routes Executing the tool “Copy Traversed Source Features” Exporting attribute table of Edges in ArcGIS Counting simulated traffic at stations in TSME

7 8 8 9 a

Task

Data Size 151 × 151 = 22,801 records 500,000 origins/ destinations 75,000 O-D trips 2 × 75,000 = 150,000 locations 75,000 routes

Time (min)a

choose X for X Field and Y for Y Field, and OK to create a layer “OD_indiv. txt Events.” It shows that most of trip origins and destinations are located in the central parish (EBRP). Activate the New Route tool in the Network Analyst module. Similar to the O-D Cost Matrix tool used in step 2, we need to define the Stops by loading the origins and destinations for all trips. In the Network Analyst window, right-click the layer Stops (0) > Load Location > Select OD_indiv.txt Events as the Load From feature; under Location Analysis Properties, specify the field for RouteName as Route_ID; under Location Position, choose Use Geometry and set the Search Tolerance to 7500 m similar to step 2. Click OK to load points (see Table 12.1 for estimated computation time). After loading the 2 × 75,000 = 150,000 locations, click Solve to compute the shortest paths for 75,000 O-D trips (also see Table 12.1 for estimated computation time). The result is saved in the layer Routes. Step 8. Recovering route connections for each shortest path in ArcGIS: Information of specific route connections for each path is needed to detect the simulated traffic flow through a station, but such information is not preserved in step 7. An ArcGIS tool is used to recover the data. In ArcToolbox, select Network Analysis Tools > Analysis > Copy Traversed Source Features. In the dialog box, choose

Monte Carlo Method and Its Application in Urban Traffic Simulation

275

Routes (produced in step 7) for Input Network Analysis Layer, select the geodatabase BRMSA.gdb as Output Location, and the default names for Edge and Junction feature classes and Turns table are okay. Click OK to run the tool (see Table 12.1 for estimated computation time). Only the Edges feature class is provided in the data folder to save space. Open the attribute table of Edges, and export it to a text file route.txt (see Table 12.1 for estimated computation time). Use a spatial join to identify the nearest road (feature rd) from traffic monitoring stations (feature station), and name the output feature class station_rd (refer to step 13 in Section 1.3.2). Export the attribute table of station_rd to a text file station_rd.txt (the one provided under the data folder is simplified with only fields STATION, ADT1 and OBJECTID_1, representing the traffic monitoring station id, annual average daily traffic and their closest road ids, respectively). Step 9. Counting simulated traffic at stations in TSME: On the TSME program main menu, switch to the last tab “Trip Assignment & Validation.” As shown in Figure 12.7, the major steps for trip assignment are already implemented in steps 7–8 in ArcGIS; for model validation, import the text file identifying the nearest road ids for traffic monitoring stations (station_rd) and the shortest route file (route. txt) (both obtained from step 8), specify the output file (e.g., traffic_simu. txt by default), and default parameter settings are as shown in Figure 12.7. Click the button under Execution panel to run the module. See reported time for running this task in Table 12.1.

FIGURE 12.7 TSME interface for the trip assignment and validation module.

276

Quantitative Methods and Socioeconomic Applications in GIS 4500

Simulated traffic

4000 3500 3000

y = 0.0316x + 19.742 R2 = 0.4322

2500 2000 1500 1000 500 0

FIGURE 12.8

0

10,000

20,000 30,000 40,000 Observed traffic

50,000

60,000

Observed versus simulated traffic.

The resulting table traffic_simu.txt includes the id, observed and simulated traffic at 816 stations. The model’s emphasis is on the spatial variability of traffic in a study area instead of the total traffic volume. Figure 12.8 shows R2 = 0.4322 (i.e., correlation coefficient = 0.66). Considering that the model uses limited data with a much simplified structure, the result is encouraging. One may repeat the analysis by experimenting with different parameters (e.g., β value and scale factor in the gravity model) in Module 1, using a different number of origins (destinations) in Module 2, and imposing a different number of O-D trips to simulate in Module 3, and examine whether the result could be improved.

12.5

SUMMARY

This chapter introduces the Monte Carlo simulation technique, a numerical analysis technique with increasing applications for spatial analysis. A Monte Carlo method generates random numbers by following a certain probability distribution function. In our case study, we illustrate the applications in spatial data disaggregation: one disaggregating area data to individual points within the areas, and another disaggregating interzonal flow data to individual flows between points. Both the uniform and discrete probability distributions are used in our simulation examples. The application of the Monte Carlo technique is demonstrated in a case study of simulating urban traffic flows. A prototype program of traffic simulation modules for education (TSME) is developed to implement the case study. Several reasons motivate us for the development of such a program despite many advanced and sophisticated transportation modeling packages available: (1) the need for a simple traffic simulation program that utilizes commonly available data sources, (2) the benefit of a program that helps users understand major steps/tasks in traffic simulation and reveal human mobility behind the algorithm, and (3) the potentials of a program including modules that may be also useful for some general spatial

Monte Carlo Method and Its Application in Urban Traffic Simulation

277

analysis tasks beyond transportation modeling. The TSME program generates some promising results. In addition to several routine GIS tasks that have been introduced in previous chapters, the case study in this chapter illustrates several new tools. The tool “Write Features To Text File” in step 4, though deprecated, saves the area boundary information to a text file, and may be useful for GIS programmers. Step 7 illustrates a way to define routes by the coordinates of origins and destinations linked through their common route ids and then use the ArcGIS Network Analyst to compute a series of routes. The method is useful for other purposes beyond traffic simulation. Another new tool “Copy Traversed Source Features” is used in step 8 to recover route connections for shortest paths.

References Abu-Lughod, J. 1969. Testing the theory of social area analysis: The ecology of Cairo, Egypt. American Sociological Review 34, 198–212. Agnew, R. 1985. A revised strain theory of delinquency. Social Forces 64, 151–167. Alonso, W. 1964. Location and Land Use. Cambridge, MA: Harvard University. Alperovich, G. 1982. Density gradient and the identification of CBD. Urban Studies 19, 313–320. Anderson, J. E. 1985. The changing structure of a city: Temporal changes in cubic spline urban density patterns. Journal of Regional Science 25, 413–425. Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht, Netherlands: Kluwer. Anselin, L. 1995. Local indicators of spatial association—LISA. Geographical Analysis 27, 93–115. Anselin, L. and A. Bera. 1998. Spatial dependence in linear regression models with an introduction to spatial econometrics. In A. Ullah and D. E. Giles (Eds.), Handbook of Applied Economic Statistics. New York: Marcel Dekker, pp. 237–289. Applebaum, W. 1966. Methods for determining store trade areas, market penetration and potential sales. Journal of Marketing Research 3, 127–141. Applebaum, W. 1968. The analog method for estimating potential store sales. In C. Kornblau (Ed.), Guide to Store Location Research. Reading, MA: Addison-Wesley. Assunçăo, R. M., M. C. Neves, G. Cămara, and C. D. C. Freitas. 2006. Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees. International Journal of Geographical Information Science 20, 797–811. Bailey, T. C. and A. C. Gatrell. 1995. Interactive Spatial Data Analysis. Harlow, England: Longman Scientific and Technical. Baller R., L. Anselin, S. Messner, G. Deane, and D. Hawkins. 2001. Structural covariates of U.S. county homicide rates: Incorporating spatial effects. Criminology 39, 561–590. Barkley, D. L., M. S. Henry, and S. Bao. 1996. Identifying “spread” versus “backwash” effects in regional economic areas: A density functions approach. Land Economics 72, 336–357. Bartholdi, J. J., III and L. K. Platzman. 1988. Heuristics based on spacefilling curves for combinatorial problems in Euclidean space. Management Science 34, 291–305. Batty, M. 1983. Linear urban models. Papers of the Regional Science Association 52, 141–158. Batty, M. 2007. Cities and Complexity: Understanding Cities with Cellular Automata, AgentBased Models, and Fractals. Cambridge, MA: MIT Press. Batty, M. 2009. Urban modeling. In N. Thrift and R. Kitchin (Eds.), International Encyclopedia of Human Geography. Oxford, UK: Elsevier, pp. 51–58. Batty, M. and Y. Xie. 1994a. Modeling inside GIS: Part I: Model structures, exploratory data analysis and aggregation. International Journal of Geographical Information Systems 8, 291–307. Batty, M. and Y. Xie. 1994b. Modeling inside GIS: Part II: Selecting and calibrating urban models using arc-info. International Journal of Geographical Information Systems 8, 451–470. Batty, M. and Y. Xie. 2005. Urban growth using cellular automata models. In D. J. Maguire, M. Batty, and M. F. Goodchild (Eds.), GIS, Spatial Analysis and Modelling. Redlands, CA: ESRI Press, pp. 151–172. Becker, G. S. 1968. Crime and punishment: An economic approach. Journal of Political Economy 76, 169–217.

279

280

References

Beckmann, M. J. 1971. On Thünen revisited: A neoclassical land use model. Swedish Journal of Economics 74, 1–7. Bellair, P. E. and V. J. Roscigno. 2000. Local labor-market opportunity and adolescent delinquency. Social Forces 78, 1509–1538. Berman, B. and J. R. Evans. 2001. Retail Management: A Strategic Approach (8th ed.). Upper Saddle River, NJ: Prentice-Hall. Berry, B. J. L. 1967. The Geography of Market Centers and Retail Distribution. Englewood Cliffs, NJ: Prentice-Hall. Berry, B. J. L. 1972. City Classification Handbook, Methods, and Applications. New York: Wiley-Interscience. Berry, B. J. L. and H. Kim. 1993. Challenges to the monocentric model. Geographical Analysis 25, 1–4. Berry, B. J. L. and P. H. Rees. 1969. The factorial ecology of Calcutta. American Journal of Sociology 74, 445–491. Besag, J. and P. J. Diggle. 1977. Simple Monte Carlo tests for spatial pattern. Applied Statistics 26, 327–333. Besag, J. and J. Newell. 1991. The detection of clusters in rare diseases. Journal of the Royal Statistical Society Series A 15, 4143–4155. Beyer, K. M. and G. Rushton. 2009. Mapping cancer for community engagement. Preventing Chronic Disease 6, A03. Black, R. J., L. Sharp, and J. D. Urquhart. 1996. Analysing the spatial distribution of disease using a method of constructing geographical areas of approximately equal population size. In P. E. Alexander and P. Boyle (Eds.), Methods for Investigating Localized Clustering of Disease. Lyon, France: International Agency for Research on Cancer, pp. 28–39. Black, W. 2003. Transportation: A Geographical Analysis. New York: Guilford. Block, R. and C. R. Block. 1995. Space, place and crime: Hot spot areas and hot places of liquor-related crime. In J. E. Eck and D. Weisburd (Eds.), Crime Places in Crime Theory. Newark: Criminal Justice Press. Block, C. R., R. L. Block, and the Illinois Criminal Justice Information Authority (ICJIA). 1998. Homicides in Chicago, 1965–1995 [Computer file]. 4th ICPSR version. Chicago, IL: ICJIA [producer]. Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributor]. Brown, B. B. 1980. Perspectives on social stress. In H. Selye (Ed.), Selye’s Guide to Stress Research. vol. 1. Reinhold, New York: Van Nostrand, pp. 21–45. Bureau of the Census. 1993. Statistical Abstract of the United States (113 ed.). Washington DC: US Department of Commerce. Burgess, E. 1925. The growth of the city. In R. Park, E. Burgess, and R. Mackenzie (Eds.), The City. Chicago: University of Chicago Press, pp. 47–62. Cadwallader, M. 1975. A behavioral model of consumer spatial decision making. Economic Geography 51, 339–349. Cadwallader, M. 1981. Towards a cognitive gravity model: The case of consumer spatial behavior. Regional Studies 15, 275–284. Cadwallader, M. 1996. Urban Geography: An Analytical Approach. Upper Saddle River, NJ: Prentice-Hall. Caliper Corporation. 2013. TransCAD Transportation Planning Software. www.caliper.com/ tcovu.htm (last accessed 2-26-2013). Casetti, E. 1993. Spatial analysis: Perspectives and prospects. Urban Geography 14, 526–537. Cervero R. 1989. Jobs-housing balance and regional mobility. Journal of the American Planning Association 55, 136–150. Chang, K.-T. 2004. Introduction to Geographic Information Systems (2nd ed.). New York: McGraw-Hill.

References

281

Chiricos, T. G. 1987. Rates of crime and unemployment: An analysis of aggregate research evidence. Social Problems 34, 187–211. Christaller, W. 1966. In C. W. Baskin (trans.), Central Places in Southern Germany. Englewood Cliffs, NJ: Prentice-Hall. Chu, R., C. Rivera, and C. Loftin. 2000. Herding and homicide: An examination of the NisbettReaves hypothesis. Social Forces 78, 971–987. Church, R. L. and C. S. ReVelle. 1974. The maximum covering location problem. Papers of the Regional Science Association 32, 101–118. Ciucu, M., P. Heas, M. Datcu, and J. C. Tilton. 2003. Scale space exploration for mining image information content. In O. R. Zaiane, S. Simoff, and C. Djeraba (Eds.), Mining Multimedia and Complex Data. Berlin; New York: Springer, pp. 118–133. Clark, C. 1951. Urban population densities. Journal of the Royal Statistical Society 114, 490–494. Clarke, K. C. and L. J. Gaydos. 1998. Loose-coupling a cellular automaton model and GIS: Long-term urban growth prediction for San Francisco and Washington/Baltimore. International Journal of Geographical Information Science 12, 699–714. Clayton, D. and J. Kaldor. 1987. Empirical Bayes estimates of age-standardized relative risks for use in disease mapping. Biometrics 43, 671–681. Cliff, A., P. Haggett, J. Ord, K. Bassett, and R. Davis. 1975. Elements of Spatial Structure. Cambridge, UK: Cambridge University Press. Cliff, A. D. and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion. Clifford, P., S. Richardson, and D. Hémon. 1989. Assessing the significance of the correlation between two spatial processes. Biometrics 45, 123–134. Cockings, S. and D. Martin. 2005. Zone design for environment and health studies using preaggregated data. Social Science and Medicine 60, 2729–2742. Cornish, D. B. and R. V. Clarke (Eds.). 1986. The Reasoning Criminal: Rational Choice Perspectives on Offending. New York: Springer-Verlag. Colwell, P. F. 1982. Central place theory and the simple economic foundations of the gravity model. Journal of Regional Science 22, 541–546. Cressie, N. 1992. Smoothing regional maps using empirical Bayes predictors. Geographical Analysis 24, 75–95. Cromley, E. and S. McLafferty. 2002. GIS and Public Health. New York: Guilford Press. Culyer, A. J. and A. Wagstaff. 1993. Equity and equality in health and health care. Journal of Health Economics 12, 431–457. Cuzick, J. and R. Edwards. 1990. Spatial clustering for inhomogeneous populations. Journal of the Royal Statistical Society Series B 52, 73–104. Dai, D. 2010. Black residential segregation, disparities in spatial access to health care facilities, and late-stage breast cancer diagnosis in metropolitan Detroit. Health and Place 16, 1038–1052. Dantzig, G. B. 1948. Programming in a Linear Structure. Washington, DC: U.S. Air Force, Comptroller’s Office. Davies, R. L. 1973. Evaluation of retail store attributes and sales performance. European Journal of Marketing 7, 89–102. Davies, W. and D. Herbert. 1993. Communities within cities: An urban geography. London: Belhaven. De Vries J. J., P. Nijkamp, and P. Rietveld. 2009. Exponential or power distance-decay for commuting? An alternative specification. Environment and Planning A 41, 461–480. Delamater P. L., J. P. Messina, S. C. Grady, V. Winkler Prins, and A. M. Shortridge. 2013. Do more hospital beds lead to higher hospitalization rates? A spatial examination of Roemer’s Law. PLoS ONE 8(2), e54900. Diggle, P. J. and A. D. Chetwynd. 1991. Second-order analysis of spatial clustering for inhomogeneous populations. Biometrics 47, 1155–1163.

282

References

Dijkstra, E. W. 1959. A note on two problems in connection with graphs. Numerische Mathematik 1, 269–271. Dobrin, A. and B. Wiersema. 2000. Measuring line-of-duty homicides by law enforcement officers in the United States, 1976–1996. Paper presented at the 52nd Annual Meeting of the American Society of Criminology, November 15–18, San Francisco, CA. Duque, J. C., L. Anselin, and S. J. Rey. 2012. The Max-P-regions problem. Journal of Regional Science 52, 397–419. Everitt, B. S. 2002. The Cambridge Dictionary of Statistics. Cambridge University Press. Everitt, B. S., S. Landau, and M. Leese. 2001. Cluster Analysis (4th ed.). London: Arnold. FHWA (Federal Highway Administration), U.S. Department of Transportation. 1977. An Introduction to Urban Travel Demand Forecasting—A Self-Instructional Text. http://ntl. bts.gov/DOCS/UT.html (last accessed 4-23-2014). FHWA (Federal Highway Administration), U.S. Department of Transportation. 2013. Traffic Analysis Tools Program: Corridor Simulation (CORSIM/TSIS). http://ops.fhwa.dot. gov/trafficanalysistools/corsim.htm (last accessed 4-23-2013). Feng, Y. and Y. Liu. 2012. A heuristic cellular automata approach for modelling urban landuse change based on simulated annealing. International Journal of Geographical Information Science 27, 449–66. Fisher, R. A. 1935. The Design of Experiments. Edinburgh: Oliver and Boyd. Fisch, O. 1991. A structural approach to the form of the population density function. Geographical Analysis 23, 261–275. Flowerdew, R. and M. Green. 1992. Development in areal interpolation methods and GIS. The Annals of Regional Science 26, 67–78. Foltêtea, J. C., K. Berthierb, and J. F. Cossonb. 2008. Cost distance defined by a topological function of landscape. Ecological Modelling 210, 104–114. Forstall, R. L. and R. P. Greene. 1998. Defining job concentrations: The Los Angeles case. Urban Geography 18, 705–739. Fotheringham, A. S., C. Brunsdon, and M. Charlton. 2000. Quantitative Geography: Perspectives on Spatial Data Analysis. London: Sage. Fotheringham, A. S., C. Brunsdon, and M. Charlton. 2002. Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. West Sussex: Wiley. Fotheringham A. S. and M. E. O’Kelly. 1989. Spatial Interaction Models: Formulations and Applications. London: Kluwer Academic. Fotheringham, A. S. and D. W. S. Wong. 1991. The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A 23, 1025–1044. Fotheringham A. S. and B. Zhan. 1996. A comparison of three exploratory methods for cluster detection in spatial point patterns. Geographical Analysis 28, 200–218. Fox, J. A. 2000. Uniform Crime Reports [United States]: Supplementary Homicide Reports, 1976–1998 [Computer file]. ICPSR version. Boston, MA: Northeastern University, College of Criminal Justice [producer]. Ann Arbor, MI: Inter-University Consortium for Political and Social Research [distributor]. Franke, R. 1982. Smooth interpolation of scattered data by local thin plate splines. Computers and Mathematics with Applications 8, 273–281. Frankena, M. W. 1978. A bias in estimating urban population density functions. Journal of Urban Economics 5, 35–45. Freeman, L. 1979. Centrality in social networks: Conceptual clarification. Social Networks 1, 215–239. Frost, M., B. Linneker, and N. Spence. 1998. Excess or wasteful commuting in a selection of British cities. Transportation Research Part A: Policy and Practice 32, 529–538. Gaile, G. L. 1980. The spread-backwash concept. Regional Studies 14, 15–25. GAO. 1995. Health Care Shortage Areas: Designation Not a Useful Tool for Directing Resources to the Underserved. Washington, DC: GAO/HEHS-95-2000, General Accounting Office.

References

283

Gao S., Y. Wang, Y. Gao, and Y. Liu. 2013. Understanding urban traffic-flow characteristics: A rethinking of betweenness centrality. Environment and Planning B 40, 135–153. Garin, R. A. 1966. A matrix formulation of the Lowry model for intrametropolitan activity allocation. Journal of the American Institute of Planners 32, 361–364. Geary, R. 1954. The contiguity ratio and statistical mapping. The Incorporated Statistician 5, 115–145. Getis, A. 1995. Spatial filtering in a regression framework: Experiments on regional inequality, government expenditure, and urban crime. In L. Anselin and R. J. G. M. Florax (Eds.), New Directions in Spatial Econometrics. Berlin: Springer, pp. 172–188. Getis, A. and D. A. Griffith. 2002. Comparative spatial filtering in regression analysis. Geographical Analysis 34, 130–140. Getis, A. and J. K. Ord. 1992. The analysis of spatial association by use of distance statistics. Geographical Analysis 24, 189–206. Ghosh, A. and S. McLafferty. 1987. Location Strategies for Retail and Service Firms. Lexington, MA: D.C. Heath. Giuliano G. and K. A. Small. 1991. Subcenters in the Los Angeles region. Regional Science and Urban Economics 21, 163–182. Giuliano G. and K. A. Small. 1993. Is the journey to work explained by urban structure? Urban Studies 30, 1485–1500. Goodchild, M. F., L. Anselin, and U. Deichmann. 1993. A framework for the interpolation of socioeconomic data. Environment and Planning A 25, 383–397. Goodchild, M. F. and N. S.-N. Lam. 1980. Areal interpolation: A variant of the traditional spatial problem. Geoprocessing 1, 297–331. Gordon, P., H. Richardson, and H. Wong. 1986. The distribution of population and employment in a polycentric city: The case of Los Angeles. Environment and Planning A 18, 161–173. Grady, S. C. and H. Enander. 2009. Geographic analysis of low birth weight and infant mortality in Michigan using automated zoning methodology. International Journal of Health Geographics 8(10). Greene, D. L. and J. Barnbrock. 1978. A note on problems in estimating exponential urban density models. Journal of Urban Economics 5, 285–290. Griffith, D. A. 1981. Modelling urban population density in a multi-centered city. Journal of Urban Economics 9, 298–310. Griffith, D. A. 2000. A linear regression solution to the spatial autocorrelation problem. Journal of Geographical Systems 2, 141–156. Griffith, D. A. and C. G. Amrhein. 1997. Multivariate Statistical Analysis for Geographers. Upper Saddle River, NJ: Prentice-Hall. Grimson, R. C. and R. D. Rose. 1991. A versatile test for clustering and a proximity analysis of neurons. Methods of Information in Medicine 30, 299–303. Gu, C., F. Wang, and G. Liu. 2005. The structure of social space in Beijing in 1998: A socialist city in transition. Urban Geography 26, 167–192. Guagliardo, M. F. 2004. Spatial accessibility of primary care: Concepts, methods and challenges. International Journal of Health Geography 3, 3. Guldmann, J. M. and F. Wang. 1998. Population and employment density functions revisited: A spatial interaction approach. Papers in Regional Science 77, 189–211. Guo, D. 2008. Regionalization with dynamically constrained agglomerative clustering and partitioning (REDCAP). International Journal of Geographical Information Science 22, 801–823. Guo, D. and H. Wang. 2011. Automatic region building for spatial analysis. Transactions in GIS 15(s1), 29–45. Haining, R., S. Wises, and M. Blake. 1994. Constructing regions for small area analysis: Material deprivation and colorectal cancer. Journal of Public Health Medicine 16, 429–438.

284

References

Hamilton, B. 1982. Wasteful commuting. Journal of Political Economy 90, 1035–1053. Hamilton, L. C. 1992. Regression with Graphics. Belmont, CA: Duxbury. Hansen, W. G. 1959. How accessibility shapes land use. Journal of the American Institute of Planners 25, 73–76. Harrell, A. and C. Gouvis. 1994. Predicting neighborhood risk of crime: Report to the National Institute of Justice. Washington, DC: The Urban Institute. Harris, C. D. and E. L. Ullman. 1945. The nature of cities. The Annals of the American Academy of Political and Social Science 242, 7–17. Hartschorn, T. A. 1992. Interpreting the City: An Urban Geography. New York: Wiley. Heikkila, E. P., P. Gordon, J. Kim, R. Peiser, H. Richardson, and D. Dale-Johnson. 1989. What happened to the CBD-distance gradient?: Land values in a polycentric city. Environment and Planning A 21, 221–232. Hewings, G. 1985. Regional Input-Output Analysis. Beverly Hills, CA: Sage. Hewitt, R., J. Díaz Pacheco, and Moya Gómez, B. 2014. A cellular automata land use model for the R software environment (weblog), available at http://simlander.wordpress.com (last accessed 5-1-2014). Hillier, B. 1996. Space Is the Machine: A Configurational Theory of Architecture. Cambridge, UK: Cambridge University Press. Hillier, B. and J. Hanson. 1984. The Social Logic of Space. Cambridge, UK: Cambridge University Press. Hillier, B., A. Penn, J. Hanson et al. 1993. Natural movement or, configuration and attraction in urban pedestrian movement. Environment and Planning B 20, 29–66. Hillsman, E. and G. Rushton. 1975. The p-median problem with maximum distance constraints. Geographical Analysis 7, 85–89. Hirschi, T. 1969. Causes of Delinquency. Berkeley, CA: University of California Press. Horner, M. W. and A. T. Murray. 2002. Excess commuting and the modifiable areal unit problem. Urban Studies 39, 131–139. Hoyt, H. 1939. The Structure and Growth of Residential Neighborhoods in American Cities. Washington, DC: USGPO. Huff, D. L. 1963. A probabilistic analysis of shopping center trade areas. Land Economics 39, 81–90. Huff, D. L. 2000. Don’t misuse the Huff model in GIS. Business Geographies 8, 12. Huff, D. L. 2003. Parameter estimation in the Huff model. ArcUser 2003 (Oct.-Dec.), 34–36. Immergluck, D. 1998. Job proximity and the urban employment problem: Do suitable nearby jobs improve neighborhood employment rates? Urban Studies 35, 7–23. ITS. 2002. Trip Generation Manual (9th ed.). Washington DC: Institute of Transportation Engineers. Jacobs, J. 1961. The Death and Life of Great American Cities. New York: Random House. Jacquez, G. M. 1998. GIS as an enabling technology. In A. C. Gatrell and M. Loytonen (Eds.), GIS and Health. London: Taylor & Francis, 17–28. Jiang, B. and C. Claramunt. 2004. Topological analysis of urban street networks. Environment and Planning B 31, 151–162. Jin, F., F. Wang, and Y. Liu. 2004. Geographic patterns of air passenger transport in China 1980–98: Imprints of economic growth, regional inequality and network development. Professional Geographer 56, 471–487. Joseph, A. E. and P. R. Bantock. 1982. Measuring potential physical accessibility to general practitioners in rural areas: A method and case study. Social Science and Medicine 16, 85–90. Joseph, A. E. and D. R. Phillips. 1984. Accessibility and Utilization—Geographical Perspectives on Health Care Delivery. New York: Harper & Row. Joseph, M., L. Wang, and F. Wang. 2012. Using Landsat imagery and census data for urban population density modeling in Port-au-Prince, Haiti. GIScience and Remote Sensing 49, 228–250.

References

285

Kain, J. F. 2004. A pioneer’s perspective on the spatial mismatch literature. Urban Studies 41, 7–32. Khan, A. A. 1992. An integrated approach to measuring potential spatial access to health care services. Socio-Economic Planning Science 26, 275–287. Khisty, C. J. and B. K. Lall. 2003. Transportation Engineering: An Introduction (3rd ed.). Upper Saddle River, NJ: Prentice-Hall. Khumawala, B. M. 1973. An efficient algorithm for the p-median problem with maximum distance constraints. Geographical Analysis 5, 309–321. Kincaid, D. and Cheney, W. 1991. Numerical Analysis: Mathematics of Scientific Computing. Belmont, CA: Brooks/Cole Publishing Co. Knorr-Held, L. 2000. Bayesian modelling of inseparable space-time variation in disease risk. Statistics in Medicine 19, 2555–2567. Knorr-Held, L. and G. Rasser. 2000. Bayesian detection of clusters and discontinuities in disease maps. Biometrics 56, 13–21. Knox, P. 1987. Urban Social Geography: An Introduction. (2nd ed.). New York: Longman. Kotusevski, G. and K. A. Hawick. 2009. A review of traffic simulation software. Computational Science Technical Note CSTN-095. Massey University, Auckland, New Zealand. Krajzewicz, D., J. Erdmann, M. Behrisch, and L. Bieker. 2012. Recent development and applications of SUMO—Simulation of urban mobility. International Journal on Advances in Systems and Measurements 5, 128–138. Krige, D. 1966. Two-dimensional weighted moving average surfaces for ore evaluation. Journal of South African Institute of Mining and Metallurgy 66, 13–38. Kuby, M., S. Tierney, T. Roberts, and C. Upchurch. 2005. A Comparison of Geographic Information Systems, Complex Networks, and Other Models for Analyzing Transportation Network Topologies. NASA/CR-2005-213522. Kulldorff, M. 1997. A spatial scan statistic. Communications in Statistics: Theory and Methods 26, 1481–1496. Kulldorff, M. 1998. Statistical methods for spatial epidemiology: Tests for randomness. In A. C. Gatrell and M. Loytonen (Eds.), GIS and Health. Taylor & Francis, London, pp. 49–62. Kwan, M.-P. 2012. The uncertain geographic context problem. Annals of the Association of American Geographers 102, 958–968. Ladd, H. F. and W. Wheaton. 1991. Causes and consequences of the changing urban form: Introduction. Regional Science and Urban Economics 21, 157–162. Lam, N. S.-N. and K. Liu. 1996. Use of space-filling curves in generating a national rural sampling frame for HIV-AIDS research. Professional Geographer 48, 321–332. Land, K. C., P. L. McCall, and L. E. Cohen. 1990. Structural covariates of homicide rates: Are there any in variances across time and social space? American Journal of Sociology 95, 922–963. Land, K. C., P. L. McCall, and D. S. Nagin. 1996. A comparison of Poisson, negative binomial, and semiparametric mixed Poisson regression models: With empirical applications to criminal careers data. Sociological Methods and Research 24, 387–442. Langford, I. H. 1994. Using empirical Bayes estimates in the geographical analysis of disease risk. Area 26, 142–149. Lee, R. C. 1991. Current approaches to shortage area designation. Journal of Rural Health 7, 437–450. Leung, Y., C. L. Mei, and W. X. Zhang. 2000. Statistical tests for spatial nonstationarity based on the geographically weighted regression model. Environment and Planning A 32, 9–32. Leung, Y., J. S. Zhang, and Z. B. Xu. 2000. Clustering by scale-space filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence 22, 1396–1410. Levitt, S. D. 2001. Alternative strategies for identifying the link between unemployment and crime. Journal of Quantitative Criminology 17, 377–390.

286

References

Li, X. and A. G.-O. Yeh. 2000. Modelling sustainable urban development by the integration of constrained cellular automata and GIS. International Journal of Geographical Information Science 14, 131–152. Liberman, N., Y. Trope, and E. Stephan. 2007. Psychological distance. In A. W. Kruglanski and E. T. Higgins (Eds.), Social Psychology: Handbook of Basic Principles. (2nd Ed.), New York: Guilford Publications, pp. 353–384. Lindeberg, T. 1994. Scale-Space Theory in Computer Vision. Dordrecht, the Netherlands: Kluwer Academic. Liu, Y., F. Wang, C. Kang, Y. Gao, and Y. Lu. 2014. Analyzing relatedness by toponym cooccurrences on web pages. Transactions in GIS. 18, 89–107. Liu, Y., F. Wang, Y. Xiao, and S. Gao. 2012. Urban land uses and traffic “source-sink areas”: Evidence from GPS-enabled taxi data in Shanghai. Landscape and Urban Planning 106, 73–87. Lösch, A. 1954. W. H. Woglom, and W. F. Stolper (trans.), Economics of Location. New Haven, CT: Yale University. Lowry, I. S. 1964. A Model of Metropolis. Santa Monica, CA: Rand Corporation. Luo W. and Y. Qi. 2009. An enhanced two-step floating catchment area (E2SFCA) method for measuring spatial accessibility to primary care physicians. Health and Place 15, 1100–1107. Luo, W. and F. Wang. 2003. Measures of spatial accessibility to healthcare in a GIS environment: Synthesis and a case study in Chicago region. Environment and Planning B: Planning and Design 30, 865–884. Luo, J.-C., C.-H. Zhou, Y. Leung, J.-S. Zhang, and Y.-F. Huang. 2002. Scale-space theory based regionalization for spatial cells. Acta Geographica Sinica 57, 167–173 (in Chinese). Luo, L., S. McLafferty, and F. Wang. 2010. Analyzing spatial aggregation error in statistical models of late-stage cancer risk: A Monte Carlo simulation approach. International Journal of Health Geographics 9(51). Marshall, R. J. 1991. Mapping disease and mortality rates using empirical Bayes estimators. Applied Statistics 40, 283–294. McDonald, J. F. 1989. Econometric studies of urban population density: A survey. Journal of Urban Economics 26, 361–385. McDonald, J. F. and P. Prather. 1994. Suburban employment centers: The case of Chicago. Urban Studies 31, 201–218. McGrail, M. R. and J. S. Humphreys. 2009a. Measuring spatial accessibility to primary care in rural areas: Improving the effectiveness of the two-step floating catchment area method. Applied Geography 29, 533–541. McGrail, M. R. and J. S. Humphreys. 2009b. A new index of access to primary care services in rural areas. Australian and New Zealand Journal of Public Health 33, 418–423. McNally, M. G. 2008. The Four Step Model, in D. A. Hensher and K. J. Button (eds.) Transport Modelling, 2nd Edition, Pergamon, Elsevier Science. Messner, S. F., L. Anselin, R. D. Baller, D. F. Hawkins, G. Deane, and S. E. Tolnay. 1999. The spatial patterning of county homicide rates: An application of exploratory spatial data analysis. Journal of Quantitative Criminology 15, 423–450. Mills, E. S. 1972. Studies in the Structure of the Urban Economy. Baltimore: Johns Hopkins University. Mills, E. S. and J. P. Tan. 1980. A comparison of urban population density functions in developed and developing countries. Urban Studies 17, 313–321. Mitchell, A. 2005. The ESRI Guide to GIS Analysis. Volume 2: Spatial Measurements and Statistics. Redlands, CA: ESRI Press. Moran, P. A. P. 1950. Notes on continuous stochastic phenomena. Biometrika 37, 17–23. Morenoff, J. D. and R. J. Sampson. 1997. Violent crime and the spatial dynamics of neighborhood transition: Chicago, 1970–1990. Social Forces 76, 31–64.

References

287

Mu, L. and F. Wang. 2008. A scale-space clustering method: Mitigating the effect of scale in the analysis of zone-based data. Annals of the Association of American Geographers 98, 85–101. Mu, L., F. Wang, V. W. Chen and X. Wu. forthcoming. A place-oriented, mixed-level regionalization method for constructing geographic areas in health data dissemination and analysis. Annals of the Association of American Geographers. Muth, R. 1969. Cities and Housing. Chicago: University of Chicago. Nakanishi, M. and L. G. Cooper. 1974. Parameter estimates for multiplicative competitive interaction models—Least square approach. Journal of Marketing Research 11, 303–311. Newling, B. 1969. The spatial variation of urban population densities. Geographical Review 59, 242–252. Niedercorn, J. H. and B. V. Bechdolt Jr. 1969. An economic derivation of the “gravity law” of spatial interaction. Journal of Regional Science 9, 273–282. Oden, N., G. Jacquez, and R. Grimson. 1996. Realistic power simulations compare point- and area-based disease cluster tests. Statistics in Medicine 15, 783–806. Oliver, A. and E. Mossialos. 2004. Equity of access to health care: Outlining the foundations for action. Journal of Epidemiology and Community Health 58, 655–658. Olsen, L. M. and J. D. Lord. 1979. Market area characteristics and branch performance. Journal of Bank Research 10, 102–110. O’Kelly, M. E., W. Song, and G. Shen. 1995. New estimates of gravitational attraction by linear programming. Geographical Analysis 27, 271–85. Onega, T., E. J. Duell, X. Shi, D. Wang, E. Demidenko, and D. Goodman. 2008. Geographic access to cancer care in the U. S. Cancer 112, 909–18. Openshaw, S. 1977. A geographical solution to scale and aggregation problems in regionbuilding, partitioning, and spatial modelling. Transactions of the Institute of British Geographers NS 2, 459–472. Openshaw, S. 1984. Concepts and Techniques in Modern Geography, Number 38. The Modifiable Areal Unit Problem. Norwich: Geo Books. Openshaw, S., M. Charlton, C. Mymer, and A. W. Craft. 1987. A Mark 1 geographical analysis machine for the automated analysis of point data sets. International Journal of Geographical Information Systems 1, 335–358. Openshaw, S. and L. Rao. 1995. Algorithms for reengineering 1991 census geography. Environment and Planning A 27 (3), 425–446. Osgood, D. W. 2000. Poisson-based regression analysis of aggregate crime rates. Journal of Quantitative Criminology 16, 21–43. Osgood, D. W. and J. M. Chambers. 2000. Social disorganization outside the metropolis: An analysis of rural youth violence. Criminology 38, 81–115. Parr, J. B. 1985. A population density approach to regional spatial structure. Urban Studies 22, 289–303. Parr, J. B. and G. J. O’Neill. 1989. Aspects of the lognormal function in the analysis of regional population distribution. Environment and Planning A 21, 961–973. Parr, J. B., G. J. O’Neill, and A. G. M. Nairn. 1988. Metropolitan density functions: A further exploration. Regional Science and Urban Economics 18, 463–478. Peng, Z. 1997. The jobs-housing balance and urban commuting. Urban Studies 34, 1215–1235. Porta, S., V. Latora, F. Wang et al. 2012. Street centrality and location of economic activities in Barcelona. Urban Studies 49, 1471–1488. Poulter, S. R. 1998. Monte Carlo simulation in environmental risk assessment: Science, policy and legal issues. Risk: Health, Safety and Environment 9, 7–26. Press, W. H. et al. 1992a. Numerical Recipes in FORTRAN: The Art of Scientific Computing (2nd ed.). Cambridge, UK: Cambridge University Press.

288

References

Press, W. H. et al. 1992b. Numerical Recipes in C: The Art of Scientific Computing (2nd ed.). Cambridge, UK: Cambridge University Press. Press, W. H. et al. 2002. Numerical Recipes in C++: The Art of Scientific Computing (2nd ed.). Cambridge, UK: Cambridge University Press. Price, M. 2004. Mastering ArcGIS. New York: McGraw-Hill. Rees, P. 1970. Concepts of social space: Toward an urban social geography. In B. Berry and F. Horton (Eds.), Geographic Perspectives on Urban System. Englewood Cliffs, NJ: Prentice-Hall, pp. 306–394. Reilly, W. J. 1931. The Law of Retail Gravitation. New York: Knickerbocker. Rengert, G. F., A. R. Piquero, and P. R. Jones. 1999. Distance decay reexamined. Criminology 37, 427–445. ReVelle, C. S. and R. Swain. 1970. Central facilities location. Geographical Analysis 2, 30–34. Robinson, W. S. 1950. Ecological correlations and the behavior of individuals. American Sociological Review 15, 351–357. Rogers, D. S. and H. Green. 1978. A new perspective in forecasting store sales: Applying statistical models and techniques in the analog approach. Geographical Review 69, 449–458. Rogerson, P. A. 1999. The detection of clusters using a spatial version of the chi-square goodness-of-fit statistic. Geographical Analysis 31, 130–147. Rose, H. M. and P. D. McClain. 1990. Race, Place, and Risk: Black Homicide in Urban America. Albany, New York: SUNY Press. Rushton, G. and P. Lolonis. 1996. Exploratory spatial analysis of birth defect rates in an urban population. Statistics in Medicine 7, 717–726. Schroeder, J. P. 2007. Target-density weighting interpolation and uncertainty evaluation for temporal analysis of census data. Geographical Analysis 39, 311–335. Schultz, G. W. and W. G. Allen, Jr. 1996. Improved modeling of non-home-based trips. Transportation Research Record 1556, 22–26. Scott, D. and M. Horner. 2008. Examining the role of urban form in shaping people’s accessibility to opportunities: An exploratory spatial data analysis. Journal of Transport and Land Use 1, 89–119. Sevtsuk, A., M. Mekonnen, and R. Kalvo. 2013. Urban Network Analysis: A Toolbox v1.01 for ArcGIS. Available at http://cityform.mit.edu/projects/urban-network-analysis.html. Shen, Q. 1994. An application of GIS to the measurement of spatial autocorrelation. Computer, Environment and Urban Systems 18, 167–191. Shen, Q. 1998. Location characteristics of inner-city neighborhoods and employment accessibility of low-income workers. Environment and Planning B: Planning and Design 25, 345–365. Shen, Q. 2000. Spatial and social dimensions of commuting. Journal of the American Planning Association 66, 68–82. Sherratt, G. 1960. A model for general urban growth. In C. W. Churchman and M. Verhulst (Eds.), Management Sciences: Models and Techniques. Oxford: Pergamon Press. Shevky, E. and W. Bell. 1955. Social Area Analysis. Stanford, CA: Stanford University. Shevky, E. and M. Williams. 1949. The Social Areas of Los Angeles. Los Angeles: University of California. Shi, X. 2009. A geocomputational process for characterizing the spatial pattern of lung cancer incidence in New Hampshire. Annals of the Association of American Geographers 99, 521–533. Shi, X., J. Alford-Teaster, T. Onega, and D. Wang. 2012. Spatial access and local demand for major cancer care facilities in the United States. Annals of the Association of American Geographers 102, 1125–1134.

References

289

Shi, X., E. Duell, E. Demidenko, T. Onega, B. Wilson and D. Hoftiezer. 2007. A polygonbased locally-weighted-average method for smoothing disease rates of small units. Epidemiology 18, 523–528. Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. London: Chapman & Hall. Small, K. A. and S. Song. 1992. “Wasteful” commuting: A resolution. Journal of Political Economy 100, 888–898. Small, K. A. and S. Song. 1994. Population and employment densities: Structure and change. Journal of Urban Economics 36, 292–313. Smirnov, O. and L. Anselin. 2001. Fast maximum likelihood estimation of very large spatial autoregressive models: A characteristic polynomial approach. Computational Statistic and Data Analysis 35, 301–319. Stabler, J. and L. St. Louis. 1990. Embodied inputs and the classification of basic and nonbasic activity: Implications for economic base and regional growth analysis. Environment and Planning A 22, 1667–1675. Taaffe, E. J., H. L. Gauthier, and M. E. O’Kelly. 1996. Geography of Transportation (2nd ed.). Upper Saddle River, NJ: Prentice-Hall. Talbot, T. O., M. Kulldorff, S. P. Forand, and V. B. Haley. 2000. Evaluation of spatial filters to create smoothed maps of health data. Statistics in Medicine 19, 2399–2408. Talen, E. 2001. School, community, and spatial equity: An empirical investigation of access to elementary schools in West Virginia. Annals of the Association of American Geographers 91, 465–486. Talen, E. and L. Anselin. 1998. Assessing spatial equity: An evaluation of measures of accessibility to public playgrounds. Environment and Planning A 30, 595–613. Taneja, S. 1999. Technology moves in. Chain Store Age 75 (5), 136. Tanner, J. 1961. Factors affecting the amount of travel, Road Research Technical Paper No. 51, London: HMSO. Taylor, G. E. 1989. Addendum to Saalfield (1987). International Journal of Geographical Information Systems 3, 192–193. Tiwari, C. and Rushton, G. 2004. Using spatially adaptive filters to map late stage colorectal cancer incidence in Iowa. In P. Fisher (Ed.), Developments in Spatial Data Handling. New York: Springer-Verlag, pp. 665–676. Tobler, W. R. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46, 234–240. Toregas, C. and C. S. ReVelle. 1972. Optimal location under time or distance constraints. Papers of the Regional Science Association 28, 133–143. Ullman, E. L. and M. Dacey. 1962. The minimum requirements approach to the urban economic base. Proceedings of the IGU Symposium in Urban Geography. Lund: Lund Studies in Geography, pp. 121–143. Von Neumann, J. 1951. The general and logical theory of automata. In Lloyd A. Jeffress (ed.), Cerebral Mechanisms in Behavior: The Hixon Symposium. New York: John-Wiley & Sons, London: Chapman & Hall. 1–41. Von Thünen, J. H. 1966. In C. M. Wartenberg (trans.) and P. Hall (Ed.), Von Thünen’s Isolated State. Oxford: Pergamon. Wang, F. 1998. Urban population distribution with various road networks: A simulation approach. Environment and Planning B: Planning and Design 25, 265–278. Wang, F. 2000. Modeling commuting patterns in Chicago in a GIS environment: A job accessibility perspective. Professional Geographer 52, 120–133. Wang, F. 2001a. Regional density functions and growth patterns in major plains of China 1982–90. Papers in Regional Science 80, 231–240. Wang, F. 2001b. Explaining intraurban variations of commuting by job accessibility and workers’ characteristics. Environment and Planning B: Planning and Design 28, 169–182.

290

References

Wang, F. 2003. Job proximity and accessibility for workers of various wage groups. Urban Geography 24, 253–271. Wang, F. 2005. Job access and homicide patterns in Chicago: An analysis at multiple geographic levels based on scale-space theory. Journal of Quantitative Criminology 21, 195–217. Wang, F. 2006. Quantitative Methods and Applications in GIS. Boca Raton, FL: CRC Press. Wang, F. 2007. Job access in disadvantaged neighborhoods in Cleveland, 1980-2000: implications for spatial mismatch and association with crime patterns. Cityscape: A Journal of Policy Development and Research 9: 95–121. Wang, F. 2012. Measurement, optimization, and impact of healthcare accessibility: A methodological review. Annals of the Association of American Geographers 102, 1104–1112. Wang, F., A. Antipova, and S. Porta. 2011. Street centrality and land use intensity in Baton Rouge, Louisiana. Journal of Transport Geography 19, 285–293. Wang, F. and J. M. Guldmann. 1996. Simulating urban population density with a gravity-based model. Socio-Economic Planning Sciences 30, 245–256. Wang, F. and J. M. Guldmann. 1997. A spatial equilibrium model for region size, urbanization ratio, and rural structure. Environment and Planning A 29, 929–941. Wang, F. and Y. Zhou. 1999. Modeling urban population densities in Beijing 1982–90: Suburbanisation and its causes. Urban Studies 36, 271–287. Wang, F. and Y. Meng. 1999. Analyzing urban population change patterns in Shenyang, China 1982–90: Density function and spatial association approaches. Geographic Information Sciences 5, 121–130. Wang, F. and W. W. Minor. 2002. Where the jobs are: Employment access and crime patterns in Cleveland. Annals of the Association of American Geographers 92, 435–450. Wang, F. and V. O’Brien. 2005. Constructing geographic areas for analysis of homicide in small populations: Testing the herding-culture-of-honor proposition. In F. Wang (Ed.), GIS and Crime Analysis. Hershey, PA: Idea Group Publishing, pp. 83–100. Wang, F. and Y. Xu. 2011. Estimating O-D matrix of travel time by Google Maps API: Implementation, advantages and implications. Annals of GIS 17, 199–209. Wang, F. and Q. Tang. 2013. Planning toward equal accessibility to services: A quadratic programming approach. Environment and Planning B 40, 195–212. Wang, F., D. Guo, and S. McLafferty. 2012. Constructing geographic areas for cancer data analysis: A case study on late-stage breast cancer risk in Illinois. Applied Geography 35, 1–11. Wang, F. and W. Luo. 2005. Assessing spatial and nonspatial factors in healthcare access in Illinois: Towards an integrated approach to defining health professional shortage areas. Health and Place 11, 131–146. Wang, F., S. McLafferty, V. Escamilla, and L. Luo. 2008. Late-stage breast cancer diagnosis and healthcare access in Illinois. Professional Geographer 60, 54–69. Wang, F., C. Fu, and X. Shi. 2014. Planning towards maximum equality in accessibility to NCI cancer centers in the US. In P. Kanaroglou, E. Delmelle, D. Ghosh, and A. Paez (Eds.), Spatial Analysis in Health Geography. Farnham, Surrey, UK: Ashgate. Watanatada, T. and M. Ben-Akiva. 1979. Forecasting urban travel demand for quick policy analysis with disaggregate choice models: A Monte Carlo simulation approach. Transportation Research Part A 13, 241–248. Webber, M. J. 1973. Equilibrium of location in an isolated state. Environment and Planning A 5, 751–759. Wegener, M. 1985. The Dortmund housing market model: A Monte Carlo simulation of a regional housing market. Lecture Notes in Economics and Mathematical Systems 239, 144–191. Weibull, J. W. 1976. An axiomatic approach to the measurement of accessibility. Regional Science and Urban Economics 6, 357–379.

References

291

Weisburd, D. and L. Green. 1995. Policing drug hot spots: The Jersey City drug market analysis experiment. Justice Quarterly 12, 711–735. Weisbrod, G. E., R. J. Parcells, and C. Kern. 1984. A disaggregate model for predicting shopping area market attraction. Journal of Marketing 60, 65–83. Weiner, E. 1999. Urban Transportation Planning in the United States: An Historical Overview. Westport, CT: Praeger. Wheeler, J. O. et al. 1998. Economic Geography (3rd ed.). New York: Wiley. White, M. J. 1988. Urban commuting journeys are not “wasteful.” Journal of Political Economy 96, 1097–1110. White, R. and G. Engelen. 1993. Cellular automata and fractal urban form: A cellular modelling approach to the evolution of urban land-use patterns. Environment and Planning A 25, 1175–1199. White, R. and G. Engelen. 2000. High-resolution integrated modelling of the spatial dynamics of urban and regional systems. Computers, Environment and Urban Systems 24, 383–400. Whittemore, A. S., N. Friend, B. W. Brown, and E. A. Holly. 1987. A test to detect clusters of disease. Biometrika 74, 631–635. Williams, S. and F. Wang. Forthcoming. Disparities in accessibility of public high schools in metropolitan Baton Rouge, Louisiana 1990–2010. Urban Geography. doi 10.1080/ 02723638.2014.936668. Wilson, A. G. 1967. Statistical theory of spatial trip distribution models. Transportation Research 1, 253–269. Wilson, A. G. 1969. The use of entropy maximizing models in the theory of trip distribution, mode split and route split. Journal of Transport Economics and Policy 3, 108–126. Wilson, A. G. 1974. Urban and Regional Models in Geography and Planning. London: Wiley. Wilson, A. G. 1975. Some new forms of spatial interaction models: A review. Transportation Research 9, 167–179. Wong, Y.-F. 1993. Clustering data by melting. Neural Computation 5, 89–104. Wong, Y.-F. and E. C. Posner. 1993. A new clustering algorithm applicable to multispectral and polarimetric SAR images. IEEE Transactions on Geoscience and Remote Sensing 31, 634–644. Wong, D. W. S. and J. Lee. 2005. Statistical Analysis and Modeling of Geographic Information. Hoboken, NJ: Wiley. Wu, F. 2002. Calibration of stochastic cellular automata: The application to rural-urban land conversions. International Journal of Geographical Information Science 16, 795–818. Wu, N. and R. Coppins. 1981. Linear Programming and Extensions. New York: McGraw-Hill. Xiao, Y., F. Wang, Y. Liu, and J. Wang. 2013. Reconstructing city attractions from air passenger flow data in China 2001–2008: A PSO approach. Professional Geographer 65, 265–282. Xie, Y. 1995. The overlaid network algorithms for areal interpolation problem. Computer, Environment and Urban Systems 19, 287–306. Zhang X., H. Lu, and Holt J. B. 2011. Modeling spatial accessibility to parks: A national study. International Journal of Health Geographics 10, 31. Zheng, X.-P. 1991. Metropolitan spatial structure and its determinants: A case study of Tokyo. Urban Studies 28, 87–104. Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Cambridge, MA: Addison-Wesley.

GEOGRAPHY

Quantitative Methods and Socio-Economic Applications in

GIS

Second Edition

“This is a thoroughly revised edition of a highly useful book on quantitative and GIS methods. … provides useful discussion of important methodological issues such as spatial autocorrelation and the modifiable areal unit problem. … an excellent book for students and social scientists interested in various social and economic phenomena with spatial manifestations.” —Mei-Po Kwan, University of Illinois at Urbana-Champaign

“… an excellent book with clear, meaningful, and well-designed examples. It’s one of the most useful books on my shelf!” —Sara McLafferty, University of Illinois at Urbana-Champaign

See What’s New in the Second Edition: • All project instructions are in ArcGIS 10.2 using geodatabase datasets • New chapters on regionalization methods and Monte Carlo simulation • Popular tasks automated as a convenient toolkit: Huff Model, 2SFCA accessibility measure, regionalization, Garin–Lowry model, and Monte Carlo–based spatial simulation • Advanced tasks now implemented in user-friendly programs or ArcGIS: centrality indices, wasteful commuting measure, p-median problem, and traffic simulation The second edition of a bestseller, Quantitative Methods and Socio-Economic Applications in GIS (previously titled Quantitative Methods and Applications in GIS) details applications of quantitative methods in social science, planning, and public policy with a focus on spatial perspectives. The book integrates GIS and quantitative (computational) methods and demonstrates them in various policy-relevant socioeconomic applications with step-by-step instructions and datasets. The book demonstrates the diversity of issues where GIS can be used to enhance the studies related to socio-economic issues and public policy.

K19056

an informa business w w w. c r c p r e s s . c o m

6000 Broken Sound Parkway, NW Suite 300, Boca Raton, FL 33487 711 Third Avenue New York, NY 10017 2 Park Square, Milton Park Abingdon, Oxon OX14 4RN, UK

w w w. c rc p r e s s . c o m