Handbook of Spatial Analysis in the Social Sciences 1789903939, 9781789903935

Providing an authoritative assessment of the current landscape of spatial analysis in the social sciences, this cutting-

116 114 32MB

English Pages 588 [589] Year 2022

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Handbook of Spatial Analysis in the Social Sciences
 1789903939, 9781789903935

Table of contents :
Front Matter
Copyright
Contents
Contributors
Acknowledgements
Introduction: Spatial analysis and the social sciences in a rapidly changing landscape
PART 1 THEORY, FRAMEWORKS AND FOUNDATIONS
1. GIScience through the looking glass
2. Locating spatial data in the social sciences
3. Analytical environments
4. Complexity
5. Linking spatial patterns to processes
PART 2 METHODS
6. Spatial econometrics
7. Local modeling in a regression framework
8. Simulating geographical systems using cellular automata and agent-based models
9. Microsimulation
10. Multilevel models
11. Context-dependent movement analysis
12. Spatial interaction modeling
13. Spatial optimization
14. Cluster identification
15. Spatial point patterns
16. Spatial dynamics
17. GeoAI in social science
18. Exploratory spatial data analysis
19. Geovisualization and geovisual analysis
20. Immersive virtual reality and spatial analysis
21. Spatiotemporal data mining
PART 3 APPLICATIONS
22. Neighborhood change
23. The spatial analysis of gentrification: Formalizing geography in models of a multidimensional urban process
24. Social networks in space
25. Analysing the dynamics of inter-regional inequality: The case of Canada
26. Spatial approaches to energy poverty
27. The shape of bias: Understanding the relationship between compactness and bias in U.S. elections
28. Space and New Urbanism
29. Space for wellbeing
30. Urban analytics: History, trajectory and critique
PART 4 EMERGING CHALLENGES AND ISSUES
31. Reproducibility and replicability in spatial science
32. An image library: The potential of imagery in (quantitative) social sciences
33. Uncertainty
Index

Citation preview

HANDBOOK OF SPATIAL ANALYSIS IN THE SOCIAL SCIENCES

This Handbook is dedicated to all the spatial thinkers of today and tomorrow, and especially the emerging spatial thinkers in our own families, Laney and Abraham.

Handbook of Spatial Analysis in the Social Sciences

Edited by

Sergio J. Rey Professor of Public Policy and Director of the Center for Geospatial Sciences, University of California, USA

Rachel S. Franklin Professor of Geographical Analysis in the Centre for Urban and Regional Development Studies (CURDS) and the School of Geography, Politics and Sociology, Newcastle University, UK

Cheltenham, UK • Northampton, MA, USA

© Sergio J. Rey and Rachel S. Franklin 2022 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical or photocopying, recording, or otherwise without the prior permission of the publisher. Published by Edward Elgar Publishing Limited The Lypiatts 15 Lansdown Road Cheltenham Glos GL50 2JA UK Edward Elgar Publishing, Inc. William Pratt House 9 Dewey Court Northampton Massachusetts 01060 USA

A catalogue record for this book is available from the British Library Library of Congress Control Number: 2022944498 This book is available electronically in the Geography, Planning and Tourism subject collection http://dx.doi.org/10.4337/9781789903942

ISBN 978 1 78990 393 5 (cased) ISBN 978 1 78990 394 2 (eBook) Typeset by Westchester Publishing Services

EEP BoX

Contents List of contributorsviii Acknowledgementsx Introduction: Spatial analysis and the social sciences in a rapidly changing landscape Sergio J. Rey and Rachel S. Franklin

xi

PART 1  THEORY, FRAMEWORKS AND FOUNDATIONS   1 GIScience through the looking glass Barbara P. Buttenfield

2

  2 Locating spatial data in the social sciences Jonathan Reades

16

  3 Analytical environments Roger Bivand

36

  4 Complexity Li An

64

  5 Linking spatial patterns to processes Colin Robertson and Jed Long

85

PART 2  METHODS   6 Spatial econometrics Luc Anselin

101

  7 Local modeling in a regression framework Mehak Sachdeva, Taylor Oshan and A. Stewart Fotheringham

123

  8 Simulating geographical systems using cellular automata and agent-based models 142 Alison Heppenstall, Andrew Crooks, Ed Manley and Nick Malleson  9 Microsimulation Nik Lomax

158

10 Multilevel models Richard Harris

173

11 Context-dependent movement analysis Somayeh Dodge

187

v

vi  Handbook of spatial analysis in the social sciences 12 Spatial interaction modeling Taylor Oshan

208

13 Spatial optimization Alan T. Murray

223

14 Cluster identification Edward Helderop and Tony H. Grubesic

245

15 Spatial point patterns Stuart Sweeney and Sophia Arabadjis

262

16 Spatial dynamics Wei Kang

277

17 GeoAI in social science Wenwen Li

291

18 Exploratory spatial data analysis Ran Wei

305

19 Geovisualization and geovisual analysis Alasdair Rae

322

20 Immersive virtual reality and spatial analysis Trevor M. Harris

336

21 Spatiotemporal data mining Arun Sharma, Zhe Jiang and Shashi Shekhar

352

PART 3  APPLICATIONS 22 Neighborhood change Elizabeth Delmelle 23 The spatial analysis of gentrification: Formalizing geography in models of a multidimensional urban process Elijah Knaap

370

384

24 Social networks in space Clio Andris and Dipto Sarkar

400

25 Analysing the dynamics of inter-regional inequality: The case of Canada Sébastien Breau

416

26 Spatial approaches to energy poverty Caitlin Robinson

434

27 The shape of bias: Understanding the relationship between compactness and bias in U.S. elections Levi John Wolf

451

Contents  vii 28 Space and New Urbanism Emily Talen

470

29 Space for wellbeing Victoria Houlden

481

30 Urban analytics: History, trajectory and critique Geoff Boeing, Michael Batty, Shan Jiang and Lisa Schweitzer

503

PART 4  EMERGING CHALLENGES AND ISSUES 31 Reproducibility and replicability in spatial science Michael F. Goodchild

518

32 An image library: The potential of imagery in (quantitative) social sciences Daniel Arribas-Bel, Francisco Rowe, Meixu Chen and Sam Comber

528

33 Uncertainty David C. Folch

544

Index559

Contributors Li An, San Diego State University Clio Andris, Georgia Institute of Technology Luc Anselin, University of Chicago Sophia Arabadjis, University of California, Santa Barbara Daniel Arribas-Bel, University of Liverpool Michael Batty, University College London Roger Bivand, Norwegian School of Economics Geoff Boeing, University of Southern California Sébastien Breau, McGill University Barbara P. Buttenfield, University of Colorado Meixu Chen, University of Liverpool Sam Comber, University of Liverpool Andrew Crooks, State University of New York, Buffalo Elizabeth Delmelle, University of Pennsylvania Somayeh Dodge, University of California, Santa Barbara David C. Folch, Northern Arizona University A. Stewart Fotheringham, Arizona State University Rachel S. Franklin, Newcastle University Michael F. Goodchild, University of California, Santa Barbara Tony H. Grubesic, University of Texas Richard Harris, University of Bristol Trevor M. Harris, West Virginia University Edward Helderop, University of Texas Alison Heppenstall, University of Glasgow Victoria Houlden, University of Leeds Shan Jiang, Tufts University Zhe Jiang, University of Alabama viii

Contributors  ix Wei Kang, University of North Texas Elijah Knaap, University of California, Riverside Wenwen Li, Arizona State University Nik Lomax, University of Leeds Jed Long, Western University Nick Malleson, University of Leeds Ed Manley, University of Leeds Alan T. Murray, University of California, Santa Barbara Taylor Oshan, University of Maryland Alasdair Rae, Automatic Knowledge Jonathan Reades, University College London Sergio J. Rey, University of California, Riverside Colin Robertson, Wilfrid Laurier University Caitlin Robinson, University of Bristol Fransico Rowe, University of Liverpool Mehak Sachdeva, Arizona State University Dipto Sarkar, Carleton University Lisa Schweitzer, University of Southern California Arun Sharma, University of Minnesota Shashi Shekhar, University of Minnesota Stuart Sweeney, University of California, Santa Barbara Emily Talen, University of Chicago Ran Wei, University of California, Riverside Levi John Wolf, University of Bristol

Acknowledgements Assembling and editing a Handbook of this scale is a colossal piece of work, requiring the support and effort of a range of people. In our case, we are especially grateful to Preeti Juturu and Esau Casimiro Vieyra at the University of California, Riverside, for their assiduous attention to the myriad details that make a volume like this come together: collating figures and organising files, proofreading and checking for internal consistency across chapters. We could not have done it without them. We also, quite literally, could not have accomplished this project without the generosity and hard work of the very many authors who contributed to the chapters in this Handbook. We are also indebted to the referees for their expert evaluations of a large body of work. We are reminded just how lucky we are in our broader spatial science communities. Lastly, we acknowledge the undeniable influence of the pandemic, without which we would most certainly have drawn this project to a close much sooner and with considerably less stress.

x

Introduction: Spatial analysis and the social sciences in a rapidly changing landscape Sergio J. Rey and Rachel S. Franklin

INTRODUCTION It is scarcely novel anymore to note that we live in exciting times, where spatial methods and data are concerned. The field is, in fact, moving so quickly that only the very brave (or foolhardy) would dare to undertake the compilation of a Handbook purporting to cover the breadth and depth of spatial analysis in the social sciences. And yet, that is just what we have done. This is not so much a reflection of our bravery—or, one hopes, our foolhardiness—but rather an indication of our conviction that there is a time and place for wide-reaching compendia that take stock of the state of the art and the disciplinary ‘direction of travel.’ For while it is true that the spatial sciences are evolving quickly, so too are the social sciences, both under the long shadow of grand challenges such as climate change and social inequality that will require a combination of skills and expertise if we are to have any hope of understanding and remediating them. The ‘spatial turn’ in the social sciences is well documented (see, for example, Logan, 2012 or Richardson et al., 2013). Whether sociology, economics, public health or history, the application of spatial thinking and methods is now relatively mainstream. While this is at least partly attributable to the increased influence of Geographic Information Systems (GIS) and Science (GISc) and growing data availability, we also live—and research—at a time when connections, proximity and place are recognized as important lenses for understanding a wide range of social phenomena. Concurrently, there is clearly also a ‘computational turn’ underway in the social sciences. Although, epistemologically, the social sciences remain a big tent collective, across the quantitative social sciences a revolution is taking place, with creative development, utilization and exploration of computational methods and novel data applied to a host of research realms. This is evidenced by the emergence of sub-fields such as ‘computational social science,’ ‘urban science,’ ‘urban analytics’ and ‘spatial demography.’ Naturally, there is dialogue between the spatial and the social sciences; the two ‘turns’ do not exist in isolation from each other. Perhaps, however, it is not always obvious how to enter the conversation, especially for those new to either discipline. Quantitative geographers and spatial scientists may be immersed in the theoretical and methodological state of the art but lack understanding of the types of enduring questions and approaches employed across a diverse range of social sciences. Social scientists, for their part, may be convinced of the importance of a spatial analytic perspective, but desire further insight into the ‘what,’ ‘how’ and ‘why’ of spatial methods and tools. It is this niche that this Handbook seeks to fill, outlining a foundation and a structure that encompasses the breadth of spatial analytics research in the social sciences, but also providing conversational entry points for those new to the topic: ample coverage of concepts, methods, xi

xii  Handbook of spatial analysis in the social sciences applications and big-picture issues. The anticipated audience for the Handbook is commensurately broad, aimed at researchers at all career stages from across the social sciences, including geography and geography-adjacent fields. To facilitate engagement with the Handbook, we have organized the material into four sections. The first, ‘Theory, Frameworks and Foundations,’ provides important background and conceptual underpinnings for spatial analytics research. The second focuses on methods, covering a wide range of spatial analysis approaches and methods. Next, the third section shifts perspective to emphasize ‘Applications’—how spatial analysis methods are used to answer a host of social science research questions. Lastly, the fourth section takes up ‘Emerging Challenges and Issues,’ signposting to readers some of the important issues emerging not only in spatial analysis research but across all computational domains. The chapters in this Handbook—all 33 of them—are designed to stand alone, for those in our audience interested in a specific set of methods, for example, or simply a sole application. We view this Handbook as an important teaching resource and have designed it in such a way that both instructors and students can ‘dip’ into materials without having to digest the full volume in its entirety. Chapters provide straightforward introductions to topics and are ideally suited to seminar readings or for those new to an area and seeking a comprehensive overview. However, those readers seeking authoritative coverage of spatial analysis in the social sciences will not be disappointed. Individual chapters are cross-referenced with each other and each chapter is accompanied by a freestanding set of related references. In this way, the Handbook has something to offer for everyone. The remainder of this chapter provides an outline of the Handbook’s organization and contents and is a good place to start for those curious to understand what we have included and why. We are exhaustive here in our description of foundational chapters and applications, with the assumption that these are the areas in which readers may not be certain what they are looking for—the unknown unknowns, so to speak. Where ‘Methods’ chapters are concerned, we have attempted to group chapters according to their role in the research and analysis cycle: as in other quantitative fields, some methods are best used to explore data and drive hypothesis development, while other methods are better for testing hypotheses or gaining insight into individual or group behaviors.

THEORY, FRAMEWORKS AND FOUNDATIONS Underlying the adoption of spatial analysis in the social sciences are core theories and frameworks that have been developed in the field of GIScience. Indeed, these foundations are what distinguish GI/Science/ from GI/Systems/. The five chapters in the first part of this Handbook present these key frameworks and articulate the linkages between these theoretical foundations and the broader social sciences. In ‘GIScience through the looking glass,’ Barbara P. Buttenfield points out that while the field of GIScience is young relative to the social sciences, important earlier themes in geography have led to the creation of the field of GIScience. Buttenfield provides a definition of GIScience and motivation for social scientists to engage with the field. She identifies recent developments and emerging trends at the intersection of GIScience and the social sciences, where advances in spatial analysis have been adopted in the broader social sciences. The interplay between GIScience and the social sciences does not only involve the application

Introduction  xiii of methods of the former to the latter. Buttenfield describes multiple ways that GIScience is evolving as research in the areas of environmental justice and social vulnerabilities from natural hazards and disease outbreaks have required the development of innovative spatial analytical and data frameworks. The social sciences, like all sciences, find themselves grappling with the challenges of big data. Big spatial data pose a set of unique concerns on top of these challenges that requires attention from social scientists. In ‘Locating data in the social sciences,’ Jonathan Reades provides an accessible introduction to the issues that social scientists will encounter when working with spatial data. Using an exemplar of modern big data, a tweet, Reades leads the reader through the different ways that space can be represented in social media data. A common thread in this discussion is the importance of thinking critically about data and data collection and their connection to particular research questions in this big data era. In particular, the shift from hand-generated data sets in the pre-big data realm, to leveraging the power of computational thinking and code in this modern era, offers pathways to empower the social scientist engaging with new types of spatial data. The rise of spatial analysis in the social sciences has been facilitated by the development of analytical environments. As Roger Bivand argues in the so-titled chapter, these developments encompass much more than the computing environment employed by researchers engaging with spatial data analysis. In fact, Bivand positions such computing environments as subsets of analytical environments. The latter includes the critical community that surrounds spatial data analysis packages. From the perspective of a scholar who has been instrumental in the development of key analytical packages in the R language, Bivand traces the evolution of computing environments for spatial analysis in the social sciences from the pre-GIS era to the modern data science era. Stressing the importance of scripting environments in R and Python during this evolution, Bivand demonstrates the modern workflows in developing interactive visualizations, and the recent technical changes surrounding the treatment of coordinate reference systems and datums. Community in the form of user-led requests for enhancements of packages and the interaction between package developers and users are often overlooked in scholarly writing about open-source spatial analysis packages, and the chapter offers an important corrective to this omission. The chapter closes with a prospectus that stresses the importance of continued collaboration between developers of spatial analysis packages from different environments (R and Python). The modeling of individual-level social processes when studying social systems remains a daunting challenge to the social sciences. In ‘Complexity,’ Li An argues that this challenge reflects the structure of social systems as heterogeneous subsystems and/or autonomous entities connected through nonlinear relationships with multiple feedback loops. These complex systems are characterized by the prominent features of path-dependence, self-organization, contingency, multifinality and equifinality. An provides an overview of the main methodological challenges involved in addressing complex systems from a geographical perspective. Stressing the use of agent-based models (ABMs), An provides a case study employing ABMs to study the complex human-environment of the Fanjingshan National Nature Reserve in China, which was established to conserve the endangered snub-nosed monkey. The modeling effort considers the interaction of multiple agents, including farmers, households and government actors, in this context. With a particular emphasis on land use and habitat, An describes the structure, implementation, validation and use of the model to consider different interventions for conservation. The chapter provides valuable guidance on best practices for the use of ABMs in this context.

xiv  Handbook of spatial analysis in the social sciences The relationship between spatial patterns and their underlying spatial processes is one of the thorny issues confronting the practice of spatial sciences. Colin Robertson and Jed Long point out that this is because making inferences about spatial processes from spatial pattern is hard. In their chapter ‘Linking spatial patterns to process,’ Robertson and Long discuss the role of pattern analysis and provide an overview of the general approaches that are used to link pattern to process. They then bridge an important gap by viewing the pattern-process question from the lens of causal inference. Here, the role of counterfactuals is highlighted, and the heavy reliance by spatial scientists on simulation as a mechanism to generate counterfactuals in hypothesis testing. Particular challenges that arise from spatial analysis in this context relate to issues of measurement error and scale. The counterfactual approach is illustrated using the famous data set from John Snow’s map of the London cholera epidemic in 1854. They conclude that the three areas of Pearl’s (2009) causal inference framework, namely, prediction, experimentation and counterfactuals, provide a useful structure for spatial scientists.

METHODS OF SPATIAL ANALYSIS The 16 chapters in the second part of this Handbook span the full range of spatial methods, both traditional and newly emerging, and can be very loosely classified into methods for exploring and describing spatial data, spatial modeling approaches and a residual category that encompasses a diverse set of computational methods. In ‘Spatial econometrics,’ the first chapter of this section, Luc Anselin opens with the standard regression model employed across the social sciences, introducing the concept of spatial effects. This chapter covers the fundamentals of spatial variables and key foundational spatial models, as well as model estimation and specification tests. Spatial heterogeneity is also touched on, a subject which is expanded upon in Mehak Sachdeva, Taylor Oshan and A. Stewart Fotheringham’s chapter on ‘Local modeling in a regression framework.’ In this chapter, the regression modeling framework is maintained but shifts focus from the global to a local perspective, showing how Geographically Weighted Regression (GWR) and Multiscale GWR can be employed to handle spatial heterogeneity and to provide insights into spatially varying relationships between outcomes and explanatory variables. An additional modeling perspective is provided by Taylor Oshan in ‘Spatial interaction modeling.’ Here, the objects of interest are aggregate flows and interactions between sets of entities—migration or trade flows, for example. Oshan lays out the basics of spatial interaction and illustrates the applicability of spatial interaction models to a variety of social science domains. Richard Harris’s chapter, ‘Multilevel models,’ in contrast, turns its attention to the role of spatial context (or neighborhood effects) and how to estimate the effect of context within a regression framework. Another suite of spatial methods exists to help researchers understand, describe and characterize their data. In ‘Exploratory spatial data analysis,’ Ran Wei shows how visualization— especially choropleth mapping—and descriptive spatial statistics for measuring spatial autocorrelation can be employed to detect patterns in data and to understand underlying spatial structures. Wei Kang also covers exploratory techniques in ‘Spatial dynamics,’ but focuses on methods that capture changes, whether spatial or temporal. Alasdair Rae adopts a complementary approach in his chapter on ‘Geovisualization and geovisual analysis,’ discussing the role of geovisualization in social science research and providing historical context and a worked

Introduction  xv example based on United States commuting flows. In ‘Spatial point patterns,’ Stuart Sweeney and Sophia Arabadjis cover a different family of spatial statistics: those used to characterize point pattern processes. Through a worked example, Sweeney and Arabadjis explain underlying assumptions, summary measures used to characterize point patterns, and models. They close with a useful discussion of applications in the social sciences. Somayeh Dodge gives an authoritative overview of movement analytics in her chapter, ‘Context-dependent movement analysis,’ including definitions, data characteristics, common measures and the role of context. Another way that data are summarized or characterized is through clustering methods that group observations based on similarities, whether in attributes or location. In ‘Cluster identification,’ Edward Helderop and Tony H. Grubesic present both aspatial and spatial techniques and show how clustering can help researchers to understand commonalities and patterns in their data. Their chapter pairs well with Arun Sharma, Zhe Jiang and Shashi Shekhar’s chapter on ‘Spatiotemporal data mining.’ Their chapter focuses on pattern identification in large spatiotemporal data often encountered in epidemiology or climate research, for example. Within the very broad category of computational spatial methods, several approaches are covered in this section. In his chapter, ‘Spatial optimization,’ Alan T. Murray provides an introduction to the development and uses of spatial optimization methods, before turning to key definitions and a thorough discussion of decision variables, objectives, constraints and solution methods. Alison Heppenstall, Andrew Crooks, Ed Manley and Nick Malleson take on the topic of bottom-up system simulation in their chapter, ‘Simulating geographical systems using cellular automata and agent-based models.’ Their chapter covers the fundamentals of both cellular automata and agent-based modeling approaches and includes examples of applications, as well as potential limitations of such approaches. ‘Microsimulation,’ by Nik Lomax, addresses methods for creating synthetic data that simulate the individual characteristics of a population of interest. He discusses the range of potential social science applications for microsimulated data and reviews several of the main methodological approaches. Novel spatial analytic methods are constantly appearing on the horizon. Two chapters in this section engage with innovations taking place in artificial intelligence (AI) and in virtual reality. In ‘GeoAI in social science,’ Wenwen Li offers an overview of AI methods at the intersection of spatial analysis and the social sciences, highlighting the particularities of AI when dealing with the sorts of big data commonly employed in social science research (for example, social media data). Trevor M. Harris addresses immersive virtual reality and its potential for visualization and spatial analysis in his chapter, ‘Immersive virtual reality and spatial analysis.’ This chapter complements other ‘Methods’ chapters that discuss exploratory spatial data analysis (for example, chapters by Kang, Wei and Rae), but focuses on the role of immersion and its implications for visualization techniques, computing platforms and the spatial sciences more broadly.

APPLICATIONS OF SPATIAL ANALYSIS The methods and frameworks of spatial analysis discussed in the previous two sections of this Handbook have seen widespread application across the social sciences. The nine chapters in this third part of the Handbook represent a set of best practices of such applications that exam-

xvi  Handbook of spatial analysis in the social sciences ine urban science issues, including neighborhood change, gentrification and new urbanism, to chapters examining the intersection of the social and spatial dimensions of networks, and the analysis of different forms of spatial disparities. Urban neighborhoods play a prominent role in social science research. Serving as the observational unit, neighborhoods have been adopted to study a wide range of substantive questions surrounding urban poverty, inequality, educational disparities, segregation, criminology, among many others. The vast majority of these studies take the definition of neighborhoods as given by administrative data and adopt static or cross-sectional views of neighborhoods to explore variation in outcomes across different neighborhood types. In ‘Neighborhood change,’ Elizabeth Delmelle introduces methods that can be used to understand and map processes of neighborhood change. This work adds an important dynamic perspective on urban social science, one which views neighborhoods as a multidimensional bundle of spatially based characteristics. Delmelle provides an overview to the methods used to delineate neighborhoods and to define discrete typologies of neighborhoods. The dynamics of neighborhoods transitioning between different classes in this typology are studied using Markov chains which provide a macro-level characterization of the dynamics. Sequence alignment techniques are next described as methods to map individual trajectories of neighborhood change which provide a microscale complement to the Markov analysis. Delmelle closes the chapter with an overview of recent advances including the use of self-organizing maps to study neighborhood change. One of the most prominent forms of neighborhood change is gentrification. In ‘The spatial analysis of gentrification: Formalizing geography in models of a multidimensional urban process,’ Elijah Knaap first situates gentrification in the broader corpus of social theory on neighborhood change. Stressing the importance of the interplay between social behavior and spatial structure in these theories, Knaap then frames existing empirical approaches to the analysis of gentrification along several dimensions. These include models that specify gentrification as a discrete or continuous outcome, as well as whether the modeling framework is micro, such as agent based, microsimulation or cellular automata, or macro, adopting Markov or sequenced based strategies, in its orientation. Knaap highlights the methodological challenges that arise from the spatial nature of gentrification. He closes the chapter by identifying promising future research directions that will be afforded by new sensor-based sources of data on neighborhood change and advances in computational modeling. Social networks and their analysis have a long history in the social sciences, in particular sociology dating back to early conceptualizations by Simmel (1908). Today, social network analysis is a central paradigm in sociology. Despite this long history, it is only very recently that the interplay of the social and spatial have been considered in network analysis. In ‘Social networks in space,’ Clio Andris and Dipto Sarkar consider the concept of a spatial social network (SSN) which they define as a set of geolocated nodes and edges between nodes that represent some type of social flow or interaction. The authors highlight important complementarities between the social and spatial dimensions of these networks. They argue that classic studies of innovation diffusion in agriculture that focused on factors such as the age and ethnicity of farmers, as well as organizational ties in developing models to predict sharing of seeds, could have been enhanced by considering the distance between farmers in the diffusion process. In a similar vein, spatially explicit studies of contagious disease spread that focus on neighborhood and spatial conditions yet ignore social ties may result in misguided interventions that target

Introduction  xvii only environmental channels rather than personal networks. These examples highlight the need to consider spatial social networks as a framework when examining the role of spatial context versus individual social ties. Andris and Sarkar provide an overview of best practices for working with SSNs and identify a number of outstanding research questions. Chief among these is that Null models for random individuals have long been used to generate random graphs against which actual social networks may be compared. Extending such frameworks to spatial social networks to incorporate explicit geographic information remains an outstanding challenge for the field. The topic of personal income inequality has received enormous attention in the last decade across the social sciences and public policy circles. However, it is only recently that regional perspectives on rising nationwide inequality have started to gain prominence. In ‘Analyzing the dynamics of inter-regional inequality: The case of Canada,’ Sébastien Breau employs recently developed methods of exploratory space-time data analysis to examine the evolution of disparities across and within regions in Canada over the 1981–2016 period. Breau addresses the challenges that changing geographical boundaries at the municipality level pose for the study of longitudinal dynamics, and constructs a harmonized set of boundaries drawing on microdata from census long-form samples. Applying dynamic local indicators of spatial association with these data and regions, Breau finds a mixture of patterns of income convergence with a spatial twist, with regions such as New Brunswick, Quebec and British Columbia moving downwards, together with their geographical neighbors, in the income distribution while other regions such as Newfoundland and Labrador move upwards from the bottom of the income distribution. Alongside these convergence patterns, Breau finds regions which are moving in opposite directions from their geographical neighbors—leading to spatial polarization. Finally, Breau examines the under-explored relationship between interregional and intraregional inequality and finds a general negative association between the growth in interregional inequality and growth in personal inequality in each region, although caution is raised about the fragility of this relationship for small changes in specification. Energy poverty is attracting increased attention across the social sciences and policy circles. In ‘Spatial approaches to energy poverty,’ Caitlin Robinson provides an important geographical lens on this area of deprivation. The spatial structure of the network infrastructures over which energy is distributed is a vital concern in the analysis of energy poverty. Neighborhoods in peripheral areas may be off-grid due to uneven development patterns, placing residents at greater risk of energy poverty. The complexity of the infrastructure network that arises from multiple distribution companies can preclude a comprehensive understanding of the network and our ability to design interventions aimed at poverty elimination. Robinson frames the spatial analysis of energy poverty along two strands, the symptoms of energy poverty and its drivers. The former are multidimensional, reflecting empirical measures such as temperature, physical health, mental health, energy consumption, income and level of education. While these have attracted much attention in the poverty literature, the drivers of energy poverty are arguably less well understood. Robinson illustrates the application of different approaches to mapping and exploring both the symptoms and drivers of energy poverty in different parts of the United Kingdom. She closes the chapter with an agenda for future research on spatial energy poverty. In the ‘Shape of bias: Understanding the relationship between compactness and bias in US elections,’ Levi John Wolf examines the critical issue of gerrymandering: the deliberate manipulation

xviii  Handbook of spatial analysis in the social sciences of spatial boundaries to provide political advantage to one party. Wolf highlights the inherent challenges and approaches to the analysis of gerrymandering. The role of spatial analysis to develop measures that can be used for gerrymandering forensics or for the design of districting plans is summarized. One major challenge is that the relationship between the shape of individual districts and the advantage in votes involves two different scales. Partisan advantage is seen in the state-wide vote and can be examined by comparing the votes won versus seats won relationship, whereas manipulation of boundaries occurs at the local district scale. This joint scale-subject divide makes it difficult to attribute state-wide advantage to individual districts. Wolf suggests the use of local spatial statistics as a way to address this challenge. Coupling these analytics with the commonly employed shape-based measure of boundaries, Wolf illustrates their utility with a case study of US congressional voting patterns. New Urbanism is a normative proposal for how cities should be organized. Its core principles are that cities should be walkable, mixed in use, compact, socially diverse and transit-served are core principles of New Urbanism. In ‘Space and New Urbanism,’ Emily Talen examines the explicitly spatial dimensions of these ideals. Stressing the neighborhood as the basic unit of human settlements, Talen explores how new urbanists have drawn upon the spatial constraints of walkable neighborhoods to develop urban design principles. Transect planning based on a rural-to-urban cross-section of a region affords a lens to examine how the level and intensity of urban character ranges within a city. Moreover, the core value of social diversity is seen as inherently spatial as its measurement is contingent upon the spatial constraints of the walkable neighborhood. At the same time, the interplay between the design of the built environment and the success of social and land use diversity is a key concern for New Urbanists. At their core, the spatial ideals of New Urbanism are central to its mission. The health effects of place have been a driving concern for the field of spatial analysis since its origins. In ‘Space for wellbeing,’ Victoria Houlden unpacks this relationship and expands the focus to consider the concept of wellbeing, which is more than the absence of illness. Houlden draws a distinction between the compositional and contextual characteristics of place that can impact an individual’s mental and physical wellbeing. Compositional factors consist of the demographic and socioeconomic structure of the population residing in a location, while contextual characteristics pertain to the built environment and the provision of services offered by a place. Noting that stark disparities in measures of wellbeing can be seen across spatial scales from the local neighborhood, to across cities and internationally, Houlden asks why such disparities manifest, and what are the roles of space in these patterns? The methodological challenges health geographers face in addressing these questions are examined. By understanding the relationships between place and health, Houlden points towards the possibility of a more just environment to reduce health disparities and to enhance health and wellbeing for all. Geoff Boeing, Michael Batty, Shan Jiang and Lisa Schweitzer consider the intersection of modern data science and machine learning with the long-standing use of quantitative analysis of city patterns and processes. In ‘Urban analytics: History, trajectory and critique,’ they reflect on the evolution of urban analytics as a scholarly and professional discipline. Tracing the origins of urban analytics back to classical physics and the development and encapsulation of main ideas from social physics, location theory and multivariate analysis in the twentieth century, the authors then describe the evolution of urban models, noting the shift from an early emphasis on deductive methods towards induction. The representation of spatial structure and interaction through network analytics are described, and urban applications of modern machine learning together with spatiotemporal big data are explored. The double-edged sword

Introduction  xix nature of many of the key urban analytics is highlighted, as, for example, the same contact tracing that has been critical in responding to the recent COVID-19 pandemic can also be used to ‘out’ vulnerable individuals and eliminate safe spaces. An overarching concern with the consequences of urban analytics and urban data for representativeness, privacy and equity informs this holistic review.

EMERGING CHALLENGES AND ISSUES Above all else, the spatial and social sciences are a community and a culture. As methods and applications evolve, so too do community and culture—including how (spatial and social) science is performed, who performs it and how contributions are recognized. To round out the Handbook, this final section shines a spotlight on three elements of what is a large and complex topic, focusing on the challenges of uncertainty, replication and reproducibility, and opportunities in spatial data, especially imagery. In ‘Uncertainty,’ David C. Folch addresses an important topic in spatial analysis and the social sciences: the fact that data, methods and theory all contribute to a certain amount of uncertainty in social science research and this uncertainty is, to a large extent, unavoidable. Folch adopts the perspective that transparency in communication of uncertainty is of the utmost importance in his discussion of types of uncertainty, its measurement, potential mitigations and communication of findings. Transparency is one motivation behind the increasing weight given to reproducibility and reproduction of research in both the spatial and social sciences. Michael F. Goodchild takes on this topic in ‘Reproducibility and replicability in spatial science,’ opening with a broad discussion of the ‘reproducibility crisis’ in science, and then focusing more specifically on the implications for spatial analysis research. He discusses the range of ways in which reproducibility and reproduction are rendered more difficult in modern (often collaborative) research, and then offers a number of avenues for improvement, both technical and practical. If there is one thing this Handbook establishes, it is the central importance of spatial data. In ‘An image library: The potential of imagery in (quantitative) social sciences,’ Daniel Arribas-Bel, Francisco Rowe, Meixu Chen and Sam Comber address the vast potential of imagery—whether from satellites, drones or cameras—for conducting social science research. Along with a thorough discussion of types and applications of imagery data, this chapter also highlights the importance of so many areas discussed in other parts of the Handbook: computing power, machine learning and AI visualization—and all at the intersection of spatial analysis and the social sciences.

CONCLUSIONS There is not much to add that is not stated already, here or elsewhere, within the Handbook. In surveying the landscape of spatial analysis in the social sciences—delineating boundaries and documenting the territory—we have attempted to be inclusive and wide ranging, while also recognizing that the field is expanding quickly, with new methods and applications frequently appearing, alongside emerging sub-fields and perspectives. This means that, inevitably, our Handbook may contain omissions, but we suggest that this is a feature, or reflection, of the

xx  Handbook of spatial analysis in the social sciences dynamism in the field (and not a bug!) and should be taken as an indication of the creative and intellectual activity of the growing community of spatial methodologists and empiricists, within both the social and spatial sciences.

REFERENCES Logan, J.R., 2012. ‘Making a place for space: Spatial thinking in social science,’ Annual Review of Sociology, 38, pp. 507–524. Pearl, J., 2009, Causality. Cambridge University Press. Richardson, D.B., Volkow, N.D., Kwan, M.P., Kaplan, R.M., Goodchild, M.F. and Croyle, R.T., 2013. ‘Spatial turn in health research,’ Science, 339(6126), pp. 1390–1392. Simmel, G., 1908. ‘Das geheimnis und die geheime gesellschaft,’ Soziologie. Untersuchungen über die Formen der Vergesellschaftung, pp. 256–304.

PART 1 THEORY, FRAMEWORKS AND FOUNDATIONS

1.  GIScience through the looking glass Barbara P. Buttenfield

1. CONTEXT: WHAT IS GISCIENCE AND WHY SHOULD IT MATTER TO SOCIAL SCIENTISTS? The discipline of Geographic Information Science (GIScience) is relatively young, having emerged in the latter half of the twentieth century. Initially referred to as Geographic Information Systems (GIS), it encompassed the computer systems providing tools to understand physical and social environments. The term ‘GIScience’ was proposed by Goodchild (1990), who argued that the GIS discipline had matured to the point of posing ‘. . . a legitimate set of scientific questions, the extent to which these can be expressed and the extent to which they are generic, rather than specific to particular fields of application or particular contexts’ (Goodchild 1992, 32). Application of the ‘science’ label to the GIS discipline mandated methods and capabilities to solve problems that are sensitive to geographic information. An example may clarify. The traveling salesman problem and its numerous applications in economic geography absolutely pre-date the earliest advent of GIS or GIScience and require cumulation of transport distances and costs along a well-defined network (Dijkstra 1959). However, the implementation of shortest path algorithms across a less well-defined network such as water crossing digital terrain cannot be accomplished without also representing details about the nature of the terrain (Burrough et al. 2015). Answers to the question of why geographic information requires special methods for analysis, or ‘why spatial data is special’ also pre-date the wide adoption of GIS. Early explanations highlighted three properties, including spatial dependence and spatial heterogeneity (Anselin 1989) and scale dependence (Openshaw 1977; Richardson 1961). Attention to these properties continue into present-day discussions of ‘big data’ in social science (for example, Kitchin 2013; Lovelace et al. 2016; Miller and Goodchild 2015). Big data transfer easily to the context of using GIS for human geography applications and include four commonly cited elements. Data volume continues to increase in social demography and public health analyses. Data variety refers to the diverse data sources and forms including, for example, social media as a widely adopted source for crime analysis. The velocity of data collection and acquisition increases at a rate once thought unimaginable, as seen with automated sensors embedded in appliances and studied as the Internet of Things (IoT), with online transactions for shopping, banking and teleconferencing, with shorter data collection cycles from satellites and unmanned aerial vehicles, also called drones, and with streaming data feeds and video. Data veracity includes issues of data precision, accuracy, completeness, reliability and confidence. As social scientists embrace novel data sources, types and formats, GIScience continues to adopt new methods to support data curation and analytics, and to develop strategies that can improve abilities to search for patterns, trends and outliers. The remainder of this chapter utilizes a dual lens to explain where such strategies came from, and how GIScientists work to improve them for geospatial analysis. Social science applications provide the emphasis throughout the discussion. As will become evident, the power of GIScience in supporting 2

GIScience through the looking glass  3 social science boils down to the acuity to manage and to integrate ancillary data from multiple sources, multiple spatial and temporal resolutions and in multiple forms. The chapter finishes with challenges that remain unresolved.

2. LOOKING BACKWARD: PAST TRENDS AND ACHIEVEMENTS 2.1  Disciplinary Foundations The disciplinary roots of GIScience emerged from two disciplines. Landscape architecture promoted manual map overlay to illustrate spatial correspondence of ancillary variables with a landscape characteristic, and to infer associative or causational relationships. Early map overlay applications dating back at least to the early twentieth century focused on population assessment and town planning (Steinitz et al. 1976), combining ancillary variables such as land use, vegetative cover, terrain slope and aspect by mylar transparency and tracing. McHarg (1969) expanded on the method to document relative suitability or impact of proposed environmental modifications or land use changes on specific land parcels. A second disciplinary root for GIScience was cartography, not only for its obvious contribution of map display but also for the dasymetric method, which refined spatial resolution of aggregated data through addition of continuous data and other variables aggregated to different (often smaller) mapping units. Dasymetric mapping originated with manual techniques, with earliest publications documented in the 1800s (McCleary 1969) and 1900s (Sementov-TianShansky 1928) to estimate population by partitioning enumerated data into smaller mapping units by spatial masking and areal interpolation. The distinction between landscape architecture’s overlay and cartography’s dasymetry centers on quantification. Wright (1936) presented a numeric formula to compute what he referred to as ‘fractional parts of densities’ to report the Cape Cod population at sub-township levels. Recent methods build upon early work, incorporating penalized maximum entropy to monitor error propagation as ancillary data are introduced (Nagle et al. 2013). Dasymetric small area estimates have been demonstrated to minimize variation within zones while demarcating higher variation at zonal boundaries (Mennis 2009) as well as preserving pycnophylactic properties (Tobler 1979) overall. Rapid adoption in recent years reflects advances in automation and developments in GIS interpolation tools (Petrov 2012), along with increasing variety of ancillary data to inform results, including land cover, road density, address matching, parcel information and remotely sensed imagery (Leyk et al. 2013; Zandbergen and Ignizio 2010). 2.2  Advances in GIScience Tools These two disciplinary foundations complemented software and technology advances that afforded social scientists novel opportunities. The rise of personal computers in 1980s initiated a democratization of GIS technology whose benefits had been accessible previously to large federal agencies such as U.S. Census (Cooke 1992), by federally funded research projects (Tomlinson 1984) or by emerging commercial ventures (Dangermond 1982). Smaller and faster processors and graphical interfaces facilitated asking and answering questions with maps. Software such as SYMAP, ASPEX, CALFORM and other products in Harvard’s suite of mapping packages (Fisher 1982) allowed educators and researchers to generate their

4  Handbook of spatial analysis in the social sciences own thematic maps of socioeconomic data and to teach these skills to others. The rise of the NSFNET in 1990 and Berners-Lee’s 1991 World Wide Web linked hypertext files into an integrated system accessible across the information network (Couldry 2012) which in turn facilitated electronic data dissemination and citizen access, thus encouraging wider spatial literacy (Abbate 1999; NRC 2006). Software developments to support geographic modeling and analysis continued with Tomlin’s (1990) Map Algebra that formalized GIS operations as a new type of matrix manipulation, opening the way for GIS users to program operations not included or anticipated in the original command sets of commercial GIS software. Demands expanded within the geographic and environmental science communities to incorporate spatial statistical analysis tools into GIS products, such as clustering, nearest-neighbor analysis, spatial dependence metrics, geostatistical interpolation, heat maps and geographically weighted regression, to name a few. As commercial packages began to offer patches to programming languages, researchers created new tools to analyze spatial patterns while relying on the GIS environment to handle mechanics such as map projection and cartographic display, data import and format conversion, and database management. 2.3  External Developments At roughly the same time (1990s), three external developments helped to strengthen the foundations of GIS support for all GIS users. The first was decryption of GPS signals for civilian use which facilitated real-time data entry for users doing fieldwork. Within a decade of civilian access, Abler (1993, 134) noted that GIS offered a complementary function, namely, to provide a spatial context for GPS data through overlay with other variables, offering an example of maintaining personal safety when navigating with GPS through an unfamiliar neighborhood. He also proposed that coupling GPS with GIS could transform geographic analysis from mapmaking to data production, for example by timestamping and geo-referencing ethnographic field interviews as they were collected, or to record urban greenspace at the level of individual trees. The second development established geospatial data standards, which helped to make geospatial data more interoperable, easier to import into statistical and mapping packages and thus more readily usable among social science communities. With production of increasing volumes of data and increasing needs to discover, analyze and share that information, efforts began in several countries to establish data standards and spatial data infrastructures to facilitate data sharing among local, regional, national and international governments. The United States took a lead in early efforts, building a collaboration among 16 (now more than 30) federal agencies that led to the formal acceptance in 1993 of the Spatial Data Transfer Standard (SDTS) (NRC 2012). Recently, the federal government switched to the Geospatial Interoperable Reference Architecture (GIRA 2015), a more flexible standard designed for enterprise architecture and larger geospatial databases. International standards organizations such as the International Organization for Standardization (ISO, https://www.iso.org) and the Open Geospatial Consortium (OGC, https://www.ogc.org) began to harmonize standards efforts internationally and promote open-source data and software tools. The third development emanated from the 1993 solicitation from the National Science Foundation, National Aeronautics and Space Administration and Defense Advanced Research Projects Agency for information scientists and computer scientists to develop and implement online digital libraries (https://www.nsf.gov/news/news_summ.jsp?cntn_id=103048), which

GIScience through the looking glass  5 created strategies for cataloging, searching and curating the ever-expanding volumes of text, tabular, video and geospatial data in a networked working environment. Six projects were funded in 1994 and 36 additional projects in 1999, to support online repositories of video news clips, full-text documents, maps and satellite and medical imagery. These three external developments made spatial data easier to georeference, to exchange and to manage. 2.4  Emergence of Critical GIS in Social Geography Methodological advances in GIScience have impacted spatial analysis in the social sciences in important ways. It is, however, also meaningful to consider how concepts emerging from social science have impacted GIScience. One of the most dramatic relates to emergence of the tenets of Critical GIS, described by social geographers concerned about the inability in conventional GIS to accommodate non-western, non-Cartesian representations of geographic spaces. Critics argued that that GIScience was exclusionary, elitist and biased toward surveillance and control rather than empowering individuals and communities to participate in their own social welfare. For a chronology of the diverse and evolving critical perspectives, see for example Pickles (1995), Crampton and Krygier (2005), O’Sullivan (2006) or Harvey (2018). The role of GIScience theory and technology as instruments of power rather than as a supporting foundation for community involvement in spatial decisions and policy gained traction in response to the National Science Foundation’s award in 1988 of a multi-year project catalyzing geospatial analysis research, education and outreach (Abler 1987). The National Center for Geographic Information and Analysis (NCGIA) was criticized in early years for neglecting to incorporate a critical theory perspective in its research agenda. ‘Critics felt GIS failed to accommodate less rational, more intuitive analyses of geographical issues, and that its methodology, by definitions, excluded a range of inquiry. GIS scholars, meanwhile, saw the value of their techniques being denigrated without really realizing why’ (Schuurman 2000, 577). NCGIA leadership invited critical social scientists to organize a research agenda for a 3-year research initiative on ‘GIS and society, the social implications of how people, space and environment are represented GIS’ (Harris and Weiner 1996; Sheppard 1995). These debates led to improved cooperation and collaboration between the GIScientists and critical social scientists, as both sides of the debate were better informed about each other’s intended efforts (Pickles 1999). Public Participation GIS emerged during the same decade and provided further opportunities for cooperation to democratize GIS technology and tools, and made an important recognition that GIScience and critical social science could benefit from such cooperation (Craig et al. 2002; Elwood and Leitner 1998; Ghose and Elwood 2003; Sieber 2006). Indigenous communities began to explore ways to articulate and archive their various and diverse communal landscape knowledge sets (Laituri 2011) to resolve conflicts about land claims and resource management.

3. LOOKING FORWARD: RECENT DEVELOPMENTS AND EMERGING METHODS GIScience and social science analysis continue to complement one another, and one can see advances offering new opportunities to both disciplines. The advances described below reflect those developments most likely to impact and advance spatial analysis in the social sciences.

6  Handbook of spatial analysis in the social sciences As a cautionary note, technology tends to advance in many branches of science more rapidly than theory or ethics of practice, and this aspect will be covered in the chapter section on ongoing challenges. 3.1  Shift from Static Mapping to Dynamic Geovisualization The earliest move towards dynamic mapping in GIScience can be seen with the rise of hypermedia in the mid-1990s, leading the way for interactivity and eventually for animation, available at first in graphics packages, then GIS platforms and then into programming libraries. Present methods offer more sophisticated functionality that enables a shift from illustrating form to process, and from display to analysis, along with a new label of geovisualization. Early examples in regional planning and alternative futures rendered differing landscapes that could result from different planning agendas (Baker et al. 2004). Integration of immersive VRML (Virtual Reality Markup Language) within GIS frameworks introduced photorealistic threedimensional views of urbanized areas as well as of satellite imagery draped over digital terrain. As processing speed and power improved, interactive visual landscape exploration became readily accessible on local (desktop) computing platforms (Huang et al. 2001). Network-based field trips to dangerous or inaccessible places offered research and educational benefits (Dykes et al. 1999). Integrated sensor networks permitted remote exploration and pattern detection to analyze and address urban patterns of crime (Collins et al. 2000) and the architectural design of interior spaces such as hospitals (Klippel and Winter 2005). Augmented reality provides another geospatial technology by overlaying actual geographic landscapes with computer-generated signage or objects that synchronize in near real time with the viewer position and orientation (Ebling and Cáceres 2010). Augmentation can be tailored to user needs and priorities, such as navigation assistance for visually challenged individuals (Loomis et al. 2007). While many current applications are focused on wayfinding or local services (for example, www.metroparisiphone.com/index_en.html), recent work connecting augmented reality with location-based social networks affords analytic capabilities for studying daily social interactions or implementing disaster response (Liu and Fuhrmann 2018). Taken together, these new methods are transforming the perspective of maps as an information repository (a window into the database) to geovisualization as an information refinery wherein the visual display becomes a mechanism for exploration and analysis rather than simply an illustration of results. 3.2  Open-Source Data and Software and Reproducible Science The commercialization of statistical packages and spatial databases as well as of remote sensing and GIS software platforms in the face of rising demand led to pricing structures that marginalized many prospective user communities. The programming and scientific communities responded, and two flavors of open-source software tools emerged. One followed a top-down process, whereby a relatively closed team develops and polishes a product prior to release. The other releases software in development early and often, encouraging input and code modifications that may not have been initially envisioned (Rey 2008). Both make their code openly available, and both work with user communities to modify and improve versions, with the distinction being a centralized as opposed to evolutionary strategy for product design.

GIScience through the looking glass  7 Statistical packages such as R (https://www.r-project.org), GeoDA (http://geodacenter. github.io/) and PySal (https://pysal.org/) provide a suite of geocomputation tools to the spatial analysis community. GRASS and QGIS provide open-source GIS environments with embedded command sets as well as patches to open-source programming languages such as C or Python. One imaging platform in wide use is NASA WorldWind (https://worldwind.arc.nasa.gov/) that exemplifies the top-down service that Rey described. Open Street Maps (OSM, https://www.openstreetmap.org/#map=4/38.01/-95.84) is an example of the second type. It is a database of road networks, transportation hubs and local services built by an open community of global contributors. OSM gained particular visibility following the 2010 earthquake in Haiti, when over 500 users added sufficient map information in sufficient detail that, in the course of just a few weeks, it became the most current and reliable information source for disaster response by major relief agencies (Heinzelman and Waters 2010). Reproducible science has been a natural extension of the open-source movement, lobbying for dissemination of methods including code and data so that others can replicate results of published scientific work. The intention is to improve research efficiency and robustness as well as to ensure credibility (Munafò et al. 2017). The principles of transparency and repeatability have been adopted quickly and widely by researchers, institutions and funding agencies. An additional benefit is providing an avenue for spatial analysis methods to be explored using already analyzed data by other social scientists who can learn advantages and limitations of the methods prior to adopting them in their own work. 3.3  The 4th Paradigm, Machine Learning and Spatial Data Science As data volumes have continued to increase, the need to take full advantage of explicit and latent patterns in the data has grown more popular, in some cases overtaking traditional approaches that formalize theory and subsequently try to validate it with empirical results. Expanding three scientific paradigms of experimental science, theory building and computational science, Gray (2007) postulated a Fourth Paradigm of data-intensive science and datadriven discovery. One approach reflecting the paradigm shift referred to as machine learning utilizes raw and pre-processed data to train parameters of a classification or regression model. In social science, an example might relate to training socioeconomic parameters of education, unemployment, concentration of populations of color and access to healthcare in order to study infant mortality rates. Adoption of machine learning methods provides important benefits to spatial analysis in the social sciences to elicit patterns in data when social and environmental variables display subtle or complex interactions. Another benefit is to perform reliable data reduction. Some machine learning methods (for example, random forests) can offer insights as to specific contributions of input variables to refine predictions (Bishop 2006). A third benefit is that training can improve parameter specification. One might pose the question of whether machine learning will advance to the point where decisions among alternative modeling strategies and parameter optimization can become autonomous, alleviating the need for manual intervention. This possibility was raised by the Fourth Paradigm community a decade ago (Hey et al. 2009) and may be coming to fruition today, and GIScience is playing an important role. GIScience offers capabilities to examine spatial distributions and spatial relationships among social and environmental covariates in the presence or absence of a variable in question, which can assist in selection of modeling parameters. Statistical tools such as Empirical

8  Handbook of spatial analysis in the social sciences Bayesian Kriging automatically optimize variogram parameters by generating equiprobable realizations on data subsets. Geostatistical Kriging Simulation generates multiple realizations of interpolated surfaces, optionally conditioned on ancillary information. The objective is to compensate for local variability lost by the tendency to smooth local averages in a single manifestation (Pilz and Spöck 2008). These and similar types of automated parameter and model-selection tools are available in commercial and open-source GIS software. Two arguments against the use of machine learning are, first, that it often generates model results whose logic is difficult to follow or that do not relate to disciplinary theory (Gahegan 2020) and, second, that one must acquire sufficient data from diverse independent and identically distributed sources to build up a training data set whose characteristics have a demonstrated capacity to represent a full suite of possible results (Goodfellow et al. 2016). The first argument holds true in many natural and social science applications. The second may be a more constricting limitation for social science, and particularly for problems that rely on primary data sources, which tend to be more costly, time or labor intensive to collect, thus reducing the availability of robust training data sets. For applications relying upon secondary data sources such as in digital humanities, census demography and coarse-scale public health, training data sets can be generated with careful attention to pre-processing. With or without training, data science offers new methods for analysis that accommodate much larger data volumes, letting the data speak for itself (Gould 1981) to elicit underlying structure. The emergence of spatial data science follows directly in developing data-intensive methods for spatial and spatiotemporal analyses. Anselin (2020) describes spatial data science as focusing explicitly on the importance of ‘Where.’ Instead of using geographic coordinates simply as dummy variables or as external ancillary data, spatial data science treats location and spatial relations as integral data characteristics to inform analysis. Spatially explicit modeling, context-aware spatial indexing, spatial and temporal dependence and interaction – all have played important roles in the adoption of spatial data science and data-driven discovery. 3.4  Dynamic Modeling The term dynamic modeling refers to the capability to analyze process rather than form. Dynamic models tend to integrate very large social data sets (big data such as hourly streaming traffic counts, or national health statistics by county) with environmental variables. Their processing methods incorporate simulation, machine learning and automatic feature extraction, exemplifying the tenets of the Fourth Paradigm, and reflecting Goodchild and Haining’s (2004) predictions about the convergence of GIScience with spatial analysis. Current dynamic modeling research focuses upon object movement such as monitoring migratory human and wildlife communities in the face of changing environmental conditions (Dodge 2021), or on change detection of form and structure of fixed-position processes such as urban growth and expansion or economic impacts of probable land use change (Deal and Shunk 2004). The use of remote sensors in fixed or moving platforms (satellite, drone or cars) can capture and stream local conditions such as hourly traffic counts, individual travel paths, weather and energy consumption, and integrate these with social variables such as changing commuter behaviors and trajectories in reaction to traffic volumes (Xiong et al. 2018). Dynamic models couple with machine learning to examine health impacts of particulate exposure through the duration of large wildfires and differentiate these impacts by age, gender, socioeconomic status and income (Reid et al. 2016).

GIScience through the looking glass  9

4. ONGOING CHALLENGES FOR GIS IN SOCIAL SCIENCE The big questions currently facing spatial analysis in social science revolve around climate change and impacts on economic, environmental and social dimensions of sustainability. These might be seen to break out into areas such as population growth, economic globalization, public health, sustainable development, smart cities, crime and food safety, although this list is not exhaustive. The complementary challenge for GIScience is to ensure ‘. . . the effective and efficient use of dynamic and disaggregated data for decision-making, delivery of services, citizen empowerment, entrepreneurship, competitiveness and innovation . . .’ (Walter 2020, 9) in support of sustainable societies locally, regionally and globally. GIScience has a documented acuity for archiving, integrating and managing increasingly massive amounts of data from multiple sources, and can take a major role in supporting the needs of social science analysis and informed decision-making. But challenges remain, and these can be consolidated into a few general categories. 4.1  Managing Unstructured Geospatial Data The problem associated with handling data such as social media, newsfeeds, streaming video and real-time sonic data is a lack of systematic formatting that makes it difficult to work with or in many cases to catalog (Kitchin 2013, 263). Real-time streaming data may alleviate some social issues such as modulating traffic flows or regulating electricity usage. Still, the unique processing demands and data types that are being exploited require tools and structures that currently are not available in most commercial statistics and GIS environments. The demand for innovative tools has increased reliance on open-source processing and adaptable interfaces, public-domain code libraries and data frameworks that may lack interoperability. This in turn limits the affordance to utilize such tools beyond proficient programmers, often blocking access by social scientists who lack strong coding skills. Related to these data issues is the impact of relying upon them as primary information sources. Miller (2020) argues that wide availability of affordable sensors, cyber telecommunications and the IoT allows faster machine-based reactions to changing conditions, forming a foundation for smart cities technology (Batty 2013). He cautions that unintended consequences can accrue, however, such as the loss of human decision-makers in policy formation and the possible prioritization of autonomous systems using resources in unsustainable ways. The challenge for GIScientists focuses on preserving usability of the tools they create, and on providing understandable documentation describing data collection, curation and management. 4.2  Uncertainty Assessment and Monitoring In GIScience, uncertainty is an umbrella term for doubt (Fisher 1982), encompassing accuracy, error, reliability, risk, confidence, incompleteness, even vagueness and ambiguity. In social science the term more commonly refers to accuracy and error. Couclelis (2003) expands the concept beyond imperfect or poorly used information to also consider cognitive elements and the interplay between the GIS platform(s) and the people who interpret the products of GIS analyses. She says that uncertainty is inherent in every knowledge production process, including spatial modeling, but is not a problem in and of itself. The problem instead lies

10  Handbook of spatial analysis in the social sciences with inadequate representation and communication of uncertainty. One ongoing challenge for GIScience lies with this insufficiency, in spite of numerous publications reporting various metrics at database, file and item levels (for example, Shi 2010). A second challenge continues to plague the discipline, namely that uncertainty is often accounted for following rather than during analysis, more of an afterthought than an acknowledgment of the inherent essence described by Couclelis. This becomes an ever-more pressing issue as GIS technologies advance, and as methods such as simulation, optimization and machine learning become more sophisticated and more widely relied upon. While methods are available to monitor model drift during cellular automata procedures (Yeh and Li 2006) or other types of iterative analyses (Zhang and Goodchild 2002), such methods are applied at the end of the iteration, not as an embedded control. Pappenberger and Beven (2006, 5) suggest that one reason for the avoidance of uncertainty analysis in modeling generally is a lack of formalized guidance about equally plausible methods that might be sensitive to model complexity or to sources of uncertainty. They add that ‘Uncertainty estimation is currently considered to be an added component to an analysis, at extra cost. It should, however, be an intrinsic and expected part of any modeling exercise.’ 4.3  Information Privacy and Ethical Use The emergence of Location Based Services (LBS) and geo-surveillance technologies raises difficult questions about invasive impacts on individual and community privacy. On one hand, information about people’s location and behaviors can protect personal safety and property, as well as assisting emergency response and most recently, for COVID-19 contact tracing. On the other hand, primarily because of cell phones, ‘. . . millions of people are now walking around with a gizmo in their pocket that not only knows where they are but also plugs into the Internet to share that info [sic], merge it with online databases, and find out what – and who – is in the immediate vicinity’ (Honan 2009, 72). Businesses rely on LBS for marketing and advertising their services and location. Still, there is a compelling need to protect respondent privacy in surveys and particularly in collecting census demography. Traditional methods of disclosure avoidance (for example, withholding selective data or swapping observations between proximal enumeration units) not only reduces accuracy of summary statistics (Kounadi and Resch 2018) but can remain effective only by not publishing the protocols (Hawes 2020). A strategy adopted for the 2020 Census is differential privacy (Abowd 2018), defined as the practice ‘. . . to obscure the presence or absence of any individual, or small group of individuals, while at the same time preserving statistical utility’ (Dwork 2014, 302). Lane et al. (2014) counter the argument, asking for clarification within the scientific community about the ‘rules of engagement’ for differential privacy in statistical analysis. Streaming data collection using stable or mobile sensors raises additional concerns about individual anonymity and security, especially given GIS acuities to integrate raw data with ancillary data (Lee and Kwan 2018). Dodge (2021, 15) cautions: ‘As mobile data collection has become prevalent, the question is how data science can help meaningful knowledge discovery while protecting the privacy and security of tracked individuals in the context of both humans or endangered species.’ The challenge to enrich analytical approaches with proper geomasking and aggregation to ensure protection of private information forms a promising area for collaborative research between spatial analysts and GIScientists.

GIScience through the looking glass  11 O’Sullivan (2006, 786) points out that this has been a longstanding ethical issue: ‘Concerns about the intrusion of GIS and geodemographic analysis into individual private lives were an important component of the original critiques of GIS (see Goss 1995a; 1995b), and such concerns only become more acute as detailed individual data become mappable.’ Klinkenberg (2007) takes a different approach, arguing that location-based technologies are not inherently harmful to society, and in fact can help in addressing societal issues ranging from environmental conservation to access to food and social services. The challenge for GIScientists is to educate people that the use of geospatial technology (rather than the technology itself) is the root cause of concerns and fear. And of course, it is necessary to use information ethically. The ethical use of geospatial information involves not only GIScientists, social scientists and others who analyze the data, but also those who curate and steward the data, namely data librarians and archivists. Blatt (2012) lists several elements of ethical conduct in this regard, including quality assurance and reporting, data validation practice, intellectual property rights and licensing, information liability and disclaimers on fitness for use and metadata creation and maintenance. She raises a number of scenarios to highlight, for example, responsibility for data validation as opposed to adding disclaimers to the metadata; or for protecting patient anonymity when transferring public health research data to a semi-permanent archive (as required by universities and funding agencies); for using email or file transfer protocols to share protected data with collaborators such as cadastral information that contains individuals’ names and addresses. Ethical questions are currently arising in news feeds about a private company’s use of social media to collect and sell facial recognition data on daily mobility patterns (Hill 2020). The information is reportedly used by law enforcement and credit card fraud offices who admit to adopting it without understanding details of how it works or who built the database of 3 billion images in the first place. The company (Clearview AI) has declined to provide a list of its buyers. In Hill’s article, the New York Times claims to have analyzed the algorithms and discovered augmented reality functionality that displays name and address identification of each facial image stored in the company database, along with geo-referenced coordinates of where each image was captured. A full discussion on the weaponization of social media is beyond the scope of this chapter, but it remains one of the most serious ongoing challenges to location-based information in geospatial data science today.

5. SUMMARY AND PROSPECTS GIScience at its core integrates data from multiple sources, scales and semantics. Data synthesis is its primary strength and a core functionality. And the fact that social systems are not isolated from natural systems but coupled to varying degrees both complicates and enriches spatial analysis. GIScience supports the perspective that adopting the tight and loose couplings can and will continue to advance understanding of societal dynamics, thus supporting the foundations of spatial analysis in the social sciences. Spatial analysis of environmental justice and social vulnerabilities arising from natural hazards, disease outbreaks and reduced sources of economic assistance can also advance thinking in GIScience by calling for development of innovative spatial and spatiotemporal analytic capabilities and data frameworks. The need for research on interacting societal and natural system dynamics is reflected in a national transdisciplinary initiative referred to as Convergence Research (NSF 2020), whose objective is to address complex societal needs by combining disciplinary expertise with

12  Handbook of spatial analysis in the social sciences innovation to bring about new forms of discovery and innovation (https://beta.nsf.gov/funding/learn/research-types/learn-about-convergence-research). This initiative presents further encouragement for cross-pollination between GIScience and spatial analysis in the social sciences. In GIScience, convergence can be seen to leverage strengths of geospatial analysis, cyber-infrastructure and spatial data science to generate new methods needed to address the complex suite of questions facing social scientists. ‘Convergence in knowledge, technology, and society is the accelerating, transformative interaction among seemingly distinct scientific disciplines, technologies, and communities to achieve mutual compatibility, synergism, and integration, and through this process to create added value for societal benefit’ (Roco et al. 2014). The Convergence platforms of human activity are defined to include nanotechnology, biotechnology, information technology and cognitive science, environmental systems, humanscale activities and societal-scale activities. Convergence research is intended to break down silos between research communities through direct exchange of ideas and methods focused on specific application domains. NSF has issued the first set of Convergence awards, supporting workshops, summer institutes, and Research Coordination Networks including an award for ‘Social Science Insights for 21st Century Data Science Education’ to develop curriculum content linking data science with societal problems. Clearly, the interplay between GIScience, spatial analysis and social science is not only easy to recognize within the geography discipline, but one can see gleanings beyond the discipline of how such interactions can benefit society and the larger scientific community in the future.

REFERENCES Abbate, J. (1999). Inventing the internet. Cambridge, MA: MIT Press. Abler, R.F. (1987). The National Science Foundation Center for Geographic Information and Analysis. International Journal of Geographical Information Systems 1, 303–326. Abler, R.F. (1993). Everything in its place: GPS, GIS and geography in the 1990s. Professional Geographer 45(2), 131–139. https://doi.org/10.1111/j.0033-0124.1993.00131.x Abowd, J.M. (2018). The U.S. Census Bureau adopts differential privacy. KDD ’18, Proceedings 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (p. 2867). https://doi. org/10.1145/3219819.3226070 Anselin, L. (1989). What is special about spatial data? Alternative perspectives on spatial data analysis. Presented at the Spring 1989 Symposium on Spatial Statistics, Past, Present and Future, Department of Geography, Syracuse University, Syracuse, NY. Retrieved 27 January 2022 from https://pdfs.semanticscholar.org/881e/6ed3c91a26996 5ff4724f3ccd6c1fd78bd25.pdf Anselin, L. (2020). Spatial data science. In D. Richardson, N. Castree, M.F. Goodchild, A. Kobayashi, L. Weidong, and R.A. Marston (Eds.), The international encyclopedia of geography: people, the earth, environment and technology (pp. 1–6). New York: John Wiley & Sons. Retrieved 27 January 2022 from https://doi.org/10.1002/9781118786352. wbieg2015 Baker, J.P., Hulse, D.W., Gregory, S.V., White, D., Van Sickle, J., Berger, P.A., Dole, D., and Schumaker, N.H. (2004). Alternative futures for the Willamette River basin, Oregon. Ecological Applications 14(2), 313–324. Batty, M. (2013). The new science of cities. Cambridge, MA: MIT Press. Bishop, C.M. (2006). Pattern recognition and machine learning. New York: Springer Science and Business Media. Blatt, A.J. (2012). Ethics and privacy issues in the use of GIS. Journal of Map and Geography Libraries 8(1), 80–84. Retrieved 27 January 2022 from https://doi.org/10.1080/15420353.2011.627109 Burrough, P.A., McDonnell, R. A., and Lloyd, C.D. (2015). Principles of geographical information systems. 3rd ed. Oxford, UK: Oxford University Press. Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt., P., and Wixson, L. (2000). A system for video surveillance and monitoring. Pittsburgh, PA: Carnegie Mellon University, Report CMU-Ri_TR-00-12. Retrieved 27 January 2022 from https://www.ri.cmu.edu/pub_files/pub2/ collins_robert_2000_1/collins_robert_2000_1.pdf

GIScience through the looking glass  13 Cooke, D.F. (1992). Topology and TIGER: The Census Bureau’s contribution. In T.W. Foresman (Ed.), The history of GIS (geographic information systems) (pp. 47–57). Upper Saddle River, NJ: Prentice Hall. Couclelis, H. (2003). The certainty of uncertainty: GIS and the limits of geographical knowledge. Geographical Analysis 7(2), 165−175. https://doi.org/10.1111/1467-9671.00138 Couldry, N. (2012). Media, society, world: Social theory and digital media practice. London: Polity Press. Craig, W.J., Harris, T.M., and Weiner, D. (Eds.) (2002). Community participation and geographic information systems. London and New York: Taylor and Francis. Crampton, J.W., and Krygier, J. (2005). An introduction to critical cartography. ACME: An International E-Journal for Critical Geographies 4(1), 11–33. Dangermond, J. (1982). The process of designing an urban geographic information system: The case of Anchorage. Computers, Environment and Urban Systems 7, 301–313. Deal, B., and Shunk, D. (2004). Spatial dynamic modeling and urban land use transformation: A simulation approach to assessing the costs of urban sprawl. Ecological Economics 51(1–2), 79–95. Retrieved 27 January 2022 from https://doi.org/10.1016/j.ecolecon.2004.04.008 Dijkstra, E.W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik 1, 269–271. Dodge, S. (2021). A data science framework for movement (50th Anniversary Special Issue). Geographical Analysis 53, 1–21. Retrieved 27 January 2022 from https://doi.org/10.1111/gean.12212 Dwork, C. (2014). Differential privacy: A cryptographic approach to private data analysis. In J. Lane, V. Stodden, S. Bender, and H. Nissenbaum (Eds.), Privacy, big data and the public good: Frameworks for engagement (Chapter 14, pp. 296–322). London and New York: Cambridge University Press. Dykes, J., Moore, K., and Wood, J. (1999). Virtual environments for student fieldwork using networked components. International Journal of Geographical Information Science 13, 397–416. Ebling, M.R., and Cáceres, R. (2010). Gaming and augmented reality come to location-based services. IEEE Pervasive Computing 991, 5–6. Retrieved 27 January 2022 from https://doi.org/10.1109/MPRV.2010.5 Elwood, S., and Leitner, H. (1998). GIS and community-based planning: Exploring the diversity of neighborhood perspectives and needs. Cartography and Geographic Information Systems 25(2), 77–88. Fisher, H.T. (1982). Mapping information: The graphic display of quantitative information. Cambridge, MA: Abt Books. Gahegan, M. (2020). Fourth paradigm GIScience? Prospects for automated discovery and explanation from data. International Journal of Geographical Information Science 34(1), 1–21. Retrieved 27 January 2022 from https:// doi.org/10.1080/13658816.2019.1652304 Ghose, R.H., and Elwood, S. (2003). Public participation GIS and local political context: Propositions and research directions. URISA Journal 15, 17–24. GIRA (2015, April). Geospatial interoperability reference architecture (GIRA): Increased information sharing through geospatial interoperability (p. 212). Washington, D.C. https://www.hsdl.org/?view&did=786490 Goodchild, M.F. (1990). Keynote address: Spatial information science. Proceedings Fourth International Symposium on Spatial Data Handling (vol. 1, pp. 13–14). Zurich, Switzerland. Goodchild, M.F. (1992). Geographical information science. International Journal of Geographical Information Systems 6(1), 31–45. Goodchild, M.F., and Haining, R.P. (2004). GIS and spatial data analysis: Converging perspectives. Papers in Regional Science 83(1), 363–385. Retrieved 27 January 2022 from https://doi.org/10.1007/s10110-003-0190-y Goodfellow, I., Bengio, Y., and Courville A. (2016). Deep learning. Cambridge, MA: MIT Press. Goss, J. (1995a). Marketing the new marketing: The strategic discourse of geodemographic information systems. In J. Pickles (Ed.), Ground truth (pp. 130–170). New York: Guilford Press. Goss, J. (1995b). We know where you are and we know where you live: The instrumental rationality of geodemographic information systems. Economic Geography 71, 171–198. Gould, P. (1981). Letting the data speak for themselves. Annals of the Association of American Geographers 71(2), 166–176. Retrieved 27 January 2022 from http://dx.doi.org/10.1111/j.1467-8306.1981.tb01346.x Gray, J. (2007, January 11). e-Science: A transformed scientific method. Transcript of a presentation to NRC Computer Science and Telecommunications Board. Retrieved 27 January 2022 from http://research.microsoft.com/en-us/um/ people/gray/JimGrayTalks.htm Harris, T.M., and Weiner, D. (1996). GIS and society: the social implications of how people, space and environment are represented. In GIS. Report of the initiative 19 specialist meeting Santa Barbara (pp. 96–97). CA: National Center for Geographic Information and Analysis Technical Report. Retrieved 27 January 2022 from http://ncgia. ucsb.edu/technical-reports/PDF/96-7.pdf Harvey, F. (2018). Critical GIS: Distinguishing critical theory from critical thinking. Canadian Geographer 62(1), 35–39. Retrieved 27 January 2022 from https://doi.org/10.1111/cag.12440 Hawes, M.B. (2020). Implementing differential privacy: Seven lessons from the 2020 United States Census. Harvard Data Science Review. Retrieved 27 January 2022 from https://doi.org/10.1162/99608f92.353c6f99 Heinzelman, J., and Waters, C. (2010). Crowdsourcing crisis information in disaster-affected Haiti (p. 16). Washington, D.C.: United States Institute of Peace. Retrieved 27 January 2022 from www.jstor.org/stable/resrep12220

14  Handbook of spatial analysis in the social sciences Hey, T., Tansley, S., and Tolle, K. (2009). The fourth paradigm: Data-intensive scientific discovery. Redmond, WA: Microsoft Research. Hill, K. (2020, January 18). The secretive company than might end privacy as we know it. New York Times Technology Section. Retrieved 27 January 2022 from https://www.nytimes.com/2020/01/18/technology/clearview-privacyfacial-recognition.html?searchResultPosition=1 Honan, M. (2009). I am here. Wired 17(2), 70–75. Huang, B., Jiang, B., and Li, H. (2001). An integration of GIS, virtual reality and the Internet for visualization, analysis and exploration of spatial data. International Journal of Geographical Information Science 15(5), 439–456. https:// doi.org/10.1080/13658810110046574 Kitchin, R. (2013). Big data and human geography: Opportunities, challenges and risks. Dialogues in Human Geography 3(3), 262–267. Retrieved 27 January 2022 from https://doi.org/10.1177/2043820613513388 Klinkenberg, B. (2007). Geospatial technologies and the geographies of hope and fear. Annals, Association of American Geographers 97(2), 350–360. Retrieved 27 January 2022 from https://www.tandfonline.com/doi/ abs/10.1111/j.1467-8306.2007.00541.x Klippel, A., and Winter, S. (2005). Structural salience of landmarks for route directions. Proceedings of the International Conference of Spatial Information Theory (COSIT, LNCS 3693, pp. 347–362), Elliottville, New York. Kounadi, O., and Resch, B. (2018). A geoprivacy by design guideline for research campaigns that use participatory sensing data. Journal of Empirical Research on Human Research Ethics 13, 203–222. Laituri, M. (2011). Indigenous peoples’ issues and indigenous uses of GIS. In T.L. Nyerges, H. Couclelis, and R.B. McMaster (Eds.), The SAGE handbook of GIS and society (Chapter 11, 202–221). London: SAGE. Retrieved 27 January 2022 from https://dx.doi.org/10.4135/9781446201046.n11 Lane, J., Stodden V., Bender, S., and Nissenbaum, H. (2014). Privacy, big data and the public good: Frameworks for engagement. London and New York: Cambridge University Press. Lee, K., and Kwan, M.P. (2018). Automatic physical activity and in-vehicle status classification based on GPS and accelerometer data: A hierarchical classification approach using machine learning techniques. Transactions in GIS 22(6), 1–28. Retrieved 27 January 2022 from https://doi.wiley.com/10.1111/tgis.12485 Leyk, S., Buttenfield, B.P., Nagle, N.N., and Stum, A.K. (2013). Establishing relationships between parcel data and land cover for demographic small area estimation. Cartography and Geographic Information Science 40(4), 305–315. Retrieved 21 June 2022 from https://doi.org/10.1080/15230406.2013.782682 Liu, C., and Fuhrmann, S. (2018). Enriching the GIScience research agenda: Fusing augmented reality and location-based social networks. Transactions in GIS 22(3), 775–788. Retrieved 27 January 2022 from https://doi. org/10.1111/tgis.12345 Loomis, J.M., Golledge, R.G., Klatzky, R.L., and Marston, J.R. (2007). Assisting wayfinding in visually impaired travelers. In G.L. Allen (Ed.), Applied spatial cognition: From research to cognitive technology. Mahwah, NH: Lawrence Erlbaum. Lovelace, R., Birkin, M., Cross, P., and Clarke, M. (2016). From big noise to big data: Toward the verification of large data sets for understanding regional retail flows. Geographical Analysis 48, 59–81. Retrieved 27 January 2022 from https://doi.org/10.1111/gean.12081 McCleary, G.F. (1969). The dasymetric method in thematic cartography. PhD dissertation, Department of Geography, University of Wisconsin-Madison. McHarg, I.L. (1969). Design with nature. New York, Doubleday. Mennis, J. (2009). Dasymetric mapping for estimating population in small areas. Geography Compass 3(2), 727–745. Miller, H.J. (2020). GIScience, fast and slow – Why faster geographic information is not always smarter. Progress in Human Geography 44, 129–138. Retrieved 27 January 2022 from https://doi.org/10.1177/0309132518799596 Miller, H.J., and Goodchild, M.F. (2015). Data-driven geography. GeoJournal 80, 449–461. Retrieved 27 January 2022 from https://doi.org/10.1007/s10708-014-9602-6 Munafò, M.R., Nosek, B.A., Bishop, D.V.M., Button, K.S., Chambers, C.D., Percie du Sert, N., and Ioannidis, J.P.A. (2017). A manifesto for reproducible science. Nature Human Behavior 1(21). Retrieved 27 January 2022 from https://doi.org/10.1038/s41562-016-0021 Nagle, N.N., Buttenfield, B.P., Leyk, S., and Spielman, S. (2013). Dasymetric modeling and uncertainty. Annals of the Association of American Geographers 104(1), 80–95. Retrieved 27 January 2022 from https://doi.org/10.108 0/00045608.2013.843439 NRC (National Research Council) (2006). Learning to think spatially (p. 333). Washington, D.C.: The National Academies Press. NRC (National Research Council) (2012). Advancing strategic science: A spatial data infrastucture roadmap for the US Geological Survey (p. 115). Washington, D.C.: The National Academies Press. NSF (National Science Foundation) (2020). Convergence research at NSF. Retrieved 14 June 2020 from https:// www.nsf.gov/od/oia/convergence/index.jsp (see also NSF Program solicitation Growing convergence research NSF 19–551. Retrieved 27 January 2022 from https://www.nsf.gov/pubs/2019/nsf19551/nsf19551.htm) Openshaw, S. (1977). A geographical solution to scale and aggregation problems in region-building, partitioning and spatial modelling. Transactions of the Institute of British Geographers 2, 55–69.

GIScience through the looking glass  15 O’Sullivan, D. (2006). Geographic information science: Critical GIS. Progress in Human Geography 30(6), 783–791. Retrieved 27 January 2022 from https://doi.org/10.1177/0309132506071528 Pappenberger, F., and Beven, K.J. (2006). Ignorance is bliss: Or seven reasons not to use uncertainty analysis. Water Resource. Research 42, W05302, 8. Retrieved 27 January 2022 from https://doi.org/10.1029/2005WR004820 Petrov, A. (2012). One hundred years of fasymetric mapping: Back to the origin. The Cartographic Journal 49, 256–264. Retrieved 27 January 2022 from https://doi.org/10.1179/1743277412Y.0000000001 Pickles, J. (1995). Ground truth: The social implications of geographic information systems. New York: Guilford Press. Pickles, J. (1999). Arguments, debates and dialogues: The GIS social theory debate and the concern for alternatives. In P.A. Longley, M.F. Goodchild, D.J. Maguire, and D.W. Rhind (Eds.), Geographical information systems: Principles, techniques, management and applications (vol. 1, pp. 49–60). New York: Wiley. Pilz, J., and Spöck, G. (2008). Why do we need and how should we implement Bayesian kriging methods. Stochastic Environmental Research and Risk Assessment 22, 621–632. Retrieved 27 January 2022 from https://doi. org/10.1007/s00477-007-0165-7 Reid, C.E., Jerrett, M., Tager, I.B., Petersen, M.L., Mann, J.K., and Balmes, J.R. (2016). Differential respiratory health effects from the 2008 northern California wildfires: A spatiotemporal approach. Environmental Research 150, 227–235. Retrieved 27 January 2022 from https://doi.org/10.1016/j.envres.2016.06.012 Rey, S. (2008). Show me the code: Spatial analysis and open source (Munich Personal RePEc Archive Paper #9260, p. 20). Retrieved 14 June 2020 from https://mpra.ub.uni-muenchen.de/9260/ Richardson, L.F. (1961). The problem of contiguity: An appendix to the statistics of deadly quarrels. General Systems Yearbook 6, 139–187. Roco, M.C., Bainbridge, W.S., Tonn, B., and Whitesides, G. (Eds.) (2014). Convergence of knowledge, technology and society. New York: Springer. Schuurman, N. (2000). Trouble in the heartland: GIS and its critics in the 1990s. Progress in Human Geography 24(4), 569–590. Retrieved 27 January 2022 from https://doi.org/10.1191/030913200100189111 Sementov-Tian-Shansky, B. (1928). Russia territory and population: A perspective on the 1926 census. Geographical Review 18(4), 616–640. Sheppard, E. (1995). Sleeping with the enemy, or keeping the conversation going? Environment and Planning A 27, 1026–1028. Shi, W. (2010). Principles of modeling uncertainties in spatial data and spatial analyses. Boca Raton, FL: CRC Press. Sieber, R. (2006). Public participation geographic information systems: A literature review and framework. Annals of the Association of American Geographers 96(3), 491–507. Steinitz, C., Parker, P., and Jordan, L. (1976). Hand-drawn overlays: Their history and prospective uses. Landscape Architecture 66(5), 443–455. Tobler, W.R. (1979). Smooth pycnophylactic interpolation for geographical regions. Journal of the American Statistical Association 74, 519–530. Tomlin, C.D. (1990). Geographic information systems and cartographic modeling. Englewood Cliffs, NJ: Prentice-Hall. Tomlinson, R.F. (1984). Keynote address: Geographical information systems – A new frontier. Proceedings International Symposium on Spatial Data Handling (vol. 1, pp. 2–3). Zurich, Switzerland. Walter, C. (2020). Future trends in geospatial information management: The five to ten year vision. 3rd ed. Technical Paper for the United Nations Committee of Experts on Global Geospatial Information Management (UN-GGIM, p. 68). London: Ordinance Survey of Great Britain. Retrieved 27 January 2022 from https://ggim.un.org/documents/DRAFT_Future_Trends_report_3rd_edition.pdf Wright, J.K. (1936). A method of mapping densities of population. The Geographical Review 26, 103–110. Xiong, C., Zhang, L., Stewart, K., Fan, J.C., Lee, M.H., and Zhou, W.Y. (2018). Analyzing travelers’ response to different active traffic management (ATM) technologies (p. 37). University of Maryland Transportation Institute Report to Maryland Department of Transportation State Highway Administration MD-18-SP709B4K. Retrieved 27 January 2022 from https://www.roads.maryland.gov/OPR_Research/MD-18_SHA-UM-4-40_ATM_Report.pdf Yeh, A.G-O., and Li, X. (2006). Errors and uncertainties in urban cellular automata. Computers, Environment and Urban Systems 30, 10–28. Zandbergen, P.A., and Ignizio, D.A. (2010). Comparison of dasymetric mapping techniques for small-area population estimates. Cartography and Geographic Information Science 37(3), 199–214. Zhang, J. and Goodchild, M.F. (2002). Uncertainty in geographical information. London: Taylor and Francis.

2.  Locating spatial data in the social sciences Jonathan Reades

1. INTRODUCTION The past twenty years have seen a transformation of the social sciences: data has gone from being hard to collect at scale to being seemingly ‘open and everywhere’ (Arribas-Bel, 2014). The past decade has been a good time to be a computationally capable researcher—as the movement of computer scientists and physicists into the social sciences has demonstrated (O’Sullivan and Manson, 2015)—but a challenging time to be a ‘classically trained’ social scientist trying to make sense of this brave new world of ‘big data’ and machine learning. Moreover, a lot of this data has also been, as Arribas-Bel (2014) also noted, ‘accidental’ in the sense that it was never really designed to support robust (spatial) analysis. Many of these new forms of data are behavioural—generated as a by-product of human activity such as phone calls or travel-card use (for example, Reades et al., 2016)—meaning that they are rarely, if ever, straightforward to collect, organise and interpret; but they also promise to help us bridge what has long been a critical gap in the social sciences: What do people actually do when social scientists aren’t watching them? However, the starting point for such work is often just working out how to deal with the data in the first place: How do we extract and store it, project and map it, and interpret or analyse it? These issues that are rarely, if ever, covered in a bog-standard ‘statistics for social scientists’ class where the data sets are often pre-cleaned, pre-packaged and pre-interpreted without any reference to geography or to the kinds of problems that you might encounter as a student undertaking independent research for the first time . . . or the fifth. So this is actually a chapter about questions, not answers. I hope to show you that there is no ‘right’ way to tackle geospatial data analytics but, rather, a series of choices that need to be articulated, made and documented based on the interactions between the research question and the data. Through the lens of a tweet, we’ll consider the various ways that ‘location’ can be extracted from the sorts of data that social scientists use in order to tackle the kinds of questions that social scientists ask. We will begin by looking at the basic classes of spatial data—points, lines and polygons— and consider some of the most common challenges that a social scientist might expect to encounter in handling coordinate data. We then turn to locational data embedded in free-form text since, although it is often overlooked, it is likely to be encountered in archival work as well as when trying to perform ‘straightforward’ (that is, not straightforward at all!) address matching. These point us towards more subtle questions around relationships—across time as well as space—between observations. A section on temporal analysis and the deceptively simple problem of how to define a ‘neighbourhood’ rounds out the fundamental conceptual challenges facing the novice analyst and positions us to think about other, more practical, challenges. So the fourth section considers some of the nitty-gritty of how we access and disseminate spatial data so that we can both 16

Locating spatial data in the social sciences  17 build on the work of others and share our own work in ways that support others. Finally, there is a gentle introduction to the kinds of analytical challenges that spatial data present when we fail to consider the how of data generation and the why of statistical analysis. By the end of this chapter, I hope that you will have had a gentle (if rapid) introduction to key concepts and terminology in spatially enabled social science, have developed an understanding of the range of contexts in which spatial data can be located (pardon the pun) and have developed a basic appreciation of the analytical challenges posed by such data.

2. WHERE’S WALLY? So, with these ideas in mind, let’s start with a seemingly straightforward question: How many ‘bits’ of spatial data are contained in the tweet below? ‘Love this photo of Central Park #I♥NYC’ — @jreades; Location: Walthamstow, GB; Geo-tag: 43.653°N, 79.383°W; Tweeted at 01:30 (GMT-5).

Why don’t you take a minute to write some down? I make it that there are at least four, and we’ll tackle each in turn to see how they shed light on the issues I’ve outlined above. A more philosophical, and less data-focussed, version of this approach can be found in Crampton et al. (2013).

3. WORKING WITH COORDINATES The most obvious piece of spatial information in the tweet is the coordinate pair from the geo-tag: 43.653°N, 79.383°W. Being supplied with explicitly spatial data often seems ideal because it gives the impression that no further thinking is required: this is where the event happened, end of story. In fact, the decimal latitude and longitude are for a point in Toronto, Canada—near City Hall and Nathan Phillips Square, if you must know—which featured nowhere in either the text of the tweet or its other metadata (that is, the user-specified location). We’ve not even begun to think about the content of the tweet and already we need to decide if a tweet about New York is relevant if it is sent from Toronto! So, while it is true that Latitude, Longitude and Elevation can theoretically be used to uniquely locate any point or event on the planet, as soon as humans get involved things get a lot more complicated. We have a nasty habit of assuming that spatial data from computers is somehow ‘true’: when we see latitude and longitude to four, six or eight decimal places it seems . . . accurate. We are then minded to accept these very precise numbers as ‘truth’ when, in reality, eight decimal places would imply that our GPS was accurate to within 1 mm! It isn’t, not by a long shot, but that sense of ‘the (big) data is always right’ is presumably why people still follow their GPS into the river despite the many warning signs en route. So this is the simplest possible representation of this data—a point in space—and we are still forced to make choices about ‘meaning’! One plausible choice that a researcher could make is to focus solely on tweets that are made within the five boroughs of New York City; we could then use this to determine whether there are hotspots of tweeting activity and try to connect these to aspects of New York life . . . But note that eminently sensible decisions lead

18  Handbook of spatial analysis in the social sciences to this tweet being dropped from the analysis, even though it is clearly about Central Park in New York City. 3.1  Get to the Point: Types of Spatial Data The point is only one of the three basic building blocks of spatial data; the second is the line. A line is composed of two (or more) points in a sequence: if you were drawing a line in a notebook then there would always a ‘first’ point and a ‘second’ point, even if your drawing has no meaningful direction. It is the same in a spatial data set. So another option would be to treat this tweet as one point in a sequence: we could link up all of the points where @jreades tweeted as a set of lines (each consisting of a pair of points). Doing so, we could create a travel history for @jreades and get a sense of how he, or she, moves about, regardless of whether or not they live in New York. And a polygon—which is a sequence of lines where the end point of the last line is the same as the starting point of the first one—could be drawn around all the points where @jreades tweeted (the technical term would be ‘Convex Hull’) to give a sense of their ‘territory’ or ‘range’. So far, so straightforward, even if these might seem like fairly poor means of capturing the many rich ways in which humans experience and represent space. However, each of these classes—points, lines or polygons—may also be single- or multi-part: a single-part geometry is one where each feature has its own attributes, and a multi-part geometry is one where several features that we see as separate ‘things’ on the map all share the same attribute. The polygon delimiting New York City’s boundaries is a multi-part geometry because Manhattan and Staten Island are not contiguous with the other three boroughs. It is the same for London because North and South London are separated by the Thames. Rather less obviously, you might think that a road is a good candidate for a single-part feature because we know from driving along roads that they are continuous ‘things’. . . However, Britain’s A11 highway is a multi-part feature because some stretches have been upgraded to a motorway (and renamed the M11) while other stretches have not! Furthermore, a highway that is best represented as a line at one scale (zoomed out) might be better represented as a set of lines, or even polygons, at another (zoomed in) because it is actually two carriageways with multiple lanes separated by a meridian. Social scientists will come face to face with representational problems when working with geospatial data because the real world is much messier than the computer’s tidy world of points, lines and polygons: London’s Paddington Station is actually made up of two Tube stations linked by a rail station. So, if we looked at smart card data from Transport for London and just totalled up tap-ins and tap-outs at Paddington Station, then we might seriously overestimate the number of people using this interchange: some passengers will be double counted when all they are doing is transferring between lines! How your data is generated and how you choose to represent it matters, so it is important to think about what you’re trying to achieve when working with geospatial data: Do you need to know where racially motivated crimes happen within a city, or only that they happen within a city? The difference between those two questions might structure everything about your work: what methods you use, what types of statistical analysis you can perform and what kinds of findings you can present. Thinking more deeply about the role that space plays in your work is critical to undertaking good social science research, and there are complex feedback effects between the types of data collected and the types of knowledge generated.

Locating spatial data in the social sciences  19 3.2  You’re Projecting: Going From Round to Flat The latitude–longitude coordinates in @jreades’ tweet were most likely generated from the phone’s GPS, in which case they were ‘recorded’ using the top-notch WGS84 standard: the World Geodetic System reference (1984) which uses data about the Earth’s gravitational and magnetic fields to locate you anywhere on the planet with astonishing accuracy. But, rather confusingly for the novice researcher, you don’t make maps in WGS84, and that’s because the world is round and your map is flat. Conceptually, it boils down to this: take a ball and then try to wrap a piece of paper around the ball without leaving any creases or folds. You might be able to wrap the paper around part of the ball without a crease but you certainly won’t be able to wrap it the whole way around. There are many different ways to wrap the paper around the ball: we can start in different places so that the creases appear in different areas; we can try to minimise the total number of folds but have some big ones at the edges; or we could have lots of very, very small folds so that there are folds everywhere, but no big distortions of the paper anywhere. The more of the Earth that we want to show on a map, the more approximate the mapping (the more ‘creases’ we need). Figures 2.1a and 2.1b depict two world maps: both are accurate, 1e7

1.0

0.5

0.0

–0.5

–1.0

–1.0

–0.5

0.0

Source:  Data CC BY-SA 3.0, Sandvik (2009)

Figure 2.1a Lambert azimuthal equal area projection

0.5

1.0

1e7

20  Handbook of spatial analysis in the social sciences 1.5

1e7

1.0 0.5 0.0 –0.5 –1.0 –1.5 –2.0 –2.0

–1.5

–1.0

–0.5

0.0

0.5

1.0

1.5

2.0 1e7

Source:  Data CC BY-SA 3.0, Sandvik (2009)

Figure 2.1b Ven der Grinten projection but it should also be obvious that they are both also, in some sense, wrong. So a world map is everywhere inaccurate to some extent, and because of this many countries developed their own projections: these are more accurate within a country (fewer creases in one area) at the price of being much less accurate everywhere else. For instance, in Great Britain we typically use the British National Grid (EPSG:27700), but since America is a rather larger country there are multiple projections because New York City (New York Long Island; EPSG:2263) is a long way from San Francisco (San Francisco CS13; EPSG:7132) or Honolulu (Hawaii zone 3; EPSG:6633). You’ll notice that, after each example, I listed an EPSG number: EPSG stands for the European Petroleum Standards Group and every widely recognised projection will have its own unique EPSG number so that two mapmakers can be confident that they are handling their data in the same way. The origins of spatial data are strongly associated with natural resource management and extraction: Perhaps you can imagine why these firms would be interested in accurate geospatial data? Figure 2.2 depicts an example of the world ‘mapped’ in QGIS (2021) using the British National Grid (BNG); notice the distortion of North and South America, and the way that parts of Asia are wrapping around the edges of the map to create large blocks of gray. The BNG projection becomes less accurate as you move away from Great Britain: the mathematical transformations that ensure the accuracy of data in the vicinity of the British Isles start to break down. In addition, it is worth noting that BNG uses metres even though speed limits and distances are usually specified in miles.

Locating spatial data in the social sciences  21 1.0

1e7

0.5

0.0

–0.5

–1.0

–1.5

–2.0 –1.5

–1.0

–0.5

0.0

0.5

1.0

1.5 1e7

Source:  Data CC BY-SA 3.0, Sandvik (2009)

Figure 2.2 The World in British National Grid So, if you tell your Geographic Information Systems (GIS) application that the spatial data you’ve just loaded is in one projection or coordinate system when it is actually recorded in another, then your data can end up in the wrong part of the world, or even not on the map at all! Worse, projections aren’t just about the part of the world you want to map, they are also about the units used to record the data: some countries record locations using metres and some use miles . . . and some, like Great Britain, use both! So you can also easily encounter issues of scale as well as location: you think something’s in kilometres, but it is actually in degrees or miles! Loading spatial data and seeing nothing at all on the map is the fastest way to end up thinking that geo-data and spatial analysis isn’t worth the hassle, but if you can bear to deal with—and debug—problems with projections, then it is often smooth(er) sailing ahead!

4. WORKING WITH TEXT(S) In addition to the coordinate data embedded in the geo-tag, we also had two other types of locational data in @jreades’ tweet: (1) the Location specified by the user in their Twitter profile; and (2) the reference to Central Park in the body of the tweet. I distinguish between these two

22  Handbook of spatial analysis in the social sciences

Source:  CC-BY-SA, Fisher (2010)

Figure 2.3 Locals and tourists: London pieces of text because, in the first case, we know from the field name (‘Location’) that the text should contain some kind of spatial identifier, whereas, in the second case, we have no way to be sure (without reading the tweet) that any geographical information will be found. For a single tweet this isn’t an issue, but if we’re trying to find locations in a book or in a collection of tweets amassed over a period of months, then we will need to enlist the help of a computer. In the map of London shown in Figure 2.3, Erica Fisher has contrasted the geo-tagged coordinates of photos posted to Flickr with the home location specified by the users who uploaded them: blue for locals, red for tourists and yellow for undetermined. (Reader is advised to look at the color figures in the e-chapter for more clarity.) But note that, unlike coordinate data which is unambiguous—even if easy to misinterpret—with textual data on a ‘home location’

Locating spatial data in the social sciences  23 the location could be missing (yellow), incomplete (yellow), mis-spelt (yellow), in a ‘slang’ form (yellow) or even indeterminate because there are multiple Delhis (there’s one in Canada) or Berlins (there used to be one in Canada)! 4.1  Look It Up! The Value of Gazetteers Knowing which bit of text contains locational information simplifies things enormously because we can be a lot ‘stupider’ in our approach: although the data may vary in its intelligibility and accuracy, for the type of work that Erica did above the fact remains that there are only 200-odd countries in the world (depending on what counts as a country!). For a computer, checking short snippets of text against a small-ish dictionary of valid country names and abbreviations is easy: it can probably do tens of thousands of lookups every second when there are only a few hundred entries to check! Manually assembling a short list of valid country names is fairly easy, but what about tens of thousands of place names? For social scientists working with text, the term ‘gazetteer’ will become a familiar one. Gazetteers are often produced by crowd-sourcing data on place names, and they typically provide one or more lookup ‘names’ (sometimes in multiple languages or multiple spellings) against which a piece of text can be matched in order to place it not only on a map, but also within some kind of formal ontology: a town, a national forest and other commonly used classes. This would likely be the best way for the computer to work out that Central Park is a location in New York City, and that Walthamstow is an area within the London Borough of Waltham Forest. With unstructured data—such as a book or document—gazetteers can be an essential part of the mapping process, and they are particularly useful when dealing with historical data incorporating places, or spellings, that no longer exist (Walthamstow began life as ‘Wilcumestowe’, the Place of Welcome). Gazetteers allow us to put historical places on contemporary maps, but they also allow us to put these places in a spatial and temporal context—via linked entities or entries—that incorporates useful details such as an historical parish or other administrative unit. For instance, the entry for Skara Brae in Scotland looks something like this (see Table 2.1): Table 2.1 The gazetteer for Scotland Name

Skara Brae

Type

Archaeological Site

Built

3100 BC

Text of Entry Updated

05-FEB-2016

Latitude

59.0492ºN

Longitude

3.3456ºW

National Grid Reference

HY 229 188

Source

Clarke, David (2000) Skara Brae. Historic Scotland, Edinburgh

Linked Entries

St Peter’s Church, Orkney, Sandwick, Douby

Features

Orkney Mainland; Skaill House; Skaill, Bay of

Source:  Gittings (n.d.)

24  Handbook of spatial analysis in the social sciences Each of these items can, in turn, be linked to other entries in the gazetteer for additional information, allowing us to convert plain-old text into something much richer and more structured; however, see Gittings (2009) for both a perspective on the value of (geo)text as well as a fuller discussion of the challenges entailed in regularising something as ‘simple’ as a place name. Aside from the impossibility of recording information about everything, everywhere, gazetteers have tended—historically, at least—to represent most features as points, since that made them easy to download as flat text files. This limitation is no longer relevant, but many gazetteers are still presented in this way, and we are (again) back into questions of the appropriate representation of spatial features in data. Locational data can be found embedded in all forms of structured and unstructured text: webpages, letters in archives, the institutions associated with Ph.D. theses and so on. The location might be an actual address, a town or village, an institution or crossroads, but if we can distinguish these spatial ‘terms’ from the rest of a document then we can begin to treat them as geographical data and start to ask spatial questions! One area where I expect to see significant change in the near future is in the growth of geographically aware Natural Language Processing libraries that are able to extract spatial information directly from source texts without relying quite so heavily on what are basically enormous dictionaries. 4.2  Answers on the Back of a Postcard: Using Address Data Gazetteers might seem clunky if you’ve got access to address data! However, the latter presents other challenges to the social scientist, particularly where historical data is concerned. Some time ago I was approached by a political economist wanting to map Soviet-era factory data that researchers had painstakingly collected and systematised with a view to better understanding how the communist production system worked. The challenges here were manifold: since the Soviet Union no longer exists, many of the places no longer do either—city names have been changed, streets no longer refer to heroes of the Marxist−Leninist revolution and the factories have long-since ceased operation—and there is no contemporary source of data against which to validate these locations. Worse, the addresses had been inconsistently recorded: abbreviations, incomplete postcodes, multiple addresses for a single factory, multiple factories (clearly in different cities) stored with a single address field . . . and that is before you consider the inevitable data entry errors resulting from the fact that it was input by human beings. None of this is to criticise the enormous and admirable amount of work that went into collecting such data; rather, I wish to highlight the challenges that can be expected: to imagine that ‘address matching’ is a straightforward process that can be ‘tacked on’ later is to set yourself up for frustration, and possibly failure. Should we abandon this effort? Absolutely not. Should we be rather more conservative in our estimates about the amount of time and energy that will be consumed by ‘getting the data into shape’? Absolutely. A more subtle consideration is that this kind of data is often only useful if you have access to the appropriate reference information. And access to such reference data can be a big if: in the United Kingdom, for instance, when the government privatised the Royal Mail it also, rather carelessly, privatised the Postal Address File that is the canonical source of all address-related information. Suddenly, thanks to weak thinking about data governance, the real boundaries of postcodes as well as the exact location of Harry Potter’s 4 Privet Drive, Surrey (actually 12 Picket Post Close, Berkshire), were a source of additional income and under the control of a for-profit entity.

Locating spatial data in the social sciences  25 Rather unsurprisingly, the company is not particularly interested in making such commercially valuable data readily available to the world. As an academic, I have a license to access the postcode boundary files, but I cannot publish any data that I derive using those boundaries without falling foul of the licensing terms. The effect of this is to ‘pollute’ the analytical pipeline in a way that makes the re-use of address data for other purposes profoundly problematic. Bizarrely, it is actually safer for me to work with the less accurate postcode centroids (the inferred middle of each postcode polygon) since they are not covered by the same onerous licensing terms. The decision on whether to use privileged data as part of an analysis is nearly always one to be taken on a case-by-case basis: if others cannot replicate your findings or make use of your code/analysis, then this represents an impoverishment of the wider research landscape (including private, public and third-sector work). You may even—as I’ve experienced—end up on the wrong side of your own research: some time ago I desperately wanted to revisit some earlier work that I had done with the more sophisticated techniques that I now understand because I think that there is a lot more that I could do. But I can’t, because as part of my condition of access, the data had to be destroyed once the initial work was done and the company that supplied it doesn’t retain data for years. This kind of constraint may be worth accepting when, say, the accuracy or timeliness of results is paramount. However, as a result of my own experiences I have, personally, increasingly prioritised the openness of my work and my data on the basis that this is the best way for others to learn, to critique and to enhance. My point is not that there are no justifiable reasons for making other choices in this regard—some of my earliest work with telecoms data simply could not have been done had openness been required (Reades et al., 2007; Reades and Smith, 2014)—but that consideration should be given to how the data might be used now . . . and in the future. Address-type data can also present challenges to confidentiality: Prof. Latanya Sweeney managed not only to identify the Governor of Massachusetts in an ‘anonymous’ health data set using nothing more than his date of birth and zip code, but has gone on to show that publicly accessible profiles in the Personal Genome Project can be de-anonymised more than 80 per cent of the time (Berkeley School of Information, n.d.; Sweeney et al., 2013)! Maintaining the privacy of sensitive records requires enormous care, and failure to understand how easily purportedly anonymous individuals can be re-identified is a major reason why social scientists can find their discussions with institutions or corporations around data access going nowhere fast. Researchers should seek to obtain the lowest resolution that still supports their analytical objectives: in spite of what I say below about the Modifiable Areal Unit Problem, linking individual data only to standard Census-type geographies and asking for attributes such as age to be grouped into bands is the only way to moderate this risk. 4.3  Linked In: Joining Spatial and Non-Spatial Data The process of looking up information in other databases or data sets is a kind of data linkage: we join two data sets together by matching information from one data set to information in another in order to gain access to its (geospatial) features. When making a map, many of the data sources (including the Census) upon which social scientists depend come in a ‘tabular’ form such as an Excel or CSV (Comma-Separated Value) file that requires us to join it to another data set containing spatial information. So how do you join your spatial and non-spatial data files, and what do you need to look out for?

26  Handbook of spatial analysis in the social sciences Table 2.2 Census data table 2011 Super Output Area – Lower Layer

Mnemonic

All usual residents



Camden 001A

E01000907

1430



Camden 001B

E01000908

1581



… Camden 028D







E01000919

2014



Source:  Office for National Statistics licensed under the Open Government License v.1.0

National statistics and mapping agencies are used to this problem so they typically provide data in a standardised format that is designed to be easy to link up. Table 2.2 displays a tiny extract of data from the 2011 UK Census provided by the Office for National Statistics. The two fields of interest here are the ‘2011 Super Output Area – Lower Layer’ and ‘Mnemonic’: these both refer to the same spatial feature—a Lower Layer Super Output Area (LSOA, for short) which is similar to a Census tract in the United States—and then attach a number of attributes about that feature (for example the number of usual residents, the number of women, the number of people aged 16–24 . . .). These two fields are unique by design because they are the key to joining your data set to a geo-data file. Why have two fields? The Mnemonic is a fixed-width code (that is, it is always nine characters long) so it is efficient to store and it is guaranteed to be unique. The human readable LSOA name is easier to understand, but the field must be as long as the longest Local Authority name in the United Kingdom plus the four-digit identifier that ensures uniqueness. This added flexibility not only takes up more storage space on a computer (though it is not a lot for most modern computers), but it is also slower to join and harder to tell if you’ve made a small mistake that might have led to mis-matches. If in doubt, it is best to go with the encoded form. In order to make a map we need to join this data file to a spatial data file containing points, lines or polygons. In Table 2.3, in rather abbreviated form, for the matching rows from one Table 2.3 List of features Selected JSON Features for LSOA Geometry { “type”: “feature”, “properties”: { “fid”:1, “lsoacd”:“E01000907”, …}, “geometry”: { “type”:“MultiPolygon”, “coordinates”: [ [ [ [ 528920.9, 186917.1 ], [ 528935.1, 186831.6 ], [ 528940.3, 186801.8 ], …, [ 528920.9, 186917.1 ] ] ] ] } }, { “type”: “Feature”, “properties”: { “fid”: 891, “lsoacd”: “E01000908”, …}, “geometry”: { “type”: “MultiPolygon”, “coordinates”: [ [ [ [ 528559.0, 186904.2 ], [ 528561.9, 186861.9 ], …, [ 528559.0, 186904.2 ] ] ] ] } }, { “type”: “Feature”, “properties”: { “fid”: 902, “lsoacd”: “E01000919”, “LSOA11CD”: “E01000919”, “LSOA11NM”: “Camden 028D”, …}, “geometry”: { “type”: “MultiPolygon”, “coordinates”: [ [ [ [ 530052.3, 181554.1 ], [ 530108.0, 181450.0 ], [ 530136.0, 181456.0 ], …, [ 530052.3, 181554.1 ] ] ] ] } }, Source:  Contains Ordnance Survey data © Crown copyright and database right 2021

Locating spatial data in the social sciences  27 type of spatial data file might look like. Notice how the coordinates in the example are shown as pairs (for example [528559.2, 186904.2])? These are the x and y of each point (also known as a vertex) in the polygon. The last pair of coordinates in each polygon are the same as the first pair because the polygon must be closed. You join records in each data set together by telling the computer how to make matches between the two files; in this case it is the Mnemonic column in the data file and the lsoacd field in the spatial data file. Your GIS application or coding libraries will (hopefully) understand how to extract possible matching names from the format above and list the non-geometry fields as options for linking the two files. On occasion, when working with a GIS application such as ArcPro or QGIS, or with code in Python and R, you may be asked whether a join is 1:1 or 1:n. While daunting, this question can be better understood as: Could ‘things’ in one data file have multiple matches to ‘things’ in the other data file? As a heuristic, statistical and other formal data releases by government agencies tend to involve 1:1 links because the data has been cleaned and organised for you, while ‘messy’ data collected via scraping or ad hoc Freedom-of-Information (FoI) Requests— as well as data involving mis-matched scales—tends to be 1:n because the relationship is not straightforward. So if you have Census data and a Census spatial data file then the answer will nearly always be 1:1 because the files are designed that way: for each Census zone there should be one, and only one, match between the data set and the spatial data set. In practice, with 1:1 joins there is exactly zero or one matching rows for each row in the spatial data file. If more than one match is found, then only the first match is kept and all other matches are (silently) discarded. But if, for instance, you wanted to look at Airbnb listings in a densely populated urban Census area then you’d expect there to be many matches between listings and a Census area so that would be a 1:n join. So 1:n joins are normally encountered when you are aggregating or grouping data together, such as when we want to calculate the total number of people living in a borough from the number of people living in each LSOA or Census tract. If you don’t see what you were expecting (and it is not a projection issue) then the specification of the join is another likely culprit. Spatial joins—also known as ‘joins by location’—are a special case of 1:n joins because we don’t use a pair of columns in the data files to make the match, we use actual locations instead: we might take number of crimes (recorded as a point) and use a spatial join to calculate the crimes within each LSOA (recorded as a polygon). You will normally also need to tell the computer whether and how to calculate derived variables based on the matches it makes: for example, sum the number of crimes together. It is also common to join polygons to polygons or lines to polygons, but beware: What should your GIS do if a line crosses two or more polygons? Even if it is 1 mm of a 2 km line that crosses over, as far as the GIS is concerned that is enough to give you a 1:n join! For this reason, spatial joins are hard work for computers: it is much faster and more accurate to match on fields than on geometries. If you have to do a (big) spatial join then it is common practice to take the centroids—the ‘centres’ of a line or polygon—of one data set before performing a join: with individual points you’re much more likely to end up with them falling inside only one polygon. Although there are clearly times when this is not the right strategy, because calculating points in polygons is fast there are many situations where this is the best way to join data sets from different—of differently scaled—geographies.

28  Handbook of spatial analysis in the social sciences

5. MAPS ARE MANIFOLDS: INCORPORATING TIME Finally, although it is not obviously a piece of geospatial information, 01:30 (GMT-5) is also nonetheless inherently spatial as well. Most obviously, if we slice up the original data by date and time we can start to ask questions like: Where and what do people tweet after midnight? Where do they tweet after noon on Saturday, or on Tuesday? Leaving to one side questions about who is tweeting at 1:30 in the morning, the potential to segment behaviours—tweeting, travel, complaints, crimes—in time and space with increasing granularity is why geospatial analysts are in such demand: they can make sense of these spatio-temporal patterns and, with luck, their socio-economic context. But there’s also a deeper question that can be brought to light by asking: How far is it from Manhattan, New York to Hoboken, New Jersey? Assuming that you know where these two places are, then chances are that you instinctively came up with an answer that was measured in minutes, not miles or kilometres. Travel time is a perfectly legitimate way to measure distance, but it already contains several built-in assumptions: Were you intending to travel by train or by car? At rush hour on Monday, or early on Saturday? And did you mean from the middle of Manhattan to the middle of Hoboken, or from water’s edge to water’s edge? In fact, if you are measuring from administrative boundary to administrative boundary then the distance between these two areas is zero because Manhattan and Hoboken touch in the middle of the Hudson River! Many of us instinctively frame and understand a question about space in terms of the experience of a journey and not the number of units of distance traversed. An extreme illustration of this effect would be the so-called ‘postcode gangs’: for some young people in cities like London or New York, leaving your hyper-local area can mean taking your life in your hands. Distance for a gang member is not remotely linear: a job opportunity a few blocks away, if it is on another gang’s ‘turf’, might as well be on Mars. This issue should clue us in to the idea that geospatial analysis is as much about choosing the best representation of space from amongst many possible representations as it is about measuring it precisely afterwards. Riffing on the statistician (Box, 1979), we could say that ‘all models of space are wrong, but some are useful’. So if the question is ‘how far is it from 51st and 2nd to that bar in Hoboken?’ then ‘It will take zero minutes to get there because the two counties are adjacent’ isn’t a very useful representation because our question isn’t about administrative units, it is about travelling between points on a transportation network. But if we’re interested in the impacts of Manhattan’s real estate market on its neighbours, then the fact that Hoboken is ‘adjacent’ is profoundly relevant. Similarly, if the question is ‘how far is it from the gang’s home turf to the job in Pret-À-Manger’, then ‘it is just 750 m’ is also not a very useful representation because it ignores the way that distance is experienced by a former gang member. 5.1  There Goes the Neighbourhood: Thinking About Near and Far This inevitably brings up the issue of how to define ‘near’ and ‘far’. Usually, near is defined as something ‘falling within the neighbourhood of the feature of interest’, but that is hardly a robust definition. A slightly better definition would be the distance—using the most appropriate representation—over which we expect some kind of interaction to occur. The epidemiologist, public health researcher, gentrification researcher or political scientist will all

Locating spatial data in the social sciences  29 define this interaction distance in different ways: it could be a housing market or area with a distinct demographic profile (see, for example, the chapters by Delmelle, chapter 22, and Knapp, chapter 23, this volume), an electoral ward or area experiencing a similar shift in voting intentions (for example, chapter 27 by Wolf, this volume), the area served by a particular doctor’s practice or hospital and so on. For example, if you are interested in commuting then your model might define a neighbourhood using mean (or median) commuting distance. If you are interested in public health then you might pick a radius around a point pollution source such as a factory or incinerator. Walkable neighbourhoods calculated for children, adults and the elderly will look very different, as might those calculated for New Yorkers and Houstonians. In short, we are focussing on effects, not units, and your definition might draw on a review of the literature, a hypothesis to be tested or a model of the process being studied. Representing the neighbourhood in quantitative form can be a challenge because we need to calculate whether each and every location in the data set is near (within the neighbourhood) or far away (outside of the neighbourhood) from every other location in the data set. This is a matrix calculation, and it becomes very expensive when our data set is large. For instance, if distances between locations are symmetric (meaning that we can assume it is as far from B-to-A as it is from A-to-B) then for 1000 data points there are ‘just’ 499 500 distances to calculate. You might be able to think of cases where distances are not symmetric, but the key idea is that the quantitative definition of a ‘neighbourhood’—which is very closely tied to the concept of a cluster (see chapter 14 by Helderop and Grubesic, this volume)—should never be confused with the ‘fuzzy’ term we use in conversation with one another.

6. WORKING WITH PRE-PACKS: DEALING WITH DATA I have deliberately avoided talking about file formats and their gnarly details since: (a) they’re a bit boring, and so (b) they tend to be off-putting to readers. However, at some point we need to do this because it really does matter, and leaving it all for the discussion at the end of the chapter looks a lot like ending on a low, not a high. Pre-packaged data prepared for use with a GIS or programming environment most commonly come in one of three formats: Shapefile, GeoPackage and GeoJSON. 6.1  I Hate Shapefiles (But They Are Hard to Avoid) Shapefiles are a bit like the .docx document or .jpeg photo formats: they’ve been around for ever so nearly everyone can read and write them, but nearly every professional hates doing so. For a start, Shapefiles aren’t singular files! A Shapefile is actually composed of at least three separate files: the geometries in .shp, the shape index in .shx and the attributes in .dbf; however, you might also have a projection (.prj), a spatial index (.sbx) and metadata (.xml)! So if you want to share or back up a Shapefile then you need to collect all of those files into a Zip archive, and missing even one of them out can leave you with terminally corrupted data. In addition, each Shapefile can contain only one data type: points, lines or polygons. So, although the fact that many platforms can work with Shapefiles makes it seem a good default choice, today there are much better options.

30  Handbook of spatial analysis in the social sciences The ‘new kid on the block’ is the GeoPackage: as the name suggests, it packages up all of your data in a format that looks to a computer like a single file. This makes it easier to share, back up and download, but there’s a much bigger advantage: in addition to embedding the projection information, the GeoPackage can also include multiple layers with different types of geo-data, and some applications can also use it to store style information. So not only can you share your entire analysis (for example, the polygon showing the boundary of your study area and the points that fell within it and the aerial photography that covers it), but if you spent a lot of time making your results look good then in some cases those who download your data get that benefit as well! The GeoPackage is a powerful file format because it is basically a compressed database, but this can be overkill and it is also not very ‘user friendly’ since it requires specialised software to read and write it. So a practical alternative to the GeoPackage is the GeoJSON file: this is effectively structured text (as you would have seen in the example above) intended to be easy for computers to parse without specialised software. GeoJSON can often be displayed directly in a web browser, and ‘simple’ web applications can still allow dynamic interaction with multiple layers, including panning and zooming, popups and custom iconography. What GeoJSON does not do well is scale: the other file formats support very large data sets, but GeoJSON cannot. So, once again, there is no ‘right answer’ only—as the English would say, ‘horses for courses’: if you need to share your data widely and it is relatively limited in scale, then GeoJSON would be an excellent choice. If you want simplicity and elegance, then GeoPackage should be your format. And if you want to ensure that every possible user is supported, then there’s still a role for the venerable Shapefile. These strengths and weaknesses clearly interact with those of the applications that make use of them: the limitations inherent in web browsers mean that the constraints of scale and complexity are encountered far sooner than with dedicated GIS applications, and those too will begin to fail long before you’ve ‘maxxed out’ the capacities of dedicated command line or database software such as GDAL and Postgres/PostGIS. 6.2  To Space and Beyond: New Sources of (Good) Data The use of satellite and remotely sensed data has not, historically, been part of the social scientist’s toolkit, but these sources of data are becoming more useful as both the resolution of the imagery, and the power of the computers to which we have access, improve (see the chapter by Arribas-Bel et al., 2022). Indeed, when we consider that systems such as LANDSAT now provide coverage dating back over 30 years, one can, with care, begin to construct big-picture histories of urbanisation and development even in areas where any number of factors—ability to collect, corruption and conflict or simply poor quality—might have undermined more traditional sources of data on people and places. A second exciting source of data for places that, historically, were missing from our maps is OpenStreetMap (OSM), its humanitarian ‘arm’ (HOTOSM) and the allied ‘Missing Maps’ programme which emerged in the aftermath of the devastating 2010 earthquake in Haiti. This subsequently led to a number of initiatives to radically improve coverage of Africa and neglected parts of Asia for both research and aid relief purposes (see Figure 2.4). In general, I feel that this is a ‘good’ outcome, although a critically reflective social scientist would also recognise that the act of mapping nonetheless has a tendency to reproduce existing power relations: it is, for instance, much easier to find strip clubs and bars than baby-change stations or rape crisis clinics in OSM (Stephens, 2013; and see also Elwood and Leszczynski, 2018).

Locating spatial data in the social sciences  31

Source:  CC-BY Missing Maps, Sandvik (2009)

Figure 2.4 Missing Maps contributions in Myanmar A related challenge with crowd-sourced and volunteered geographic information is that, in the absence of extensive validation and a strong ontology, human classification tends to vary in accuracy and consistency. So having mappers in one country assign land uses to features observed from space in another can produce wildly inaccurate assessments of what is being studied. In the case of HOTOSM/Missing Maps, there is an explicit hierarchy—mappers and validators—and this is further followed up (ideally) by trained staff on the ground who are able to assign locally relevant place names and feature attributes. However, since OSM does not really enforce a strong ontology or try to map these classes across languages, it can be challenging to try to compare the distributions of many types of objects. None of this is to minimise the achievement or utility of OSM: it may well be the only data source supporting open, replicable and cross-border mapping, and in many cases it may have more information/detail than a standard map. However, neither should OSM be confused with the output of a national mapping agency, and so it is, once again, a case of horses for courses. But lest you think that you can avoid these issues by relying on data from a national organisation: the use of older, marginally less accurate geo-data by the U.S. Treasury and their slight misalignment with the most up-to-date geo-data provided by the Census allowed a CEO worth billions to claim a major tax break intended for the poorest in America (Ernsthausen and Elliott, 2019)!

7. DISCUSSION Linking back to our original tweet one last time: For analytical purposes should we attribute this tweet to the account’s address (London), the tweet’s geo-tag (Toronto) or to the place referenced in the tweet itself (New York)? The answer is, of course, it all depends on what we want to know! This is the vital contribution that social scientists can make to such projects:

32  Handbook of spatial analysis in the social sciences although we have much less control than we once did with purposive surveys over how data from platforms like Twitter are collected, we are used to thinking critically about data and data collection, and about the ways in which it can (or cannot) be applied to a particular research question. So focussing on Central Park as a landmark tells us something about the features that tourists and ‘natives’ associate with New York City. But the sequencing of geo-tagged tweets by accounts gives insight into mobility patterns and tourism, potentially at a global scale (for example, Girardin et al., 2008). Or we could consider whether the account address is useful for designing overseas marketing campaigns (‘the Brits like Central Park, the French like the Empire State Building’). None of these is a simple mapping between activity, location and purpose, but all are potentially rich and entirely legitimate topics for social scientists—whether academic or otherwise gainfully employed—to tackle. 7.1  The Perils of Spatial Data Working with geospatial data will introduce the social scientist to a variety of analytical and representational challenges that can only be briefly touched on here. However, perhaps the most important idea the reader can retain from this foundational chapter is the importance of thinking about how spatial data is generated and recorded before using it in your research. Good research requires consideration of how the system produces data, who produces it and why and what the analysis seeks to achieve: the social sciences are good at the who, why and what, but often much less good at the how. Geospatially and computationally empowered social scientists can bridge this gap, bringing good reflective practices to an area where, all too often, data is treated simplistically as ‘truth’. A few years ago I was involved in pitching an analytical platform to an international telecoms provider: we argued that, using techniques taken from planning, we could model the spatial distribution of their users with much greater precision. This would allow the company to improve their service offering by highlighting under served areas and enhancing locationbased services such as messaging about transit delays. The firm’s data scientists responded: ‘We don’t need this, we have all the data already and don’t need a model.’ To them, the data spewed out of their mobile phone network was reality, and questions of bias, resolution and representativity were irrelevant! The problems of this mindset were driven home while talking to a researcher at a different phone company who’d decided to run some tests on how his own phone connected to their network because he’d discovered that none of the models in use had actually been validated! The analyst discovered that the pattern of connections was unlike anything they’d expected, with cells overlapping far more than anyone had realised and the behaviour of the network differing for fast- and slow-moving phones (for example, while walking vs. while on the train). These details matter because high-profile publications in journals such as Science and Nature make liberal use of CDRs (Call Data Records) which are the logged billable ‘events’ triggered by a mobile phone user. But if it turns out that our understanding of how this system works is wrong—or, at least, incomplete—then there are potentially significant implications for the kinds of conclusions that can be drawn from such data. The issue is not that mobile phone data is useless, it is that it is likely to be inappropriate for some purposes. Since relatively few researchers have privileged access to such data, this intersects with wider questions of verification and replication: Do the conclusions hold for

Locating spatial data in the social sciences  33 other networks with different consumer profiles? For other countries? Even for other types of network (there is more than one!)? 7.2  The (Statistical) Perils of Spatial Data Variations on this issue also crop up in statistics when dealing with spatial data: one of the most basic assumptions of frequentist statistics—that observations are independent of one another—is violated as soon as we start factoring in the role of space. The simplest way to think about it is: Where are you likely to find wealthy or poor people? Near other wealthy or poor people respectively! In practical terms, if we think of statistics as a way of working out whether a pattern we’re observing in our data is random or statistically significant, then using non-spatial statistics can lead us to see patterns where none exist because our confidence thresholds are wrong (we should expect to find clusters of wealth and deprivation!). A more subtle issue—and one about which we can, in practical terms, often do very little—is the Modifiable Areal Unit Problem (for example Fotheringham and Wong, 1991): as social scientists, we’re often tempted to use administrative boundaries to help structure our analysis of, say, crime or road accidents. In Figure 2.5, the shaded area is obviously a hotspot for some kind of activity (the hexagons), but the events are evenly split between

Ward 1

Ward 2

Enfield

Haringey Waltham Forest

Source:  Contains Ordnance Survey data © Crown copyright and database right 2021

Figure 2.5 The Modifiable Areal Unit Problem

34  Handbook of spatial analysis in the social sciences Wards 1 and 2. Suddenly, events are ‘dispersed’ across two different, larger units of analysis! The same holds for the events recorded as stars on the map: they are clearly driven by a process connected to main roads, but a borough-level analysis would show only that Enfield had more of them than its neighbours. A similar issue also happens when we try to disaggregate data: not only is it largely (though not quite always) impossible, but it often leads to the Ecological Fallacy of inferring that, because an area is on average well off, then everyone in that area is well off. Similarly, although well-off people tend to vote for conservative/right political parties, well-off areas tend to vote for liberal/left parties. 7.3  The Power of Computational Thinking Finally, it should be clear that, over the course of this chapter, we have moved into realms where hand coding/correction is just about impossible and that to approach these problems requires code. Indeed, many of the other chapters in this Handbook, such as the introduction to GIScience (Buttenfield) and Analytical Environments (Bivand) are as much about the power of code to help us tackle challenging spatial problems as they are about the concepts presented; this is because the way that we approach these ideas is by writing code to perform them. In fact, many spatial statistics are best presented as algorithms, not equations, for reasons that will become clear over the many subsequent chapters (for an excellent illustration of this, see Xiao, 2016). However, you can always return to the fundamental fact that geospatial data, like geospatial analyses, are produced by people and by systems created by people: there is nothing ‘innate’ or ‘correct’ or ‘true’ in such data—or the algorithms with which we analyse them—and you should feel free to choose simpler data and algorithms if you understand them and are confident that they are appropriate. What the rest of the book does is help you to understand how various approaches can be appropriate in particular contexts and to particular problems. Engaging critically with new concepts is integral to what we, as social scientists already do every day, and it is how we learn to ask the questions that cut to the heart of contemporary social and policy issues. Moreover, asking questions is critical to good (social) science: I hope that you come away from this chapter not daunted by the range of concepts and challenges that seem to lie in wait, but empowered to ask questions of others. How are these events generated and logged? Why is this the right representation of the process? What kinds of mistakes or assumptions might we be making if we assume that n = all? For a more in-depth tackling of these issues in a non-spatial context, the highly engaging and very accessible Data Feminism (D’Ignazio and Klein, 2020) would be an excellent starting point, and see also the challenges and opportunities being raised in Part 4 of this book (see chapter 33 by Folch).

REFERENCES Arribas-Bel, D. (2014). Accidental, open and everywhere: Emerging data sources for the understanding of cities. Applied Geography, 49, 45–53. Arribas-Bel, D., Rowe, F., Chen, M., and Comber, S. (2022). An image library: The potential of imagery in (quantitative) social sciences. In S. Rey and R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. Berkeley School of Information. (n.d.). Keeping secrets: Anonymous data isn’t always anonymous. Retrieved from https://ischoolonline.berkeley.edu/blog/anonymous-data

Locating spatial data in the social sciences  35 Bivand, R. (2022). Analytical environments. In S. Rey and R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. Box, G. E. P. (1979). Robustness in the strategy of scientific model building. Research Triangle Park, NC. Retrieved from http://www.dtic.mil/docs/citations/ADA070213 Buttenfield, B. (2022). GIScience through the looking glass. In S. Rey and R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. Crampton, J. W., Graham, M., Poorthuis, A., Shelton, T., Stephens, M., Wilson, M. W., and Zook, M. (2013). Beyond the geotag: Situating ‘big data’ and leveraging the potential of the geoweb. Cartography and Geographic Information Science, 40(2), 130–139. Delmelle, E. (2022). Neighborhood change. In S. Rey & R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. D’Ignazio, C., and Klein, L. F. (2020). Data feminism. MIT Press. Elwood, S., and Leszczynski, A. (2018). Feminist digital geographies. Gender, Place & Culture, 25(5), 629–644. Retrieved from https://doi.org/10.1080/0966369X.2018.1465396 Ernsthausen, J., and Elliott, J. (2019). One Trump tax cut was meant to help the poor. A billionaire ended up winning big. ProPublica. Retrieved from https://www.propublica.org/article/trump-inc-podcast-one-trump-tax-cut-meant-tohelp-the-poor-a-billionaire-ended-up-winning-big Fisher, E. (2010), ‘Locals and Tourists: London’, CC-BY-SA. Retrieved from https://www.flickr.com/photos/ walkingsf/4671589629/in/album-72157624209158632/ Folch, D. (2022). Uncertainty. In S. Rey and R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. Fotheringham, A. S., and Wong, D. W. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23(7), 1025–1044. Girardin, F., Calabrese, F., Dal Fiore, F., Ratti, C., and Blat, J. (2008). Digital footprinting: Uncovering tourists with user-generated content. IEEE Pervasive Computing, 7(4), 36–43. Gittings, B. M. (2009). Reflections on forty years of geographical information in Scotland: Standardisation, integration and representation. Scottish Geographical Journal, 125(1), 78–94. Retrieved from https://doi. org/10.1080/14702540902873881 Gittings, B. M. (n.d.). Skara Brae. Retrieved 2020, from https://www.scottish-places.info/features/featuredetails1186. html Helderop, E., and Grubesic, T. (2022). Clustering identification. In S. Rey and R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. Knapp, E. (2022). The spatial analysis of gentrification: Formalizing geography in models of a multidimensional urban process. In S. Rey and R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. Missing Maps. (n.d.). One map Myanmar and Phandeeyar. Retrieved from https://www.missingmaps.org/ blog/2018/09/24/mapping-in-myanmar/ O’Sullivan, D., and Manson, S. M. (2015). Do physicists have geography envy? And what can geographers learn from it? Annals of the Association of American Geographers, 105(4), 704–722. QGIS.org (2021). QGIS Geographic Information System. QGIS Association. Retrieved from http://www.qgis.org Reades, J., and Smith, D. (2014). Mapping the ‘Space of Flows’: The geography of global business telecommunications and employment specialisation in the London Mega-City Region. Regional Studies, 48(1), 105–126. Reades, J., Calabrese, F., Sevtsuk, A., and Ratti, C. (2007). Cellular census: Explorations in urban data collection. IEEE Pervasive Computing, 6(3), 30–38. Reades, J., Zhong, C., Manley, E., Milton, R., and Batty, M. (2016). Finding pearls in London’s oysters. Built Environment, 42(3), 365–381. Sandvik, B. (2009). World Borders Data Set, CC BY-SA 3.0. Retrieved from https://thematicmapping.org/downloads/ world_borders.php Stephens, M. (2013). Gender and the geoweb: Divisions in the production of user-generated cartographic information. GeoJournal, 78(6), 981–996. Sweeney, L., Abu, A., and Winn, J. (2013). Identifying participants in the personal genome Project by Name (a Re-identification experiment). Retrieved from http://arxiv.org/abs/1304.7605 Wolf, L. J. (2022). The shape of bias: Understanding the relationship between compactness and bias in US elections. In S. Rey and R. Franklin (Eds.), Handbook of Spatial Analysis in the Social Sciences. Edward Elgar. Xiao, N. (2016). GIS algorithms: Theory and Applications for Geographic Information Science & Technology. Research methods. SAGE.

3.  Analytical environments Roger Bivand

1. INTRODUCTION In conducting applied research, social scientists use a range of toolboxes containing methods. The methods and their allocation to toolboxes evolve over time, building on contributions from social science disciplines and from disciplines beyond the social sciences. An analytical environment in such a setting is associated with one or more toolboxes and may relate to one or more sub-discipline or disciplines. Use of toolboxes may engender collaboration or communication between users and developers of methods, with overlaps between the roles of applied researchers viewing themselves as just users, users expressing needs to developers and developers focussed chiefly on innovation in methods, among others. One could also view an analytical environment as a layered community, in which users and developers interact. Interaction may be asymmetrical, with developers (or software vendors) conditioning the choices proposed to users. In scientific research, these choices may be further ordered by reviewers of work for publication, and by thesis committees. It is not at all impossible for domain communities to adopt differing specifications of the conceptualizations they are using, leading to the choice of different analytical environments based on the relative distances between nodes in research communities. This chapter will consider the background to selected analytical environments for spatial analysis in the social sciences, choosing to use the R and Python languages as examples. Attention will be concentrated on empirical quantitative analysis rather than, for example, deterministic and simulation-based modelling, or qualitative methods. The ability to map data and the results of fitting models to data will be used to draw together similar kinds of analytical environments. Stress will be placed on the roles played by the shared open-source infrastructure underpinnings in the R and Python communities. This extends to the ways in which communities associated with these analytical environments view the mutual relationships between software components. 1.1 Antecedents Duncan et al. (1961) discussed in some detail the challenges facing social scientists analysing areal data. The increasing use of areal data by social scientists from the 1950s, such as census and population data, raised numerous questions. These included some topics that are still central, and others which are now more neglected. They also highlighted issues arising in the development of appropriate methods of analysis for areal data, but apologize for their concern with methodology. They acknowledged that most scholars occupy themselves with substantive research questions, conceptual and applied, but few develop and implement appropriate methods. In this context, they pointed to the potential benefits of learning from 36

Analytical environments   37 other disciplines with disparate subject matter. The dust jacket of the book starts with the depressing assertion: Specialists adopting techniques derived for the solution of other sorts of problems than those specifically concerning them share an unhappy lot. Such are those faced with the analysis of the quantitative data of the social sciences having areal significance.

While spatial data encompass more than areal data, such as aggregate census counts within arbitrarily given boundaries, it remains important to develop appropriate methods, and to propose and describe how they should be used in research practice. In the 1950s, some tabulation could be done by machine, but most calculation in the social sciences was done manually. During the 1960s, psychometricians and some social scientists, as well as their statistician colleagues and co-authors, began to be able to access computers, but most methods needed to be implemented as software using available languages such as Fortran and Algol using batch processing, or BASIC, which provided a degree of interactivity when using a teletype terminal. Once SPSS and SAS became available in 1968 and 1972 respectively for batch processing on mainframe computers, many university quantitative social scientists were relieved of the burden of building software from scratch, and analytical environments developed around the proprietary systems available. If the required method was not available, researchers still needed to be able to compile from source, but as compilers were always provided with mainframe computers, this was only an additional step, provided source code was available. Note that it was not uncommon at this time for authors to offer access to listings of source code in papers and books (Bivand, 2009; Cliff and Ord, 1969). The entry of minicomputers and microcomputers during the 1970s presented challenges to the batch-processing mainframe model of computing in the social sciences. It took time for licensing and software publication business models to catch up with new hardware opportunities. However, libraries of Fortran subroutines for linear algebra (LINPACK, EISPACK) were released to the public domain, initially on magnetic tape, later from FTP archives. Researchers still needed to compile from source, linking to these libraries. A profusion of development environments arose, like MATLAB, which was written to give interactive access to linear algebra functionalities (MATrix LABoratory). The birth of spreadsheets on the earliest microcomputers before the IBM PC offered an important and direct way for social scientists to handle tabular data. Its importance lay in the immediacy of interaction between the analyst and the data, subsequently mirrored in graphical interface design for many statistical software systems. This perception of the user interface has now been influential for almost forty years despite its weaknesses. A major weakness is that, unless backed by a transactional database or other system, there is no record of changes made to data or instructions using the graphical user interface (GUI). Consequently, and unlike the batch processing of the previous two decades, it can be very difficult to reconstruct steps taken in the course of analysis. In the decades between 1990 and 2010 with the widespread use of personal computers and licensed commercial software, a division of labour was strengthened, separating developers and users. In the case of economics and econometrics, Renfro (2009) writes: One of the consequences of this specialization has been to introduce an element of user dependence, now grown to a sufficiently great degree that for many economists whichever set of econometric operations can be performed by their choice of program or programs has for them in effect become the universal set (p. 24).

38  Handbook of spatial analysis in the social sciences This situation arose over time, and certainly did not characterize the research context when quantitative social science came into being. At that time, graduate students simply regarded learning to program, often in Fortran, as an essential part of their preparation as researchers. This meant that researchers were much ‘closer’ to their tools, and could adapt them to suit their research needs. Until recently, fewer appear to have felt confident as programmers despite the fact that modern high-level languages such as Matlab, Python or R are easy to learn and very flexible, and many statistical software applications offer scripting languages (such as SAS, Stata, SPSS). In addition, Python and R can be used without licence costs and installed cross-platform. Happily, this is changing, including collaboration between R and Python communities (Turner, 2020) to increase diversity and to ensure that all are welcome in rapidly evolving communities. In this article, a number of pictures will be drawn, hoping to indicate in a non-prescriptive manner the motivations behind and the functioning of the chosen environments for handling and analysing spatial data. First, we will consider the handling of spatial data as a prerequisite for analysis, before progressing to environments for analysing spatial data.

2. ENVIRONMENTS FOR HANDLING SPATIAL DATA Bending towards the consideration of autobiography in Holt-Jensen (2019), I have chosen to discuss environments that have informed and shaped my own work (see also Johnston, 2019; Meeteren, 2019). I learned to program as a PhD student at the London School of Economics (LSE) in the early 1970s, because I needed to handle and analyse data for my thesis in ways that were not otherwise possible using software available at the University of London Computing Centre. 2.1  London, Early 1970s Few social science departments had facilities for technical drawing when books and articles were prepared using typewriters. Human geographers often did have access to map libraries, from which the bases for thematic cartography could be borrowed for tracing. Many articles and books were richly furnished with figures constructed by hand, sometimes by researchers themselves, but often by skilled technical staff working from sketches provided by researchers. While tabular computations could be conducted and line-printer output could be entered into typewritten manuscripts, figures were a much greater problem. Figure 3.1 (left panel) shows a typical example of a hand-drafted map showing county boundaries, county letter codes and contiguous neighbour counts used extensively in Cliff and Ord (1973) for testing for spatial autocorrelation. When the figure was prepared, very few researchers irrespective of discipline had access to hardware permitting the input of boundary positions as data, or to hardware for outputting hard copy maps. I was fortunate to participate in the doctoral programme at the LSE Department of Geography 1972–75. We had access to a Wang minicomputer (probably a 2200 model with CRT (cathode-ray tube), using Basic for programming), connected to a large digitizer. This let those of us willing to use considerable time to construct boundaries by recording chosen boundary points. The points were then listed in line sets, and line sets as closed polygon boundary sets; other examples of work using a Wang minicomputer with a digitizer may be found (Jeremíasson, 1976).

39

G5

D4

M4

C4

P3

U3

T7

V8

0

W4

A5

Miles

J5

I5

50

Y4

Z3

F3

O2

Q6

R3

B5

K5

S7

X5

N4

L5

E1

1.4 Population density

CITY

A hectare is 2.471 acres 0 1 2 3 miles

Number of persons per hectare 110.0 or more 90.0–109.9 70.0–89.9 50.0–69.9 30.0–49.9 Less than 30.0

Figure 3.1 Left panel: counties of the Irish Republic, (Cliff and Ord, 1973, p. 54; right panel: population density in London, 1971, (Shepherd et al., 1974, p. 15)

H3

N

40  Handbook of spatial analysis in the social sciences Figure 3.1 (right panel) is typical of the output of choropleth maps with boundary input digitized using the Wang minicomputer (Shepherd et al., 1974). Margaret Jeffery and Hazel O’Hare wrote software called ‘Chormap’ to utilize the digital polygon boundaries with matching census data, to be run in batches on the University of London Computing Centre CDC7600, and output on a CalComp 1670 microfilm recorder on 35 mm monochrome film. The association of input and output hardware, together with uniquely skilled technical assistance, made it possible for the very few researchers with access to these resources to plot publishable choropleth maps much more efficiently than hand drafting had permitted. These polygon boundaries also permitted the extraction of lists of neighbouring Norwegian census tracts, used in my thesis to calculate Moran’s I statistics using self-written software (Bivand, 1975). This analytical environment was people based, a community of graduate students, technical staff and junior lecturers, who needed to create choropleth maps from census data and carry out aspatial analyses, for example using SPSS. Any non-standard analyses meant creating software oneself in Basic or Fortran; some shared code was available, and advice from colleagues very helpful. 2.2  Geographical Information Systems and Handling Spatial Data The subsequent emergence of geographical information systems (GIS) software, and of the adoption of standards for file formats for transferring spatial data, made it easier for social scientists to work where no digitizers were available. The US Census introduced the Topologically Integrated Geographic Encoding and Referencing (TIGER) format (Marx, 1986), and others adopted other ad hoc standards. The earlier GIS ran only on larger, multi-user computer systems with command line interfaces. During the 1990s, EsriTM (Environmental Systems Research Institute) introduced ArcView and MapInfo Corporation MapInfo; these were programs with graphical user interfaces for personal computers. They used the Esri Shapefile and MapInfo TAB file formats respectively, and both formats used multiple files for a single data collection, and dBase DBF files to hold attribute data. Typically, digitizer output was specified in planar units, often in the native units of the digitizer itself, or in decimal inches or millimetres. At this stage, coordinate reference systems (CRS) were not seen as essential. This treatment of CRS as unimportant extended to the specification of the ESRI Shapefile (ESRI, 1998), which was only subsequently optionally supplemented with a *.prj file containing an ESRI-specific Well Known Text representation. University geography departments were offered academic licenses for teaching and research; use of GIS was tied to PC labs, and often to dedicated staff who could keep the software operating. Certainly, in the 1990s and 2000s the dominant analytical environment was GIS-based, but while take-up in the environmental sciences was strong, this was seldom the case in the social sciences. The adoption of GIS in human geography among the social sciences led to accusations that GIS was positivist in its essence and expression (Schuurman, 2000). Consequently, while other empirical and quantitative social sciences found that the ability to handle spatial data in the new common file formats enhanced the range of tasks they could accomplish, human geographers were challenged. The ‘science wars’ begun thirty years ago continue (Thatcher et al., 2016), at least in human geography. This has meant that few human geographers receive any training in handling spatial data as graduate students or earlier, leaving them disadvantaged when the tools might prove useful.

Analytical environments   41 2.3  Moving Beyond GIS It is not, however, the case that being able to handle spatial data, especially aggregate data, has passed by without application in the social sciences, despite the ‘science wars’. One example is the use of R as an environment for handling and visualizing data, some of which is spatial (Cheshire and Uberti, 2014). One of the authors related in a blog (https://jcheshire.com/ visualisation/r-visualisations-design/) that: The majority of graphics we produced for London: The Information Capital required R code in some shape or form. This was used to do anything from simplifying millions of GPS tracks, to creating bubble charts or simply drawing a load of straight lines. We had to produce a graphic every three days to hit the publication deadline, so without the efficiencies of copying and pasting old R code, or the flexibility to do almost any kind of plot, the book would not have been possible.

Another pair of examples are two atlases mostly using cartograms to visualize many aggregate variables avoiding the excessive figure area occupied by large, sparsely populated areal units (Ballas et al., 2014, 2017). In these cases, the authors do not state which software environment has been used for handling the spatial data beyond the method (Gastner and Newman, 2004), but they provide guidance in an earlier article (Dorling and Ballas, 2011). These examples demonstrate very active use of spatial data in the social sciences, using scripts and other software components, but not as such directed towards the fostering of analytical environments. They do, however, show that GIS had become too limiting for the creative needs of these projects, leading to movement beyond expecting software to be provided ready for use by GIS vendors.

3. CONTEMPORARY SCRIPTING ENVIRONMENTS FOR HANDLING SPATIAL DATA Scripting is not just interactive computing, entering successive commands at a command prompt. It presumes the active use of the language ‘behind’ the prompt, and that the command sequences can be submitted from a file, a script, by analogy with performance. The script contains the sequence of steps needed to reach the intended conclusion. Scripting and the use of ‘little languages’ had been seen as an alternative to the graphical user interfaces that increasingly dominated GIS from the early 1990s (Bivand, 1996, 1997, 1998). Scripts would also play an important role in advancing reproducibility in analysing spatial data (Brunsdon and Comber, 2021). Although neither R nor Python are ‘little languages’ anymore, their on-ramps are less forbidding and much better supported by communities and training than previously. Both are also largely open-source, and both support the integration of external open-source geospatial software into scripting environments. R (R Core Team, 2021) and Python (Van Rossum and Drake, 2009) permit scripts to be written using basic language functionality. This functionality, however, does not extend to the handling of spatial data. The R analytical environment includes the Comprehensive R Archive Network (CRAN), which serves user-contributed packages of software with documentation. These have included packages for spatial data handling and analysis since the early 2000s (Bivand, 2021), including packages providing scripting access to key open-source geospatial software libraries. These libraries, and some of the interface and other code in R packages,

42  Handbook of spatial analysis in the social sciences are written in languages requiring compilation, which may complicate distribution to users. CRAN provides users with ‘binary’ contributed packages for Windows and MacOS to permit users to install such packages and their dependencies. R contributed packages may be installed and updated in a running R session, but should not be updated if already used in a session. Because CRAN contributed package binaries are available for Windows and MacOS, and contain their required external library components by static linking, the computing environment is closely controlled and almost certainly coherent (because served packages are tested against each other on multiple platforms continuously). Python packages are provided in a very similar way using the Python Package Index. Unlike R, Python packages should not be installed in a running Python session, but prior to its commencement. Typically, pip has been used to manage handling contributed packages, and more recently complete conda computing environments are also available. Specified computing environments permit users to carry out chosen tasks without the need to install chains of individual software components, in much the same way as the decision by CRAN to link Windows and MacOS packages accessing external libraries statically. External libraries do evolve, and if a new version changes an interface component for whatever reason, then software such as Python or R packages using that library but linking dynamically would need re-compilation. Further opportunities for customizing and controlling computing environments are offered by containers such as Docker, or by using Binder to create computing environments for remote users. While computing environments are a strict subset of analytical environments, they do matter a great deal. Community activity on mailing lists and question-and-answer fora often provide example scripts resolving problems. These solutions are, however, dependent on the versions of software components being used, both the packages themselves, other packages they use and the external software libraries that they interface. Open-source packages for analysing spatial data share many components in their use of open-source external libraries, and this leaves most of them vulnerable to changes (Bivand, 2014). Such changes do occur frequently, recently in PROJ (Evenden et al., 2022), GEOS and GDAL (Rouault et al., 2022). These changes lead to churn in releases of R and Python packages as small numbers of maintainers struggle to adapt to upstream changes while assuring the stability of downstream packages and scripted workflows. Examples of upstream changes in PROJ and GDAL feeding through into R and Python packages concern the ways in which CRS are represented. In the 1990s, the specification of CRS of spatial data was downplayed; for example, .prj files containing a text version of the CRS were optional. Correct CRS specification impacted visualization in web-mapping applications from the mid-2000s, and was always essential for the integration of data across multiple data sources. In such cases of upstream changes, a change in the computing environment may engender changes in the wider analytical environment, where users are obliged to absorb technical details which might have seemed unimportant. 3.1  R and Contributed Packages Sys.setenv(PROJ_NETWORK=‘ON’) library(sf) ## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE packageVersion(‘sf’) ## [1] ‘1.0.5’

Analytical environments   43 The sf package (Pebesma, 2018, 2020), introduced in late 2016, is a modern replacement for the sp package (Pebesma and Bivand, 2005, 2021); see Bivand (2021) for a fuller account. In Bivand et al. (2013) and other books presenting spatial data handling workflows, sp is used for both vector and raster data, supplemented by raster (Hijmans, 2022). The rgdal package (Bivand et al., 2021a) then provided functionalities offered by the external PROJ and GDAL libraries for reading and writing spatial data files; the rgeos package (Bivand and Rundel, 2021) interfaced GEOS predicates and topological operations. The sf package was first written as a replacement for vector data handling in sp, rgdal and rgeos, but also now supports raster data handling by interfacing GDAL for the stars package (Pebesma, 2021). Changing from sp based workflows to sf based workflows is proceeding, but necessarily impacts the broader analytical environment. It has been important to ensure that sp based workflows continue to function as far as possible. rgdal and rgeos will be retired by or before the end of 2023 to simplify maintenance of the R-spatial package ecosystem. We will use a dataset from the British Census 2011 for Output Areas in the Borough of Camden in London used in Bivand and Wong (2018), and described in detail there. It is stored in a Geopackage format file, which overcomes many of the shortcomings of the legacy ESRI Shapefile format. It is read into an “sf” object using the GDAL GPKG driver: camden