Handbook on Measuring Governance (Elgar Handbooks in Public Administration and Management) 1802200630, 9781802200638

Measuring governance has become an increasingly important feature of modern societies, with organizations and institutio

230 103 4MB

English Pages 330 [331] Year 2024

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Handbook on Measuring Governance (Elgar Handbooks in Public Administration and Management)
 1802200630, 9781802200638

Table of contents :
Front Matter
Copyright
Contents
Figures
Tables
Contributors
Introduction to the Handbook on Measuring Governance
PART I HISTORICAL DEVELOPMENT OF MEASURING GOVERNANCE
1. State formation and statistics
2. Quantification and global governance
3. New Public Management, performance measurement, and measuring for governance
4. The constitutive effects of measuring governance
PART II THEORETICAL APPROACHES TO MEASURING GOVERNANCE
5. Theoretical approaches to measuring governance: public administration
6. Measuring governance: a political science perspective
7. The sociology of measurement
8. Governmentality and the measuring of governance
PART III METHODS AND METHODOLOGIES FOR MEASURING GOVERNANCE
9. Approaches and methods for measuring governance: comparing major supranational institutions
10. Measuring the quality of collaborative governance processes
11. A framework for measuring the effects of policy processes on health system strengthening
12. Measuring micro-foundations of governance: a behavioral perspective
13. Criteria-based measurement of collaborative innovation and its impact on public problem solving and value creation
14. Using collaborative performance summits to help both researchers and governance actors make sense of governance measures
PART IV FIELDS OF MEASURING GOVERNANCE
15. Measuring active labour market polices
16. Governance in public health care: measurement (in)completeness
17. Made to measure: how central banks deliver performances of their worth and why unconventional monetary policy is reversing the burden of proof
18. We treasure what we measure: global development cooperation and the Sustainable Development Goals
19. Measuring democracy: capturing waves of democratization and autocratization
Index

Citation preview

HANDBOOK ON MEASURING GOVERNANCE

ELGAR HANDBOOKS IN PUBLIC ADMINISTRATION AND MANAGEMENT This series provides a comprehensive overview of recent research in all matters relating to public administration and management, serving as a definitive guide to the field. Covering a wide range of research areas including national and international methods of public administration, theories of public administration and management, and technological developments in public administration and management, the series produces influential works of lasting significance. Each Handbook will consist of original contributions by preeminent authors, selected by an esteemed editor internationally recognized as a leading scholar within the field. Taking an international approach, these Handbooks serve as an essential reference point for all students of public administration and management, emphasizing both the expansion of current debates, and an indication of the likely research agendas for the future. For a full list of Edward Elgar published titles, including the titles in this series, visit our website at www​.e​-elgar​.com​.

Handbook on Measuring Governance Edited by

Peter Triantafillou Professor of Public Administration and Politics, Department of Social Sciences and Business, Roskilde University, Denmark

Jenny M. Lewis Professor of Public Policy, School of Social and Political Sciences, University of Melbourne, Australia

ELGAR HANDBOOKS IN PUBLIC ADMINISTRATION AND MANAGEMENT

Cheltenham, UK • Northampton, MA, USA

© Peter Triantafillou and Jenny M. Lewis 2024

Cover image: Marek Piwnicki on Unsplash. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical or photocopying, recording, or otherwise without the prior permission of the publisher. Published by Edward Elgar Publishing Limited The Lypiatts 15 Lansdown Road Cheltenham Glos GL50 2JA UK Edward Elgar Publishing, Inc. William Pratt House 9 Dewey Court Northampton Massachusetts 01060 USA A catalogue record for this book is available from the British Library Library of Congress Control Number: 2023949642

This book is available electronically in the Political Science and Public Policy subject collection http://dx.doi.org/10.4337/9781802200645

ISBN 978 1 80220 063 8 (cased) ISBN 978 1 80220 064 5 (eBook)

EEP BoX

Contents

List of figuresvii List of tablesviii List of contributorsix Introduction to the Handbook on Measuring Governance1 Peter Triantafillou and Jenny M. Lewis PART I

HISTORICAL DEVELOPMENT OF MEASURING GOVERNANCE

1

State formation and statistics Cosmo Howard

15

2

Quantification and global governance Isabel Rocha de Siqueira

31

3

New Public Management, performance measurement, and measuring for governance Jenny M. Lewis

4

The constitutive effects of measuring governance Peter Dahler-Larsen

PART II

45 62

THEORETICAL APPROACHES TO MEASURING GOVERNANCE

5

Theoretical approaches to measuring governance: public administration Sorin Dan

80

6

Measuring governance: a political science perspective B. Guy Peters

96

7

The sociology of measurement Radhika Gorur

111

8

Governmentality and the measuring of governance Peter Triantafillou

125

PART III METHODS AND METHODOLOGIES FOR MEASURING GOVERNANCE 9

Approaches and methods for measuring governance: comparing major supranational institutions Andrea Bonomi Savignon, Lorenzo Costumato and Fabiana Scalabrini v

138

vi  Handbook on measuring governance 10

Measuring the quality of collaborative governance processes Joop Koppenjan

156

11

A framework for measuring the effects of policy processes on health system strengthening Fabiana da Cunha Saddi, Stephen Peckham, Peter Lloyd-Sherlock and Germano Araujo Coelho

12

Measuring micro-foundations of governance: a behavioral perspective Sjors Overman, Emma Ropes and Wouter Vandenabeele

13

Criteria-based measurement of collaborative innovation and its impact on public problem solving and value creation Jacob Torfing, Andreas Hagedorn Krogh and Anders Ejrnæs

204

14

Using collaborative performance summits to help both researchers and governance actors make sense of governance measures Scott Douglas

216

172

187

PART IV FIELDS OF MEASURING GOVERNANCE 15

Measuring active labour market polices Niklas Andreas Andersen, Flemming Larsen and Dorte Caswell

229

16

Governance in public health care: measurement (in)completeness Margit Malmmose

243

17

Made to measure: how central banks deliver performances of their worth and why unconventional monetary policy is reversing the burden of proof Timo Walter

259

18

We treasure what we measure: global development cooperation and the Sustainable Development Goals Katja Freistein

19

Measuring democracy: capturing waves of democratization and autocratization288 Marianne Kneuer

Index

273

304

Figures

9.1

The interconnected dimensions of sustainable development goals

147

9.2

Data flow of SDG data collection

148

9.3

Analysis of each database in terms of span of measurement, data sources, and methodological robustness (shown as circle size)

152

10.1

Measures for assessing the quality of collaborative governance processes

159

11.1

The Policy Integration and Performance Framework (PIPF)

176

11.2

The Policy Integration and Performance Framework (PIPF): sub-dimensions and questions

177

13.1

Suggested causal relations between collaboration, innovation and ability to impact

212

14.1

The roles of researchers during the various stages of a summit

220

vii

Tables

1.1

Elements of historical statistical regimes

22

3.1

Governance modes and key success criterion

47

3.2

The chain of performance measurement with aspects and actors

48

3.3

The purposes of performance measurement

51

5.1

The NWS, NPM and NPG: main concepts and characteristics

85

5.2

The NWS, NPM and NPG: argument, assumptions and implications for governance measuring

90

6.1

Sources of governance failure

103

9.1

Impact of GaaG areas of interest on what GaaG measures

142

9.2 Overview

145

9.3

SDGs with number of targets and indicators

146

12.1

Overview of theoretical and operational criteria for assessing measurement quality

194

13.1

The additive index measuring the breadth of collaboration

208

13.2

Correlation matrix for the relationship between collaboration, innovation and ability to impact

212

19.1

Regime types and subtypes in selected democracy indices

294

19.2

Classification of Russia by selected indices for the time period 2008–22

296

viii

Contributors

Niklas Andreas Andersen (PhD) is Assistant Professor in the Department of Politics and Society at Aalborg University, Denmark. His area of research is situated at the intersection between public administration and public policy, where he studies how techniques for evaluating and governing street-level organizations are contributing to changes at both the level of policymaking and implementation. He furthermore focuses on the institutionalization of Evidence-based Knowledge in processes of designing and implementing new welfare policies. Andrea Bonomi Savignon is a Senior Lecturer in Public Management and Social Innovation at the University of Rome Tor Vergata, Italy, where he heads the Strategy Committee for the Research Center on Digital Administration (CRAD). A Co-director of the MIMAP Executive Master in Public Administration, Andrea is past Chair of the EURAM SIG on Public and Non Profit Management and a Co-chair of the EGPA Study Group on Public Network Policy and Management. His main research interests focus on performance management and co-production at the intersections between public, private and non-profit organizations. Dorte Caswell is a Professor at Aalborg University, Denmark. Her research focuses on understanding social work practice in the context of welfare to work, including implications for the most vulnerable clients. At present she is co-leader of the research centre CUBB, which, in cooperation with municipalities in Denmark, tries to develop co-creation and user involvement in the employment services through system innovation. Dorte has published mainly within the areas of social work and social/employment policy and has strong international networks. Germano Araujo Coelho is a PhD researcher in Political Science at the University of Brasília, Brazil, and a member of the research group on democracy and inequalities (Demodê). He holds a Master’s degree in Political Science from the Federal University of Goias (2019) and a degree in International Relations from the University of Brasília (2011). Germano was technical advisor for municipal planning at the Goias Federation of Municipalities (2014–18), and a researcher in a British Academy Newton Advanced Fellowship project. Lorenzo Costumato is a Postdoctoral Research Fellow in Public Management and Governance in the Department of Management and Law at the University of Rome Tor Vergata, Italy. His main research interests are related to processes of innovation in public organizations, especially those linked to collaboration and networks for the creation of public value, multi-level governance, digital transformation and management of European funds. Peter Dahler-Larsen (PhD and dr. scient. pol.) is a Professor in the Department of Political Science at the University of Copenhagen, Denmark, where he teaches evaluation. His international publications include The Evaluation Society (Stanford University Press, 2013), Quality: From Plato to Performance (Palgrave, 2019) and Casualties of Causality (Palgrave, 2022). Sorin Dan is an Assistant Professor in Public Management at the University of Vaasa, Finland, where he teaches and conducts research on public sector reform and innovation. Dan is the co-author of Digital Talent Management: Insights from the Information Technology ix

x  Handbook on measuring governance and Communication Industry (2021) and the author of The Coordination of European Public Hospital Systems: Interests, Cultures and Resistance (2017), both published by Palgrave Macmillan. He has offered consultancy services to the OECD and the European Commission and has authored over 30 publications. Scott Douglas is Associate Professor of Public Administration at the Utrecht University School of Governance, the Netherlands. His research focuses on the performance management of collaborations, working closely with public sector organizations tackling issues such as radicalization, educational inequality and domestic violence. Anders Ejrnæs is a Professor (MSO) in Social Sciences in the Department of Social Sciences and Business at Roskilde University, Denmark. He is a member of the steering committee of Danish Research Data for the Social Sciences (DRDS) and a member of The Coordinating Body for Register Research (KOR). Anders is a scientific expert on survey research and comparative analysis. His research interests include trust in public institutions, fear of crime, comparative analysis of political engagement and Euroscepticism. Katja Freistein is a Senior Researcher and research group leader in the Centre for Global Cooperation Research at the University of Duisburg-Essen, Germany. Her work has focused on international organisations, development and global inequalities. She is also interested in discourse theory and narrative approaches. Her publications have been published in journals such as Third World Quarterly, International Political Sociology, and Review of International Studies, and she is co-editor of the book Imagining Pathways for Global Cooperation (Edward Elgar Publishing, 2022). Radhika Gorur is Associate Professor of Education at Deakin University, Australia. Her research spans education policy and reform, global aid and development in education, data infrastructures and data cultures, accountability and governance, large-scale comparisons and the sociology of measurement. Radhika is a founding director of the Laboratory of International Assessment Studies, Convenor of the Deakin Science and Society Network and Editor of the journal Discourse: Studies in the Cultural Politics of Education. Cosmo Howard is an Associate Professor in the School of Government and International Relations and the Centre for Governance and Public Policy at Griffith University, Australia. His research focuses on the politics of expertise, policy responses to inequality and comparative public administration. Cosmo is the author of Government Statistical Agencies and the Politics of Credibility (Cambridge University Press, 2021). Marianne Kneuer is a Full Professor of Comparative Politics and Director of the Institute of Political Science at the TU Dresden, Germany. Her main research area is democracy studies and democratization as well as digital politics. Her latest book (together with Thomas Demmelhuber) deals with Authoritarian Gravity Centers. A Cross-Regional Study of Authoritarian Promotion and Diffusion (Routledge, 2020). Marianne served as President of the International Political Science Association (IPSA) from 2019 to 2021. Joop Koppenjan is Professor Emeritus in Public Administration at the Erasmus University Rotterdam, the Netherlands. His research interests include governance networks, collaborative governance, public private partnerships and public values. He has (co-)authored various (contributions to) books and numerous articles in peer-reviewed journals. Together with Erik Hans

Contributors  xi Klijn he published the monograph Governance Networks in the Public Sector (Routledge, 2016). Andreas Hagedorn Krogh is Assistant Professor in Public Governance and Organisation in the Institute for Leadership and Organisation at the Royal Danish Defence College, Denmark. His research interests include collaborative governance, collaborative innovation, network management, co-creation of societal security, crisis management and crime prevention. Andreas recently published Public Governance in Denmark (Emerald Press, 2022). Flemming Larsen is a Professor at Aalborg University, Denmark. His research focuses on labour market and social policy, both from a political science and public administration perspective. At present he is co-leader of the research centre CUBB, which in cooperation with local welfare agencies, tries to develop co-creation and user involvement in the employment services through system innovation. Flemming has participated in several international research networks and projects and has published widely internationally. Jenny M. Lewis is Professor of Public Policy in the School of Social and Political Sciences and Director, Scholarly and Social Research Impact for Chancellery Research and Enterprise, at the University of Melbourne, Australia. Jenny is a Fellow of the Academy of Social Sciences Australia, and the immediate past President of the International Research Society for Public Management. She was an Australian Research Council Future Fellow for 2013–16, and is an expert on policymaking, policy design and public sector innovation. Peter Lloyd-Sherlock is Professor of Global Gerontology at the University of Northumbria, UK. His main area of research looks at social protection, health and the wellbeing of older people in developing countries. Peter is also interested in the economic and social effects of non-communicable diseases, such as stroke, heart disease and Alzheimer’s Disease. He has a more general interest in social policy, particularly in Latin America. Margit Malmmose is Head of Finances (CFO) at University College Northern Jutland, Denmark. She was previously an Associate Professor in Management Accounting at Aarhus University, Denmark. Her research interests centre on management accounting themes in healthcare such as general performance measurement systems, historical developments and more technical accounting themes. Margit engages largely with qualitative research where she has conducted several case studies and comparative studies on the role of managers in hospitals, reform developments and the political influence on hospital management. Moreover, Margit has done research on hospitals’ available accounting figures including cost accounts. Sjors Overman is Assistant Professor of Public Governance at Utrecht University School of Governance, the Netherlands, and Managing Director of the Netherlands Institute of Governance. He works at the intersection of public governance and behavioural sciences to study and teach felt accountability, emotions and measurement. Sjors developed and validated multiple survey scales. In particular, his work concentrates around public service providers, regulatory authorities and cultural institutions. Stephen Peckham is Director of the NIHR Policy Research Unit in Health and Social Care Systems and Commissioning and Professor of Health Policy at the London School of Hygiene and Tropical Medicine, UK. He is Co-Director of the Institute for Health, Social Care and

xii  Handbook on measuring governance Wellbeing at the University of Kent. He has over 30 years of policy analysis and health services research experience. B. Guy Peters is Maurice Falk Professor of Government at the University of Pittsburgh, USA, and founding President of the International Public Policy Association. He holds a PhD degree from Michigan State University and has honorary doctorates from four European universities. He is currently editor of the International Review of Public Policy. His most recent books include Administrative Traditions: Understanding the Roots of Contemporary Administrative Behavior (Oxford University Press, 2022) and Democratic Backsliding and Public Administration (Cambridge University Press, 2022). Isabel Rocha de Siqueira is an Assistant Professor and Deputy Director in the Institute of International Relations (IRI), PUC-Rio, and researcher at the BRICS Policy Center, Brazil. She holds a PhD in International Relations from the Department of War Studies, Kings College London, UK. In 2019, 2022 and 2023 she was awarded the title of Young Female Scientist of our State (JCNE) from FAPERJ. Recent publications include reports commissioned by g7+ and UNOSSC, edited books by Editora PUC-Rio, articles in Globalizations and Policy & Society, and an authored book by Open Book Publishers. Emma Ropes is a PhD candidate in the Department of Public Administration and Sociology at Erasmus University Rotterdam, the Netherlands. She obtained a Research Masters in public Administration and Organizational Science. Emma’s PhD research focuses on the emotional response of citizens to encounters with regulatory agencies, and its consequences for citizen judgement and behaviour. Her main research interests are behavioural public administration, citizen-state interaction and emotions. Fabiana da Cunha Saddi is currently a Visiting Professor at the University of Brasília in Brazil. She is a former Senior Research Associate at the School of International Development at the University of East Anglia, a former British Academy Newton Advanced Fellow and Visiting Fellow at the University of Kent. She has been researching Brazilian and low and middle-income countries’ primary health and social policy employing Public Policy and Health System literature. She is interested in policy processes, system strengthening, qualitative and mixed methods. Fabiana Scalabrini is a PhD candidate in Management – on the track in public management and governance – in the Department of Management and Law at the University of Rome Tor Vergata, Italy. Her PhD thesis revolves around the factors and drivers behind the transition from e-government to digital transformation in the public sector. Fabiana lectures for MSc and Executive Programmes at Tor Vergata. Jacob Torfing is Professor of Politics and Institutions in the Department of Social Sciences and Business at Roskilde University, Denmark, Professor 2 at Nord University, Norway, and Director of the Roskilde School of Governance. His research interests include collaborative governance, public innovation, co-creation and facilitative leadership. He has recently published Co-creation for Sustainability (Emerald Press, 2022) and Rethinking Public Governance (Edward Elgar Publishing, 2023). Peter Triantafillou is Professor of Public Administration and Politics in the Department of Social Sciences and Business at Roskilde University, Denmark. His interest is modern

Contributors  xiii power-knowledge regimes in the field of employment policy, public health politics, and performance auditing and measuring in the public sector. Wouter Vandenabeele is an Associate Professor of Human Resources Management at Utrecht University School of Governance, the Netherlands, and Visiting Professor at the Public Governance Institute at KU Leuven University, Belgium. He obtained a PhD in Social Sciences at KU Leuven University (2008). His main research interests are organizational and institutional behaviour in the public sector, including questions of (public service) motivation, performance and other relevant constructs. Wouter also studies the application of evidence-based management in the public sector. Timo Walter is a Lecturer and Co-director of the Centre of International History and Political Studies of Globalization at the University of Lausanne, Switzerland. He obtained his PhD in International Relations from the Graduate Institute of International and Development Studies in Geneva. Timo’s research grapples primarily with understanding the origins and driving forces of the financialization of capitalism, and has been published in outlets such as Socio-economic Review and the European Journal of International Relations.

Introduction to the Handbook on Measuring Governance Peter Triantafillou and Jenny M. Lewis

Measuring governance has become an increasingly important feature of modern societies. International organizations (Oman & Arndt, 2010), government departments, private companies, social service deliverers, non-governmental organizations (NGOs), etc. are expected to prove their worth by measuring their activities and results. The democratic quality of political regimes, the level of good governance, and the human development status has been the object of intense scrutiny via measurement since the end of the Cold War (Kaufmann et al., 2007; Norris, 2011; Wahlberg & Rose, 2015). This measurement has come in many different forms, such as government statistics, key performance indicators, user satisfaction surveys, rankings and rating systems, programme evaluations, etc. This drive for measuring has taken place in parallel with the emergence of new forms of governing both public and private matters. The term ‘governance’ is today a catch-all phrase for a bewildering array of political, administrative, economic, social and cultural phenomena whereby some actors or organizations are trying to govern individuals, groups or even populations with a view to improve the public good. The term is used at times to denote voluntary or at least collaborative (horizontal) interactions based on deliberation and compromises, which are contrasted with more state-centric top-down (vertical) government. However, there is little agreement on this definition. As the measuring of governance has become a fundamental feature of both the public and private sectors at local, national and international levels, it is timely to take stock of the state-of-the-art. The aim of this handbook on measuring governance is to map out the historical developments and relations between governance and measurement, trace the theoretical conceptions and disciplines that are concerned with analyzing governance measurement, showcase key methodological approaches to the measuring of governance, and summarize what we know about measuring governance from research in specific fields of practice.

WHAT IS GOVERNANCE AND MEASUREMENT? We understand governance as the more or less institutionalized attempts to direct the behaviour or conduct of individuals and organizations through mechanisms that are considered legitimate in liberal democratic terms with a view to improving the public good. Torfing et al. offered the following rather generic definition: ‘the process of steering society and economy through collective action and in accordance with common goals’ (Torfing et al., 2012, p. 2). This is a rather broad definition that entails most governing and steering processes that seek to generate processes and outcomes for the public good, which are decided through collective processes enabling substantive citizen influence. 1

2  Handbook on measuring governance Governance then includes, for example, laws and political reforms issued by governments elected according to liberal democratic procedures. It also includes decisions taken and resources allocated by networks, groups or organizations (public and/or private) that entail changing the conduct of other actors in order to produce public value, such as governance networks (Klijn, 2008). Conversely, governance does not include the management and power exercised by for-profit companies in order to produce or sell their products as private goods – often referred to as corporate governance. However, private firms delivering public services under contract to governments are involved in governance as it is understood here. Charitable foundation activities conducted by private organizations, such as the Bill Gates Foundation, may very well aim to improve the common good. Yet, the goals of charity are not based on collective decision-making processes and therefore do not qualify as governance, based on the definition we have used in this volume. We understand measurement as the act of assessing the character, quality, performance or results of a governance activity according to a unit that may be more or less standardized. Measurement of governance includes: ● Effective evaluation of public programmes or services – whether delivered by private contractors or public ones ● Productivity analysis of an organization, a programme or a specific service ● Benchmarking of public services or national policies ● Democracy surveys of political regimes, e.g. freedom indexes ● Competitiveness surveys of national economies ● Rankings of nations on specific public policy sectors, e.g. level of innovativeness, health care coverage, social services comprehensiveness, the quality of school education. The Emergence and Development of Governance Measuring Measuring has been instrumental to political rule for millennia. We need only think of the population censuses developed during the Roman empire. With the emergence of large territorial states and secular political rationalities, such as cameralism and raison d’ȇtat, statistics take on a new importance. When seventeenth-century British political economist William Petty coined the term Political Arithmetic, he was aspiring to develop a new art of state governance based not only on population censuses but on a statistical knowledge of persons, resources and territory (Mykkänen, 1994). In the following centuries, the production of statistics adhering to national and later international standards has been crucial to the political steering capacities of modern states in the West. The continued importance of statistics for modern state-building and government policies is well documented, notably by French scholars (Amosse, 2022; Desrosières, 1998). The development of industrial capitalism and urbanization of Western Europe during the nineteenth century called for new scientific understandings of this phenomenon. Apart from French and British modalities of political economy (Tribe, 2008), the new society saw the rise of the discipline of sociology. It may be that the French philosopher Auguste Comte and his social physics is regarded as the originator of sociology (Comte, 1877), but it is with the later works of Emile Durkheim that statistical measures become crucial to understand society and how it can possibly be governed. Between Comte and Durkheim, Western societies had experienced a statistical surge encapsulated in the establishment of national statistical agencies. By

Introduction  3 the same token Western societies also experienced a probabilistic revolution that essentially implied a shift from a deterministic understanding of causality to a probabilistic one (Hacking, 1990). This statistical revolution included the invention of the normal distribution, which soon came to be applied to all kinds of societal phenomena starting with educational examinations (Hoskin & Macve, 1994) and from there moving into macroeconomics (Tooze, 2001, pp. 29–30) and the study of social phenomena like crime and unemployment. Durkheim would regard crime and suicide not in individual and moral terms (as a faulty or sinful character) but as a social-statistical fact (Durkheim, 1968). His studies suggested that every society has a certain normal rate of crime and of suicides. Accordingly, governance interventions are only needed in situations of anomie, that is, when the rates of crime, suicide or unemployment reach levels that exceed the statistical normal. In Britain, shipowner and self-made social researcher Charles Booth took a slightly different tack by focusing explicitly on problems of poverty among the labouring classes. Like Durkheim, Booth was inspired by Comte’s empirical social physics. However, rather than regarding poverty as a statistical fact only, he took statistical measures as absolutely fundamental to understand and, ultimately, to address and alleviate poverty (Van Dooren et al., 2010, p. 39). So far, we have mainly looked at how the measuring of societal phenomena, such as the economy and the population, has enabled and informed new ways of governing. However, measuring targets not only what takes place outside the realm of governance but also governance itself. The history of the measuring of governance itself may be much briefer than the measuring of society. Yet, it is by no means straightforward. The emergence of cost-based accounting may be one of the first really important uses of measuring of governance. The use of accounting to manage commercial activities goes back at least to the affluent trading families of Renaissance Northern Italy. However, these had little if any use in public governance. With the emergence of industrial capitalism, the big private corporations in the US and later Western Europe would start to use cost-accounting from the end of the nineteenth century (Chandler, 1977; Fleischman & Tyson, 1993; Hoskin, 1998). This measurement innovation would essentially allow specialization and division of labour between distinct department and branches of the manufacturing production, and, more importantly, the measurement of profitability and delegation of managerial responsibilities between discrete entities within the corporation. It was partly the emergence of these large and highly effective capitalist corporations that inspired Max Weber to claim around the turn of the century that the most effective form of political rule rested on a bureaucratic organization characterized by clear rule-based lines of authority and functionally divided branches staffed by experts (Weber, 1978, pp. 956–74). Statistics and numbers in general were crucial to all these regulatory ideals and, ultimately, to ensure the legitimacy and trust of the bureaucracy as a specific mode of rule (Porter, 1995). While cost-accounting may not have played the same important role in public administrations as it did in private corporations, we do see the spread of performance budgeting systems during the 1960s. Programmes like the Planning Programming Budget Systems (PPBS) and Management-By-Objectives (MBO) were found in the US, the UK, France and several other European countries (Van Dooren et al., 2010, p. 41). In the context of growing social welfare programmes and expenditures, the overall idea was to create a holistic approach to public management, a systematic approach by which the costs of all the activities deemed necessary to achieve a given end were weighed up against the benefits (Schick, 1966). The new budgeting

4  Handbook on measuring governance systems reflected the trust in rational planning enabled by experts with the access of nearly unlimited evidence to unravel causal links between means and ends, and to gauge the costs and benefits of each of the activities relevant to achieve politically given ends. Not surprisingly, this planning hubris turned out to provide disappointing results (Wildawsky, 1969), and by the late 1970s these kinds of programmes were largely abandoned (Van Dooren et al., 2010, p. 41). The advent of New Public Management reforms in the 1980s came with a surge of new performance management and measurement systems (Hood, 1991). The explosion of performance measurement was part of a wider trend in making the activities of both public and private organizations auditable to an hitherto unseen extent (Power, 1997). In an ideological climate dominated by neoliberal thinking in which markets were by default seen as efficient, and public organizations as the opposite, New Public Management (NPM) reforms lead to public organizations sometimes being privatized, but always being subjected to performance management schemes. These included contracting out, informal performance contracts, performance bonuses, user choice, user satisfaction surveys, benchmarking and many other schemes. All of these tried to create competition – between public organizations for users – in an environment where this could not be expected to happen spontaneously. Interestingly, many of these reforms were marketed as deregulation and reducing bureaucracy, whereas they were in reality predicated on re-regulation, namely, a wide range of regulations seeking to create new competitive systems for public services. These competitive systems in turn relied on the development of a comprehensive apparatus developing measures and indicators, systems to collect data and interpret these. Thus, public service activities had to be reformed in ways that were not only deemed more competitive but also measurable (Power, 1996). During the 1990s, NPM reforms spread rapidly throughout most Organisation of Economic Co-operation and Development (OECD) countries, not least thanks to the OECD’s public management group (PUMA) established in 1990 (Hadjisky, 2016). Still, the NPM reforms were implemented rather differently across OECD countries (Pollitt & Bouckaert, 2011). While the Reagan administration talked a lot about the atrocities of big government and introduced major tax cuts, it was only with the Clinton administration and its Government Performance and Results Act of 1993 (GPRA) that NPM reforms took pace. The GPRA called for systematic adoption of performance measurement across agencies regarding both strategic management (planning) every three years and annual evaluation of results (Kravchuk & Schack, 1996). In the early 2000s, the GPRA was partly supplanted by the Program Assessment Rating Tool launched by the Bush Junior administration, which directly linked performance assessments to budget decisions (Moynihan, 2008, pp. 125–36). The last two decades have also seen an increase in the use of programme evaluation information with a view to improving performance management programmes (Kroll & Moynihan, 2017). Much like the US, the spread of NPM in the UK only took place a decade or so after the advent of the Thatcher-led Conservative government in 1979. The development of performance management and measurement reforms did not follow any blueprint and therefore took off in several rather different and often uncoordinated directions. After an initial phase emphasizing public budget cuts and privatization, performance measurement and ranking systems were more systematically introduced from the late 1980s (Pollitt & Bouckaert, 2016, p. 337). The National Audit Office and, not least, the Audit Commission were instrumental to the spread of auditing and measurement systems. Tony Blair’s and, later, Gordon Brown’s New Labour government further expanded the apparatus of centrally orchestrated perfor-

Introduction  5 mance measurement systems in order to keep track of the expanded social service budgets of both central and local governments (Pollitt & Bouckaert, 2016, p. 338). Interestingly, the advent of the Liberal-Conservative coalition government in 2010 seemed to eclipse much of the performance measurement and targeting craze, not least signified by the closure of the Audit Commission in 2012 (Tonkiss & Skelcher, 2015). This reduction of performance measurement systems aimed to cut unnecessary expenditures and give the power back to the people from civil servants; others suggested that the closure of the Audit Commission would erode the capacity to keep track of local government malpractice and poor performance (Timmins & Gash, 2014, p. 19). In contrast to the UK and the US, NPM reforms in New Zealand came with a relatively clear theoretical orientation that was turned into a comparatively coherent action plan in the early 1990s (Boston et al., 1996). With inspiration from transaction cost theory and public choice theory, extensive performance management and measurement programmes were launched up until around 2000. This was followed by a period of increasing criticism of these reforms and a partial rolling back of some of them. Together with other NPM reforms, performance measurement also gradually invaded the public sector of most if not all European countries, though slower and often less pervasively than in the Anglophone countries (Pollitt & Bouckaert, 2016). Performance measurement of governance was not only a domestic issue, but also played a very important role for international organizations and governance from the early 1990s (Davis et al., 2012). With the 1990s Washington Consensus that propagated free markets, macroeconomic stability, and free currency exchange as the way to economic prosperity for developing countries, loans from the World Bank and the International Monetary Fund would above all depend upon satisfying indices of privatization, deregulation, (lacking) currency regulation, etc. (Vestergaard, 2009). Measuring also played an important role in international governance focusing on the world’s wealthy countries where economic competitiveness from the 1990s onwards was seen as absolutely crucial to maintain economic growth (Cerny, 2010). This quest for international competitiveness was underpinned by a range of indices and measures, such as the annual global competitiveness report developed in 2000 by Harvard Business School professor Michael Porter, Harvard University professor Jeffrey Sachs and the Swiss World Economic Forum (Porter et al., 2000). Returning to the domestic level, the 2000s saw increasing academic attention to interactive forms of governance. Under headings, such as network governance (Marcussen & Torfing, 2007), collaborative governance (Ansell & Gash, 2007) and new public governance (Osborne, 2006), some studies pointed to an empirical tendency towards more decentralized and networked forms of governing (Bevir, 2010; Rhodes, 1997). Several scholars argue that there is a societal need for more interactive forms of governing that cast citizens as active and resourceful co-producers of policies and services (Bovaird, 2005; Sørensen & Torfing, 2005). Partially based on a critique of NPM and its performance measurement craze, collaborative governance regarded public value production as a service interactively produced between the public sector and the citizens, not as a discrete output that can be easily measured (Osborne et al., 2013). Many collaborative governance scholars recognize the need for accountability and some kind of performance measuring (Emerson & Nabatchi, 2015; Klijn & Edelenbos, 2013), but this task is inherently difficult as governance often implies that both means and even goals may change in the process (Lewis & Triantafillou, 2012). Moreover, the often informal characters

6  Handbook on measuring governance of the collaborative activities, which often enable them to be agile and effective, often make it difficult for the actors to provide systematic and valid data and measures of their activities and the outcomes of these (Torfing et al., 2012, pp. 71–84).

THEORIES AND METHODS OF GOVERNANCE MEASURING Disciplines and Theories The measuring of governance has been studied from within a wide range of social scientific disciplines and theories. This brief introduction cannot do justice to the theoretical diversity of this topic, but we may still point to some influential approaches emanating within organizational theory, public administration, political science, sociology, economics, governmentality studies and anthropology. Organizational management theory has been particularly important to the scientific advances on how to measure governance. In the 1980s, US management scholars Brinton Milward and Keith Provan embarked on a series of studies that developed consistent models to gauge the effectiveness of policy and governance networks (Milward & Provan, 1998; Provan & Milward, 2001; Rainey & Milward, 1983). Later on, Dutch organizational scholars Patrik Kenis and Jörg Raab engaged in similar studies developing models for measuring the effectiveness of governance networks and identifying the key factors influencing their effectiveness (Kenis et al., 2009; Provan & Kenis, 2007; Raab et al., 2015). Organizational and accounting studies have emphasized the performative nature of performance measurements, and their structuring effects of governance regimes at organizational, national and even international levels (Mehrpouya & Samiolo, 2016). Public administration is perhaps the discipline that has contributed most extensively to the study of governance and how it can be measured. The emergence of public administration research on governance is particularly clear in countries like Australia (Considine et al., 2015), the US (Moynihan et al., 2012), Denmark (Sørensen & Torfing, 2007) and, not least, the Netherlands (Klijn & Koppenjan, 2014). Notwithstanding the immense variation of these studies, public administration scholars tend to focus on public organizations, including internal collaboration and collaboration with private organizations and citizens. Some public administration scholars have looked at ways of measuring governance (Brandsma & Schillemans, 2013; Torfing et al., 2020), others have critically addressed the unintended effects of measurement regimes (Hood, 2006; Lewis, 2016; Radnor, 2008). It is much less common to take a political systems perspective on governance and its measuring. Still, we do find important political science studies addressing how governance fits in and possibly modifies state functioning and structures (Bevir, 2010; Pierre, 2000; Pierre & Peters, 2000). Likewise, the debate on and not least the attempts to actually measure governance at the level of the political system are surprisingly scarce (Torfing et al., 2012, pp. 71–84). In a more critical vein, political scientists have examined the political struggles around the decisions on who, how and what to measure (Lewis, 2015). We already saw above that statistics played a crucial role in Durkheimian sociology and that sociologists have contributed fundamentally to develop social indicators and statistical surveys in order to grasp and govern social problems (Land & Michalos, 2018). More recently, sociologists have been preoccupied with the social conditions and various implications of

Introduction  7 quantification, not least the use of measuring in governance (Diaz-Bone & Didier, 2016; Mennicken & Espeland, 2019). They have studied, for instance, the social processes of commensurability that enabled the implementation of the PISA surveys in OECD member states (Gorur, 2014). With its quest for mathematical modelling of markets and the statistical assessment of the ups and downs of national economies, economics may be regarded as the royal discipline of the governance-measurement nexus. Just think of the significance of the invention of the gross domestic product for the exercise of modern fiscal and monetary policies. Moreover, economists have explicitly addressed the potentials and pitfalls in using performance measures and indicators to gauge government policy performance (Williams & Siddique, 2008). Yet, much of the discipline’s authority and its claim to neutrality rests with its formalization that abstracts both from the political battles underpinning the making of its statistical artefacts, and from the political implications of its formalized models (Fioramonti, 2013). Accordingly, it is mainly up to economic sociology and to some extent political economy approaches to address the often rather arbitrary and politically loaded relationships between governance and measurement. For instance, the French school of economics have focused on the contingent nature of conventions guiding the construction of statistical indicators that inform much government policymaking (Thévenot, 2001, 2015). This includes how statistical data are enabling new European Union (EU) strategies seeking to shape the member states’ economic and employment policies (Salais, 2006). Finally, while the measuring of governance has generally been slow to be taken up by political economy scholars (Mügge, 2020), we do find studies of how financial indices and measuring contribute to discipline the economic policies of developing countries (Vestergaard, 2009) and, more recently, how performance measurement may augment state capacity of poorer countries (Asadullah & Savoia, 2018). A number of Foucauldian inspired studies have addressed the power-knowledge relations implied in governance measuring. Often the focus is on the ways in which various technologies of performance measurement interact with rationalities of government, notably neoliberalism. For example, we find studies of the role played by performance indicators and benchmarking technologies have shaped the work of international organizations and global governance (Merry et al., 2015), the EU’s employment policy (Triantafillou, 2008), university policy (Ørberg & Wright, 2009; Triantafillou, 2015), public health policy (Triantafillou, 2020) and social security (Henman, 2021; Henman & Adler, 2003). These studies above all point to the many ways in which measurement regimes enable the exercise of power at a distance over or through individuals and organizations. Finally, anthropologists have examined the role of statistical knowledge and measures for British colonial rule (Cohn, 1996) and for the governmentalization and control of developing countries (Escobar, 1995; Ferguson, 1990). More recently, anthropological scholars have focused on the intricacies between measuring and governance in industrialized countries. Some have attempted to quantify the culture of organizations (Fletcher & Jones, 1992). However, the quest to audit and measure organizational culture or performance implies not merely technical instruments but serves to inculcate new organizational cultures affecting organizational identities and the often tacit codes of employee conduct (Shore, 2008). We also find an increasing number of ethnographic studies of the ways in which audit and performance codes and rituals shape organizational legitimacy (Power, 2003).

8  Handbook on measuring governance Methods and Methodologies The methods and methodologies involved in the measuring of governance are mostly well known and used in many other contexts. For instance, methods like randomized control studies, various experimental studies, and various types of case-studies that are widely used in the social sciences are also frequently applied to measure the effect of specific governance programmes or projects (Clark et al., 2021). Moreover, the well-known techniques of process evaluation and process tracing have also been used to assess and understand the causal efficacy of governance arrangements (Rauschmayer et al., 2009; Schmitt & Beach, 2015). Finally, benchmarking has for many decades played a crucial role in the manufacturing and financial sectors in order to gauge relative performance. Such methods would later come into prominence in the measuring of public administrations and governance too (Arrowsmith et al., 2004; Triantafillou, 2007). Yet, we also find many distinct methodological contributions within governance research. For instance, the methods for measuring the effectiveness of governance organizations and structures have seen substantial improvement thanks to governance scholars (Provan & Milward, 2001; Voorn et al., 2020). Moreover, governance scholars have developed new methods of participatory measuring of organizational learning and results (Armitage et al., 2018; Wal et al., 2014). In a more critical vein, we find a large score of studies analyzing the unintended effects of NPM inspired performance measurement regimes (Bouckaert & Van Dooren, 2015; Hood, 2006; Radnor, 2008). Finally, inspired by constructivist social science approaches, some scholars have looked into the constitutive or performance effects of performance auditing and measurement regimes (Arnaboldi & Lapsley, 2008; Dahler-Larsen, 2014; Power, 1996).

THE ORGANIZATION OF THIS HANDBOOK Part I: Historical Development of Measuring Governance Part I focuses on the historical development of the measuring of government. Four chapters deal with the role of measuring government in state formation, global governance, NPM, and in constituting new ways of governing political and social phenomena. Part II: Theoretical Approaches to Measuring Governance This part maps key theoretical approaches to measuring governance. The four chapters in this part examine how public administration theory, political science, sociology and governmentality studies conceive and analyze the measuring of governance. Part III: Methods and Methodologies for Measuring Governance Part III consists of six chapters that account for a variety of methodological approaches and methods for studying the measuring of governance. Some chapters describe and discuss methods for examining the processes, quality, effects and micro-foundations of measuring governance. Other chapters zoom in on the methods for measuring the governance of supra-

Introduction  9 national institutions, and the ways in which stakeholders make sense of different types of measurement data. Part IV: Fields of Measuring Governance The final part examines how the measuring of governance is taking place in distinct policy fields. Through five chapters, we get to know how the measuring of governance is taking place within the labour market, public health, central banking, aid and development, and democracy assessment.

REFERENCES Amosse, T. (2022). Homo statisticus: A history of the France’s general public statistical infrastructure on population since 1950. In A. Mennicken & R. Salais (Eds.), The new politics of numbers (pp. 169–96). Palgrave Macmillan. Ansell, C., & Gash, A. (2007). Collaborative governance in theory and practice. Journal of Public Administration Research and Theory, 18(4), 543–71. Armitage, D., Dzyundzyak, A., Baird, J., Bodin, Ö., Plummer, R., & Schultz, L. (2018). An approach to assess learning conditions, effects and outcomes in environmental governance. Environmental Policy and Governance, 28(1), 3–14. Arnaboldi, M., & Lapsley, I. (2008). Making management auditable: The implementation of best value in local government. Abacus, 44(1), 22–47. https://​doi​.org/​10​.1111/​j​.1467​-6281​.2007​.00247​.x. Arrowsmith, J., Sisson, K., & Marginson, P. (2004). What can ‘benchmarking’ offer the open method of co-ordination? Journal of European Public Policy, 11(2), 31–328. Asadullah, M.N., & Savoia, A. (2018). Poverty reduction during 1990–2013: Did Millennium Development Goals adoption and state capacity matter? World Development, 105, 70–82. Bevir, M. (2010). Democratic governance. Princeton Unversity Press. Boston, J., Martin, J., Pallot, J., & Walsh, P. (1996). Public management: The New Zealand model. Oxford University Press. Bouckaert, G., & Van Dooren, W. (2015). Performance measurement and management in public sector organizations. In T. Bovaird & E. Loeffler (Eds.), Public management and governance (pp. 174–87). Routledge. Bovaird, T. (2005). Public governance: Balancing stakeholder power in a network society. International Review of Administrative Sciences, 71(2), 217–28. Brandsma, G.J., & Schillemans, T. (2013). The accountability cube: Measuring accountability. Journal of Public Administration Research and Theory, 23(4), 953–75. https://​doi​.org/​10​.1093/​jopart/​mus034. Cerny, P.G. (2010). The competition state today: From raison d’État to raison du Monde. Policy Studies, 31(1), 5–21. Chandler, A. (1977). The visible hand: The managerial revolution in American business. Harvard University Press. Clark, T., Foster, L., Sloan, L., & Bryman, A. (2021). Bryman’s social research methods. Oxford University Press. Cohn, B.S. (1996). Colonialism and its forms of knowledge: The British in India. Princeton University Press. Comte, A. (1877). The system of positive polity. Longmans, Green & Co. Considine, M., Lewis, J. M., O’Sullivan, S., & Sol, E. (2015). Getting welfare to work street-level governance in Australia, the UK, and the Netherlands. Oxford University Press. Dahler-Larsen, P. (2014). Constitutive effects of performance indicators: Getting beyond unintended consequences. Public Management Review, 16(7), 969–86. Davis, K.E., Fisher, A., Kingsbury, B., & Merry, S.E. (Eds.) (2012). Governance by indicators: Global power through classification and rankings. Oxford University Press.

10  Handbook on measuring governance Desrosières, A. (1998). The politics of large numbers: A history of statistical reasoning. Harvard University Press. Diaz-Bone, R., & Didier, E. (2016). Introduction: The sociology of quantification – perspectives on an emerging field in the social sciences. Historical Social Research, 41, 7–26. Durkheim, E. (1968). Suicide: a study in sociology. Routledge & Kegan Paul. Emerson, K., & Nabatchi, T. (2015). Evaluating the productivity of collaborative governance regimes: A performance matrix. Public Performance and Management Review, 38(4). https://​doi​.org/​10​.1080/​ 15309576​.2015​.1031016. Escobar, A. (1995). Encountering development: The making and unmaking of the Third World. Princeton University Press. Ferguson, J. (1990). The anti-poltics machine: ‘Development’, depoliticization, and bureaucratic power in Lesotho. Cambridge University Press. Fioramonti, L. (2013). Gross domestic problem: The politics behind the world’s most powerful number. Zed Books. Fleischman, R.K., & Tyson, T. (1993). Cost accounting during the industrial revolution: The present state of historical knowledge. Economic History Review, 46(3), 503–17. Fletcher, B., & Jones, F. (1992). Measuring organizational culture: The cultural audit. Managerial Auditing Journal, 7(6), 30–36. Gorur, R. (2014). Towards a sociology of measurement in education policy. European Educational Research Journal, 13(1), 58–72. Hacking, I. (1990). The taming of chance. Cambridge University Press. Hadjisky, M. (2016). Diffusion or interaction? The New Public Management at the OECD-PUMA and GOV. Paper presented at the ECPR General Conference, Prague, September 2016. Henman, P. (2021). Governing by algorithms and algorithmic governmentality. Towards machinic judgement. In M. Schuilenburg & R. Peeters (Eds.), The algorithmic society: Technology, power and knowledge (pp. 19–34). Routledge. Henman, P., & Adler, M. (2003). Information technology and the governance of social security. Critical Social Policy, 23(2), 139–64. Hood, C. (1991). A public management for all seasons? Public Administration, 69(1), 3–19. Hood, C. (2006). Gaming in the target world: The targets approach to managing British public services. Public Administration Review, 66(4), 515–21. Hoskin, K. (1998). Inverting understandings of ‘the economic.’ In A. McKinlay & K. Starkey (Eds.), Foucault, management and organization theory (pp. 93–110). Sage. Hoskin, K., & Macve, R. (1994). Writing, examining, disciplining: The genesis of accounting’s modern power. In A. Hopwood & P. Miller (Eds.), Accounting as a social and institutional practice (pp. 67–97). Cambridge University Press. Kaufmann, D., Kraay, A., & Mastruzzi, M. (2007). Governance matters VI: Aggregate and individual governance indicators, 1996–2006. Policy Research Working Paper. Kenis, P., Janowicz-panjaitan, M., & Cambré, B. (2009). Temporary organizations: Prevalence, logic and effectiveness. Edward Elgar. Klijn, E.-H. (2008). Governance and governance networks in Europe: An assessment of ten years of research on the theme. Public Management Review, 10(4), 505–25. Klijn, E.-H., & Edelenbos, J. (2013). The influence of democratic legitimacy on outcomes in governance networks. Administration & Society, 45(6), 627–50. Klijn, E.-H., & Koppenjan, J.F.M. (2014). Accountable networks. In M. Bovens, R.E. Goodin, & T. Schillemans (Eds.), The Oxford handbook of public accountability (pp. 242–57). Oxford University Press. Kravchuk, R.S., & Schack, R.W. (1996). Designing effective performance-measurement systems under the Government Performance and Results Act of 1993. Public Administration Review, 56(4), 348–58. Kroll, A., & Moynihan, D.P. (2017). The design and practice of integrating evidence: Connecting performance management with program evaluation. Public Administration Review. https://​doi​.org/​https://​ doi​.org/​10​.1111/​puar​.12865. Land, K.C., & Michalos, A.C. (2018). Fifty years after the social indicators movement: Has the promise been fulfilled? Social Indicators Research, 136(1), 835–68.

Introduction  11 Lewis, J.M. (2015). The politics and consequences of performance measurement. Policy and Society, 34(1), 1–12. Lewis, J.M. (2016). The paradox of health care performance measurement and management. In E. Ferlie, K. Montgomery, & A.R. Pedersen (Eds.), The Oxford handbook of health care management (pp. 375–92). Oxford University Press. Lewis, J.M, & Triantafillou, P. (2012). From performance measurement to learning: A new source of government overload? International Review of Administrative Sciences, 78(4), 597–614. Marcussen, M., & Torfing, J. (2007). Democratic network goverance in Europe. Palgrave Macmillan. Mehrpouya, A., & Samiolo, R. (2016). Performance measurement in global governance: Ranking and the politics of variability. Accounting, Organizations and Society, 55, 12–31. Mennicken, A., & Espeland, W. (2019). What’s new with numbers? Sociological approaches to the study of quantification. Annual Review of Sociology, 45, 223–45. Merry, S.E., Davis, K.E., & Kingsbury, B. (2015). The quiet power of indicators: Measuring governance, corruption, and rule of law. Cambridge University Press. Milward, H.B., & Provan, K.G. (1998). Measuring network structure. Public Administration, 76(2), 387–407. Moynihan, D.P. (2008). The dynamics of performance management: Constructing information and reform. Georgetown University Press. Moynihan, D.P., Fernandez, S., Kim, S., LeRoux, K.M., Piotrowski, S.J., Wright, B.E., & Yang, K. (2012). Performance regimes amidst governance complexity. Journal of Public Administration Research and Theory, 21, 141–55. Mügge, D. (2020). Economic statistics as political artefacts. Review of International Political Economy. doi​.org/​10​.1080/​09692290​.2020​.1828141. Mykkänen, J. (1994). ‘To methodize and regulate them’: William Petty’s governmental science of statistics. History of the Human Sciences, 7(3), 65–88. Norris, P. (2011). Measuring governance. In M. Bevir (Ed.), The Sage handbook of governance (pp. 179–99). Sage. Oman, C.P., & Arndt, C. (2010). Measuring governance. Policy Brief no. 39. OECD Development Centre. Ørberg, J.W., & Wright, S. (2009). Paradoxes of the self: Self-owning universities in a society of control. In E. Sørensen & P. Triantafillou (Eds.), The politics of self-governance (pp. 117–35). Ashgate. Osborne, S.P. (2006). The new public governance? Public Management Review, 8(3), 377–87. Osborne, S.P., Radnor, Z., & Nasi, G. (2013). A new theory for public service management? Toward a (public) service-dominant approach. The American Review of Public Administration, 43(2), 135–58. Pierre, J. (2000). Introduction: Understanding governance. In J. Pierre (Ed.), Debating governance: Authority, steering, and democracy (pp. 1–10). Oxford University Press. Pierre, J., & Peters, B.G. (2000). Governance, politics and the state. Macmillan. Pollitt, C., & Bouckaert, G. (2011). Public management reform: A comparative analysis (3rd ed.). Oxford University Press. Pollitt, C, & Bouckaert, G. (2016). Public management reform. A comparative analysis (4th ed.). Oxford University Press. Porter, M.E., Cornelius, P.K., Levinson, M., & Sachs, J.D. (2000). The global competitiveness report. Oxford University Press. Porter, T.M. (1995). Trust in numbers: The pursuit of objectivity in science and public life. Princeton University Press. Power, M. (1996). Making things auditable. Accounting, Organizations and Society, 21(2–3), 289–315. Power, M. (1997). The audit society: Rituals of verification. Oxford University Press. Power, M. (2003). Auditing and the production of legitimacy. Accounting, Organizations and Society, 28(4), 379–94. https://​doi​.org/​10​.1016/​S0361​-3682(01)00047​-2. Provan, K.G., & Kenis, P. (2007). Modes of network governance: Structure, management, and effectiveness. Journal of Public Administration Research and Theory, 18(2), 229–52. https://​doi​.org/​10​.1093/​ jopart/​mum015. Provan, K.G., & Milward, H.B. (2001). Do networks really work? A framework for evaluating public-sector organizational networks. Public Administration Review, 61(4), 414–23.

12  Handbook on measuring governance Raab, J., Mannak, R.S., & Cambré, B. (2015). Combining structure, governance, and context: A configurational approach to network effectiveness. J-PART, 25(2), 479–511. Radnor, Z. (2008). Hitting the target and missing the point? Developing an understanding of organizational gaming. In W. Van Dooren & S. Van de Walle (Eds.), Performance information in the public sector: How it is used (pp. 94–105). Palgrave Macmillan. Rainey, H.G., & Milward, H.B. (1983). Public organizations: Policy networks and environments. In R.H. Hall & R.E. Quinn (Eds.), Organizational theory and public policy. (pp. 133–46). Sage. Rauschmayer, F., Berghöfer, A., Omann, I., & Zikos, D. (2009). Examining processes or/and outcomes? Evaluation concepts in European governance of natural resources. Environmental Policy and Governance, 19(3), 159–73. Rhodes, R.A.W. (1997). Understanding governance: Policy networks, governance, reflexivity and accountability. Open University Press. Salais, R. (2006). Reforming the European Social Model and the politics of indicators: From the unemployment rate to the employment rate in the European Employment Strategy. In M. Jepsen & A. Serrano (Eds.), Unwrapping the European Social Model (pp. 189–212). The Policy Press. Schick, A. (1966). The road to PPB: The stages of budget reform. Public Administration Review, 26, 243–58. Schmitt, J., & Beach, D. (2015). The contribution of process tracing to theory-based evaluations of complex aid instruments. Evaluation, 21(4), 429–47. Shore, C. (2008). Audit culture and illiberal governance: Universities and the politics of accountability. Anthropological Theory, 8(3), 278–98. Sørensen, E., & Torfing, J. (2005). The democratic anchorage of governance networks. Scandinavian Political Studies, 28(3), 195–218. Sørensen, E., & Torfing, J. (2007). Theories of democratic network governance. Palgrave Macmilllan. Thévenot, L. (2001). Organized complexity: Conventions of coordination and the composition of economic arrangements. European Journal of Social Theory, 4(4), 405–25. Thévenot, L. (2015). Certifying the world: Power infrastructures and practices in economies of conventional forms. In P. Aspers & N. Dodd (Eds.), Re-imagining economic sociology (pp. 195–223). Oxford University Press. http://​www​.idhes​.cnrs​.fr/​wp​-content/​uploads/​2015/​10/​TH​_Noors5​-copie​.pdf. Last accessed: 15 April 2023. Timmins, N., & Gash, T. (2014). Dying to improve: The demise of the Audit Commission and other improvement agencies. https://​www​.in​stitutefor​government​.org​.uk/​sites/​default/​files/​publications/​ Dying to Improve - web.pdf. Last accessed: 12 April 2023. Tonkiss, K., & Skelcher, C. (2015). Abolishing the Audit Commission: Framing, discourse coalitions and administrative reform. Local Government Studies, 41(6), 861–80. Tooze, J.A. (2001). Statistics and the German state, 1900–1945: The making of modern economic knowledge. Cambridge University Press. Torfing, J., Peters, B.G., Pierre, J., & Sørensen, E. (2012). Interactive governance: Advancing the paradigm. Oxford University Press. Torfing, J., Krogh, A.H., & Ejrnæs, A. (2020). Measuring and assessing the effects of collaborative innovation in crime prevention. Policy & Politics, 48(3), 397–423. Triantafillou, P. (2007). Benchmarking in the public sector: A critical conceptual framework. Public Administration, 85(3), 829–46. Triantafillou, P. (2008). Normalizing active employment policies in the European Union: The Danish case. European Societies, 10(5), 689–710. Triantafillou, P. (2015). Doing things with numbers: The Danish national audit office and the governing of university teaching. Policy and Society, 34(1). https://​doi​.org/​10​.1016/​j​.polsoc​.2015​.03​.002. Triantafillou, P. (2020). Accounting for value-based management of hospital services: Challenging neoliberal government from within? Public Money and Management. https://​doi​.org/​10​.1080/​09540962​ .2020​.1748878. Tribe, K. (2008). Continental political economy: From the physiocrats to the marginal revolution. In T.M. Porter & D. Ross (Eds.), The Cambridge history of science (pp. 154–70). Cambridge University Press. Van Dooren, W., Bouckaert, G., & Halligan, J. (2010). Performance management in the public sector. Routledge.

Introduction  13 Vestergaard, J. (2009). Discipline in the global economy? International finance and the end of liberalism. Routledge. Voorn, B., Genugten, M. van, & Thiel, S. van. (2020). Performance of municipally owned corporations: Determinants and mechanisms. Annals of Public and Cooperative Economics, 91, 191–212. Wahlberg, A., & Rose, N. (2015). The governmentalization of living: Calculating global health. Economy and Society, 44(1), 60–90. Wal, M. van der, Kraker, J. De, Offermans, A., Kroeze, C., Kirschner, P.A., & Ittersum, M. van. (2014). Measuring social learning in participatory approaches to natural resource management. Environmental Policy and Governance, 24(1), 1–15. Weber, M. (1978). Economy and society Vol 1 & 2. University of California Press. Wildawsky, A. (1969). Rescuing policy analysis from PPBS. Public Administration Review, 29, 189–202. Williams, A., & Siddique, A. (2008). The use (and abuse) of governance indicators in economics: A review. The Economics of Governance, 9, 131–75.

PART I HISTORICAL DEVELOPMENT OF MEASURING GOVERNANCE

1. State formation and statistics Cosmo Howard

INTRODUCTION Modern government is impossible without statistics. Administrative bodies maintain enormous datasets for the purposes of routine management of public programmes. Legislatures and the media often demand access to these figures as a vital information source for holding governments accountable for their actions and inactions. Some public agencies exist solely to publish ‘official statistics’, the definitive indicators of national life. Many countries conduct a population census, a key shared national experience that helps communities understand themselves and distinguish their societies from others. Meanwhile, political leaders continuously invoke statistics in their public statements and debates with opponents. Yet, while statistics are taken for granted today as a core instrument of governing, they were not always central to rulers’ exercise of authority. This chapter offers an overview of how the relationship between statistics and the state developed and evolved. It engages with a small body of literature, spanning several disciplines including political science, public policy, sociology, history and philosophy, which addresses the relationship between statistics and the emergence and evolution of the modern state. The chapter presents a synthesis of this literature by firstly exploring a conundrum: statistics are central to virtually every practice of governing, yet statisticians insist their work is apolitical. Next, the chapter presents a novel framework for understanding the relationship between statistics and the state, which shows that the relationship is shaped by four dimensions: prevailing ideas about how to count and govern; dominant social, economic and political interests; the structure and behaviour of state institutions; and questions of national identity. Following this, it addresses a case study of the role of statistics in state formation and re-formation in Australia, to illustrate how ideas, interests, institutions and identities have interacted in complex ways throughout that country’s history to influence the evolving relationship between statistics and government.

A PARADOXICAL RELATIONSHIP Statistics are key to the formation of the modern state. Here it is useful to define the terms statistics and state formation. The word ‘statistics’ derives from the same etymological roots as ‘state’, reflecting the fact that statistics originally referred to facts or knowledge about the state, that could be both quantitative and qualitative (Starr, 1987). Statistics took on their modern meaning and use, that is, quantitative data derived from observations of large populations, in the nineteenth century, as they shed their direct association with the study of the state and came to be associated with increasingly professionalised and quantitative natural and social sciences (Gigerenzer & Swijtink, 1989). Meanwhile, state formation can be defined using Max Weber’s concept of the state as a form of human community that successfully 15

16  Handbook on measuring governance claims a monopoly on legitimate physical force within a given territory, typically by ceding authority to a political-administrative organisation that is usually termed government. Thus, in the modern state the government has sovereign authority over an area and its inhabitants. Given this definition, the focus of this chapter is on the period since the seventeenth century, when the idea and practice of the modern sovereignty first emerged. In this chapter, state formation is defined as a dynamic and always unfinished business of asserting sovereignty. While it may be tempting to conceive of state formation as a single event or specific period, such as the birth of a nation, state formation is better understood as a chain of ongoing struggles that interconnect ideas, interests, institutions and identities. Focusing on state formation as a one-off process neglects the fact that states are ‘reformed’ more or less continuously, although periods of reform are often interspersed with long periods of stability and equilibrium, and reforms can occur at different levels and have differing impacts on the operations of government and the state’s relations with civil society (Hall, 1993). While statistics are widely acknowledged to be necessary tools of the modern state, modernity also features a powerful discourse of statistical objectivity. The language, concepts and values associated with this discourse – neutrality, science, professionalism, expertise – have differed across time and space, but the basic claim that statistical methods and data can record and represent events in ways that are insulated from subjective judgement and societal interests is several centuries old (Hacking, 1990). Furthermore, and despite (or perhaps because of) the rise of post-modernism and post-truth discourses, this claim of objectivity has been promoted and extended through international regimes of norms in recent decades. A growing body of international statistical standards encourages harmonisation of definitions and indicators across states to facilitate trade and investment, often drawing resistance from local communities (Higgins & Larner, 2010). Also, international agencies, including the United Nations, International Monetary Fund, World Bank, European Union, and credit ratings bodies, all promulgate statements of principles insisting that official statistics must be apolitical, with threats to ‘name and shame’ or lower the credit ratings of any government seeking to use statistics for political ends (Howard, 2021). Statistics are core to what Christopher Hood (1983) calls the ‘nodality’ function of government, referring to the state’s unique position as a hub or intersection in the flows of information throughout society. The state’s capacity for nodality depends on its statistical information being public, something that has not always pertained historically. It was the rise of liberal, democratic and nationalist ideas in the nineteenth century that encouraged a shift towards publication of government statistics, in part to provide citizens with a ‘mirror’ to solidify shared national identities, and also to provide an information resource to guide the free market and academic research (Hacking, 1990). But nodality also depends on the other core functions of the state – what Hood calls authority (laws and regulations), treasure (expenditure and taxation) and organisation (public administrative bodies). The next section explores dimensions of the relationship between statistics and state formation and shows how the modern state’s nodality function is integrated and interdependent with other state functions.

DIMENSIONS OF THE RELATIONSHIP Research on the relationship between state formation and statistics has generated a complex and highly differentiated picture of interactions between data gathering and government.

State formation and statistics  17 Numerous case studies have detailed the nuances of these relationships in individual countries, as well as subtle and dramatic shifts at different points in time. There is no simple and universal chronology of statistics and state formation. Nor is it possible to capture all this nuance in a book chapter. In what follows, I present a general summary of research on statistics and state formation under four thematic headings: ideas, interests, institutions and identities. Ideas Statistics and state formation have influenced each other partly though the medium of ideas, or how actors make sense of their worlds. The causal relationship between statistical ideas and governing ideas has always been bi-directional. Ideas developed through early statistical experimentation and practice have been important for the operation of the sovereign state, but equally, the sovereign state and its political and policy discourses have made possible and encouraged certain statistical practices. The earliest examples of large-scale data collection by the state preceded the modern disciplines and techniques of statistics by two centuries, and were spurred by prevailing ideas about good government, as well as the dominant policy problems that occupied the minds of elites and statesmen. From the sixteenth to eighteenth centuries, population was a fixation for rulers who embraced cameralist and mercantilist ideas of government. In these governing doctrines, population was a source of national wealth, and the role of government was to increase the population and its productivity (Porter, 1986). At the same time, population, especially in urban environments, presented a potential threat of disorder, both in political and public health terms, so it needed to be closely monitored and policed, with an assumption that the morality of the population was connected to its vitality and productivity. Early statistical collections in France and Germany were motivated by governments’ desires for information on trade, agricultural outputs and their populations for the purposes of planning and introducing measures to boost output and protect local industries from outside competition. According to Ian Hacking (1990), this administrative effort to collect data had an unintended effect, spurring an ‘avalanche of numbers’ that provided material for groups of ‘statistical amateurs’ and mathematicians working within administrations to start to observe patterns and tendencies at the level of the society, where previously they had focused attention on the individual and the family: [Statistics] gradually reveal[ed] that the population possesses its own regularities: its death rate, its incidence of disease, its regularities of accidents. Statistics also shows that the population also involves specific, aggregate effects and that these phenomena are irreducible to those of the family: major epidemics, endemic expansions, the spiral of labour and wealth. Statistics also shows that through its movements, its customs, and its activity, population has specific economic effects. (Burchell et al., 2009, p. 104)

This view of the population provided the conceptual underpinnings of a new kind of government, in which the sovereign ruler could acquire knowledge of the individual without the intercession of the head of the family or the local nobility, and then convert this knowledge into social aggregates. In this way, the ‘biopolitics’ of the early modern states on the European continent, focused on the management of growing populations to secure their economic productivity, depended on statistical ideas of population, which in turn developed out of the early bureaucratic efforts to collect detailed information about people and commerce.

18  Handbook on measuring governance Mercantilism and its repressive ‘police state’ gave way, at different times in different places, to liberal notions of freedom and democracy. Yet again statistical ideas played an important role in shaping this new form of governance. Censuses were important for apportioning representation in electoral systems, and became a site of political conflict as a result (see below). Furthermore, the paradoxical question of how to govern ‘free’ people in liberalism was also partly addressed via statistical ideas, including emerging notions of chance, averages and normality. By the nineteenth century, statistical thought was coming to focus on probabilities rather than certainties; the world was not determined, but also not random; it was subject to laws of chance (Hacking, 1990). Statistics revealed patterns of probability, allowing the state to shift its policy focus from policing social discipline to managing collective risks to the economy and social wellbeing. Furthermore, statistics produced a portrait of ‘l’homme moyen’ – the average man/person – and in so doing also helped build an idea of the atypical or aberrant citizen (Porter, 1986). This fed the development of a new form of government preoccupied with governing deviations from the mean or norm, especially at the ‘bottom’ of society. Liberal government could use statistical knowledge to uphold freedom by focusing on the correction of a few deviant subjects, rather than repressively policing the whole population. Liberal ideas also evolved. While early nineteenth-century liberalism emphasised the importance of freedom and the efficacy of the free market, later the rising awareness of power differentials and growing urban poverty, itself a product of the statistical enquiries of social reformers, spurred a new more social form of liberalism that advocated intervention to protect and support the poorest. The economic instability of the early twentieth century also provoked a re-evaluation of the effectiveness of free markets, and the rise of new economic ideas about the need to consider the economy at a ‘macro’ level as a series of aggregates, rather than interactions of individuals coordinated by an invisible hand (Desrosières, 2002). Paralleling earlier ideas of population trends, these ideas, developed by John Maynard Keynes and others, were built on new aggregate statistical concepts – national accounts, gross national product, unemployment and aggregate demand. New data collections in many countries, which measured aggregate phenomena such as unemployment for the first time (Walters, 2000), gave governments tools to engage in ‘discretionary’ fiscal and monetary policies, calibrating governments’ stimulus or contraction of the economy to the prevailing macroeconomic conditions. Furthermore, more recent neoliberal ideas, which champion a retreat of government from discretionary macroeconomic management, also rely heavily on statistical measures. The rise of ‘depoliticised’ economic management, epitomised by independent central banks, depended for its effectiveness and legitimacy on inflation statistics produced at arm’s length from the government of the day (cf. McNamara, 2002). Neoliberal ideas also shifted governments’ policy focus away from managing risks at the level of broad social categories, towards a greater emphasis on individual behaviour and responsibility, and the impacts of state intervention on undermining incentives. Several types of statistical analysis, including longitudinal and life course studies, have been essential to supporting neoliberal discourses about how Keynesian welfare states created patterns of intergenerational dependency, and have been used to justify highly targeted activation programmes, as well as the residualisation of welfare systems (Dean, 2010).

State formation and statistics  19 Interests The discussion so far has treated the state as a unitary actor with a coherent agenda for governing. In reality, the state is an arena where different societal interests vie to influence public policy. Statistics provide a mechanism for interests to influence public policy, in several ways. Historically, groups have used statistics produced outside the state to pressure government to act. Woolf (1989) shows how elites in the UK in the eighteenth and nineteenth centuries used private surveys and financial data to translate their economic interests and social concerns into policy influence. Another powerful channel of influence is via attempts by societal interests to shape statistics produced within government. From a pluralist theoretical standpoint, the state can be seen as a kind of ‘cash register’ for interests, where different societal interests influence public policy in proportion to their societal prevalence, and their influence shifts over time. On this view, government statistical collections reflect the prevailing interests and concerns of civil societies at any given time (Rose, 1991). In practice, not all interests carry equal weight, and this extends to the production of official statistics (Slattery, 1986). This is illustrated in accounts of official statistics in the nineteenth century in the USA and UK, where industries were able to mobilise state support for the collection of data that favoured their interests in terms of regulation and subsidies (Desrosières, 2002; Hacking, 1990). Differences in the status and roles of government statistical agencies are discussed further under the heading of institutions below. Another key interest shaping the role of statistics within the state is political parties. However, the impact of parties and party ideologies on official statistics is complex. Some research suggests parties of the left seek to expand government statistical collections and are more willing to grant and maintain autonomy for government statisticians (Howard, 2021). High-profile cases of political interference in, and cutbacks to, official statistical collections, such as the Thatcher government in the UK (Tant, 1995), the Reagan, Bush and Trump administrations in the USA (Alonso & Starr, 1987; Howard, 2021) and the Harper government in Canada (Howard, 2022), show Conservative governments are sometimes hostile to statistical collections. Part of their objection is the administrative cost, and they also raise concerns about respondent burden and invasion of privacy. Another important factor is that statistics can be used to justify expanded policy measures and hold governments accountable for failing to solve policy problems. Yet governments on the right also seek to collect statistics, and under neoliberalism, this can take the form of surveillance of poor and marginalised populations and algorithmic tools designed to individualise risk profiles (Henman & Dean, 2010). Donald Trump’s controversial and ultimately unsuccessful effort to include a question on citizenship on the 2020 census is an example of a right-wing administration’s desire for more data, as is the Australian government’s use of its country’s statistical agency to carry out a de facto plebiscite on same sex marriage (discussed further below) (Howard, 2021). Institutions In the contemporary age of ‘big data’, when everyone is continuously generating statistical information through their constant use of digital technologies, it can be tempting to think that data produce themselves, and that citizens are willing to volunteer up the most intimate details to be captured and repurposed into data by companies and governments. This is an inaccurate statement today, and even more so for earlier eras. Statistics do not produce themselves; they

20  Handbook on measuring governance are produced through social relations, in which institutions play a key role. We can unpack this role by addressing institutions at two levels: underlying governance frameworks of nation-states; and the special bureaus responsible for producing statistics. The underlying institutional features of states shape the role of statistics in government. These include whether a state is unitary or federal. In a federal state, the question of where to institutionally locate different aspects of the official statistical production process needs to be addressed. Different federations have different arrangements – Germany, for example, has a devolved system while Australia, Canada and the USA have national systems (although these rely in different ways on cooperation with sub-national governments). Institutional factors also include fundamental constitutional relationships between the different branches. In the USA’s separation of powers system, where individual members of Congress scrutinise spending bills and pursue their own interests and agendas, statisticians answer directly to members of the legislature and often serve them directly; in Canada, the statistical agency can be directed in limited ways by the executive, but not by individual members of Parliament (Howard, 2021). State cultures, or the underlying beliefs about the proper relations between the state and citizenry, also shape statistics. ‘Public interest’ state cultures tend to encourage policy responsiveness, while the public law-based traditions of continental Europe encourage more continuity in statistical measures (cf. Pierre, 1995). Finally, administrative traditions also shape the relationship between statistics and the state. A country with a prevailing ‘administrative bargain’ that emphasises independent ‘technical trusteeship’ in its public services will tend to give its statisticians more autonomy than one based on public service loyalty to the elected government (cf. Hood & Lodge, 2006). Another critical institutional dimension concerns the dedicated statistical agencies states maintain to collect, store, analyse and publish official statistics. These can be found in every modern state, though their form and function vary significantly (Edmunds, 2005; Howard, 2021; Starr, 1987). A full account of this variation is impossible here, but the key dimensions of difference can be highlighted. One can distinguish between centralised and decentralised statistical systems (UNSD, 2003). Centralised statistical systems concentrate authority and/ or capacity to collect and disseminate statistics in an individual national bureau. Examples include Australia, Canada and Ireland. Decentralised systems spread responsibility for official statistics among multiple units, sometimes closely linked to policy areas. The USA is an example of a decentralised statistical system. Some commentators argue that centralised systems tend to be more independent and less responsive to policy makers (Martin, 1981), but this is an overgeneralisation, as central agencies can be formally subject to political control (as is the case in Canada, although limits on this control were strengthened in 2017), while decentralised statistical agencies can be insulated from control (as has been the case in Sweden since 1994). Some countries have ‘hybrid’ statistical institutions, with a national statistical agency operating alongside decentralised units (Sweden and the UK are examples). The reasons these different arrangements emerged are complex and historical (Alonso & Starr, 1987; Desrosières, 2002; Howard, 2021). Statistical institutions are not fixed. State cultures and structures undergo reform. Neoliberalism has had a powerful impact on public administration, through reform programmes such as the New Public Management (NPM), which has had complex implications for official statistics (Howard, 2021). On the one hand, NPM proposes opening state functions to competition and making state agencies more responsive to the government of the day. This has tended to place statistical agencies under increased political and budgetary pressure to

State formation and statistics  21 deliver data tailored to the specific agendas of the government of the day (Howard, 2019). At the same time, NPM also emphasises the depoliticisation of day-to-day administration, and the need for an arm’s length relationship, especially where the credibility of economic policy making is at stake. This helps to explain why several statistical agencies have experienced increased formal authority in recent decades (Howard, 2021). Identities The modern state derives its legitimacy from a claim that its territories and institutions encompass a community with a shared identity – a group with common experiences, beliefs, values and/or aspirations. Statistics can support the state’s claim to reflect and defend the identity of its people in several ways. Where people have different political values but share and express a patriotic commitment to democratic procedures and public institutions such as a constitution, statistics can act as a baseline for the authority of the system’s procedures. In the USA, the constitution mandates the conduct of a decennial census to determine electoral appointment in the House of Representatives. This ‘nominal’ approach to conducting the census, where individuals are counted and consulted directly, rather than relying on the local clergy or aristocracy as informational gatekeepers, was a radical idea until late in the nineteenth century, and was important to cementing democratic identities (Curtis, 2002). A population can also develop a national identity, comprised of history, culture, language and/or religion. The fusion of nation and state, where the state and its territory are seen as a ‘container’ for the national society (Beck, 2000), is also aided by modern statistics. Nations, as Benedict Anderson (2006) has noted, are constructs and derive their power from simplification, abstraction and exclusion. To create nation-states, some of the complexity and messiness of cross-cutting cultural affiliations, histories and practices must be erased. Anderson suggests that the census is a key tool for achieving this erasure. It was used by governments and colonial administrators to create a ‘total classificatory grid’, suggesting an exhaustive portrait of the nation, neatly sorting people into class, ethnic and religious containers, while labelling those who did not fit within the dominant framing of national identity as ‘others’. In some cases, statistical institutions reinforced particular framings of national identity by directly excluding minority populations, including Indigenous people, from censuses (Leibler & Breslau, 2005). Some authors have constructed typologies to capture different periods of statistical governance through the era of the modern state (Alenda-Demoutiez, 2022). French statistical historian Alain Desrosières (2011) proposed five eras and identified statistical collections that were dominant in each period. However, this classification is focused on political economy, and while it addresses the role of statistical and economic ideas, it does not provide details about the changing institutional bases of official statistics, nor about the kinds of social identities connected to state formation and re-formation throughout this period. Beaud and Prévost (2010) provide another typology which offers more promise for addressing these dimensions. They suggest an historical trajectory from the ‘proto statistical’ regimes before the mid-nineteenth century, to the statistical nationalisation of the late nineteenth and early twentieth centuries, followed by a period of Keynesian ‘statistical macro-management’, and finally, a period of neoliberal globalisation. Importantly, each of these periods addresses the institutional arrangements within official statistics. Beaud and Prévost’s (2010) typology can be joined with our themes of ideas, interests, institutions and identities discussed above, to generate a four by four matrix (Table 1.1).

22  Handbook on measuring governance Table 1.1 Statistical regime Proto-statistical

Elements of historical statistical regimes Dimension Ideas/instruments

Interests

Institutions

Identities

Political arithmetic,

Monarchs,

Ministries, ad hoc

Traditional social estates,

Mercantilism,

administrative and

statistical offices, some

the family.

Malthusian. Census,

military elites.

national statistical agencies.

trade and commerce statistics. Statistical

Nationalism, liberalism,

Policy makers,

National statistical

nationalisation

laissez-faire,

philanthropists and

agencies, line and

democracy.

social reformers.

central ministries.

Statistical

Fordism, Keynesianism, Policy makers, labour

macro-management

social security and

and domestic capital.

National statistical

Nationalism, the family.

Social class, nationalism.

agencies, line and central ministries.

welfare, national accounting. Neoliberal

Liquid modernity,

Policy makers, global

National statistical

Globalism,

globalisation

reflexive modernity,

capital, new social

agencies, line and

multiculturalism,

individualisation.

movements.

central ministries,

entreprenuerialism,

corporate and third

intersectionality.

sector.

In the next section of the chapter, this framework is applied to the Australian case.

STATE FORMATION AND STATISTICS IN AUSTRALIA This section of the chapter presents a case study of the role of statistics in state formation in Australia. Australia has, at the time of writing, a self-described ‘centralised’ statistical system, with the Australian Bureau of Statistics (ABS), an autonomous agency of the federal government, designated as the country’s national statistical office. While the ABS enjoys cross-partisan support in Australia, with high levels of public trust and an enviable reputation for statistical excellence, the history of official statistics in the country illustrates the complex power dynamics between statistics and governing outlined in the previous section. Consistent with the preceding discussion, this section does not treat Australian state formation as a single event in time, but rather as a continuous process that features a series of ‘critical junctures’, where the state is formed and reformed to reshape both its mode of operation and relationship to civil society. Australia has not received much attention in the literature on statistics and state formation (but see Howard, 2019, 2021; Howard & Bakvis, 2015). A distinctive feature of Australian official statistics is that, because of the relatively recent formation of the country as a federation in 1901, Australia skipped Beaud and Prevost’s ‘proto-statistical’ phase, a point discussed further below. Therefore, this case study focuses on three key periods in Australian statistical history and state formation: statistical nationalisation, statistical macro-management and neoliberal globalisation. In each of these periods, the discussion addresses the themes of ideas, interests, institutions and identities.

State formation and statistics  23

STATISTICAL NATIONALISATION Although calls for joining the separate British colonies that occupied parts of the Australian mainland and Tasmania into a single country started in the middle of the nineteenth century, the discussion and debates surrounding federation began in earnest in the late nineteenth century. At this time, a series of political ideas were prevailing in Australia that would influence the role of statistics in state formation. Social liberalism captured the imaginations of key actors, including early prime ministers, political parties, administrators and judges. This ideology suggested the state should be actively involved in protecting the least well-off, ensuring minimum wages sufficient to support a family (Sawer, 2003). These ideas fed the development of generous minimum wages for workers in select industries, as well as pioneering efforts to provide pensions to seniors. Infamously, this period also saw the emergence of the ‘White Australia Policy’, an explicitly racist immigration regime built on ideas of racial purity and white supremacy. Racism also drove the exclusion of Aboriginal and Torres Strait Islander peoples from the census count. This collection of policies – innovative welfare, progressive minimum wages, racist and paternalistic regulation – reflected a view that the state should act as a ‘vast public utility’ to secure a better standard of living for the white, Anglo-Saxon majority (cf. Hancock, 1931). While there was consensus about the need for state intervention and the importance of racial policies, Australia’s period of statistical nationalisation also featured a contest between political interests in several dimensions. One was regional: different parts of the country sought to gain advantages out of federation for themselves in terms of policies that would aid their economies. To this end, some colonies favoured free trade, while others were in favour of protectionism of domestic production. These interests spilled over into statistics, where the colonies each wanted statistical measures that supported their economic interests (ABS, 2005). This meant that although there was in-principal agreement that census taking and statistics should be a federal power in the new country, the Constitution did not make it an exclusive national power, leaving the question of how to distribute statistical work in the federation to future parliaments. Another key interest was the labour movement. Labour shortages, economic downturns and violent crackdowns on strikes in the closing decades of the nineteenth century spurred the formation of Labor parties, who saw that direct industrial action had failed and were determined to take advantage of Australia’s early embrace of the universal franchise to win government. This parliamentary socialist strategy paid off with Australians electing the first Labor governments in the world, first in the Colony of Queensland, then at the national level in 1904. A succession of Labor, Liberal and Conservative governments then had to contend with rising union power and the issue of inflation, which threatened the nascent federation with political turmoil and economic stagnation. These interests came to be managed in an institutional bargain, the so-called ‘Australian Settlement’, which provided workers with guaranteed minimum wages and enforced a system of court-based arbitration of wage disputes (Sawer, 2003). Statistics were central to this settlement, because the courts, confronted with conflicting demands from employers and labour, relied on impartial inflation data in order to secure the legitimacy of arbitration decisions (Howard, 2021). As such, one of the most politically important tasks of the new statistical agency, the Commonwealth Bureau of Census and Statistics (CBCS, 1905), was to produce inflation statistics in a way that was seen as beyond politics, in a context where

24  Handbook on measuring governance competing class interests were looking for reasons to challenge inflation numbers in their favour. Meanwhile, Australia had also inherited a complex institutional legacy of official statistics from the colonial period. Each colony had its own statistical office, and most were well regarded as competent and independent enumerators (ABS, 2005). At the same time, these offices tended to use different definitions of certain measures, partly due to expedience and partly because different measures helped their governments’ claims for a better deal in the federation. Following federation, the new states (each of which inherited the administrations of their colonial precursors) were mostly reluctant to transfer their statistical powers to the new federal statistical agency. The federal government decided not to force the issue, and to leave the challenge of unifying statistics to the statistical offices themselves (Howard, 2021). In terms of identity, statistics in this early period of Australian state formation were promoted as a means to attract the right kind of migrant to Australia amidst a strong focus on populating the country to address labour shortages. From 1908 the statistical agency produced a ‘glossy annual publication with plenty of images and an ornate design’ to attract British migrants to Australia (ABS, 2005, p. 32). Official statistics also excluded or distorted demographic facts that contradicted the image of a white egalitarian utopia. Indigenous peoples were excluded from the population census until 1967, although some were counted through the derogatory category of ‘half-castes’ between 1933 and 1966. Chief statisticians wanted to include Indigenous Australians, but Parliament did not, as this would change the allocation of seats in federal parliament, which they opposed as Indigenous people could not vote at the time (ABS, 2005, p. 116). Yet this meant that the states, which had exclusive responsibility for Indigenous people under the constitution until 1967, had inadequate demographic data on this group. Meanwhile, census respondents who were judged European (white) in appearance were asked about their race, but for those assessed not to be European, the census collector would impose a racial category. This was to preserve the integrity of the White Australia Policy from non-white imposters (NSW Census Report 1891: 188–5 in Horn, 1987). Furthermore, all people of ‘European’ racial origin were categorised together whether born in Australia or not (Arcioni, 2012). In addition to problems with racial classifications, income and wealth were not measured in the census until 1933, while state statistics on incomes were also incomplete, such that statisticians and economists must estimate the income distribution during this period (Panza & Williamson, 2021). Australian government statistics of the early twentieth century thus served to promote an Australian identity that excluded non-whites and occluded inconvenient realities of income and wealth inequality.

STATISTICAL MACRO-MANAGEMENT While Australia was a pioneer of wage regulation and some aspects of social welfare provision, its leaders did not embrace Keynesian ideas of macroeconomic management until after World War II. Yet, Australian economists and statisticians were enthusiastic early adopters of national income accounting (Whitwell, 1986). The national statistical bureau took up macroeconomic accounting to fill gaps left by state statistics and distinguish itself and its capabilities from state statistical agencies (ABS, 2005, p. 34). By the start of the 1930s the national statistical agency had such economic prestige that its head, the Commonwealth Statistician, was appointed the Chief Economic Adviser to the government (ABS, 2005, p. 33). After a brief tenure of Labor

State formation and statistics  25 governments in the immediate postwar years, Australian politics was then dominated for two decades by Conservative governments committed to states’ rights and opposed to social policy expansion at the national level. As a result, federal data collection on social issues was limited, and the national agency conducted few sample surveys. This became a political problem with the election of the Whitlam Labor government in 1972, which had an ambitious social agenda, but complained that existing statistics were not fit for the task (ABS, 2005, p. 128; Committee on Integration of Data Systems, 1974). The Whitlam government initiated a review headed by political science academic and former public servant Leslie Crisp (Committee on Integration of Data Systems, 1974) that led to an important institutional reform, discussed further below. In terms of interests, the regional differences and disputes over who was to collect statistics gradually faded in the first half of the twentieth century, and state statistical offices progressively joined to the national statistical agency of their own accord. Statistical frictions between regional interests were then replaced by a new set of tensions at the national level. Different line ministries, each with their own client industries that they wished to support, jealously guarded their statistical collections against consolidation and harmonisation. The postwar era saw the extension of economic protectionism under the slogan of ‘protection all-round’, where workers and employers both benefited from import tariffs and quotas. As the Whitlam government began the process of moving Australia away from its reliance on protectionism, it was again stifled by a lack of reliable statistics, since many of the relevant numbers were produced in ministries that had a vested interest in downplaying the cost of protectionism and exaggerating its benefits: They’d try to get some statistics out that would show something about the costs of protection and what it meant in terms of employment and so on, and they’d find the Department of Industry and Commerce putting forward the views of manufacturers who would have a different set of statistics … so there would be interminable arguments about the statistics. (John Miller in ABS, 2005, p. 24)

As a result of these concerns, Whitlam’s inquiry into the statistical system was also asked to look at how the supply of data could be made more consistent, comprehensive and efficient (Committee on Integration of Data Systems, 1974). The review was also tasked with addressing the institutional position of the CBCS within the government. The agency was not formally autonomous, but rather a division of the Department of the Treasury. Whitlam’s government clashed with the Treasury, the former feeling the latter was not supportive of the government’s expansive social agenda (Wanna et al., 2000, p. 74). Having the national statistical agency located in the Treasury was therefore seen as part of the problem since the agency would not have an incentive to produce social statistics to support policies its parent department did not want. While several influential voices called for the CBCS to be moved from the Treasury into another portfolio (Wanna et al., 2000, p. 93), the Crisp review concluded that Treasury was the best place for it given the Treasurer did not have sectoral interests and was very powerful in Cabinet. Yet Crisp acknowledged the perception of Treasury’s excessive influence and recommended making the agency formally independent. At the same time, Crisp recommended addressing the proliferation of alternative data sources by legally empowering the ABS to be the provider of all statistics nationally, to ensure coordination, as well as economies of scale, an important consideration in the late 1970s considering the reliance on manual processing and the then high cost of computing technology in statistics.

26  Handbook on measuring governance The Whitlam government introduced legislation to implement these recommendations, and after a protracted debate in both houses of Parliament, the Australian Bureau of Statistics Act was passed in 1975. It reflected the Whitlam government’s fondness for arm’s length statutory authorities as an institutional model, making the national statistical agency an autonomous body, with a head appointed for a fixed term and who could not be terminated by the government of the day. The new legislation also stipulated that the ABS was responsible for all official statistics in Australia, in line with the notion that greater centralisation of statistics was a necessary and desirable development. The postwar period also saw the beginnings of significant shifts in Australian identity. The last official elements of the White Australia Policy were removed in the mid-1960s, and in the 1970s multiculturalism emerged as a new policy with bipartisan support. In a landmark constitutional referendum in 1967, Australians voted overwhelming to include Aboriginal and Torres Strait Islander people in the census, and to give the national government the power to make laws for their benefit. ‘Race’ was removed from the census in 1966, replaced with questions about ‘origin’ throughout the 1970s, and then ‘ancestry’ starting in 1986 (Horn, 1987). As Horn (1987) notes, the new ancestry question was no longer geared to separating white Australians from non-whites, but now served the positive identification of different ethnicities in order to promote a multicultural identity and guide public support for different ethnic groups. The 1970s also saw the emergence of social surveys that shaped Australians’ understanding of their economic conditions and the image of Australia as an egalitarian paradise. Australia experienced its postwar ‘rediscovery of poverty’ during the Whitlam government, which created an official inquiry into poverty and welfare. The early surveys conducted by the new bureau were directly connected to this inquiry (ABS, 2005, pp. 120–21). While Australians had historically been opposed to questions about income, wealth and household expenditure, the ABS ran its first household expenditure survey in 1975 (ABS, 2005, p. 121). These new surveys of income, wealth and poverty brought to policy makers’ attention the problems of poverty in Australia, and helped to drive a new era of social policy expansion in areas such as single parent payments, education subsidies and universal public health insurance (Edwards et al., 2001).

NEOLIBERAL GLOBALISATION Australia, like many countries, experienced a significant shift in dominant policy ideas with the end of the long boom in the early 1970s. From the mid-1970s governments in Australia progressively shifted Australia’s policy framework to favour competition in the domestic economy, on the international stage, and in the operation of public services. At the same time, Australia’s ‘third way’ Labor governments of Bob Hawke and Paul Keating sought to pick up the social policy mantle dropped after the brief Whitlam period. They demanded new statistics in health and social welfare, and used central agencies to pressure the ABS to develop a stronger ‘user’ orientation, consistent with prevailing NPM ideas about administrative responsiveness, as well as imposing annual ‘efficiency dividends’ (budget reductions) on the agency. Under the Conservative government of John Howard, emerging neoliberal discourses of welfare reform promoted ideas about intergenerational welfare dependency, and generated policy interest in innovative longitudinal and ‘life course’ statistical research from the USA.

State formation and statistics  27 Thus, whereas existing scholarship on the neoliberal era has tended to portray it as a period of statistical cutbacks and blackouts (see, for example, Alonso & Starr, 1987; Ramp & Harrison, 2012; Tant, 1995), the Australian picture is more complex. This complexity derives from several sources. Firstly, neoliberal ideas were not the only influential policy discourse of the period. Interests and advocacy coalitions also shaped statistics. The Hawke and Keating governments, influenced by the work of committed ‘femocrats’, sought to advance women’s equality and funded the creation of statistics such as domestic violence data (ABS, 2005, p. 133). The Howard government initiated longitudinal research to understand causes of long-term poverty and indigenous disadvantage. Institutionally, the neoliberal period saw a proliferation of new statistical programmes and agencies (Howard, 2019). While the reforms of the 1970s sought greater centralisation and consolidation of official statistics, by the 1980s the emphasis had shifted to NPM ideals of competition, user charging and value for money. In this environment, the Australian national statistical agency was sometimes criticised for being slow and/or inflexible in meeting the demands of policy departments (Australian Public Service Commission, 2013; Howard, 2019). This reflects a dilemma for statistical agencies as they engage with the state in terms of how much independence to insist upon, and the limits this places on responsiveness. When the ABS experienced an embarrassing website shut down on census night in 2016 due to a distributed denial of service attack, the agency was subjected to considerable criticism. The agency has received an increase in funding to modernise its digital systems, while its leadership has committed to working more closely with other government departments to understand and meet their needs, and also to draw on ‘greater external expertise’ in ABS operations (Kalisch, 2016). The ABS has played a complex role in the politics of national identity in the neoliberal period. Reporting on census results increasingly tells a story of a diverse Australia where the national image includes more than Anglo-Saxon men. In this new era focused on diversity, intersectionality and individualism, statistical concepts like averages are used to celebrate difference rather than enforce conformity. When the ABS released results for the 2021 census, the agencies’ media campaign made much of the fact that the ‘average Australian’ is female, aged 30–39 years and living in a capital city, and also stressed the now considerable portion of Australians born overseas. The statistical meaningfulness of the ‘average Australian’ has been challenged (Goot, 2022), but these narratives show an agency repurposing nineteenth-century statistical themes to tell a late modern story of intersectionality and cultural diversity. The ABS has also had to manage controversies surrounding gender and sexual identity. Its efforts to survey women about their experiences of domestic violence in the early 1990s provoked an internal backlash, with staff worried the agency was being captured by a feminist agenda. More recently, when a Conservative government used the ABS to carry out a de facto national vote on same sex marriage, activists objected, and the move faced a high court challenge and condemnation from constitutional specialists (Howard, 2019). The ABS leadership, however, embraced the opportunity and delivered the results on time and significantly under budget. The absence of a question on sexual orientation on the 2021 census led to further accusations of government ideological interference (Karp, 2019). These cases illustrate the delicate and difficult balancing act statistical agencies must perform to maintain an image of responsiveness and neutrality in an age of heightened sensitivity to questions of identity and practices of statistical labelling.

28  Handbook on measuring governance

CONCLUSION This chapter has summarised work on the relationship between statistics and state formation. It emphasised the complexity and dynamism of this relationship. Statistics and the modern state developed side by side, each feeding and responding to changes in the other. The chapter has illustrated four dimensions in which this relationship plays out: the realm of ideas, where evolving concepts of statistical reasoning interact with changing discourses of governing; interests, where actors in civil society and the state use statistics to obtain benefits and pursue agendas; institutions, referring to how state structures shape the state’s collection, publication and use of statistics; and identities, or the role of statistics in creating and altering shared images of local and national communities. None of these dimensions are static, and the chapter has outlined a typology of different periods and the different statistical practices that were prevalent in each. To understand the political implications of statistics, one should not restrict the study of statistical politics to the period of original state creation; instead, statistics are implicated in the ongoing reformulation of government administration and state-society relations. This chapter has also stressed the importance of studying statistics and state formation comparatively. In addition to shifts over time, there are important differences between jurisdictions in the role statistics play in government. The chapter has summarised the core dimensions of these differences and presented a case study of the evolution of government statistics in Australia. That case shows that typologies of statistical eras can help to understand broad trends in statistical governance, but there are idiosyncratic factors that shape individual countries’ statistical trajectories. It has also helped to explain the conundrum, discussed at the beginning of the chapter, that statistics are simultaneously central to governing, yet also presented as apolitical. Throughout the history of the Australian state, key political institutions and policy frameworks have depended for their operation and legitimacy on statistics that are seen to be independent of partisan politics and vested interests. Maintaining an impression of statistical independence while also meeting the competing and evolving informational needs of policy makers and the broader community is a perennial challenge for the statistical institutions of the modern state.

REFERENCES ABS. (2005). Informing a nation: The evolution of the Australian Bureau of Statistics. Australian Bureau of Statistics. Alenda-Demoutiez, J. (2022). Statistical conventions and the forms of the state: A story of South African statistics. New Political Economy, 27(3), 532–45. Alonso, W., & Starr, P. (1987). The politics of numbers. Russell Sage Foundation. Anderson, B. (2006). Imagined communities: Reflections on the origin and spread of nationalism. Verso Books. Arcioni, E. (2012). Excluding Indigenous Australians from ‘the people’: a reconsideration of sections 25 and 127 of the Constitution. Federal Law Review, 40(3), 287–315. Australian Public Service Commission. (2013). Capability review: Australian Bureau of Statistics. https://​ www​.apsc​.gov​.au/​sites/​default/​files/​2021​-06/​ABS​-Capability​-Review​.pdf. Accessed 15/9/2022. Beaud, J.-P., & Prévost, J.-G. (2010). L’histoire de la statistique canadienne dans une perspective internationale et panaméricaine. In N. Senra & A. De Paiva Rio Camargo (Eds.), Estatísticas nas Américas: por uma agenda de estudos históricos comparados (pp. 37–66). IBGE.

State formation and statistics  29 Beck, U. (2000). The cosmopolitan perspective: Sociology of the second age of modernity. The British Journal of Sociology, 51(1), 79–105. Burchell, G., Foucault, M., Senellart, M., Ewald, F., & Fontana, A. (2009). Security, territory, population: Lectures at the Collège de France 1977–1978 (Vol. 4). Macmillan. Committee on Integration of Data Systems. (1974). Report of the Committee on Integration of Data Systems. Australian Government Publishing Service. Curtis, B. (2002). The politics of population: State formation, statistics, and the census of Canada, 1840–1875. University of Toronto Press. Dean, M. (2010). Governmentality: Power and rule in modern society. Sage. Desrosières, A. (2002). The politics of large numbers: A history of statistical reasoning. Harvard University Press. Desrosières, A. (2011). Words and numbers: For a sociology of the statistical argument. Apuntes de Investigación del CECYP, 19, 75–101. Edmunds, R. (2005). Models of statistical systems. OECD. Edwards, M., Howard, C., & Miller, R. (2001). Social policy, public policy: From problem to practice. Allen & Unwin. Gigerenzer, G., & Swijtink, Z. (1989). The empire of chance: How probability changed science and everyday life (Vol. 12). Cambridge University Press. Goot, M. (2022). The ABS’s notion of the average Australian makes little sense. Here’s why. https://​theconversation​.com/​the​-abss​-notion​-of​-the​-average​-australian​-makes​-little​-sense​-heres​-why​ -186296. Accessed 15/9/2022. Hacking, I. (1990). The taming of chance. Cambridge University Press. Hall, P.A. (1993). Policy paradigms, social learning, and the state: The case of economic policymaking in Britain. Comparative Politics, 25(3), 275–96. Hancock, WK. (1931). Australia. Ernest Benn. Henman, P., & Dean, M. (2010). E-government and the production of standardized individuality. In V. Higgins & W. Larner (Eds.), Calculating the social: Standards and the reconfiguration of governing (pp. 77–93). Palgrave Macmillan. Higgins, V., & Larner, W. (2010). Standards and standardization as a social scientific problem. In V. Higgins & W. Larner (Eds.), Calculating the social: Standards and the reconfiguration of governing (pp. 1–17). Palgrave Macmillan. Hood, C. (1983). Tools of government. Macmillan International Higher Education. Hood, C., & Lodge, M. (2006). The politics of public service bargains: Reward, competency, loyalty and blame. Oxford University Press on Demand. Horn, R. (1987). Ethnic origin in the Australian census. Journal of the Australian Population Association, 4(1), 1–12. Howard, C. (2019). The politics of numbers: Explaining recent challenges at the Australian Bureau of Statistics. Australian Journal of Political Science, 1–17. doi:10.1080/10361146.2018.1531110. Howard, C. (2021). Government statistical agencies and the politics of credibility. Cambridge University Press. Howard, C. (2022). How leaders of arm’s length agencies respond to external threats: A strategic-performative analysis. Administration & Society, 54(3), 366–94. Howard, C., & Bakvis, H. (2015). Conceptualizing interagency coordination as metagovernance: Complexity, dynamism and learning in Australian and British statistical administration. International Journal of Public Administration, 39(6), 417–28. Doi:10.1080/01900692.2015.1018427. Kalisch, D.W. (2016). Census 2016: Lessons learned – improving cyber security culture and practice. https://​www​.abs​.gov​.au/​websitedbs/​d3310114​.nsf/​home/​Australian Statistician - Speeches Census 2016 Lessons Learned. Accessed 15/9/2022. Karp, P. (2019). ABS said census questions on gender and sexual orientation risked public backlash. The Guardian. https://​www​.theguardian​.com/​australia​-news/​2019/​dec/​03/​abs​-said​-census​-questions​ -on​-gender​-and​-sexual​-orientation​-risked​-public​-backlash. Accessed 15/9/2022. Leibler, A., & Breslau, D. (2005). The uncounted: Citizenship and exclusion in the Israeli census of 1948. Ethnic and Racial Studies, 28(5), 880–902. Martin, M.E. (1981). Statistical practice in bureaucracies. Journal of the American Statistical Association, 76, 1–8.

30  Handbook on measuring governance McNamara, K. (2002). Rational fictions: Central bank independence and the social logic of delegation. West European Politics, 25(1), 47–76. Panza, L., & Williamson, J.G. (2021). Always egalitarian? Australian earnings inequality 1870–1910. Australian Economic History Review, 61(2), 228–46. Pierre, J. (1995). Comparative public administration: The state of the art. In J. Pierre (Ed.), Bureaucracy in the modern state: An introduction to comparative public administration (pp. 161–84). Edward Elgar. Porter, T.M. (1986). The rise of statistical thinking, 1820–1900. Princeton University Press. Ramp, W., & Harrison, T.W. (2012). Libertarian populism, neoliberal rationality, and the mandatory long-form census: Implications for sociology. Canadian Journal of Sociology, 37(3), 273–94. Rose, N. (1991). Governing by numbers: Figuring out democracy. Accounting, Organizations and Society, 16(7), 673–92. Sawer, M. (2003). The ethical state? Social liberalism in Australia. Melbourne University Publishing. Slattery, M. (1986). Official statistics. Tavistock. Starr, P. (1987). The sociology of official statistics. In P. Starr & W. Alonso (Eds.), The politics of numbers (pp. 7–57). Sage. Tant, A. (1995). The politics of official statistics. Government and Opposition, 30(2), 254–66. UNSD. (2003). The handbook of statistical organization, third edition: The operation and organization of a statistical agency. United Nations Statistics Division. Walters, W. (2000). Unemployment and government: Genealogies of the social. Cambridge University Press. Wanna, J., Forster, J., & Kelly, J. (2000). Managing public expenditure in Australia. Allen & Unwin. Whitwell, G. (1986). The Treasury line. Allen & Unwin. Woolf, S. (1989). Statistics and the modern state. Comparative Studies in Society and History, 31(3), 588–604.

2. Quantification and global governance Isabel Rocha de Siqueira

In 2014, at the request of the United Nations (UN) Secretary-General, the Independent Expert Advisory Group on a Data Revolution for Sustainable Development (IEAG) published a report entitled A World That Counts. Mobilising the Data Revolution for Sustainable Development, as part of the organization’s effort to incentivize implementation and monitoring of the 2030 Agenda, approved in 2015.1 The document claimed advances in technology and data production offered ‘unprecedented possibilities for informing and transforming society’ (IEAG, 2014, p. 2). Considering there are 17 Sustainable Development Goals (SDGs) covering a wide range of themes, along with 169 targets, it was clear that policymakers and researchers would require all the information the world can provide. However, more important was the sentiment expressed throughout UN papers and other related documents that ‘[n]ever again should it be possible to say “we didn’t know”. No one should be invisible’ (ibid.). Indeed, the motto of the 2030 Agenda is ‘leave no one behind’, and with that impetus came a vast amount of investment in data production, especially based on the diversification of data sources. If, at one point, quantified data became the means by which governments came to know and govern their populations (Desrosières, 1998), in the past few decades, global governance has come with a vast infrastructure of techniques, technologies, and social and political incentives that has made it possible, at least in principle, to govern any issue and any community from a distance. Today, in the context of the 2030 Agenda, a global agenda, we see a proliferation of data producers, and the very concept of data is the object of debates over knowledge and the role of science, especially in times of denialisms (Cesarino, 2021; Sismondo, 2017). For the purpose of setting the scene, it is important to clarify that by referring to data in this chapter, I am focusing on quantified data, although aware that not all data are quantified. In addition, I propose we think of data as a selective extraction, that is, ‘data harvested through measurement are always a selection from the total sum of all possible data available’, for which reason ‘data are inherently partial, selective and representative, and the distinguishing criteria used in their capture has consequence’ (Kitchin, 2014, p. 29). The quantified data being produced for global governance today have many sources, which, in turn, can be classified in diverse ways. Administrative data are those produced by governments everywhere; alternative data sources are those other than the government but can also include administrative data when these are used for purposes distinct from those originally intended by governments. Alternative data sources also include, for instance, geospatial data, citizen-generated data and privately held data (GIZ and Global Partnership for Sustainable Development Data, 2020, p. 5). Therefore, producers of alternative data vary from technologies of information and communication, such as mobile phones, to local communities and private companies. In brief, governments are not alone in producing data about their populations. Governance from a distance has been greatly facilitated by this ‘avalanche of numbers’ (Hacking, 1990). If ‘[t]he sheer complexity of global governance and the distance between the humans involved means that human relations must be mediated’ (Hansen and Porter, 31

32  Handbook on measuring governance 2007, p. 34), the proliferation of data sources has allowed for more mediations that can cross borders. The general reasoning is that policymaking can be made better by the production of appropriate and timely information. Moreover, if this is to be done at a global scale, then technologies and techniques are needed that are able to accompany the speed and complexity of the current global flows of people, commodities and ideas. Thus, the rationale that more data are always needed, as seen in the UN report above, is one that has gained strength, paired with a certain technological determinism (Hassan and Sutherland, 2017). However, although data can be of great help for policy, the production and analysis of data have costs and require capacity that are not without limits. Accordingly, many authors have advanced critiques to what is perceived as another frontier of global inequality (Taylor and Broeders, 2015; Thatcher et al., 2016). The remaining of this chapter will look first into the rationalities that enable the global proliferation of quantifying practices. Next, some key instruments that have reinforced a reliance on quantification are analyzed. Those are tools that encourage a global homogenization of certain processes in policy. The following section will then discuss how global agendas and new technologies have impacted this scenario. Finally, the chapter concludes with a brief overview of challenges and political dilemmas ahead that should be taken into account as the world becomes irrevocably datafied.

HOW WE CAME TO RELY ON QUANTIFICATION IN GLOBAL GOVERNANCE I analyze three notions that seem to give quantified data an increasing role in global governance: their aura of perfectibility; their alleged capacity to offer predictions; and their capacity to speak in terms of degrees of certainty. There is an important sense in which a historical reliance on statistics has fed some of the key reasonings that nowadays support our reliance on data in general. As noticed above, not all data are quantified, but the legitimacy acquired by statistics in the history of governments (Hacking, 1990) is a key part of the history of modern Western science, which means the objectivity and neutrality most people became accustomed to attribute to information in the form of numbers are standards through which all kinds of knowledge are judged, even if critical scholars have now accumulated a rich corpus of critique to that view (see, for instance, Collins, 2002). The very definition of data I mobilized above incorporates that critical stance: data are always partial, in all senses of the word. Nevertheless, the trick in such debates has become less about pointing out errors and outright manipulations, which are easy targets and fair ones that most researchers will rush to address. The Doing Business Report is a proof of that: the series has been discontinued in 2021 by the World Bank due to allegations of serious biases and interferences. In the announcement, the Bank said that ‘[t]rust in the research of the World Bank Group is vital. World Bank Group research informs the actions of policymakers, helps countries make better-informed decisions, and allows stakeholders to measure economic and social improvements more accurately.’ It also promised that the Bank ‘will be working on a new approach to assessing the business and investment climate’.2 As scholars debate the truth in numbers and, therefore, the role they should play in governance, the central point is that quantification offers precisely the possibility the World Bank made use of, which is the promise of perfectibility. By defi-

Quantification and global governance  33 nition, numbers can be reworked, formulae can be fixed and new sources of quantified data can always be provided by new technologies. With that, as I have argued elsewhere, numbers are not and do not need to be stable in conventional terms; they are rarely abandoned as an exercise, because ‘they are always on the move’ (Rocha de Siqueira, 2017a, p. 7). I am referring to Desrosières’s (2009) work on numbers as both real and conventional, in the sense that they offer both a measurable diagnosis and a normative prescription. Conventions, for instance, are reworked by ‘expert bodies’ all the time – that is the reason for the existence of such bodies. IEAG, cited at the beginning of this chapter, is responsible for revising methodologies for data production in the context of the 2030 Agenda. It took the body five years to move certain indicators from Tier III – no methodology existent – to Tier I – methodologies existent and frequent data collection (United Nations, 2021). Revisions such as these are always taking place. That is one reason why the understanding of numbers as both real and conventional is replicated by many sociologists of quantification (Bowker and Star, 2000; Lampland, 2009; Lampland and Star, 2009). Among its repercussions, we need to take into account that numbers have a crucial epistemological role in delimiting how we understand the thing that should be governed. They also have an ontological role, in effect enabling the construction of that which is the target of intervention, by directing behaviour and expectations, for instance. With that, the reasoning that moves a reliance on perfectibility facilitates governance to the extent that mistakes are themselves diagnosed by quantification or even anticipated: statistics have historically relied on the ability to offer degrees of certainty, indicating beforehand what the margins of errors are, for example. Thus, it is not only that quantification fixes itself, by promoting new practices or adopting new technologies, but one of its main strengths is the alleged capacity to establish just how much we can rely on them. For all purposes, however, if numbers can be disputed and revised (and we live in times when this has happened more and more often, not necessarily based on scientific reasoning), this has generally no effect over people’s trust in quantification itself. As the sources of data have multiplied, one extremely interesting case is that of citizen-generated data (CGD). In global governance, that is a source that has been increasingly acknowledged as a vital complement to official statistics, often providing data precisely on invisibilized and marginalized groups (GIZ and Global Partnership for Sustainable Development Data, 2020; Jameson et al., 2018; Sacco and Marques, 2019). CGD helps disaggregate and localize data, both of which are of enormous importance when it comes to intervening upon complex social issues, such as gender violence, for instance. When it comes to reasoning in terms of degrees of certainty, one of the three attributes of data I want to highlight, CGD is an interesting case. Some authors have argued that CGD should be valued more for its capacity to open conversations and place certain issues in view than for its accuracy or other attributes usually expected of good data. Discussing the role of CGD in the environmental debate, Gabrys et al. argue that [c]itizen data might fall outside of the usual practices of legitimation and validation that characterise scientific data (which also has its own processes for determining if data is good enough). However, it could be just good enough to initiate conversations with environmental regulators, to make claims about polluting processes, or to argue for more resources to be invested in regulatory-standard monitoring infrastructure. (2016, p. 2)

34  Handbook on measuring governance We are speaking here of two different understandings of ‘good enough data’: one whose possible incorrectness is itself scientifically calculated beforehand and another whose insufficiency is not measured but is known. There is much room to dispute that specific argument about CGD (see Cázarez-Grageda et al., 2020; Gabrys et al., 2016), but the point here is to signal the pervasiveness of this ‘good enough’ or ‘certainty by degrees’ reasoning, one that is perhaps paradoxically aligned with the authority of numbers. That is, in governance, even if potentially or expressly incorrect, it is better to have numbers than not to have them at all (Ferguson, 1996).3 This is of special relevance today, since, in great part, global governance nowadays counts on an immense volume of data: Big Data, for example, has brought reliance on quantification to the point where researchers even argue we do not need theory any more, because we have enough correlations (Anderson, 2008). ‘Correlation and regression enabled previously separate objects to hold together. They constructed a new type of spaces of equivalence and compatibility’ (Desrosières, 1998, p. 283). With enough data about all issues, communities and places in the world, allegedly, we can connect enough dots that we achieve certain predictability about most phenomena. In fact, it became commonplace to say we are now datafied beings in a datafied world, and if datafication is the phenomenon by which all of this information, often freely given, is ‘systematized, analyzed and made instrumental so that predictive action can be generated’ (Mayer-Shönberger and Cukier, 2014, p. 78), then we have an ever more predictable world, or so the reasoning leads us to believe. However, by definition, a high probability does not indicate causality, but only that certain things are highly probable to happen simultaneously. The rest is unknown, but this unknown is modelled and the possible errors are measured, so that chance can be tamed (Hacking, 1990). Elsewhere, I called this reasoning metaphysics of correlation: it does not promise knowledge, but in fact, offers ‘an authoritative “scientific” acknowledgment of ignorance’ (Rocha de Siqueira, 2017b, p. 61). The latter is key in understanding the role several quantifying initiatives have been playing in global governance. An example comes to mind that has had important historical unfolding. In the 1990s, when the US Central Intelligence Agency (CIA) sponsored the State Failure Task Force, tasked with conducting empirical research on the ‘correlates of state failure’ from the mid-1950s on, the goal was to avoid surprises – the unknowns – of the kind of the end of the Cold War (Goldstone, 2008). By the time of its second report, in 1999, the team was working with three sets of indicators, on infant mortality, trade openness and democracy, considered together the ‘most efficient discrimination between “failure cases” and stable states’ (Esty et al., 1999, p. 51; Goldstone et al., 2010). The programme was completely reformulated after the 9/11 attacks and has had a long life, with what one can imagine is a vast amount of data compiled on political instability, democracy, terrorism and crises in general.4 It is not new that data are used to predict phenomena with potential global impacts, but digital technologies have undoubtedly increased capacities for data collection and analysis. Whether these in fact amount to the death of theory in a world that abounds with correlations is a matter for much debate.

Quantification and global governance  35

ON INSTRUMENTS AND TOOLS FOR QUANTIFICATION AT A GLOBAL SCALE Global governance today relies to a great extent on technologies of quantification but some of them have been around for a long time in different forms. This is the case of rankings and templates in the development field, for instance. Classifying countries is an old practice, usually conducted in order to facilitate decision-making regarding access to funds, for example, or in order to predict global economic and political impacts, as seen above. For decades, the main indicator to measure state performance has been Gross Domestic Product (GDP), which became synonymous with economic development or even synonymous with development writ large. Nevertheless, this notion has been increasingly criticized, particularly after the financial crisis of 2008/9. It was in this context of the global financial crisis that the former French President Nikolas Sarkozy created the famous Stiglitz-Sen-Fitoussi Commission to scrutinize mainstream approaches to development and economic growth. The Commission was formed by renowned economists Joseph Stiglitz, Jean-Paul Fitoussi and Nobel laureate Amartya Sen and initiated work even before the financial crisis, but much of its contributions gained repercussions after the economic effects of the crisis were felt worldwide.5 The main report, published in 2009, stated that the Commission’s aim has been to identify the limits of GDP as an indicator of economic performance and social progress, including the problems with its measurement; to consider what additional information might be required for the production of more relevant indicators of social progress; to assess the feasibility of alternative measurement tools, and to discuss how to present the statistical information in an appropriate way. (Stiglitz et al., 2009, p. 7)

The Commission suggested a series of changes to how policymakers and researchers should measure economic development and social progress. The reasoning was clearly put: ‘What we measure affects what we do; and if our measurements are flawed, decisions may be distorted’ (ibid.). Therefore, the document proposed a focus on wellbeing, to start with, and made a series of technical and political recommendations. Later on, some of the authors involved and others founded the Social Progress Imperative (SPI), which aimed to measure wellbeing in terms of social progress across three dimensions that take into account people’s perception, quality of life in different areas and environmental sustainability.6 SPI produces rankings based on basic human needs, foundations of wellbeing and opportunity, seeking to measure ‘what matters for people in a community’. Although it resembles other initiatives called by some authors ‘Beyond GDP’, like the Human Development Initiative (HDI) (which was also co-created by Amartya Sen), SPI is more oriented towards elements other-than-GDP, looking at ‘standard of living’ instead of GDP per capita, for instance. In comparison to HDI, SPI tends to show greater divergence in regards to GDP measures (Malay, 2019). SPI aims to focus on community and a holistic view of societies, although it is important to notice that it is governed by a body of private companies and foundations.7 All this may seem intuitive now for those concerned with the world and its communities, but when it comes to instruments and standards for global governance, a historical obsession with GDP has meant and still means that most organizations and studies still prioritize a very narrow understanding of development that is not only merely economic, but even in that sense, strictly focused on income and consumption (Latouche, 2009). For many years, developing and conflict-affected states have disputed this narrow definition of development, because

36  Handbook on measuring governance it creates stigmas, reduces political leverage and constrains access to funding and investments (Alexander, 2010; Bhuta, 2021). The World Bank’s Country Policy and Institutional Assessment (CPIA) is a case in point. The measurement practised within this framework is also largely based on an economicist view of country performance and although it was not created to rank countries, it has been effectively doing so. That is, after all, the nature of standards, especially when they come with powerful strings attached. CPIA is the basis for the allocation of resources by the International Development Association (IDA), the branch of the World Bank that provides concessional resources to the poorer countries. ‘The CPIA consists of 16 criteria grouped in four equally weighted clusters: Economic Management, Structural Policies, Policies for Social Inclusion and Equity, and Public Sector Management and Institutions (see Box below). For each of the 16 criteria, countries are rated on a scale of 1 (low) to 6 (high).’8 The framework has changed many times since 2006, when it was disclosed. The CPIA country rating (IRAI) is fed into a formula that supports IDA resources allocation, providing a country performance rating (CPR). CPIA has also led to the formulation of a list of fragile and conflict-affected situations (FCS), released annually: The list functions primarily as a tool to help the WBG adapt its approaches, policies, and instruments in difficult and complex environments. The WBG also uses it for monitoring and accountability around its support for the most vulnerable and marginalized communities.9

Countries whose rating fall below a 3.2 threshold are classified FCS, which provides access to specific forms of concessional funding,10 although also leading to less welcome political implications. Indeed, terminology matters in diplomatic and political circles in general, so an inadvertent effect of all the CPIA, as an instrument of global governance, has been to effectively create a subclass of countries, like that of so-called fragile or conflict-affected countries (Rocha de Siqueira, 2017b). [D]ifficulties around data collection in fragile states mean donors often rely on out of date statistics. Misrepresentations can result, which fail to provide an accurate picture of the progress that states are making. There is also an issue of creating overly ambitious international targets and goals for fragile states that do not take into account the low base from which fragile states are starting, and thus ‘set countries up to fail’ against these measures. Finally, indicators determined by international actors do not draw on the true experts on fragility – the citizens of fragile states themselves. (g7+, 2013, p. 2)

Some instruments are much less visible in this macro perspective and, nonetheless, just as pervasive a form of governing from a distance. Let us take the case of the (in)famous logical framework, a template that was born in a very specific context, created for the US Agency for International Development and in use by the World Bank since 1997, ‘when it became a standard attachment to the Project Appraisal Document for investment operations’ (World Bank, 2005, p. 1). The logframe – as it is called – became a proxy for the Measuring for Results (MfR) rationale, an administrative and business rationale adopted by most organizations everywhere, especially in the 1990s and entering the 2000s, much due to the World Bank’s influence in circulating the template and its attached procedures as manifested forms of expertise. The logframe is a tool that has the power to communicate the essential elements of a complex project clearly, and succinctly throughout the project cycle. It is used to develop the overall design of a project,

Quantification and global governance  37 to improve project implementation, monitoring, and to strengthen periodic project evaluation. In essence, the logframe is a ‘cause&effect’ model of project interventions to create desired impacts for the beneficiaries. (World Bank, 2005, p. 13)

Therefore, some key elements we need to consider here are how the logframe powerfully represents two central characteristics of the MfR rationale: a mode of governance that is based on interventions-by-project, and the streamlining of this reasoning through potent visualization tools, such as the matrix in logframes. The centrality of projects in most organizations feeds and is fed by the MfR rationale. The idea is that the smaller the unit of action, the better for monitoring’s sake, that is, it is easier to follow the logic of a well-contained form of intervention that presents itself in incremental stages, to attest the end of each stage and, then, to measure the results in every step. Similarly, the very existence of the logframe and the MfR has led to projects being the homogeneous form of intervention everywhere (see Natsios, 2010): Submitting proposals for funding means submitting ideas in the form of projects, and a logframe is usually expected as well. The logframe adopted by the World Bank and circulated since 1997 is often a 16-box matrix, where ‘each box contains specific and unique types of information’. The four columns describe the causal logic of the project, following from objectives to outputs, passing by activities and indicators (World Bank, 2005, p. 14). The main issue for global governance is that this model became the rule that every small to major organization needs to follow in one form or another, so that the thinking around complex social problems has been modelled to simplify matters into a series of correlations that, nonetheless, implies some causation. The MfR, after all, is concerned with producing and measuring results, so the logframe represents a form of focusing on results by including the provision of a causal narrative that links objectives, activities and results, even if no expert will ever say causality can be proven – there are only correlations of high probability. There is a grammar and a vocabulary that have also become attached to the MfR and are themselves tools in global governance. In a fictitious letter to development donors, Everjoice Win eloquently explains what this entails: We spent three days trying to fit visions, objectives, strategies and our way of seeing the world into the differently shaped blue, green and yellow cards. It was really not funny, though. It was painful. Nobody understood this method and the logic behind it. It did not make sense for many of us who are Ndebele- or Shona-speaking. In our languages we express ourselves in paragraphs, not in short phrases or sentences. We are an oral people. We don’t think in boxes either. (Win, 2004, p. 125)

And yet, every organization needs funding and submitting projects is everyone’s routine. That is how successful certain methods and tools, or what Lins Ribeiro calls ‘micro models’ (2013, p. 137), have been in measuring global governance, but most importantly, through that measuring, effectively governing from a distance. They are models in the concrete sense of the word, preformatted cognitive maps to (re)produce materialities, lifestyles, production and circulation schemes, environments, and social, economic, political and cultural processes. They prefigure reality and impose a predefined order on it. As the last links of the dissemination chain, they have a hands-on approach. Micromodels have authoritative and authoritarian qualities and they fully depend on projectism. Once micromodels are defined and start to be implemented, all processes have to be adapted and performed according to their mandates. Incapacity to act accordingly and results that differ from what was anticipated are considered to be evidence of failures and incompetence. Those responsible for the disparities must explain themselves and may be found guilty and dismissed for not following the rules. (Lins Ribeiro, 2013, p. 137)

38  Handbook on measuring governance Therefore, in general, the consequences of ‘projectism’ (ibid.) have been of considerable political weight: organizations have little predictability over their budgets, have their human and material resources spread thin in order to fill forms in compliance with different donors, and most importantly, positive social change has been oversimplified to fit boxes and long-term dynamics are sidelined for the sake of short-term activities to which clearer impacts can be attributed.

IMPORTANT HISTORICAL TRANSFORMATIONS IN MEASURING GLOBAL GOVERNANCE The reasoning, methods and tools employed in measuring global governance have been profoundly strengthened by some recent events and phenomena that have reinforced the view that ‘more data is always better’. Digitality is one such phenomenon, which has had an important impact upon global governance, as it has made it possible, for instance, to expand the MfR reasoning to new frontiers of data collection and analysis, something partly seen in the previous sections. We have only to look into what the COVID-19 pandemic has represented in these terms so far, as discussed below. I also want to focus on the power leveraged by global agendas, of which none is perhaps more strikingly complex and all-encompassing than the 2030 Agenda. That way, this section provides both an analysis of instruments (digitality) and goals or themes (the SDGs), looking into how different levels of analysis can provide lenses to observe important historical transformations. Moreover, as discussed ahead, the means and the goals feed into each other: The more aligned to a quantifying and results-oriented reasoning, the more salient a theme and, with that, there are increasing expectations to produce data for monitoring and evaluation against targets that can be compared globally. I want to take the 2030 Agenda as an example of an incredibly potent ‘epistemic infrastructure’ (Tichenor et al., 2022) that, along with its antecessor, the Millennium Development Goals (MDGs), has incentivized quantification at a global level. Most data produced nowadays will rely on digitality, that is, not only on digitalization but on it being part of a bigger picture, ‘a culture that is formed and expressed, for reasons of capitalist “efficiency”, upon an ever-increasing acceleration and automation … That is to say, the culture of digitality is less a humanly created culture than it is a computer-created and network-created one’ (Hassan and Sutherland, 2017, p. 120). It is because of digitality that it is possible to have such a thing as Big Data. Speaking of digitality means that digital media cannot be dissociated from the capitalist trajectory of their development and need to be understood as a form of capitalist mediation (ibid., p. 4). If we are now swimming in numbers, the reasoning for producing them dates back centuries, but it was the increase in economic interests in data (statistical and others) – which came to be seen as the ‘new oil’ – that made data mining a commercial phenomenon without borders. Zuboff (2019) suggests that extracting digital data that can be commercialized for propaganda purposes ushered in a type of capitalism in which the appropriation of labour, land and wealth, a hallmark of industrial capitalism, was supplanted by the appropriation of ‘private experience for translation into fungible commodities’ (p. 11). At first, these data were like a by-product of the users’ activity. With increasing pressure from investors, they became ‘aggressively hunted, acquired and accumulated … largely through unilateral operations aimed to evade individual conscience and thus circumvent individual decision rights’ (ibid.,

Quantification and global governance  39 p. 13). This digital architecture has recently expanded with the incentives to ‘build back better’ ‘after’ the COVID-19 pandemic. Until the end of the first year of the pandemic, in 2020, investments in digitalization already surpassed 3 per cent of GDP in South Korea, China and the European Union. The goals have been to invest in digitalization of industries and services but also increase the use of digital platforms (Tang and Begazo, 2020). The experience with the COVID-19 pandemic is a key case to explore the repercussions of datafication and digitality in global governance. It has reinforced the reasoning that more data is always better. Many initiatives came up with forms of producing open data on the circulation of the virus and these have been understood as public goods. It is hardly disputable that data on the virus have been of extreme importance, so that projects seeking to contribute to the pool have been welcome internationally and have reinforced the reasoning that more data is always best. Meanwhile, critics have raised the alarm that many technologies for data collection and analysis are being circulated without appropriate tests and that the surveillance they are setting in motion has not been democratically approved, due to the exceptional demands of the pandemic, and will have lasting effects (Kitchin, 2020).11 One example has global proportions: The #data4covid19 platform mapped public, private and mixed initiatives of data production. Between January 2020 and January 2021, of the 225 projects listed, 69 were created by private companies; 56 were created by private-public partnerships; and 42 from academia alone or along with other partners.12 The platform is led by GovLab, which received funds from Luminate for this specific project. Luminate, in turn, is relatively new, created in 2018, and has as one of its founders Pierre Omydiar, creator of eBay and a figure who has headed a philanthropy group, the Omydiar Group, for a few years. In just two years of existence, Luminate has supported ‘296 organizations in 17 countries with more than $378 million in funding’. One of the three priority areas for Luminate is precisely open data, privacy and Artificial Intelligence (AI).13 It is important to emphasize again that the digital architecture strengthened by the last developments related to the pandemic, and digitality more broadly, had been in expansion for a few years. Whereas the pandemic led a recent wave of investments in digitality and data, there have been other ‘thematic-led’ initiatives that have certainly contributed to advance these phenomena in incredible leaps. In this sense, nothing has contributed more to the idea that more data is always better and that data are essential for public policies and global governance rather than the UN development agendas, development being a historically ‘projectist’ and results-oriented field (Ferguson, 1996). It was the MDGs, inaugurated in 2000, that first set quantified development targets globally, encompassing a diversity of social themes, and since 2015, with the approval of the 17 SDGs, this impetus has increased exponentially. ‘Calling for a “data revolution”, the UN Sustainable Development Goals (SDGs) seek to promote progress in matters related to planet, people, prosperity, peace and partnerships (the “5Ps”) by mobilizing an all-encompassing datafying system that heavily relies on quantification’ (Ramalho and Rocha de Siqueira, 2022, p. 1). The 2030 Agenda has borrowed and extrapolated from old reasonings and well-established Western modern ways of intervening in the world: These are the simplification of complex social phenomena in statistics; the counting and accounting way of doing policy through numbers (or numerical data); and the measurement and comparison of institutional performance, which are all part of a global governance by indicators that have been developing throughout the past decades (ibid., p. 2; see also Davis et al., 2012).

40  Handbook on measuring governance The Agenda 2030 has 169 targets and 231 unique indicators, so the system is based on flows of information, resources and practices travelling in all directions. As such a vast and complex endeavour, the SDGs do not just represent global goals but actually shape them, playing the real and the conventional roles discussed by Desrosières (2009). The Agenda 2030 does that by leading to the implementation of vocabulary, grammar, techniques, technologies, processes and methods from the local administrations in municipalities to the major organizations in the world. That way, the SDGs inaugurated a point of no return: there is a well-established industry now of global proportions that is dedicated to constantly push the borders of data production further. This is incredibly important when it comes to responsibly making visible the problems of certain communities, especially minorities or marginalized groups, as more data might also mean more disaggregated data, but the issue is that producing more data should not be an end itself nor be attributed absolute value. Otherwise, that is, if the assumptions of data production are not themselves constantly questioned as part of governance thinking and practice, data become only another form of commoditization that, by nature, tends to reinforce inequalities instead of helping to intervene in them. When data are the end game, even what is called citizen science – of which CGD is an example – can be used to extract value rather than promoting positive change on people’s lives. Mirowski (2017) is a loud critic of these initiatives, for instance, suggesting the ‘precariat’ has little power to dictate agendas or ‘the uses to which knowledge will be put’, so that citizen science, according to him, can often serve the purposes of advancing private interests, which fund such projects to a large extent. The issue, thus, is to address the ethical and political dilemmas that are associated with quantification and with governing through data not merely by buying into the ‘challenges of production’ – how to produce more data – but by actually asking what data and for whom they are being produced, bringing responsibility to the fore.

CONCLUDING REMARKS AND WAYS AHEAD This chapter aimed at briefly exploring how a quantifying reasoning established itself as an authoritative form of knowledge production in global governance; what main forms it has taken in terms of methods and tools; and what have been the impacts of recent events and phenomena upon the role quantification plays in global governance. Quantification relies on the capacity to present errors as a matter of perfectibility, which also means quantifying practices usually offer certainty in terms of degrees, adapting diagnoses and solutions to the data available. In addition, quantification sells the ability to predict, something that the vast amount of quantified data now available worldwide has taken to new proportions. These are some of the features that have historically led quantification to be presented as a highly authoritative form of knowledge, yet the chapter has also shown how data are always partial and selective to the extent that they capture only part of the phenomena and depend on a series of previous decisions that are always human and political. Some key examples were used to describe how this reasoning has taken shape in global governance, especially through what can be generally called a Measuring for Results (MfR) rationale. The World Bank has been a key actor in that regard, both through its more macro forms of governing from a distance, such as through its rankings and formulae for allocating resources, and through more micro initiatives, such as templates that have come to be practi-

Quantification and global governance  41 cally mandatory and end up shaping our very way of reasoning about problems in the world. The logframe discussed is a prime case in point. Certain important events and phenomena, such as the rise of digitality as a culture, the COVID-19 pandemic and the UN development agendas, have impacted what were already powerful reasonings and tools. Digitality has led data collection and analysis to previously unimagined frontiers and is at the centre of a complex and powerful global economic and political dynamic. In addition, as the world becomes datafied, global agendas, such as the 2030 Agenda, have intensified even further the reliance on data, especially statistics, and at this point, it becomes important to acknowledge the relevance of quantification without failing to question the way it is often conducted or the role it has taken in societies. It is in this regard that I would like to conclude by pointing out three general dilemmas that I believe should be of concern to scholars of global governance, in the realms of theory, inequalities and governance itself. (a) Theoretically, there is much we still need to unpack about the epistemological, ethical and political impacts of having become datafied beings in a datafied world. In global governance, this means developing theoretical insights about the ways quantification has become a lingua franca that is often threatening to other forms of articulation. Insights from data feminism (D’Ignazio and Klein, 2020) and data citizenship (Ruppert et al., 2017) come to mind as important contributions in this regard. (b) Similarly, it is important to be aware and face new forms of inequality that are being enabled by quantification and datafication. Discussions on data colonialism and dispossession, for instance, problematize the way communities in the global South are being made intelligible to interventions by way of data collection and how private interests often enter such communities to collect data without proper normative frameworks (Couldry and Mejias, 2019). (c) Finally, it is key to understand how global governance has been challenged by the appearance of new actors. In terms of quantification, bottom-up, participatory methodologies in data production have led to some important debates on polycentric governance that still need to be expanded (Aguerre et al., 2024). These are pressing challenges that need to be mapped and investigated, therefore, offering new important avenues of research.

NOTES 1. See https://​sdgs​.un​.org/​goals. Accessed: 1 September 2023. 2. See https://www.worldbank.org/en/news/statement/2021/09/16/world-bank-group-to-discontinue​ -doing-business-report. Accessed: 1 September 2023. 3. Interestingly, James Ferguson advanced that argument early in the 1990s, in his renowned book where he compared development with an anti-politics machine. Already then, even without all the technological possibilities we have today, he diagnosed a mode of international governance from a distance that heavily relied on numbers, including when these were not available: ‘the fact that there are no statistics available is no excuse for not presenting statistics, and even made-up numbers are better than none at all’ (1996, p. 41). 4. The programme has since been called Political Instability Task force. See http://​www​.systemicpeace​ .org/​inscr/​PIT​FProbSetCo​debook2017​.pdf. Accessed: 1 September 2023.

42  Handbook on measuring governance 5. See OECD. ‘Beyond GDP. Measuring what counts for economic and social performance’, at https://www.oecd-ilibrary.org/sites/9789264307292-en/index.html?itemId=/content/publication/​ 9789264307292-en. Accessed: 1 September 2023. 6. See https://​www​.socialprogress​.org/​. Accessed: 20 October 2022. 7. See https://​www​.socialprogress​.org/​framework​-0. Accessed: 20 October 2022. 8. See https://thedocs.worldbank.org/en/doc/b8464ff32b31e488bd3aec5437c3cc92-0290032021/​ original/CPIAFAQ2020.pdf. Accessed: 20 October 2022. 9. See https://www.worldbank.org/en/topic/fragilityconflictviolence/brief/harmonized-list-of-fragile​ -situations. Accessed: 20 October 2022. 10. The sum available is determined in the IDA Replenishments, which are exercises that take place every three years and counts the contributions of the richest IDA country members. See https://​ida​ .worldbank​.org/​en/​replenishments. Accessed: 20 October 2022. 11. See also https://​www​.hrw​.org/​news/​2020/​04/​02/​governments​-should​-respect​-rights​-covid​-19​ -surveillance; https://​www​.amnesty​.org/​en/​latest/​news/​2020/​04/​covid​-19​-surveillance​-threat​-to​ -your​-rights/​. Accessed: 20 October 2022. 12. See #data4covid19 at https://​data4covid19​.org. Accessed: 20 October 2022. 13. See https://​luminategroup​.com/​data​-and​-digital​-rights/​en. Accessed: 20 October 2022.

REFERENCES Aguerre, C., Campbell-Verduyn, M., & Scholte, J.A. (2024). Global digital data governance: Polycentric perspectives (forthcoming). Routledge. Alexander, N. (2010). The Country Policy and Institutional Assessment (CPIA) and allocation of ida resources: Suggestions for improvements to benefit African countries. Heinrich Boell Foundation. Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine, 23 June. Bhuta, N. (2021). Governmentalizing sovereignty: Indexes of state fragility and the calculability of political order. In B. Kingsbury, S.E. Merry, & K.E. Davis (Eds.), Indicators as technologies of global governance. Oxford University Press. Bowker, G.C., & Star, S.L. (2000). Sorting things out: Classification and its consequences. The MIT Press. Cázarez-Grageda, K., Schmidt, J., & Ranjan, R. (2020). Reusing citizen-generated data for official reporting. A quality framework for national statistical office-civil society organisation engagement. PARIS21 Working Paper. Cesarino, L. (2021). Pós-verdade e a crise do sistema de peritos: uma explicação cibernética. Ilha – Revista de Antropologia, 23(1). Collins, P.H. (2002). Black feminist thought: Knowledge, consciousness, and the politics of empowerment. Routledge. Couldry, N., & Mejias, U.A. (2019). Data colonialism: Rethinking big data’s relation to the contemporary subject. Television & New Media, 20(4), 336–49. https://​doi​.org/​10​.1177/​1527476418796632. D’Ignazio, C., & Klein, L.F. (2020). Data feminism. The MIT Press. Davis, K., Fisher A., Kingsbury B., & Merry S.E. (2012). Governance by indicators: Global power through classification and rankings. Oxford University Press. Desrosières, A. (1998). The Politics of large numbers: A history of statistical reasoning. Harvard University Press. Desrosières, A. (2009). How to be real and conventional: A discussion of the quality criteria of official statistics. Minerva, 47, 307–22. Esty, D.C., Goldstone, J.A., Gurr, T.R. et al. (1999). State Failure Task Force Report: Phase II findings. https://​www​.wilsoncenter​.org/​sites/​default/​files/​media/​documents/​event/​Phase2​.pdf. Accessed: 1 September 2023. Ferguson, J. (1996). The anti-politics machine: Development, Depoliticization and bureaucratic power in Lesotho (3rd ed.). University of Minnesota Press. g7+ (2013). Note on the fragility spectrum. Forstår ikke denne.

Quantification and global governance  43 Gabrys, J., Pritchard, H., & Barratt, B. (2016). Just good enough data: Figuring data citizenships through air pollution sensing and data stories. Big Data & Society, 3(2), 1–14. GIZ and Global Partnership for Sustainable Development Data (2020). The 2030 Agenda’s data challenge. Approaches to Alternative and digital data collection and use. Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ) GmbH. Goldstone, J.A. (2008) The Political Instability (State Failure) Task Force: Perspective and prospects as seen by an academic consultant. Personal memorandums. Goldstone, J., Bates, R.H., Epstein, D.L., et al. (2010) A global model for forecasting political instability. American Journal of Political Science, 54(1), 190–208. Hacking, I. (1990). The taming of chance. Cambridge University Press. Hansen, H.K., Porter, T. (2017) What do big data do in global governance? Global Governance: A Review of Multilateralism and International Organizations, 31–42. https://​doi​.org/​10​.1163/​ 19426720​-02301004. Hassan, R., & Sutherland, T. (2017). Philosophy of media: A short history of ideas and innovations from Socrates to social media. Routledge. Independent Expert Advisory Group on a Data Revolution for Sustainable Development (IEAG) (2014). A world that counts: Mobilising the data revolution for sustainable development. United Nations. Jameson, S., Lämmerhirt, D., & Prasetyo, E. (2018). Acting locally, monitoring globally? How to link citizen-generated data to SDG monitoring. http://​dx​.doi​.org/​10​.2139/​ssrn​.3229753. Kitchin, R. (2014). The data revolution: Big data, open data, data infrastructures and their consequences. Sage. Kitchin, R. (2020). The data revolution: Big data, open data, data infrastructures and their consequences. Sage. Lampland, M. (2009). False numbers as formalizing practices. Centre for the Philosophy of Natural and Social Science, London School of Economics. Lampland, M., & Star, S.L. (2009) Standards and their stories: How quantifying, classifying, and formalizing practices shape everyday life. Cornell University Press. Latouche, S. (2009) Farewell to growth. Polity Press. Malay, O.E. (2019) Do beyond GDP indicators initiated by powerful stakeholders have a transformative potential? Ecological Economics, 162, 100–107. Mayer-Schönberger, V., & Cukier, K. (2014) Big data: A revolution that will transform how we live, work, and think. Harper Business. Mirowski, P. (2017). Against citizen science. Aeon, 20 November. https://​aeon​.co/​essays/​is​-grassroots​ -citizen​-science​-a​-front​-for​-big​-business. Accessed: 1 September 2023. Natsios, A. (2010). The clash of the counter-bureaucracy and development. Center for Global Development. https://​www​.cgdev​.org/​publication/​clash​-counter​-bureaucracy​-and​-development. Accessed: 1 September 2023. Porter, T.M. (1995). Trust in numbers: The pursuit of objectivity in science and public life. Princeton University Press. Ramalho, L., & Rocha de Siqueira, I. (2022). Participatory methodologies and caring about numbers in the 2030 Sustainable Development Goals Agenda. Policy and Society, 41(4), 486–97. Ribeiro, Gustavo Lins (2013). Global flows of development models. Anthropological Forum, 23(2), 121–41. doi:10.1080/00664677.2013.767183. Rocha de Siqueira, I. (2017a). Managing state fragility: Conflict, quantification and power. Routledge. Rocha de Siqueira, I. (2017b). Development by trial and error: The authority of good enough numbers. International Political Sociology, 11(2), 166–84. Ruppert, E., Isin, E., & Bigo, D. (2017). Data politics. Big Data & Society, 4(2), 1–7. Sacco, C., & Marques, J. (2019). O IGBE na Produção do data_labe e o Debate Sobre Dados no Brasil. Revista Brasileira de Geografia, 64(1), 109–21. Sismondo, S. (2017). Post-truth? Social Studies of Science, 47(1), 3–6. Stiglitz, J., Sen, A., & Fitoussi, J-P. (2009). Report by the Commission on the Measurement of Economic Performance and Social Progress. https://​ec​.europa​.eu/​eurostat/​documents/​8131721/​8131772/​Stiglitz​ -Sen​-Fitoussi​-Commission​-report​.pdf. Accessed: 1 September 2023.

44  Handbook on measuring governance Tang, J., & Begazo, T. (2020) Digital stimulus packages: Lessons learned and what’s next, Digital Development – World Bank Blogs. https://​blogs​.worldbank​.org/​digital​-development/​digital​-stimulus​ -packages​-lessons​-learned​-and​-whats​-next. Accessed: 1 September 2023. Taylor, L., & Broeders, D. (2015) In the name of development: Power, profit and the datafication of the global South. Geoforum, 64, 229–37. Thatcher, J., O’Sullivan, D., & Mahmoudi, D. (2016). Data colonialism through accumulation by dispossession: New metaphors for daily data. Environment and Planning D: Society and Space, 34(6), 1–17. Tichenor, M., Merry, S.E., Grek, S., & Bandola-Gill, J. (2022). Global public policy in a quantified world: Sustainable Development Goals as epistemic infrastructures, Policy and Society, 41(4), 431–44. United Nations (2021) Progress towards the Sustainable Development Goals: Report of the Secretary-General – Statistical Annex. https://​digitallibrary​.un​.org/​record/​1627573​?ln​=​en. Win, E. (2004). If it doesn’t fit on the blue square it’s out!’ An open letter to my donor friend. In L. Groves & R. Hinton (Eds.), Inclusive aid changing power and relationships in international development (pp. 123–7). Earthscan. World Bank (2005). The logframe handbook: A logical framework approach to project style cycle management, Washington, DC. Zuboff, S. (2019). Surveillance capitalism and the challenge of collective action. New Labor Forum, 28(1), 10–29.

3. New Public Management, performance measurement, and measuring for governance Jenny M. Lewis

INTRODUCTION It is difficult to imagine a time when people and organizations were not spending time tracking whether or not they are ‘measuring up’ to some ideal performance based on averages, benchmarks or targets. The rise of measurement and quantification of society has been documented by sociologists (e.g. Power, 1997), while public administration scholars tend to focus on the rise of performance measurement in line with what is often referred to as New Public Management (NPM), or more broadly, neoliberalism. The discussion might be framed as the rise of the performance movement (Radin, 2006), the audit explosion (Power, 1997), or the growth of administrative accountability (Flinders, 2001). Regardless of how it is framed, performance measurement is now a major industry in the public sector. The global economic disturbances of the 1970s helped spread a belief that governments were bloated and inefficient and producing pathologies such as fiscal crisis and government ‘over-reach’ (Pollitt and Bouckaert, 2011). The remedy to these perceived problems was to make government agencies behave more like private firms, by saving money and increasing their efficiency, and obliging public bureaucracies to be more responsive to citizens’ desires (Boston et al., 1996; Pollitt and Bouckaert, 2011). NPM and the ideas behind it (elaborated in the next section) means that publicly funded services have been subjected to ever-greater levels of scrutiny through the lens of performance, leading to claims that we are now governed by indicators (Davis et al., 2010), or are living in the age of the performance state (Henman, 2016). National governments have created frameworks for performance measurement, as have individual departments within nations. International organizations have also spread the word about the need for such measurement (see e.g. OECD, 2005). Governance is defined in this chapter as the system by which governments control and operate state-society interactions, and the mechanisms by which they are held to account. So, what is the relationship between governance and performance measurement? In the broadest terms, performance measurement’s purpose is linked to the better management of organizations and the improvement of outcomes. It is a tool imbued with power, which has the potential to be (and quite often is) wielded by government bureaucracies as a means of control (Lewis, 2015). Performance measurement, then, is a tool wielded by public administration organizations to fulfil their governance functions. In this chapter, it is framed and examined as such. From this starting point, the political ideas and rationalities underpinning NPM and performance measurement are described in the next section. Then, the key instruments, methods, and tools of measuring are explored via an examination of performance measurement’s many purposes reported in the literature. This is followed by an empirical analysis of measurement’s purposes, based on interviews with 34 senior public administrators in Australia, Canada, and the UK in the health and higher education policy sectors. This analysis supports the conclusion 45

46  Handbook on measuring governance that an important focus of performance measurement for these actors is using it as a tool for governance.

IDEAS AND RATIONALITIES: NPM AND PERFORMANCE MEASUREMENT Public administration scholars generally agree that performance measurement in the public sector increased markedly with the emphasis on public management reform in many countries (Pollitt and Bouckaert, 2011). Reflecting the dominant mood of the 1980s, NPM with its focus on planning, targets, outputs and a tighter oversight and control of the achievements of public sector organizations, came into play alongside a focus on saving money, reducing the time and effort expended, and cutting waste. In Christopher Hood’s (1991) famous description of NPM, he argues that it was frugality and the reduction of waste that became its single-minded focus. Performance measurement remains a central plank of NPM reforms, supported by a belief that previous problems can be avoided with more measurement and more management. An emphasis on demonstrating the effective use of taxpayers’ money and a meeting of specific goals set by politicians (Pollitt and Bouckaert, 2011) led to an explosion of performance measures allegedly gauging the outcome or at least the output (as compared to input) of public services (Power, 1997). To be sure, measuring performance is a core component of the NPM toolkit, but it is more than that. Beryl Radin (2006) argues that performance measurement has become ubiquitous due to a growing unwillingness of citizens to accept that institutions are performing as they should. They are concerned about the expenditure of public sector funds, sceptical about the allocation of limited resources, and worried by programme decisions they do not agree with and the unresponsiveness of organizations to their concerns. NPM, as noted earlier, owes an intellectual debt to neoliberal ideas that are inherently sceptical about the ability of any (central) government’s cognitive and governmental capacities (Triantafillou, 2017). The privatization and endorsement of quasi-markets to make public services more efficient (thereby saving money) was particularly favoured in Anglophone nations. The post-war consensus established in many of these countries involved the provision of public services for citizens, and the mechanism for support involved a public bureaucracy, delivering services using rules and standard operating procedures. This public bureaucracy was subject to direct governmental budget and procedural oversight. By the late 1980s, distrust of the permanent bureaucracy by the political class had infected all Westminster systems and was also evident in the US (Campbell and Halligan, 1992). Bureaucracy had become a universally pejorative term (Barzelay, 1992) and reformers looked to the private sector and the market for inspiration. Hence, its widespread implementation is aligned with concerns about the wise use of public funds, and it sits easily with the rationalities of public accountability and transparency (Power, 1997), and the growth of citizen and consumer-rights activism (Radin, 2006). In a tools of government approach (Hood, 1983; Hood and Margetts, 2007), there is a distinction between detectors and effectors – the instruments used for taking in information and those used for making an impact on the world. This fits with the notion of collecting performance information and then using it in attempts to direct efforts in a specific manner. Examining ‘the instruments that government uses at its interface with the world outside’ (Hood and Margetts, 2007, p. 11) draws attention to how these instruments are used to govern.

NPM, performance measurement, and measuring for governance  47 Here, performance measurement is imagined as a set of instrument, measures, and tools. It has the potential to be used for many different purposes and the range of purposes seen by those who generate and oversee the application of performance measurement to subordinate public sector organizations is crucial. Further, an instrument is ‘a device that is both technical and social, that organises specific social relations between the state and those it is addressed to, according to the representations and meanings it carries’ (Lascoumes and Le Gales, 2007, p. 4). Performance measurement structures political relations as competitive, and confers technical legitimacy on those who are responsible for deciding that measurement is needed, what the purpose of this measurement is, and what range of measures will be collected. Analysing the range of purposes attributed to measuring performance uncovers what those with power (in this case senior administrators) are attempting to achieve. This conceptualization of policy instruments has the important consequence of orienting relations through the application of ‘devices that combine technical (measuring, calculating, the rule of law, procedure) and social components (representation, symbol)’ (Kassim and Le Gales, 2010, p. 5). This sits comfortably with a political-realistic view of performance measurement, with an emphasis on administrative power (Lewis, 2015): performance measurement, and the set of instruments, tools, and methods associated with it, is a means for governing. NPM as a mode of governing has a different source of rationality, form of control, primary virtue, service delivery focus, and key success criterion compared with other modes of governance. Table 3.1 presents four modes of governance: traditional Weberian bureaucracy; NPM (divided into corporate and market sub-types); and network. NPM is divided to represent its earlier and later variants which reflect (respectively) the influence of management and economics (Considine and Lewis, 1999; Lewis et al., 2021). Based on these ideal-types, the focus of NPM (corporate and market) can be seen as being jointly driven by management and competition and the focus on meeting the targets and beating the competition (outputs). In contrast, the bureaucratic mode of governing rests on law, with measurement focusing on rules and procedures (inputs), while the network mode emphasizes co-production and using networks of relationships to get results (processes). Some definitions of accountability are strongly aligned with the above view of policy instruments, which comes from political sociology. Accountability is relational at its core and is an enforcement mechanism. It is record keeping that gives rise to ‘story-telling in a context Table 3.1

Governance modes and key success criterion Source of

Form of Control

Primary Virtue

Rationality Bureaucratic

Law

Service Delivery

Key Success Criterion

Focus Rules

Reliability

Universal

Knowing the rules and official

Treatments

procedures Meeting the targets set by

Corporate

Management

Plans

Goal-driven

Target Groups

Market

Competition

Contracts

Cost-driven

Price

Network

Culture

Co-production

Flexibility

Clients

management Competing successfully with other service providers Having the best possible set of contacts outside the organization

Source: Adapted from Considine and Lewis (1999), Lewis et al. (2021).

48  Handbook on measuring governance of social (power) relations within which enforcement of standards and the fulfilment of obligations is a reasonable expectation’ (Bovens et al., 2014, p. 3). NPM has tended to focus accountability towards administrative (rather than political or judicial) concerns. Accounting systems were joined by a set of non-financial reporting systems as performance became a key organizational value. As a result, accountability came to be defined as demonstrating one’s performance (van de Walle and Correlissen, 2014). Performance measurement increased markedly with an emphasis on public management reform, and it is now more extensive, more intensive, and more external in its focus (Pollitt and Bouckaert, 2011). Further, the measurement of performance is undertaken by managers, it is linked to some management purpose, and it is management that is most likely to actually use performance measures in some way. Hence, performance management is ‘the use of performance indicators and management prescriptions, designed to improve such measured performance, to achieve public service performance objectives’ (Cutler, 2011, p. 129). Much has been written about this presumed link – see, for example, Moynihan (2008), Taylor (2009), and Hammerschmid et al. (2013). The point here is not this use (or non-use) of measurement by management, but to illustrate that it is managerial in its construction and its intentions. Hence, it is more closely related to the work of administrators than to politicians. This is not to say that the effects of performance measurement are confined to administrative issues – on the contrary, they have widespread impacts. But it is the administrators who decide on and oversee performance measurement. In the ‘chain of performance measurement’, performance measurement is conceived of as a social structure which arises from the interaction of institutional rules and individual responses to these rules (Lewis, 2015). Combining this chain with a consideration of different aspects and the actors involved at these different stages is helpful in defining the scope of inquiry in this chapter. The aspects can be conceptualized as falling into policy and decision-making; management and technical; and effects. The actors can be divided into measurement deciders and steerers, the measurers, the measured, and the measures (assuming measures have agency of their own). The links in the chain and their relationship to these is depicted in Table 3.2. Of course, none of this is as clear-cut in practice as this table suggests. The empirical component of this chapter is centred on the questions associated with the first few links in the chain: It is a study targeted at the measure deciders and steerers. This focus was chosen because they are highly engaged in performance measurement (Pollitt, 2013). In Pollitt’s (2013) terms, they are the top officials and technocrats. Top officials need Table 3.2

The chain of performance measurement with aspects and actors

Links in the Chain

  Aspect

Actors

Context

  Policy and decision making

the measurement deciders and steerers of the application of

Policy & strategy

measures

Criteria Rules

  Management and technical

the measurers interact with the measures and the measured

Understandings Actions Outputs

  Effects

Consequences

effects arise from the interaction between the measurers, the measured, and the measures, and are fed back to the measure deciders and steerers

Source: Adapted from Lewis (2015).

NPM, performance measurement, and measuring for governance  49 to understand performance measurement because of its potential to act as a control mechanism over operational staff and organizations, and technocrats are the designers and improvers of the measurement systems, so this is core to their jobs. They are located in central government departments and authorities, rather than organizations that directly deliver public services, and are involved in the creation and dissemination of measures, rather than directly measuring performance. They are a crucial part of the institutional context of performance steering (who has rights and other instruments to steer public organizations and programmes) in performance regimes (Talbot, 2008).

KEY INSTRUMENTS, METHODS, TOOLS: THE PURPOSES OF PERFORMANCE MEASUREMENT Why measure performance? This question opens up a vast array of reasons for why it is established and how it is used in practice. Van Dooren’s (2006) comprehensive list of 44 potential uses provides an overview of practices that is instructive – ranging from types of budgeting to communicating with the public and strategic priority setting, to name just a few – but also too specific for the central focus here. The question of why performance measurement is used is much harder to establish, and is reliant on attempts to observe underlying purposes. Despite the explicit purpose of transparency and improvement, the implicit purpose of performance measurement can be any number of things. Since performance measurement is linked to notions of control, management, and consequences, it provides the potential for some actors to enhance their power. Performance indicators can be used to monitor the strategic or operational performance of an organization, to control the lower levels of an organization, to manage street-level bureaucrats, or to appraise performance (Carter et al., 1992). In this chapter the focus is on key elements of governance – control, steering, power, and accountability. The purposes of performance assessment schemes listed by Pollitt (1987) include: to clarify an organization’s objectives; to evaluate the final outcomes from activities; to indicate areas of potential cost savings (efficiencies); to raise questions about the organization of resources; to use in allocating staff incentive schemes; to determine the most cost-effective level of service for a given target; to indicate standards and monitor their fulfilment; to indicate how activities contribute towards a policy goal; to enable consumers to make informed choices; and to provide staff with feedback so they can improve their practice. This list reflects a focus on measurement and policy questions, which differs from much of the literature that is aimed more at management and managers. Indeed, one of the most widely cited lists of reasons for measuring performance is Behn’s (2003) comprehensive overview. His focus is on the purposes that public managers (rather than government organizations) are attempting to achieve by measuring performance. Surveying and synthesizing information from numerous sources, Behn lists eight purposes, which are to evaluate, to control, to budget, to motivate, to promote, to celebrate, to learn, and to improve. He sees the first seven as subordinate to the ultimate purpose of performance measurement which is to improve (the eighth purpose). It is worth spending some words on explaining these different purposes. It is also apparent that many of these are closely aligned with NPM (evaluation, control, budget, motivate), while others (promote, celebrate, learn, improve) align more closely with a learning and improvement perspective (for more on this distinction see Bovens et al., 2008; Lewis and Triantafillou, 2012). As Behn records, for

50  Handbook on measuring governance evaluation there are many mechanisms, and for control and budgeting, quite a few. But for the purposes of learning and improving, more open-ended analyses are required. A learning (as opposed to controlling) perspective requires a shift away from performance measurement for investigating failures and punishing those responsible, towards quality, learning from failure, and avoiding the allocation of blame (see Bovens et al., 2008). Evaluation, the first of Behn’s purposes, is central to measuring performance – such measurement can and will be used to evaluate an agency’s performance. The second purpose is control. Performance measurement is used by managers to ensure that their subordinates are doing what they should be doing. Budgeting is the third purpose listed by Behn (2003). Performance measurement can be used to help make micro budget allocations, within the macro political priorities established by elected officials. Public managers can also use it to motivate people to perform better, through the use of goals and targets to focus their thinking and provide a sense of accomplishment. Performance measurement can also be used to promote the good work of agencies and convince politicians, journalists, and citizens that they are performing well. Related to this is the purpose of celebrating success. Doing this gives people a sense of relevance and self-worth and motivates future efforts. The seventh purpose that Behn lists is to learn what is working and what is not working. Beyond evaluation, the purpose of performance measurement is to learn why certain things work or do not work. In Behn’s list, the ultimate purpose is to make improvements. Performance measurement is not an end in itself, but must be used to discover what should be done differently in order to improve performance (Behn, 2003). Levels of control and units of analysis are important here. Generally, performance measures are not chosen by the senior public administrators leading the public organizations that are being measured. The measures are instead imposed on them by central government departments and agencies. Journalists and stakeholders as well as citizens also have an impact on the purposes, the measures, and the standards chosen. Performance measurement is clearly a means for managing and directing the work of individuals or organizations. The subtext of the performance movement is a concern about control (Radin, 2006): It is used as a mechanism of political discipline by superordinate governments to change the behaviour of lower-level governments in the US, the UK, and the European Union (Bertelli and John, 2010). Moving beyond Behn’s list, with his focus on managers and their (internal) management of the performance of their staff, accountability becomes crucial. As has been noted already, accountability itself has moved more towards a focus on performance. Accountability incorporates external expectations and control through legal and political accountability (Bovens et al., 2008). But, as Van Dooren et al. (2010) argue, performance information can be used to find out what works and why, to control and motivate, and to give account (to communicate with the outside world about performance). Hence, performance measurement extends beyond internal assurance and management practices into the broader notion of (external) accountability. Another reason for measuring performance is the accountability-related concerns of investigating failures and corruption, stopping abuses of power, and sanctioning bad behaviour (van de Walle and Correlison, 2014). Two more additions to the list of purposes are power and symbolism. One purpose of performance measurement is getting, retaining, or enhancing power over entities that must provide performance information (van Dooren et al., 2010). This is how governments, and their departments and measurement agencies, strengthen their control over organizations that provide public services. It is more than simply controlling behaviour in line with certain criteria, it

NPM, performance measurement, and measuring for governance  51 is about changing the power balance between the measurer and the measured. The second consideration is that the purpose of measurement might be neither learning nor improvement, but measurement to demonstrate that it is being done. The reason to measure performance is to provide assurance that measurement is seen as important and it is being undertaken. It might even be merely a ceremonial device to assure the intended audience (international/regional bodies, other levels of government, politicians, superior public bureaucracies, funders, citizens) that all is well (Collier, 2008). Table 3.3 outlines a classification of purposes based on the preceding review of the literature. This list of possible purposes spans Van Dooren et al.’s (2010) three broad purposes of performance measurement: evaluation (finding out what works and why); management (control and motivate); and accountability (hold to account). In Table 3.3, the purposes most directly related to governance (controlling subordinates, purposively steering to achieve policy goals, exercising power, holding accountable, and signifying the importance of measurement) are in italics. Table 3.3

The purposes of performance measurement



Purpose

Evaluate

Establish level of performance (Behn, 2008; Pollitt, 1987)

Control

Ensure subordinates are doing what is required (Behn, 2008; Radin, 2006)

Budget

Prioritize funding (Behn, 2008; Pollitt, 1987; Radin, 2006)

Motivate

Use goals and targets to set desired direction (Behn, 2008; Pollitt, 1987; Talbot, 2005)

Promote

Convince others of good performance (Behn, 2008; Collier, 2008)

Celebrate

Recognize worthy accomplishments (Behn, 2003; Talbot, 2005)

Learn

Discover why some things are working or not working (Behn, 2003; Pollitt, 1987; Talbot, 2005)

Improve

Establish what should be done differently to improve performance (Behn, 2003; Pollitt, 1987; Talbot, 2005)

Steer

Use measurement to achieve desired governmental/policy goals (Bertelli and John, 2010; Pollitt, 1987; Radin, 2006)

Dominate

Get, keep or enhance power over others (Pollitt, 1987; Radin, 2006; van Dooren et al., 2010)

Punish

Sanction bad behaviour (Radin, 2006; van de Walle and Correlison, 2014)

Account

To hold accountable (Bovens et al., 2008; Pollitt, 1987; van Dooren et al., 2010)

Signify

Demonstrate that measurement is important and is being done (Collier, 2008; van Dooren et al., 2010)

52  Handbook on measuring governance

EXAMINING PERFORMANCE MEASUREMENT’S PURPOSES IN THREE NATIONS AND TWO POLICY SECTORS This section empirically examines the purposes of performance measurement for the health and higher education policy sectors in Australia, Canada, and the UK. These three Anglophone nations have all engaged in NPM and performance measurement within similar political institutions, but also with some important differences. The UK is a unitary state – although devolution to Scotland and Wales has occurred to some extent. It is institutionally the most centralized of these three nations. Australia and Canada are both federations of states/territories/provinces, but with disparate versions of federalism. Australia has a more strongly coordinated version of federalism, with the Commonwealth’s largely exclusive power over direct taxation making it a powerful policy actor (Painter, 2000). Canada has a highly competitive and decentralized political structure and, as a result, less centrally coordinated public policies (Tomblin, 2000). The two policy sectors were chosen because they are exemplars where performance measurement is practised widely. Given the financial cost and social importance of the health sector, it was expected that measurement would be more directly used for governing health than higher education, in both Australia and the UK. These national differences are reflected in their approaches to (national) performance measurement. The UK was an early adopter of NPM reforms and has applied a raft of (first) managerial reforms followed by market-based reforms. Australia was not far behind on these trajectories. Performance measurement at the national level is widespread in these core NPM states. In Australia, it is routinely carried out at the state/territory level and then ‘fed up’ to the national level. Canada embraced the initial phase of NPM (corporate management) but engaged less in the competitive practices and contracting out that followed on from this in the two other nations. Performance measurement occurs at the provincial and territory level, but this is not tightly coordinated at the national level. Health care and higher education are both areas where there has been a substantial focus on performance measurement in the UK. There are numerous national performance frameworks for the National Health Service (NHS). For example, the NHS performance framework is self-described as ‘a performance management tool … designed to strengthen existing performance management arrangements … it improves the transparency and consistency of the process of identifying and addressing underperformance across the country’ (Department of Health, 2012, p. 11). The Report on Government Services produced by the Australian Productivity Commission examines several broad service areas (including health care but not higher education) and compares all of the Australian states and territories. Its general performance framework consists of equity, effectiveness, and efficiency as the three key components (SCRGSP, 2013). The National Health Performance Authority has a Performance and Accountability Framework and reports on 48 indicators for Primary Health Network areas and the performance of Local Hospital Networks and hospitals (NHPA, 2016). There are no national performance frameworks or sets of performance measures in Canada for either the health or higher education sector. The Canadian Institute for Health Information (CIHI) provides essential information on Canada’s health systems and the health of Canadians. It was created in 1994 in response to the desire of (federal and provincial) governments for ‘a nationally coordinated approach to gathering and analysing their respective financial and administrative data’ (Marchildon, 2013, p. 34). CIHI provides ‘comparable and actionable

NPM, performance measurement, and measuring for governance  53 data and information that are used to accelerate improvements in health care, health system performance and population health across Canada’ (CIHI, 2016, p. 5). Its latest strategic plan lists four performance themes – patient experience; quality and safety; outcomes; and value for money. CIHI’s data comes from the provinces and territories, and from academic research projects. Higher education measures are generally centred on teaching, research and knowledge exchange, and impact. It is simplest to focus on research as both Australia and the UK have national research assessment systems that are broadly comparable. They rest on assessments that include publication quantity and quality as important components. Importantly, unlike other countries (see Whitley and Gläser, 2007), they both have funding allocations tied to these research performance measures – the UK to a much greater degree than Australia. As such, these national systems are amongst the most interventionist variants of research assessment around the world. Canada, on the other hand, has not introduced a national-level system of measurement for research performance. A number of provinces have introduced: performance indicators related to enrolments and graduations (Alberta and Ontario); a range of indicators (British Columbia); institution-specific indicators (Saskatchewan); and student outcomes, resources and labour market analysis (the Maritimes). Only two provinces (Alberta and Ontario) have tied these to funding (OCUFA, 2006). In his work on performance regimes, Talbot (2008) mapped out the institutions involved: central ministries; line ministries; legislatures; audit, inspection and regulatory agencies; judicial and quasi-judicial bodies; professional associations; users and user bodies; other public agencies; and agencies themselves. Three of these were included in the empirical work here: line ministries; audit, inspection and regulatory agencies; and (in the case of higher education) professional associations. The first two of these represent the public bureaucracies that are central in deciding upon and steering the performance measurement of subordinate organizations. The associations included in the higher education case represent groups of organizations who are regularly consulted by government departments. The interviewees were senior administrators who represent the measure deciders and steerers. As noted earlier, they are highly engaged in performance measurement because of its potential to act as a control mechanism over operational staff and organizations, and because it is central to their job (Pollitt, 2013). Interviews were conducted during 2014–15. In government departments, these people were generally director-level people with some line responsibility for performance measurement. In all other agencies, these people were CEOs or equivalent. A total of 34 people were interviewed, and these were split evenly across the two sectors and three nations (see the Appendix). The interviews were recorded and transcribed and the transcripts were read with the aim of discovering the range of performance measurement’s purposes. The number of different organizations involved in measuring performance – particularly in the health sector – is very large. A small number of interviews cannot possibly cover the entire gamut of perspectives but can be considered as representing the views of some of the individuals and organizations involved.

54  Handbook on measuring governance The Purposes of Measurement The aim of this analysis was to provide an insight into senior administrators’ understandings of the purposes of measurement. The 13 purposes identified from the literature (see Table 3.3) were searched for in the transcripts, using the words in the table and a range of synonyms. The five purposes shown in italics in the table were regarded as the most important for governance – to control, steer, dominate, account, and signify. A clear pattern that emerged from this analysis was that governance (represented by the five purposes identified in Table 3.3) appeared frequently. Perhaps this is not surprising given that these interviewees were doing high-level jobs in central bureaucracies and agencies, and that controlling, steering, and holding others to account were central to their roles. The only instance where this was not the case, was for the health policy sector in Canada, reflecting Canada’s lack of national-level systems of measurement, as well as its strong emphasis on distributed sub-national data gathering which is not connected to performance measurement. All comments by the interviewees that expressed a stated purpose of why measurement was being done were examined for themes. These are presented in the next two sections using quotes from the interviews, for each of the two policy sectors. Higher Education There were several mentions that the purpose of measurement for higher education was to support the knowledge economy and enhance global competitiveness. This is captured in the following quotes, which provide a clear sense of control and steering the sector: you actually have a vision for Australia to actually change our economy to be a technologically driven country rather than a resources driven country. (Australia) So our job is to essentially try to … move the competitiveness of Canadian researchers globally. (Canada) it’s about the fact that we need to have a high skills economy. (UK)

For Australia, there was discussion of measurement as a means to influence the behaviour of universities – providing a clear intention of control and the use of power: that’s what the metrics are about doing; changing behaviour. (Australia)

And a focus on waste and meting out punishment: So every time budget process comes around, Finance … will say, ‘Well there’s all this wastage. We need to clean up the wastage. We’ll take a cut and we’ll, you know, administer you a dose of medicine and then we’ll also require you to better target’ … ‘Just don’t spend as much on all those basket-weavers.’

In the UK the focus was much more squarely on accountability and regulation: this is actually to do with notions of accountability, and accountability running into governance … how do they know how things are going? They don’t unless they’ve got a measure. And so we are driven towards measurements.

NPM, performance measurement, and measuring for governance  55 essentially we regulate on the basis that ‘if you want our money you have to do this’.

The Canadian example (with no direct link between measurement and funding at the national level) describes a different view of measurement that is more about investment and celebration: … to foster world class research through our investments; to make sure that we attract and retain top talent; to ensure that the investments we make allow for the training of the next generation … performance measurement and reporting, you know, shouldn’t – it shouldn’t be a burden, it should be a celebration [laughs], … The less it’s about accountability, and the more it’s about celebrating the results of, you know, the hard work that people have done, and the importance of it, the better … there are no explicit performance targets set, and they’re – I think, unlike the UK and Australia, we haven’t gone so far at the federal research level in terms of allocating funding on such a – quite a controlled basis.

In summary, the emphasis on Australia was about using measurement to steer, change behaviour, and punish poor performance. For the UK, accountability and regulation featured in the interviewees’ comments, along with a strong sense of steering the higher education sector with measurement. Canada had a different emphasis that reflected more of a learning approach to the use of measurement. Interviewees in all three nations expressed their ambitions of using performance measures to steer their nations’ economies to make them more globally competitive. This last purpose suggests a high-level governance direction, while the references to changing behaviour, holding to account, punishment, and control all suggest NPM ideas. Health For the health policy sector, the national context came through very strongly, as did the linking of performance measurement to governance and policy. For Australia, health policy at the federal government level had a strong focus on steering the sub-national governments. The Australian interviewees spoke of performance measurement in the health sector as being driven by concerns about transparency and a desire to uncover variation at and below state/ territory level: I don’t think there was a robust enough consideration of: What’s the objectives of the different elements of the health sector? What are we trying to achieve? It was more about we need some more transparency around how money’s spent and what is happening at the local level, and that’s coming to some sort of compromise position on what areas that transparency might be focused on.

and So one of the key purposes behind setting up that was we, I guess, uncover the variation in performance that sits below the state level.

Another purpose mentioned was the explicit use of performance measurement to reform health systems and change behaviour: … in the lead up to that there was a lot of criticism about performance, funding for performance and performance based, all that sort of stuff … but actually what the evidence was showing was actually

56  Handbook on measuring governance it can work quite well to address fundamental issues within health systems … I think we all recognise it can be a tool to drive certain behaviours in the system, hopefully not in a perverse way.

The UK interviewees spoke about transparency and accountability at the regional commissioning level: … the indicators and the data that comes from those is then available to sort of drive improvement and I guess the theory is that NHS England … look at those indicators and they say: we know next year perhaps we need to focus on sort of that and also that those indicators provide something in terms of public transparency so that the commissioning group for Manchester or wherever can then be accountable to its local population.

They also discussed regulation and risk management, in the wake of a major hospital scandal that had led to a review (the Francis Report) and an overhaul of measurement in the NHS: We’re doing this because regulation was seen to have failed, quite frankly, … in Mid Staffordshire … the consequence of Robert Francis’s report was he made a number of recommendations around changes that needed to be made in regulation … … we look at financial risk, because one of our fundamental goals is to ensure continuity of services and within the NHS hospitals now, … they have to use that money to manage their affairs and pay their staff, pay their bills, fund their buildings and so on.

There was also a strong sense that national performance measures were being used for governance purposes by hospital boards: … foundation trusts have to be well governed and we assess what we see as being the key aspects of their governance. And to do that we use national performance targets, … things we see as being touchstones of good governance, even though we’re not performance managing, they become de facto performance management priorities for boards.

and … one of the big conversations on the back of Robert Francis was about responsibility of leadership of boards and the governance of organisations and it really opened up a conversation which I don’t think has been prominent in any way, shape or form about culture and the role of leadership in defining the culture about what’s important around here …

The overarching Canadian story is one of good quality data being available, but not being used to direct policy or strategy or improve performance: Really the function was to make sure that the data … was comparable and of high quality. there’s not a lot of objectives, inter-governmental cooperation and planning on the key aspects of health care. And particularly when it comes to measurement and reporting, so it’s a very diluted, disaggregated sort of approach here … the collection and dissemination of the data, not necessarily focused on interpreting it for future policy direction.

In summary, for Australia, the ability to compare performance between states and territories was crucial, while the structure of the NHS and changes made in the UK following a hospital scandal made risk management, transparency, and governance by boards a central concern.

NPM, performance measurement, and measuring for governance  57 And in Canada, the lack of any national-level system was described as responsible for the disconnect between good data and policy and strategy. As one interviewee remarked: ‘there is no directive by governments collectively to do anything for Canada as a whole’. The fingerprints of NPM are clear in these discussions for Australia and the UK, where central steering of devolved health systems, using indicators, was an imperative. Measuring for governance was most clearly articulated in the UK.

CONCLUSION The rise of performance measurement alongside the diffusion of NPM around the world, in many different types and forms, has been accompanied by an explosion in the purposes of performance measurement. This is confirmed both by the literature which continues to report more reasons for measuring performance and is paralleled by an unabated interest by governments in using measurement to support and direct all manner of policy outcomes. Performance measurement is particularly recognized as a policy instrument for governing by those who decide on measures and use them for steering other organizations. Based on the empirical analysis presented in this chapter, it seems that those who perform the roles of national-level measurement deciders and steerers see governance as an important purpose of performance measurement. As noted earlier, this is not surprising given the responsibilities of the interviewees. What is perhaps more revealing is that they were very comfortable attributing all kinds of governance purposes to the use of measurement – signalling their belief that performance measurement is generating information that can and should be used to govern. Measuring for governance has followed NPM’s spread and expansion, and its focus on the need to measure in order to manage has now been (at least conceptually) stretched into the need to measure in order to govern. This chapter highlights that the rise of NPM and the consequent explosion of performance measurement has created the space for the measurement of governance. The expansion of purposes for which performance measurement is seen to be legitimate now firmly encompasses governance as defined in this chapter: controlling, steering, exercising power, holding to account, signifying importance. Measurement for governance is now an important part of the mix alongside more direct management tools and evaluation functions. The chapters in this current handbook and the varied approaches taken to measuring (of) governance are testimony to the expansive idea that measuring governance has become. An acceptance, at least conceptually amongst these senior administrators, that measuring governance is about purposefully utilizing indicators to govern, suggests that – just as NPM spread the idea that measurement was needed to manage – we now need measurement to govern. The relentlessly expanding notion of performance measurement lends support to this idea: Why not govern by measurement? The technical ability to keep measuring more supports this ongoing broadening of purpose. Further, as accountability scholars have argued, NPM has redefined this field, focusing accountability towards administrative (rather than political or judicial) concerns, and redefining it as demonstrating performance. If there is a growing focus on the use of ‘measurement for governing’ by public administrators, then some important questions follow. How will performance measurement continue to be applied and expanded in the service of control, steering, transparency, and accountability? What are the limits of measuring for governance? We should be concerned about the

58  Handbook on measuring governance longer-term consequences of ‘measurement for governing’ for administrative and political power. As Du Gay and Lopdrup-Hjorth (2022) have argued, the unique attributes of public service and the state are threatened. They need to be defended against the idea that all value spheres are the same and, by extension, that all measurement can be used for any purpose. Public bureaucracy is a crucial cornerstone of constitutional rule, where the ethics of office and the importance of duty and responsibility, as distinct from predominantly individualistic ethics, is central. NPM and other modes of governance that move away from the traditional features of public bureaucracies, outlined earlier, risk undermining the public service and the role of the state. The expansion and conceptual stretching of performance measurement needs to be kept in check, and measurement for governing deserves to be applied with a great deal of caution.

REFERENCES Barzelay, M. (1992). Breaking through bureaucracy: A new vision for managing in government. University of California Press. Behn, R.D. (2003). Why measure performance? Different purposes require different measures. Public Administration Review, 63(5), 586–606. Bertelli, A.M., & John, P. (2010). Performance measurement as a political discipline mechanism. University of Southern California Law School, Law and Economics Working Paper Series No. 112. Berkeley Electronic Press. http://​law​.bepress​.com/​usclwps​-lewps/​art112. Boston, J., Martin, J., Pallot, J., & Walsh, P. (1996). Public management: The New Zealand model. Oxford University Press. Bovens, M., Schillemans, T., & ’t Hart, P. (2008). Does public accountability work? An assessment tool. Public Administration, 86(1), 225–42. Bovens, M., Schillemans, T., & Goodin, R.E. (2014). Public accountability. In M. Bovens, R.E. Goodin, & T. Schillemans (Eds.), The Oxford handbook of public accountability. Oxford University Press, pp. 1–20. Campbell, C., & Halligan, J. (1992). Political leadership in an age of constraint: The Australian experience. University of Pittsburgh Press. Carter, N., Klein, R., & Day, P. (1992). How organisations measure success: The use of performance indicators in government. Routledge. CIHI (Canadian Institute for Health Information) website (2016). https://​www​.cihi​.ca/​en/​health​-system​ -performance (accessed 24 May 2016). Collier, P.M. (2008). Performativity, management and governance. In J. Hartley, C. Donaldson, C. Skelcher, & M. Wallace (Eds.), Managing to improve public services. Cambridge University Press, pp. 46–64. Considine, M., & Lewis, J.M. (1999). Governance at ground level: The frontline bureaucrat in the age of markets and networks. Public Administration Review, 59(6), 467–80. Cutler, T. (2011). Performance management in public services ‘before’ New Public Management: The case of NHS acute hospitals 1948–1962. Public Policy and Administration, 26(1), 129–47. Davis, KE., Kingsbury, B., & Merry, S.E. (2010). Indicators as a technology of global governance. New York University Public Law and Legal Theory Working Papers. Paper 191. http://​lsr​.nellco​.org/​nyu​ _plltwp/​191 (accessed 30 May 2017). Department of Health (2012). The NHS performance framework: Implementation guidance. https://​ assets​.publishing​.service​.gov​.uk/​government/​uploads/​system/​uploads/​attachment​_data/​file/​216251/​ dh​_133507​.pdf (accessed 2 March 2023). Gay, P. du, & Lopdrup-Hjorth, T. (2022). For public service: State, office and ethics. Routledge. Flinders, M. (2001). The politics of accountability in the modern state. Ashgate.

NPM, performance measurement, and measuring for governance  59 Hammerschmid, G., van de Walle, S., & Stimac, V. (2013). Internal and external use of performance information in public organizations: Results from an international survey. Public Money and Management, 33(4), 261–8. Henman, P. (2016). Performing the state: The socio-political dimensions of performance measurement in policy and public services. Policy Studies, 37(6), 499–507. Hood, C. (1983). The tools of government. Macmillan. Hood, C. (1991). A public management for all seasons. Public Administration, 69, 3–19. Hood, C., & Margetts, H. (2007). The tools of government in the digital age. Palgrave Macmillan. Kassim, H., & Le Gales, P. (2010). Exploring governance in a multi-level polity: A policy instruments approach. West European Politics, 33(1), 1–21. Lascoumes, P., & Le Gales, P. (2007). Introduction: Understanding public policy through its instruments. Governance, 20(1), 1–21. Lewis, J.M. (2015). The politics and consequences of performance measurement. Policy and Society, 34(1), 1–12. Lewis, J.M., & Triantafillou, P. (2012). From performance measurement to learning: A new source of government overload? International Review of Administrative Sciences, 78(4), 597–614. Lewis, J.M., Nguyen, P., & Considine, M. (2021). Are policy tools and governance modes coupled? Analysing welfare-to-work reform at the frontline. Policy and Society, 40(3), 397–413. Marchildon, G.P. (2013). Canada: Health system review. Health systems in transition. WHO European Observatory on Health Systems and Policies. Moynihan, D.P. (2008). The dynamics of performance management: Constructing information and reform. Georgetown University Press. NHPA (National Health Performance Authority) website (2016). http://​www​.nhpa​.gov​.au/​internet/​nhpa/​ publishing​.nsf/​Content/​PAF (accessed 24 May 2016). OCUFA (Ontario Federation of University Faculty Associations) (2006). Performance indicator use in Canada, the US and abroad. OCUFA. OECD (Organisation for Economic Co-operation and Development) (2005). Modernising government: The way forward. OECD. Painter, M. (2000). When adversaries collaborate: Conditional co-operation in Australia’s arm’s length federal polity. In U. Wachendorfer-Schmidt (Ed.), Federalism and political performance. Routledge, pp. 130–45. Pollitt, C. (1987). The politics of performance assessment: Lessons for higher education? Studies in Higher Education, 12(1), 87–98. Pollitt, C. (2013). The logics of performance management. Evaluation, 19(4), 346–63. Pollitt, C., & Bouckaert, G. (2011), Public management reform: A comparative analysis (3rd ed.). Oxford University Press. Power, M. (1997). The audit society: Rituals of verification. Oxford University Press. Radin, B.A. (2006). Challenging the performance movement: Accountability, complexity and democratic values. Georgetown University Press. SCRGSP (Steering Committee for the Review of Government Service Provision) (2013). Report on government services 2013. Canberra: Productivity Commission. Talbot, C. (2005). Performance management. In E. Ferlie, L. Lynn, & C. Pollitt (Eds.), The Oxford handbook of public management. Oxford University Press, pp. 491–517. Talbot, C. (2008). Performance regimes: The institutional context of performance policies. International Journal of Public Administration, 31(14), 1569–91. Taylor, J. (2009). Strengthening the link between performance measurement and decision making. Public Administration, 87(4), 853–71. Tomblin, S. (2000). Federal constraints and regional integration in Canada. In U. Wachendorfer-Schmidt (Ed.), Federalism and political performance. Routledge, pp. 146–74. Triantafillou, P. (2017). Neoliberal power and public management reforms. Manchester University Press. van de Walle, S., & Cornelissen, F. (2014). Performance reporting. In M. Bovens, R.E. Goodin, & T. Schillemans (Eds.), The Oxford handbook of public accountability. Oxford University Press, pp. 441–55.

60  Handbook on measuring governance van Dooren, W. (2006). Performance measurement in the Flemish public sector: A supply and demand approach. Faculty of Social Sciences KU Leuven. van Dooren, W., Bouckaert, G., & Halligan, J. (2010). Performance management in the public sector. Routledge. Whitley, R., & Gläser, J. (Eds.) (2007). The changing governance of the sciences: The advent of research evaluation systems. Springer.

NPM, performance measurement, and measuring for governance  61

APPENDIX: LIST OF AGENCIES (AND NUMBERS OF PEOPLE) INTERVIEWED Australia ● ● ● ● ● ●

Department of Employment and Training – Higher Education Division (1) Department of Employment and Training – Research Funding and Policy Division (1) Tertiary Education Quality and Standards Agency (1) Group of 8 – current Chair (1) Universities Australia – current President (1) Department of Health and Ageing – Hospital Performance, Governance and Infrastructure Division (3) ● Department of Health and Ageing – Safety, Quality and Research Division (3) ● National Health Performance Authority (1) Canada ● ● ● ● ● ● ● ●

Association of Universities and Colleges Canada (1) Higher Education Quality Council of Ontario (1) Social Sciences and Humanities Research Council of Canada (2) Canada Foundation for Innovation (1) Ontario Ministry of Training, Colleges and Universities (2) Canadian Institute for Health Information (1) Health Council of Canada (2) Public Health Agency of Canada – Governance, Planning and Reporting Directorate (2)

UK ● ● ● ● ● ● ● ●

National Institute for Health and Care Excellence (1) Monitor – financial regulator (2) Care Quality Commission (1) Healthcare Quality Improvement Partnership (1) Higher Education Funding Council of England (2) The Russell Group – current Chair (1) Universities UK – CEO (1) REF expert panelist (1)

4. The constitutive effects of measuring governance Peter Dahler-Larsen

THE RELEVANCE OF CONSTITUTIVE EFFECTS As early as 5000 years ago, the pharaoh in Egypt appointed ministers called viziers to collect taxes (paid in grain). We know about this practice because they kept registrations of the taxes collected. In 1854 in London, Doctor John Snow placed the incidences of cholera on a district map. His geographical statistics paved the way for modern epidemiology and for sanitary practices in modern states (Johnson, 2006). States have built their ability to govern on measurement practices. The terms ‘state’ and ‘statistics’ share the same etymological roots. Modern states carry out intense measurement practices not only related to tax collection and health, but also in policy domains such as, but not limited to, poverty, unemployment, economics, demographics, corruption, sustainability, crime, education, well-being, and quality of life. The term ‘constitutive effects of measurement’ is a perspective on how measurement operates and thereby how it contributes to the creation of new social realities. To constitute something means to lay a foundation, establish, and ‘make up’ something, be it actors such as taxpayers or healthy citizens as in the examples above, but also programmes, interventions, accountabilities, relations, and governance itself. The concept of constitutive effects helps us analyze how measurement itself operates as a form of governance: measurement as governance. At the same time, governance can be an object of measurement: measurement of governance. Constitutive effects can occur in both of these dimensions. The term constitutive places the construction of reality through measurement at a more central theoretical place in analyses of governance than an alternative concept such as ‘unintended effects’. What the concept of ‘constitutive effects’ does is to take what is otherwise depicted as a ‘side effect’, an ‘overflow’, or even morally innocent epiphenomena, and instead place these effects in the midst of the very idea of governance as well as in medias res concerning the political and democratic implications of measurement as/of governance. The relevance and importance of attention to these implications are intensified as measurement as/of governance relates to improving the public good (as mentioned in the introduction to this book). Measurement makes it possible to ‘govern at a distance’ (Rose & Miller, 1992). The consequences of measurement as/of governance are differentiated over time and displaced across social and geographical spaces. The architects behind measurement regimes are often not in a position to observe the effects of these systems, and there may be a lack of mechanisms available to sensitize the architects themselves to these effects, not to mention holding the architects accountable. The isolation of measurement architects from the social consequences 62

The constitutive effects of measuring governance  63 of measurement may be one of the mechanisms which secure a particular form of governance through measurement. An interesting democratic challenge in contemporary society is therefore how to invent institutional mechanisms to handle feedback about the constitutive effects of measurement as/of governance. Can the actors in the evaluation society (Dahler-Larsen, 2012) share their observations of complex constitutive effects in such a way that they discover a sense of common destiny that forms the basis for collective action (Dewey, 1927)? The body of this chapter consists of five sections. First, the term ‘constitutive effects’ is unpacked and justified in comparison to what might be a more commonly used term, ‘unintended effects’. The second section looks at mechanisms and processes explaining how and why constitutive effects occur. The third section adds the dimension of time in terms of dynamics and histories of measurement as/of governance, as well as the notion of anticipatory governance. The fourth section discusses normativity and resources as sources of instability in measurement regimes. The final section is devoted to three small case studies of diverse constitutive effects. The examples illustrate particular socio-historical styles of governance as well as different normative perspectives on these effects.

CONSTITUTIVE RATHER THAN UNINTENDED It is often assumed that measurement, for example, in the form of performance indicators, has unintended effects (Courtney, Needell, & Wulczyn, 2004; Espeland & Sauder, 2007; Smith, 1995; Weingart, 2005). Although there may be some (but not total) empirical overlap between unintended and constitutive effects, it is theoretically important to distinguish between the two concepts. The term ‘unintended’ continues to hinge on intentions behind measurement schemes (Dahler-Larsen, 2013). More often than not, the term ‘intention’ assumes an individualistic and rationalistic perspective which lacks attention to broader myths, scripts, and institutional norms circulating in particular eras and in particular organizational fields. In empirical analyses, an exact identification of ‘intentions’ also poses problems. Whose intentions count? At what point in time? What if the most important intentions are not officially declared? In complex governance regimes, many kinds of actors may have different intentions evolving over time. Intentions may be difficult to capture both conceptually and empirically. Consider an example. A ranking list of schools based on performance criteria helps create a competitive relation between schools. Is this intended or not? The official purpose of a ranking list may be to ‘offer transparency’ to parents in relation to school choices or to ‘allow schools to learn from the best examples’ and ultimately to ‘improve education all over the country’. How do we know, however, that more competition between schools is not ‘intended’, too? A strong advocate of neoliberalism, for example, would enjoy seeing more competition. If a given system of school ranking is a result of a political compromise between neoliberalists and social democrats, where the former wants more competition among schools and the latter do not, then whose intentions count? If an effect can be both intended and unintended, how useful is it then to maintain the distinction between the two?

64  Handbook on measuring governance It is too often taken for granted that an analyst of measurement as/of governance can safely identify with an imagined unitary and benevolent designer measurement regimes and assume that ‘we’ ‘know’ what the ‘intentions’ are. It is time to question this analytical identification with a particular social agent. Furthermore, by declaring an effect ‘unintended’, its social, moral, and political significance is diminished: ‘Oh, there was a bit of collateral damage, but nobody can be blamed, it all happened inadvertently.’ Instead, the term constitutive effects portrays effects of measurement as/of governance as real phenomena with real consequences without relying on ‘intentions’ as a standard of reference. The next section is devoted to how constitutive effects may be produced.

INSTRUMENTS, MECHANISMS, AND PROCESSES The production of constitutive effects rests on social processes such as categorization, commensuration, temporalization, interpellation, and fixation. Categorization is fundamental to all measurement (Porter, 1994). Counting all elements in a given category requires a definition of the elements belonging to that category. For example, counting the citizens in a country requires a definition of citizenship. A measurement of an administrative burden requires a definition of administrative work as distinct from other kinds of work. A measurement of the quality of education assumes an attention to some structural and processual features that go beyond the classroom, but the definition of ‘education’ is not evident. Commensuration means bringing different objects together on the same scale regardless of their different concrete and material properties. For example, it means measuring the quality of teaching in very different subjects using the same questionnaire, or transferring the productive value of different surgical operations into the same monetary scale. Standardization and abstraction are conducive to commensuration. As a result, time and money are often used as all-encompassing universal measures to which almost anything else can be transformed. However, some abstract measures of quality can also facilitate commensuration of otherwise materially diverse objects of evaluation (Dahler-Larsen, 2019). A striking example here is indicators of quality of life. Temporalization means dividing time so that different measures belong to different periods. For example, different evaluation objects such as building a school, growing sustainable crops, and raising a child have different corporeal, organic or material rhythms. Systematic and abstract evaluation may therefore create tensions given the materiality and specificity of a given task over time. Interpellation means making particular social actors responsible for particular scores in a given measurement regime. Interpellation can build on already existing actors perceived to ‘naturally’ exist already, such as students, whose responsibility for their performance is further sharpened through grades and tests. However, interpellation can also enrol actors who are required to define themselves anew in relation to a measurement as/of governance. An example might be the ‘green consumer’. Consider also an academy of music conventionally identified as a cultural institution of a specific kind, which now becomes subject to an accreditation system in higher education. The organization is thereby interpellated as an object of measurement. It must define itself anew as an auditable organizational subject which deserves a positive outcome of the accreditation process (Power, 1997). In practice, this means reforming the organizational design so

The constitutive effects of measuring governance  65 that a special unit with newly hired staff takes responsibility for the accreditation process. This has to be integrated in the management of the organization, which, in turn, makes other parts of the organization responsible for their contribution through an elaborate set of documentation practices. Finally, both the process and the outcome of the accreditation process need to be communicated internally and externally. To handle these consequences of interpellation, organizational and managerial capacities are required which are fundamentally different from merely being a music academy as such. Fixation pins down an otherwise broad concept into an observable set of practices. Fixation is enhanced through an accountability bias, where actors focus on exactly that part of their behaviour that is measured and reported (Behn, 2001). For example, organizations can limit their cognitive attention through routines and procedures for official collection of data. Organizations can establish incentive structures to support performance indicators. Local managers can multiply the effects of national indicators by adding further incentives at the local level. However, measurement can also create fixation through discursive, rhetorical, and symbolic manoeuvres, which fix the attention of organizational members in one direction rather than another. Fixation may be supported by institutional lock-in (Osterloh & Frey, 2014), where institutional mechanisms support measurement regimes to such a degree that they become self-reinforcing. Certain calculative techniques define a potential group of political goals as more legitimate, more likely, and more measurable than others (Triantafillou, 2011). Employees who score well in measurement systems may be portrayed as professional and intelligent and ‘in the know’, whereas critics of these systems will be indirectly blamed for incompetency and not knowing how to score well. If the former are given high status and the latter are placed at a peripheral social position or are leaving an organization or a profession, institutional lock-in may be further reinforced. An additional reinforcement takes place when a measurement regime is evaluated and an increase in the score on the indicators defined in the same system is used as a criterion of success. The proof of the value of the system is found within the system itself. In an extreme situation, management takes place based only on what is reported in official performance and evaluation systems (Roberts, 2017). Fixation can reflect very specific, pedestrian and perhaps trivial properties of a measurement system. For example it becomes a goal in itself to secure that a score is ‘among top ten’, or ‘not below average’, or ‘not below category 3 on a scale from 1 to 5’ or that ‘at least X percent of outcomes do not fall in category Y’. Another criterion may just be ‘better than last year’. There is perhaps no good way of setting standards (Stake, 2004), but once standards are set, perhaps primarily driven by technical, operational or statistical motives, then practical and social consequences may flow from these standards. Herein lies an important contribution to constitutive effects. Where can constitutive effects be studied more specifically? For example in the following domains: ‘content’, ‘timing’, ‘relations’, and ‘systemic interactions’. Under the headline of content, it can be studied how the substance of teaching, health care, administration, etc. changes as a result of measurement regimes. A classic example is teaching that focuses on what is being measured in tests. In the category of timing, it can be studied how measurement regimes prescribe periodizations, rhythms, and deadlines for activities. An example is an action plan with deadlines prescribing activities needed for an unemployed person.

66  Handbook on measuring governance When social actors respond to measurement as/of governance, their identities and relations also become affected. Regimes of measurement help define democratic citizens, healthy workers, reliable taxpayers, strong soldiers, and more. Some forms of evaluation co-construct users of services in particular ways, for example, when student satisfaction surveys assume that students are ‘consumers’ of education (Cheney, McMillan, & Schwartzman, 1997). New reporting mechanisms install new accountability relations and power structures. In some cases of measurement as/of governance, the obligation of A to report to B about a particular phenomenon may be a more important source of constitutive effects than the description of the reported phenomenon in itself. For example, installing a regime in schools with low-validity tests may set a principle about political and organizational accountability in schools (in addition to the effects of this regime on pedagogical practices). In some situations, a particular actor is constituted as someone who delivers data about other actors (such as when an accreditation of higher education requires students to fill in questionnaires about their satisfaction with teachers and teaching). In its radical version, a reporting relationship between a person and the secret police may undermine that person’s relation to family members and loved ones. Finally, constitutive effects may manifest themselves in terms of larger systemic effects across levels of analysis. For example, when international measurement as/of governance is introduced, national governments may be motivated to install their own internal performance management regimes in local governments and institutions. Another set of ripple effects may come from an indicator designed to be used for informational purposes which then later becomes connected to financial incentives. The new incentives structure may set in motion a number of behavioural patterns, which may have systemic effects on several types of actors and several layers of a governance system, including, of course, financial streams. In other words, constitutive effects can in principle be found in different domains and at different levels of analysis. The next section places the production of constitutive effects in a larger perspective of time, history, and dynamics.

DYNAMICS AND HISTORIES IN MEASUREMENT AS/OF GOVERNANCE In the broadest possible macro perspective, a given form of measurement as/of governance is a reflection of the Zeitgeist of a given era, the need for steering and the technological possibilities at the time. From early times, there has been registration and documentation practices related to calculation of taxes and provision of military services from citizens, which are two central elements in the formation of statehood and the idea of government (see Howard, Chapter 1, this volume). What we today call social research methods were to a large extent invented to respond to the needs of governments to create knowledge relevant for management and steering in such areas as health care, epidemiology, poverty, education, and more (Easthope, 1974). The knowledge needed was specific to a socio-historical situation, but it also helped create a socio-historical path forward for particular types of governance. The more extensive use of responsibilization of citizens and organizations through measurement of performance, which we see in contemporary times, is thus in many ways a continuation of earlier practices rather than an invention de novo. What is new may be the expansion of

The constitutive effects of measuring governance  67 measurement to the global level (Rottenburg & Merry, 2015), the migration of measurement into new areas (such as philanthropy) (Brest, 2012), and the colonization of many areas of the life of individuals under performance-oriented and competition-oriented regimes. We may also see more sophisticated ways to seek to control the future. Measurement as/of governance in modern times can be meaningfully analyzed in terms of three ideal-typical styles of governance with distinct socio-historical characteristics (Dahler-Larsen, 2012). Modernity itself is characterized by a belief in rationality, linearity, and progress. A typically modern assumption is that progress in one domain is logically connected to progress in other domains (for example, advances in technology means advances in health, and economic growth means advances in quality of life). Under modernity, many measurement regimes quite straightforwardly assume that more is better. A prominent example is GDP. Under reflexive modernity, there is an increasing recognition that many of the problems in society are caused by modernity itself (Beck, 1992). Therefore, more is no longer necessarily better. Under reflexive modernity, there is an increasing attention to complexity and side effects. New measures of sustainability, equity, equality, etc. are brought into play. These measures focus more on broader social issues than on monetary measures alone, and more on balance than on growth. Some measurement regimes are based on the acknowledgement that central authority is incapable of controlling all social change. Instead, soft regulation with a focus on forms of self-governance, learning, reflexivity, and deliberation are enhanced. A key motif in ‘reflexive government’ is the internalization in an organization of a degree of responsibility for its own side effects (Beck, 1992; Dean, 2009: 223). An example is the European Union’s (EU) directive 89/391 which states that all employers in the EU must carry out an assessment of safety and health at the workplace, but there is no specification of methods, standards, results, or outcomes. The idea is that the employer together with the employees will design their own reflexive process in such a way that it will lead to improvements in safety and health at the workplace (Dahler-Larsen, Sundby, & Boodhoo, 2020). The soft nature of this evaluation regime may be due to the unevenness and complexity of the regulative object at hand (‘safety’ and ‘health’). It may also reflect a low political priority of this policy area in comparison with other policy areas in the EU (Smismans, 2003). A third style of measurement as/of governance can be termed ‘the audit society’ (Power, 1997). According to this style of measurement, organizations must make themselves auditable through a number of documentation practices which are, in turn, the object of scrutiny for external auditors, inspection and accreditation agencies. The audit society produces a number of comprehensive evaluation machines with a focus on preemptive attention to risk factors before they lead to problems. This focus resonates with a view of the world as a dangerous and unpredictable place. The risks are many and diverse: terror, hacking, internet trolls, war, disasters. Preventive measures are needed. They must be codified and documented. A key theme in this style of measurement as/of governance is to secure documentation of procedures needed to prevent dramatic and large-scale negative events. Interestingly enough, this mental model seems to migrate from areas of high politics (such as war, disaster, and risk management) into areas of low politics (such as accreditation of higher education and social work). Given its strong focus on documentation of procedures, the audit society aspires to suspend subjective judgement, but lends itself to systematic bureaucratization with formalism and reification lurking around the corner.

68  Handbook on measuring governance The audit society narrows the evaluative perspective by installing procedures, which allegedly allow risks and side effects to be managed even before they occur. But the preoccupation with risk, disaster, terror, or simply bad quality or failure itself does not always permit the kind of open critical questioning which was the hallmark of reflexive modernization. This tripartite scheme (modernity/reflexive modernity/audit society) is useful at the macro-level as an analytical framework of substantially different forms of measurement as/ of governance, but empirical realities are immensely more complicated than these three schematic ideal-types. Different styles of measurement reflect broader socio-historical imaginaries (Castoriadis, 1997), but not necessarily without contradiction and controversy. One of the more advanced attempts to exert control can be found in the category of anticipatory governance. One element in anticipatory governance is a claim to knowledge about the future. In some situations, this knowledge can be founded in a particular epistemic model, perhaps supplemented with a scientific projection (Robertson, 2022). It may also be found among local ‘wise men’ who claim to have particular insights (Dahler-Larsen, 2022). It may be critical for the success of a particular kind of anticipatory governance that this image of the future becomes sufficiently widely accepted. Another element in anticipatory governance is monitoring of a preemptive installation of initiatives and precautionary action aimed at controlling the future. Anticipatory governance plays with these elements so that measurement as/ of governance helps produce a self-fulfilling prophecy. Finally, anticipatory governance may confirm its own legitimacy by using an increased score in self-selected criteria as a measure of its own success. However, attempts to control the future through measurement as/of governance are often contested. The next section explains why.

CONTESTATION: NORMATIVITIES AND RESOURCES When a measurement regime is installed, there may be attempts to stabilize the regime (for example, through fixation and institutional lock-in). It is a classical theme in studies of performance measurement that people change their behaviours as a result of being measured. But there can also be push-back from local organizations and individuals under measurement (Rothstein, Huber, & Gaskell, 2006). For example, professionals under measurement allegedly engage in behaviours characterized as ‘gaming’ (Radnor, 2008). This term evokes an imagery where a real, intended target counts as a normative yardstick, but shrewd and self-interested actors drive their activities away from that target. Some studies challenge this assumption. A study showed that doctors can play with diagnostic codes and other documentation practices to create space for further examinations when there is uncertainty about the status of a patient (Kerpershoek, Groenleer, & de Bruijn, 2016). These moves may be in the interests of the patients. Only if our analytical perspective is tacitly assuming solidarity with the assumed rational and benevolent designer of the measurement system, can we call these behaviours ‘unintended’ or ‘gaming’. It is not under the control of the architects of a measurement regime whether a normative discussion emerges. Consider an example where average grades in exams are used as indicators of school quality. Some argue that if a ranking list is made publicly available, free school choice will widen the differences between schools since the most well-off parents will send their children to the best highest-ranking schools. The best teachers will move in the same

The constitutive effects of measuring governance  69 direction. Educational researchers may say that these moves are ‘unintended’ consequences of the ranking scheme. Furthermore, researchers in education or statistics will argue that raw measures of grades do not fairly and correctly measure school quality since different schools recruit children from uneven social backgrounds. Statisticians then recommend that the raw exam scores should be controlled for the socioeconomic and ethnic background of families in the neighbourhood. If that recommendation is taken up, a website will, in the holy name of transparency, also provide data about the social and ethnic composition of families who send their children to that school. This information includes the proportion of single parents or the proportion of immigrants. Now, some parents protest against the publication of these indicators. They argue that just because you are a single parent or an immigrant does not mean that you cannot be a good parent. What others think is a statistical control factor, they see as unfortunate and unfair social labelling of particular groups of people. Again, constitutive effects unfold in the interaction between statistics and social life. They do so as reactions to reactions to reactions, each triggering its own set of normativities. One of the potential normative controversies has to do with social inequality. Akselvoll and Dahler-Larsen (2021) describe how evaluations of school children communicated to parents through digital media create new inequalities between parents of different social classes, depending on resources in the family and adeptness with digital media. In other words, there may be far from perfect overlap, and sometimes contradictions, between the technical/practical aspects of a measurement system and the normative concerns it sets in motion (Triantafillou, 2011). The resources spent on measurement as/of governance may also be an important issue leading to change of measurement regimes. Many of these resources may be officially invisible. For example, time spent on registration and documentation by professionals is often not measured nor monetarized and may therefore be ‘nonexistent’ in the eyes of management. However, if these costs of measurement regimes are registered (by audit offices, evaluation researchers, or others), they may become an issue of debate. Professionals will argue that time spent on documentation is time not spent on clients, patients, citizens, etc. Perhaps more importantly, politicians can also be outspoken adversaries of the paperwork imposed on professionals in the public sector, even as this administrative burden is a result of policies enacted legislatively by the same politicians. An interesting empirical question is therefore the extent to which the costs of measurement are made visible, how, and to whom. In an example in Denmark, a survey showed that the average teacher spends 17 hours per year on the production of pupil plans (EVA, 2020). A pupil plan is an evaluation report that describes the performance of the student and sets targets for the following months. If the approximately 50,000 teachers in Denmark work approximately 1700 hours pr. year, then the production of pupil plans costs the equivalent of 500 full-time teachers. If eliminating the legally mandated pupil plans is impossible, maybe their costs can be reduced. Let’s say teachers were allowed to give oral instead of written feedback to students and their parents, thereby cutting time spent on producing these reports in half. Politicians should know that if they accept a more light-footed version of these reports, they can have additional 250 full-time teachers at no extra cost in comparison to the present situation. Whether or not knowledge about costs is available may affect the political decisions about the design of documentation and measurement systems. A recent reform on schools in Denmark paved the way for a transformation of these comprehensive pupil plans into a more simplified evaluation tool with only ‘a few important observa-

70  Handbook on measuring governance tions’. We have also seen some large-scale evaluation systems being transformed, reduced, or eliminated, for example, in hospitals and in national research evaluation. In another recent example, students in higher education are expected to fill in a relatively large number of surveys describing the quality of their educational institution, its learning environment as well as each course. These surveys are designed by different agencies and institutions with a non-coordinated overlap. It has become clear that many of these surveys among students suffer from very low response rates. This is a concern to the Ministry of Higher Education and Science, to the accreditation agency, and to local heads of studies. Apparently, students have limited time and attention and do not always find that these surveys are worthwhile. The official system considers incentive schemes and campaigns to increase the response rates. The automatic response seems to be to spend additional institutional resources on the allocation of resources (among students) to feed these data collection systems. To maintain itself, the existing system of measurement seems to require ever-more resources. Perhaps it in fact depletes the resources it needs for its own legitimacy and survival. A new project has been launched in the Danish Accreditation Agency with the aim of ‘rationalizing’ or ‘rightsizing’ the whole configuration of student surveys. The sum of these observations calls for a warning against building ideological assumptions into our analytical apparatus about measurement regimes automatically becoming ‘better’ or ‘worse’ as a function of history itself. The uncertainty following this dynamic complexity might, in some situations, weaken potential constitutive effects, since it may be difficult to predict which performance measures one will be exposed to in the future. On the other hand, however, uncertainty may also precisely enhance constitutive effects as actors are motivated to take preventive and precautionary measures just to be on the safe side. The term constitutive effects may help us pay attention to how normativities and struggles over limited resources (including limited attention) can lead to modifications in measurement as/of governance over time. The trajectory of a given regime of measurement as/of governance is a function of a complex interplay between sociohistorical forces at the macro-level and reactions on the micro-level, between social imaginaries and technical detail, and between inherent logics incorporated in measurement systems and the dynamic social and normative contexts in which they operate. There is a broad field open for empirical studies of the many manifestations and dynamics in measurement as/of governance. The remarks made so far also highlight the importance of the normative element in any analysis of constitutive effects. The normativity should not be buried under terms such as ‘unintended’. Instead, an analysis of constitutive effects may reveal explicitly how, why, from which perspective, and for whom the practical and social consequences of measurement as/of governance become a problem. The remainder of this chapter is devoted to three examples illustrating how such analyses were carried out in different ways in different thematic areas. These have been organized chronologically.

The constitutive effects of measuring governance  71

EXAMPLE 1: OECD’S ANTICIPATORY GOVERNANCE IN EDUCATION Already in 1942, the governments of the allied countries met to discuss how education systems could be rebuilt once the war was over. UNESCO, founded in 1945, was based on keywords such as ‘the universal character’ of a ‘genuine culture of peace’ and ‘the intellectual and moral solidarity of mankind’.1 Today, UNESCO has 193 members. Since its early post-war days, UNESCO has promoted an ideal of a world society based on shared knowledge. The OECD was founded in 1961 and has now 38 member countries. According to its official website, the OECD is engaged in ‘improving economic performance’, ‘creating jobs’, and ‘fostering strong education’. It does so through ‘a unique forum and knowledge hub for data and analysis, exchange of experiences, best-practice sharing, and advice on public policies and international standard-setting’.2 In an interesting example, Robertson (2022) compares the attempts of UNESCO and the OECD to influence the future of education. Where UNESCO focused on inputs such as expenditures and human resources based on a humanistic philosophy, the OECD promoted a New Public Management (NPM)-inspired focus on outcomes such as those defined in test results. Over time, UNESCO was criticized for the allegedly poor quality of its statistics. The critique was articulated by another international organization (The World Bank) which is also defined by a fundamental economic outlook on the world (Robertson, 2022: 198), much like the OECD, but unlike UNESCO. The OECD focused more on using scientific methods (measurement, forecasting, testing, planning) to support economic modernization in the form of neoliberal competition between countries. And the OECD measured outcomes, not inputs (Robertson, 2022). The OECD became conspicuously influential in the way it defined education as a resource in a world characterized by international competition (Grek, 2009). Policy makers, teachers, and students themselves have become motivated to see education as a resource in the anticipatory management of the future of individuals, which is, in turn, defined in economic and competitive terms. This mentality cannot, of course, be causally attributed to the OECD alone, but it is difficult to ignore the stunning elective affinity between this mentality and the imaginaries in the OECD’s anticipatory governance. Echoing the notions introduced earlier, we can say that the Programme for International Assessment (PISA) seeks to introduce a particular international ‘content’ (compentencies, skills, etc.) that educational systems across the world are assumed to accept as general goals. In terms of ‘relations’, PISA also seeks to construct a competitive ranking across contries (in contradistinction to UNESCO’s notion of one world). In timings of ‘timing’, every round of PISA assessments seems to inspire a new circle of educational policy-making. At the same time, the activities of the OECD, in particular the PISA assessments, have been intensely criticized and continue to be so. One set of problems relates to what has been described as the demotivating effects of the performance-oriented and competition-oriented view on teaching and learning, and thus on teachers and students. The phenomenological aspects of being socialized as a neoliberal performing subject are described as negative, and recruitment and retention of teachers is reduced. Students are said to suffer from psychological pressures, if not disorders. There may or may not be a demonstrable causal link between evaluative measures in schools and the resulting problems for teachers and students. The point is, however, that when the discussions of these problems among teachers and students are

72  Handbook on measuring governance articulated and politicized, they are also made relevant as constitutive effects and therefore as critical points of feedback into formation of school policy. Another related point of contention in school polities is the tension between measurements belonging to different and only partly overlapping organizational fields. The OECD’s assessments (such as PISA) must be able to circulate in an international field. So these measures must appear as free-floating in space, not referring to a particular nation or culture. The same is not true for school policies. Some nations adopt school policies with a focus on language, history, culture and perhaps religion. Correspondingly, measures of performance in these domains circulate only inside a given national educational field. At some point in time, critical researchers and critical teachers may realize that the constitutive effects produced by measurements in the international field do not resonate well with constitutive effects produced by measurements in the national field. They may ask officials and policy makers how to reconcile what appears to be inconsistencies. The discovery of inconsistencies may delegitimize existing regimes for steering and governance and create more local wiggle room. In essence, PISA seeks to create ‘systemic effects’, but these effects are met with resistance and counter-currents along the way. Anticipatory governance is never perfect.

EXAMPLE 2: TRANSPARENCY IN GOVERNMENT The Corruption Perception Index (CPI) compiled by Transparency International (TI) ranks countries by perceived levels of corruption. It was first published in 1995 (Baumann, 2020). TI claims to promote ‘social and economic justice, human rights, peace and security’.3 The CPI broadly interpreted is a broad indicator of good government, and more specifically as a measure of a corruption-free government. However, corruption is inherently difficult to measure. When institutions detect cases of corruption, it may be because these institutions are effective, not because there is a lot of corruption. Massive amounts of corruption may thrive in darkness. Therefore, the least problematic measure of corruption may come not from direct measurement, but from reputational assessments of governments from independent sources, hence the Corruption Perception Index. Absolute independence in measurement is rare. To calculate the CPI, TI borrows data from a range of other organizations. Many of these are focusing on banking, finances, business climate, and financial risks.4 In Baumann’s (2020) analysis, this means that the CPI is overwhelmingly informed by a business perspective, as if business interests and financial risks constitute the primary concerns in relation to good government. Why, asks Baumann (2020), do non-commercial non-governmental organizations (NGOs) also support the CPI? Because it is the least problematic proxy measure of good government available. It is not easy to get access to data about corruption, and only business organizations can afford to collect (relatively) good data. Exactly this emphasis on a financial view of the world may help explain why business-oriented corporations and organizations in fact trust the CPI and base their decisions on it. They may also bestow the CPI with perceived legitimacy and validity because people in their own network who view the world in similar ways do the same (Baumann, 2020). In turn, the transparency index will be coloured by the type of organization in position to create such an index. In Baumann’s analysis, the world’s view of good government becomes co-terminous

The constitutive effects of measuring governance  73 with a good climate for business. At the same time, the role of corporations in maintenance of corruption is kept out of focus. Again, echoing the constitutive analytical categories introduced earlier, good government becomes defined in terms of a specific business-friendly conceptual ‘content’. Over time, CPI may contribute to a ranking of countries, where the ‘systemic effect’ is a privileged business perspective on governance and is a result of a use of business sources in the definition of governance.

EXAMPLE 3: UN’S SUSTAINABLE DEVELOPMENT GOALS Another perhaps even more conspicuous attempt to define the future can be found in the UN’s Sustainable Development Goals (SDGs) adopted in 2015. There are 17 goals spread over policy areas such as health, education, climate, sanitation, infrastructure, and more. Potential constitutive effects coming out of the SDGs are too numerous to count. Examples will suffice. One interesting effect lies already in the selection goals. In addition, many observers notice how the SDGs have achieved a remarkable level of normative and rhetorical support from governments, organizations, institutions, and individuals. Politicians wear little badges symbolizing the goals. From a political science perspective it is remarkable that justice and strong institutions count among the SDGs, but democracy does not. This omission may be explained by the fact that all the goals were filtered through the UN system, and not all member countries would accept democratization as a common goal. The goals would have to be accepted also by Russia, China, Syria, and many other countries. The institutional channels through which goals are set have an effect on the selection and formulation of goals and thus on their potential constitutive effects. Are the SDGs representative of good and undebatable goals for all of humankind? Or are they simply some political goals set by one political organization among others? In the first case, anyone who questions the quality of the goals, the omission of democracy, or potentially problematic side effects of the SDG will be a pariah. In the latter case, SDGs and their consequences can be analyzed, discussed and challenged just like any other political proposal. This distinction is important. If the SDGs are unconditionally accepted as representing ‘the good’, then it is more likely that people will be blind to their potentially problematic constitutive effects. Following the formal adoption of the goals, 231 indicators have been developed. An important aspect of these additional measurements has to do with how the goals are broken down into different levels of analysis such as countries and sectors. This is not trivial. Are the environmental effects of a given industry allocated to the locus of production or consumption? If a nation hosts a fleet of ships and planes, are the effects of transportation attributed to that country, or to all countries, or is transportation not counted if it takes place in sea and air across national borders? Obviously, countries have an interest in breaking down the achievement of goals in particular ways rather than others. Reports highlighting country-specific scores on SDGs are being produced. Unsurprisingly, they show that the global north has a better ‘performance’ on the SDGs than the rest of the world, but they also have spill-over effects on low-income countries.5 The problems with operationalization do not go away as the SDGs are broken down into sectors, municipalities, companies, etc. In a Danish example, authorities,

74  Handbook on measuring governance organizations, and foundations have come together in an attempt to define specific indicators as a sort of Danish version of the SDGs.6 The purpose is to inspire organizations, companies, and citizens to think of ways in which they can contribute to accomplishing some of the SDGs in their own part of the world. In a university class, where the author of this chapter taught evaluation to graduate students in political science, the students were asked to identify potential constitutive effects of this Danish variant of the SDGs. The students identified the following problems. First, the students found that many of the goals were in potential conflict with each other. An organization may contribute to one goal while perhaps sacrificing another (for example, if economic goals are in conflict with climate goals). This problem can lead to a very selective attribution of undeserved praise for some who score well on just one goal. The underlying idea is apparently to allow stakeholders to contribute where they can. But the Danish document consists merely of a list of indicators. It does not consider or problematize how to balance several goals. Second, the students found that some of the operationalizations were remarkable and bluntly in conflict with existing political goals. For example, related to goal number 1, the reduction of poverty, it is suggested to use ‘the proportion of the Danish population financially supported by the public sector’ (criterion 1.3.i). By implication, if the Danish government wants a high score on this (positive) criterion, it would have to fundamentally change the existing policies which usually aim at making as many citizens as possible able to sustain their own livelihood without public support. Conventionally, dependence on the public sector is seen as negative or at least definitely not a goal in itself. In the light of the SDGs, it is suddenly defined as a positive goal. Third, some goals were operationalized in ways which would make it impossible to assess whether a change in the score is positive or negative. In relation to goal 5, gender equality, it is proposed to look at contacts to women’s crisis centres. It is unknown, however, whether it is good that women contact these centres because they trust that services will be available for them, or whether it is better that women do not experience any threats to their health and safety and therefore abstain from contacts with these centres. Without any knowledge about the real need for such centres, it is difficult to make substantial sense of the number of contacts to these centres, and given the lack of a positive or negative anchor, any normative assessments of change in numbers may be free-floating. An even more remarkable effect also occurs in relation to gender equality. In Denmark, an agency called Ligebehandlingsnævnet (‘The Council for Equal Treatment’) processes complaints from citizens about discrimination or unequal treatment, for example, at the labour market or in business or services. The proposed indicator of equal treatment is ‘the proportion of cases where the complaint is decided in a favor of the plaintiff’, regardless of the substance of the case and regardless of who the plaintiff is. If all cases are decided in the positive, that would apparently indicate gender equality. Thus, if a new plaintiff files a counter-complaint to a case, the new counter-complaint will also have to be decided positively, substantially against the decision already made. Using the terms introduced earlier, we can predict such intense problems with commensuration of the cases under the aegis of ‘The Council for Equal Treatment’ that it is likely that no fixation of this measure will take place. These examples only show what is commonly known in measurement of performance: It is difficult to measure complicated things in simple ways without running into unfortunate or even absurd consequences.

The constitutive effects of measuring governance  75 To sum up this example, potential constitutive effects are uneven and have many roots. Some problems may relate to the institutional construction of goals, some to implementation, and some to very pedestrian problems with operationalization and measurement. It is perhaps promising for democracy that even the strong normative loading of the SDGs did not sedate the critical afterthought among the students who took a critical look at the Danish version of the SDGs using ‘constitutive effects’ as a conceptual lens. In this example, many of the measures of the Danish version of the SDGs in terms of ‘content’ are bluntly distorted and also in direct conflict with existing political and normative ideals (such as fair trial). Therefore, the ‘systemic’ consequences, as well as the timing effects, may become very limited. On the other hand, in principle, measures with low validity (defined on the basis of conventional understandings) may also have constitutive effects (since they may redefine reality more strongly), so we should always remain empirically open in our studies of these effects. For example, it may happen that Denmark’s ‘isolated’ contribution to the SDGs will become illustrated in a very selective, if not illusionary way.

A CONCLUSION AND A REFLECTION ON THE CONCEPT OF CONSTITUTIVE EFFECTS As we have seen, constitutive effects have many roots and they unfold over time in different form and shape. A reflection about the pros and cons of the concept as concept as such is called for. As compared to concepts such as ‘gaming’ or ‘unintended effects’, it is less judgmental and more open for various normative loadings. Whether a given constitutive effect is acceptable may depend on the views of different stakeholders and may be up for democratic discussion. The normative problem undergirding a given constitutive effect should neither be explained nor explained away with reference to only the subjective biases of the researcher or observer. The many ways in which constitutive effects occur inspire an open and flexible use of the concept. We have suggested to look for constitutive effects under headlines such as ‘content’, ‘timing’, ‘relations’, and ‘systemic effects’. At the same time, the concept does not offer a very specific set of hypotheses about which effects occur under which controlled circumstances. Instead, it allows observers to be surprised. The term constitutive effects is best used as a sensitizing concept in empirical studies pointing to how realities are created out of measurements. A study of constitutive effects is therefore different from merely a theoretical position which seeks to uncover the hidden, but given ideological effects of a given measurement regime. At its best, the term constitutive effects may sensitize us to what we had not seen coming. A final reflection is that if we take this view of constitutive effects seriously, what are the implications for a more democratic form of governance? If the observations in this chapter are correct, then constitutive effects flow out of many aspects of measurement regimes (construction, implementation, operationalization, use, reactions to use, etc.). They also spread themselves over time and place in complicated ways. For these reasons, it is difficult to achieve an overview over constitutive effects produced by a given measurement as/of governance. They may be invisible for the architects of these measurement systems, as well as for many of their actors and participants. Each of them see only, at best, a part of the picture. Because of these complexities and the differences in time and place between the producers of measurement systems and the people who live with their constitutive effects, it is difficult to

76  Handbook on measuring governance create mechanisms for feedback and reflexivity needed to create a new responsibilization for these effects (Beck, 1997). Many international organizations which produce measurement as/ of governance do not have institutional features that allow for direct democratic participation and feedback. It is also difficult to translate feedback about constitutive effects into political action, because these effects are spread out among actors who may not see that they have anything in common. By splitting actors up in different segments (producers, objects, and individualized users of information), measurement regimes may make it more difficult to discern a sense of society and common destiny. Political processes may be too slow and too complex to catch up with the accelerated mechanisms in the digital, competitive, global world (Rosa, 2005). The political dealing with these systems is also tempered by the fact that the contemporary individual in the modern world is already entangled in the production and use of a number of measures, indicators, and assessments not under control under democratic authorities. Most of us use Tripadvisor and Google Scholar and many other measurement tools even if we question their validity. We are already implicated, and we know we are implicated. But knowing that we are implicated is not emancipatory in itself (Basevic, 2019). Nevertheless, the concept of constitutive effects opens a new horizon for attention and observation of measurement as/of government. It may also focus on the institutional arrangements which make constitutive effects possible and open discussions about how to deal with these effects. Can we redesign institutions so that sensitivity to constitutive effects is increased? Can we create new feedback loops so that, for example, the OECD becomes more responsive to schoolteachers and students who live with the ‘evaluation culture’ suggested by the OECD? What can we learn from experiments with deliberative forums which provide structured feedback into policy-making about large-scale measurement systems (such as national tests in schools) (Dahler-Larsen, 2023)? If we think of democracy not as a universal standard, but as a socio-historically situated practice, then one of the next important challenges is how to make constitutive effects of measurement subject to a more reflexive and dynamic form of democratic governance (Rosanvallon, 2011). At the moment, the dynamics of the constitutive effects of measurement as/of governance seem to operate swiftly and dynamically, while the logics of democracy seem to be lagging behind.

NOTES 1. https://​www​.unesco​.org/​en/​brief. Accessed 18 August 2023. 2. https://​www​.oecd​.org/​about/​. Accessed 18 August 2023. 3. https://​www​.transparency​.org/​en/​what​-we​-do. Accessed 18 August 2023. 4. https://images.transparencycdn.org/images/CPI2021_SourceDescriptionEN.pdf. Accessed 18 August 2023. 5. https://​dashboards​.sdgindex​.org/​chapters. Accessed 18 August 2023. 6. https://​www​.kl​.dk/​media/​24970/​vores​-maal​-rapport​.pdf. Accessed 18 August 2023.

The constitutive effects of measuring governance  77

REFERENCES Akselvoll, M., & Dahler-Larsen, P. (2021). Evaluation people and real people in home-school cooperation. In P. Dahler-Larsen (Ed.), A research agenda for evaluation (pp. 105–27). Edward Elgar. Basevic, J. (2019). Knowing neoliberalism. Social Epistemology, 33(4), 380–92. doi:10.1080/0269172 8.2019.1638990. Baumann, H. (2020). The corruption perception index and the political economy of governing at a distance. International Relations, 34(4), 504–23. Beck, U. (1992). Risk society: Towards a new modernity. Sage. Beck, U. (1997). Risikosamfundet: på vej mod en ny modernitet. Hans Reitzels Forlag. Behn, R.D. (2001). Rethinking democratic accountability. The Brookings Institution. Brest, P. (2012). A decade of outcome-oriented philanthropy. Stanford Social Innovation Review, 10(2), 42–7. doi:10.48558/K9H3-7Z08. Castoriadis, C. (1997). The imaginary: Creation in the social-historical domain. In D.A. Curtis (Ed.), World in fragments: Writings on politics, society, psychoanalysis, and the imagination (pp. 3–18). Stanford University Press. Cheney, G., McMillan, J.J., & Schwartzman, R. (1997). Should we buy the ‘student-as-consumer’ metaphor? The Montana Professor, 7(3), 8–11. Courtney, M.E., Needell, B., & Wulczyn, F. (2004). Unintended consequences of the push for accountability: The case of national child welfare performance standards. Children and Youth Services Review, 26, 1141–54. Dahler-Larsen, P. (2012). The evaluation society. Stanford University Press. Dahler-Larsen, P. (2013). Constitutive effects of performance indicators – getting beyond unintended consequences. Public Management Review, 16(7), 969–86. http://​www​.tandfonline​.com/​doi/​pdf/​10​ .1080/​14719037​.2013​.770058. Accessed 18 August 2023. Dahler-Larsen, P. (2019). Quality: From Plato to performance. Palgrave. Dahler-Larsen, P. (2022). Your brother’s gatekeeper: How effects of evaluation machineries are sometimes enhanced. In E. Forsberg, L. Geschwind, S. Levander, & W. Wermke (Eds.), Peer review in an era of evaluation (pp. 127–46). Palgrave. Dahler-Larsen, P. (2023). Can we use deliberation to change evaluation systems? How an advisory group contributed to policy change. Evaluation, 29(2), 144–60. doi​.org/​10​.1177/​13563890231156955. Dahler-Larsen, P., Sundby, A., & Boodhoo, A. (2020). Can occupational health and safety management systems address psychosocial risk factors? An empirical study. Safety Science, 130, 1–8. https://​www​ .sciencedirect​.com/​science/​article/​abs/​pii/​S0925753520302757. Accessed 18 August 2023. Dean, M. (2009). Governmentality: Power and rule in modern society. Sage. Dewey, J. (1927). The public and its problems. Alan Swallow. Easthope, G. (1974). A history of social research methods. Longman. Espeland, W., & Sauder, M. (2007). Rankings and reactivity: How public measures recreate social worlds. American Journal of Sociology, 113(1), 1–40. EVA. (2020). Elevplaner i Folkeskolen. https://​www​.eva​.dk/​grundskole/​elevplaner​-folkeskolen. Accessed 18 August 2023. Grek, S. (2009). Governing by numbers: The PISA ‘effect’ in Europe. Journal of Education Policy, 24(1), 23–37. doi:10.1080/02680930802412669. Johnson, S. (2006). The ghost map. Riverhead. Kerpershoek, E., Groenleer, M., & de Bruijn, H. (2016). Unintended responses to performance management in Dutch hospital care: Bringing together the managerial and professional perspectives. Public Management Review, 18(3), 417–36. doi:10.1080/14719037.2014.985248. Osterloh, M., & Frey, B. (2014). Academic rankings between the ‘republic of science’ and ‘new public management’. In A. Lanteri & J. Vromen (Eds.), The economics of economists: institutional setting, individual incentives, and future prospects (pp. 77–103). Cambridge University Press. doi:10.1017/ CBO9781139059145.005. Porter, T.M. (1994). Making things quantitative. Science in Context, 7(3), 389–407. Power, M. (1997). The audit society. Oxford University Press.

78  Handbook on measuring governance Radnor, Z. (2008). Hitting the target and missing the point? Developing an understanding of organizational gaming. In W. Van Dooren & S. Van de Walle (Eds.), Performance information in the public sector: How it is used (pp. 94–105). Palgrave Macmillan. Roberts, J. (2017). Managing only with transparency: The strategic functions of ignorance. Critical Perspectives on Accounting, 55, 53–60. Robertson, S. (2022). Guardians of the future: International organisations, anticipatory governance and education. Global Society, 36(2), 188–205. Rosa, H. (2005). The speed of global flows and the pace of democratic politics. New Political Science, 27(4), 445–59. doi:10.1080/07393140500370907. Rosanvallon, P. (2011). The metamorphoses of democratic legitimacy: Impartiality, reflexivity, proximity. Constellations, 18, 114–23. Rose, N., & Miller, P. (1992). Political power beyond the state: Problematics of government. British Journal of Sociology, 43(2), 173–205. Rothstein, H., Huber, M., & Gaskell, G. (2006). A Theory of risk colonization: The spiralling regulatory logics of societal and institutional risk. Economy and Society, 35(1), 91–112. Rottenburg, R.M., & Merry, S.E. (2015). A world of indicators: The making of governmental knowledge through quantification. In R.M. Rottenburg, S.E. Merry, S.-J. Park, & J. Mugler (Eds.), The world of indicators (pp. 1–33). Cambridge University Press. Smismans, S. (2003). Towards a new community strategy on health and safety at work? Caught in the institutional web of soft procedures. International Journal of Comparative Labour Law and Industrial Relations, 19(1), 55–83. Smith, P. (1995). On the unintended consequences of publishing performance data in the public sector. International Journal of Public Administration, 18(2&3), 277–310. Stake, R.E. (2004). Standard-based and responsive evaluation. Sage. Triantafillou, P. (2011). Metagovernance by numbers – technological lock-in of Australian and Danish employment policies? In J. Torfing & P. Triantafillou (Eds.), Interactive policymaking, metagovernance and democracy (pp. 149–66). ECPR Press. Weingart, P. (2005). Impact of bibliometrics upon the science system: Inadvertent consequences? Scientometrics, 62(1), 117–31.

PART II THEORETICAL APPROACHES TO MEASURING GOVERNANCE

5. Theoretical approaches to measuring governance: public administration Sorin Dan

INTRODUCTION This chapter discusses several theoretical approaches to measuring governance found in the public administration (PA) literature. Considering the eclectic and interdisciplinary nature of the field of PA, and the many theories and models that have contributed to it over time (Ongaro & van Thiel, 2018; Peters & Pierre, 2014), the aim of this chapter is not to provide a comprehensive or chronological survey of the field’s many theoretical insights on and approaches to measuring governance. The goal is somewhat less ambitious; yet we prefer to err on the side of omission than on the side of generality. The chapter outlines and compares PA theory’s contributions to measuring governance with reference to three established trajectories1 of public administration and management reform, namely, the neo-Weberian state (NWS), the New Public Management (NPM) and the new public governance (NPG). There are several theoretical reasons why this is a sensible approach and can improve our understanding of measuring governance. First, the NWS, NPM and NPG, while not being the only models available, constitute three relatively established PA reform framings that are frequently used and compared in the PA literature (Bouckaert, 2022; Pollitt & Bouckaert, 2017; Torfing & Triantafillou, 2013; Torfing et al., 2020). Second, as we shall see in detail in this chapter, they cover a wide theoretical spectrum of PA theory starting with the classical Weberian model, on which the NWS is built, all the way to the theories that fed into NPM and the NPG. In this way, by distilling their theoretical underpinnings, we can cover a relatively broad corpus of PA theory while being guided by and remaining focused on the concepts, assumptions and arguments of these three approaches. Third, as ideal types or visions of PA reform that have developed over time, they have also received considerable interest from PA practitioners who may not label their reform efforts in any of these three ways, but have turned their ideas into reform practices on the ground. Finally, yet importantly, all three models include several key assumptions and arguments about measuring governance. However, the texts that elaborate on the three models do not discuss in detail and delineate clearly the assumptions and arguments that relate specifically to measuring governance. For this reason, we set out to contribute to this theme by outlining each model’s main concepts, assumptions and arguments as they relate to governance measuring and discuss their specific contribution to our current understanding of measuring governance. While there are good reasons to discuss the contribution of PA theory to measuring governance in terms of the NWS, NPM and NPG, we acknowledge the limitations of this approach. Particularly in so doing, we are bound to omit important PA insights on measurement that do not relate to any of the three models. Moreover, they are ideal types and visions of PA reform that exhibit important differences, yet they also share certain similarities resulting from mutual influences that have been developing over time due to processes of reform sedimentation and 80

Theoretical approaches to measuring governance  81 co-existence (Dan et al., 2022; Pollitt & Bouckaert, 2017; Torfing et al., 2020). Furthermore, the NWS, NPM and NPG are too often viewed as static, time-bound chronological trajectories. It is common to refer to developments in PA theory with reference to classical PA (of which the NWS is a current version), temporally leading to the NPM (as a reaction to the classical PA model), and then further on to public governance (of which the NPG is a notable representative). By discussing governance measuring in terms of ‘why measure’, ‘what to measure’ and ‘how to measure’ across the three models, this chapter seeks to address both the values, concepts, assumptions and arguments, on the one hand, and the practices, measures and tools of governance measuring, on the other hand.

THE NWS, NPM AND NPG: THEORETICAL ROOTS AND CONCEPTS The Neo-Weberian State (NWS) Articulated by Christopher Pollitt and Geert Bouckaert in the second edition of their influential book Public Management Reform (Pollitt & Bouckaert, 2004), and further polished in the subsequent two editions of the same book, the NWS is proposed as a synthesis of the best of several worlds – both the old and the new. It builds on the classical Weberian bureaucratic model of the state, but it recognizes its administrative deficiencies and proposes to rejuvenate it for our present times. The Weberian model of hierarchical and professional bureaucracy (Weber, 1947) represents the core of NWS and, as a result, this approach derives its essence from established theoretical and historical roots. Max Weber’s ideal type model of efficient bureaucracy was characterized by Albrow (1970, pp. 43–5) as having: ● a clear hierarchy of offices with well-specified functions; ● a clear distinction between the resources of the organization and those of office holders; ● written documents as the basis of the administration which is centred on the bureau as the central point of modern organization; ● technical, specialized spheres of competence; ● a career structure and promotion based on merit or seniority following evaluation from superiors; ● a unified control and disciplinary system applicable across the organization. Observed with respect to PA reform trends in continental and Nordic Europe, the NWS is a commendable effort to preserve ‘a balance between state and society that ensures the legitimacy of administrative arrangements’ (Lynn, 2008, p. 23). Its aim is thus two-fold: to maintain and modernize the state to increase its capacity, acceptance, legitimacy and sustainability. A key question is how to maintain and modernize the state at the same time and, as we shall see later in this chapter, the why, what and how of governance measuring has important implications for this question, not only in the case of the NWS, but also for the other two theoretical approaches considered. In keeping with the classical Weberian model, the NWS aims to modernize the state by making it more professional, but at the same time it aspires to make it more efficient and, importantly, actively outward-focused, responsive and citizen-friendly (Pollitt & Bouckaert, 2017, p. 19). This involves reducing undesirable bureaucracy and lowering the

82  Handbook on measuring governance administrative burden (Herd & Moynihan, 2018) by adopting a set of innovative mechanisms and instruments, including, for example, digital technologies for seamless and customized service delivery that would meet citizens where they are so that they become satisfied with the attitude and service delivery of state organizations and public service providers. Pollitt and Bouckaert (2017, p. 19) note that NWS is ‘not a universal model, but one limited to particular kinds of state’. The NWS emphasizes the pivotal role of the state, rather than that of society or private actors, and is thus a state-centred model that fits well with the wealthy, welfare-state systems in Western and Nordic Europe where the state plays a comprehensive societal role and public spending is sizeable. Reflecting a state-centred or neo-Weberian theory of the state that cherishes both modernization and preservation of professional value systems and the publicness of governance arrangements, NWS draws upon the rich scholarship and theoretical developments in institutional theory, particularly the historical institutionalism stream of the new institutionalism (Hall & Taylor, 1996; Lynn, 2008; Pierson, 2004; Thelen, 1999). The main logic of action of historical institutionalism is path dependency, a mechanism that facilitates or constrains the range of policy options and administrative decision-making (Nakrošis et al., 2023; Peters, 1999; Van der Wal et al., 2021), which can be at odds with the modernizing, neo element of NWS, leading to possible contradictions and tensions. The New Public Management (NPM) Two of the early influential scholarly treatments of NPM were Hood (1991), who coined the term, and Pollitt (1990, p. 1), who referred to NPM as ‘managerialism’, that is, the belief that ‘better management will prove an effective solvent for a wide range of economic and social ills’. While some scholars believe that NPM is only one manifestation of managerialism, both terms point to a similar range of ideas and approaches (Pollitt, 2016). Two main sets of ideas fed into NPM, leading to what Hood (1991, p. 5) called ‘a marriage of opposites’. A first stream (which we can refer to as an extreme version of NPM) stemmed from the new institutional economics and the work of public choice, transactions costs and principal agent theorists, who led assaults on the established Western bureaucracies for being too large, overly procedural, inner-focused and slow to respond to external requests. A second, and more temperate, stream (soft NPM) consists of what Hood (1991, p. 5) called ‘business-type managerialism in the public sector’ (which is similar to Pollitt’s managerialism). This second stream built on the importance of distinct management expertise, which permeates the public sector and requires managerial discretion ‘to let the managers manage’ to improve organizational performance, public service quality and customer satisfaction (Aucoin, 2017; Hood, 1991). The emphasis in NPM is on market dynamics and cutting back on government spending, in the case of the extreme stream, and on professional management expertise and the role of managers and leaders in initiating and driving public sector change, in the temperate stream. This theoretical amalgamation created internal inconsistencies within NPM, which had an impact on the types and goals of reform measures that public sector reformers and managers pursued across the globe. This also meant that in practice there was a variety of reform options to choose from, only some of which followed the original precepts of NPM. This led to the observation that almost every reform adopted in the public sector over the past decades was labelled as NPM even without sharing its basic assumptions or sharing only one or a limited number of them, rather than the whole NPM ‘package’. Despite this and the uneven appli-

Theoretical approaches to measuring governance  83 cation of NPM across time and space, there has been a tendency within PA scholarship to attribute to NPM much of what presently goes wrong within the public sector. For this reason, it is important to reiterate that Christopher Hood referred to seven different types of reform to denote the contents of NPM (Hood, 1991, p. 4). The New Public Governance (NPG) The NPG builds on the substantial cross-disciplinary network governance scholarship that emerged over decades of work on the role of networks as a governance mechanism complementing hierarchies and markets (Kickert et al., 1997; Klijn, 2008; Podolnyi & Page, 1998; Powell, 1990; Provan & Milward, 2001; Rhodes, 1997). Coined by Stephen Osborne (2010), the NPG emerged as a label reflecting a third wave of theoretical thinking on PA reform, as a reaction to NPM, in particular, but also to state-centred Weberian thinking (Pollitt & Bouckaert, 2017). The NPG mirrored a shift from the ‘intra-organizational’ focus of the private sector management of the NPM (Osborne, 2010, p. 4) towards governance processes that have a broader societal focus. The need for a society-centred model is justified with reference to the increased complexity of policy problems, which may be addressed by the active involvement of societal actors (public, private and non-profit) into policymaking and service delivery (Sørensen & Torfing, 2008). Osborne articulated the NPG with reference to public service delivery specifically and distinguished between service delivery within the open natural systems theory of the NPG, the open rational systems theory of the NPM and the closed system of the pre-NPM PA. NPG’s open natural system is tied with a ‘plural’ and ‘pluralist state’, which refer to ‘multiple interdependent actors’ contributing to service delivery and ‘multiple processes that inform the policymaking system’ (Osborne, 2010, p. 9). Unlike the NPM and the NWS, the NPG emphasizes inter-organizational relationships, partnership and network building and continual negotiation processes between the members of the networks and partnerships for consensus building (Klijn, 2008; Sørensen & Torfing, 2008; Torfing & Triantafillou, 2013). Because of this, it focuses on the ‘governance of processes’ (Osborne, 2010, p. 9) yet Osborne notes that its aspiration is to move beyond process to capture ‘service effectiveness and outcomes that rely upon the interaction between public service organizations with their environment’. Theoretically, several theoretical perspectives that have largely a sociological and political science nature have influenced the NPG. They include sociological theories of organizations and institutions, for example, sociological institutionalism (Podolny & Page, 1998; Powell, 1990; Powell & DiMaggio, 1991) and political science and democratic theories of active citizenship, participation and empowerment (Hall & Taylor, 1996; Torfing & Triantafillou, 2013). Sørensen and Torfing (2008, p. 17) distinguish between four clusters of governance network theories, that is, interdependency theory, governmentality theory, governability theory and integration theory. Interdependency theory sees governance networks both as vehicles for and results of interest mediation between interdependent actors who have their own interests, resources and goals, but also have mutual resource dependencies. According to the governability theory, governance networks enable voluntary coordination between independent actors following game-based negotiation processes. Governmentality theory, by contrast, emphasizes the role of the state in facilitating and mobilizing the voluntary participation of societal actors into governance processes, and has a certain affinity with the research on meta-governance (Kooiman, 2003; Meuleman, 2008). Finally, integration theory views

84  Handbook on measuring governance governance networks as a somewhat more institutionalized means and platform for voluntary exchange between stakeholders who are joined-up by shared norms and values (Sørensen & Torfing, 2008). The emphasis in the NPG is on horizontal voluntary cooperation between stakeholders from across the society who come together, for various reasons, or are mobilized for participation and who bring their own information, knowledge and resources to jointly address complex policy problems. Thus, it differs in important ways from the ‘enlightened and professional hierarchies’ of the NWS and the market and management orientation of the NPM (Pollitt & Bouckaert, 2017, p. 23). This, however, does not exclude certain similarities and affinities that exist across the three perspectives (not to mention their practical application and integration in policy practice due to processes of sedimentation, layering and simultaneous co-existence of reform ideas and practices). For example, the need for meta-governance in the NPG resonates with the state-centred model of the NWS. There is a need for professional management expertise, emphasized by NPM, in both the NWS and the NPG (e.g. for managing and measuring networks, performance, service quality, inputs, outputs as well as overall results). Moreover, business actors were elevated within NPM, but they certainly play a role also in the NWS (e.g. in public-private partnerships) and NPG (e.g. for broadening participation to relevant societal actors). Thus, understanding the values, motivations and practices of private actors and the extent to which they differ from the publicness of public sector actors becomes important. By emphasizing certain values and characteristics, each perspective ignores or pays less attention to other valuable elements, creating the need for a more holistic synthesis. Table 5.1 summarizes the logic of action of the NWS, NPM and NPG.

THE NWS, NPM AND NPG: ARGUMENTS, ASSUMPTIONS AND IMPLICATIONS FOR GOVERNANCE MEASURING The ‘Why’ of Measuring Governance The question of why measure governance boils down to the nature of accountability envisaged in the three approaches. The measurement of governance processes and outcomes serves as a vehicle for account giving and for proving an organization’s or individual’s worth (Behn, 2001). While measuring governance is closely tied to public sector performance (Bevan & Hood, 2006; Hood, 2006; Ingraham et al., 2003; Van Dooren et al., 2015), and influenced by the NPM, it is not limited to either performance alone or to the understanding of performance under NPM. Yet measuring governance has received less emphasis in the NWS and NPG research so far although these models have important implications for changes in the nature of accountability (Bouckaert, 2022; Klijn, 2008; Osborne, 2010; Rhodes, 1997; Sørensen & Torfing, 2008; Torfing & Triantafillou, 2013). If NPM emphasizes accountability for measurable results, NWS centres on procedural accountability informed by professional standards and norms, while NPG is characterized by ‘multiple forms of accountability based on a variety of standards attuned to organizational learning’ (Torfing & Triantafillou, 2013, p. 14). The core argument of the NWS is to make government organizations more professional, efficient and responsive to citizens and businesses by modernizing the state apparatus (Pollitt & Bouckaert, 2017). Public sector organizations are to achieve this desideratum by using innovative methods and approaches, some of which may originate in the corporate world

hierarchy of impartial and professional officials

especially historical

institutionalism

through performance frameworks and incentives Private sector consultants and entrepreneurs working for the government

Instability/ reforming Market behaviour Entrepreneurship

theories

Management studies

Network managers/ governors who seek, negotiate, lead partnerships and mobilize participation

Bounded rationality of the state Openness to broad

stakeholders who provide information, knowledge and resources

of organizations,

especially sociological

institutionalism

Boundary spanners who search for leverage and cross-sectoral synergies

participation and contribution Voluntary cooperation Active citizenship

Political science theories

of democracy and

stakeholder participation

meaning and relationships

collaborative

Sociological theories

interdependent

Negotiation of values,

Participatory and

Network theories

Competition Multiple accountabilities

Networks of

agencies held to account

managerial action

and performance

reformist

public choice, transaction

Flexibility

mechanisms and/or

organizational resources

entrepreneurial/

economics, especially

costs and principal-agent

government departments/

Accountability for

Quasi market-type

Management of

Managerial and

New institutional

measurable results

Managers working in

and innovation Customer orientation

procedures

civil service norms and

the state

Openness to modernization

and follow established

politicians’ decisions

Predictability Continuity

who implement laws and

Probity

State-centred theories of

Professional experts

Clear accountability lines

Stability

Roles for civil servants

Main values

Source: Expanded based on Osborne (2010, p. 10); Pollitt and Bouckaert (2017, pp. 22, 120–27; 173); Torfing and Triantafillou (2013, p. 14).

NPG

NPM

through a modernized

implementation

Institutional theory,

Authority exercised

Policy creation and

Logic of action

Emphasis

Hierarchical and modernist

Weberian bureaucracy

Nature of the state

Theoretical roots

NWS

The NWS, NPM and NPG: main concepts and characteristics

Model

Table 5.1

Theoretical approaches to measuring governance  85

86  Handbook on measuring governance but preferably within the public sector itself, while preserving and fostering the publicness of a distinct public sector ethos (Bouckaert, 2022; Osborne, 2010). By contrast, the NPM seeks to make government more economical, efficient and effective by using private sector methods and dynamics and relying on managerial action (Hood, 1991). In the NPG, the core argument centres on making government organizations better informed, less exclusive and more legitimate by mobilizing and using information, knowledge and resources from interdependent stakeholders who may voluntarily engage in governance networks and collaborative arrangements and choose to contribute to policymaking and solving societal problems (Pollitt & Bouckaert, 2017; Sørensen & Torfing, 2008). Embedded in each model, however, are a set of assumptions about the nature of organizational and human behaviour, the drivers of performance and institutional change and the nature, meaning and formation of public value. Within the NWS, the underlying assumption is that it is possible to both maintain and modernize the key features of the classical bureaucratic model. If authority is exercised through a modernized hierarchy of impartial and professional officials, the key question is how to modernize this established hierarchy so that public sector organizations and services become more responsive to societal developments and needs. In other words, what governance measures and tools should be used for modernization, who should use them and how should they be used? Professional roles and discretion play a key role in NWS, affecting both the nature and extent of measuring tools which reflect professional standards and norms, yet the extent to which professionals are willing and able to act as both experts and managers efficiently and effectively is subject to interpretation (Alvehus, 2022). The professionalism of the NWS manager is by definition less managerial as management experts feature less prominently in the NWS (or the NPG for that matter) than in the NPM. It can be described as a ‘particularly circumscribed kind of professionalism’ (Pollitt & Bouckaert, 2017, p. 176). This means that it is part of, and influenced by, the structures, processes and procedures of the bureaucratic model and the direct involvement of politicians. Politicians can add democratic accountability to professional action (Behn, 2001), but they can also intervene in it and politicize it, leading to negative effects on HRM practices, policy effectiveness and public legitimacy (Fuenzalida & Riccucci, 2019; Peters et al., 2022). Fundamental to the state-centred model of the NWS is also the assumption that the state has sufficient resources to fund and sustain a comprehensive role in society. This may be the case in Western and Nordic European welfare states (although demographic and financial prospects raise important financial sustainability questions even in these cases). However, it is difficult for the NWS model to take hold, over the short and medium term at least, in less affluent states that have lower levels of trust in government and insufficient capacity. This context specificity, which is also a feature of the NPM and NPG, challenges the transferability of ‘ready-made models’ to other contexts that differ in fundamental ways from the places where the models emerged. Measuring is a key feature of the NPM, which can be credited for drawing attention to the importance of measuring governance. However, there is no shortage of assumptions embedded in the NPM (Aucoin, 2017; Bezes, 2018; Hyndman & Lapsley, 2016). NPM’s emphasis on rationality and quantitative measures stems from the nature of the theories that fed into it. This is particularly the case for the ‘hard’ version of NPM that was influenced by new institutional economics, which is at odds with the ‘soft’ stream of management studies that allow for more horizontal, consensual approaches to management and measurement instead of command and control (Hood, 1991; Pollitt & Dan, 2013). Fundamental to the NPM’s view of measuring governance is the assumption that quantifying the inputs, activities, outputs and outcomes

Theoretical approaches to measuring governance  87 of governance is both possible and desirable. It uses a ‘production’, input-output model and applies it particularly to public service delivery (Pollitt & Dan, 2013), and pays specific attention to both minimizing the volume of inputs while at the same time maximizing the volume of outputs. It also assumes that the information gathered through measurement is accurate and reliable enough and, when used, acts as a valuable resource that helps managers and politicians to monitor, evaluate and improve the performance and accountability of public sector organizations. Moreover, while promising flexibility and ‘lean’ management, there is evidence showing an increase in new forms of bureaucratic control (Bezes, 2018; Clarke & Newman, 1997). This is due to NPM’s emphasis on quantitative measurement and its elaborate performance measurement toolkit, especially when used in a command-and-control fashion, emphasizing accountability alone, instead of both accountability and learning (Lewis & Triantafillou, 2012). NPM emphasized measurable outputs instead of broader and more qualitative societal and democratic outcomes (Christensen & Lægreid, 2022; Hood & Dixon, 2015; Pollitt & Dan, 2013). This calls into question the nature and meaning of performance emphasized in NPM, though the overall emphasis on measuring and achieving results is a welcoming feature (if used properly) that can provide decision makers with required performance information. The NPG challenges both the NWS and the NPM, which are seen as too statist, inflexible and organizationally focused (Osborne, 2010). The assumptions of both these approaches are too narrow and limited for the NPG, giving rise to a new approach that has direct implications for governance measuring. Theory and practice have documented a new set of assumptions embedded in the NPG. This view fundamentally assumes that various actors and organizations from across the different spheres of the society and economy are both willing and able to join forces, collaborate and share resources, often without a rational, self-interested motivation to address a specific problem. This also assumes that they have a stake or important reason in getting involved for solving that problem and they are willing to do so even if other stakeholders may fail to do their part (Pollitt & Bouckaert, 2017). For NPG to function in practice in an efficient and effective way (so that something ends up being done), consensus among various players either exists by default or can be cultivated. This, however, may be a difficult assumption to hold in practice, particularly in pluralist and complex societies (which are a feature of the NPG) where there are various and often times, competing and entrenched values and interests (Kapucu & Hu, 2020; Sørensen & Torfing, 2008). A further assumption relates to the roles and skills of governmental actors who, depending on the type of network arrangement, may need to take a network-governing role to seek, negotiate and lead partnerships and act as boundary spanners who search for leverage and cross-sectoral synergies (Pollitt & Bouckaert, 2017, p. 173). New skills, which few civil service systems may possess currently, need to be cultivated to this end. This also implies that government officials are willing to take on these roles and give up, at least in part, on their established legal and official mandates. Moreover, network meta-governors are able to manage the inherent power imbalances of network arrangements (Sørensen & Torfing, 2008; Torfing & Triantafillou, 2013). The ‘What’ and ‘How’ of Measuring Governance The NPG is seeking to redefine the nature of performance that is typically used in NPM, which is seen to be too much concerned with instrumental elements and an overemphasis on measurable results (O’Flynn, 2007). The notion of public value is preferred instead to incorporate a variety of changing public sector values that may include measurable perfor-

88  Handbook on measuring governance mance (economy, efficiency and effectiveness) but go beyond it to capture broader societal and democratic values (Faulkner & Kaufman, 2017; Moore, 1995; O’Flynn, 2007). In terms of what to measure about governance, the NPG is therefore significantly less concerned with ‘intra-organizational’ change and the performance of individual organizations. By contrast, it notes that governance measures need to capture inter-organizational relations and the effectiveness of networks and partnerships (Osborne, 2010). NPG thinking thus aspires towards a better understanding of how to design and employ measures that capture the public service system. Coupled with this, there is an interest in NPG on value creation processes, which NPG scholarship claims to differ substantially from the performance orientation in NPM and the closed system of the classical public administration (Osborne, 2010, p. 10). From a service-oriented perspective, which is said to differ from both NPM’s production model and co-production theory, value is co-determined and co-created together with the service users (Osborne et al., 2021). This approach, however, goes beyond the NPG and is known as the public service logic, combining the publicness of public administration and management theory with the service orientation/service dominant logic of service management and marketing theory (Osborne et al., 2015; Osborne et al., 2022). Interestingly enough, at least part of the theoretical origins of the public service logic come from business and management studies – a useful parallel with NPM. We can take this argument further by observing that this thinking is not completely different from that of NPM. While emphasizing explicit standards and measures of performance, NPM led to the development of customer-user satisfaction surveys and quality improvement schemes, such as the Total Quality Management (TQM), Common Assessment Framework (CAF) and SERVQUAL, that incorporated user feedback and perceptions of service quality (Lapuente & Van de Walle, 2020; Singh & Slack, 2022). These quality schemes were part of the NPM’s arsenal of opening up public service organizations to the dyamics of the market and the views and preferences of customer-users (Pollitt & Bouckaert, 2004). The customer-user orientation, though criticized for being misplaced (Diefenbach, 2009; Drechsler, 2005), aimed to make, and one may argue successfully so, public service organizations better attuned and responsive to the wishes of the service users (Dan & Pollitt, 2015; Lapuente & Van de Walle, 2020; Singh & Slack, 2022). Although the focus of the NPM has been on an instrumental view of performance, research that looked at reform trends over time has found that the measurement of public management reforms have also included additional themes, such as accountability, strategic growth and change (Johnsson et al., 2021). This questions the widespread view that the emphasis of public management reforms is on measuring costs and efficiency, and inputs and outputs, alone. Despite theoretical developments in public governance and public value theory, the notion and practice of measuring governance from an NPG perspective are still in their infancy (Pollitt & Bouckaert, 2017). The fluidity and flexibility of NPG arrangements make it difficult for policymakers and network organizers to operationalize and measure governance processes and outcomes (Osborne, 2010). This is coupled with the inherent complexity of the meaning of public value and the various, multi-stakeholder influences that define it (Faulkner & Kaufman, 2017). This also calls into question the feasibility of using such a sophisticated and multi-faceted approach to measuring governance in practice, given data limitations and the resources, skills and meta-governing capacity required to implement and sustain such a measuring programme. Assuming that such a system can be put into place, there are questions related to the administrative burden that it would entail to build, run and maintain it. New forms of bureaucracy may develop that run against the very intent of the measuring system,

Theoretical approaches to measuring governance  89 that is, to promote flexibility, participation, collaboration, accountability and learning (Lewis & Triantafillou, 2012). The modernizing ambition of the NWS requires measuring citizen satisfaction, service quality, trust in national, regional and local institutions while controlling for escalating costs to pursue savings and efficiency gains through new tools and methods, such as public sector innovations and digital technologies (De Vries et al., 2015). From this perspective, the ‘what’ of governance measuring resembles to a good extent what is being measured from an NPM approach. While not clearly formulated in the NWS literature, professional guidelines, norms and quality standards (Alvehus, 2022; Noordegraaf, 2015) should play a prominent role in the type of indicators used, particularly with regard to highly professional services, such as education, healthcare, employment or social care. The more administrative state services, however, are likely to be less influenced by professional norms and, in the NWS model, are likely to be characterized by a high degree of formalism and proceduralism, reflecting Weberian principles. NPM theory emphasized a more dominant, command-and-control approach to measurement characterized by a detailed specification of performance indicators and targets used for the purpose of accountability. The NWS is still using performance measurement, possibly framed in a broader public value framework to reflect professional standards and norms, but the way in which it is used is expected to be different from the NPM approach (Pollitt & Bouckaert, 2017). The degree of target specification is more general and the purpose should include an emphasis on learning in addition to accountability (though performance measurement in NPM can also be used for learning and in a less top-down manner). Horizontal processes of negotiation and consensus-building between professionals, civil servants and politicians are likely to feed into the type and goals of measurement. A consensual process of negotiation between stakeholders and citizens characterizes public value measuring in NPG, the emphasis being primarily on exchange, experience sharing and learning and to a less extent on vertical and horizontal accountability. Table 5.2 summarizes the core arguments, assumptions and implications for governance measuring across the three theoretical approaches.

THE CONTRIBUTION OF THE NWS, NPM AND NPG TO GOVERNANCE MEASURING The NWS, NPM and NPG view governance measuring with different lenses. While fundamentally different, they share certain common goals, methods and tools considering that none of the three visions represents a unique model (Pollitt & Bouckaert, 2017). Within public administration and management, the NPM, for better or for worse, preeminently elevated measuring like no other approach. Its contribution lies in the specificity and quantification of goals, inputs, implementation process and outputs. It draws attention to the requirement to set up, monitor and evaluate explicit standards and measures of results as a method for improving performance and accountability (Hood, 1991). Though not exclusively, its emphasis is on what can be easily measured, that is, ‘output controls’ (Hood, 1991, p. 5), as well as on monetary inputs (savings). Efficiency and effectiveness, though important to the NPM, proved to be more difficult to measure and assess accurately (Hood & Dixon, 2015; Pollitt & Dan, 2013). In addition to effectiveness, other outcome measures, especially service quality and user satisfaction, featured prominently within NPM (Johnsson et al., 2021; Lapuente & Van de Walle,

Moderately detailed specification of public value/ performance measures/targets

High levels of trust in government Sizeable taxation and public spending levels

maintaining a distinct public sector ethos

Focus on organizational change and performance Instrumental nature of performance, plus customer-user orientation and satisfaction Detailed specification of measures/targets Command-and-control approach (hard version), possibly with a best-practice, learning orientation (soft version) Broad view of performance, reframed as multi-faceted public value Co-creation of public value based on multi-stakeholder input Emphasis on inter-organizational/ ecosystem relations Public value measures derived from consensus and pluralistic views used for experience sharing and learning

possible and desirable Performance information is relevant, accurate, reliable and useful Limited unintended consequences: cheating, gaming, new forms of bureaucracy Diverse actors are willing and able to share resources, participate and collaborate Consensus exists or can be reached and sustained efficiently and effectively Existing government skills, willingness and capacities for new roles: meta-governance, network managers and boundary spanners Possibility to operationalize, measure and practically use

effective by using private sector methods and managerial

action

To make government better informed, less exclusive and

more flexible and legitimate by mobilizing and obtaining

input from interdependent actors

cost-efficiency and sustainability

by powerful actors/groups, new forms of bureaucracy,

Limited negative consequences: process/outcome capture

ecosystem/relational public value

plus possible top-down, formal approach Emphasis on measurable inputs and outputs

Operationalization and measurement of performance is

To make government more economical, efficient and

Source: Expanded based on Pollitt and Bouckaert (2017, pp. 22, 120–27, 173); Torfing and Triantafillou (2013, p. 14).

NPG

NPM

and norms and/or formal and procedural elements

Circumscribed role for civil servants

responsive by modernizing the state apparatus while

Consensual approach to creating and using measures,

Measures that reflect professional standards, guidelines

Maintain + modernize

To make government more professional, efficient and

NWS

Implications for measuring

The NWS, NPM and NPG: argument, assumptions and implications for governance measuring Assumptions

Argument

Model

Table 5.2

90  Handbook on measuring governance

Theoretical approaches to measuring governance  91 2020; Singh & Slack, 2022). Moreover, a strength of NPM has been its focus on unpacking the public policy implementation process (Osborne, 2010, p. 5). It also contributed to documenting and measuring organizational change processes and the outcomes of the policy process, and to opening up the black box of policy implementation (Osborne, 2010, p. 5). Despite its relatively narrow approach to governance measuring, which the PA literature has extensively covered (Bevan & Hood, 2006; Christensen & Lægreid, 2022; Osborne et al., 2015; Pollitt & Bouckaert, 2017), the NPM reoriented organizational behaviour and action outwards (instead of inwards as in the classical bureaucratic model) towards the end users of the policy process and service delivery. Though viewed narrowly as customers or service users (instead of citizens more broadly), the NPM promoted the use of generic quality improvement schemes, such as the TQM, that reflected and incorporated user feedback (Pollitt, 2003). This basic reorientation represented an important innovation when it emerged, and is an important contribution of the NPM. It involved (and in some places it still does) a major shift in bureaucratic behaviour, and there are reasons to credit NPM for it. Depending on which version of NPM we consider, governance measuring under an NPM regime either followed a command-and-control approach centred on vertical accountability (in the case of the hard version) or a softer approach that aimed to improve organizational learning, in addition to accountability. In its attempt to modernize the state apparatus, the NWS draws attention to the important role of distinct public service values that include, but go beyond, instrumental measures (Bouckaert, 2022; Lynn, 2008; Pollitt & Bouckaert, 2017). It aspires to both maintain and foster this distinct ethos by listening to the voice of the professions. In an NWS perspective, it is not the public managers who should design and decide on the details of political visions, but professionals who are accustomed to traditional values, norms and standards and operate in culturally appropriate ways. Governance measuring, then, is influenced by this approach and the indicators and measures that are used within it reflect professional guidelines and standards. This, however, does not mean that instrumental performance measures related to economies and efficiencies have no place within the NWS. Increasingly, as public budgets are under pressure, particularly in turbulent times, it has become evident that there is a need to find an acceptable balance between costs and professional value. Public sector cost-cutting is not an end in itself within NWS, but practice has shown that it is often required. The approach to cost-cutting in NWS is, however, more consensual than in NPM and must include the voice of the professions who typically oppose it (Pollitt & Bouckaert, 2017). NWS has also contributed to governance measuring by formalizing measures that reflect not only professional standards, but also the features of the classical public administration. This involves procedural aspects and due diligence, which may provide both stability and inflexibility to organizational arrangements and public service delivery. As by definition the NWS approach to measuring is less managerial than in NPM, it is expected, by comparison with NPM, that the governance measures and targets are more moderately specified and detailed in NWS. They are also used in a consensual, bottom-up way, but it does not exclude command and control that may be required if no consensus can be reached, for example, under conditionality constraints to address fiscal imbalances that jeopardize the state’s ability to pay out its debt. Thus, while the NWS is contributing to a professionalization of governance measuring, there is an inherent and persistent tension between expanding the role of the state and ensuring its sustainability. The NPG provides us with another lens for governance measuring. The focus in NPG is on process and dialogue and less on the results of governance. It values the active engagement of multiple stakeholders: the more actors are engaged the better as the process of participa-

92  Handbook on measuring governance tion is assumed to contribute to democratic values and effective policymaking and service delivery. It thus seeks to mobilize private resources and ideas and promote active citizenship to improve the legitimacy and accountability of government action (Torfing & Triantafillou, 2013). On this basis, NPG’s contribution to governance measuring is potentially manifold. It broadens the notion of performance and emphasizes a multi-dimensional conception of public value (Faulkner & Kaufman, 2017). Public value is not only professionally determined, though professions play a role, but based on the active say of citizens and service users who are supposedly co-determining and co-creating it (Osborne et al., 2021). Service effectiveness, in particular, relies on the interaction between service providers and their environment (especially the service users) (Osborne, 2010). In these ways, despite the central role that negotiation processes play within NPG, the NPG aspires to move beyond process to contribute to outcome formation. Thus, there is potential for the NPG to contribute to outcome measures that reflect a complex view of public value (although this is inherently more difficult to do in practice compared to NPM and NWS). Performance management takes on new forms to include the evaluation of networks, co-production of public services, and multi-organizational and multi-faceted performance. The emphasis is on negotiation and consensus-seeking of shared targets and public value standards. The measures are less detailed and specific, emphasizing steering and collaboration instead of vertical coordination (Pollitt & Bouckaert, 2017). Another contribution to governance measuring stems from NPG’s emphasis on shared problem-solving, capacity building for active citizenship and processes of self-regulation and state-citizen partnership. By highlighting democratic accountability, there is a need to develop multiple forms and measures of accountability (Torfing & Triantafillou, 2013) that are useful and relevant, that is, detailed and specific enough yet inviting and non-controlling (Pollitt & Bouckaert, 2017. There is an inherent tension within NPG between the need to develop multiple forms of accountability and the coordination capacity that is required to govern these complex and dynamic processes. There is also the risk of over-proceduralism, emanating from the process of building consensus, which may limit the ability of actors to take concrete action and achieve a certain result that goes beyond the process itself. In this sense, the NPG parts ways with the NPM’s focus on accountability for results.

CONCLUSION: THE STATUS TODAY AND THE PROSPECTS FOR PA THEORY ON GOVERNANCE MEASURING In view of the interdisciplinary nature of PA theory and its significant interest in governance measuring that goes back to the origins of the discipline (Norris, 2011), a wealth of measuring concepts and approaches have emanated from PA. This chapter has not surveyed all of these ideas, but instead outlined and compared three main approaches to measuring that have been particularly influential in both PA theory and practice, that is, the NWS, NPM and NPG. As visions of PA that structure reform programmes, they are ideal types that find only a partial application in practice (Pollitt & Bouckaert, 2017). Instead of a neat application, they are best thought of as sets of intertwined and/or competing concepts, practices, tools and measures that co-exist and are layered upon each other (Dan et al., 2022; Pollitt & Bouckaert, 2017; Torfing et al., 2020). Chronologically, PA has moved from the traditional administration to NPM and NPG yet this trend obscures the sedimentation and simultaneous co-existence of these ideas. The NWS, for example, does not equal the classical Weberian model, but instead was observed

Theoretical approaches to measuring governance  93 as a present-day trajectory of PA reform (Pollitt & Bouckaert, 2017). NPM ideas and practices are also relevant today (Bezes, 2018; Hyndman & Lapsley, 2016; Lapuente & Van de Walle, 2020). Thus, though the NPG and its many concepts and approaches have become increasingly popular in PA theory, it is an overstatement to claim that PA as a discipline currently ‘lives and breathes’ the NPG (or NPM and NWS). Yet does the approach to measuring that NPG proposes represent the future of governance measuring within PA? It is unlikely to be the only perspective, but the search, layering and co-existence is expected to continue, as it has been in the past. That said, the NPG offers valuable insights into measuring that capitalize on the characteristics of present-day societies and democracies: complexity, pluralism, uncertainty and turbulence, and a search for meaning and value. This leads to an interest in finding new ways of governing that reflect these characteristics which we expect will continue to grow in the future. The key challenge for PA scholars is to make sense of the complexity of governance measuring in a way that advances both the discipline’s scholarship and its practical relevance and applicability, given today’s constraints of policymaking and service delivery.

NOTE 1. The public administration and management literature uses different terminology to refer to the NWS, NPM and NPG, such as visions, approaches, models, trajectories, agendas, movements, etc. Throughout this chapter we use such terms interchangeably.

REFERENCES Albrow, M. (1970). Bureaucracy: Key concepts in political science series. London: Macmillan. Alvehus, J. (2022). The logic of professionalism: Work and management in professional service organizations. The Policy Press. Aucoin, P. (2017). New Public Management and new public governance: Finding the balance. In D. Siegel & K. Rasmussen (Eds.), Professionalism and public service: Essays in honour of Kenneth Kernaghan (pp. 16–33). Toronto University Press. Behn, R.D. (2001). Rethinking democratic accountability. Brookings Institution Press. Bevan, G., & Hood, C. (2006). What’s measured is what matters: Targets and gaming in the English public healthcare system. Public Administration, 84(3), 517–38. Bezes, P. (2018). Exploring the legacies of New Public Management in Europe. In E. Ongaro E. & S. Van Thiel (Eds.), The Palgrave handbook of public administration and management in Europe. Palgrave Macmillan, pp. 919–66. Bouckaert, G. (2022). The neo-Weberian state: From ideal-type model to reality? Institute for Innovation and Public Purpose, University College London. Working paper. Christensen, T., & Lægreid, P. (2022). Taking stock: New Public Management (NPM) and post-NPM reforms – trends and challenges. In A. Ladner, F. Sager, & A. Bastiansen (Eds.), Handbook on the politics of public administration: Public management – a new paradigm. Edward Elgar, pp. 38–49. Clarke, J., & Newman, J. (1997). The managerial state. Sage. Dan, S., Špaček, D., & Lægreid, P. (2022). The New Public Management: Dead or still alive and co-existing? State of play at 40+. Special issue in Public Management Review. De Vries, H., Bekkers, V., & Tummers, L. (2015). Innovation in the public sector: A systematic review and future research agenda. Public Administration, 94(1), 146–66. Diefenbach, T. (2009). New public management in public sector organizations: The dark sides of managerialistic ‘enlightenment’. Public Administration, 87(4), 892–909. Drechsler, W. (2005). The rise and demise of the new public management. Post-Autistic Economics Review, 33(14), 17–28.

94  Handbook on measuring governance Faulkner, N., & Kaufman, S. (2017). Avoiding theoretical stagnation: A systematic review and framework for measuring public value. Australian Journal of Public Administration, 77(1), 69–86. Fuenzalida, J., & Riccucci, N.M. (2019). The effects of politicization on performance: The mediating role of HRM practices. Review of Public Personnel Administration, 39(4), 544–69. Hall, P.A., & Taylor, R. (1996). Political science and the three new institutionalisms. Political Studies, 44, 936–57. Herd, P., & Moynihan, D.P. (2018). Administrative burden: Policymaking by other means. Russell Sage Foundation. Hood, C. (1991). A public management for all seasons? Public Administration, 69, 3–19. Hood, C. (2006). Gaming in targetworld: The Targets approach to managing British public services. Public Administration Review, 66(4), 515–21. Hood, C., & Dixon, R. (2015). A government that worked better and cost less? Evaluating three decades of reform and change in UK central government. Oxford University Press. Hyndman, N., & Lapsley, I. (2016). New Public Management: The story continues. Financial Accountability & Management, 32(4), 385–408. Ingraham, P.W., Joyce, P.G., & Donahue, A.K. (2003). Government performance: Why management matters. Johns Hopkins University Press. Johnsson, M.C., Pepper, M., Milani Price, O., & Richardson, L.P. (2021). ‘Measuring up’: A systematic literature review of performance measurement in Australia and New Zealand local government. Qualitative Research in Accounting & Management, 18(2), 195–227. Kapucu, N., & Hu, Q. (2020). Network governance: Concepts, theories and applications. Routledge. Kickert, W.J.M., Klijn, E-H., & Koppenjan, J.F.M. (Eds.) (1997). Managing complex networks: Strategies for the public sector. Sage. Klijn, E.-H. (2008). Governance and governance networks in Europe: An assessment of ten years of research on the theme. Public Management Review, 10(4), 505–25. Kooiman, J. (2003). Governing as governance. Sage. Lapuente, V., & Van de Walle, S. (2020). The effects of new public management on the quality of public services. Governance, 33(3), 461–75. Lewis, J.M., & Triantafillou, P. (2012). From performance measurement to learning: A new source of government overload? International Review of Administrative Sciences, 78(4), 597–614. Lynn Jr, L.E. (2008). What is a neo-Weberian state? Reflections on a concept and its implications. The NISPAcee Journal of Public Administration and Policy, 1(2), 17–30. Meuleman, L. (2008). Public management and the metagovernance of hierarchies, networks and markets. Physica-Verlag. Moore, M.H. (1995). Creating public value: Strategic management in government. Harvard University Press. Nakrošis, V., Dan, S., & Goštautaitė, R. (2023). The role of EU funding in EU member states: Building administrative capacity to advance administrative reforms. International Journal of Public Sector Management, 36(1), 1–19. Noordegraaf, M. (2015). Public management: Performance, professionalism and politics. Bloomsbury Academic. Norris, P. (2011). Measuring governance. In M. Bevir (Ed.), The Sage handbook of governance (pp. 179–99). Sage. O’Flynn, J. (2007). From New Public Management to public value: Paradigmatic change and managerial implications. Australian Journal of Public Administration, 66(3), 353–66. Ongaro, E., & van Thiel, S. (Eds.) (2018). The Palgrave handbook of public administration and management in Europe. Palgrave Macmillan. Osborne, S.P. (Ed.) (2010). The new public governance: Emerging perspectives on the theory and practice of public governance. Routledge. Osborne, S.P., Radnor, Z., Kinder, T., & Vidal, I. (2015). The SERVICE framework: A publicservice-dominant approach to sustainable public services. British Journal of Management, 26, 424–38. Osborne, S.P., Nasi, G., & Powell, M. (2021). Beyond co-production: Value creation and public services. Public Administration, 99(4), 641–57. Osborne, S.P., Powell, M., Cui, T., & Strokosch, K. (2022). Value creation in the public service ecosystem: An integrative framework. Public Administration Review, 82(4), 634–45.

Theoretical approaches to measuring governance  95 Peters, B.G. (1999). Institutional theory in political science: The ‘new institutionalism’. Continuum. Peters, B.G., & Pierre, J. (Eds.) (2014). The Sage handbook of public administration (2nd ed.). Sage. Peters, B.G., Danaeefard, H., Torshab, A.A., Mostafazadeh, M., & Hashemi, M. (2022). Consequences of a politicized public service system: Perspectives of politicians, public servants, and political experts. Politics & Policy, 50, 33–58. Pierson, P. (2004). Politics in time. Princeton University Press. Podolnyi, J.M., & Page, K.L. (1998). Network forms of organization. Annual Review of Sociology, 24, 57–76. Pollitt, C. (1990). Managerialism and the public services: The Anglo-American experience. Blackwell. Pollitt, C. (2003). The essential public manager. Open University Press/McGraw Hill. Pollitt, C. (2016). Managerialism redux? Financial Accountability & Management, 32(4), 429–47. Pollitt, C., & Bouckaert, G. (2004). Public management reform: A comparative analysis (2nd ed.). Oxford University Press. Pollitt, C., & Bouckaert, G. (2017). Public management reform: A comparative analysis – into the age of austerity (4th ed.). Oxford University Press. Pollitt, C., & Dan, S. (2013). Searching for impacts in performance-oriented management reform: A review of the European literature. Public Performance & Management Review, 37(1), 7–32. Powell, W.W. (1990). Neither market nor hierarchy: Network forms of organization. Research in Organizational Behavior, 12, 295–336. Powell, W.W., & DiMaggio, P. (Eds.) (1991). The new institutionalism in organizational analysis. University of Chicago Press. Provan, K.G., & Milward, H.B. (2001). Do networks really work? A framework for evaluating public-sector organizational networks. Public Administration Review, 61, 414–23. Rhodes, R.A.W. (1997). Understanding governance, policy networks, governance, reflexivity and accountability. Open University Press. Singh, G., & Slack, N.J. (2022). New Public Management and customer perceptions of service quality – a mixed-methods study. International Journal of Public Administration, 45(3), 242–56. Sørensen, E., & Torfing, J. (Eds.) (2008). Theories of democratic network governance. Palgrave Macmillan. Thelen, K. (1999). Historical institutionalism and comparative politics. Annual Review of Political Science, 2, 369–404. Torfing, J., & Triantafillou, P. (2013). What’s in a name? Grasping new public governance as a political-administrative system. International Review of Public Administration, 18(2), 9–25. Torfing, J., Andersen, L.B., Greve, C., & Klausen, K.K. (2020). Public governance paradigms, competing and co-existing. Edward Elgar. Van der Wal, Z., Mussagulova, A., & Chen, C.-A. (2021). Path-dependent public servants: Comparing the influence of traditions on administrative behavior in developing Asia. Public Administration Review, 81, 308–20. Van Dooren, W., Bouckaert, G., & Halligan, J. (2015). Performance management in the public sector (2nd ed.). Routledge. Weber, M. (1947). The theory of social and economic organisation (Henderson, A.M., & Parsons, T., Trans). Free Press.

6. Measuring governance: a political science perspective B. Guy Peters

Governance has become an important body of literature within political science (see, e.g., Ansell and Torfing, 2022; Pierre and Peters, 2020). This literature is important, but also somewhat internally contradictory, and presents several different conceptions of what governance is, and should be. The conventional model of governance in political science has been that governance is supplied through the State, and by governments. Even when contested by other theoretical approaches, this étatiste conception of governance remains important in political science, as well as in the “real world” of government itself (see Bell and Hindmoor, 2010; Capano et al., 2022, p. 9). Individuals elected to public office, and the civil servants who support them, continue to think they are in the governance business. The antithesis of the étatiste conception of government in political science has been the idea of “governance without government” (Börzel and Risse, 2010; Rhodes, 1996). This attack on statist conceptions of governing became a major driver in the interest that the discipline has displayed in governance since the late 1990s. The argument here is that governments are bureaucratic, clumsy, slow, and, even in representative democracies, undemocratic. Network governance arrangements, centered on the activities of social actors, are argued to be more effective in governing, as well as offering a more representative and more continuous form of democracy (Sørensen and Torfing, 2007). The “governance without government” approach has been more popular among academics in the smaller European democracies with a history of significant interest group involvement with government, for example, through corporatist arrangements. These political systems have tended to allow subnational governments and/or social actors a great deal of influence in shaping and implementing policy. That said, the approach gained its greatest prominence when British academics began to notice that Westminster and Whitehall did not, in fact, control everything that happened in the name of government (see Rhodes, 1996), as they had not for decades. Given those internal contradictions in how governance is conceptualized, developing any means for measuring governance becomes more difficult, and appears to involve selecting one conception or another of governance. How do we steer between the Scylla of an excessive concern with the formalized State, and the Charybdis of loosely structured networks? Some scholars (e.g., Kooiman, 2003) have expanded governance beyond this dichotomy to simply mean activities directed at solving collective problems. While that is inclusive, its openness makes identifying actors and processes perhaps too open-ended, and can make measurement equally vague. The measurement of governance is further complicated by internal differences within the more statist conception of governance, with democratic and authoritarian systems having different practices, and conceptions, of governance (Demmers et al., 2004). That said, I will argue that developing a generic approach to governance is not impossible, and more than just 96

Measuring governance: a political science perspective  97 possible it is necessary if governance is to be a useful concept for political science. Without a possibility of measurement, the concept of governance can be little more than a descriptive term, rather than being a viable approach for understanding politics and governing. In this chapter I will begin by examining the existing literature on measuring governance and discussing the strengths and weaknesses of the various approaches to measurement, and their links to conceptions of governance. I will then develop a generic conception of governance, and then consider how it can be measured. I will be arguing that we need to have such a generic conception of governance, and its associated measures, to move the study of governance forward and more into the mainstream of political science. Once I have developed that generic model, I can add various elements to it from the two contending approaches to be able to address the particular arguments contained within each. In this discussion of measurement for governance, I will be arguing that any viable measure of governance must be useful for comparative research. In Sartori’s term (1970), the measure must be able to travel, and to travel widely. The concern with the applicability of governance measures in a variety of settings may produce measures that are extremely general. Students of governance may therefore need to add adjectives to the main concept of governance in order to produce more precise measures, much as Collier and Levitsky (1997) had to do when confronted with the task of conceptualizing and measuring democracy. The first task, however, is being able to identify measures that can provide at least ordinal assessments of the governance of different political systems, regardless of the type of institutions involved in governing.

MEASURING GOVERNANCE: WHERE ARE WE NOW? Among the several disjunctures in the governance literature, there is something of a contradiction between the approaches mentioned above and what has become the conventional means of measuring governance. The two dominant political science approaches mentioned above are very political, focusing on the ways in which decisions are made within political systems, and the extent to which participatory institutions such as networks have replaced (so it is argued) the formal institutions of the State as the locus of that decision-making. These political conceptions of governance contrast rather sharply with much of the existing literature on measuring governance. Corruption and Human Rights That existing literature on measuring governance, and especially on measuring good governance, is more concerned with the bureaucracy and its capacity to implement the rule of law in a fair and non-corrupt manner (Kaufmann et al., 2007). Much of this literature has been sponsored by the World Bank and other international organizations (see Oman and Arndt, 2010) which, after years of focusing almost entirely on market reforms in developing countries, had the revelation that markets will not function properly without effective states that can provide the legal framework within which the markets can function. Thus, much of the existing literature on measuring governance focuses on issues such as corruption, the rule of law, the ability of businesses to get licensed in an efficient manner, and the probity of legal institutions. In this way, the measurement of governance and of “good governance” have become very much the same (see Peters et al., 2022, chapter 6). Also, it should be noted that these measures of

98  Handbook on measuring governance governance do focus on the role of the State in governing, and tend to ignore informal actors and processes. Another set of measures of governance also focuses on good governance, but with rather different ideas about what constitutes good governance. The United Nations Development Programme has created a useful compilation of various indices of governance, with an emphasis on human rights and openness of government (UNDP, n.d.).1 These measures of governance, or again good governance, are not insignificant by any means, but they focus on just one aspect, or a few aspects, of the governance question. The public do not want to be governed by people who take bribes from contractors, or even other countries, and they do not want to have to pay bribes to public officials in order to obtain their necessary services from government. The public do want their human rights respected. But governance is about more, and therefore needs to be measured more broadly. The Quality of Democracy A second body of literature dealing with the measurement of governance focuses on the quality of democracy, and the extent to which human rights, transparency and other liberal democratic values are being upheld. This is much more a classical political science concern than the first measures of governance. Pippa Norris (2011) has supplied a useful summary of the state of that literature in the early 21st century, and also points to the possibilities of “democratic backsliding,” and the erosion of many of those democratic values under populist regimes. Since that time, the V-Dem Project at the University of Gothenburg has gathered and analyzed a great deal of information about the changing natures of democracy. This concern with democratic governance is especially appropriate for political science students of governance, given the emphasis of the discipline in most countries on democracy, and the quality of democracy. However, while we may value liberal democratic values very highly, they are perhaps also too narrow to provide a measure of governance per se. Much of the world is currently governed by non-democratic regimes, and any measure of governance that we would want to use for comparative purposes must take into account what happens in those regimes. We may want to measure the extent to which political systems do conform to democratic standards, as organizations such as Freedom House and DEMOS do, but that does not measure governance, or indeed does not really measure democratic governance. Governance Capacity A third body of existing literature on measuring governance comes closer to a comprehensive measure. This literature focuses on the capacity to govern, and the effectiveness of governing. The capacity literature does of course measure just that – it considers whether a political system has the resources (human, financial, administrative, legal) to govern, and generally does not see if this potential is actually used (Hall, 2002). What the capacity literature does is to assess the capacity of the State “… to implement logistically political decisions throughout the realm” (Mann, 1984, p. 189). High-capacity systems, for example, the United States, do not always convert that capacity into effective governance action. American government has access to a large economy, a capable public bureaucracy, and general respect for the rule of law, and should be able to govern effectively. Some of the literature (more in public administration and public policy

Measuring governance: a political science perspective  99 than in what has become “mainstream” political science) does attempt to measure just how effective governance is in producing desired outcomes (see, e.g., Grindle, 2011).2 Much of this literature has focused, quite rightly, on the capacities of public bureaucracies and their ability to implement public policies (Hanson and Sigman, 2021; Rauch and Evans, 2000). Greater consideration of this is located in the public administration chapter of this volume (see Chapter 5 by Sorin Dan). There are several existing measurements of governance capacity. The most extensive of these is the Bertelsmann Institute Sustainable Governance Index. That index does contain some measures of executive capacity, but also contains measures of the quality of policy choices of governments (from the viewpoint of sustainability) and the level of democracy. Further, it is largely based on expert surveys rather than on the actual performance of governments. Thus, this index goes some way in measuring capacity but does not address directly the actual performance of governments. In summary, the existing literature on measuring governance, and the political science contributions to that literature, all provide a part of the picture of governance. However, they do not, I would argue, provide a sufficient means of understanding how governance functions, or to make comparisons of the level of governance being produced in different countries. The properties being measured are important, but they provide only a limited perspective on what governance is and how it should be measured.

A GENERIC CONCEPTION OF GOVERNANCE, AND ITS MEASUREMENT Talking about measurement requires talking first about conceptualization (Peters, 2013). As already intimated, I will be arguing that political science should begin with a rather generic conceptualization of governance before attempting to deal with narrowly defined conceptions, such as democratic governance or network governance. If we can understand more clearly what it means to govern, then we can also understand more easily how to do so in a democratic fashion, or through social actors, or even in an authoritarian manner. Without the more general conception, we are left with a set of partial measures, many of which really do not address governance in any meaningful sense of the term. The root word for governance, government, etc. is a Greek word meaning steering. If we start with that notion of governance, then to govern means to steer the economy and society in certain directions, and the most fundamental measure of governance is the extent to which the targets of steering are achieved (see Pierre and Peters, 2016). That sounds simple, but in reality is a rather difficult measurement task. To measure governance in this seemingly straightforward manner connects governance with the public policy literature, especially that on success and failure (McConnell, 2010; Peters, 2015) and policy evaluation (Vedung, 2006). The connection to the policy literature raises an even more fundamental point. As we attempt to measure governance it is important to remember that policy is the mechanism through which governance can be achieved. The policy instruments utilized to govern may vary from moral suasion through to the use of physical coercion (Bemelmans-Videc et al., 1998), but public policies are how governments or other actors can affect the economy and society. For governance to be acceptable to the public, the policy instruments selected may be important, along with the content of the programs being delivered.

100  Handbook on measuring governance I therefore agree strongly with Richard Rotberg (2014) that governance is about whether the actors responsible for governance – whether state or non-state actors – are capable of, delivering the goods, and whether they actually do deliver the goods to the people. Those goods may be social programs, defense, economic development, or any number of other policies, but unless the study of governance focuses on what actually happens for and to citizens it is, as Rotberg points out, a set of vague impressions about processes and institutions. In short, as Addink (2019, p. 16) has argued: Good governance is not only about the proper use of the government’s powers in a transparent and participative way, it also requires a good and faithful exercise of power. In essence, it concerns the fulfilment of the three elementary tasks of government: to guarantee the security of persons and society; to manage an effective and accountable framework for the public sector; and to promote the economic and social aims of the country in accordance with the wishes of the population.

The simple way to measure governance therefore appears to be to just examine the quality of public services, and the well-being of citizens. We could simply measure, for example, the extent to which the UN’s Sustainable Development Goals have been reached in various countries or subnational units, and that could be argued to tell us how good governance is (Meadowcroft, 2007). That might be a start, but for political scientists there are at least three other important considerations that must be addressed. The first issue is whether the public are doing rather well simply because the country is affluent. For much of the post-war period in the United States many if not most people lived rather well simply because the economy performed well. Governance – other than permitting or encouraging the market to function well – was perhaps a small part of the equation. Following from the first point, a second concern is the extent of inequality of the outcomes for the public. The average economic and social well-being of the public may be good, but that can mask significant levels of inequality in which the worst off do not live well at all. The danger with many of the measures of government performance is that they depend on averages, and those averages can hide perhaps as much as they reveal. Further, for measures of governance capacity the variance within a country across policy domains may be greater than differences across countries, making simple comparisons far too simplistic (Gingerich, 2012). The third concern, which gets more directly at measures of governance, is how much positive results from governance actions equals a success? This is a common problem in evaluation research, and clearly also applies as we attempt to measure and evaluate governance. For example, “Obamacare” in the United States has enabled millions of previously uninsured people to get health insurance, but over 20 million Americans remain uninsured (Peters, 2023). Is this program a success, and is it good governance? Further, are there unintended consequences that overwhelm the benefits being created by the policy? At this point in the discussion the central role of politics and governing become clear, and their actions must be included directly in the measurement of governance. This need for direct inclusion of politics is true whether the State itself is attempting to govern or it is delegating responsibilities to other actors such as networks. This perspective will require some thinking about what the governing actors are actually doing, and what they intend to do, when making policy. Thus, governance is about setting (public) goals and then attempting to reach them. Working with such a definition of governance, good governance is having a high percentage of success in reaching those goals.

Measuring governance: a political science perspective  101 There is a good deal of qualitative evidence that good governance in this definition is difficult to attain, with a number of studies documenting the failures of governments to design and implement programs that work (Bovens and ’t Hart, 2016; King and Crewe, 2014). It may be, however, that there has been more interest in the blunders of government than in its successes, just as the implementation literature has been built more on studying failures than successes when putting public programs into action (Pressman and Wildavsky, 1974). However, some scholars have indeed pointed out the successes of governments in making and implementing policies (Compton and ’t Hart, 2019). This definition of governance is generic and can be applied to any form of governance. That said, this definition almost immediately pushes us to differentiate democratic and undemocratic regimes. The most important factor requiring such differentiation is the source of the goals being pursued through governance actions. For democratic regimes, the assumption is that the goals that governance actors should pursue come from “the people.” We know, of course, that one of the tasks of leadership in democratic governance is to shape public opinion, and to convince the public that their goals are the same as those of the leaders. But at some point leaders cannot stray too far from what the public wants, assuming that they and their political party want to be reelected. The above notion of good governance can present some normative challenges. What if the governing actors in question pursue goals that do not correspond to widely accepted values concerning equity or human rights? This deviation from fundamental values may occur in democratic governments as well as authoritarian governments, for example, in the treatment of migrants. If the government actors are able to reach those unacceptable goals, is that still good governance? Or even in fully democratic regimes political leaders may use their power to pursue their private goals. These goals need not be corruption per se and overt self-enrichment, but it still involves using public power to assist specific groups and pursue private goals, as famously discussed by Theodore Lowi (1969). This problem arises to some extent in the real world in the context of failed or weak states. The states cannot deliver the goods and services desired by their publics, but the warlords, or the cartels, or whoever can, but they deliver abuses along with the services (Börzel and Risse, 2022). Can we say they are doing a good job of governing? Goals: whose and how specific? The first question about measuring governance is how do we establish the goals that are to be pursued through governance? Those goals can be thought of as existing at several levels within the political system. The first is what do the people want? Governance means, at least in democratic settings, that the goals being pursued by governments should reflect the desires of the people. Many of the complaints by populist movements over the past several decades have been that the elites in government are not responsive to the people, and therefore are not providing good governance. At least for the American states there is some systematic evidence that leaders are more responsive than usually assumed (Caughey and Warshaw, 2022), but that positive finding may not be duplicated in other settings. Interestingly, although populists on the political right have made the claims that government elites do not reflect the will of the people, it seems that some of the most egregious examples of responsiveness failures have been by those very right-wing governments. For example, in the United States the Republican party has been leading the campaign to ban abortions, but in at least one state where those elites were passing legislation to impose a ban, a referendum gave a resounding endorsement of abortion rights (Cassidy, 2022).

102  Handbook on measuring governance The first question to be asked about governance, again especially in democratic regimes, is how responsive are the goals being pursued to the wishes of the public? Responsiveness is a classic question in political science, and is usually connected with the platforms of political parties and the pronouncements of political candidates. The Manifestos Project (Klingemann et al., 2006) has provided a wealth of data about manifestos and speeches, and it tends to show a somewhat tenuous connection between the priority of citizens (as expressed in public opinion polls) and those platforms (see, e.g., Allen and Bara, 2017). The responsiveness question can also be applied to the social actors that are involved in collaborative governance. The assumption is that the leadership of these organizations do indeed represent the preferences of the members, but the classic view of Michels (1959) may still prevail, and the leaders may represent more their own preferences and ambitions, and may respond more to their desire to be successful in the deliberations than to the preferences of members. Thus, utilizing networks and other forms of collaborative governance may not necessarily mean that policy choices will correspond to preferences of even the mass membership of those organizations, much less the preferences of the public as a whole. Thus, the first stage of the responsiveness is the link between the policy preferences of the public and the goals expressed by political parties and political leaders. The goals for governing that come from the citizens, as well as those that come from the social actors involved in collaborative governance, have to be filtered through institutions, and that then becomes a second level of examining responsiveness. For the political parties involved in government, the second stage of this process is the linkage between the goals expressed during elections and the goals that are actually pursued by the government. Governments may have to campaign in certain ways if they want to be elected, but then may promptly forget what they campaigned on once in office. Some of that amnesia may be purposeful, and some of it may be just a function of the difficulties in governing and changing circumstances. Richard Rose (1976) found that governments elected in the United Kingdom tended to do about one-third of what they had promised, did nothing about another one-third, and did exactly the opposite for the final third. Rose’s example of the United Kingdom was for a government composed of a single party. Trying to pursue campaign promises in a coalition government is even more of a challenge. Coalition agreements often try to allow the member parties to pursue their top priorities – Green parties get the ministry of the environment in order to pursue their commitments, for example. This may not be possible for all parties, and further bargaining to form a coalition may lead to promises being abandoned, or whittled down to little more than rhetoric. Thus, if governance is about pursuing goals promised in a campaign, coalition governments and especially “rainbow coalitions” with multiple and ideologically diverse parties may be prone to failure (Kawecki, 2022). The potential saving grace for coalition governments is that they tend to be more prevalent in “consensus” political systems (Lijphart, 2012) in which there has been more general agreement on goals. However, as more niche parties, especially those with radical, nationalist, programs, are elected into parliament that consensus is declining, and with it the difficulties in democratic governance are increasing. The institutional structures involved with collaborative governance pose their own challenges to having goals coming from the environment actually making it into practice. If we assume that, much like political parties entering a coalition, the actors involved in collaborative governance enter with their own goals, then the bargaining process involved may lead to some weakening, or abandonment, of goals in order to reach a viable consensus among the actors involved. If we further assume that the actors involved have a de facto, if not de jure,

Measuring governance: a political science perspective  103 veto over decisions then the result may be governance through the lowest common denominator (Scharpf, 1988). The third stage of governance in this generic conception is the implementation of the programs and preferences of leaders and their parties in government. As noted above, the general finding about implementation is that actually producing the intended results of policies is difficult. There are numerous barriers – political, institutional and social – that can prevent a policy from actually making the social and economic impact intended. Those implementation problems are endemic in governance, but may be especially vexing for governments in less developed political systems that have massively larger problems with often less professionalized implementation structures. Although implementation failures may appear to be more bureaucratic and technical, we must also be cognizant of their role in the capacity of democratic governments to “deliver the goods” to the society. The populist rhetoric that has become more common in the 21st century tends to demean the “Deep State,” meaning the bureaucracy, and fails to recognize the commitment of most bureaucrats to liberal democratic values and public service (Yesilkagit et al., 2022). And fundamentally, in governance terms, it is important that the actors responsible for governance are also responsible for how the policies are implemented. What are the goals of government, and is the government of the day or the governing system functioning with a longer-term perspective?3 For political scientists examining the behaviors of sitting governments is probably a more common, and appropriate, concern. We may want to know, for example, if coalition governments can govern better than single party governments, or if authoritarian governments are able to achieve their goals more readily than democratic governments. Further, political scientists may want to examine the barriers that government structures can present to effective governance (see Rose, 1976). In summary of governance in democratic regimes, there are three distinct stages at which preferences of the public must be translated into action if there is to be effective governing. Failures may occur at any of the three stages of the process, and those failures may have different meanings for the measurement of governance. Table 6.1 presents those sources of governance failure in nominal terms, with the implicit argument that only if all three conditions are met can we say there has been successful democratic governance. That sets a high bar for governance success and, following arguments that perfect implementation should not be expected, that bar may be too high. The ideas presented in Table 6.1 are general, and we can move away from the simple nominal measures by, for example, looking at success and failure in different policy areas. A governance system may, for example, do well in linking popular demands about the economy with policy outcomes, but may fail in other policy domains. We may therefore be able to characterize democratic political systems by not only what aspects of governance tend to fail, but also in what policy domains they fail. Table 6.1

Sources of governance failure Goals Reached

  Public Approval of Goals

Yes No

Yes

No

Popular Governance

Failed Governance

(A)

(B)

Private State

Failed Governance

(C)

(D)

104  Handbook on measuring governance Table 6.1 also presents a particular version of democratic responsiveness and democratic governance. There may be other versions of democracy that are less direct and rely on elites, and perhaps especially elites within the public bureaucracy, to set the policy agenda. Some scholars (see Page and Jenkins, 2005) have argued that much of the detail of policy in the United Kingdom comes from the bureaucracy. Perhaps even more clearly, étatiste style democracies with strong bureaucracies may attempt to pressure the public rather than being pressured by them. Measuring the first stage of the governance model above may become more difficult in those settings, but the other components of the model do remain viable. Likewise, there may be more direct forms of democratic governance that permit the public to bypass formal institutions and to legislate directly, for example, referendums or participatory budget exercises. In these cases, the principal source of governance failure would be at the implementation stage. That sort of failure might even be more probable than in the usual governance processes because the political and bureaucratic elites responsible for implementation may not agree with the tasks they have been handed through a popular referendum. In summary, conceptualizing governance as steering, and conceptualizing democratic governance as steering in response to public preferences, leads me to a particular way of measuring governance. This perhaps focuses too much on failures of governance rather than on successes, but the intonation is not to be negative about governance. Rather the purpose is to be able to identify successes and, at the same time, also identify the sources of failure. Understanding failure in this approach to measurement will have both theoretical and practical benefits for the advancement of governing. Further, this approach to measurement moves beyond corruption and measures of that sort to focus on more political conceptions of governance.

GOVERNANCE IN AUTHORITARIAN REGIMES The above discussion has focused on governance in democratic political systems. Governance in those settings is setting goals that are popular with, or at a minimum beneficial for, the public and attempting to reach the goals. Governance in authoritarian regimes can be assessed in somewhat the same manner, with the difference being that the process of setting goals is somewhat easier. The more autocratic the regime is, the less leaders are likely to be concerned about responding to popular goals, and will instead focus on their own goals. Implementation in an authoritarian government may also be easier, given the capacity to mobilize the power of public sector institutions with minimal opposition. That said, more astute leaders in non-democratic regimes will understand that they cannot stray too far from the preferences of the public, at least not for very long. Especially in an age of the Internet and social media, it is difficult for an autocratic leader to ignore the preferences of their publics (Reuter and Szakonyi, 2015). Further, autocratic leaders generally are conscious of the international visibility of their actions, and may want to appear to be benign despots, even as they remain despots. The public responsiveness elements of governance central to democratic governance may be absent for authoritarian regimes, but the final component of the governance model – implementation – will still be present. Authoritarian leaders may attempt to implement their own goals and may fail, and hence be unsuccessful in governing. While we assume that the primary goal for democratic governance will be supplying public goods and services, we

Measuring governance: a political science perspective  105 can also assume that the primary goal for autocratic leaders is maintaining their positions in government. It is important to remember, however, that autocratic leaders may also have some public goals for governing. While most of their goals may be reflexive, that is, attempting to maintain their position in power, some may be about providing the healthcare, education, etc. that the public wants and needs. Further, pursuing those public goals may be a means of solidifying their position with the public, and preventing internal discontent.4 Somewhat paradoxically, if the autocratic rulers do choose to pursue those public goals, they may be more successful than democratic leaders, not being bothered with pesky details such as independent legislatures and courts.

GOVERNANCE IN FAILED STATES I am assuming that the democratic and authoritarian governments discussed above are reasonably stable and effective political systems. They may be far from perfect, with numerous governance failures, but they are able to exercise authority over their territory for most policies for most of the time. Their citizens may not always be happy with their governments, but they do largely accept them, whether out of loyalty or fear. Not all countries are so fortunate, and there are a number of weak, and even failed, states that are not capable of providing governance through a formal government operating through the usual mechanisms. For our purposes in this chapter, the most important thing about these failed states is that some governance is still produced. What is produced is not the same sort of governance that might be produced in other settings, even in authoritarian regimes, but it is governance nonetheless (see Börzel and Risse, 2022). Social groups such as clans, as well as warlords, or even businesses, can be the source of governance in these systems. Can we measure the governance provided in these systems in the same way as we might for more institutionalized governance systems? One answer to the above question would be no. Governments (and their partners) that cannot control their territory and that are faced with rival sources of governance simply cannot be said to be governing. A more correct answer, I would argue, is yes and no. While the negative assessment of governance in these states makes an important point, on the other hand, I would still argue that governance is about setting collective goals and reaching them. The difficulty for failed states is that there may be multiple groups attempting to govern, and the control of territory within which to govern may be contested. Studying governance in failed states may involve examining the goals and the successes of multiple actors. The nominal government may have its goals but so too will various warlords, or clans, or cartels that are controlling some parts of the territory. In contrast to the democratic governments discussed above, I would expect relatively few of the goals of these actors to be about the delivery of public services, and more would concern the survival of the groups in question, as well as their ambition of extending their domain. It is also important to remember that some of the multiple sources of governance exist even in states that are not truly failed. There is a good deal of “shadow governance” (Peters, 2011) existing within developed democratic regimes. For example, in countries with substantial indigenous populations those groups may have rights to govern their own territories. Likewise, social actors such as churches or unions may provide public services such as education or job

106  Handbook on measuring governance training. This list of governance could be extended, but they all represent a willingness on the part of governments to delegate some parts of governing to other actors. What differentiates this delegated governance in more successful regimes from the fragmentation of failed states is the capacity existing within the more successful governments to exercise authority and rescind the delegation. Therefore, when analyzing governance in the democratic regimes, and attempting to measure that governance, we must take into account delegation. Delegation can affect not only the goals being pursued after a government is elected but also the implementation of programs seeking to attain those goals. Recognizing the importance of delegation in governing also requires recognition of multiple sources of goals within specific components of the country. When governance is delegated to tribal governments in the United States, or to Maori governments in New Zealand, the goals pursued may be different than those which might have been pursued by the national governments (Imai et al., 2009). These differences may require that governance be measured in a more differentiated manner.

SUBTYPES For the autocratic leaders and their governance, I will argue that there are at least four subtypes of governance based on the success or failure of those leaders in reaching their goals. I am again assuming that the goals being pursued are primarily those of the leaders, whether individuals, cliques, or hegemonic parties. Although developed for private reasons, those goals may still have some public importance and public impact, and the public may have positive opinions about the policies being advocated (see Sinkkonen, 2021). We can conceptualize governance possibilities in authoritarian regimes as shown in Table 6.1. Leaders in these regimes may pursue their own goals as well as some public goals, and may be successful in some areas and not successful in others. While this typology was developed for governance in authoritarian regimes, it may have more general applicability. Something which might be called “public governance” is possible in authoritarian regimes just as private governance is possible in democratic regimes, when leaders use their positions in government to pursue their own goals. If we apply this typology to the real world of governance it seems that any one country could have some outcomes in all four boxes. Such an outcome would certainly help to demonstrate the complexity of attempting to measure governance, but it would not be satisfying if the researcher is attempting to characterize regimes, or leaders. Again, we should remember that when doing comparative research of policy and governance the differences across policy domains may be greater than those across political systems.

METHODOLOGY AND DATA: ACTUALLY DOING THE MEASUREMENT The discussion to this point has raised a number of points about what measurement of governance should be, from the perspective of students of political science and government. The obvious question then is how can scholars translate that conceptual discussion into action, and produce research that does indeed measure governance in the ways outlined above (see

Measuring governance: a political science perspective  107 Gisselquist, 2014). There has already been a good deal of qualitative and descriptive work done on the topic of governance. Most of this has been done using expert surveys, asking knowledgeable scholars or practitioners to rate governance in their own and/or other countries on a number of dimensions. These data, although valuable as an initial stage of research on governance, are impressionistic and further often are averages that may obfuscate wide differences among the evaluators involved. Further, governments in democratic regimes are to some extent performing assessments of their governance performance every time they go to the polls. At every election there will be claims that the incumbent government has failed to fulfill the promises they made in the previous election campaign, and therefore should be expelled from office. While the evidence of governance in these cases is clouded by a large number of intervening variables, it does provide some sense of how well a government (along with its allies in the private sector) are governing. In some ways, this political conception of governance as goal attainment helps to validate the academic conception that has been discussed, and voters may be empowered to assess how well governance is being performed. If we move beyond these political “measures” of good governance, we can first think about the use of case studies as a means of understanding if, and how, governance is created. In this era of (excess) quantification in the social sciences, the humble case study may appear to be of minimal value, but that is far from the truth, especially if the researcher uses techniques such as process tracing (Beach and Pedersen, 2019) to explore the causal connections within the case more fully. Case studies may be used in the first instance to assess the extent to which policy goals were reached, much as might be true for evaluation research. The researcher will first need to identify the goal or goals being sought in a program and then find means of assessing the extent of attainment.

CONCLUSIONS From the perspective of political science, governance is about selecting collective goals and then making attempts to reach those goals. This perspective is valid whether the goals are selected through a democratic process or selected unilaterally by an authoritarian leader. While I may prefer the former method of goal selection on normative grounds, for measuring governance the differences among the regimes are not fundamental. Likewise, within democratic regimes, whether the goals are selected by some sort of participatory, network process or through representative democracy may be of limited importance. The conceptualization of governance presented here, and the associated forms of measurement, emphasize the role of governance in solving collective problems. But it is also important to note the extent to which the same processes and mechanisms could be used to appropriate public power for private purposes. Thus, just as private actors may be used to serve public purposes so too can public actors pursue private goals. This can still be considered a form of governance, and we must always be sure to consider the goals being pursued in governance actions. The ideas for measuring governance here are preliminary, and not fully developed. At the same time, they do attempt to examine governance as a more political activity, and more of a goal-seeking activity, than is often the case for measurements of governance. The measurement ideas proposed here are more difficult to implement than are some commonly used meas-

108  Handbook on measuring governance ures of governance, but I believe that they will provide a better picture of how that essential “product” of the political system can be fashioned.

NOTES 1. The collection includes a variety of other types of indicators, but clearly the emphasis is on human rights. 2. Grindle uses the term “good enough governance,” but not in the way in which the World Bank does. For her, good governance is the ability by governments and their allies in the private sector to produce effective action. This is much closer to the approach to governance that I am advocating. 3. Most constitutions have statements of lofty goals, and sometimes more specific goals. For example, the preamble to the US Constitution states that the goals are “to provide for the common defense, promote the general welfare, and to secure the blessings of liberty to ourselves and our posterity.” 4. Unfortunately, relatively few autocratic rulers have got this message, and they continue to pursue more selfish goals.

REFERENCES Addink. G.H. (2019). Good governance: Concept and context. Oxford University Press. Allen, N., & Bara, J. (2017). ‘Public foreplay’ or programmes for government? The content of 2015 party manifestos. Parliamentary Affairs, 70, 1–21. Ansell, C., & Torfing, J. (2022). Handbook of theories of governance (2nd ed.). Edward Elgar. Beach, D., & Pedersen, R.B. (2019). Process-tracing methods: Foundations and guidelines. University of Michigan Press. Bell, S., & Hindmoor, A. (2010). Rethinking governance: The centrality of the state in modern society. Cambridge University Press. Bemelmans-Videc, M.-L., Rist R.C., & Vedung, E. (1998). Carrots, sticks and sermons: Policy instruments and their evaluation. Transaction. Börzel, T., & Risse, T. (2010). Governance without a state: Can it work? Regulation and Governance, 4, 113–34. Börzel, T., & Risse, T. (2022). Effective governance under anarchy: Institutions, legitimacy and social trust in areas of limited statehood. Cambridge University Press. Bovens, M.A.P., & ’t Hart, P. (2016). Revisiting the study of policy failures. Journal of European Public Policy, 23, 653–66. Capano, G., Zito, A.R., Toth, F., & Rayner, J. (2022). Trajectories of governance: How states shaped policy sectors in the neoliberal age. Macmillan. Caughey, D., & Warshaw, C. (2022). Dynamic democracy: Public opinion, elections and policymaking in the American states. University of Chicago Press. Collier, D., & Levitsky, S. (1997). Democracy with adjectives: Conceptual innovation in comparative research. World Politics, 49, 430–51. Compton, M.E., & ’t Hart, P. (2019). Great policy successes. Oxford University Press. Demmers, J., Jilberto, A.E.F., & Hogenboom, B. (2004). Good governance in the era of global neoliberalism: Conflict and depolitisation in Latin America, Eastern Europe, Asia, and Africa. Routledge. Gingerich, D.W. (2012). Governance indicators and the level of analysis problem: Empirical findings from Latin America. British Journal of Political Science, 43, 505–40. Gisselquist, R.M. (2014). Developing and evaluating governance indices: 10 questions. Policy Studies, 35, 513–31. Grindle, M.S. (2011). Good enough governance revisited. Development Policy Review, 25, 533–74. Hall, J.S. (2002). Reconsidering the connection between capacity and governance. Public Organization Review, 2, 23–43.

Measuring governance: a political science perspective  109 Hanson, J.K., & Sigman, R. (2021). Leviathan’s latent dimensions: Measuring state capacity for comparative political research. Journal of Politics, 83, 1495–510. Imai, S., McNeil, K., & Richardson, B.J. (2009). Indigenous people and the law: Comparative and critical perspectives. Bloomsbury. Kaufmann, D., Kraay, A., & Mastruzzi, M. (2007). The worldwide governance indicators project: Answering critics. The World Bank. Kawecki, D. (2022). End of consensus? Ideology, partisan identity and affective polarization in Finland 2003–19. Scandinavian Political Studies, 45, 478–503. King, A., & Crewe, I. (2014). The blunders of governments. Oneworld. Klingemann, H.-D., Volkens, A., Bara, J., Budge, I., & McDonald, M. (2006). Mapping policy estimates for parties, electors, and governments in Eastern Europe, European Union and OECD countries 1990–2003. Oxford University Press. Kooiman, J. (2003). Governing as governance. Sage. Lijphart, A. (2012). Patterns of democracy (2nd ed.). Yale University Press. Lowi, T.J. (1969). The end of liberalism. W.W. Norton. Mann, M. (1984). The autonomous power of the state: Its origins, mechanisms and results. European Journal of Sociology, 25, 185–213. McConnell, A. (2010). Understanding policy success: Rethinking public policy. Macmillan. Meadowcroft, J. (2007). Who’s in charge here? Governance for sustainable development in a complex world. Journal of Environmental Policy & Planning, 9, 299–314. Michels, R. (1959). Political parties. Dover Publications. Originally published 1915. Norris, P. (2011). Measuring governance. In M. Bevir (Ed.), Handbook of governance (pp. 46–71). Sage. Oman, C. P, & Arndt, C. (2010). Measuring Governance. OECD Development Centre, Policy Brief, 39. Page, E.C., & Jenkins, B. (2005). Policy bureaucracy: Government with a cast of thousands. Oxford University Press. Peters, B.G. (2011). Governing in the shadows. Asia-Pacific Journal of Public Administration, 33, 1–16. Peters, B.G. (2013). Strategies for comparative research in political science. Macmillan. Peters, B.G. (2015). State failure, governance failure and policy failure: Exploring the linkages. Public Policy and Administration, 30, 261–76. Peters, B.G. (2023). Health policy in the United States. University of Bristol Press. Peters, B.G., Pierre, J., Sørensen, E., & Torfing, J. (2022). A research agenda for collaborative governance. Edward Elgar. Pierre, J., & Peters, B.G. (2016). Comparative governance: Rediscovering the functional dimensions of governing. Cambridge University Press. Pierre, J., & Peters, B.G. (2020). Governance, politics and the state. Macmillan. Pressman, J.L., & Wildavsky, A. (1974). Implementation. University of California Press. Rauch, J., & Evans, P. (2000). Bureaucratic structure and bureaucratic performance in less developed countries. Journal of Public Economics, 75, 49–7. Reuter, O.J., & Szakonyi, D. (2015). Online social media and political awareness in authoritarian regimes. British Journal of Political Science, 45, 29–51. Rhodes, R.A.W. (1996). The new governance: Governance without government. Political Studies, 44, 652–77. Rose, R. (1976). The problem of party government. Macmillan. Rotberg, R.I. (2014). Good governance means performance and results. Governance, 27, 511–18. Sartori, G. (1970). Concept misformation in comparative politics. American Political Science Review, 64, 1033–53. Scharpf, F.W. (1988). The joint decision trap: Lessons from German federalism and European integration. Public Administration, 66, 239–78. Sinkkonen, E. (2021). Dynamic dictators: Improving the research agenda on autocratization and authoritarian resilience. Democratization, 28, 1172–90. Sørensen, E., & Torfing, J. (2007). Theories of democratic network governance. Macmillan. UNDP (n.d.). https://​www​.un​.org/​ruleoflaw/​files/​Governance​%20Indicators​_A​%20Users​%20Guide​ .pdf. Accessed February 15, 2023. Vedung, E. (2006). Evaluation research. In B.G. Peters & J. Pierre (Eds.), Handbook of public policy (pp. 372–91). Sage.

110  Handbook on measuring governance Yesilkagit, K., Bauer, M.S., Peters, B.G., & Pierre, J. (2022). Guardians of democracy: Can civil servants prevent democratic backsliding. Paper presented at General Sessions of European Consortium for Political Research. Innsbruck, Austria.

7. The sociology of measurement Radhika Gorur

INTRODUCTION Classical Western sociology emerged in the eighteenth century in the context of the monumental social, cultural, economic and political transformations ushered in by the French Revolution (Nehring & Plummer, 2014). The industrial revolution that soon followed throughout Europe saw another massive set of changes, with the mechanisation of agriculture and the rise of cities with large populations of exploited factory workers, including children, working and living in squalid and dangerous conditions. Since then, society has continued to be transformed at an incredible pace, through global and regional conflicts and alliances; technological innovations, including social media and artificial intelligence (AI); globalisation; climate change; and the global pandemic, bringing in their wake a range of new issues of concern to sociology. Contemporary datafied societies (Van Es & Schäfer, 2017) are suffused with data, with almost every aspect of social life rendered calculable and measurable (Mayer-Schönberger & Cukier, 2013). Vast troves of social data are now mined by a variety of actors to influence behaviour and to automate decision-making, impacting society in profound ways. These transformations have raised unprecedented challenges for sociologists, requiring new ways to think about society itself. For contemporary sociologists, this has meant an engagement with data and datafication, and the rise of such concepts as the networked society (Castells, 2004) and the inclusion of the more-than-human in studying how structure and agency interact to produce various forms of social interactions, power relations, knowledges, hierarchies, institutions, dominance, resistance, politics and governance (Braidotti, 2019). The sociology of measurement is an emerging branch of sociology which has concerned itself with how data and measurement are produced as well as how they affect individuals, institutions and societies (Gorur, 2014). It derives its inspiration from Science and Technology Studies (STS) and the history and philosophy of science. Although governance has been critiqued (and thus ‘assessed’ or ‘measured’) by sociologists since the emergence of sociology as a field, the sociology of measurement has developed a particular focus on the social life of numbers and the ways in which numbers participate in governance, and the methodological and ontological politics of measurement. To claim that there is a field called ‘the sociology of measurement’ may be an over-reach: there is no conference or journal attached to these terms, and no university courses to consolidate its status as a recognised field. I subsume under this the term ‘sociology of measurement’ the sociology of numbers (Gorur, 2014), the sociology of quantification (Espeland & Stevens, 2008) and Critical Data Studies (Kitchin & Lauriault, 2018). Critical Data Studies (CDS), which tends to focus on big data and automated decision-making, has indeed begun to evolve as a recognised field of study that is taught in universities. But the sociology of measurement, which is generally interested in all forms of data and datafication, and in the empirical and ontological politics of measurement, can be regarded as the hinterland within which CDS has evolved. 111

112  Handbook on measuring governance The term ‘sociology of measurement’ was first used by Woolgar (1991) in a critique of citation measures as the basis for assessing the quality of academic work. It then appears to have been dormant in this hard-to-track-down paper until a reference to the term was made by Derksen (2000) in a paper on how DNA evidence moved from initially being considered a dubious form of scientific evidence to becoming one of the most trusted forms of knowledge, capable of overturning court verdicts based on other forms of evidence. The rise and rise of numbers has seen an explosion of accounts that could be classified under the sociology of measurement. Topics covered include how measurement has aided the neoliberal shift from ‘government’ to ‘governance’ (Bellamy & Palumbo, 2017); datafication, evidence-based policy and New Public Management (Parsons, 2002); indicators and international comparisons (Gorur, 2018a); the rise of the audit culture (Strathern, 2000; Shore & Wright, 2015); the politics and effects of accountability (Grek, Maroy, & Verger, 2020); the sociology of classification and standardisation (Gorur, 2018a; Landri, 2022); and more generally, the epistemological and ontological politics of number practices. More recently, big data, automation, digital forms of surveillance and algorithmic governance have led to the emergence of new topics and concepts, such as data justice (Taylor, 2017) and data sovereignty (Hummel, Braun, Tretter, & Dabrock (2021). More usually covered under the umbrella of CDS, these accounts are found in many fields including environmental science, health administration, global aid and development, organisational studies, media and culture, and law. Topics also range in scale, from quantification at an individual level (through wearable and other personalised technologies) to institutional, state and global levels (through large-scale surveys and polls and various forms of indicators and comparative accounts). In this chapter, I provide an account of some of the key theories, concepts and contributions of the nascent field of sociology of measurement, concluding with a reflection on the future prospects for the field. In the next section, I provide a brief historical overview of the field, tracing the roots of the sociology of measurement in classical Western sociology and its preoccupation with modernity.

MODERNITY, MEASUREMENT AND GOVERNANCE Given its roots in political and economic upheaval during the time of the French Revolution, industrialisation and colonisation, modernity has remained a central concern for sociology (Nehring & Plummer, 2014). This includes, on the one hand, a study of how the powerful gained and remained in power, and, on the other hand, issues such as poverty and oppression. Governance and administration, the spread of Enlightenment ideas of rationality and progress, and the ways in which power, wealth and privilege came to be concentrated or redistributed during this period were all concerns with which sociologists began to engage. Nehring and Plummer (2014, p. 171) identify three signature features, making evident the vast canvas of modernity, including ‘A set of distinctive forms of social, economic, cultural and political organisations and institutions’; ‘Underlying systems of norms, values, and beliefs’; and ‘Modes of everyday life that have emerged from and are sustained by these forms of social organisation and norms, values, and beliefs.’ A key instrument of modernity is measurement. Sociology has concerned itself both with the way measurement is used in the service of governance and in the way it is used to render governance itself measurable. In the current era of outcomes, transparency and accountability,

The sociology of measurement  113 these two phenomena appear to have merged in interesting ways, folding in on each other. Accountability is not just demanded by those in charge; it is also accepted as an obligation by those who govern to render themselves accountable. Such accountability is a double-edged sword. On the one hand, it exposes governing bodies to the media and the public, making them vulnerable. But on the other hand, it strengthens their legitimacy and moral authority. By making themselves vulnerable to critique and consequences, those who govern appear to become stronger through a show of openness. This is at the heart of the current mantra of open government. As Porter (1996) has argued, numbers and instruments of measurement are invested with a sense of legitimacy, clarity and objectivity leading to trust which is less readily accorded to humans. Governments as well as citizens are thus wont to delegate trust to numbers rather than humans who are deemed as likely to be biased, fickle and unreliable. Sociological critique can ‘measure governance’ in terms of how numbers are being used to govern, how number-based governance is affecting various groups, ideas and societies, and how numbers are being used by governments and managements to track and monitor themselves, gain credibility and render themselves accountable. The next section outlines key theories that have guided the study of the relationship between measurement and governance in sociology – with an emphasis on the sociology of measurement.

STATISTICS AND THE STATE The relationship between the state and statistics, examined by key historians of numbers and scholars of policy and politics such as Desrosières (1998), Porter (1996) and Scott (1998), have provided a rich foundation for one aspect of sociological understandings of quantification and its effects on individuals, institutions and societies. These scholars have described how historical developments in statistical sciences and the establishment of infrastructures for regular, standardised forms of data generation over the last 200 or so years have made it possible for the state to be made visible, measurable and governable in unprecedented ways. As statistical sciences became more sophisticated and ambitious, so did state ambitions to expand control and to regulate its populations with more and more fine-grained data. Desrosières (1998) demonstrates the power of statistical notions such as the mean or the mode to set expectations and norms against which difference, deviation and inadequacy could be attributed. Although the mode and mean might have been derived from a range of scores or widely distributed data, once it is set as a norm, it tends to become the desired way to be. These norms act as powerful phenomena which guide the behaviour of individuals and form the basis for laws and policies. If Desrosières and others described how the history of statistics impacted governance, Scott’s (1998) Seeing Like a State focused on how measurement and standardisation enabled the state to be described in increasing detail, rendering it more governable. He describes the development of a particular science of the state, predicated on systematic resource management, long-term planning and better-regulated taxation, which in turn produced particular citizen-state relations. Centralised control and management were enabled by a synoptic view of the state which rendered populations more and more abstract and stylised. A remote form of governance sought to regulate and improve the human condition. However, he argues, based on case studies ranging from forestry management in Prussia to city planning in Paris and Chandigarh, that the abstract, reductionist data used in such regulation wrought untold hard-

114  Handbook on measuring governance ships upon people, forcing them to seek ways to subvert or escape the state’s efforts to regulate their everyday lives. This is because the synoptic view on which such policies and planning are based ignores local knowledge and experience which are relevant to the everyday lives of people. Governance based on reductionist and synoptic views can eventually produce states that mirror the measurements; as Jasanoff (2004) has argued, knowledge (in this instance statistical descriptions of the state) and the state are co-produced. Another key aspect of numbers and measurement highlighted by historians and sociologists relates to the effects of standardisation and classification on individuals and societies. Hacking (2007) has demonstrated how the development of standardised categories for such phenomena as mental illness ‘made up’ people who could then be counted and governed as particular types of subjects. Bowker and Star (2000) have also shown how categorisation has ontological consequences – for example, a limited set of gender categories on a census form could render invisible gender-fluid, trans and other non-binary gender identifications – limiting the ability for people to identify in particular ways or for advocacy and the creation of laws to protect or support categories that do not officially exist. Census exercises act not only to ‘construct populations’ but also to produce subjects who can identify themselves as members of that population (Ruppert, 2008). However, as Latour (2012) has noted, society has never been comprised of subjects who fit into clear categories – rather, we tend to be messy hybrids and to leak out of any categories designed to contain us. Governing by Numbers In the past four or so decades, neoliberal philosophies have shaped the economic and social agendas of political parties on both the right and the left, displacing the diversity of ideologies that were, previously, the signature of a political party. ‘Evidence’ – mostly large-scale statistical and comparative data – appears to have replaced political philosophy as the basis for policy. Ideologies were cast as the opposite of data and evidence, which in turn projected data as being apolitical and objective. Numbers brought with them a way to govern that was seen as common sense – facts were supposed to drive decisions in the era of New Public Management (NPM) that aimed to harness the interoperability of various databases to advance joined up governance, the centralisation of authority and the decentralisation of responsibility, with states ‘steering at a distance’ through mechanisms of benchmarking, target-setting, measuring and monitoring (Ball & Youdell, 2008). Rewards and penalties were attached to performance measures to incentivise actors across government and in various institutions to be productive and achieve or exceed the set goals. Evidence-based practice was not just the way the governments sought to operate – quantified data became ingrained in society as the most trusted type of knowledge, and society became infused with an audit culture (Power, 2000; Strathern, 2000). Scientific advances in statistical measurement and comparison, and technological advances in governance infrastructures, including new means of presenting data to citizens in accessible ways, enabled a high-modernist mode of governing (Scott, 1998) in which predictability, transparency and accountability became the hallmarks. Citizens were expected to play a key role in the audit society – they were actively invited to engage in holding states and institutions accountable using the data made publicly available (Gorur, 2018b). Meanwhile, with the establishment of intergovernmental agencies in the interwar and post-war years, the focus on datafication became global and took on greater urgency (Gorur,

The sociology of measurement  115 2018a). Good numerical and comparative data were seen as imperative for proper planning, governing and monitoring (Smyth, 2008). Economic progress became a global security issue in the Cold War period, and the world, previously classified as communist and non-communist, came to be described as ‘developed’ or ‘underdeveloped’ (Gorur, 2018b). Modernistic and Enlightenment visions of making the world legible, governable and improvable were pursued with missionary zeal. There was a distinct focus on data for economic growth and for rebuilding nations torn by war and for supporting newly independent nations emerging from decades – even centuries – of colonisation (Collins & Wiseman, 2012). International aid agencies supported the generation of globally comparable indicators and the development of statistical infrastructures within nations so that there could be regular monitoring of progress, enabling donors to determine where to focus their resources and to track the return on their investment. Comparative accounts began to be compiled on an increasingly wider range of phenomena by organisations such as the OECD, the World Bank and UNESCO. Consequently, the ambitions and desires of nations began to be conceptualised in similar terms, both by global agencies and by national bodies, although there was still considerable diversity in political and economic philosophies, the structures and processes of governance, and the relations between the state and its citizens. ‘Seeing like a state’ was extended to a global level, flattening out the variations between nations in their measures, despite vast inequities and variations in context (Gorur, 2016a). Modernisation became the mantra for development, through education, technology and economic growth. Global schemes such as Education for All, the Millennium Development Goals and the current Sustainable Development Goals epitomise this high-modernist global vision underpinned by universal indicators and measures. Governing by Dashboard If the nineteenth and twentieth centuries were gripped by this numeric imaginary, where measurements and comparisons led to the rise of ‘governing by numbers’ (Rose, 1991), we could say that the twenty-first century is gripped by the technocratic imaginary of ‘governing by dashboard’ (Bartlett & Tkacz, 2017). This shift was triggered by two key transformations. First, the capacity to generate data has grown exponentially, so that the abstraction and decontextualisation of statistical data can now be supplanted by or supplemented with big data – data that are specific, detailed and contextualised (Gorur, 2018b) – and easily and instantly accessible in simplified, understandable forms. This has major implications for the forms of knowledge that are valued as well as the forms of governance, surveillance and individualised control that are available to institutions and states. Second, the availability of vast amounts of digitised data and developments in automated decision-making have revolutionised possibilities for governance. Governance has come to be imagined as a hands-on steering of the state based on dynamic data that are updated in real-time (Gorur & Arnold, 2021). In this imaginary, rituals of annual data collection in various spheres and a census count every decade seem archaic and ridiculous. Instead, up-to-the-minute data are to be generated using various technologies such as satellite imaging, GPS data, and electronic updates in real-time on a variety of phenomena that enable the government to shift gears, change policy and respond to situations constantly. No longer is governance a matter of five-year plans, and nor is it dependent on any political manifesto or idealism. No longer is it about commissioning research and consulting experts and pondering over evidence. Instead, it is imagined that knowledge and expertise can be folded into instruments such as dashboards

116  Handbook on measuring governance that can then be efficiently and smartly handled by professional, tech-savvy policy makers to deliver precision policy. The ability to interact directly with the public easily through social media, without all the cumbersome preparation that goes into public rallies and press conferences, is also part of this imaginary, which enables quite different citizen-state relations. Citizens themselves are co-opted into not only engaging with and responding to the constant stream of data, but indeed into producing it: Visualisation techniques afforded by digitisation make the data accessible to a wide range of citizens, and interactive interfaces allow citizens to provide suggestions and preferences. Perhaps the most visible and interesting use of dashboards has been made in governing ‘smart cities’, where a stream of real-time data (on tube line status, the availability of bikes for hire, weather and air pollution data, and live local news and stock market updates) combine with large displays – iPad walls – that enable citizens to interact with the dashboard in passing, and allowed the Mayor of London to ‘look over the capital digitally as well as physically’. (Bartlett & Tkacz, 2017, p. 12)

Such technologies as dashboards illustrate this changing relationship between data, the state and citizens (Kitchin, Lauriault, & McArdle, 2015). The enormous volume of data now available; the pace at which data can be generated; and the ways in which different groups can participate in generating, accessing, contributing to and acting upon data in real-time has meant that contemporary governance in low-, middle- and high-income nations alike is strongly infused with a socio-technical imaginary (Beck, Jasanoff, Stirling, & Polzin, 2021). Ministers and bureaucrats now not only imagine using numbers to develop policy that would affect people and change behaviours, they imagine acting on dynamic numbers with immediate effect on the people and situations they seek to control. At the same time, citizens are expected to become ‘informed publics’ (Callon, Lascoumes, & Barthe, 2011) and help regulate and monitor those in charge by becoming aware of the data, engaging with the vast amount of information generated for public consumption, exercising choice, and voting with their feet.

CONCEPTS, ASSUMPTIONS AND ARGUMENTS The increasing use of quantification and measurement, and their use in policy and governance, have received critical attention from a number of scholars interested in the effects of datafication on policies, governance and practices. The deep commitment of sociology to understanding the structures and architectures of inequities, the focus on power and agency, the attention to such issues as gender, race and class are enhanced in the sociology of measurement with key issues that relate to quantification. Understandings of how numbers are produced; who comes to be included and excluded in these productions; the contexts of the production of measures; the methodological politics of numbers; and the effects of measurement are all concerns that are emerging in this nascent field. Reductionism A key criticism of the reliance on numbers for governing and accountability is that numbers produce thin descriptions (Porter, 2012) of complex phenomena. Emphasised in these accounts are the reductionism entailed in quantification. Scott (1998) argued that those who

The sociology of measurement  117 govern deploy ‘tunnel vision’ – focusing only on the issue of interest – thus forests are reduced to usable timber; nature is translated into ‘natural resources’ and humans into human capital. One issue with the dependence on quantified accounts is that not everything that counts can be counted; indeed, it has been argued that some of the most important aspects of any social phenomenon might be those that cannot be quantified (Nichols & Berliner, 2007). On the other hand, that which is counted comes to matter. For example, an exclusive focus on literacy and numeracy assessment scores as a measure of quality education can lead to a neglect of other subjects and other outcomes that education might produce. Even if everyone agrees that creativity or empathy or well-being are good outcomes of education, these aspects are difficult to measure in a standardised way. Assessing student performance as well as teacher, school and system quality on the basis of the more readily measurable literacy and numeracy measures, and developing accountability systems underpinned by these measures, has led to an undue emphasis on literacy and numeracy to the exclusion of other, less tangible and long-term outcomes. Systems, curricula and pedagogies have adapted to maximise the possibility of improving literacy and numeracy scores. Since numbers are extensively used to justify governance decisions and to evaluate and monitor governance, sociologists of numbers highlight the reductionism of numbers and the distortions that may be caused by making simplified accounts the basis for policies and governance (Merry, 2016). While acknowledging that all types of knowledge-making processes are necessarily partial, and that tunnel vision is necessary for governance, sociologists of measurement nonetheless draw attention to what is left out and what effects such reductionism might have on particular populations and societies. Representation and Objectivity A key argument in the sociology of measurement is that numbers are not accurate, unbiased and objective representations of reality. Rather, they are the products of the contexts of their production and the methodological choices employed in their production (Gorur, 2018a). The sociology of measurement seeks to empirically highlight the epistemological and methodological politics of quantification. Empirical studies seek to show how numbers are made up, but not in the sense that they are deliberately falsified or fabricated to mislead. Rather, sociologists elaborate the processes by which numbers are produced – and in so doing they describe the many decision points, the judgement involved, the compromises made, etc. They demonstrate that numbers are a contingent product of a particular set of contextual circumstances rather than universal facts (Merry, 2016). These descriptions of the productions of numeric knowledge don’t seek to expose numbers as false. Rather, they do so to keep alive some of the important controversies that are suppressed once numbers begin to circulate widely and become embedded in various instruments of governance and practice (Gorur, 2016a). Despite their lack of accuracy and objectivity, numbers excite trust (Porter, 1996). The means of production of numbers is opaque to most users, including, often, those who use the numbers to govern, and those who are affected by governance. As a result, they are seldom questioned, except in very superficial, ‘fact-checking’ kinds of ways. They are more readily relied upon when deciding policy or determining governance strategies. Moreover, such is the trust in numbers (and by extension algorithms) that decision-making is increasingly being delegated to automated systems. The greater the trust in numbers, and now, increasingly, decision-making algorithms, the less the discretion and judgement accorded to humans. This

118  Handbook on measuring governance is particularly problematic because algorithms are not capable of taking complex and contingent factors into account in ways that may be crucial to the decisions made. They incorporate various racial, cultural and gender biases, leading to calls to decolonise data (Quinless, 2022). These understandings about the methodological politics of quantification are useful in querying governance decisions and holding decision-makers accountable, as well as challenging justifications and narratives of accountability imposed by decision-makers upon themselves and others. Performativity The key concern of sociologists of measurement is, however, not that numbers are not ‘objective’ or ‘accurate’ representations of reality out there – these sociologists acknowledge that all forms of knowledge are burdened with these same shortcomings. Rather the focus is on the performativity or the ontological politics of numbers (Gorur, 2014); sociologists of measurement argue that numbers are Janus-faced (Latour, 1987). They are not some form of self-evident description of a phenomenon – rather, they appear to represent phenomena because phenomena and their representations are co-produced. In other words, measurement is not merely a representation of reality, it also produces the reality it represents (Knorr-Cetina, 1991). This means that it is important to investigate empirically and through case studies what kinds of realities measurements are producing, and to engage with the consequences of these realities. Such empirical investigations also produce new concepts and theories that can be used to understand the ontological politics of measurement. Because of the trust placed in numbers and their sweeping use in governance, measurements have effects. Incorporated into policies and governance, measurement begins to act on the world in consequential ways. A university whose rank has ‘slipped’ may find donor dollars shrink, leading to difficulty in supporting certain programmes, which in turn leads to real decline in quality (Espeland & Stevens, 2008). In Australia, the triennial OECD rankings of school systems provoked sweeping changes to its school education governance, including the introduction of a national curriculum, a national testing programme, and a public website on which a wide variety of data on every school in the nation, including their performance on national tests, is made publicly available. These have real effects, both intended and unintended, on schools, students and families. Not only do numbers provoke decisions, the anticipation of measurement also brings about changes, leading institutions to try to game the system – recruiting more high-profile professors from other nations, for example, to look better on certain criteria on university rankings. Finally, the processes involved in making the world calculable in the first place involve selection and categorisation which beget and reify some actors and phenomena while making others invisible (Gorur, 2014). Assemblages and Infrastructures Critical data sociologists understand measurements as effects of heterogeneous arrangements in which a number of ideas and practices coalesce to produce a relatively stable measure. In other words, measurements are cobbled together through a series of arguments, trials and compromises through processes that look nothing like mathematical precision or a self-evident, dispassionate calculation (Gorur, 2016b). For example, currently, the UNESCO Institute for Statistics is engaged in developing global indicators to monitor progress on Sustainable

The sociology of measurement  119 Development Goal 4 (SDG4) – the education goal. Each indicator is developed through deliberate processes of discussion, contestation and mutual learning between various groups – statistical experts, representatives of various governments, civil society organisations and regional bodies. Mathematical expertise alone is not seen as adequate in the development of these measures. However contentious the arguments are before indicators and measures are stabilised, once they are in operation, they appear self-evident and logical, and are confidently embedded in a series of other assemblages, making them difficult to displace even when their deficiencies are made apparent. A good example is the GDP – the Gross Domestic Product – that has dominated economic and political calculus for decades. The GDP is put together based on several different indicators incorporated into a range of surveys and other data instruments. This complex measure has for some decades been known to be deeply flawed, yet it continues to dominate economic planning and governance as a matter of course (Fix, Nitzan, & Bichler, 2019). Understanding numbers as assemblages that have come to be black boxed enables the reopening of the taken-for-granted so that it is available again for debate. Moreover, a key aspect of assemblage is its contingency – it highlights the idea that a particular measurement is an assemblage that might as well have looked very different had a different set of actors been involved, for example. This contingency also gives hope and encouragement for the possibility of change – if things might have been otherwise, they might yet be changed (Law, 2009). This makes the sociology of measurement a political project which can raise important moral and ethical questions, highlight points of intervention, and provide hope for change. Subjects, Subjectification, Involuntary Participation One of the most powerful aspects of measurement is its ability to produce some subjectivities while supressing others, as previously argued. Categorisation and standardisation label people and invest them with particular subjectifications. AI technologies such as facial recognition tools have upped the ante, and concerns with regard to subjectification have multiplied. Scholars such as Crawford (2021) and Benjamin (2019) have explored racial profiling as well as bias with regard to how machines are taught to recognise faces, and the effects such technologies are now having on every aspect of life – from influencing law enforcement to making hiring decisions. A key argument here is that machines learn from the past, so they incorporate the biases and the unfairness of the past and present and perpetuate it into the future. In India, for example, AI was employed to determine which students might be deserving of a loan based on who might be more likely to repay loans. Not unsurprisingly, it was found that the AI selection was biased towards the so-called ‘upper castes’ (Chamuah & Ghildiyal, 2020). CDS and the sociology of measurement seek to highlight these issues empirically and convincingly as more and more government decision-making is being delegated to AI in the name of making decisions fairer and less biased. Informed Publics Scott (1998) has argued that statistics provided a synoptic view that enabled states to develop centres of calculation from which to exercise control over territories at a great distance from the seat of government. Advances in measurement, however, have meant that in addition to

120  Handbook on measuring governance such distant steering, more intimate accountability practices are also now possible (Asdal, 2011; Gorur, 2018b). Here, the entities to be governed are not abstract and formless, but illuminated in great detail, often on public websites. This changes the relations of accountability – not only forcing institutions to making themselves accountable, but also bringing ‘informed publics’ (Callon, Lascoumes, & Barthe, 2011) into the mix to enforce public accountability. In such imaginaries, the public, armed with detailed data, for example, about their child’s school, can hold that school to account by responding to the data. If they are unhappy with the school’s results on the national assessment, for example, they can ‘vote with their feet’ and switch schools. The participation of informed publics changes citizen-institution-state relations in contemporary governance. This is a key development which has as yet received less attention than it arguably deserves. With more and more non-state actors involved as participants in governance (Marres, 2011) – either through contributing data voluntarily or involuntarily, responding to surveys, or exercising their rights as informed publics – this phenomenon may become a very significant factor in the production of the state and in the production of knowledge.

KEY CONTRIBUTIONS One of the key strengths of the sociology of measurement and CDS is that it provides a multiscalar, poly-epistemic challenge to contemporary datafication of governance on multiple fronts. It does so not from a luddite stance of rejecting new technologies of measurement, but from a technically informed critical understanding. Since the current explosion of measurement technologies has been embraced by a range of public and private actors, scholars from various disciplines, drawing on a variety of theoretical inspirations, are engaging in this emerging field. Focusing on the ontological (or productive or performative) politics of datafication, sociologists of measurement are raising a series of important questions to challenge the narrative of progress and the claim of objectivity and lack of bias that is invoked in contemporary practices of measuring governance, as well as the hubris that big data heralds ‘the end of theory’ (Anderson, 2008). A significant aspect of the current moment is that ordinary citizens are not only co-opted as informed publics to participate in engaging with data, they are also subjected to involuntary participation in a range of data experiments and data-generation exercises. At issue is that citizens are drafted into unpaid labour to profit corporations and become targets for advertising or political propaganda and misinformation. These issues have led to new theorisation about what it means to be a participant in new experiments in participatory governance (Lezaun, Marres, & Tironi, 2016). Measurement technologies embed within them biases with effects not only for individuals but for groups as well as society at large. One key set of contributions of the sociology of measurement is that it highlights the various forms of inequities, exclusions and distortions that are wrought by the extensive use of data technologies in governance by institutions and states. Most studies in the sociology of measurement offer detailed empirical accounts of how data assemblages end up marginalising certain groups of perpetuating biases and inequities. The sociology of measurement thus offers an urgently required counterpoint to narratives of evidence-based, unbiased and fair governance made possible with the use of data and algorithmic decision-making.

The sociology of measurement  121 STS has a long history of feminist critique, raising issues regarding the marginalisation of feminist concerns in the relations between science, technology and the state (see, for example, Haraway, 1991). Within the sociology of measurement, cyberfeminism raises issues of gender, womanhood, marginalisation and inequity in emerging technologies, including those deployed by various institutions and governments. It also offers possibilities for more inclusive and less exploitative cyberenvironments. The ways in which technologies of datafication reify, exclude or implicate race has also been a major concern within this field. Crawford (2021) and Benjamin (2019), for example, have highlighted how racial profiling, exclusion and other harms and violations occur in the deployment of tools such as facial recognition, which is extensively used in various aspects of government such as law enforcement, national security, human resource management, human services, etc. Data colonialism, data justice and data sovereignty are other concepts emerging in this field, challenging the accountability narratives offered by those in power. These issues span global infrastructures and assemblages of measurement, such as the measures used by the OECD or the World Bank, for example, as well as by multinational corporations such as Twitter or Google. They incorporate such new fields as Indigenous rights in relation to datafication, data sovereignty, and decolonial approaches to data and governance. They deal with the changing nature of knowledge-making, including the effects of new forms of data production, crowd-sourced data and visualisation techniques, and the erosion of local, folk and Indigenous knowledges. They explore how technologies such as search engines may further the interests of some while suppressing those of others. In a world where the lines between the public and the private are blurred, they have engaged with emerging phenomena such as disaster capitalism and data philanthropy to highlight new risks and inequities through surveillance and datafication. Since society itself is constantly changing, sociology finds itself requiring new conceptual and methodological approaches to study society. Advances in measurement and governance technologies have fundamentally altered society, producing new means of ordering and regulating, and new forms of citizen-state relations. Unprecedented challenges such as the climate crisis and Covid have further skewed these relations, with existential consequences for society. In the face of such fundamental shifts, the sociology of measurement has been able to keep renewing itself to remain relevant and effective in contemporary social sciences. Its socio-technical approach to measurement and governance and its attention to the material have enabled it to engage in meaningful ways in a more-than-human world across a variety of fields. It has developed new, critical, empirical insights into how new technologies of measurement and governance are producing difference in society, giving rise to new inequalities of gender, race, class and ethnicity through the very means that purport to promote fairness and lack of bias. Most importantly, it has taken on a keenly critical stance, seeking not only to describe or criticise but to engage productively with the production of social orders, governance and measurement in contemporary times.

CONCLUSION The sociology of measurement, and more recently, CDS, are emergent fields which are learning on the go, as new technologies of datafication and measurement as well as new developments in governance practices present unexpected sites of importance. It is a lively, interdisciplinary

122  Handbook on measuring governance space which also often uses the very technologies it critiques. The audience for these studies is not just the scholarly community or the policy and governance community, but people in every walk of life, since we are all entangled in these practices as citizens, consumers, producers, and experimental participants. Beyond scholarly publications and grey literature, sociologists of measurement also use social media and traditional mass media, including documentaries and YouTube productions, TED Talks, etc. to promote their ideas, contributing to novel methodologies of knowledge dissemination and new forms of impact and engagement. The nature, scope and speed of change in the use of technologies in governance pose ongoing theoretical and methodological challenges, as well as moral dilemmas. Theoretically, the issues arising from new developments in measuring governance are interdisciplinary – and despite declarations by governments, funding bodies and universities valuing interdisciplinary approaches, there are few university courses that are genuinely interdisciplinary, and there are many barriers to interdisciplinary research (Gorur, 2021). Moreover, issues of concern to sociologists of measurement spill across various fields, often involving complex fieldwork which could pose new methodological challenges. Governments and private corporations are not always eager to engage with researchers in relation to issues of datafication, making empirical work difficult. Despite calls for openness and sharing, many data operations remain the intellectual property of private entities or are just opaque and inaccessible to researchers. Algorithms, including those used by governments, are often intellectual property of private corporations and not made available for scrutiny. Despite this, there is more data, and more forms of data, publicly available today than at any other time in history. Technologies have also made it possible to employ traditional methods such as surveys and interviews easily, cheaply, and in an environmentally friendly manner across the world. Dissemination can also be instant, with a wide reach through social media. Previously, as Latour (2010) has noted, quantitative research necessarily dealt with synoptic data or clustered information and patterns, since the numbers involved are so large that individual data is impossible to capture. On the other hand, sociological investigations often focused on smaller populations, without the luxury of very large amounts of data. If the key to sound sociological enquiry is not only to grasp the macro and the micro but the relations between them, then current technologies of big data might hold much promise, as they provide both large volumes of data (providing the contextual information) and highly detailed and individualised, intimate information. This may provide new ways to understand the relations between the actor and the network, giving rise to new methodologies and theories with which to study contemporary datafied societies and to new forms of regulation and governance.

REFERENCES Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired magazine, 16(7), 16 July. Asdal, K. (2011). The office: The weakness of numbers and the production of non-authority. Accounting, Organizations and Society, 36(1), 1–9. Ball, S.J., & Youdell, D. (2008). Hidden privatisation in public education. Education International. Retrieved from https://​www​.researchgate​.net/​profile/​Deborah​-Youdell2/​publication/​228394301 _Hidden_privatisation_in_public_education/links/0a85e539232ed78325000000/Hidden-privatisa tion-in-public-education.pdf. Bartlett, J., & Tkacz, N. (2017). Governance by dashboard: A policy paper. Demos. Retrieved from https://​www​.demos​.co​.uk/​wp​-content/​uploads/​2017/​04/​Demos​-Governanceby​-Dashboard​.pdf.

The sociology of measurement  123 Beck, S., Jasanoff, S., Stirling, A., & Polzin, C. (2021). The governance of sociotechnical transformations to sustainability. Current Opinion in Environmental Sustainability, 49, 143–52. Bellamy, R., & Palumbo, A. (2017). From government to governance. Routledge. Benjamin, R. (2019). Race after technology: Abolitionist tools for the new Jim Code. Polity Press. Bowker, G.C., & Star, S.L. (2000). Sorting things out: Classification and its consequences. The MIT Press. Braidotti, R. (2019). Posthuman knowledge (Vol. 2). Polity Press. Callon, M., Lascoumes, P., & Barthe, Y. (2011). Acting in an uncertain world: An essay on technical democracy. MIT Press. Castells, M. (2004). The network society (pp. 3–45). Edward Elgar. Chamuah, A., & Ghildiyal, H. (2020). AI and education in India. Friedrich-Ebert-Stiftung. Collins, C.S., & Wiseman, A.W. (2012). Education, development, and poverty: An introduction to research on the World Bank’s education policy and revision process. In C.S. Collins & A.W. Wiselman (Eds.), Education strategy in the developing world: Revising the World Bank’s education policy (Vol. 16, pp. 3–18). Emerald Group Publishing. Crawford, K. (2021). The atlas of AI: Power, politics, and the planetary costs of artificial intelligence. Yale University Press. Derksen, L. (2000). Towards a sociology of measurement: The meaning of measurement error in the case of DNA profiling. Social Studies of Science, 30(6), 803–45. Desrosières, A. (1998). The politics of large numbers: A history of statistical reasoning. Harvard University Press. Espeland, W.N., & Stevens, M.L. (2008). A sociology of quantification. European Journal of Sociology/ Archives européennes de sociologie, 49(3), 401–36. Fix, B., Nitzan, J., & Bichler, S. (2019). Real GDP: The flawed metric at the heart of macroeconomics. Real-world Economics Review, 88, 51–9. Gorur, R. (2014). Towards a sociology of measurement in education policy. European Educational Research Journal, 13(1), 58–72. Gorur, R. (2016a). Seeing like PISA: A cautionary tale about the performativity of international assessments. European Educational Research Journal, 15(5), 598–616. Gorur, R. (2016b). The ‘thin descriptions’ of the secondary analyses of PISA. Education and Society, 37(136), 647–88. Gorur, R. (2018a). Standards: Normative, interpretative, and performative. In S. Lindblad, D. Pettersson, & T.S. Popkewitz (Eds.), Education by the numbers and the making of society (pp. 92–109). Routledge. Gorur, R. (2018b). Escaping numbers? Intimate accounting, informed publics and the uncertain assemblages of authority and non-authority. Science & Technology Studies, 31(4), 89–108. Gorur, R. (2021). Opening the black box of peer review. In C. Addey & N. Piattoeva (Eds.), Intimate accounts of education policy research (pp. 62–76). Routledge. Gorur, R., & Arnold, B. (2021). Governing by dashboard: Reconfiguring education governance in the Global South. In C. Wyatt-Smith, B. Lingard, & E. Heck (Eds.), Digital disruption in teaching and testing (pp. 166–81). Routledge. Grek, S., Maroy, C., & Verger, A. (2020). Introduction: Accountability and datafication in education: Historical, transnational and conceptual perspectives. In World yearbook of education 2021 (pp. 1–22). Routledge. Hacking, I. (2007). Kinds of people: Moving targets. Proceedings – British Academy (Vol. 151, pp. 285–318). Oxford University Press. Haraway, D.J. (1991). A cyborg manifesto: An ironic dream of a common language for women in the integrated circuit. In S. Stryker & D. McCarthy Blackston (Eds.), The transgender studies reader remix (pp. 429–43). Routledge. Hummel, P., Braun, M., Tretter, M., & Dabrock, P. (2021). Data sovereignty: A review. Big Data & Society, 8(1), https://​doi​.org/​10​.1177/​2053951720982012. Jasanoff, S. (Ed.) (2004). States of knowledge. Taylor & Francis. Kitchin, R., & Lauriault, T. (2018). Towards critical data studies: Charting and unpacking data assemblages and their work. In J. Eckert, A. Shears, & J. Thatcher (Eds.), Geoweb and big data (pp. 3–20). University of Nebraska Press.

124  Handbook on measuring governance Kitchin, R., Lauriault, T., & McArdle, G. (2015). Knowing and governing cities through urban indicators, city benchmarking and real-time dashboards. Regional Studies, Regional Science, 2(1), 6–28. Knorr-Cetina, K.D. (1991). Epistemic cultures: Forms of reason in science. History of Political Economy, 23(1), 105–22. Landri, P. (2022). Waves of standardisation. In H. Riese, L.T. Hilt, & G.E. Søreide (Eds.), Educational standardisation in a complex world (pp. 25–42). Emerald Publishing. Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Harvard University Press. Latour, B. (2010). Tarde’s idea of quantification. In M. Candea (Ed.), The social after Gabriel Tarde (pp. 161–78). Routledge. Latour, B. (2012). We have never been modern. Harvard University Press. Law, J. (2009). Actor network theory and material semiotics. The new Blackwell companion to social theory, 3, 141–58. Lezaun, J., Marres, N., & Tironi, M. (2016). Experiments in participation. The Handbook of Science and Technology Studies, 4, 195–221. Marres, N. (2011). The costs of public involvement: Everyday devices of carbon accounting and the materialization of participation. Economy and Society, 40(4), 510–33. Mayer-Schönberger, V., & Cukier, K. (2013). Big data: A revolution that will transform how we live, work, and think. Houghton Mifflin Harcourt. Merry, S.E. (2016). The seductions of quantification. University of Chicago Press. Nehring, D., & Plummer, K. (2014). Sociology: An introductory textbook and reader. Routledge. Nichols, S. L., & Berliner, D.C. (2007). Collateral damage: How high-stakes testing corrupts America’s schools. Harvard Education Press. Parsons, W. (2002). From muddling through to muddling up – evidence based policy making and the modernisation of British Government. Public Policy and Administration, 17(3), 43–60. Porter, T.M. (1996). Trust in numbers. Princeton University Press. Porter, T.M. (2012). Thin description: Surface and depth in science and science studies. Osiris, 27(1), 209–26. Power, M. (2000). The audit society – second thoughts. International Journal of Auditing, 4(1), 111–19. Quinless, J.M. (2022). Decolonizing data: Unsettling conversations about social research methods. University of Toronto Press. Rose, N. (1991). Governing by numbers: Figuring out democracy. Accounting, Organizations and Society, 16(7), 673–92. Ruppert, E.S. (2008). ‘I Is; Therefore I Am’: The census as practice of double identification. Sociological Research Online, 13(4), 69–81. Scott, J.C. (1998) Seeing like a state: How certain schemes to improve the human condition have failed. Yale University Press. Shore, C., & Wright, S. (2015). Audit culture revisited: Rankings, ratings, and the reassembling of society. Current Anthropology, 56(3), 421–44. Smyth, J.A. (2008). The origins of the international standard classification of education. Peabody Journal of Education, 83(1), 5–40. Strathern, M. (2000). Audit cultures (Vol. 146). Routledge. Taylor, L. (2017). What is data justice? The case for connecting digital rights and freedoms globally. Big Data & Society, 4(2), 2053951717736335. Van Es, K., & Schäfer, M.T. (2017). The datafied society: Studying culture through data. Amsterdam University Press. Woolgar, S. (1991) Beyond the citation debate: Towards a sociology of measurement technologies and their use in science policy. Science and Public Policy, 18(5), 319–26.

8. Governmentality and the measuring of governance Peter Triantafillou

INTRODUCTION Over the last three decades or so, the concept of governmentality has informed an increasing number of highly insightful studies of the relationship between measuring and the art and practices of governing. The term governmentality was coined in 1978 by the French historian and philosopher Michel Foucault in his lecture series at the Collège de France (Foucault, 2007). He used it to analyze how shifting forms of (secular) thinking had informed the governing of states and their populations in Western Europe since the Renaissance. The aim of Foucault’s lectures on governmentality was dual: to try to produce a more adequate understanding than that provided by traditional political science of how power is exercised in modern states, and to come up with a critical analysis of this power that went beyond the various Marxist interpretations fashionable at the time (1970s). Initially, the term received little attention in social and political academic circles. Yet, from the early 1990s a number of Anglophone scholars started to employ the term (Dean, 1994, pp. 174–93; Rose, 1996a; Rose & Miller, 1992), not least thanks to the edited volume The Foucault Effect: Studies in Governmentality (Burchell et al., 1991). This neologism, governmentality, is a contraction of two terms: government and mentality or rationality (Foucault, 2007, p. 115). Government designates the ‘conduct of conduct’, that is, the art of conducting the ways in which others conduct themselves within a more or less open field of possibilities (Foucault, 1982, pp. 220–21). Foucault was not very clear about whether the second part of the term governmentality referred to mentality or rationality. However, in the reflections that he and many of his followers have produced, the term rationality or rationalities of government is the preferred one. The term rationality refers not to abstract principles, ideologies or worldviews (mentalities?) but to the concrete reflections, calculations, tactics and ways of reasoning about how best to govern a state territory and not least the wealth and wellbeing of the population inhabiting this territory (Foucault, 1991, pp. 78–82, 2007, p. 108). Governmentality, governmental rationality or political rationality is thus a set of means-ends calculations about how best to govern a state. Such calculations draw on various forms of knowledge or theories that seek to provide secular truth about how states are, can be and, in many cases, ought to be governed. Economics and political science are common disciplinary foundations, amongst others. Since the 1990s, we have seen a surge of governmentality studies that in more or less direct ways are inspired by Foucault’s wider analytics of power-knowledge (for an overview, see Triantafillou, 2016). While the topics of these governmentality studies vary greatly, they all in some way examine how knowledge and power interrelate in ways that form active subjects, such as persons or organizations. Accordingly, the aim of these studies is not to create yet another theory about power and subjectivity, but to scrutinize how certain forms of knowledge 125

126  Handbook on measuring governance came to be regarded as truthful and enabled the exercise of power in general, and government in particular, in new ways. The type of critical inquiry exercised in these analyses is not the kind of more or less sophisticated critique of ideology that we find in Marxist theory, the Frankfurt School or recent French praxis theory, such as Bourdieu (Lemke, 2002, pp. 62–6). Instead, it is a perspectivist critique that tries to illuminate how the kinds of knowledge that we take for granted as epistemologically truthful or normatively desirable are excluding certain ways of enacting our freedom (Owen, 2002). The term governmentality has also found its way into studying how measuring is linked to the art and practice of governing. As we shall see below, governmentality studies are interested not so much in how to measure governance, but how various ideas, techniques and schemes of measuring are enabling, transforming and at times contesting the art and practice of governing. In general terms, then, the aim of these studies is two-fold: to understand or render intelligible the emergence and functioning of measuring in the exercise of power, and to critically interrogate the effects this kind of power has on our way of thinking and acting, that is, on our freedom.

THE FIELD OF GOVERNMENTALITY STUDIES AND GOVERNANCE MEASURING The field of governmentality studies of governance measuring is heavily indebted to Foucault’s conceptual and analytical approach. Hence, it is necessary to provide a brief account of his engagement with the notion of governmentality and the specific power of government. Foucault studied the role of quantifying techniques and knowledge in the area of biopower and statecraft. He paid particular attention to the role of epidemiology, that is, the statistical knowledge of the health and vigour of the population, as the source of the new art of government (Foucault, 2007, pp. 55–67). He showed how the new knowledge of epidemiology, together with the invention of mass vaccination, enabled states to move away from the exclusive reliance on sovereign power (dictating behaviour) and disciplinary power (seeking to make individuals abide to a standard code of conduct). Instead, it enabled power to be exercise in terms of calculated risk levels. Instead of trying to handle epidemic diseases with a view to eradicate it (as contemporary China seems fixed to do with COVID-19) or by fully controlling people’s movements (via forced quarantine, isolation or vaccination), the new form of power, government, worked through various regulatory measures seeking to curb the spread of a virus to a degree in which incident rates, hospitalization rates and mortality rates oscillate within a politically acceptable level. Such regulatory measures include restrictions (but not bans) on movement and travelling, voluntary vaccination, incentives to work from home, mask wearing in public, in-door spaces, etc. With the French physiocrats and the Scottish political economists, a new rationality of government emerged, classical liberalism, characterized by the following problem: How can the state govern civil society without undermining the virtues and self-governing capacities of the latter? The general and always incomplete answer to that question would be to govern through the freedom and the self-steering capacities of markets, organizations, citizens, etc. Much of the indirect governing mechanisms invented during the twentieth century, such as contracts, schooling, campaigns, economic incentives, nudging and performance management, are predicated on techniques of quantification and schemes of measurement. Ian Hacking has shown

Governmentality and the measuring of governance  127 how a revolution of statistical techniques and probabilistic knowledge both made a range of new economic, social and political phenomena visible and, potentially, governable in wholly new ways (Hacking, 1990). With the invention of statistical concepts, such as the standard normal distribution, and a gradual avalanche of quantitative surveys, phenomena like disease, suicide, crime, industrial accidents and unemployment were no longer regarded as individual intentionality or act, but as regularly occurring social phenomena that were impossible to eradicate, but nonetheless susceptible to regulation (Donzelot, 1984; Ewald, 1986). Moreover, as these measurement techniques sprung up in military academies and civil schools at the beginning of the nineteenth century, where they gradually replaced former oral modes of assessment, the physical character of soldiers, intelligence, and the academic performance of students became the object of systematic written recording and normalizing comparison (Hoskin, 1996; Hoskin & Macve, 1994). The new measurement systems changed not only the ways in which the character and performance of individuals and organizations were assessed, but also how the students, soldiers and inmates of these institutions were governed. The more recent role of performance measuring in enabling various and often novel forms of governance has been examined in a wide range of areas. Governmentality studies have focused on the multifarious ways in which measuring is implicated in the governing of individuals, organizations, urban areas, populations (of states) and states. Studies of the role of measuring in the governing of individuals include human resource management (Power, 2004; Weiskopf & Munro, 2011), public crime prevention (Garland, 1997; Stenson, 1999) and employment policy (Grundy, 2015; Triantafillou, 2009, 2011). Most recently, we have seen governmentality studies of the emerging algorithmic logics and techniques in the design and the governing of social services (Henman, 2021; Lam, 2022). A very large body of studies is focusing on the ways in which measuring is linked to the governing of organizations. These include, for instance, public sector auditing (Gendron et al., 2007; Triantafillou, 2015), managing schools (Niesche, 2014), prisons (Guter-Sandu & Mennicken, 2022) and, perhaps not so surprisingly, the governing of higher learning staff and institutions (Cannizzo, 2015; Engebretsen et al., 2012; Morrissey, 2013; Shore, 2008). An increasing number of studies examine how measuring is linked to the governing of urban areas. Some of these studies address participatory planning (Rosol, 2014) and the governing of smart cities (Argento et al., 2020). We also find many studies of how measurement-government regimes target populations. They include, for example, public health and lifestyle governance (Evans & Colls, 2009; Henderson, 2015; Light, 2001; Lupton, 2013), and the governing of pandemics (Miller, 2022; Triantafillou, 2022). Finally, we find many interesting studies of the ways in which techniques of measurement are employed in the attempt to govern states. Some have studied the many attempts to spur the democratic and good governance of states (Hansen & Mühlen-Schulte, 2012; Leifert, 2014; Löwenheim, 2008). Others have looked at how benchmarking is part of the European Union’s (EU) attempts to govern its member states in policy areas where the states have sovereignty (Hansen & Triantafillou, 2011). Taken together, these studies show just how influential and productive the notion of governmentality has been for analyzing the relationships between measuring and governance. The term has permeated a wide range of social and organizational sciences. One of the strong points of these studies is that they cross established academic disciplinary boundaries, but in addition, they go beyond the disciplines by asking partially new questions and by answering these through innovative conceptual frameworks. The downside of this diversity is that the

128  Handbook on measuring governance studies differ significantly not only regarding the object of their study, but also regarding the ways in which they use the concept of governmentality and in their wider methodological approach. This is the topic of the following section.

CONCEPTS, ANALYTICAL STRATEGIES AND ARGUMENTS IN GOVERNMENTALITY STUDIES As suggested above, the key concept of governmentality denotes means-ends calculations and reflections about how best to govern a state, a society, a people, the economy or an organization. Foucault focused on means-ends calculations informed by relatively coherent and accepted bodies of scholarly knowledge, such as political economy, welfare economics or neoclassical economics (Gordon, 1991). The ends are about the vision of governing, its telos, which could involve maximizing the wealth of a state, the happiness and wellbeing of a population, the freedom of individuals, or the efficiency of markets or an organization. The means, also dubbed technologies of government (Rose & Miller, 1992), include the techniques, instruments, schemes and procedures implied in governing. Statistics, performance measurement systems, examination records, and other instruments of measuring governance are all examples of technologies of government. Such technologies are not simple extensions of human intentionality or rationalities of government. Technologies are always informed by rationalities, but never determined by them. Conversely, governmental rationalities are always already inscribed in some kind of practical, technical activity (Dean, 1996; Foucault, 1991). Several distinct governmentalities, such as raison d’état, mercantilism, Polizeiwissenschaft, classical liberalism and neoliberalism have been identified (Foucault, 2007). The latter governmentality, neoliberalism, has received particularly strong attention. Other scholars have added to these governmentalities, social welfarism (Procacci, 1991; Rose & Miller, 1992), authoritarianism (Dean, 2002; Sigley, 2007) and authoritarian liberalism (Bonefeld, 2017). This is all well, as there is no reason to assume that Foucault was able to capture the entirety of governmentalities. Yet not all the recent academic constructs seem to satisfy the Foucauldian inspired understanding of governmentality suggested above. For instance, it is not clear what kind of coherent and accepted body of knowledge is underpinning affective governmentality (Sauer & Penz, 2017) or green governmentality (Rutherford, 2016). Paul Henman has given careful consideration to the term algorithmic governmentality (Henman, 2021). Yet, this governmentality seems to have an overtly general telos or vision, namely, the ability to be able to better govern the present based on big data analysis and predictions of the future. This is not the place to go into a long conceptual discussion about the notion of governmentality. Suffice to note that while Foucault was quite vague on how to define governmentality (Biebricher & Vogelmann, 2012) and insisted that analytical concepts be defined and redefined to suit current analytical and political purposes (Foucault & Deleuze, 1977), it is obviously problematic if the concept of governmentality is used to mean completely different things and is applied for entirely different purposes. In this chapter, the term is broadly following Foucault’s understanding outlined above. Governmentality studies do not share a common analytical strategy or methodology. This has partly to do with their aims of producing critical accounts of the ways in which power-knowledge (measuring) regimes emerge and shape our freedom (Hansen & Triantafillou, 2022). Many, but not all, governmentality studies are designed to produce

Governmentality and the measuring of governance  129 critical, albeit logically coherent and well-documented, descriptive accounts. Conversely, they rarely seek to tease out and test causal relations. Accordingly, most governmentality studies of the role of measuring in governance are qualitative studies of one or more cases. They typically rely heavily on document analyses to bring forth the elaborate means-ends reflections found in policy documents and expert reports underpinning and justifying the use of measurement systems in governance. This focus on policy documents and various expert reports have sparked a debate about whether governmentality studies tend to simply reproduce official policy discourse and overlook the messiness of programme implementation (O’Malley et al., 1997). The critique of governmentality studies as focusing too narrowly on rationalities and discourses at the expense of techniques and practices of governing – and by the same token, measuring – may have to do with the fact that Foucault only dealt with governmentality at his lectures, not by way of in-depth genealogical analyses (Biebricher, 2008). I think this critique has a certain validity to the general corpus of governmentality studies. However, the problem does not really apply to governmentality studies of the links between measuring and governance. Here there is usually an acute attention to the technical and practical ways in which measuring is employed in the art of governing. Nevertheless, the critique may have inspired the use of interviews and observations in governmentality studies in order to access the micro-operations of measurement systems and how target groups, such as public sector employees and citizens, are interacting with these (Brady, 2014). If there is a methodological bias in governmentality studies of measuring governance, this has less to do with lacking attention to techniques and practices, and more to do with their often ahistorical character. Apart from a few very important exceptions (see the section on contributions below), most studies tend to focus on the present. A key argument in governmentality studies is that the art and practices of governing constitute a distinct form of modern power. We noted that government designates the conduct of conduct. This mode of power entails structuring the field of possibilities within which the governed governs herself. By implication, government is not an institution (the government) but a form of power that assumes that the subject over whom power is exercised has a significant level of choice or liberty to conduct herself. Government may thus be distinguished from other modern forms of power, such as sovereignty, discipline and biopower (Foucault, 2007; Lemke, 2011). Sovereignty may be seen as the very antithesis of government in that the former entails the decrees or laws that aim to directly control the behaviour of a subject: thou shall (not). In contrast, the exercise of government is conditioned on the freedom of those over whom it is exercised, and it rarely emanates from a centre (a monarch or a government ministry). These latter features are shared with discipline and biopower. Disciplinary power seeks to create civilized and productive individuals found in factories, schools, prisons and army barracks via constant recording, measuring and surveillance. Government could be and is currently used in all these settings but the rationale is not to make individuals behave in accordance with a more or less fixed norm, but rather to make the individuals exercise their freedom within a wider space of possible but still desirable conduct (Foucault, 2007, p. 57). Finally, while government was historically linked to biopower’s regulatory mechanisms seeking to secure and augment the biological quality of the population, government today is employed to serve a much wider set of purposes (Miller & Rose, 2008). In order to understand the role of measurement in governmentality studies, we need to recognize that the exercise of government, like that of discipline and biopower, is predicated

130  Handbook on measuring governance on more or less coherent and authoritative forms of knowledge. With the notion of regime of truth, Foucault emphasized that modern forms of power are closely interlinked, though the one cannot be reduced to the other (Foucault, 1980, p. 131). Power may obviously shape the production of knowledge via the allocation of research funding, creation of ministerial commissions and reporting, establishing statistical institutions, etc. This is not a novel insight. Foucault’s innovation was to address how the production of knowledge enables, legitimizes and structures the exercise of certain forms of power – at the expense of other forms. For instance, modern psychiatric and psychological forms of mental therapy required detailed recording of the individual childhood, social relations, occurrences of illness, responses to previous treatments, etc. (Rose, 1996b). In brief, what is interesting about knowledge is its performative dimension, not its truthfulness. To stay with the example, the point is not whether contemporary psychiatry is more (or less) true or scientific than earlier discourses of mental illness, but the – more or less – new forms of governing that it enables. Similarly, the analytical point about the measuring of governance is not about whether the measuring represents a fair, adequate or true representation of the measured object, but how it renders that object visible and governable. Governmentality studies of measuring would obviously agree with the motto that ‘What’s measured is what matters’ (Bevan & Hood, 2006). Yet, governmentality studies would go further in at least three ways: constructivism, governability and plurality. Firstly, many scholars in the field maintain that measuring works to constitute the object that it measures (see Chapter 4; Dahler-Larsen, 2014; Power, 1996). For instance, the modern business organization, with its distinct operating units that is managed by a hierarchy of executives, was predicated on the emergence of new accounting systems (Hoskin & Macve, 1994, pp. 80–83). The point is not that these objects, such as the modern business organization, are somehow created out of nowhere, but that the way in which objects are understood, assessed and managed changes significantly when measured in new ways. Secondly, quantified measuring implies a given (normative) yardstick to undertake the measuring and is therefore always predisposed to assess objects in one particular way, rather than others (Hoskin, 1996, p. 265). The ways in which an object is constituted and (quantitatively) measured through a fixed yardstick imply that the object may be governed in some ways rather than others. For instance, it is much more obvious to the employment services to pursue a work-first strategy, rather than a human capital strategy, if the performance of these services are measured in terms of number of job placements in a certain period of time (Triantafillou, 2011). Finally, in contrast to oral assessments and various other qualitative assessments, numerical measuring imparts a certain sense of objectivity and impartiality (Porter, 1995). Thus, even if quantified measuring is predisposed to assess objects in one particular way, it is exactly because of its objectivity that quantitative measuring lends itself to a plurality of often contradictory interpretations and political purposes. That is, the objectivity of measuring is both what creates a certain trust in them but also the potential for a plurality of uses. Apart from these rather general arguments about the relationship between measuring and governance, it has been argued more specifically that the last three decades or so have brought about a situation of reflexive government (Dean, 1999, p. 193). What is meant by this is not that current forms of governance are more pensive or well considered, but that the contemporary quest for measuring governance constitutes a historically specific governmental ambition whereby measuring is employed as a tool to design, evaluate, improve and ultimately govern governance. Thus, under the heading of New Public Management and neoliberal government

Governmentality and the measuring of governance  131 more generally, government turns back upon itself (Triantafillou, 2017). Measuring is used not only in the act of governing civil society, the economy, public health, citizens, etc. It is also used to govern the activities of government itself. By the same token, accreditation, balanced scorecard, benchmarking, user satisfaction surveys and various other systems of performance measuring are used not (only) to govern society, but to assess and govern the actions of public authorities and organizations.

KEY CONTRIBUTIONS OF GOVERNMENTALITY STUDIES OF GOVERNANCE MEASURING Governmentality studies have provided many important contributions to our understanding of the emergence, dynamics and power effects of governance measuring in contemporary societies. The type of contributions provided by governmentality studies obviously relates to their aims and the overall methodological approach. Many, but far from all, the subsequent studies by other scholars of governmentality and measuring aim to produce a more adequate understanding than that provided by traditional political science or public administration of how power is exercised in modern states and organizations, and to critically analyze this power. Regarding the methodological approach, it may be noted that few if any governmentality studies engage in rigorous causal analysis. Rather than trying to pin down social phenomena into discrete (independent and dependent) variables, governmentality studies are descriptive accounts guided by ‘how’ questions. How, for instance, did a contemporary governmentality emerge? How – in what ways – is it informed by various bodies of knowledge and measurement regimes? How – by what techniques – is measuring taking place? etc. As noted above, the studies of the relationship between measuring and rationalities and technologies of government cover a very wide range of topics. The contributions provided by this very diverse set of studies obviously reflect the variations in their specific object of study. Therefore, any attempt to account for their contributions obviously cannot do full justice to the studies. At any rate, I will try to distil a mix of general contributions and some more issue-specific ones. At the most general level, it is possible to identify three distinct contributions. Firstly, governmentality studies have provided a historically sensitive understanding of measuring and governance. They have improved our understanding of the emergence and the often rather transient character of the link between governance and measuring by empirically demonstrating the historical specificity of these links. This is particularly clear in Hoskin’s study of accounting, Hacking’s analyses of statistics, and Donzelot and Ewald’s studies of the link between social statistics and the governing of industrial and employment relations (see above). These studies are clearly indebted to Foucault’s genealogical approach that seek to trace the historical formation and transformation of power-knowledge relations (Foucault, 1986). Apart from this historical sensitivity, there is also an increasing attention to the spatial or rather societal specificity of the links between government and measuring. While many studies still focus on the Anglophone countries, we find an increasing number of both single and comparative studies including a wide range of Central and Northern European countries (Triantafillou, 2012, pp. 92–168), and East and South East Asian countries (Chenchen, 2018; Lam, 2022; Triantafillou, 2002).

132  Handbook on measuring governance Secondly, governmentality studies have shed light on often overlooked forms of power (governing) enabled by measuring. These studies illuminate how power is exercised in the attempt to create productive citizens and how it is often predicated on the nurturing, shaping and structuration of the ways in which individuals and organizations exercise their freedom (Cannizzo, 2015). At the same time, many studies also demonstrate that measuring governance is a dividing practice that distinguishes between good and poor performers, and links to powers of exclusion and sanctioning of those deemed unfit, unqualified or in other ways not performing as expected (Hansen & Mühlen-Schulte, 2012). By the same token, these studies have also shown that the indirect forms of power invoked by government may at times work in close tandem with other direct and coercive forms of power (Light, 2001). Thirdly, governmentality studies have provided new insights into the technical underpinnings of measuring and governance. The attention to the technical dimension of measuring governance, an analytical feature shared with Science and Technology Studies (Hackett et al., 2008), has not only improved our understanding of how measuring is linked to governance, but has also sharpened the critical analysis of the power-freedom relations implied by measuring practices. It is not that there is a lack of critical studies of performance measuring of governance, but much of these studies tend to focus either on the technical and methodological problems of measuring or on the political ideas and discourses (e.g. Thomas & Hewitt, 2011). By shedding light on both the technical dimension and on the kind of knowledge and rationalities informing government (rather than general political ideas and discourses), many governmentality studies have provided very sharp accounts of the ways in which the choice of particular measuring techniques engender new forms of power. Apart from these general contributions, governmentality studies have also provided many insights that are more specific to the particular empirical object of study. In the following, I outline three specific contributions. Firstly, Hoskin and Macve have demonstrated the importance of written and numerical reporting of performance and their implications for governance. In their studies of the introduction of new grading systems at the US Westpoint Academy for training of military officers and the accounting systems at the Springfield Armory, they show how these measuring systems enabled new ways of managing the organnizations and the people inhabiting them. In the former case, the new examination system produced detailed knowledge about the individual performance of the cadets that allowed the commander of the school to discard or promote cadets. In the case of the Springfield Armory, the detailed knowledge about the costs and incomes of individual units and departments allowed central directors to hold lower-level department heads accountable and, partly, responsible for profits. In both cases, the new measurement systems enabled the emergence of genuinely modern forms of organizational management. It is worth quoting Hoskin and Macve at length here: Weber’s high bureaucracies are giving way to accounting-led organisations which both give more space for calculative individual initiative and locate the power over individuals in more dynamic flexible control systems which ultimately empower them to discipline their selves in a constant play of accountability and responsibility. (Hoskin & Macve, 1994, p. 91)

Secondly, quite a few governmentality studies have shown how technologies and systems of performance measuring have enabled the governing of states in accordance with what may seem like rather lofty political ideals. One set of studies has examined how ideals and norms of liberal democracy espoused by, for instance, the Freedom House (Leifert, 2014) or ideals of good governance emanating from the OECD (Hansen & Mühlen-Schulte, 2012) are used

Governmentality and the measuring of governance  133 to govern states. Another set of studies have looked at the ways in which social justice and sociological theories are employed in the EU’s employment strategy with a view to govern member states’ employment policy (Hansen & Triantafillou, 2011). Both sets of studies demonstrate how the measuring technology of benchmarking is instrumental in rendering quite abstract political and social ideals amenable to the governing of states. Obviously, these attempts to govern sovereign states tend to fail in their own terms, but they nevertheless spark ongoing debates both outside and inside the benchmarked countries on how to avoid under-performance in these assessments (Triantafillou, 2009). Thirdly and finally, governmentality studies have started examining the ways in which big data analytics and the algorithms supporting these are clearing new spaces for the governing of a wide range of social and political issues. Given increasing digitization and computer power and the political aspirations linked to the potential wonders of big data, it is very timely that governmentality scholars have begun critically understanding and examining the use of big data analytics in governing social problems. The work of Paul Henman and others in liberal democracies (Henman, 2021; Schuilenburg & Peeters, 2021), and the studies of the Chinese social credit system (Lam, 2022; Zhang, 2020) demonstrate that the new algorithmic technologies are able to resonate worryingly well both with neoliberal governmentalities and authoritarian ones.

CONCLUSION Governmentality studies have produced many important insights. It is not only the concept of governmentality, but also the associated analytical concepts and strategies developed by Foucault and subsequent scholars that have proved highly fruitful for scholars of the role of measuring in governance. The studies of the relationship between the measuring of governance have proved remarkably apt at addressing rationalities and bodies of knowledge, on the one hand, and their interaction with techniques and practices of governing and measuring, on the other. Thus, unlike many other studies applying the notion of governmentality, the studies surveyed in this chapter avoid reducing the measuring of governance to rationalities or discourse of government. Still, there is room for improvement. Too many governmentality studies, in my view, lack conceptual, analytical or critical ambitions. Merely invoking the term governmentality does not, any longer, provide any theoretical novelty to the analysis of the measuring of governance. Conceptually, we need more work on how to grasp the diversity within a particular governmentality and how to distinguish between governmentalities. With some notable exceptions (Henman, 2021), few studies are examining the interaction of plural governmentalities. Instead, most studies tend to focus exclusively on neoliberalism. Analytically speaking, there are good reasons to focus on neoliberal governmentalities given their prevalence in liberal democracies and elsewhere since the 1980s. Yet, we need more analytical innovation that goes beyond accounting for the complicity of neoliberalism in the measuring of governance. This does not necessarily imply a return to Foucault-style genealogical analysis. Yet, there is a need for analytical innovation in order to better understand the historical emergence and polyvalent relationship between contemporary measuring systems and governance. Finally, governmentality studies – at their best – have demonstrated a strong capacity to critically address how measuring is predicated on certain forms of power that at once enables some freedoms and cur-

134  Handbook on measuring governance tails others. It is this critical interrogation of power that distinguishes governmentality studies from many other fine studies of measuring systems, such as Science and Technology Studies. Governmentality studies need to retain an open mind and keep grinding their analytical tools to come up with novel and critical analyses of the power relations embedded in the measuring of governance.

REFERENCES Argento, D., Grossi, G., Jääskeläinen, A., Servalli, S., & Suomala, P. (2020). Governmentality and performance for the smart city. Accounting, Auditing and Accountability Journal, 33(1), 204–32. Bevan, G., & Hood, C. (2006). What’s measured is what matters: Targets and gaming in the English public health care system. Public Administration, 84(3), 517–38. Biebricher, T. (2008). Genealogy and governmentality. Journal of the Philosophy of History, 2, 363–96. Biebricher, T., & Vogelmann, F. (2012). Governmentality and state theory: Reinventing the reinvented wheel? Theory & Event, 15(3). Bonefeld, W. (2017). Authoritarian liberalism: From Schmitt via ordoliberalism to the Euro. Critical Sociology, 43(4–5). Brady, M. (2014). Ethnographies of neoliberal governmentalities: From the neoliberal apparatus to neoliberalism and governmental assemblages. Foucault Studies, 18. Burchell, G., Gordon, C., Miller, P., & Foucault, M. (1991). The Foucault effect. Studies in governmentality, with two lectures by and an interview with Michel Foucault. Harvester Wheatsheaf. Cannizzo, F. (2015). Academic subjectivities: Governmentality and self-development in higher education. Foucault Studies, 20, 199–217. Chenchen, Z. (2018). Governing neoliberal authoritarian citizenship: Theorizing hukou and the changing mobility regime in China. Citizenship Studies, 22(8), 855–81. Dahler-Larsen, P. (2014). Constitutive effects of performance indicators: Getting beyond uninteded consequences. Public Management Review, 16(7), 969–86. Dean, M. (1994). Critical and effective histories: Foucault’s methods and historical sociology. Routledge. Dean, M. (1996). Putting the technogical into government. History of the Human Sciences, 9(3), 47–68. Dean, M. (1999). Governmentality: Power and rule in modern society. Sage. Dean, M. (2002). Liberal government and authoritarianism. Economy and Society, 31(1), 37–61. Donzelot, J. (1984). L’invention du social. Fayard. Engebretsen, E., Heggen, K., & Eilertsen, H.A. (2012). Accreditation and power: A discourse analysis of a new regime of governance in higher education. Scandinavian Journal of Educational Research, 56(4), 401–17. Evans, B., & Colls, R. (2009). Measuring fatness, governing bodies: The spatialities of the Body Mass Index (BMI) in anti-obesity politics. Antipode, 41(5), 1051–83. Ewald, F. (1986). L’Etat-Providence. Grasset. Foucault, M. (1980). Truth and power. In C. Gordon (Ed.), Michel Foucault: Power/Knowledge (pp. 109–33). Harvester Wheatsheaf. Foucault, M. (1982). The subject and power. In H.L. Dreyfus & P. Rabinow (Eds.), Michel Foucault: Beyond structuralism and hermeneutics (pp. 208–26). The Harvester Press. Foucault, M. (1986). Nietzche, genealogy, history. In P. Rabinow (Ed.), The Foucault reader. Penguin Books. Foucault, M. (1991). Questions of method. In G Burchell, C. Gordon, & P. Miller (Eds.), The Foucault effect (pp. 73–86). Harvester Wheatsheaf. –Foucault, M. (2007). Security, territory, population: Lectures at the Collège de France, 1977–1978. Palgrave Macmillan. Foucault, M., & Deleuze, G. (1977). Intellectuals and power: A conversation between Michel Foucault and Gilles Deleuze. In D.F. Bouchard (Ed.), Language, counter-memory, practice (pp. 205–17). Cornell University Press.

Governmentality and the measuring of governance  135 Garland, D. (1997). ‘Governmentality’ and the problem of crime: Foucault, criminology, sociology. Theoretical Criminology, 1(2), 173–214. Gendron, Y., Cooper, D.J., & Townley, B. (2007). The construction of auditing expertise in measuring government performance. Accounting, Organizations and Society, 32(1–2), 101–29. Gordon, C. (1991). Governmental rationality: An introduction. In G. Burchell, C. Gordon, & P. Miller (Eds.), The Foucault effect (pp. 1–52). Harvester Wheatsheaf. Grundy, J. (2015). Performance measurement in Canadian employment service delivery, 1996–2000. Canadian Public Administration, 58(1), 161–82. Guter-Sandu, A., & Mennicken, A. (2022). Quantification = economization? Numbers, ratings and rankings in the prison service of England and Wales. In A. Mennicken & R. Salais (Eds.), The new politics of numbers (pp. 307–36). Palgrave Macmillan. Hackett, E.J., Amsterdamska, O., Lynch, M.E., & Wajcman, J. (2008). Science and Technology Studies and an engaged program. In E.J. Hackett, O. Amsterdamska, M. Lynch, & J. Wajcman (Eds.), The handbook of Science and Technology Studies (3rd ed). The MIT Press. Hacking, I. (1990). The taming of chance. Cambridge University Press. Hansen, H.K., & Mühlen-Schulte, A. (2012). The power of numbers in global governance. Journal of International Relations and Development, 15(4), 455–65. Hansen, M.P., & Triantafillou, P. (2011). The Lisbon Strategy and the alignment of economic and social concerns. Journal of European Social Policy, 21(3), 197–209. Hansen, M.P., & Triantafillou, P. (2022). Methodological reflections on Foucauldian analyses: Adopting the pointers of curiosity, nominalism, conceptual grounding, and exemplarity. European Journal of Social Theory. https://​doi​.org/​10​.1177/​13684310221078926. Henderson, J. (2015). Michel Foucault: Governmentality health policy and the governance of childhood obesity. In F. Collyer (Ed.), The Palgrave handbook of social theory in health, illness and medicine (pp, 324–39). Palgrave Macmillan. Henman, P. (2021). Governing by algorithms and algorithmic governmentality. Towards machinic judgement. In M. Schuilenburg & R. Peeters (Eds.), The algorithmic society: Technology, power and knowledge. Routledge. Hoskin, K. (1996). The awful ideal of accountability: Inscribing people into the measurement of objects. In R. Munro & J. Mouritsen (Eds.), Accountability: Power, ethos & the technologies of managing (pp. 265–82). International Thompson Business Press. Hoskin, K., & Macve, R. (1994). Writing, examining, disciplining: The genesis of accounting’s modern power. In A. Hopwood & P. Miller (Eds.), Accounting as a social and institutional practice (pp. 67–97). Cambridge University Press. Lam, T. (2022). The people’s algorithms: Social credits and the rise of China’s big (br)other. In A. Mannicken & R. Salais (Eds.), The new politics of numbers (pp. 71–95). Palgrave Macmillan. Leifert, C.L. (2014). Indicating power: A Foucauldian analysis of Freedom House’s democracy index. Political Perspectives, 8(2), 1–10. Lemke, T. (2002). Foucault, governmentality, and critique. Rethinking Marxism, 14(3), 49–64. Lemke, T. (2011). Biopolitics: An advanced introduction. New York University Press. Light, D.W. (2001). Managed competition, governmentality and institutional response in the United Kingdom. Social Science & Medicine, 52(8), 1167–81. Löwenheim, O. (2008). Examining the state: A Foucauldian perspective on international ‘governance indicators’. Third World Quarterly, 29(2), 255–74. Lupton, D. (2013). Quantifying the body: Monitoring and measuring health in the age of mHealth technologies. Critical Public Health, 23(4). Miller, P. (2022). Afterword: Quantifying, mediating and intervening: The R number and the politics of health in the twenty-first century. In A. Mennicken & R. Salais (Eds.), The new politics of numbers (pp. 465–76). Palgrave Macmillan. Miller, P., & Rose, N. (2008). Governing the present. Polity. Morrissey, J. (2013). Governing the academic subject: Foucault, governmentality and the performing university. Oxford Review of Education, 39(6), 797–810. Niesche, R. (2014). Governmentality and my school: School principals in societies of control. Educational Philosophy and Theory, 47(2), 133–45.

136  Handbook on measuring governance O’Malley, P., Weir, L., & Shearing, C. (1997). Governmentality, criticism and politics. Economy and Society, 26(4), 501–17. Owen, D. (2002). Criticism and captivity: On genealogy and critical theory. European Journal of Philosophy, 10(2), 216–30. Porter, T.M. (1995). Trust in numbers: The pursuit of objectivity in science and public life. Princeton University Press. Power, M. (1996). Making things auditable. Accounting, Organizations and Society, 21(2–3), 289–315. Power, M. (2004). Counting, control and calculation: Reflections on measuring and management. Human Relations, 57(6), 765–83. Procacci, G. (1991). Social economy and government of poverty. In G. Burchell, C. Gordon, & P. Miller (Eds.), The Foucault effect: Studies in governmentality (pp. 151–68). University of Chicago Press. Rose, N. (1996a). Governing ‘advanced’ liberal democracies. In A. Barry, T. Osborne, & N. Rose (Eds.), Foucault and political reason (pp. 37–64). UCL Press. Rose, N. (1996b). Inventing our selves: Psychology, power and personhood. Cambridge University Press. Rose, N., & Miller, P. (1992). Political power beyond the state: Problematics of government. British Journal of Sociology, 43(2), 173–205. Rosol, M. (2014). Governing cities through participation – a Foucauldian analysis of CityPlan Vancouver. Urban Geography, 36(2), 256–76. Rutherford, S. (2016). Green governmentality: Insights and opportunities in the study of nature’s rule. Progress in Human Geography, 31(3), 291–307. Sauer, B., & Penz, O. (2017). Affective governmentality: A feminist perspective. In C. Hudson, M. Rönnblom, & K. Teghtsoonian (Eds.), Gender, governance and feminist analysis (pp. 39–58). Routledge. Schuilenburg, M., & Peeters, R. (Eds.) (2021). The algorithmic society technology, power, and knowledge. Routledge. Shore, C. (2008). Audit culture and illiberal governance: Universities and the politics of accountability. Anthropological Theory, 8(3), 278–98. Sigley, G. (2007). Chinese governmentalities: Government, governance and the socialist market economy. Economy and Society, 35(4), 487–508. Stenson, K. (1999). Crime control, govemmentality and sovereignty. In R. Smandych (Ed.), Governable places: Readings on governmentality and crime control (pp. 45–73). Routledge. Thomas, P., & Hewitt, J. (2011). Managerial organization and professional autonomy: A discourse-based conceptualization. Organization Studies, 32(10), 1373–93. Triantafillou, P. (2002). Machinating the responsive bureaucrat: Excellent work culture in the Malaysian public sector. Asian Journal of Public Adminstration, 24(2), 185–209. Triantafillou, P. (2009). The European employment strategy and the governing of the French employment policies. Administrative Theory & Praxis, 31(4), 479–502. Triantafillou, P. (2011). Metagovernance by numbers: Technological lock-in of Australian and Danish employment policies? In J. Torfing & P. Triantafillou (Eds.), Interactive policymaking, metagovernance and democracy (pp. 149–66). ECPR Press. Triantafillou, P. (2012). New forms of governing: A Foucauldian inspired analysis. Palgrave Macmillan. Triantafillou, P. (2015). Doing things with numbers: The Danish national audit office and the governing of university teaching. Policy and Society, 34(1), 13–24. Triantafillou, P. (2016). Governmentality. In J. Torfing & C. Ansell (Eds.), Handbook on theories of governance (pp. 353–63). Edward Elgar. Triantafillou, P. (2017). Neoliberal power and public management reforms. Manchester University Press. Triantafillou, P. (2022). Biopower in the age of the pandemic: The politics of COVID-19 in Denmark. European Societies, 24(5), 657–81. https://​doi​.org/​10​.1080/​14616696​.2022​.2061553. Weiskopf, R., & Munro, I. (2011). Management of human capital: Discipline, security and controlled circulation in HRM. Organization, 19(6), 685–702. Zhang, C. (2020). Governing (through) trustworthiness: Technologies of power and subjectification in China’s social credit system. Critical Asian Studies, 52(4), 565–88.

PART III METHODS AND METHODOLOGIES FOR MEASURING GOVERNANCE

9. Approaches and methods for measuring governance: comparing major supranational institutions Andrea Bonomi Savignon, Lorenzo Costumato and Fabiana Scalabrini

INTRODUCTION This chapter examines the main supranational institutions producing databases on the components of governance quality, comparing methods and thematic focuses of different databases, especially focusing on the methodological fit in relation to the respective aims. As the performance movement intensified during the past three decades in national political agendas, increasing levels of formalized planning, control and reporting in governance structures at the global level and especially across all Organisation for Economic Co-operation and Development (OECD) countries have been observed (Bouckaert & Halligan, 2008: 29). The main waves putting an emphasis on performance management can be identified in the scientific management movement (1900s–1940s; planning, programming and budgeting system; PPBS and management by objectives; MBO), in the New Public Management (1980s–2000; NPM) theory; and, distinctively, in the Public Governance approach (Van Dooren et al., 2015). There are at least four possible lenses through which we can look at governance of results: (i) performance measurement at the global level (international institutions); (ii) national public sector policies introducing compulsory performance management systems; (iii) strategic performance management at the organizational level; (iv) individual performance assessment and pay. This chapter focuses on the first level and analyzes the main performance measurement systems used to compare national governance systems in different countries, specifically focusing on: the span (input & process vs output & outcome measures) and depth (regional, national, global level) of their measurement approaches; the sources of data (primary vs secondary). Finally, we also provide an assessment based on the robustness and transparency of methodological collection and aggregation of data, and on the interactivity and openness of data used. Far from aiming at providing a complete overview of all institutions producing data on governance quality and impacts, we first introduce the measurement approaches of three exemplary cases of supranational measurement: the World Bank Worldwide Governance Indicators, the OECD Government at a Glance, and the United Nations (UN) Sustainable Development Goals. We show how the three databases focus on different dimensions of governance impacts and employ significantly different methods of data collection and representation. Then, we introduce a thematic focus on digital transformation policies, exemplified by an analysis of the UN E-government Development Index. The aim is to evaluate whether a narrower thematic focus induces a different methodological approach by measuring institutions. 138

Approaches and methods for measuring governance  139 Finally, we conclude by synthetizing similarities and differences among measurement approaches, illustrating whether an association between span of performance measures included and different methodologies is apparent.

ANALYSIS OF THE KEY PURPOSES AND TECHNICAL CHARACTERISTICS OF THE INSTITUTIONS AND RELATED DATABASES Worldwide Governance Indicators The Worldwide Governance Indicators (WGI) project is an index produced by the World Bank. The WGI is a long-standing research project to develop cross-country indicators of governance. It provides the views on the quality of governance by a large number of enterprise, citizen and expert survey respondents in industrial and developing countries. Before delving into this index in detail, it is essential to start with the definition of governance provided by the WGI. Governance is the traditions and institutions by which authority in a country is exercised. This includes (a) the process by which governments are selected, monitored and replaced; (b) the capacity of the government to effectively formulate and implement sound policies; and (c) the respect of citizens and the state for the institutions that govern economic and social interactions among them. (Kaufmann et al., 2010)

The WGI definition of governance broadly defines the empirical set that makes up the dimensions. However, the relevance of other definitions is also recognized. The WGI provides a measure of governance relating to the outcome that can be generated in civil society. The WGI was therefore built along six interconnected dimensions: 1. Voice and Accountability – capturing perceptions of the extent to which a country’s citizens are able to participate in selecting their government, as well as freedom of expression, freedom of association and free media; 2. Political Stability and Absence of Violence/Terrorism – capturing perceptions of the likelihood that the government will be destabilized or overthrown by unconstitutional or violent means, including politically motivated violence and terrorism; 3. Government Effectiveness – capturing perceptions of the quality of public services, the quality of the civil service and the degree of its independence from political pressures, the quality of policy formulation and implementation, and the credibility of the government’s commitment to such policies; 4. Regulatory Quality – capturing perceptions of the ability of the government to formulate and implement sound policies and regulations that permit and promote private sector development; 5. Rule of Law – capturing perceptions of the extent to which agents have confidence in and abide by the rules of society, and, in particular, the quality of contract enforcement, property rights, the police and the courts, as well as the likelihood of crime and violence;

140  Handbook on measuring governance 6. Control of Corruption – capturing perceptions of the extent to which public power is exercised for private gain, including both petty and grand forms of corruption, as well as ‘capture’ of the state by elites and private interests. Points (a), (b) and (c) of the WGI governance definitions each embrace two of the listed six dimensions. In particular, ‘voice and accountability’ and ‘political stability and absence of violence/terrorism’ refer to the process by which governments are selected, monitored and replaced; ‘government effectiveness’ and ‘regulatory quality’ refer to the capacity of the government to effectively formulate and implement sound policies; finally, ‘rule of law’ and ‘control of corruption’ refer to the respect of citizens and the state for the institutions that govern economic and social interactions among them. The WGI started in 1996, and the survey was biennial from 1996 to 2002. Since 2003 it has become annual. WGI uses mainly secondary data deriving from 34 datasets provided by non-governmental organizations, multilateral organizations and other public sector bodies.1 Over time the WGI index has considered data from 214 countries. This dataset includes data from surveys of firms and households and expert assessments of various commercial business information providers. For this reason, the data sources reflect the perceptions of a large and varied number of respondents. The data sources are harmonized with each other because, although most are updated annually, some surveys take place every two or three years. Data are rescaled and combined to create the six aggregate indicators corresponding to the six dimensions of governance described above, using a statistical methodology known as an unobserved components model. The process of creation of the WGI can be divided into three steps: 1. Assigning data from individual sources to the six aggregate indicators – every individual question from the underlying data sources is assigned to each of the six aggregate indicators; 2. Rescaling the individual source data to run from 0 to 1; 3. Using an Unobserved Components Model (UCM) to construct a weighted average of the individual indicators for each source. At the end of these three steps data from each country were reported in percentile rank terms, ranging from 0 (lowest rank) to 100 (highest rank). Data underlying the WGI index can be accessed and downloaded freely. On the official website of the WGI and World Bank, it is also possible to generate highly customizable data representations. It is possible to choose the country or countries to be analyzed, the dimensions and the time interval. This indicator’s intention is not to provide an absolute governance value, but rather to give evidence of the evolution over time of this concept both within each country and in comparison with the other countries. In particular, it becomes clear how difficult it is to measure changes in the governance effects measured by the WGI in the short term, since governance outcomes can only be measured in the long run. Changes over time in a country’s score on the WGI reflect a combination of three factors: (i) changes in the underlying source data; (ii) the addition of new data sources for a country that are only available in the more recent period; and (iii) changes in the weights used to aggregate the individual sources. It should be emphasized how most significant changes derive from more data availability over time that provides an increasingly complete framework.

Approaches and methods for measuring governance  141 Conversely, it has been noted how the inclusion of different data sources over time, and the different methodologies adopted by data providers in different regional clusters, has not ensured a complete level of comparability within the WGI over space and time. Within the WGI, it is also disclosed by the World Bank whether the various organizations collecting data at local level employ the same methodological standards. Government at a Glance Government at a Glance (GaaG) is a public governance set of indicators provided by the OECD. Its focus is on how governments work and their performance. The main objective of GaaG is to allow countries to benchmark their performance and, consequently, measure their progress over time (Lafortune et al., 2018). According to this goal, the target audience of GaaG is predominantly made up of politicians, policymakers, academics, students and interested citizens. The first OECD GaaG report was published in 2009. For the first time, an international organization combined a large set of comparative data on the performance of the public sector in a coherent and accessible way (OECD, 2017). Since 2009, GaaG reports have been published regularly every two years. In addition, yearly updates for some selected indicators of GaaG are available on OECD.stat. Since the first report, GaaG has provided data concerning not only OECD countries but also OECD key country partners, OECD accession countries and other non-OECD countries that are considered as relevant for the analysis. By looking at the group of OECD countries, starting in 2009 with the 30 countries that were OECD members at that time, the following reports have also considered the entrance of Chile, Estonia and Israel (2011), Latvia (2013), Lithuania (2019) and Colombia (2021). For all the OECD countries a dedicated fact sheet is published in addition to the GaaG overall report, while data for non-OECD countries are presented separately and are not considered in the aggregate’s graphical representations. For both OECD and non-OECD countries, data are shown at a national level without considering the regional or local institutional units. All the indicators presented in the reports are grouped in different areas of interest that are frequently recurring, with few differences along the seven reports. Some areas have been existing regularly since 2009, that is, ‘public employment’, ‘HRM management’, ‘budgeting’ and ‘regulatory management’. Others are repeatedly present from 2011, namely, ‘public finance and economics’, ‘public procurement’ and ‘open government’. In 2015, three other areas of interest were introduced and appeared with the same label also in 2017, 2019 and 2021: ‘institutions’, ‘core government results’ and ‘serving citizens’. Finally, since 2015 GaaG has started to measure aspects related to digital government. Nevertheless, we notice that specific areas of interest were sometimes measured, such as ‘women in government’ in 2013 and ‘risk management and communication’ in 2017. The number of indicators changes along the seven editions. It happens for three main reasons, as observed by Lafortune et al. (2018): first, in the last 15 years, new topics have risen to the forefront (e.g., digital government); second, the need to include an assessment of public sectors’ outputs and outcomes has become more and more relevant for governments; third, after the first reports, the need to describe different aspects of each country’s public sector increased.

142  Handbook on measuring governance Following only in part the taxonomy provided by Lafortune et al. (2018), we recognize that GaaG indicators can be divided into two groups: core indicators that are present in every edition of GaaG and one-off indicators that address issues that are topical at the time of the publication. By looking more carefully at the indicators within the different areas of interest, we recognize at first that GaaG includes indicators on the whole ‘production chain’ (Lafortune et al., 2018). In the first chapter of the 2017 GaaG report, Bouckaert noted that ‘while the first edition [in 2009] contained indicators only on the context, inputs and processes, there has been a clear strategy over time to broaden the scope and the span of coverage to also include outputs and outcomes’. The resulting conceptual framework consists of four main dimensions of analysis: 1. 2. 3. 4.

contextual factors of each country; inputs referred to the resources used by the governments; processes used by governments for the implementation of public policies; outputs and outcomes that, according to GaaG, refer to ‘the amount of goods and services produced by governments’ (outputs) and to ‘the effects of policies and practices on citizens and businesses’ (outcomes).

However, if we look at the impact that all the indicators have on each area of interest (Table 9.1), we can still recognize a prevalence of indicators devoted to public management practices and procedures (processes). Indicators on outputs/outcomes are consistently residual even though they provide helpful information for governments and citizens. GaaG uses many data sources. Most of the data in GaaG has been collected directly by the OECD. However, in order to avoid duplication of data collection and to provide a comprehensive view of what governments do, data are also drawn from other international organizations. Using the 2021 report as the primary reference for the following analysis, it is possible to qualify the set of indicators by looking at their source (and if they are primary or secondary data), how data are collected (especially for the primary data), whether the owner is the OECD itself and, finally, whether the data provide qualitative or quantitative information. All the 15 indicators within the area of ‘public finance and economics’ are relatively homogeneous as regards their sources and main features. They derive from secondary data from the OECD National Accounts Statistics database, which includes a wide range of data, based on the System of National Accounts (SNA): a set of internationally agreed concepts, definitions, classifications and rules for national accounting that allows the international comparability Table 9.1

Impact of GaaG areas of interest on what GaaG measures

Contextual factors

Inputs

Processes

Outputs and outcomes

Country fact sheets; online

Public finance and economics

Institutions (ch. 4); budgeting

Core government results (ch.

resources

(ch. 2); public employment

(ch. 5); HRM (ch. 6);

13); serving citizens (ch. 14)

(ch. 3)

regulatory government (ch. 7); public procurement (ch. 8); open government (ch. 9); digital government (ch. 10); governance of infrastructure (ch. 11); public sector integrity (ch. 12)

Source: Own elaboration.

Approaches and methods for measuring governance  143 of data. In five cases, OECD data are merged with secondary data from Eurostat government finance statistics, while data for non-OECD countries are from the International Monetary Fund (IMF). Compared with the ‘public finance and economics’ indicators, the seven ‘public employment’ indicators exhibit greater heterogeneity. Here we have different sources. Data come from the OECD, International Labour Organization (ILO), the Inter-Parliamentary Union’s Parline database and the Eueopean Commission for the Efficiency of Justice (CEPEJ) mixing primary and secondary sources. The OECD directly collects data through the 2020 Composition of the Workforce in Central/Federal Governments survey for two indicators. The respondents are senior officials in central government HRM departments. The ‘institutions’ section is consistently dedicated to the central government’s role in managing the COVID-19 crisis. For all the indicators, the OECD has directly surveyed senior officials from OECD governments who provide direct support and advice to heads of government and the council of ministers or cabinet. The four ‘budgeting’ indicators are the result of two different surveys. The first concerns the spending reviews and represents the countries’ own assessment of current practices and procedures. Respondents are predominantly senior budget officials. The second is an OECD and European Commission joint survey and provides countries’ own assessment for two indicators of green budgeting practices; respondents are budget officials within central budget authorities. Furthermore, one indicator is the result of data from desk research conducted by the OECD and verified by the OECD’s Network of Parliamentary Budget Officials and Independent Fiscal Institutions. As for ‘institutions’ and ‘budgeting’, other sections are based on OECD surveys. In the case of ‘public procurement’, ‘open government’, ‘governance of infrastructure’ and ‘public sector integrity’, respondents are, respectively, country delegates responsible for procurement policies at the central government level, representatives to the OECD Working Party on Open Government, senior officials in the central/federal ministries of infrastructure, public works and finance, or infrastructure agencies, and senior officials responsible for integrity policies in central government. Indicators of ‘human resources management’ are interesting as they provide some peculiarities compared to what we have seen previously. All the indicators are based on primary data collected by OECD survey. Survey respondents are predominantly senior officials in central government HRM departments, and the data refer only to HRM practices at the central government level. The originality is that data are presented using an index with a composite indicator for three cases. These indexes have been developed to measure contemporary public sector HRM developments and dilemmas on how to best manage human resources in the public sector in the 21st century. Another peculiarity concerns the ‘measuring employee engagement’ indicator, which considers data from questions on respondents’ perceptions. The ‘regulatory governance’ section presents six indicators. Three are based on the Regulatory Policy and Governance (iREG) survey. The iREG index measures three fundamental principles: stakeholder engagement, regulatory impact analysis and ex post evaluation. The remaining three are based on the OECD product market regulation methodology and survey. They measure the independence of economic regulators, the accountability arrangements of economic regulators and the regulators’ performance. Also, the OECD collects data for the ‘digital government’ section through the Survey on Digital Government 1.0, which was designed to monitor the implementation of the OECD

144  Handbook on measuring governance Recommendation on Digital Government Strategies adopted on 15 July 2014 and to assess countries’ digital maturity across six dimensions of analysis: digital by design, data-driven public sector, government as a platform, open by default, user-driven and proactiveness. The resulting Digital Government Index (DGI) is a composite index focusing on ‘the implementation of cross-government digital and data standards, key enablers, and principles’. The final remarks are for indicators of ‘core government results’ and ‘serving citizens’. According to the GaaG conceptual framework, these sections refer to the outputs and outcomes of public administrations. From a methodological point of view, we recognize some substantial differences from the previous indicators, as they are based largely on secondary data from the Gallup World Poll, the World Justice Project, the United Nations Educational Scientific and Cultural Organization (UNESCO) and other European sources. OECD surveys are still present but residually. Several secondary sources considered for these indicators are based on the analysis of citizens’ (or specific stakeholders’) perceptions through surveys and interviews. To conclude, we summarize some highlights on some crucial aspects that we mentioned for each area of interest. In particular: 1. The OECD collects data for inputs and processes mainly through OECD surveys where respondents are senior officials of OECD countries’ governments. It implies that comparisons among OECD countries are quite reliable, and where indicators are recurrent along GaaG reports, it is possible to compare the advancement on a specific topic in one country over the years. 2. Indicators of outputs and outcomes of public sectors rely on a mix of primary and secondary data. In particular, when information about citizens’ or non-governmental stakeholders’ perceptions are needed, GaaG asks for the intervention of external sources. This approach has strengths and weaknesses: on the one hand, despite the external sources being well recognized as reliable sources, data analysis needs an effort in standardization by the OECD, and data are not always available for all the OECD countries; on the other hand, the intervention of external sources avoids the risk of duplication. 3. As regards the type of information delivered by the indicators, all the results shown in the GaaG publication are quantitative. For some indicators, the analysis returns an index that compares OECD countries in a sort of rank (i.e., the Digital Government Index – DGI). All data are presented graphically within the report. For each graph, an Excel file with all the data is provided within the text as a hyperlink. Additionally, on OECD.stat, some selected indicators of GaaG 2021 are accessible and reusable in aggregate. However, the possibility to interact with data directly on graphs and figures is something that the OECD must still improve. Table 9.2 improves Table 9.1 by giving an overview of what we have analyzed in previous paragraphs. Sustainable Development Goals (SDGs) – UN Agenda 2030 At the heart of the 2030 Agenda for Sustainable Development, adopted in 2015 by the UN, are the 17 Sustainable Development Goals (SDGs) with their 169 associated targets and 2312 indicators (A/RES/71/313). The Agenda was signed by 193 member countries of the UN. Since 2015, the SDGs have called for actions by developed and developing countries. Given the number of global, national and regional actors involved, assessing progress towards the achievement of the goals is a complex, multifaceted process (UNECE, 2020).

Approaches and methods for measuring governance  145 Table 9.2

Overview



Contextual factors

Inputs

Processes

Outputs and outcomes

Chapter in the

Country fact sheets;

Public finance and

Institutions (ch. 4);

Core government results

OECD GaaG 2021

online resources

economics (ch. 2);

budgeting (ch. 5); HRM (ch.

(ch. 13); serving citizens

public employment

6); regulatory government

(ch. 14)

(ch. 3)

(ch. 7); public procurement (ch. 8); open government (ch. 9); digital government (ch. 10); governance of infrastructure (ch. 11); public sector integrity (ch. 12)

Source (primary or

Summary of data

Secondary data

Primary sources (surveys)

A blend of primary and

secondary)

presented in aggregate

for financial and

where respondents are

secondary sources for

over the report

economics and for

senior government officials

intercepting citizens’ and stakeholders’

employment

perceptions Qualitative vs



Quantitative

Quantitative

Quantitative

Quantitative Reliability



Medium

High

High

Comparability



High

High

Medium



Medium

Medium

Medium



Low

Low

Low

(horizontally and vertically) Openness and reusability Interactivity

Source: Own elaboration.

Starting in 2016, the UN Secretary-General has presented an annual SDGs progress report in which data produced by national statistical systems are collected and delivered (see, for e.g., SDG Progress Report 2022). Additionally, a Global Sustainable Development Report is prepared once every four years by an independent group of scientists appointed by the Secretary-General to inform the quadrennial SDGs review deliberations at the General Assembly (the first two editions of the GSDR were published in 2019 and 2023). Geographically speaking, except for the first annual progress report (2016) in which data were presented for countries in ‘developed’ or ‘developing’ regions, the country groups are based on the geographical regions defined under the Standard Country or Area Codes for statistical use (known as M49) of the UN Department of Economics and Social Affairs Statistics Division.3 It consists of eight main regional groups of countries: Sub-Saharan Africa, Northern and Western Asia, Central and Southern Asia, Eastern and South-Eastern Asia, Latin America and the Caribbean, Australia and New Zealand, Oceania, Europe and Northern America. Moreover, some countries are also classified as Least Developed Country (LDC), Landlocked developing Country (LLDC) or Small Island Developing States (SIDS). The annual progress reports present data aggregated around the eight regional groups; however, in the online SDG Global Database, specific country reports are available. If we look at the SDGs using the lens of performance management and its measurability, we acknowledge that they can be seen as the overall impacts the global community must achieve.

146  Handbook on measuring governance It is true especially for the first 16 SDGs, while the last – ‘Partnerships for the goals’ – can be seen at the same time as the impact of the new international governance approach (promoted by the UN in the Agenda) and as the driver for the achievement of the other goals. While SDG 17 explicitly talks about a global partnership for development and has a target specifically related to multistakeholder collaboration (17.17), the reality is that all the SDGs necessarily need significant collaboration across all societal sectors and actors (Stibbe & Prescott, 2020). As mentioned, each SDG presents a list of targets and related indicators (Table 9.3). Within this goal-target-indicator structure, the targets stand for the outcomes and, in some cases, the outputs (results) thanks to which the achievement of the expected impacts is enabled. Consequently, the indicators also measure outcomes and outputs. In depicting this new approach to international development, the Agenda – and the SDGs themselves – has recognized the interconnectedness of prosperous business, a thriving society and a healthy environment (Figure 9.1). As stated in the 2030 Agenda for Sustainable Development, the Goals and targets will be followed-up and reviewed using a set of global indicators. These will be complemented by indicators at the regional and national levels which will be developed by member states, in addition to the outcomes of work undertaken for the development of the baselines for those targets where national and global baseline data does not yet exist.

Following this approach, all the information presented in the SDG progress reports is based on data from the Global Indicator Framework for the SDGs adopted by the General Assembly in 2017 (A/RES/71/313). Given that the metadata for each indicator is available and transparent on the UN website, it is possible to be aware of the methodology behind the yearly measurement of the SDGs’ progress. While it is common ground that all 231 indicators provide quantitative information, the most interesting aspect concerns the source of data collection, as it involves national and international organizations. On the one hand, data collection is mainly based on secondary Table 9.3

SDGs with number of targets and indicators

SDG

No. of targets

No. of indicators

1.

No poverty

7

13

2.

Zero hunger

3.

Good health and well-being

8

14

13

4.

Quality education

10

28 12

5.

Gender equality

9

14

6.

Clean water and sanitation

8

11

7.

Affordable and clean energy

5

6

8.

Decent work and economic growth

12

16

9.

Industry, innovation and infrastructure

8

12

10.

Reduced inequalities

10

14

11.

Sustainable cities and communities

10

15

12.

Responsible consumption and production

11

13

13.

Climate action

14.

5

8

Life below water

10

10

15.

Life on land

12

14

16.

Peace, justice and strong institutions

12

24

17.

Partnerships for the goals

19

24

Figure 9.1

Source: UN.

The interconnected dimensions of sustainable development goals

Approaches and methods for measuring governance  147

148  Handbook on measuring governance data provided by National Statistical Offices (NSOs), but also other national government departments and agencies involved in the production of data, subnational government departments and agencies, especially municipalities, academic and research organizations and, also, civil society organizations. On the other hand, the responsibility for global collection and reporting is shared between the UN and many other international organizations such as Food and Agriculture Organization (FAO), ILO, IMF, OECD, World Health Organization (WHO), World Bank, World Trade Organization (WTO) and many others. Figure 9.2 shows the data flow and the main players involved in the production of SDG measures. The process begins with NSOs, which are responsible for preparing SDG indicators at the national level; indicators intended for global reporting then flow to organizations of the UN or other global bodies, acting as ‘custodian agencies’ with responsibility for compiling specific indicators; then, they feed the indicators into the global SDG database; these, in turn, serve as background materials for the annual meeting of UN member states – the High-Level Political Forum on Sustainable Development; another set of inputs into the High-Level Political Forum are National Voluntary Reports on SDG progress directly prepared by member states themselves.

Source: United Nations Economic Commission for Europe (UNECE) (2020).

Figure 9.2

Data flow of SDG data collection

As regards the openness of data, in addition to annual progress reports in which data are presented in aggregate, the UN is constantly providing further information on the SDG Global Database.4 Once inside the portal, it is possible to access the complete set of metadata for each

Approaches and methods for measuring governance  149 one of the 17 SDGs, with the possibility of filtering data for global, national, regional or subregional contexts. Specific web pages are also dedicated to individual country data, in which all the available information for the country is disclosed and reusable. An aspect that must be highlighted here, and that describes the global and innovative power of the SDGs, is that progress on the achievement of SDGs’ targets is not only measured by the UN – as we have seen in previous paragraphs – but also by other international institutions or national governments. This attempt denotes two primary purposes: the first is to contribute to the global reporting on achievement of goals (this process is described in Figure 9.2); the second is to connect their national or international strategies, formally independent of the UN 2030 Agenda, to the SDGs’ targets. This last aspect proves how SDGs are meeting the initial expectations regarding their capacity to attract and orient government actions worldwide. An outstanding example is the European context. In 2016, the European Union (EU) announced the integration of the SDGs into the European policy framework, describing the frontrunner role of the EU in promoting sustainable development as instrumental to the SDGs’ main purposes. A few years later, in 2019, the European Commission published the reflection paper ‘Towards a sustainable Europe by 2030’, in which the competitive advantages that implementing the SDGs would offer the EU were presented. In July 2019, the just-elected new President of the European Commission, Ursula von der Leyen, adopted her political guidelines, ‘A Union that strives for more’, setting out a ‘whole of government’ approach towards the implementation of the SDGs. Each of the six European political priorities has been linked to one or more SDGs based on the potential contribution of the first to the second. Beyond what is defined in the European Commission’s political strategy, since 2017, the European statistical institution, EUROSTAT, has been monitoring the implementation of the SDGs in its annual SDG monitoring reports.5 These reports monitor progress towards the SDGs in an EU context, building on the EU SDG indicator set developed in cooperation with many stakeholders. The indicator set comprises around 100 indicators and is structured along the 17 SDGs. Each SDG focuses on aspects that are relevant from an EU perspective. The monitoring report provides a statistical presentation of trends relating to the SDGs in the EU over the past five years (short term) and, when sufficient data are available, over the past 15 years (long term). As well as EUROSTAT yearly reports, the EU provides intelligent tools for the dissemination of EU contribution to the achievement of the SDGs: ‘SDGs & me’, for example, allows people to consult the progress towards the SDGs of one European member state, compared to the others or the European average; the same thing, in a different data visualization, is allowed by the ‘SDG country overview’ platform. Despite the clear ambition of the EU to integrate the SDGs into European policies, a recent report delivered by the European Committee of the Regions (2022) has recognized several areas of improvement. They found that a comprehensive strategy for the EU’s achievement of SDGs is lacking as well as a structured monitoring methodology. Also, observing the recently adopted National Recovery and Resilience Plans (NRRPs6), it is clear that the SDGs were not mentioned in the guidance provided by the European Commission. The analysis of the member states’ NRRPs confirms this trend. The report’s authors found that most member states merely mention the SDGs implicitly, with fewer countries explicitly linking NRRP components to the SDGs. The use of SDG indicators in their performance management systems is limited. An ex post effort in attributing the NRRP targets to the SDGs is recognizable in some European member states: for example, the Italian Statistical Office (ISTAT) has

150  Handbook on measuring governance recently delivered a dashboard in which the link between the Italian NRRP objectives and the SDGs is highlighted.7

A FOCUS ON DIGITAL GOVERNANCE INDICATORS: E-GOVERNMENT DEVELOPMENT INDEX (EGDI) In the previous paragraphs, the starting point has been the concept of governance widely understood; still, it would be misleading to think that placing the letter ‘e’ in front of words such as government, governance and democracy could be enough to make these concepts ‘technological’. Adopting technologies within these key concepts does not only mean computerization; in fact, it changes not only the results and processes but leads to a change in relationships that affect them in all respects. In this specific case, e-governance cannot simply be governance that has been given an electronic patina. E-government and e-governance are two very different concepts, just as the concept of government differs from that of governance. There may be forms of governance in no direct government, which similarly occurs with the adoption of technology. It is, therefore, possible to affirm that e-governance through the use of Information Communication Technologies (ICT) in government (a) changes governance structures and processes in ways that were not possible without the application of ICT, (b) creates new structures and new governance processes not possible without ICT, and (c) concretizes the issues that have arisen with the application of ICT systems in new rules and laws. Bannister and Connolly (2012) noted a decade ago that e-governance had so far been used with considerable elasticity; this situation was the cause of a non-univocity of the definition. This non-homogeneity was reflected even more in the interchangeable use that has been made over the years of the word e-governance and e-government. The digital governance concept can be seen as an evolution of the e-governance concept (Misuraca & Viscusi, 2014). Digital governance is defined as digital technology ingrained in structures or processes of governance and their reciprocal relationships with governance objectives and normative values. Digital governance includes the utilization of digital capabilities and involves a transformation of structures, processes or normative values. Changes in the structure and processes of digital governance intervene in these aspects: service delivery, regulation, policymaking, governance mechanism, relationships, interaction and participation, coordination and decision-making; regarding the values, they intervene in efficiency, transparency, accountability, participation, effectiveness, responsiveness, good governance, smart governance, economic development (Engvall & Flack, 2022). To date, given the prominence of the topic, also in light of recent historical events, there is a strong focus on the aspects of measuring the digitization of a country to pursue the widely understood governance objective. The relevant questions when measuring the digital context of a country are related both to the positioning of that country compared to the others, which allows for a snapshot of the state of the art on the subject, and how digital transformation contributes to the socio-economic development of the country. Among the most authoritative indexes we single out the E-Government Development Index (EGDI).

Approaches and methods for measuring governance  151 The EGDI is a synthetic index produced every two years and estimated within the United Nations E-Government Survey report since 2001. It is the only global report that analyzes the development of e-government among 193 member states of the UN. This report aims to provide a ranking of the performances of each country on a relative scale. The index for each country moves in a score ranging from 1 to 0; it isn’t an absolute measurement. Every country has a value that moves from ‘Very High EGDI’, ‘High EGDI’, ‘Medium EGDI’ and ‘Low EGDI’. This index’s purpose is to measure the e-government’s effectiveness in the provision of public services. With EGDI, it is possible to identify models to carry out benchmark operations so that countries can learn and improve from each other and identify areas of strength and challenges that e-government presents, thus allowing the shaping of the policies and strategies adopted in this sector. The results are tabulated and combined with a set of indicators embodying a country’s capacity to participate in the information society, without which e-government development efforts are of limited immediate use. The EGDI is a composite measure of three dimensions of e-government, namely: provision of online services (Online service index, OSI), telecommunication connectivity (Telecommunication Infrastructure Index, TII) and human capacity (Human Capital Index, HCI). The method and dimensions are constant, while the meaning of the values of these dimensions changes from one survey to another because it is the technology that changes; these changes produce an evolution in the concept of e-government. As for the EGDI, this aspect is fundamental because it provides even more the idea of a comparative framework that evolves with the reference context. The path undertaken by this type of study is therefore not linear with an absolute objective, but it changes with the evolution of the society’s needs. These aspects enable the above indexes to provide contemporary and truthful views of the realities observed; in fact this index reflects the SDGs’ goals in all indicators and aspects. The first dimension, Scope and quality of online services with its index OSI is investigated by the United Nations Department of Economic and Social Affairs (UNDESA). The components are Institutional Framework, Service Provision, Content Provision, Technology and E-participation. The online service index measures a government’s capability and willingness to provide services and communicate with its citizens electronically. The second dimension, Telecommunication with the index TII is investigated by the International Telecommunication Union (ITU). The components are Internet users, Fixed broadband subscriptions, Wireless broadband subscriptions, Fixed telephone subscriptions, and Mobile cellular subscriptions. The telecommunication infrastructure index measures the infrastructure required for citizens to participate in e-government. Finally, UNESCO investigates the dimension of Human Capital. Its components are gross enrolment ratio, expected years of schooling, adult literacy and mean years of education. The human capital index measures ability to use e-government services. EGDI combines primary quantitative data (collected and owned by the United Nations Department of Economic and Social Affairs) and secondary data from other UN agencies. This index is closely correlated with the SDGs. This index also has good interactivity. The data are public, and it is possible to obtain a lot of combinations of different levels of the study: country, region and city, and it is possible to choose for every level the year and the type of data.8

152  Handbook on measuring governance

CONCLUSION The aim of this chapter has been to describe the main databases produced by supranational organizations on the various components of governance quality, comparing several methods and thematic focuses of different databases, especially focusing on the methodological robustness and fit in relation to the respective aims. In doing so, we have examined: two of the most popular and long-lived databases (the Worldwide Governance Indicators – World Bank – and Government at a Glance – OECD); the system of goals, targets and indicators behind the UN 2030 Agenda for Sustainable Development; and, finally, a sectorial database on the governance of digitalization shown within a grey circle in Figure 9.3 (the E-Government Development Index – EGDI). These databases have different aims and traditions, but by paying attention to three of the main variables used for the analysis, it is possible to describe them from a comprehensive view. Two main variables to take into account are the span of governance measurement and the data sources: the first allows us to identify the object of the measurement, and in particular if a database focuses on inputs, processes, outputs, outcomes or impacts of governance; the second considers the data sources and, in more detail, if data are primary or secondary. Primary data are collected directly by the organization who is the owner of the database, while secondary data are the result of data collection provided by other supranational or national institutions and, sometimes, non-governmental organizations. These two main variables are integrated by a third one, which is the robustness of data, considering if data sources are

Source: Own elaboration.

Figure 9.3

Analysis of each database in terms of span of measurement, data sources, and methodological robustness (shown as circle size)

Approaches and methods for measuring governance  153 comparable over time and space and if the methodology for data collection and aggregation is entirely transparent. Figure 9.3 shows the relationships among the three aforementioned variables. The horizontal axis shows the span of governance measurement, identifying the prevalent governance dimension for each database; the vertical axis considers if data are mainly primary or secondary; and the degree of robustness is, then, described by the size of circles. In summary, it is possible to observe that the three main databases on governance in a broad sense – namely, World Bank (WGI), OECD (GaaG) and United Nations (SDGs) – provide a comprehensive eye on the different dimensions of governance. GaaG focuses mainly on processes of governing, paying attention to the operations of OECD countries’ public sectors; WGI and SDGs are both focused on the outcome and impact measurement, but while WGI considers specific dimensions of governance, SDGs are more focused on the overall impact of an approach to global governance. In other words, governance dimensions analyzed by World Bank can be seen as implicit drivers for the desired effects described in the UN 2030 Agenda for Sustainable Development. If we look at the sectorial database on digitalization (EGDI) it is recognizable that it is focused more on inputs and outputs of the digital society. This aspect reflects the global need to understand and measure the enabling factors of digitalization (input, processes and outputs) that must contribute to achieving relevant economic, societal and environmental outcomes and impacts, as the UN 2030 Agenda identifies them. Furthermore, it must be noted that data sources reflect the nature of the analyzed databases. Primary data are commonly used in thematic databases focused on inputs, processes and outputs (GaaG and EGDI). In contrast, secondary sources are used for more systemic databases dedicated to measuring global outcomes and impacts (WGI and SDGs). Finally, with regards to the robustness of data, three main approaches can be recognized. The first one considers databases that have a high degree of robustness; it includes GaaG as it presents a robust and transparent methodology in data collection, using primary data that are directly collected by database owners. The second includes databases presenting a moderate to high degree of robustness, namely, EGDI and SDGs. Both are characterized by the fact that a ‘custodian’ actor, often the UN itself, acts as a guarantee for the standardization of data collection and aggregation giving to the final database a methodological robustness. Finally, a third approach is featured by the World Bank WGI, in which indicators are based only on secondary sources with no clear and standardized methodology over time, due to the fact that data are based on different methodologies adopted by data providers in different regional clusters, not ensuring a complete level of comparability within the WGI over space and time. To conclude, a weak correlation between more robustness and a wider use of primary sources seems apparent from our analysis. Conversely, and understandably, as the indicators move towards outcome and impact measures, a stronger use of secondary sources emerges. Primary sources are generally more reliable and normally more comparable over space and time; yet, this should not discourage institutions from a continued effort towards the development of appropriate and reliable impact measures, which are arguably the most significant and useful ones in the context of the evolution of governance assessments. As such, there seems to be a trade-off between reliability (in terms of the use of secondary data, mostly used for input and process assessments) and usefulness (in terms of the shift towards an outcome and impact-based focus) within governance indicators. It should be underlined how, particularly in the case of the assessment of outcomes within governance quality, the process of collating and harmonizing data stemming from different

154  Handbook on measuring governance data sources is of paramount importance – the role of ‘custodian agencies’ in the case of SDGs being a good example of this. Past experiences have shown the risks of systematic or external biases that could apply to databases using mixed sources and methodologies, especially when they are presented in the form of rankings – the most recent example being the ‘Doing Business’ (DB) report controversy in 2021.9 This case shows methodological reliability and disclosure is even more relevant when the rankings presented may have, as was the case for DB and is for the databases discussed in this chapters, explicit public discourse and policy implications.

NOTES 1. To see the complete list of data sources and data providers consult the WGI web page: https://​info​ .worldbank​.org/​governance/​wgi/​. 2. The total number of indicators listed in the global indicator framework of SDGs is 248. However, 13 indicators repeat under two or three different targets (see below). 3. https://​unstats​.un​.org/​unsd/​methodology/​m49/​ (last accessed 4 January 2023). 4. https://​unstats​.un​.org/​sdgs/​dataportal / (last accessed 4 January 2023). 5. https://​ec​.europa​.eu/​eurostat/​web/​sdi/​overview/​ (last accessed 4 January 2023). 6. https://ec.europa.eu/info/business-economy-euro/recovery-coronavirus/recovery-and-resilience​ -facility_en / (last accessed 4 January 2023). 7. https://​www​.istat​.it/​it/​archivio/​275128/​ (last accessed 4 January 2023). 8. https://​publicadministration​.un​.org/​egovkb/​Data​-Center/​ (last accessed 4 January 2023). 9. https://www.worldbank.org/en/news/statement/2021/09/16/world-bank-group-to-discontinue​ -doing-business-report (last accessed 28 December 2022).

BIBLIOGRAPHY Alessandrini, M., Pasturel, A., Hat, K., Munch, A., Furtado, M.M., & Marinovic, P. (2022). Synergies between the Sustainable Development Goals and the National Recovery and Resilience Plans – best practices for local and regional authorities. European Union Committee of the Regions publication. Bannister, F., & Connolly, R. (2012). Defining e-governance. e-Service Journal, 8(2), 3–25. Bouckaert, G., & Halligan, J. (2008). Managing performance: International comparisons. Routledge. Engvall, T., & Flak, L.S. (2022). Digital governance as a scientific concept. In Y. Charalabidis, L.S. Flak, & G.V. Pereira (Eds.), Scientific foundations of digital governance and transformation (pp. 25–50). Springer. European Commission (2022). Digital Economy and Society Index (DESI) 2022 methodological note. European Commission (2022). Digital Economy and Society Index (DESI) 2022 thematic chapters. Kaufmann, D., Kraay, A., & Mastruzzi, M. (2010). The Worldwide Governance Indicators: Methodology and analytical issues. World Bank Policy Research Working Paper, No. 5430. Lafortune, G., Gonzalez, S., & Lonti, Z. (2018). Government at a glance: A dashboard approach to indicators. In D.V. Malito, G. Umbach, & N. Bhuta (Eds.), The Palgrave handbook of indicators in global governance (pp. 207–38). Palgrave Macmillan. Misuraca, G.C., & Viscusi, G. (2014). Digital governance in the public sector: Challenging the policy-makers innovation dilemma. In 8th International Conference on Theory and Practice of Electronic Governance (ICEGOV 2014) (pp. 146–54). ACM. OECD (2011). Government at a Glance 2011. OECD Publishing. OECD (2013). Government at a Glance 2013. OECD Publishing. OECD (2015). Government at a Glance 2015. OECD Publishing. OECD (2017). Government at a Glance 2017. OECD Publishing. OECD (2019). Government at a Glance 2019. OECD Publishing.

Approaches and methods for measuring governance  155 OECD (2021). Government at a Glance 2021. OECD Publishing. Stibbe, D., & Prescott, D. (2022). The SDG partnership guidebook: A practical guide to building high-impact multi-stakeholder partnerships for the Sustainable Development Goals. United Nations. United Nations (2014). A world that counts – mobilising the data revolution for sustainable development. Report prepared at the request of the United Nations Secretary-General, by the Independent Expert Advisory Group on a Data Revolution for Sustainable Development. United Nations (2015). Indicators and a monitoring framework for the Sustainable Development Goals – launching a data revolution. A report to the Secretary-General of the United Nations by the Leadership Council of the Sustainable Development Solutions Network. United Nations (2017). Global indicator framework for the Sustainable Development Goals and targets of the 2030 Agenda for Sustainable Development. A/RES/71/313. United Nations Department of Economic and Social Affairs (2022). E-government survey 2022: The Future of Digital Government. United Nations Economic Commission for Europe (UNECE) (2020). Measuring and monitoring progress towards the Sustainable Development Goals. Van Dooren, W., Bouckaert, G., & Halligan, J. (2015). Performance management in the public sector. Routledge.

10. Measuring the quality of collaborative governance processes Joop Koppenjan

INTRODUCTION This chapter focuses on the evaluation of collaborative governance processes and the measurement of the quality of these processes. Although over the last decades the research on collaborative governance has grown enormously, authors do not necessarily agree upon the exact nature of collaborative governance (see Emerson et al., 2011; Huxham, 1996; O’Leary & Vij, 2012). Ansell and Gash (2008), for instance, define collaborative governance as a ‘governing arrangement where one or more public agencies engage nonstate stakeholders in a collective decision making process that is formal, consensus-oriented, and deliberative and that aims to make or implement public policy or manage public programs or assets’ (p. 544). Emerson et al. (2011), in contrast, do not limit their definition to formal, state-initiated arrangements. In their view, collaborative governance can include the involvement of various public, private, and societal parties and is not necessarily government-initiated, nor is a formal collaborative arrangement required. In this chapter, this broader view on collaborative governance is followed. A first step in seeking to develop a methodology to evaluate the quality of collaborative governance processes is to define these processes. Various authors have developed theoretical frameworks that conceptualize the phenomenon of collaborative governance (e.g., Ansell & Gash, 2008; Bryson et al., 2006, 2015; Emerson et al., 2011; Koppenjan & Klijn, 2004; Sørensen & Torfing, 2009). Despite communalities, the ways in which these frameworks define and organize variables that make up the collaborative governance process vary. Attempts to deal with the interactive and dynamic nature of the collaboration process often result in a conflation of components, antecedents, and effects of the collaboration process. In this study, a definition is sought that does not include antecedents and effects. The collaborative governance process is defined as a series of activities and interactions among participants involved in the co-creation or joined implementation of public policies, programmes or assets. These activities and interactions may be self-governing, but may also include deliberate governance efforts of actors that aim to enhance collaboration and realize collaborative advantages and public values (Bryson et al., 2015; Emerson et al., 2011; Klijn & Koppenjan, 2016; Provan & Kenis, 2008). In attempts to show that collaboration is more than ‘drinking cups of tea’, various authors have suggested evaluation frameworks aimed at measuring the performance of collaborative governance practices (Emerson & Nabatchi, 2015; Mandell & Keast, 2007; Voets et al., 2008). Performance, however is an ambiguous concept. Performance may refer to the realization of joint activities and substantial outcomes in the short and the long run (Emerson & Nabatchi, 2015; Innes & Booher, 1999; Provan & Milward, 2001). Evaluation studies also mention the development of rules, norms, processes, and procedures for collaboration as dimensions of 156

Measuring the quality of collaborative governance processes  157 success and effectiveness (Bryson et al., 2015; Turrini et al., 2010). These are institutional outcomes that contribute to the stabilization of the collaborative network and may enhance future collaborations (cf. Emerson & Nabatchi, 2015; Innes & Booher, 1999; Mandell & Keast, 2008; Voets et al., 2008). This chapter, however, does not focus on substantive and institutional outcomes, but rather on the process qualities of collaboration. As the way in which actors interact and relate to one another forms a crucial precondition for the emergence and success of joint activities and the realization of outcomes and mutual gains, assessing the quality of collaboration processes is an important dimension of measuring the success of collaborative governance (Bianchi et al., 2021; Bryson et al., 2015; Klijn & Koppenjan, 2016; Mandell & Keast, 2008; Voets et al., 2008). Process quality refers to the ways in which actors engaged in collaborative governance behave and the collective effects of their actions on how interactions evolve and what this means for other actors participating in these processes. Whereas substantive outcomes refer to the what that is accomplished by collaboration, and institutional outcomes to institutional conditions that will enhance or hinder future collaboration, a process perspective refers to how collaborations evolve and how substantive outcomes are accomplished. It should be noted that, in practice, it is hard to distinguish substantial outcomes, institutional impacts, and process quality, because they influence one another and evolve cyclically and dynamically. As Innes and Booher (1999, p. 415) state: ‘Processes and outcomes cannot neatly be separated … because the process matters in and of itself and because the process and outcome are likely to be tied together.’ Nevertheless, a focus on the quality of the process is important, certainly if we acknowledge that the process ‘matters in and of itself’. A process perspective has a distinct nature and purpose compared to an evaluation of substantive outcomes or institutional effects. A successful process may mean something quite different than the accomplishment of a substantive mutual benefit. It may well be that a successful substantive outcome is realized but the process falls short, and the other way around.

KEY PURPOSES OF MEASURING THE QUALITY OF COLLABORATIVE GOVERNANCE PROCESSES O’Leary and Vij (2012) take stock of the research on collaborative public management, and they state that there is no ‘set of valid, reliable, recognizable measures for analysing and comparing different collaborations and drawing conclusions on how to foster and maintain effective collaborations’ (p. 517). Therefore, clarifying how collaboration and collaboration processes can be measured is an important endeavour to advance the research on collaborative governance. In this chapter, both the conceptual and the methodological topics that underlie the measurement of collaborative governance processes are discussed. These topics will be addressed in the sections that follow. This section reflects on the key purposes of the measurement of collaborative governance processes. These purposes may differ, with implications for the type and the shape of measurements. Broadly speaking, the quality of collaborative governance processes can be measured in the context of research of collaborative practices and research for these practices (see Enserink et al., 2013). Research of collaboration practices will first and foremost be focused on trying to understand and explain the dynamics and complexities of interaction processes and the patterns and mechanisms that underlie these. Research for collaborative practices will be aimed at informing these practices

158  Handbook on measuring governance and therefore has a focus on evaluation and the identification of direct antecedents that can be mitigated. Despite these different purposes, the distinction between these two kinds of analysis should not be exaggerated. The first type of research produces knowledge that may inform collaborative practices too, although research for collaborative practices will probably be more instrumental to the design, management, and evaluation of collaborative processes. Moreover, although research for collaborative practices may in the first instance be focused on assessing the quality of processes, audits and assessments may be looking for causes of performance too. Skelcher and Sullivan (2008) make a distinction between a theory-driven and a metric-driven approach in analysing collaborative performance, the latter being based on the availability of indicators that allow measurement. In this chapter, the suggested measures of process quality are derived from theoretical frameworks, informed by research of collaborative practices. They are intended to be of use for both types of research. To sum up, the purpose for measuring the quality of collaborative governance practices may be to contribute to the further theoretical and methodological development of collaborative governance research. It may also aim to inform the design, management, and evaluation of collaborative processes, play a role in how these practices are held accountable, or give account to participants and the outside world. Thus, it may contribute to the quality and legitimacy of these practices and the capacity to learn and improve (Bianchi et al., 2021; Bryson et al., 2015).

MEASURES FOR ASSESSING THE QUALITY OF COLLABORATIVE GOVERNANCE PROCESSES In this section, an overview of process measures is presented, based on characteristics of collaboration processes derived from conceptual frameworks of prominent collaborative governance scholars. Although each of these frameworks uses its own terminology, and orders and connects the components of the collaborative process differently, generally they agree on which characteristics of the process are important and why. In the literature, a high process quality is predominantly defined as contributing to collaboration success in terms of effective collaboration resulting in effective collaborative (substantive) outcomes (Bryson et al., 2015; Emerson & Nabachi, 2015; Mandell & Keast, 2007). In addition, various authors stress the importance of good relationships, democratic values, and an open and fair process (Klijn & Koppenjan, 2016; Newman et al., 2004; Purdy, 2012; Sørensen & Torfing, 2009; Voets et al., 2008). The set of measures presented here builds on both effectiveness considerations and considerations of good and democratic governance. Since in the frameworks the collaboration process itself is not always clearly delineated and often conflated with antecedents, the measures we suggest are based on both characteristics of the collaboration process and their direct antecedents. Antecedents that could be more clearly distinguished from the process and contextual factors were not included, as substantive and institutional outcomes. Process quality measures can have various values that contribute to, or detract from, process quality. They are clustered under the headings: nature of interaction; shared motivation; democratic legitimacy; leadership, management and (meta-)governance; and governance structure. For an overview see Figure 10.1.

Figure 10.1

Measures for assessing the quality of collaborative governance processes

Measuring the quality of collaborative governance processes  159

160  Handbook on measuring governance 1.

Nature of Interaction

The literature mentions various qualities of collaboration processes that reinforce willingness to collaborate and contribute to the realization of joint and informed outcomes. Variety and enrichment The variety of considerations, problem definitions, and alternatives that are considered in the collaboration process is a first indicator of process quality. This variety increases the capacity to realize enriched, inclusive solutions and mutual benefits that do justice to the various diverging or even conflicting interests involved (Ansell & Gash, 2008; Sørensen & Torfing, 2009). Variety in ideas and solutions also enlarges the potential to overcome controversies, which often become stuck on specific perceptions and solutions (Klijn & Koppenjan, 2016). Decisiveness and transaction costs Interactions are expected to produce results. The extent to which decisions are reached and collaborative activities are undertaken is a measure of quality (Emerson et al., 2011). It contributes to the outcome legitimacy of these processes. Sørensen and Torfing (2009) speak of the smoothness of (implementation) processes. The capacity to produce results is also reflected in the absence of prolonged and stagnated interactions and low transaction costs (Klijn & Koppenjan, 2016; Voets et al., 2008). Various authors also mention small wins as indicators of the effectiveness of collaboration. Small wins are important in enhancing shared motivation and internal and external support for the collaboration process (Ansell & Gash, 2008; Emerson et al., 2011; Sørensen & Torfing, 2009). Flexibility and innovativeness Variety also refers to the presence of new and innovative ideas and of actors who fulfil the role of catalysts, coming up with unexpected, unorthodox ideas, and enriching the debate and widening the scope of the discussion. Openness to new developments and changing priorities during the collaboration is important. It implies responsiveness, flexibility and adaptiveness, which are preconditions for being innovative and seizing unanticipated opportunities to realize mutual gains (Klijn & Koppenjan, 2016; Sørensen & Torfing, 2009). Intensity of interaction and proximity The intensity of interaction refers to the frequency with which actors meet, interact, and communicate. The more frequently they interact, the more they will learn to know and trust one another and develop a sense of commitment and mutual obligation towards one another and a sense of ownership towards the process (Klijn & Koppenjan, 2016; Mandell & Keast, 2007). Besides frequency, authors point to the need for face-to-face meetings, rather than other ways of communicating (Ansell & Gash, 2008; Bryson et al., 2015). Proximity and personal involvement matter. Knowledge sharing and learning An important process quality is the extent to which deliberations and decisions are informed by correct information and knowledge (O’Leary & Vij, 2012; Sørensen & Torfing, 2009). Certainly, in the current societal and political climate in which governmental institutions, experts, science, and mass media are distrusted and contested, and discussions are infested

Measuring the quality of collaborative governance processes  161 with dubious truth claims and conspiracy theories, this process quality is important. However science often is not conclusive and experts may disagree (Koppenjan & Klijn, 2004). What is more, solutions require the integration of various strands of knowledge of various expertise and stakeholders. However, actors find it difficult to accept knowledge and expertise that does not confirm their own convictions. The literature suggests that actors should engage in joint fact finding in order to arrive at negotiated truths (Ansell & Gash, 2008; De Bruijn & Ten Heuvelhof, 2010; Klijn & Koppenjan, 2016). Various actors emphasize the importance of the capacity to learn (Bryson et al., 2015; Sørensen & Torfing, 2009). 2.

Shared Motivation

Various authors mention the presence of a shared motivation among actors as a quality of the collaboration process. In this section, various dimensions of shared motivation are discussed. Mutual understanding Collaboration requires a certain level of mutual understanding (Ansell & Gash, 2008; Emerson et al., 2011). Actors have different backgrounds, different interests and missions, different skills, resources, and working procedures, and different cultures. Institutional logics may diverge or even conflict (Bryson et al., 2015; Newman et al., 2004; O’Leary & Vij, 2012). These backgrounds shape and constrain actors’ capabilities to collaborate. Knowledge of these backgrounds is helpful in collaboration. Certainly, actors should acknowledge their mutual dependence (Ansell & Gash, 2008; Emerson et al., 2011; Sørensen & Torfing, 2009). Ansell and Gash (2008) mention the importance of common norms and values. Internal legitimacy, commitment, and shared mission Collaboration requires the support of the participating actors (Ansell & Gash, 2008; Emerson et al., 2011; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). A certain minimal level of agreement is needed (Ansell & Gash, 2008; Bryson et al., 2006; O’Leary & Vij, 2012). Some authors mention the development of a sense of shared ownership of the project (Ansell & Gash, 2008). The extent to which participants share a general agreement on the problem and a common mission and purpose is important too, as opposed to a situation in which they pursue different objectives and strategies (Ansell & Gash, 2008; Bryson et al., 2015; Emerson & Nabatchi, 2015; Mandell & Keast, 2008). De Bruijn and Ten Heuvelhof (2010) speak of a sense of urgency that brings actors together and creates an urge to collaborate and arrive at coordinated outcomes. Koschmann et al. (2012) emphasize the need of a compelling story and ‘authoritative texts’ that give the collaboration direction and shape the perceptions of participants. Quality of relations and mutual trust Mandell and Keast (2008) state that the nature of the relations among collaborating actors is an important indicator of process quality. It is important that participants get along in terms of working together and resolving conflicts. The presence of conflicts does not indicate a bad process quality. They can be seen as an expression of actors’ commitment to the common purpose and the process. Conflicts become problematic when they are not addressed and become dysfunctional and result in dialogues of the deaf (Klijn & Koppenjan, 2016).

162  Handbook on measuring governance There is a firm consensus among authors that mutual trust is an important condition for collaboration to be successful (Ansell & Gash, 2008; Bryson et al., 2006; Newman et al., 2004; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). Trust refers to the conviction that actors that participate in collaboration have good intentions, are reliable and competent, and will not harm others, especially not in situations of vulnerability. Trust may refer to how the intentions of other actors are valued, but also to their competences. The level of trust among individuals in collaboration processes may differ from the trust among involved organizations (Bryson et al., 2015). 3.

Democratic Legitimacy

Various authors stress the importance of the democratic nature of collaborative processes. Sørensen and Torfing (2009) speak of democratic anchorage. The question of what exactly is considered to be democratic can be answered in different ways. The following measures can be derived from the literature. Political mandate, accountability, and political support A first indicator of the democratic legitimacy of collaboration processes is the extent to which they are influenced and monitored by elected politicians (Bryson et al., 2006; Sørensen & Torfing, 2009). It may be that collaboration processes are initiated and mandated by elected politicians. Monitoring implies that accountability mechanisms are in place by which collaborations give account to elected politicians or representative bodies. This also allows elected politicians to influence the course taken by the collaboration. Mandates and accountability mechanisms contribute to the democratic legitimacy of collaborative processes (Bryson et al., 2015; Koliba et al., 2018; O’Leary & Vij, 2012; Voets et al., 2008). Of course, these formal arrangements do not guarantee politicians’ support for the collaboration process. Dynamics in the relationship between politicians and the collaborative process may lead to diminished support. Apart from the presence of a formal mandate and accountability mechanisms, the actual support of elected politicians is an indicator of the democratic legitimacy of the collaboration (Sørensen & Torfing, 2009). Inclusiveness, representativeness, and power balance An important dimension of the democratic nature of collaboration processes is the participation of not only the actors whose resources and skills are needed to make the collaboration successful, but also the actors affected by the collaboration (Bryson, 2004). It can, however, also be argued that participation should not be open to just anyone. Actors should have a clear stake in order to be allowed to join the collaborative process and have a voice. Sørensen and Torfing (2009) argue that actors should be representative of the constituencies that they are supposed to represent. In addition, it may be that not all actors have the same resources and skills to participate on an equal footing. Various authors mention power imbalances as important threats to successful collaboration (Ansell & Gash, 2008; Bryson et al., 2015; Emerson et al., 2011; O’Leary & Vij, 2012; Purdy, 2012; Sørensen & Torfing, 2009). Power imbalances are caused by an unequal distribution of resources over participants. The democratic quality of the collaboration process depends on the presence of efforts to empower actors, for instance, by allocating resources or offering training programmes to enhance skills (Ansell & Gash, 2008; Purdy, 2012; Sørensen & Torfing, 2009).

Measuring the quality of collaborative governance processes  163 Due deliberation and good faith negotiation Due deliberation refers to the extent to which the interaction among participating actors is characterized by an open conversation, in which different opinions are exchanged and discussed and in which the pros and cons of a variety of solutions, problem definitions, and underlying considerations and values are expressed and critically discussed (Emerson et al., 2011; Klijn & Koppenjan, 2016; Sørensen & Torfing, 2009). This presupposes also actors’ willingness to collaborate and their openness to exploring mutual gains instead of pursuing their own interest (Ansell & Gash, 2008; Sørensen & Torfing, 2009). The aim is to look for collaborative actions and solutions that do justice to the various interests and concerns involved, or for ways to compensate interests or actors that are harmed (Klijn & Koppenjan, 2016). In the literature, this is referred to as due deliberation, good faith negotiation, and frame reflection (Ansell & Gash, 2008; Rein & Schön, 1993). The opposite consists of processes in which alternative opinions are discouraged, disqualified, or excluded, and actors are misled or deflected (compare Newman et al., 2004). Transparency and procedural fairness Frequently mentioned democratic qualities of collaboration processes include transparency and open communication (Ansell & Gash, 2008; Klijn & Koppenjan, 2016; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). Transparency is of interest to actors potentially affected by the collaboration, but who are not yet participating. Transparency is also important to allow external scrutiny and prevent internal practices that cannot bear the daylight. Some authors stress the importance of the collaboration’s visibility in the outside world and the importance and the use of active communication strategies and media to reach the general public (Sørensen & Torfing, 2009). An important dimension of the democratic legitimacy of collaboration is the extent to which procedures are clear, rules are fair, core interests are protected, and the expectations of a fair outcome are justified (Bryson et al., 2015). Procedures that regulate conflicts and allow for objection and appeal contribute to safeguarding fairness (Klijn & Koppenjan, 2016; Ostrom, 1990; Sørensen & Torfing, 2009). As far as expectations are concerned, it may well be that some actors are only invited to be informed or to give their consent to decisions without much room to influence these. Their presence may be of a symbolic nature to further the external legitimacy of the collaboration. Newman et al. (2004) speak of the political opportunity structure and show that, in collaborative cases that they studied, local collaboration processes were kept at a distance from what city governments considered strategic issues. Various authors argue that the collaboration’s scope or mission should be made clear to avoid unrealistic expectations and delusions (Ansell & Gash, 2008; O’Leary & Vij, 2012). 4.

Leadership, Management, and (Meta-)Governance

Authors agree on the presence of leadership as a condition for successful collaboration (Ansell & Gash, 2008; Bryson et al., 2015; Emerson et al., 2011). Some authors speak of management of the collaboration process or the network of collaborating actors (Klijn & Koppenjan, 2016; Koliba et al., 2018; O’Leary & Vij, 2012; Turrini et al., 2010), others of (meta-)governance (Sørensen & Torfing, 2009). Various dimensions of leadership, management, and (meta-) governance are mentioned.

164  Handbook on measuring governance Facilitation and conflict regulation Various authors state that collaboration requires a facilitator, who invests resources to bring actors together, organizes meetings, provides meeting places and facilities, and structures deliberations (Ansell & Gash, 2008; Bryson et al., 2015; Emerson et al., 2011; Koliba et al., 2018; Turrini et al., 2010). Facilitation also implies the presence of planning activities to manage the collaboration process: the setting of goals, activities, results, and deadlines. Facilitation also implies the enforcement of agreements and rules (Ansell & Gash, 2008; Bryson et al., 2015; Klijn & Koppenjan, 2016; Sørensen & Torfing, 2009). The independence of this facilitator is important, given the diversity and sometimes conflicting views and interests of participants. Besides facilitation, the presence of conflict management and rules for conflict regulation are important conditions for successful collaboration. These imply the presence of mediation and arbitration in situations of disagreement and conflict (Bryson et al., 2015; Klijn & Koppenjan, 2016; Sørensen & Torfing, 2009). Motivation and communication An important dimension of leadership is motivating and persuading actors to participate and invest their resources (Emerson et al., 2011; Koschmann et al., 2012; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). This is often done by framing the issue at hand, stressing the urgency of collaboration and by providing a clear purpose and direction and an attractive agenda that emphasizes the benefits of collaboration to potential participants (Bryson et al., 2015; De Bruijn & Ten Heuvelhof, 2010). Here, not only the formal position of leaders or managers will be of importance, but also their personal skills and qualities. The authoritativeness and charisma of leaders and managers are important, as is the extent to which they prove to be capable leaders and managers during the process, taking credible decisions (Emerson et al., 2011). This will enhance participants’ trust in the leadership and the process. Communication during the various phases of the collaboration to keep participants informed and motivated is paramount (Koschmann et al., 2012; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). Securing resources and external support An important function of leadership, management, and (meta-)governance is to create and maintain external support and legitimacy. It is important that participants are supported by their parent organizations, as these provide participants with the mandate to collaborate and supply resources, and may defend the collaboration vis-à-vis the outside world. The wider societal support of relevant constituencies, media, and the general public for the collaborative process is important to create stable conditions for its performance over time (Ansell & Gash, 2008; Bryson et al., 2015; Emerson & Nabachti, 2015; Emerson et al., 2011; Mandell & Keast, 2008; Sørensen & Torfing, 2009). Securing resources and external legitimacy requires an external orientation, anticipating and responding to external threats and opportunities, and communication with relevant publics (Koliba et al., 2018; Koschmann et al., 2012; Turrini et al., 2010). 5. Governance Structure and the Capacity for Joint Action Various authors state that structural configurations and arrangements enhance the capacity for joint action within collaborations (Bryson et al., 2015; Koliba et al., 2018; Turrini et al., 2010). This structure may have emerged during the collaboration, but it may also have been

Measuring the quality of collaborative governance processes  165 consciously designed. Ansell and Gash (2008) speak of institutional design, others of process design (De Bruijn & Ten Heuvelhof, 2010; Klijn & Koppenjan, 2016). Governance structures are made up of the following dimensions. Actor constellation and availability of resources and skills The constellation of participants is an important dimension of the governance structure of a collaborative process (Ansell & Gash, 2008; Klijn & Koppenjan, 2016; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). It determines the presence of resources, expertise, and skills and thereby collaborative capacity (Bryson et al., 2015; Emerson et al., 2011). The strength and structure of relations matter too. Strong ties between actors make the exchange of resources easier; as does their central position in the network. Skills and resources of actors that are less central in the network are harder to mobilize for common purposes (Kapucu & Hu, 2020; Koliba et al., 2018). Ambitions and the purpose of collaborations and the accompanying transaction costs should be in balance with actors’ resources and skills and the structure of their relationships. Supportive arrangements The presence of informal and formal arrangements, the availability of platforms, and the accessibility of arenas in which decision making takes place are important conditions for successful collaboration (Ansell & Gash, 2008; Bryson et al., 2006; Emerson et al., 2011; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). These structural provisions make collaboration less dependent upon persons and eventualities. Agreements like covenants and contracts reduce uncertainties, make actors’ behaviour predictable, and may strengthen trust or compensate for a lack of trust (Klijn & Koppenjan, 2016; Turrini et al., 2010). Provan and Kenis (2008) suggest that the type of governance structure is contingent upon the structural constellation of the collaboration: when collaborations have a low number of participants, a high level of consensus and trust, and well-developed skills to collaborate, self-governance by participants suffices, whereas in other situations a more centralized mode of governance is needed: governance by a lead organization or a network organization. The governance structure should match the characteristics of the collaboration process that they are meant to support. Clear and agreed-upon ground rules Various authors mention the presence of clear and agreed-upon ground rules that guide the collaborations as an important condition for success and legitimacy (Ansell & Gash, 2008; Bryson et al., 2015; Emerson et al., 2011; Klijn & Koppenjan, 2016; O’Leary & Vij, 2012; Sørensen & Torfing, 2009). These rules may specify the scope of the collaboration, the nature and division of roles and responsibilities, the conditions under which actors may join or leave the collaboration, the way in which decisions are taken, and how benefits, costs, and risks will be divided (cf. Klijn & Koppenjan, 2016; Ostrom, 1990). These rules align expectations and reduce the uncertainties and risks under which actors collaborate and protect participants against arbitrariness. Results of the Measurement Measuring collaboration process quality may result in findings that show that the process scores differently on the various measures. The results of the measurement (e.g., instance visu-

166  Handbook on measuring governance alized in a spiderweb diagram) provide an overview of the strengths and the weaknesses of the process in term of quality, and may give an indication of respects in which the collaboration falls short and can be improved. This raises the question of whether all measures are equally important and whether they can compensate for one another. As the measures are mentioned in the various frameworks on which this overview is based, it can be argued that they all are important and that no clear hierarchy exists. Moreover, as these criteria are used to measure process quality, they cannot simply be added on or compensate for one another: they have a value of their own. It should be acknowledged, though, that they are hard to realize all at the same time and that they may be conflicting and present practitioners with dilemmas. Bryson et al. (2015) and O’Leary and Vij (2012) argue that managing collaboration processes implies dealing with tensions and paradoxes. Measuring process quality may help to make these dilemmas explicit and open to deliberation. Furthermore, process characteristics are not static. They may evolve over time, as may the quality of the process. Assessment, therefore, should take this dynamic nature of processes into account. A measurement carried out at a single juncture may inform the collaborating participant at that point of time, but it does not necessarily tell something about the future or past quality of the process. Mandell and Keast (2008) argue that collaborations go through a life cycle and that characteristics and requirements of what can be seen as a good process evolve over time. In earlier phases of collaborations, qualities like ambitions, motivation, and facilitation are considered important, whereas shared motivations and relationships still have to be built and outcomes cannot yet be expected. In later phases, commitment, accountability, and rule enforcement become more important, as do outcomes.

METHODS: HOW TO MEASURE THE QUALITY OF COLLABORATIVE GOVERNANCE PROCESSES The quality of collaborative governance processes can be measured in various ways. Here, we discuss six methods that are commonly used: process mapping by temporal bracketing, participatory observations, quantitative research using surveys, social network analysis, serious gaming and simulations, and participatory evaluations.1 Process Mapping by Temporal Bracketing The first method to measure the quality of collaborative governance processes consists of the in-depth reconstruction of these processes by qualitative research, often case studies. By describing how processes have evolved over time, it becomes possible to characterize them and assess their qualities. This method analyses processes in retrospect, implying that it is an ex post or an ex durante exercise. Process mapping involves temporal bracketing: deconstructing the processes into successive periods and describing participants’ activities within these periods (Langley, 1999). An example of temporal bracketing is the so-called rounds model (Koppenjan & Klijn, 2004; Teisman, 2000). It suggests reconstructing a timeline of events and decisions in the collaboration process and distinguishing various periods during which the conditions under which actors interact are relatively stable. Rounds are separated from one another by shifts in

Measuring the quality of collaborative governance processes  167 conditions that have an important impact upon the nature of the interaction, either content- or process-wise. These shifts may consist of an external event or a far-reaching decision that acts as a ‘game changer’ and rearranges the conditions under which actors interact in the next round. Rounds do not coincide with formal process stages, because processes do not develop in a linear way, although the realization of an essential in-between process outcome to bring the process further may mark the transition between rounds. A round has a specific, relatively stable configuration of conditions that constrain and shape the collaboration process and result in a specific process quality. The reconstruction of the various rounds makes it possible to measure the way in which the quality of the process has evolved over time. Measuring process quality is not just the application of a checklist to see whether certain process characteristics are present, but rather a contingent assessment that should do justice to the difficulties and dilemmas encountered by the process over time. This method is labour intensive, however. Data sources include documents (official statements, reports, agreements, communications, and media reports and posts), observations, and interviews with participants, experts, and outsiders. The risk of ex post rationalizing by respondents is a challenge. This may be countered by triangulation of interviews and data sources. Participatory Observations The quality of the collaboration process can also be measured by using ethnographic methods and more specifically by observations (Ansell & Gash, 2008; Mandell & Keast, 2007). This implies that the researcher is present during sessions and meetings that take place in the context of a collaborative process. The various participants involved in the process are required to agree with the presence of the researcher. It also presupposes that the researcher merely observes what is happening without interfering. The observations should be aimed at identifying and recording actions and behaviours that are indicative of the process quality measures. The strength of this method is that it reports what is actually happening during the collaboration process. It may, on the other hand, be hard to understand or interpret what is happening. This implies that the observer must conduct additional research to become informed about the background of behaviour and interactions and conduct interviews with participants (cf. Meads, 2017). This method is therefore labour intensive and only reveals the quality of the process in a micro-setting over a specific time span, whereas a collaborative process may evolve in various arenas over a prolonged time period. Studies are, however, available with observations over an extended period (see, e.g., Ulibarri, 2019). Quantitative Research Using Surveys A third measurement method is that of surveys (see, e.g., Ansell & Gash, 2008; van Meerkerk et al., 2019). By asking process participants to fill in questionnaires with items that measure process quality, it is possible to include a greater number of participants in the analysis and to measure the various aspects of quality in a more systematic, controlled way. It also allows for including a larger number of cases in an assessment, thus measuring the process quality of a collaborative practice, or making comparisons between various processes. Conducting a survey implies measuring at one specific juncture. It provides a snapshot rather than a more dynamic image of the development of quality over time. However, as the survey can be repeated at various instances fairly easily, this method allows for monitoring the quality of

168  Handbook on measuring governance the collaboration process over time. It can be done at the start of the collaboration, at various points during the collaboration, and at the end. In this way, the issue of the cyclical and dynamic nature of collaboration processes is dealt with too. The measurement can differentiate between types of stakeholders involved and assess in more detail the differences in how various stakeholders appreciate the process, and how the process takes their specific needs and ambitions into account. At the same time, it should be acknowledged that a questionnaire can only have so many items, thereby limiting the possibility of measuring quality dimensions exhaustively. It should also be emphasized that surveys measure perceptions (see, e.g., Hui & Smith, 2022; Warsen et al., 2018). So, despite the larger n and the quantitative way in which the answers are processed, what is measured is subjective by nature and vulnerable to socially desirable responses. These risks need to be countered by a careful selection of participants and well-developed and validated questionnaires and items. Furthermore, outcomes may be validated by additional analysis or focus groups. Social Network Analysis By mapping the frequency of contacts (ties) between participants (nodes) interacting in a collaboration process, it is possible to identify the structure of the network and the strength of the relationship between actors (Kapucu & Hu, 2020; Koliba et al., 2018; Lemaire & Raab, 2019). This information can be gathered by conducting a survey among the process participants, asking them to mention the frequency of their contacts with other participants over a certain timeframe. Analysis then allows for identifying the density of interactions and the centrality of certain participants in the process, and the marginal position of others. This is relevant, because one of the qualities of processes is the establishment of stable relationships and the inclusiveness of the interactions (Mandell & Keast, 2007; Turrini et al., 2010). A limitation of this method is that it does not provide information on the quality of contacts. Additional methods should be applied to get a fuller picture. If a survey is conducted, additional questions addressing more qualitative dimensions of relationships may be added. Social network analysis may also be supplemented with interviews with a representative subset of participants. Serious Games and Simulations Another method to measure process quality is to simulate a collaborative interaction situation, for instance, by using a serious game (Bryson et al., 2015; Kelley & Johnston, 2012; Medema et al., 2016; Olejniczak et al., 2020). A serious game simulates a real-life situation by way of roleplay in which the conditions for interaction are controlled and manipulated. This roleplay may be IT supported, but this is not essential. The aim is to introduce actors into a situation in which they are challenged to fulfil their role and to experience the implications of their behaviour. These serious games are especially used for training and education purposes. In the context of measuring the quality of interaction processes, these games may be used to simulate and evaluate collaborative governance processes and inform discussions among (potential) collaboration process participants on what can be considered high-quality collaboration processes and how they can contribute to them. This may concern hypothetical interaction situations, but also the replay of interactions that actually happened. Playing these games and doing simulations may make participants more sensitive to the requirements of high-quality collaboration processes.

Measuring the quality of collaborative governance processes  169 Participatory Evaluations The quality of collaboration processes can also be assessed by inviting process participants to engage in focus groups to collaboratively assess the quality of the process (Cousins & Whitmore, 1998; Emerson & Nabatchi, 2015; Mandell & Keast, 2007; Massey, 2011). This requires a conscious selection of participants, representing the various actors who affect or are affected by the collaboration process. It also implies a well-designed assessment process, in which actors feel free to share their experiences and concerns, guided by an independent research team and facilitator who are trusted by the participants. Participatory evaluation can be informed by an upfront set of evaluation criteria, as suggested in this chapter, but this set of criteria should be discussed, adapted, and prioritized by the participants. Actually, participatory evaluation can itself be seen as a collaborative process and may be subject to similar antecedents and quality criteria. Focus group assessments may be embedded in the accountability arrangements agreed upon in the collaborative processes to be assessed. Participatory evaluations may be organized ex ante, ex durante, and ex post. Their strength is that they not only include actors’ various perceptions, but also involve a deliberation process on what quality actually means and to what extent the process meets standards, given the specific challenges and opportunities encountered, thereby arriving at an intersubjective measurement of quality. The above-mentioned considerations are also intended to mitigate the potential weaknesses of participatory evaluations and focus groups, such as the lack of representativeness, the risk of actors not feeling free to voice their opinions, and the strategic motivations underlying contributions.

CONCLUSION Building on frameworks of prominent collaborative governance scholars, this chapter has presented a synthesis of process measures that could be deduced from these frameworks. Assessing collaboration processes with these measures can fulfil various objectives. Such assessments may contribute to the further theoretical and methodological development of collaborative governance research. They may also inform the design, management, and evaluation of processes, and enhance accountability. Various assessment methods have been discussed that can be used to apply these measures. These methods have their specific strengths and weaknesses, and their use is contingent upon the objectives of the assessment and the resources available. Combined, they can complement one another and compensate for each other’s weaknesses. The major challenge for further research on the evaluation and assessment of collaboration practices is to validate these process measures. This research may also further increase our knowledge and understanding of methods to measure processes and their strengths and weaknesses.

NOTE 1. In this chapter, the emphasis is on measuring the quality of collaboration processes. Therefore, methods aimed at arriving at explanations, such as comparative case studies, qualitative comparative analysis (QCA), and process tracing, are not discussed (see for these methods Voets et al., 2019).

170  Handbook on measuring governance

REFERENCES Ansell, C., & Gash, A. (2008). Collaborative governance in theory and practice. Journal of Public Administration Research and Theory, 18(4), 543–71. Bianchi, C., Nasi, G., & Rivenbark, W.C. (2021). Implementing collaborative governance: Models, experiences, and challenges. Public Management Review, 23(11), 1581–9. Bryson, J.M. (2004). What to do when stakeholders matter: Stakeholder identification and analysis techniques. Public Management Review, 6(1), 21–53. Bryson, J.M., Crosby, B.C., & Stone, M.M. (2006). The design and implementation of cross-sector collaborations: Propositions from the literature. Public Administration Review, 66, 44–55. Bryson, J.M., Crosby, B.C., & Stone, M.M. (2015). Designing and implementing cross-sector collaborations: Needed and challenging. Public Administration Review, 75(5), 647–63. Cousins, J.B., & Whitmore, E. (1998). Framing participatory evaluation. New Directions for Evaluation, 1998(80), 5–23. De Bruijn, H., & Ten Heuvelhof, E. (2010). Process management: Why project management fails in complex decision making processes. Springer Science & Business Media. Emerson, K., & Nabatchi, T. (2015). Evaluating the productivity of collaborative governance regimes: A performance matrix. Public Performance & Management Review, 38(4), 717–47. Emerson, K., Nabatchi, T., & Balogh, S. (2011). An integrative framework for collaborative governance. Journal of Public Administration Research and Theory, 22(1), 1–29. Enserink, B., Koppenjan, J.F.M., & Mayer, I.S. (2013). A policy sciences view on policy analysis. In W.A.H. Thissen & W.W. Walker (Eds.), Public policy analysis: New developments (pp. 11–40). Springer. Hui, I., & Smith, G. (2022). Private citizens, stakeholder groups, or governments? Perceived legitimacy and participation in water collaborative governance. Policy Studies Journal, 50(1), 241–65. Huxham, C. (1996). Creating collaborative advantage. Sage. Innes, J.E., & Booher, D.E. (1999). Consensus building and complex adaptive systems: A framework for evaluating collaborative planning. Journal of the American Planning Association, 65(4), 412–23. Kapucu, N., & Hu, Q. (2020). Network governance: Concepts, theories, and applications. Routledge. Kelley, T.M., & Johnston, E. (2012). Discovering the appropriate role of serious games in the design of open governance platforms. Public Administration Quarterly, 36(4), 504–54. Klijn, E.H., & Koppenjan, J. (2016). Governance networks in the public sector. Routledge. Koliba, C.M.J., Zia, A., & Mills, R. (2018). Governance networks in public administration and public policy. Routledge. Koppenjan, J.F.M., & Klijn, E.H. (2004). Managing uncertainties in networks: A network approach to problem solving and decision making. Routledge. Koschmann, M.A., Kuhn, T.R., & Pfarrer, M.D. (2012). A communicative framework of value in cross-sector partnerships. Academy of Management Review, 37(3), 332–54. Langley, A. (1999). Strategies for theorizing from process data. Academy of Management Review, 24(4), 691–710. Lemaire, R.H., & Raab, J. (2019). Social and dynamic network analysis. In J. Voets, R. Keast, & C. Koliba (Eds.), Networks and collaboration in the public sector: Essential research approaches methodologies and analytic tools (pp. 160–88). Routledge. Mandell, M., & Keast, R. (2007). Evaluating network arrangements: Toward revised performance measures. Public Performance & Management Review, 30(4), 574–97. Mandell, M., & Keast, R. (2008). Evaluating the effectiveness of interorganizational relations: Developing a framework for revised performance measures. Public Management Review, 10(6), 715–31. Massey, O.T. (2011). A proposed model for the analysis and interpretation of focus groups in evaluation research. Evaluation and Program Planning, 34(1), 21–8. Meads, G. (2017). From pastoral care to public health: An ethnographic case study of collaborative governance in a local food bank. The Open Public Health Journal, 10(1), 106–16. doi: 10.2174/1874944501710010106.

Measuring the quality of collaborative governance processes  171 Medema, W., Furber, A., Adamowski, J., Zhou, Q., & Mayer, I. (2016). Exploring the potential impact of serious games on social learning and stakeholder collaborations for transboundary watershed management of the St. Lawrence River Basin. Water, 8(5), 175. https://​doi​.org/​10​.3390/​w8050175. Newman, J., Barnes, M., Sullivan, H., & Knops, A. (2004). Public participation and collaborative governance. Journal of Social Policy, 33(2), 203–23. O’Leary, R., & Vij, N. (2012). Collaborative public management: Where have we been and where are we going? The American Review of Public Administration, 42(5), 507–22. Olejniczak, K., Newcomer, K.E., & Meijer, S.A. (2020). Advancing evaluation practice with serious games. American Journal of Evaluation, 41(3), 339–66. Ostrom, E. (1990) Governing the commons. The evolution of institutions for collective action. Cambridge University Press. Provan, K.G., & Kenis, P. (2008). Modes of network governance: Structure, management, and effectiveness. Journal of Public Administration Research and Theory, 18(2), 229–52. Provan, K.G., & Milward, H.B. (2001). Do networks really work? A framework for evaluating public-sector organizational networks. Public Administration Review, 61(4), 414–23. Purdy, J.M. (2012). A framework for assessing power in collaborative governance processes. Public Administration Review, 72(3), 409–17. Rein, M., & Schön, D.A. (1993). Reframing policy discourse. In F. Fischer & J. Forester (Eds.), The argumentative turn in policy analysis and planning (pp. 145–66). Duke University Press. Skelcher, C., & Sullivan, H. (2008). Theory-driven approaches to analysing collaborative performance. Public Management Review, 10(6), 751–71. Sørensen, E., & Torfing, J. (2009). Making governance networks effective and democratic through metagovernance. Public Administration, 87(2), 234–58. Teisman, G.R. (2000). Models for research into decision-making processes: On phases, streams and decision-making rounds. Public Administration, 78(4), 937–56. Turrini, A., Cristofoli, D., Frosini, F., & Nasi, G. (2010). Networking literature about determinants of network effectiveness. Public Administration, 88(2), 528–50. Ulibarri, N. (2019). Collaborative governance: A tool to manage scientific, administrative, and strategic uncertainties in environmental management? Ecology and Society, 24(2). https://​www​.jstor​.org/​ stable/​26796941. van Meerkerk, I., Edelenbos, J., & Klijn, E.H. (2019). Survey approach. In J. Voets, R. Keast, & C. Koliba (Eds.), Networks and collaboration in the public sector: Essential research approaches, methodologies and analytic tools (pp. 64–81). Routledge. Voets, J., Van Dooren, W., & De Rynck, F. (2008). A framework for assessing the performance of policy networks. Public Management Review, 10(6), 773–90. Voets, J., Keast, R., & Koliba, C. (Eds.) (2019). Networks and collaboration in the public sector: Essential research approaches, methodologies and analytic tools. Routledge. Warsen, R., Nederhand, J., Klijn, E.H., Grotenbreg, S., & Koppenjan, J. (2018). What makes public-private partnerships work? Survey research into the outcomes and the quality of cooperation in PPPs. Public Management Review, 20(8), 1165–85.

11. A framework for measuring the effects of policy processes on health system strengthening Fabiana da Cunha Saddi, Stephen Peckham, Peter Lloyd-Sherlock and Germano Araujo Coelho

INTRODUCTION Complex policy processes, characterized by ambiguity, conflict and unanticipated effects, have raised questions for positivist types of outcomes and forms of measurements in public policies, as they lack understanding of how processes can be empirically linked and related to effects. This refers to a relevant knowledge gap in Public Policy and Management Studies: the need to better understand and systematize what and how causal mechanisms or processes can effectively, and in varied contexts, affect outcomes. In this chapter we draw attention to the need to develop and apply comprehensive synthesis or forms of measurements of policy processes and their effects. We focus on policy mechanisms or theoretical policy drivers that provide clues about where to look in the public policy process and on ways to measure them, in order to give depth to analyses of the effects of the political process in a systematic and synthesized way, which can provide adaptable or contextual evidence in different methodological traditions. Comprehensive forms of measurement of policy processes can be an important approach, offering tools/frameworks and indicators to evaluate programmes. The predominant concern – academic and practical – in policy outcomes and in quantifiable evidence reflects changing forms of governance related to New Public Management that framed government approaches to demonstrate efficiency and transparency in the management of public funds and public policies (Hood, 1991; Van Dooren et al., 2010; Wood, 2014). Although these transformations have positive effects for the analysis of public policies, such as the search for greater methodological rigour, or for public practices based on evidence, the excessive focus on results and on quantitative data makes it difficult to fully understand the reasons for keeping the emphasis on results alone, instead of a further and complementary focus on more comprehensive types of data and analysis. Considering these methodological and practical challenges and correlated need to advance knowledge of both policy process and effects/outcomes, it is possible to develop comprehensive forms of measurement that capture the complexity of policy processes. This also offers synthesized contextual evidence that can be explored quantitatively and in a mixed method perspective, to better understand the process and its effect on outcomes. An important alternative or complementary approach has been to construct qualitative and mixed methods indicators. In this chapter we present an analytical framework for measuring the effects of policy processes on health system strengthening, as a theoretical and methodological guidance. The Policy Integration and Performance Framework (PIPF) highlights the importance of focusing on the integration between formulation and implementation process 172

Measuring the effects of policy processes on health system strengthening  173 (policy drivers) as performance enhancers and as health system strengthening. It also suggests the use of qualitative methods as a way of comprehensively observing these processes, while proposing synthesis indicators that can mediate the application of mixed methods by relating these qualitative data to quantitative ones – when available, or when produced by the research itself. This chapter is structured as follows. In the first section we present the PIPF and its theoretical underpinnings. We then discuss the methodological aspects involved in PIPF application, its qualitative and quantitative dimensions, the proposed indicators and the potential uses of mixed methods. Finally, we present three examples of research that originated from the application of the framework: the analysis of a payment for performance (P4P) health programme in Brazil, the comparison between P4P health programmes in low- and middle-income countries (LMICs) and a research project on primary health care (PHC) strengthening during and after COVID-19.

THE POLICY INTEGRATION AND PERFORMANCE FRAMEWORK We have developed the Policy Integration and Performance Framework (PIPF) and its analytical instruments to explore how politically relevant policy and performance concepts/drivers can enable us to understand how distinct levels of policy integration between the formulation and implementation process can engender changes in distinct types of performance drivers (changes in policies, structures and behaviour) and effectively strengthen two Health System Strengthening (HSS) building blocks: the leadership (at managerial level) and workforce (at the frontline level). Despite being originally oriented towards the analysis of health policies, the PIPF can offer insights and analytical support for other sectors of public policy, since it is theoretically and methodologically based on recent discussions – academic and practical – on the challenges of understanding the complexity that involves social interventions, as well as the ways of grasping, changing and evaluating their effects, which to a large extent are associated with the ability to measure the multiple levels and dimensions of policy process within governance arrangements. Contemporary policy processes are marked by numerous governance challenges that cross overlapping stages and phases of the entire policy chain. These are complex processes, whether from the point of view of resource management – physical, human and informational – or, and mainly, from the political point of view, whose particularity is the ambiguity of policy objectives (Zahariadis, 2003) and the conflict of interests of actors (Lipsky, 2010; Sabatier, 1988) who participate in the governance arrangements. Therefore, the PIPF seeks to associate multiple levels of analysis – systemic/governance, organizational/management and individual/frontline – and different methodological traditions through mixed methods, in order to understand ways of integrating conflicting demands that affect the strengthening of the governance system. The emergence of performance-based management schemes – linking public expenditure to predetermined performance levels – grew from this concern to measure the results of public policies, part of a broader neoliberal transformation that called into question the very capacity and efficiency of public organizations (Crouch, 2011; Van Dooren et al., 2010). While initially designed for developed countries, pay-for-performance programmes (P4P) subsequently expanded across the globe to LMICs through conditioning the international organizations

174  Handbook on measuring governance funding for these countries to the implementation of P4P administrative schemes (Saddi and Peckham, 2018; Saddi et al., 2019a). Although P4P performance measurement initiatives of health and PHC, in particular, have been developed in several parts of the world, there are still significant gaps or challenges in comparative knowledge in public policies on how these programmes have been effectively formulated and implemented, as well as on how different processes of implementation and reformulation (and integration between both processes) can contribute to the strengthening of health systems in several parts of the world (Diaconu et al., 2022; Singh et al., 2021). Studies or evaluations of P4P, although employing different methods of analysis – mixed, quantitative and qualitative (Saddi and Peckham, 2018) – tend to focus mainly on the results of these programmes, or even study separately the phases of policy formulation and implementation, emphasizing implementation (when studied). The formulation tends to be neglected (as happens in Brazil) or initially explored regarding contributions of public policy, as occurs in Africa. This is particularly relevant in the case of PHC, given the number of performance-based payment programmes (P4P) implemented worldwide, and with particular attention to developing countries, where P4P is used to make advancements in the construction of health systems (HS). It is therefore necessary to develop studies that seek to explore in an integrated way both public policy processes – formulation and implementation, as well as to verify how these processes influence (or not) the strengthening of health systems through P4P programmes. Within the Public Policy field, both formulation and implementation processes respond to policy drivers – the factors that influence the process – that help understand why some policies are more responsive/responsible than others, and why some fail and others succeed. Concepts or categories such as agents, ideas and interests, organizational capacity, policy diffusion/ transfer, policy tools, policy learning and feedback influence policy design and formulation and help to provide a way of understanding why some policies are more receptive and responsive, or better integrated. In addition to understanding implementation, assessing concepts such as street-level bureaucracy (frontline workers), policy knowledge, participation or involvement, motivational reasons and feedback help identify ‘drivers’ in the implementation of public policies. In a relational way, these concepts can reveal multiple forms of integration between the formulation and implementation, making the lessons/possibilities on how to obtain better results in public policies emerge. With regard to the literature on strengthening the health system, ‘six building blocks’ (WHO, 2007) are effective in strengthening health systems: leadership, resources and equipment, workforce, health service delivery, financing. As defined by Chee et al. (2013), the concept of ‘system strengthening’ has particular significance: Strengthening the health system is accomplished by more comprehensive changes to performance drivers such as policies and regulations, organizational structures, and relationships across the health system to motivate changes in behavior and/or allow more effective use of resources to improve multiple health services. (Chee et al., 2013, p. 85)

Thus, health system strengthening relates to a large extent to the human and social aspects in the public policy process (Saddi et al., 2023; Witter et al., 2019). Through the analysis and understanding of key concepts and challenges concerning the formulation and implementation processes, it is possible to examine how it would be possible to generate or improve these

Measuring the effects of policy processes on health system strengthening  175 health systems’ building blocks and strengthen systems in different parts of the world and countries.

METHODOLOGICAL ASPECTS OF THE FRAMEWORK APPLICATION The PIPF consists of a theory-driven framework comprising three analytical steps to capture distinct phases and levels of analysis that constitute the policy process (Figures 11.1 and 11.2). The first step focuses on the level of integration between the policy formulation and implementation processes. As integration is by definition a relational concept, its measurement is based on the analysis of the mechanisms (policy drivers) that mediate the interconnection between the two phases of the policy. In the second stage, the framework looks at how the dynamics of integration between (re)formulation and implementation can generate different levels of performance. Thus, for performance measurement, the framework establishes fundamental variables that condition the policy results (design, infrastructure and organization, and behaviour). The level of performance is measured, then, by the adequacy of the governance arrangement to change the performance drivers according to the demands, interests and problems identified by the policy integration process. In the third step, the framework connects the relationship between policy integration and performance levels to policy system strengthening through two social analytical categories that are part of the system’s building blocks: leadership and workforce. These main components (policy and performance drivers, and outcomes) and related concepts are structured by the institutional characteristics and governance arrangement that define the formal possibilities of choice/influence in the (re)formulation and implementation. Methodologically, the assessment of the framework requires applying both qualitative and quantitative methods of analysis given the complexity of the policy process. Qualitative methods are essential to understand in depth the processes and mechanisms of integration and performance, which go unnoticed by the cold numbers of quantitative analysis between static variables, typical of performance measurements. From both an academic and practical point of view, turning the analytical lens to these mechanisms can elucidate conflicts, unwanted consequences, design flaws and unnoticed contextual aspects. It can also unveil practical and negotiable solutions, innovative forms of engagement and participation and new important variables for measurement. Therefore, the measurement of policy and performance drivers involves the application of research techniques that capture both process and outcome as processes are fundamental to understanding the causal mechanisms that lead to certain outcomes. It also considers the effect of policy outcomes on the reconfiguration of governance arrangements and on subsequent policy process changes. The policy drivers are divided into the formulation and implementation drivers identified in the public policy literature and performance drivers are based on the health system strengthening literature. Formulation drivers can be both extrinsic and intrinsic. Extrinsic drivers involve the identification of coalitions, their ideals, interests and power resources (Sabatier, 1988), the multiple, sometimes contradictory, streams that influence policymaking (Zahariadis, 2003), the tools and institutional arrangements that shape the governance structure (Margetts and Hood, 2016), the ways of learning and the permeability of ideas and interests arising from actors participating in the implementation, from public organizations, civil society and other organized groups (Dunlop, 2015; Jacobs and Weaver, 2010), the engagement and participation

Figure 11.1

The Policy Integration and Performance Framework (PIPF)

Source: Adapted from Saddi et al. (2019a, p. 10) and Saddi et al. (2019b, p. 11).

176  Handbook on measuring governance

Figure 11.2

The Policy Integration and Performance Framework (PIPF): sub-dimensions and questions

Source: Saddi et al. (2019a, p. 10) and Saddi et al. (2019b, p. 11).

Measuring the effects of policy processes on health system strengthening  177

178  Handbook on measuring governance of managers and the frontline professionals, and the consideration and understanding in the policy redesign of gaming and cheating practices that occur during implementation (Lewis, 2015; Pollitt, 2013). Implementation drivers cover the quality and extent of the knowledge transmission to professionals supposed to deliver public services, the types of professionals’ motivation – intrinsic and extrinsic (Mickel and Barron, 2008), the forms of learning and feedback generated (Dunlop, 2015; Jacobs and Weaver, 2010), changes in the work process carried out by the frontline professionals throughout the process, and the forms of gaming and cheating within the rules of the game. On the other hand, performance drivers are composed of changes in policy, in its organizational and infrastructure characteristics and in the behaviour of actors. The application of qualitative methods depends on the synthesis and well-defined elaboration of the above theoretical mechanisms, in order to verify them empirically. This does not mean that the mechanisms proposed in the framework are exhaustive and that new mechanisms cannot be found empirically. The types of methods to be used depend on the temporal, spatial, resource and access conditions available to researchers. For example, as an alternative to individual interviews, researchers with limited resources and time can use focus groups where all that is sought is a range of views/perceptions of a policy process. The process-tracing method can also be an auxiliary method to verify the causal mechanisms of integration between formulation and implementation, as well as the impact of these mechanisms on the performance and strengthening of the policy. In the next section, the use of some methods during the application of the PIPF will be discussed in the case of a P4P programme in the health sector in Brazil. In addition to in-depth research, the application of the framework requires a level of generalization capable of verifying the scope of integration and performance in strengthening the health system. One approach is to create comprehensive qualitative synthesis and quantitative concept indicators that can be used quantitatively: distinct combinations of public policy drivers can be associated with effective policy integration levels (EPILs) between implementation and reformulation, while distinct combinations of performance drivers can be classified as performance impact perception levels (PIPLs). In case studies and comparative analyses, it is possible to explore the relationships between EPILs and PIPLs in order to create the variable levels of System Strengthening (LHSSleadership and LHSSworkforce). Effective policy integration levels (EPILs) is a synthesis indicator expressing the extent to which the formulation process is connected to the implementation process (and vice versa), revealing, therefore, the level of integration between both processes. Analyses of policy suggest that implementation is more effective and successful when both processes are more integrated or connected (Peckham et al., 2022). Performance impact perception levels (PIPLs) refer to the extent to which integrative policy drivers have generated new mechanisms or strengthened old mechanisms considered powerful to initiate or promote the development of system strengthening. While policy drivers consist mainly in the actions or activities promoted in the policy process, performance drivers refer to performance mechanisms caused or highly influenced by integrative policy drivers. The idea is that whenever people get together or connect in diverse ways to discuss, design or implement a public policy, they do it with policy purposes. In those diverse forms of integration, it is possible that new performance mechanisms will come from those policy integrative activities (gatherings or connections), generating performance drivers such as: (1) new policies or strategies, (2) changes in the organization, procedures or structure, (3) changes in behaviour. Policy drivers facilitate the introduction of performance drivers that can affect system strengthening.

Measuring the effects of policy processes on health system strengthening  179 During the construction of indicators, it is important, at first, to synthesize the qualitative evidence. A concern at this stage is to reduce the loss of information and complexity present in qualitative data. Thus, some techniques help in this process: triangulation between different sources of evidence (between methodological techniques or observations) can help to verify the most recurrent and disseminated mechanisms in the integration process; thematic or content analysis facilitates the organization of textual data, in order to synthesize evidence in clear, short and objective sentences that represent specific and well-defined categories distinguishable between different contents and themes; and the establishment of clear parameters to define the intensity of variation of the empirical occurrence of drivers found among the researched cases. Then, after performing the qualitative synthesis, by defining the main types, representative sentences of each type and their intensities, these categorical data can be transformed into discrete or nominal quantitative data. The scale to be used depends on the amount of information obtained for each category or variable, so as not to reduce too much the perception of the variation of the indicators, the possible quantitative techniques intended to be used, and the adequacy of other secondary data available to the research that can be compared with the primary data collected. The ways in which the comprehensive qualitative syntheses and the integration, performance and system strengthening indicators are used vary according to the research objectives and practical concerns related to improving the system of governance. In general, the use of mixed methods in the application of the framework in the analysis of the effects of the policy process makes it possible to understand in-depth, complex and contradictory dynamics without losing sight of the generalization ability and the impact of these dynamics on the effects and performance of social interventions. The inclusion of a qualitative perspective for the management of the governance system offers a more sensible and comprehensive interpretation of the evidence in order to include political factors in the analysis of policy performance and results.

PIPF APPLICATION The PIPF was initially designed for application in the analysis of the National Program for Improving Access and Quality to Primary Care (PMAQ) in Brazil. Its development was based on the understanding that the framework had the potential for a more comprehensive analysis of public policy process than more quantitative assessment frameworks that had previously been applied. It was then adapted for research on and comparison of P4P programmes in LMICs and to design a project on PHC policy performance and strengthening during and after COVID-19. The PMAQ Research The PMAQ was developed in 2011, at a politically significant time. The programme was adopted after the expansion of the PHC, aiming to increase the access and quality of basic care in tandem with this expansion process. It was also a response to fiscal restriction, through a design that stimulates the development of comparable quality standard in PHC, by means of performance pay incentives, associated with the promotions of cross-cutting strategic actions (Brasil – Ministério da Saúde, 2017).

180  Handbook on measuring governance The programme has a complex governance arrangement, formed by different levels of government (federal, state and municipal) and by multiple actors involved in the formulation, implementation and evaluation of the programme. The main body responsible for coordinating the PMAQ is the Department of Primary Care (DAB) of the Ministry of Health (MoH). Policy decisions are deliberated and negotiated in the Tripartite Inter-management Commission (CIT), with the participation of all federative bodies. The programme’s evaluation instrument was developed by the DAB, but widely discussed and negotiated inside the CIT and with research institutions, contracted by the MoH, responsible for carrying out the external evaluation. At each cycle, the PMAQ underwent reformulations and rounds of negotiation and agreement between the multiple actors involved. The research included six cities in Brazil and 36 health units, during the third cycle of the programme (2015–19). The selection of the six cities was based on the variation of management/leadership arrangements and organizational capacities. The first phase of the research involved interviews with policymakers, external evaluators, managers, and frontline staff to explore their perspectives and experiences during the formulation and implementation process of PMAQ, taking into account integrative policy drivers of the framework. Five interview guides were prepared which oriented the interviews to be carried out with policymakers (1) in the DAB/MoH (n = 11) and (2) the State Health Secretariats’ National Council (CONASS) and the Municipal Health Secretariats’ National Council CONASEMS (n = 4); with the policymakers and implementers in the municipalities, allocated in the (3) basic care of the Municipal Health Secretariats (SMSs) and Health Districts (n = 20) and (4) managers and teams (doctors, nurses and community health workers) of the health units in the municipalities (n = 174); as well as with process influencing experts, such as: (5) official PMAQ external evaluators (n = 8). The interviews were recorded and transcribed. Transcriptions were exported to the NVivo 12 software, where they were organized, coded and analyzed according to the main integrative policy drivers and the actors’ perceptions of changes in performance. A thematic framework analysis employing the PIPF was conducted. Using the software we generated statistics for each code (node) and grouped them according to distinct combinations or frequencies of policy drivers. Those groups were interpreted in terms of effective policy integration levels (EPILs) between implementation and reformulation. The attribution of the levels was guided by the general criteria of the framework. Each informant and type of answer related to a policy driver were recoded according to the following levels: very low, low, middle, high and very high. This attribution of a new value was made for each dimension of the framework: formulation, performance and strengthening. Distinct combinations of performance drivers were classified as levels of performance impact perception levels (PIPLs). In case and comparative analyses, we explored the relationship between EPILs and PIPLs, by generating crosstabulation. We also made field notes and used them during the interpretation of data, as they maintain contextual details and non-verbal expressions about the subject. The quantitative data were extracted from the microdata available in the MoH information system, which depict the results of the external evaluation of the PMAQ third round. We used data from two external evaluation questionnaires: characteristics of the basic health units (Module I – UBS) and of the teams (Module II – Teams). Those secondary indicators from the PMAQ Official database – collected by the MoH – were organized and classified into different result-based performance levels (RBPLs) by each health team, by means of attributing values to the indicators (1 to 5). The MoH collected

Measuring the effects of policy processes on health system strengthening  181 structural, process and outcome indicators in all basic health care units in Brazil. For our research purpose, we selected a few variables that enabled us to map different types of health team and management capacity in the units. They were classified into five dimensions (Human Resources Capacity, Organization of the Work Process, Management Instruments, Health Teams Practices and Structural Conditions) as a way of distinguishing different aspects of management and work process. We employed a convergent mixed methods design as a way of integrating qualitative (phase 1) and quantitative data (phase 2). The mixed method phase (Phase 3) was intended to fill knowledge gaps between partial views of each type of analysis (Creswell and Clark, 2018), one based on performance measurement, another on the frontline work process. The convergent design sought to expand understandings of how aspects regarding the integration between the formulation and implementation processes, obtained in the qualitative data/analysis, was associated with the performance results quantified by the PMAQ. It also enabled us to check the extent to which both types of data converge or diverge or explain the performance scores obtained by the health teams. We adopted data transformation and integration procedures to merge evidence/indicators. Qualitative synthesis indicators/data based on the analytical framework were transformed into quantitative scales from 1 to 5 (1 = very low; 2 = low; 3 = medium; 4 = high; 5 = very high). For each health team, a quantitative value was attributed for the synthesis variables (indicators): health policy integration levels (EPILs), Performance impact perception levels (PIPLs) and health system strengthening leadership (HSSlead) and workforce (HSSwf). Further, we generated crosstabulations exploring the relations between EPILs, RBPLs, HSSlead and HSSwf, in order to analyze the extent to which qualitative policy process indicators (EPILs and PIPLs) and result-based indicators (RBPLs) were associated and correlated with health system strengthening (of leadership and workforce) in cases and compared analysis. Results from health system strengthening of each team were compared in crosstabulations with the incidence of different levels of policy drivers that compose the EPIL (integration between formulation and implementation) and the dimensions that form the indicator RBPL (performance-based results), with the purpose of identifying which policy drivers and dimensions may be associated with health system strengthening. The analysis found that implementation drivers can directly influence system performance and strengthening. We found that motivation and feedback drivers, for instance, play a significant role in the generation of higher levels of performance drivers. Frontliners’ levels of motivations ranged from middle to low in the cities, due to the lack of satisfactory working conditions, lack of support and overload of work. The levels of feedback ranged from low to medium and were perceived by frontliners as incomplete or partial feedback. Also, the level of feedback is higher in the two smaller cities: Senador Canedo (in Goias) and Paulista (in Pernambuco), and even higher (high-medium level) in the City of Senador Canedo where we found higher levels of motivations. In the cities where we found low levels of motivation, feedback and change in the work process, the frontline doesn’t see changes in performance drivers and system strengthening due to the PMAQ. This was also reflected in low scores in the quantitative indicators produced by the DAB and the programme’s external evaluation. Moreover, we found that the integration between formulation and implementation can contribute to increases in the overall performance of P4P/PBF (Programa Bolsa Familia). In the cities where frontliners were – though in different ways – involved in the policymaking and discussion process, the policymaking is characterized by higher levels of knowledge and learn-

182  Handbook on measuring governance ing about the implementation challenges, presenting higher levels of impact on performance drivers and system strengthening. Researching P4P programmes in PHC in LMICs The PIPF was also utilized to explore the extent to which the formulation and implementation of P4P programmes in several regions around the world, in LMICs, have been developed as integrated/responsive processes, with different capacities to strengthen the performance of health systems (Saddi, 2017; Saddi et al., 2019b). We developed a comparative policy analysis that employed a multi-method approach aiming to analyze the different types of data to be used in the investigation: literature data, documents, perspectives/opinions of the agents involved internationally and nationally in the P4P formulation and implementation processes. In the first phase of the research, we carried out a systematic review of the empirical literature on P4P in LMICs, selecting texts that directly addressed the programmes’ processes of formulation and/or implementation. Evidence was triangulated with official and academic documents. We searched for relevant papers using MEDLINE, Cochrane and SCOPUS databases, and for reports and other publications on the websites of some governments, international institutions and policy/research networks. Data extracted from the review of the selected countries was systematized in a matrix according to their EPILs, corresponding to the formulation and implementation processes, as well as according to the degree to which the integration triggers performance drivers and the improvement of the building blocks of the system. In the second phase, we interviewed 14 international experts/scholars and surveyed 52 health policy experts about current challenges and future perspectives in P4P using a short questionnaire. The results of this survey and the expert interviews were used to complement the findings of the literature review. This phase enabled us to delve into specialists’ experiences with specific P4P programmes in LMICs; processes of improvement and integration between formulation and implementation (with or without the presence of public policy drivers), as well as actions to strengthen health systems, triggered in the cases addressed. At the same time, the short questionnaires made it possible to understand more generally the perspective and experience within the epistemic community related to P4P policymaking in LMICs, so that data could be quantitatively systematized. Finally, we triangulated quantitative and qualitative data using EPIL indicators, so that the impact of the integration between formulation and implementation on the performance of the programmes and the strengthening of the health system could be understood. The formulation and implementation stages were categorized separately from the public policy drivers and their respective EPILs and then were analyzed together to identify the degree of integration for each country, both from the perspective of the literature and from the perspective of the experts who have a practical view of the programmes. Evidence on the formulation and implementation processes were related to the presence of performance drivers to then understand how integration in the programmes strengthens health systems by enhancing the building blocks of workforce and leadership. While quantitative data provided generalizations on the impact of drivers on performance and systems, qualitative data deepened the evidence on how these processes and mechanisms relate to the practice of P4P programmes.

Measuring the effects of policy processes on health system strengthening  183 Research Project: PHC Strengthening during and after COVID-19 – an Intersectoral and Equity Approach In preparation for a new study on PHC we have redesigned the PIPF to better understand the implementation of PHC and how it affects strengthening, in an intersectoral perspective with basic social care, in areas with high inequality rates, during and after COVID-19 (Saddi, 2022). Special attention will be given to PHC policies directed to Black and elderly people. The intersectoral dimension has been included due to the exacerbation of social inequalities during the pandemic. We have added critical policy drivers to the framework that need to be considered in times of the pandemic and in the post-pandemic period. Those drivers refer to concepts such as: wicked factors (Klasche, 2021) that involve challenging and complex issues, such as persistent poverty, intersectoral relationships (and forms of coordination or integration) between PHC and social care (and if involving experiments of social prescription or support, such as prescription(s) for social, psychological and/or physical activities, and if within independent or institutionalized network(s)); policy capacity in its multiple dimensions (organizational, system and individual), as employed in the Policy Capacity framework (Wu et al., 2015); specificities of governance arrangements established for the COVID-19 and post-pandemic period when responding to pandemic/crises (Barbazza et al., 2019; Capano et al., 2020) – at the municipal and district levels – seeing how it related to the implementation at the frontline – and forms of resilience at the frontline of health work (Abimbola and Topp, 2018; Blanchet et al., 2017). This project will conduct interviews and participatory meetings with communities to collect new qualitative data in basic health and social care units in poor areas in the City of Brasilia, in Brazil, building contextual synthesis indicators to better understand changes taking place in the policy process, seeking to see how they have affected results in terms of system strengthening. As a summary synthesis indicator, levels of policy integration in the implementation will be constructed based on how policy drivers have taken place, by unit and district. Higher levels of integrative implementation will be associated to higher levels of leadership and workforce strengthening. Our hypothesis is that higher levels of policy drivers, policy capacity and resilience (or a mix of them), in an interconnected way, and as generated at the front level of health and social units, can entail high levels of implementation performance during and after the pandemic, affecting PHC strengthening within a system perspective. We expect to find variances in three main stages of the process: first wave and second wave of COVID-19, and in the initial post-pandemic period. As potential users of the synthesis indicators constructed in this project, policymakers at the municipal and national levels will be involved in the participative meeting (data collection with them) and contribute to discussions in a seminar or webinar.

CONCLUSION By developing the PIPF, we expect to induce and encourage new and more comprehensive evidence on P4P programmes and PHC. In addition, we hope it provides insights and recommendations on how it is possible to build/foster leadership and workforce through integration of the programme’s reformulation and implementation processes, in order to strengthen PHC policy and other governance arrangements.

184  Handbook on measuring governance In our work to date this approach shows promise. In the research focused on LMICs, both levels of policy knowledge (involving diversity of types/form of knowledge and policy feedback in the formulation) and of participation of national actors in the policymaking appear as significant formulation drivers, associated with higher levels of performance – though changes in performance may or may not increase at first (may take a while). The two research projects found that policy drivers that foster integration or closer relationships between actors and/or policy spheres/or cycles are the types of policy initiatives or strategies that could be privileged by policymakers and implementers if P4P/PBF is employed to improve system strengthening. However, the relations between performance drivers and system strengthening are significant and complex (non-linear), as they depend on context, and the results can be mixed. Impact results tend to be more consistent/significant with respect to leadership strengthening than to the strengthening of the workforce. Whenever we see closer relationships and or higher levels of integration – either within the implementation process or between the policymaking and implementation, we find better indicators concerning performance and leadership strengthening. In future research it would be interesting to further investigate alternative policy drivers that can impact the workforce, better checking the variances that can be produced. Specifically in the case of the LMICs research, employing not only surveys, but qualitative interviews and/or focus groups and country/local analyses. It would also be important to employ a more participatory approach in the research process, by means of developing a good number of rounds of conversations with frontliners and managers, using their results (indicators) to produce changes and impact the implementation and performance process in distinct phases of the research. This will practically contribute to increase the levels of performance and, thus, strengthen the relation between performance drivers and HSS in the implementation during the research process. Moreover, results could influence the reformulation at the federal level and/or the policymaking processes at the municipalities, by proposing adaptations or changes to the programme/policy. The empirical evidence from case and comparative studies may reflect the problems identified in the literature, but does, we believe, broaden the research analysis scope. Thus, it is expected that the PIPF may also point out directions and gaps to be explored by the field of study in high-, low- and middle-income countries which adopt performance-based financing programmes as a way to strengthen health systems. Qualitative data can, and should, be used to generate comprehensive indicators. The occurrence and form of policy and performance drivers that make up the analytical framework of the research can be observed empirically in the cases studied, so that in each case different levels of integration between reformulation and implementation and performance can be understood, associated and measured by EPIL and PIPL indicators, respectively. The indicators may be analyzed for each case and in a comparative way. Public policy managers and implementers can take advantage of evidence to learn about ways that can be used to better disseminate knowledge and foster participation at the forefront, reducing the occurrence of unexpected consequences. Research results may also impact the frontline professional, since the dissemination of the research with participants and also their possible influence on management may contribute to increase health professionals’ knowledge about programmes, who may even be more open to participate in the construction of a new, more sustainable evaluation/planning culture of their work at the frontline.

Measuring the effects of policy processes on health system strengthening  185 Systematization of available official data can also be associated with qualitative indicators in order to produce a deeper understanding of the relationship between work processes, the roles played by the actors, service quality and the results of the programme. It is expected that the findings could improve policy tools and strategies for subsequent implementation cycles, taking into account the knowledge or contextual evidence obtained from the research.

REFERENCES Abimbola, S., & Topp, S.M. (2018). Adaptation with robustness: The case for clarity on the use of ‘resilience’in health systems and global health. BMJ Global Health, 3(1), e000758. Barbazza, E., Kringos, D., Kruse, I., Klazinga, N.S., & Tello, J.E. (2019). Creating performance intelligence for primary health care strengthening in Europe. BMC Health Services Research, 19(1), 1–16. Blanchet, K., Nam, S.L., Ramalingam, B., & Pozo-Martin, F. (2017). Governance and capacity to manage resilience of health systems: Towards a new conceptual framework. International Journal of Health Policy and Management, 6(8), 431. Brasil – Ministério da Saúde (2017). Programa Nacional de Melhoria do Acesso e da Qualidade da Atenção Básica (PMAQ): Manual Instrutivo 3º Ciclo (2015–2016). Brasília, DF: Ministério da Saúde. Capano, G., Howlett, M., Jarvis, D.S., Ramesh, M., & Goyal, N. (2020). Mobilizing policy (in)capacity to fight COVID-19: Understanding variations in state responses. Policy and Society, 39(3), 285–308. Chee, G., Pielemeier, N., Lion, A., & Connor, C. (2013). Why differentiating between health system support and health system strengthening is needed. International Journal of Health Planning and Management, 28(1), 85–94. doi: 10.1002/hpm.2122. Creswell, J.W., & Clark, V.L. (2017). Designing and conducting mixed methods research. Sage. Crouch, C. (2011). The strange non-death of neo-liberalism. Polity. Diaconu, K., Witter, S., Binyaruka, P., Borghi, J., Brown, G.W., Singh, N., & Herrera, C.A. (2022). Appraising pay-for-performance in healthcare in low- and middle-income countries through systematic reviews: Reflections from two teams. Cochrane Database of Systematic Reviews, 20(5), May, ED000157. doi:10.1002/14651858.ED000157. PMID: 35593101; PMCID: PMC9121198. Dunlop, C.A. (2015). Organizational political capacity as learning. Policy and Society, 34(3–4), 259–70. doi: 10.1016/j.polsoc.2015.09.007. Hood, C. (1991). A public management for all seasons? Public Administration, 69(1), 3–19. Jacobs, A.M., & Weaver, R.K. (2010). Policy feedback and policy change. APSA 2010 Annual Meeting Paper. Available from: https://​ssrn​.com/​abstract​=​1642636. Klasche, B. (2021). After COVID-19: What can we learn about wicked problem governance? Social Sciences & Humanities Open, 4(1), 100173. https://​www​.sciencedirect​.com/​science/​article/​pii/​ S2590291121000693. Lewis, J.M. (2015). The politics and consequences of performance measurement. Policy and Society, 34, 1–12. Lipsky, M. (2010). Street-level bureaucracy: Dilemmas of the individual in public services. Russell Sage Foundation. Margetts, H., & Hood, C. (2016). Tools approaches. In B.G. Peters & P. Zittoun (Eds.), Contemporary approaches to public policy (pp. 133–54). International Series on Public Policy. Palgrave Macmillan. Mickel, A.E., & Barron, L.A. (2008). Getting ‘more bang for the buck’: Symbolic value of monetary rewards in organizations. Journal of Management Inquiry, 17(4), 329–38. https://​doi​.org/​10​.1177/​ 1056492606295502. Peckham, S., Hudson, B., Hunter, D., & Redgate, S. (2022). Policy success: What is the role of implementation support programmes? Social Policy & Administration, 56(3), 378–93. Pollitt, C. (2013). The logics of performance management. Evaluation, 19, 4346–63. Sabatier, P.A. (1988). An advocacy coalition framework of policy change and the role of policy-oriented learning therein. Policy Sciences, 21(2), 129–68. Saddi, F.C. (2017) How to strengthen leadership and the workforce through the implementation of a pay for performance program (PMAQ) in PHC. A comparative health system and policy analysis in LMICs. Summary of the research project submitted to the Social Sciences Approach for Research and

186  Handbook on measuring governance Engagement in Health Policy and Systems Thematic Working Group (SHAPES) at Health Systems Global (HSG). Saddi, F.C. (2022). Research project and plan of work: How to strengthen primary care policy for older and black people after Covid-19 in a Quilombo and vulnerable areas in the Federal District, in Brazil. University of Brasilia, Selection of Visiting Professors, August. Saddi, F.C., & Peckham, S. (2018). Brazilian payment for performance (PMAQ) seen from a global health and public policy perspective. Journal of Ambulatory Care Management, 41(1), 25–33. Saddi, F.C., Peckham, S., Coelho, G.A. et al. (2019a). A comprehensive mixed-method policy framework to evaluate how the re-design and implementation of a P4P program can effectively affect health system strengthening: and applying it to the Brazilian case (PMAQ). Presented at the 5th International Public policy Conference (ICPP5), Montreal, 2019. https://​www​.ippapublicpolicy​.org//​file/​paper/​ 5d112a8164521​.pdf. Accessed 20 August 2023. Saddi, F.C., Peckham, S., Coelho, G.A. et al. (2019b). A public policy and health system strengthening analysis to explore the politics and effectiveness of P4P/PBF programs in LMICs around the globe: Mixing the qualitative review, interviews and survey results. Presented at ICPP5, Montreal, 2019. https://​www​.ippapublicpolicy​.org//​file/​paper/​5d024c4354525​.pdf. Accessed 20 August 2023. Saddi, F.C., Peckham, S., Bloom, G., Turnbull, N., Coelho, V.S., & Denis, J.L. (2023). Employing the policy capacity framework for health system strengthening. Policy and Society, 42(1), March, 1–13. https://​doi​.org/​10​.1093/​polsoc/​puac031. Singh, N.S., Kovacs, R.J., Cassidy, R., Kristensen, S.R., Borghi, J., & Brown, G.W. (2021). A realist review to assess for whom, under what conditions and how pay for performance programmes work in low- and middle-income countries. Social Science Medicine, 270, February, 113624. doi:10.1016/j. socscimed.2020.113624. Epub 2020 Dec 18. PMID: 33373774. Van Dooren, W., Bouckaert, G., & Halligan, J. (2010). Performance management in the public sector. Routledge. http://​ndl​.ethernet​.edu​.et/​bitstream/​123456789/​24068/​1/​11​.pdf. Accessed 9 September 2023. WHO (World Health Organization) (2007). Everybody business: Strengthening health systems to improve health outcomes: WHO’s framework for action. https://​apps​.who​.int/​iris/​bitstream/​handle/​ 10665/​43918/​9789241596077​_eng​.pdf​?sequence​=​1​&​isAllowed​=​y. Accessed 9 September 2023. Witter, S., Palmer, N., Balabanova, D., Mounier-Jack, S., Martineau, T., Klicpera, A., & Gilson, L. (2019). Health system strengthening – reflections on its meaning, assessment, and our state of knowledge. The International Journal of Health Planning and Management, 34(4), e1980–e1989. Wood, M. (2014). Bridging the relevance gap in political science. Politics, 34(3), 275–86. Wu, X., Ramesh, M., & Howlett, M. (2015). Policy capacity: A conceptual framework for understanding policy competences and capabilities. Policy and Society, 34(3–4), 165–71. Zahariadis, N. (2003). Ambiguity and choice in public policy: Political decision making in modern democracies. Georgetown University Press.

12. Measuring micro-foundations of governance: a behavioral perspective Sjors Overman, Emma Ropes and Wouter Vandenabeele

INTRODUCTION In recent years, scholars of public governance have increasingly concentrated on the individual as a unit of analysis. In particular, with the rising interest in behavioral public administration, the individual has moved to the forefront of governance research (Grimmelikhuijsen et al., 2017). This development has had both theoretical and empirical implications. Not only do we witness an increase in the use of individual level theoretical perspectives, we also witness increased use of measurements that focus on individual level behavior and attitudes. The most prevalent exponent of this movement is the development of survey scales in the discipline. The survey scale is the type of individual level measurement that has gained great momentum within public governance research. This chapter provides an overview of the use of individual level measurement in public governance research, and discusses the development, validation, and refinement of survey scales as a measurement instrument for public governance. Working with many a concept or theory in the social sciences, governance included, requires the measurement of concepts that are not directly observable. This is true for concepts at the institutional or organizational level – as discussed at length in the other chapters of this handbook – as well as for concepts at the individual level. Some concepts, including body length or hair color, may be quite straightforward to measure objectively. Yet, whenever we want to empirically underpin statements about attitudes, motivations, or personality, we need more sophisticated measurement instruments. The more ambiguous concepts we want to study, the more complex the measurement of these concepts may become. With the rise of interest in the individual level in governance research, governance scholars have rapidly acquired more experience in the application of measurement theories and instruments for individuals. These are the focus of the current chapter. The structure of this chapter is as follows. We first discuss theories of measurement, with a focus on classical test theory, which underlies the development of most survey scales. We continue with an overview of the proliferation of this method in public governance research. To do so, we present the results of a systematic literature review of the governance literature in the past 30 years. We concisely discuss 51 articles that have been published in academic journals and in which a survey scale has been developed or refined. We then discuss the implications of this type of measurement for governance research. We conclude with an outlook on other instruments for measurements at the individual level that may gain ground in the years to come.

187

188  Handbook on measuring governance

THEORIES ON MEASUREMENT Measurement of intangible concepts, such as public service motivation (Vandenabeele, 2008), felt accountability (Overman et al., 2020), or compassion (Ropes and de Boer, 2021), has proven a key challenge for scholars of governance. The field has proven apt in conceptualizing and theorizing such concepts over the last century, but for empirical research, valid measurements of these concepts are essential. Moreover, because of the sophisticated nature of many conceptualizations, they cannot be translated directly to questions in interviews or questionnaires. Asking a civil servant about their public service motivation can raise multiple problems. Among these problems are: the respondent may not understand the concept at all, the first respondent may have a different conceptualization from the second respondent, or respondents may answer the question in a socially desirable way. Operationalization of concepts and the development of their measurement in surveys facilitates empirical research of these difficult concepts. Classical test theory has been developed to deal with the mentioned measurement problems and other issues. In this section, we discuss issues of concept definition and operationalization, classical test theory, and evaluation criteria of measurement instruments. Conceptualization Many phenomena in the social sciences have been extensively described, but cannot be observed directly. We are, therefore, confronted with many latent constructs in the social sciences in general, and in governance research, in particular. Such constructs include performance and leadership, or constructs as discussed at the start of this section. The latent character of these constructs poses several problems in their measurement. The constructs consist in the minds of individuals or in the dialogue between individuals, but they have no physical properties (Searle, 1996). Any measurements are, therefore, mere proxies for the variable constructs that we want to observe (DeVellis, 2009). To make sure that the constructs are well represented by their measurements, it is, thus, essential to define and operationalize constructs in an adequate way. Based on a definition, measurement items can be generated, either in a deductive or in an inductive way. Deductive scale development applies a classification scheme or typology prior to data collection where item generation is based on an understanding of the literature (Hinkin, 1995). This approach can be used in two ways. Deductive scale development consists of researchers basing their items on the literature or on existing scales, or gathering opinions from subject experts from the field (Hinkin, 1995). Inductive scale development, on the other hand, is based on qualitative information regarding a construct obtained from opinions gathered from the target population, for example, focus groups, interviews, expert panels, and qualitative exploratory research methodologies (Hinkin, 1995; Morgado et al., 2017). Sometimes, combinations of the two are used in item pool formation. The goal for scholars who measure constructs is to approach the true score of the construct of interest. The true score of a latent variable is unknown and will remain unknown and, therefore, the objective becomes to approach the true score as closely as possible. The starting point of attaining this objective is to define and operationalize the concept of interest as clearly as possible. In practice, this may involve defining broad concepts such as performance, compassion, or trust, much more narrowly than in theoretical studies. It may involve dissecting

Measuring micro-foundations of governance  189 concepts in two or more dimensions, as well. The concept of public service motivation, for example, encompasses three, four or five dimensions, depending on the perspective and whom you would ask (Perry, 1996; Vandenabeele, 2008; Giauque et al., 2011). These dimensions include, among others, attraction to public policy making and commitment to the public interest. While they are both components of public service motivation, they are distinctly measurable from each other and require a different battery of questions to evaluate the scores. Determining the number of dimensions is a process that requires both theoretical and empirical information, and the empirical analysis often reveals other structures than initially theorized. Classical Test Theory Charles Spearman (1904), reflecting on the measurement of individual psychological characteristics, has laid the basis for what we have later become to consider psychometry (Raykov and Marcoulides, 2011). The main purpose of this field was and still is the estimation of a true score of the concept of interest, based upon measurement of what Spearman called ‘the psychics of real life’ (analogous to the ‘physics of real life’ in other scientific fields). One of the most important theoretical foundations of this field is ‘classical test theory’, which states that X = T + E. The X can, for example, be the answer to a survey question. X, being the observed score, is a function of a true score (T), on the one hand, and an error score E, on the other hand. The former refers to variance in the object of interest in a particular observation, whereas the latter is about variance which is not related to the object of interest. In other words, when asking a survey question about compassion, the answer (X) contains the score on compassion (T), as well as some ‘noise’ that may – or may not – include whether the respondent has eaten well and feels relaxed, their general attitude toward surveys, or a genuine error in typing in an answer (E). This seemingly simple equation has nevertheless engendered substantial discussion as it proved to be much more complex than expected. An important element in explaining this complexity is that a true score is even more difficult to assess in fields that concern human behavior, since the sources of error in such an environment exponentially exceed those of errors in other fields (Raykov and Marcoulides, 2011). In engineering, measurement error is oftentimes systematic and can be attributed to the features of the measurement tool. In the social sciences, measurement error can stem from the latter sources, but it can also be attributed to various unobserved factors, both systematic and random. The true score itself can, thus, be considered as a Platonic concept; something which cannot be directly observed. Instead, the score is derived from the observation with a reasonable accuracy. If we translate this to measuring a characteristic related to governance on an individual level, the measurement assumes that individuals have a true value of a certain characteristic, not unlike a person’s height or weight. But it is challenging to transpose the technique of measuring a physical object to measuring less tangible concepts, such as psychological characteristics. After all, unlike with the physical attributes, the lack of tangible characteristics obstructs observing the measurement’s deviation from the actual attribute (Lord et al., 1968). Therefore, often a true score is conceptualized at a more operational level where a true score is the average of a set of scores. In particular, a test should be administered many times, to the same individual in a way ensuring a large number of resulting statistically independent and identically distributed measurements of this person, assume

190  Handbook on measuring governance thereby … no underlying development or change of the construct of actual interest occurs. (Raykov and Marcoulides, 2011, p. 118)

Evidently, assessing true score in this way is practically impossible, but at least it offers an interesting approach to think about true score on a conceptual level. Next to the true score, we also measure some error. In classical test theory, it is crucial to distinguish between systematic and random error. Random or non-systematic error refers ‘to pure chance effects that are momentary and have nothing to do with construct being measured’ (Raykov and Marcoulides, 2011, p. 116). This type of error therefore influences individual scores in a non-descript direction, and repeated measures do not show this type of error as the average is zero. Systematic error, on the other hand, is the proportion of the observed score that is unrelated to the concept of interest, but which occurs in a systematic fashion. Systematic error biases the score in one particular direction. For example, the order of questions in a citizen satisfaction survey can prime respondents: it matters whether a question about generalized satisfaction with the government is asked before or after satisfaction with specific services (Van De Walle and Van Ryzin, 2011; Thau et al., 2021). Repeated measures, therefore, cause the average error to be non-zero. These features are important in evaluating individual and group or average scores, as they can have a tremendous effect on the validity either of an estimate of an individual characteristic or on the average validity of a test. Composite Measures Usually, tests involve the use of multiple items in a single scale. The idea behind such composite measures is rooted in classical test theory, as discussed above. Items partially overlap in a set, which reveals the latent construct from slightly different angles. Individual scores can still differ in the way observed scores capture the true score but, on average, when multiple respondents answer these items, the items render the same results and the same average scores. In other words, even if certain items seem redundant, having multiple items for the same construct allows summation of the content that is common to all items, while it cancels out the idiosyncrasies of each separate item (DeVellis, 2009, p. 65). Also, in order to assess reliability (see next section), it is worthwhile to include multiple test or items. Often, such composite measures assume that each item weighs equally toward estimating a true score, so-called tau-equivalence, but in reality and as stated above, in current governance scholarship, we see few examples of studies where this assumption is challenged. One can distinguish three different models based on classical test theory, putting the above information together. First, a model of parallel tests assumes that for any particular set of tests that are supposed to measure the same thing, individual tests render the same true scores and similar variances in error score (Raykov and Marcoulides, 2011). Second, a model of tau-equivalent tests or true score equivalent tests relaxes the assumption of the equal variances of different tests (or items), but not of the true scores. Tests are therefore not complete substitutes. Finally, congeneric models refer to a set of items that measure the same concept, but are not necessarily on the same measurement scale, thus not only relaxing the assumption of the error variances being equal of two tests, but also the assumption of the true score (Raykov and Marcoulides, 2011). The true scores of two tests are, however, perfectly correlated. This latter model is probably the most common in the field of broader social sciences. One can think of this in terms of a unidimensional factor model with different but significant (but not equal)

Measuring micro-foundations of governance  191 factor loadings – that would be tau-equivalent models – but not having equal error variances – that would be parallel models. Testing Classical Test Theory-based Models: Reliability One aspect of measurement quality is reliability. Reliability usually is defined as the ratio of true score versus observed variance (Raykov and Marcoulides, 2011). Given that we have no information on the true score, this may be an impossible task. A workaround is using the correlation between two parallel tests, which equals the reliability (Joreskog, 1971). Squaring this correlation yields a number that is known as coefficient alpha. Yet again, completely parallel tests or items tests are hard to come by (try to imagine a composite measure – for example, a series of Likert questions that are used to measure a concept – that has two or more items which measure exactly the same score for each respondent). Therefore, other strategies have been developed. One of the most familiar strategies to assess the coefficient alpha, in order to assess the reliability of a composite measure, is reporting Cronbach’s alpha (Sijtsma, 2009). Using Cronbach’s alpha comes as a challenge. The history of using Cronbach’s alpha is riddled with misuse and the way it is calculated may create issues (Schmitt, 1996). In principle, Cronbach’s alpha should only be used for assessing reliability of parallel or tau-equivalent measures. For other (congeneric) measures, it could either over- or underestimate reliability (Raykov, 2001a; Trizano-Hermosilla and Alvarado, 2016). However, some alternatives are available. It has been illustrated that omega, and in particular weighted omega is a much better estimation of reliability than Cronbach’s alpha (Bacon et al., 1995) as it is better suited to the assumptions of the model (Raykov, 2001b). Apart from the methods based on covariance of items, which allow estimating reliability based on a single test administration, reliability can be assessed based on two test administrations. Under certain conditions, the correlation between the items or complete tests can serve as an estimate of reliability. Based on similar reasoning, splitting the sample into two halves and administrating only parts of the instrument to each half can render estimates of reliability. However, as these may be more difficult to assess (Raykov and Marcoulides, 2011) and as these are hardly used within the fields of public administration or public management, they are beyond the scope of this chapter. Testing Classical Test Theory-based Models: Validity Reliability is only one way to assess whether a measure is a ‘good’ measure. Another way to assess the quality of a measure is asking whether an instrument measures what is was initially intended to measure (Hu and Olshfski, 2008). This criterion, validity, is even more challenging to evaluate than reliability, as there is no list of definitive criteria (Cunningham and Olshfski, 2001). Validity cannot be established with absolute certainty; assessing validity is more a matter of degree (Raykov and Marcoulides, 2011). In Standards for Educational and Psychological Testing, validity is defined as ‘the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests’ (AERA, APA, and NCME, 1999, p. 9). As a consequence, validity requires continuous and cumulative evidence to be evaluated.

192  Handbook on measuring governance There are several types of validity with regard to the measure that is being evaluated (Giannatasio, 2008). The most intuitively appealing types include face validity and content validity. Face validity is a rudimentary way of assessing validity, evaluating whether a measure seems to be a good measure for the intended concept, at face value. It is therefore prone to perceptual biases and rarely provides convincing arguments in favor of claiming validity. Content validity is a more advanced type of validity and refers to ‘the degree to which one can draw correct inferences from test scores to performance on a larger domain of similar items’ (Raykov and Marcoulides, 2011, p. 185). This translates into the idea that all dimensions of the concept should be covered by the measure. It should not surprise that assessing this mainly happens in a qualitative fashion, paying a lot of attention to theory. The assessment is sometimes to some extent quantified by having external reviewers score the validity (Lawshe, 1975). As discussed in the next section, the latter technique is regularly witnessed in current governance scholarship (Van Loon et al., 2016; Overman et al., 2021). But there are some drawback to this approach. The relationship between classical test theory and these two types of validity is sketchy at best, as there is no formal relationship to the elements of the classical test theory such as observed score, true score or error score. Other types of validity are better embedded within classical test theory. One such type is criterion validity which refers to ‘the degree to which there is a relationship between a given test’s scores and performance on another measure of particular relevance (Raykov and Marcoulides, 2011, p. 187). The latter test is referred to as the criterion. Depending on whether this criterion is situated in the future or closer by in terms of time, they are either dubbed predictive validity or concurrent validity. The correlation between the test and the (pre-existing) criterion demonstrates the validity. For example, Grimmelikhuijsen and Knies developed a scale for citizen trust in government organizations and used an existing item for generalized trust to measure their correlation (Grimmelikhuijsen and Knies, 2017). Theoretically, concurrent validity equals 1 for perfect validity, but this assumes the error scores of both test and criterion to be zero. When the error score does not equal zero (and measurement is not perfect), validity cannot exceed the product of the respective reliability indexes – the correlation between the true score and the observed score – of both measures. Therefore, the more reliable both measures, the higher the criterion validity. A final type of validity is construct validity. This refers to the ‘extent to which there is evidence based on which one can interpret the results of a given … instrument’ (McDonald, 1999, p. x). Although ‘constructs’ are only indirectly observable, there are theoretical assumptions about their relationships with other constructs relevant in the field, as well as with other components of the construct. It is self-evident that there is no single test for construct validity (Schwab, 1980). However, there are a number of approaches that can be followed. Most importantly, there are various correlational methods to assess the relationship with other constructs. Also, there are factor-analytic strategies to assess the latent structure of the instrument, notably in terms of convergent and discriminant validity. The former, convergent validity, refers to the extent to which multiple measures relate to one another, therefore assuming a relationship with a latent construct. Here, the link with congeneric measures is self-evident. This is a necessary condition for construct validity, but not a sufficient one (Carlson and Herdman, 2012). It needs to be complemented by discriminant (sometimes dubbed divergent validity), which refers to ‘the ability of two tests or measures to be discriminated from one another’. It can be defined as ‘two measures intended to measure distinct constructs have discriminant validity if the absolute value of the correlation between

Measuring micro-foundations of governance  193 the measures after correcting for measurement error is low enough for the measures to be regarded as measuring distinct constructs’ (Rönkkö and Cho, 2022, p. 11). Both convergent and discriminant validity are part of a broader validation strategy – the multitrait-multimethod or MTMM matrix – developed by Campbell and Fiske (1959). We see the discriminant validity, for example, in survey scales that measure different dimensions of a concept. For example, in public service motivation, attraction to public policy making is different (and should be distinctively measured) from commitment to the public interest (Vandenabeele, 2008). Other Considerations Next to the test-theoretical considerations, some other theoretical and practical reflections need to be considered by the researcher. Survey length is an important issue when studying samples of respondents who are busy and have little time to spend on scholarly research. And survey platforms may charge more for longer surveys to be fielded. The quest for validity is, therefore, always balanced by the practical execution of the study. This translates into the shortening of existing scales in search for parsimony, a development that is witnessed in governance research as well. Another issue is the generalizability of survey scales. Simple translations between disciplines are not self-evident. Think, for example, of the different meanings for ‘stock’ between a farmer (an inventory of goods or livestock), and a financial broker (certificates of partial ownership of a firm). Such issues become even more apparent when generalizing surveys across countries or geographic regions. The meaning of question wordings may differ and, moreover, cultural differences may influence answering behavior. In some cultures, using extreme answers or endpoints on Likert-type scales are much more prevalent than in other cultures; also, the configuration of a latent construct may not be similar across cultures (Jilke et al., 2015). Such issues refer to the lack of measurement invariance across cultures. In the next section, we will, therefore, also discuss the cross-cultural adaptations of survey scales, which have become more common. Further Elaboration of Evaluating Strategies: Formulating Criteria Based upon the above account of criteria, Table 12.1 is constructed. This provides an overview of questions that should be asked in evaluating the psychometric properties and the general quality of a measure for three different dimensions. First, the conceptualization should be adequate; a clear and focused definition is essential and items should reflect the full spectrum of aspects within the definition of the latent construct. Second, the scales should meet criteria of reliability and validity. Third, the scales should meet practical criteria. In Table 12.1, we distinguish between the types of criteria to be assessed. We describe general questions to be asked, referring to subdimensions of criteria or to the nature of the criteria at hand. Lastly, we mention the operational criteria. These entail strategies to assess whether criteria have been satisfied.1 Such operational criteria can help practically assess the survey scale.

194  Handbook on measuring governance Table 12.1

Overview of theoretical and operational criteria for assessing measurement quality

Theoretical criteria

Questions to be asked

Operational criteria

Conceptualization

What do I intend to measure?

Clear and focused definition

Type of model

Is the set of items exhaustive?

Items cover the definition (see also content validity)

Is it a congeneric model?

Exploratory factor analysis (including correct estimation) Confirmatory factor analysis (including good fit and correct estimation)

Is it a tau-equivalent model?

Confirmatory factor analysis (‘congeneric’ plus equal factor loadings)

Is it a parallel model?

Confirmatory factor analysis (‘tau-equivalent’ plus equal error variances)

Reliability

Is coefficient alpha a good strategy?

Reported Cronbach’s alpha of over 0.70 and evidence of tau-equivalent model

Validity

Is omega-based reliability a good

Reported omega-based reliability of over 0.70 and evidence of

strategy?

congeneric model

Is there evidence of content validity?

A reflection on the relationship between the construct and the

Is there evidence of criterion validity?

Correlation between construct at hand and other relevant

Is there evidence of construct validity?

Evidence of convergent validity (evidence of a congeneric

measure constructs (concurrent or predictive validity) model)  

Evidence of discriminant validity (significant X2 difference between overlapping constructs or correlation confidence interval between constructs does not include zero)

Practical

Is the survey length acceptable?

Dependent on target group

implementation

Do the contents fit with cultural

Dependent on context

circumstances?

THE USE OF SURVEY SCALES IN GOVERNANCE RESEARCH To discuss the proliferation of individual level measurements in governance scholarship, we conducted a systematic literature review of the extant literature. We collected and analyzed the articles in scholarly journals focused on public governance that develop, revise, or adapt survey scales. The literature collection was conducted in April 2022. The inclusion criteria are based on those of Van Engen (2017), who collected a similar set of studies. We searched for articles in Web of Science in the spring of 2022.2 We selected articles that referred to the development of a scale in the abstract/title/keywords. It is possible that studies were dedicated to the topic of scale development, but did not refer to the selected search terms or did not mention the goal of scale development explicitly in the title, abstract, or keywords. Therefore, it is possible that relevant studies were overlooked. For this chapter, we performed a literature review on measurement scales developed in public administration. We exclusively focused on scales that measure individual perceptions, attitudes, opinions, and values. Based on the systematic literature search, we found 51 articles that develop, adapt, or shorten survey scales that measure individual level characteristics. The method was first described in a public governance context in the 1990s. In 1995, Hal Rainey, Sanjay Panday, and Barry Bozeman published measurement scales for General Red tape (GRT), Personnel

Measuring micro-foundations of governance  195 Red Tape (PRT), immediately followed by James Perry’s well-recognized measure of public service motivation (PSM) in 1996. But it took until ten years later for this exercise to be repeated in the public governance literature. We also see that the first developments in survey scales in this stream of literature were closely related to the original two scales. From 2005 to 2010, we see the development of five more survey scales, most of which are moving the first two survey scales forward. These include revisions or adaptations of the PSM scale (Vandenabeele, 2008; Kim, 2009) or a scale on green tape, which is closely related to the earlier red tape scales (DeHart-Davis, 2009). Six more follow in the first half of the 2010s. And again, some of these demonstrate close relations to the then existing scales, including two refinements of the PSM scale (Kim, 2011; Kim et al., 2013), and a scale on public service ethos (Rayner et al., 2011). But we also see new topics appear on the stage, including the measure for policy alienation (Tummers, 2012). Yet after 2015, the number of scales developed in the public governance literature strongly increases. From 2017 to 2022, we see on average more than five published articles per year that develop, revise, or shorten a new survey scale. That is also when the measured topics start to diverge strongly. The measurement of attitudes related to human resource management has, thus, dominated the literature on survey scales in public governance during its initial stages. The largest number of such articles builds on the theory Public Service Motivation (e.g. Perry, 1996; Coursey and Pandey, 2007; Vandenabeele, 2008; Kim, 2009, 2011; Kim et al., 2013; Ballart and Riba, 2017). Existing scales on PSM have been shortened (e.g. Coursey and Pandey, 2007), revised to fit cultural purposes (e.g. Kim et al., 2013) and its dimensions have been closely scrutinized and tested (e.g. Vandenabeele, 2008; Kim, 2009). But during the surge of survey scales in the second decade of the twenty-first century, we see that many topics in public management have been considered. The initial survey scales in public governance focused mainly on civil servants and their attitudes toward their job, the institutions of their work environment, and their leaders. Later, we see a wider focus and a growing interest in measuring attitudes of civil servants toward citizen-clients. Examples are interaction style (Van Parys and Struyven, 2018), enforcement style (de Boer, 2019), and attitude and compassion toward clients (Keulemans and Van de Walle, 2020; Ropes and de Boer, 2021). Some studies also focus on clients’ and stakeholders’ attitudes toward government and public services, but these studies remain more limited in their numbers. Existing scales that take the citizen or stakeholder perspective include, for example, the scales of bureaucratic reputation (Lee and Van Ryzin, 2019; Overman et al., 2020), public value of organizations (Meynhardt and Jasinenko, 2020), or trust in government (Grimmelikhuijsen and Knies, 2017). Still, we see that more than 80 percent of the articles in our sample report on survey scales that have been developed for testing attitudes of public sector employees. Citizen and stakeholder attitudes remain a minority among the developed scales in public governance literature, even in the current times where much public policy research has been conducted with the individual citizen as a unit of analysis. Conceptualization in Practice Most concepts that have been measured have been theoretically described to a certain extent, either in previous research or as a starting point for the development of new measurement instruments. Many scholars choose a deductive strategy for their measurement. They start with (pre-existing) theory to develop a measurement approach. For example, both Lee and

196  Handbook on measuring governance Van Ryzin (2019) and Overman et al. (2020) developed a measurement for bureaucratic reputation of government agencies. They both build on Carpenter’s work on reputation (Carpenter, 2010), who developed a theory of bureaucratic reputation in four theoretically separate dimensions: performative, moral, technical, and procedural reputation. The authors all took the existing theory as a point of departure and developed items based on this existing theoretical description. With the theoretical description in mind, they formulated items to tap into the presumed dimensions. In their studies, they consequently tested whether the dimensions could also empirically be identified in a reliable manner. This deductive approach is the method of choice for most authors, but not all. In our review, we found that almost half (45 percent) of the articles discussing a newly developed survey scale used a primarily deductive approach. The alternative is a purely inductive approach, where the researcher presents a concept to their target group and interviews them about the dimensions and indicators. The interview results, then, serve as the basis for the construction of dimensions or items. The green tape scale that DeHart-Davis (2009) developed is an example. She interviewed 90 city employees about their view on written rules and distilled five attributes out of these conversations, which represent elements of green tape – the opposite of red tape, thus, effective rules. The attributes she identifies based on these interviews include written rules, valid means-ends relationships, optimal control, consistent application, and understood purposes (DeHart-Davis, 2009, p. 375). In the measurement scale she develops, these five attributes form the dimensions for which she has generated survey items. In our overview, we find that only a handful of the studies (mainly) apply an inductive approach. Many studies also combine the two and use a combination of theoretical deduction and inductive interviews to generate dimensions and items to measure their concept of interest. Van Loon and her colleagues (2016), for example, developed two dimensions for their concept of job-centered red tape, which represents the public sector employees’ experience of red tape in their jobs. Based on theoretical expectations, they identify the burden of compliance with rules and the lack of functionality of rules as dimensions of job-centered red tape. Consequently, they qualitatively interviewed ten public sector employees to generate an item pool for these two dimensions. Based on the words that interview respondents used (including ‘pressure’, ‘delay’ (Van Loon et al., 2016, p. 668) they developed items for their item pool. It is recommended to use a combination of both approaches. Using a purely deductive approach often leads to theoretically sound measurement constructs, but it can become problematic to obtain reliable results. Theoretically sound scales are often not sufficiently precisely focused. Some researchers report that they have to drop many items based on an initial validation effort. They find that the items may not load on the latent construct or the dimension that was hypothesized (e.g. Yang, 2005; Overman et al., 2020). Conversely, a purely inductive approach also has its drawbacks. A scale that is inductively constructed may not align completely with literature on the same topic, if the measurement does not cover the concept in its entirety. It may also result in survey scales that capitalize on specificities of the cultural setting in which the initial interviews were conducted. Scale Adaptation The length of a survey is an important consideration in the development of survey scales. More items may lead to higher construct validity, if a concept or dimension can be measured in great detail. It may also lead to higher internal consistency, which is particularly important

Measuring micro-foundations of governance  197 if conclusions are drawn on an individual – rather than group – level, for example in clinical psychological tests for diagnoses (Ziegler et al., 2014). In many studies of governance, the focus of the conclusions lies on groups, rather than individuals. High reliability, as measured by, for example, Crohnbach’s alpha, may be less of a concern than measuring the complete concept. Moreover, the tradeoffs of longer survey scales include response time, costs, and respondent attrition. Increased response time may cost more money if the surveys are fielded by a commercial enterprise. But more importantly, increased response time leads to survey attrition. Particularly with web surveys, short scales and shorter response times lead to higher response rates. Shorter scales require sharply focused and well-defined concepts to be measured. Yet, traditionally, research in public governance is concerned with broad concepts; this is no different for those studies that try to measure the concepts on an individual level. Examples of such broad concepts include leadership (Roman et al., 2019; Latif et al., 2022), public values (Wang and Wang, 2020), and public service motivation (Perry, 1996). As a result, the measurement scales in public governance are often high-dimensional and contain large batteries of items. The original public service motivation scale contained 4 dimensions and 18 items (Perry, 1996). The first measurement scale on policy alienation – the tendency of public professionals to alienate themselves from the policies they have to execute – included 5 dimensions and 23 items (Tummers, 2012). The aim for shorter and more efficient scales is seen in the literature. The first time that we see a study aiming to shorten a scale was in 2007, when Coursey and Pandey reduce Perry’s original public service motivation scale from the original 4 dimensions and 18 items to 10 items and 3 dimensions. To do so, they dropped the self-sacrifice dimension of public service motivation on theoretical grounds. That decision implies that the measurement in essence may measure a slightly distinct concept. There may, of course, be valid reasons to abridge a survey scale by opting for a theoretically more narrow definition of the concept of interest. An empirical way to reduce the length of survey scales is to look for redundancy in the original items. That is what the authors of the first shortened public service motivation scale also did and how they further condensed the scale. Van Engen (2017) condensed the policy alienation scale (Tummers, 2012) from a scale of 5 dimensions and 23 items to a very short scale, with no more than 5 items. She used one representative item for each dimension and demonstrated the validity of her instrument in three studies. As we see with many contemporary endeavors in the social sciences, a strong focus on the developed Western world predominates in the literature. This is no different for the development of survey scales in public governance research. Most survey scales in this area have been developed in Western countries, with samples from the US (in 29 percent of the reviewed studies) and the Netherlands (22 percent) standing out. The focus on these specific administrative and political cultures may come at the risk of potential variance in measurements when they would be transposed to other cultures. Measurement invariance is, therefore, important to study (Jilke et al., 2015). The often mentioned scale for public service motivation was tested in multiple contexts and countries and the authors of a 12-country comparison warn against making direct comparisons across countries based on the same scale (Kim et al., 2013). At the same time, we also see promising developments for the validation or adaptations of survey scales in different cultural contexts. In particular, the concept of public service motivation has been translated in a great number of varying cultural contexts. Some authors focus on revising existing scales to fit cultural purposes (e.g. Ballart and Riba, 2017; Xu and

198  Handbook on measuring governance Chen, 2021), and it might be necessary to do so for other measurement scales. Ballart and Riba (2017), for example, broaden the definition of their concept. They add political loyalty as one additional dimension to the concept of public service motivation, to make the concept better-suited for countries with a Napoleonic administrative tradition, such as France.

EMERGING TRENDS IN MEASURING GOVERNANCE AT THE INDIVIDUAL LEVEL Other means of measuring individual level attributes exist beyond survey questions, and some studies promise to make important headway in developing innovative strategies, too. In this chapter, we have deliberately focused on survey scales as the most important way of measuring individual level attributes, as it is currently the most prevalent method of measurement in the discipline. Beyond Self-report Other types of measurement that are currently in use include direct observation. Many types of behavior can be observed directly or indirectly. Think of evaluations of policy interventions that are implemented to alter choice behavior. Usually, we are able to observe potential effects of policy interventions, such as vaccine uptake (Leight and Safran, 2019), sentences for criminal recidivism (Aslim et al., 2022), or public sector employees helping citizens (Szydlowski et al., 2022). Also, some studies have been published that use the information that is transmitted by mobile phones to measure movement through public spaces, such as train stations. But we cannot observe feelings, attitudes, or intentions; therefore, we do not know why any results from policy interventions have been attained. In many questions in the discipline, the intentions and attitudes, understandably, remain the concepts of interest. Do citizens trust their government? Are they happy with services? Or, do public sector employees intend to stay in their jobs? Those questions require an insight into the minds of individuals and, therefore, surveys have long been the go-to for scholars of governance. Recently, we have witnessed upswings in other types of measurements, which come closer to the motivations of individuals. Particularly in the area of feelings and emotions, some innovations have entered the discipline of governance. These measurements are rooted in the idea that emotional feelings can be assessed discretely, in two dimensions, or in a combination (Scherer, 2022). The two dimensions are usually referred to as arousal and valence. For example, happiness has positive valence and some arousal, whereas anger has negative valence and some arousal. In comparison, boredom is expressed with less arousal than anger, and excitation is expressed with more arousal than happiness (Russell, 1980). One example of a practical implementation is the work by Hattke and colleagues (2020), who measured emotional expressions as a result of bureaucratic red tape. They used facial expression recognition software to register micro-expressions in their subjects. Based on automated video image analysis, the authors analyzed reactions that can point to discrete emotions. Other variants have been used in political science, too. These include affective reactions to political rhetoric (Bakker et al., 2021). Bakker and colleagues measured skin conductance, which is representative of arousal, and facial muscle activation via facial electromyograms, which provide indications of changes in facial expressions. Heart rate variability is also monitored in some

Measuring micro-foundations of governance  199 studies; an increase can represent arousal or activation (Bakker et al., 2021), and a decrease can point to attentiveness – therefore, more variability can be interpreted as more engagement (Soroka et al., 2019). Current developments in computer science and engineering promise even more possibilities for measurements of individual behavior. A recent review identifies various types of measurements to monitor human behavior (Davila-Montero et al., 2021). Apart from the discussed types (video and physiological), there have been developments in measurement of audio, such as tone of voice. Vocal pitch is highly relevant in political communication, as established by Boussalis and colleagues (2021), and may be important for other aspects of governance, too. Also, spatial aspects of behavior, such as movement, orientation, and proximity, can tell us about feelings and emotions, as they provide information about body posture and interaction. The instruments available in mobile phones can even be used to measure such spatial activity; for example, by gyroscope or accelerometers. And the neurosciences may, one day, be able to gain more importance in our field as well.

CONCLUDING THOUGHTS We have witnessed important advances in the measurement of aspects of governance in individuals in the last 20–30 years. In particular, the development, adaptation, and implementation of survey scales has matured to a great extent in the governance literature. At the same time, important steps still need to be taken. Two particular methodological cautions should be made with regard to the use of survey scales. First, we already mentioned the assumption that all items weigh equally in the approximation of the true score. This has led to the use and sometimes misuse of Crohnbach’s alpha as a generalized quality criterion, even if it should not be used in all those cases. Second, it remains important to acknowledge the lack of measurement invariance between populations for many survey scales. We have discussed issues of invariance of the structure of latent constructs. We should also mention full measurement invariance, which includes the absolute values of test scores (Van De Schoot et al., 2013). Most survey scales have not been tested for the equivalence of values and means across groups, let alone that the individual items have not been tested as such. Therefore, we are able to draw conclusions about relationships between variables in a study, but we should refrain from drawing conclusions about the mean absolute values (scores) of individuals or groups. Until full measurement invariance has been established, these numbers should not be compared out of context. We foresee an increased use of measurement instruments at the individual level, and it is likely that future instruments will go beyond self-report in surveys. Scholars continue to develop and improve reliable and valid instruments, which is a prerequisite to high-quality science. These developments will facilitate more valid conclusions and better recommendations for governance practice.

NOTES 1. A large part of these strategies is based upon statistical methods that are based on latent factor modeling (exploratory factor analysis but mainly confirmatory factor analysis), yet given the scope of this chapter, particularities regarding these methods are not elaborated upon.

200  Handbook on measuring governance 2. We used the following query: ((LA = (English)) AND WC = (Public Administration) AND TS = (developing measure OR measurement instrument OR measurement scale OR scale development OR validating OR validation) AND PY = (1995–2022)).

REFERENCES AERA, APA, & NCME (1999). Standards for educational and psychological testing (revised edn). American Educational Research Association. Aslim, E.G., Mungan, M.C., Navarro, C.I., & Yu, H. (2022). The effect of public health insurance on criminal recidivism. Journal of Policy Analysis and Management, 41(1), 45–91. https://​doi​.org/​10​ .1002/​pam​.22345. Bacon, D., Sauer, P., & Young, M. (1995). Composite reliability in structural equation modeling. Educational and Psychological Measurement, 55(3), 394–406. Bakker, B.N., Schumacher, G., & Rooduijn, M. (2021). Hot politics? Affective responses to political rhetoric. American Political Science Review, 115(1), 150–64. https://​doi​.org/​10​.1017/​S000305542 0000519. Ballart, X., & Riba, C. (2017). Contextualized measures of public service motivation: The case of Spain. International Review of Administrative Sciences, 83(1), 43–62. Borry, E.L. (2016). A new measure of red tape: Introducing the three-item red tape (TIRT) scale. International Public Management Journal, 19(4), 573–93. Boussalis, C., Travis, G.C., Holman, M.R., & Müller, S. (2021). Gender, Candidate Emotional Expression, and Voter Reactions During Televised Debates. American Political Science Review, 115(4), 1242–57. https://​doi​.org/​10​.1017/​S0003055421000666. Brudney, J., Cheng, Y.D., & Lucas, M. (2022). Defining and measuring coproduction: Deriving lessons from practicing local government managers. Public Administration Review, 82, 795–805. https://​doi​ .org/​10​.1111/​puar​.13476. Campbell, D.T., & Fiske, D.W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81. Carpenter, D. (2010). Reputation and power: Organizational image and pharmaceutical regulation at the FDA. Princeton University Press. Carlson, K.D., & Herdman, A.O. (2012). Understanding the impact of convergent validity on research results. Organizational Research Methods, 15(1), 17–32. Chen, C.A., & Xu, C. (2021). No, I cannot just walk away: Government career entrenchment in China. International Review of Administrative Sciences, 87(4), 944–61. Chen, C.A., Chen, D.Y., & Xu, C. (2018). Applying self-determination theory to understand public employee’s motivation for a public service career: An East Asian case (Taiwan). Public Performance & Management Review, 41(2), 365–89. Coursey, D.H., & Pandey, S.K. (2007). Content domain, measurement, and validity of the red tape concept: A second-order confirmatory factor analysis. The American Review of Public Administration, 37(3), 342–61. Cunningham, R., & Olfski, D. (2001). Objectifying assessment centers. Public Personnel Administration, 5(3), 42–9. Davila-Montero, S., Dana-Le, J.A., Gary, B., Hall, A.T., & Mason, A.J. (2021). Review and challenges of technologies for real-time human behavior monitoring. IEEE Transactions on Biomedical Circuits and Systems, 15(1), 2–28. https://​doi​.org/​10​.1109/​TBCAS​.2021​.3060617. de Boer, N. (2019). Street-level enforcement style: A multidimensional measurement instrument. International Journal of Public Administration, 42(5), 380–91. DeHart-Davis, L. (2009). Green tape: A theory of effective organizational rules. Journal of Public Administration Research and Theory, 19(2), 361–84. DeVellis, R.F. (2009). Scale development: Theory and applications (2nd ed.). Applied Social Research Methods Series 26. Sage.

Measuring micro-foundations of governance  201 Giauque, D., Ritz, A., Varone, F., Anderfuhren-Biget, S., & Waldner, C. (2011). Putting public service motivation into context: A balance between universalism and particularism. International Review of Administrative Sciences, 77(2), 227–53. https://​doi​.org/​10​.1177/​0020852311399232. Giannatasio, N.A. (2008). Threaths to validity in research designs. In G.J. Miller & K. Yang (Eds.), Handbook of research methods in public administration. Routledge, pp. 108–28. Grimmelikhuijsen, S., & Knies, E. (2017). Validating a scale for citizen trust in government organizations. International Review of Administrative Sciences, 83(3), 583–601. Grimmelikhuijsen, S., Jilke, S., Olsen, A.L., & Tummers, L. (2017). Behavioral public administration: Combining insights from public administration and psychology. Public Administration Review, 77(1), 45–56. Guenoun, M., Goudarzi, K., & Chandon, J.L. (2016). Construction and validation of a hybrid model to measure perceived public service quality (PSQ). International Review of Administrative Sciences, 82, 208–30. Hall, J.L., & Van Ryzin, G.G. (2019). A norm of evidence and research in decision-making (NERD): Scale development, reliability, and validity. Public Administration Review, 79(3), 321–9. Han, Y., & Perry, J.L. (2020). Employee accountability: Development of a multidimensional scale. International Public Management Journal, 23(2), 224–51. Hattke, F., Hensel, D., & Kalucza, J. (2020). Emotional responses to bureaucratic red tape. Public Administration Review, 80(1), 53–63. https://​doi​.org/​10​.1111/​puar​.13116. Hinkin, T.R. (1995). A review of scale development practices in the study of organizations. Journal of Management, 21(5), 967–88. Hu, L.T., & Olfski, D. (2008). Describing and measuring phenomena in public administration. In G.J. Miller & K. Yang (Eds.), Handbook of research methods in public administration. Routledge, pp. 205–12. Jensen, U.T., Andersen, L.B., Bro, L.L., Bøllingtoft, A., Eriksen, T.L.M., Holten, A.L., & Würtz, A. (2019). Conceptualizing and measuring transformational and transactional leadership. Administration & Society, 51(1), 3–33. Jilke, S., Meuleman, B., & Van de Walle, S. (2015). We need to compare, but how? Measurement equivalence in comparative public administration. Public Administration Review, 75(1), 36–48. Joreskog, K.G. (1971). Simultaneous factor analysis in several populations. Psychometrika, 36(4), 409–26. https://​doi​.org/​10​.1007/​BF02291366. Kay, K., Rogger, D., & Sen, I. (2020). Bureaucratic locus of control. Governance, 33(4), 871–96. Keulemans, S., & Van de Walle, S. (2020). Understanding street-level bureaucrats’ attitude towards clients: Towards a measurement instrument. Public Policy and Administration, 35(1), 84–113. Kim, S. (2009). Revising Perry’s measurement scale of public service motivation. The American Review of Public Administration, 39(2), 149–63. Kim, S. (2011). Testing a revised measure of public service motivation: Reflective versus formative specification. Journal of Public Administration Research and Theory, 21(3), 521–46. Kim, S., Vandenabeele, W., Wright, B.E., Andersen, L.B., Cerase, F.P., Christensen, R.K., & De Vivo, P. (2013). Investigating the structure and meaning of public service motivation across populations: Developing an international instrument and addressing issues of measurement invariance. Journal of Public Administration Research and Theory, 23(1), 79r–102. Latif, K.F., Ahmed, I., & Aamir, S. (2022). Servant leadership, self-efficacy and life satisfaction in the public sector of Pakistan: Exploratory, symmetric, and asymmetric analyses. International Journal of Public Leadership, 18(3), 264–88. Lawshe, C.H. (1975). A quantitative approach to content validity. Personnel Psychology, 28(4), 563–75. Lee, D., & Van Ryzin, G.G. (2019). Measuring bureaucratic reputation: Scale development and validation. Governance, 32(1), 177–92. Leight, J., & Safran, E. (2019). Increasing immunization compliance among schools and day care centers: Evidence from a randomized controlled trial. Journal of Behavioral Public Administration, 2(2). https://​doi​.org/​10​.30636/​jbpa​.22​.55. Liu, T.A.X., Juang, W.J., & Yu, C. (2022). Understanding corruption with perceived corruption: The understudied effect of corruption tolerance. Public Integrity, 1–13. Liu, Y., Zhang, Z., Chen, H., & Zhao, H. (2021). Measuring the political cost of environmental problems (PCEP): A scale development and validation. Journal of Chinese Governance, 1–19.

202  Handbook on measuring governance Lord, F.M., Novick, M.R., & Birnbaum, A. (1968). Statistical theories of mental test scores. Addison-Wesley. McDonald, R.P. (1999). Test theory: A unified treatment. Lawrence Erlbaum. Meynhardt, T., & Jasinenko, A. (2020). Measuring public value: Scale development and construct validation. International Public Management Journal, 24(2), 222–49. Morgado, F.F., Meireles, J.F., Neves, C.M., Amaral, A., & Ferreira, M.E. (2017). Scale development: Ten main limitations and recommendations to improve future research practices. Psicologia: Reflexão e Crítica, 30. Nascimento, T.G., de Souza, E.C.L., & Adaid-Castro, B.G. (2020). Professional competences scale for police officers: Evidence of psychometric adequacy. RAP: Revista Brasileira de Administração Pública, 54(1). Nowell, B., & Boyd, N.M. (2014). Sense of community responsibility in community collaboratives: Advancing a theory of community as resource and responsibility. American Journal of Community Psychology, 54(3), 229–42. Overman, S., Busuioc, M., & Wood, M. (2020). A multidimensional reputation barometer for public agencies: A validated instrument. Public Administration Review, 80(3), 415–25. Overman, S., Schillemans, T., & Grimmelikhuijsen, S. (2021). A validated measurement for felt relational accountability in the public sector: Gauging the account holder’s legitimacy and expertise. Public Management Review, 23(12), 1748–67. Perry, J.L. (1996). Measuring public service motivation: An assessment of construct reliability and validity. Journal of Public Administration Research and Theory, 6, 5–22. Pratama, A.B., & Imawan, S.A. (2019). A scale for measuring perceived bureaucratic readiness for smart cities in Indonesia. Public Administration and Policy, 22(1), 25–39. Prete, M.I., Guido, G., Pichierri, M., & Harris, P. (2018). Age-related differences when measuring political hypocrisy. Journal of Public Affairs, 18(4), e1707. Rainey, H.G., Pandey, S., & Bozeman, B. (1995). Research note: Public and private managers’ perceptions of red tape. Public Administration Review, 55, 567–74. Raykov, T. (2001a). Bias of coefficient a for fixed congeneric measures with correlated errors. Applied Psychological Measurement, 25(1), 69–76. Raykov, T. (2001b). Estimation of congeneric scale reliability using covariance structure analysis with nonlinear constraints. British Journal of Mathematical and Statistical Psychology, 54(2), 315–23. Raykov, T., & Marcoulides, G.A. (2011). Introduction to psychometric theory. Routledge. Rayner, J., Williams, H.M., Lawton, A., & Allinson, C.W. (2011). Public service ethos: Developing a generic measure. Journal of Public Administration Research and Theory, 21, 27–51. Roman, A.V., Van Wart, M., Wang, X., Liu, C., Kim, S., & McCarthy, A. (2019). Defining e-leadership as competence in ICT-mediated communications: An exploratory assessment. Public Administration Review, 79(6), 853–66. Rönkkö, M., & Cho, E. (2022). An updated guideline for assessing discriminant validity. Organizational Research Methods, 25(1), 6–14. Ropes, E., & de Boer, N. (2021). Compassion towards clients: A scale and test on frontline workers’ burnout. Journal of European Public Policy, 28(5), 723–41. Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161. Sauter, D.A., & Russell, J.A. (forthcoming). https://​hdl​.handle​.net/​11245​.1/​2dfc139c​-93fd​-412f​-84cc​ -135a0aee2cc3. Scherer, K.R. (2022). Theories in cognition & emotion – social functions of emotion. Cognition & Emotion, 36(3), 385–7. Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8(4), 350–53. https://​doi​.org/​10​.1037/​1040​-3590​.8​.4​.350. Schwab, D.P. (1980). Research methods for organizational studies. Psychology Press. Searle, J.R. (1996). The construction of social reality. Penguin Books Philosophy. Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha. Psychometrika, 74(1), 107–20.

Measuring micro-foundations of governance  203 Soroka, S., Fournier, P., & Nir, L. (2019). Cross-National evidence of a negativity bias in psychophysiological reactions to news. Proceedings of the National Academy of Sciences, 116(38), 18888–92. https://​doi​.org/​10​.1073/​pnas​.1908369116. Spearman, C. (1904). ‘General intelligence,’ objectively determined and measured. The American Journal of Psychology, 15(2), 201–93. https://​doi​.org/​10​.2307/​1412107. Szydlowski, G., de Boer, N., & Tummers, L. (2022). Compassion, bureaucrat bashing, and public administration. Public Administration Review, 82, 619–33. Taamneh, M., Nawafleh, S., Aladwan, S., & Alquraan, N. (2019). Provincial governors and their role in local governance and development in the Jordanian context. Journal of Public Affairs, 19(1), e1900. Tangsgaard, E.R. (2021). Risk management in public service delivery: Multi-dimensional scale development and validation. International Public Management Journal, April, 1–22. Thau, M., Mikkelsen, M.F., Hjortskov, M., & Pedersen, M.J. (2021). Question order bias revisited: A split-ballot experiment on satisfaction with public services among experienced and professional users. Public Administration, 99(1), 189–204. https://​doi​.org/​10​.1111/​padm​.12688. Trizano-Hermosilla, I., & Alvarado, J.M. (2016). Best alternatives to Cronbach’s alpha reliability in realistic conditions: Congeneric and asymmetrical measurements. Frontiers in Psychology, 26(7) (May), 769. doi:10.3389/fpsyg.2016.00769. PMID: 27303333; PMCID: PMC4880791. Tummers, L. (2012). Policy alienation of public professionals: The construct and its measurement. Public Administration Review, 72(4), 516–25. Tummers, L. (2017). The relationship between coping and job performance. Journal of Public Administration Research and Theory, 27(1), 150–62. Tummers, L., & Knies, E. (2016). Measuring public leadership: Developing scales for four key public leadership roles. Public Administration, 94(2), 433–51. Tummers, L., Vermeeren, B., Steijn, B., & Bekkers, V. (2012). Public professionals and policy implementation: Conceptualizing and measuring three types of role conflicts. Public Management Review, 14(8), 1041–59. Van De Schoot, R., Kluytmans, A., Tummers, L., Lugtig, P., Hox, J., & Muthen, B. (2013). Facing off with Scylla and Charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Frontiers in Psychology, 4(770), 1–15. Van De Walle, S., & Ryzin, G.V. (2011). The order of questions in a survey on citizen satisfaction with public services: Lessons from a split-ballot experiment. Public Administration, 89(4), 1436–50. https://​doi​.org/​10​.1111/​j​.1467​-9299​.2011​.01922​.x. Van Engen, N.A. (2017). A short measure of general policy alienation: Scale development using a 10-step procedure. Public Administration, 95(2), 512–26. Van Loon, N.M., Leisink, P.L.M., Knies, E., & Brewer, A.G. (2016). Red tape: Developing and validating a new job-centered measure. Public Administration Review, 76(4), 662–73. https://​doi​.org/​10​ .1111/​puar​.12569. Van Parys, L., & Struyven, L. (2018). Interaction styles of street-level workers and motivation of clients: A new instrument to assess discretion-as-used in the case of activation of jobseekers. Public Management Review, 20(11), 1702–21. Vandenabeele, W. (2008). Development of a public service motivation measurement scale: Corroborating and extending Perry’s measurement instrument. International Public Management Journal, 11, 143–67. Vogel, D., Reuber, A., & Vogel, R. (2020). Developing a short scale to assess public leadership. Public Administration, 98(4), 958–73. Wang, X., & Wang, Z. (2020). Beyond efficiency or justice: The structure and measurement of public servants’ public values preferences. Administration & Society, 52(4), 499–527. Xu, C., & Chen, C.A. (2021). Revisiting motivations for a public service career (MPSC): The case of China. Public Personnel Management, 50(4), 463–84. Yang, K. (2005). Public administrators’ trust in citizens: A missing link in citizen involvement efforts. Public Administration Review, 65, 273–85. Ziegler, M., Kemper, C.J., & Kruyen, P. (2014). Short scales – five misunderstandings and ways to overcome them. Journal of Individual Differences, 35(4), 185–9.

13. Criteria-based measurement of collaborative innovation and its impact on public problem solving and value creation1 Jacob Torfing, Andreas Hagedorn Krogh and Anders Ejrnæs

INTRODUCTION This chapter aims to answer the increasingly pertinent research question of how new and better measurement tools can help researchers, policymakers and program managers to assess collaborative efforts to enhance public innovation in order to solve complex problems and enhance the creation of public value (Crosby, ’t Hart and Torfing, 2017). Building on a research project that developed and tested a criteria-based assessment tool on multiple cases of collaborative innovation in the field of crime prevention (Torfing, Krogh and Ejrnæs, 2017, 2020), it makes the case for criteria-based measurement of collaborative innovation and its impact on public problem solving and value creation. Recent research on public innovation (Hartley, 2005; Eggers and Singh, 2009; Bommert, 2010; Sørensen and Torfing, 2011; Hartley, Sørensen and Torfing, 2013; Torfing, 2016) suggests that cross-boundary collaboration between a broad range of public and/or private actors with different innovation assets (e.g., professional skills, scientific and practical knowledge, creativity, fresh ideas, courage, stamina, financial resources and implementation capacity) spurs the development of innovative solutions. It proposes the hypothesis that relevant and affected actors create more innovative solutions when they engage in trust-based processes of expansive and transformative learning (see Mezirow, 2000; Engeström, 2008) and participate in co-creation and co-governance (Osborne, Radnor and Strokosch, 2016) than they do on their own. Moreover, it suggests that collaborative efforts to spur innovation are imperative for tackling wicked problems that are difficult to understand, tangled and ridden with conflict, and for creating public value for affected actors and society at large. Some qualitative single case studies confirm the positive impact of multi-actor collaboration on the ability to craft new and innovative solutions in the public sector (Roberts and Bradley, 1991; Steelman, 2010; Ansell and Torfing, 2014), but there are surprisingly few comparative case studies of collaborative innovation (Newman, Raine and Skelcher, 2001; Dente, Bobbio and Spada, 2005; Krogh and Torfing, 2015). Quantitative studies are even scarcer (for a rare exception, see Borins, 2014). Empirical case studies showing how collaboration spurs public innovation rarely document the impact of the innovative solutions on the behavior and wellbeing of the target group, thus preventing us from testing whether collaborative innovation produces public value outcomes. A crucial measurement problem explains why research has failed to test the capacity of collaboration to produce innovation and public value outcomes: it is much more difficult to measure the outputs and outcomes of collaborative initiatives than their programmatic activ204

Criteria-based measurement of collaborative innovation  205 ities. Addressing this problem, we have developed a new criteria-based measuring tool that enables the quantitative empirical measurement of process (collaboration), outputs (innovation) and outcomes (public problem solving and value creation) (Torfing, Krogh and Ejrnæs, 2017, 2020). Taking us beyond impressionistic qualitative assessments, it opens up the avenue of comparative analysis across a large number of empirical cases. Specifically, it provides researchers with the methodological means to examine the causal mechanisms between key variables and their scope conditions. For practitioners, it offers much needed support for evaluating outputs and outcomes of collaborative governance initiatives; for instance, public managers may use the criteria-based measuring tool to identify the need for process design and leadership in order to enhance innovation and improve the outcomes of collaborative processes. If deployed wisely, it thus holds the promise of advancing collaborate innovation research and practice. In this chapter, we first describe the main purposes of criteria-based measurement of collaborative governance initiatives. Then, we explain the key characteristics of criteria-based measuring tools and define the key variables and indicators of collaboration, innovation and value creation. Next, we present the results from deploying the tool in the field of local crime prevention. Finally, we summarize the main findings and briefly discuss the current status and prospects for the criteria-based measurement of collaborative innovation processes and their public value outcomes.

THE MAIN PURPOSES OF CRITERIA-BASED MEASUREMENT OF COLLABORATIVE INNOVATION Public and private actors initiate collaborative innovation processes in order to co-create new and better solutions that outperform the existing ones in terms of complex problem solving and public value creation. However, they cannot assume that the new and innovative solutions will bring about the desired outcomes automatically; instead, they must measure the impact of the innovative solution to secure the feedback needed to improve, adjust, revise and consolidate the collaborative innovation processes, as well as to allocate sufficient funds for successful collaborative innovation projects while reforming or terminating those that are under-performing to free up resources to develop more effective alternatives. Many public projects build on the normative assumption that both collaboration and innovation are commendable practices that contribute to desirable outcomes. For an array of reasons, however, they rarely define or measure the form and character of the collaborative processes, the innovative solutions and/or their impacts. Empirically, it is difficult to separate the collaborative process from its innovative results and their problem-solving effects (Innes and Booher, 1999). Theoretically and methodologically, both researchers and practitioners lack clear conceptualizations, operational measures and reliable data on processes, outputs and outcomes (Emerson and Nabatchi, 2015). By the same token, there is a lack of applied knowledge and a dearth of studies examining and connecting the quality of the collaborative processes, their innovative outputs and the problem-solving and value-creating impact of innovative solutions. To remedy the problem, we developed a criteria-based measurement tool that offers practitioners a means for improving their efforts to solve wicked problems by spurring collaborative innovation and provides researchers with a method to test whether and when collaboration spurs innovation that affects complex problem solving and public value creation. The devel-

206  Handbook on measuring governance oped measuring tool is criteria-based in the sense that it combines qualitative indicators of collaboration, innovation and ability to impact (i.e., to create desirable outcomes), with quantitative measures of the variables by constructing additive indexes. Each indicator in the indexes assesses a score on a numerical scale. Specifically, each additive index consists of four indicators that are scored on a scale from 1 to 5, where the total score for each variable will vary between 4 and 20. The tool was designed and applied in close cooperation with the Safer City Agency, the Department of Social Services, and the staff of local crime prevention projects in the City of Copenhagen. The aim of the research was not only to produce new scientific knowledge about the impact of collaborative innovation but also to develop a practical, user-friendly and easy-to-use tool that enables local crime prevention projects to evaluate themselves in order to improve their performance through a process of organizational learning. In addition, the data generated by the criteria-based scoring of the local projects were to enable the municipality to prioritize resources across projects, either by picking winners or supporting promising but failing projects. The criteria-based assessment tool proved apt for measuring the quality of the collaboration between the involved actors, the innovativeness of the co-created solutions, and their ability to effectively solve the problem at hand and to create value for the target group. Criteria-based measuring is valuable for practitioners in multiple functions and roles at various levels of the public organization. First, public executives and cross-program managers may use it to evaluate and compare their entire portfolio of collaborative innovation projects addressing the same wicked problem and creating value for a particular target group. Second, program and project managers may use it to initiate discussions on how to overcome barriers, strengthen drivers and improve the performance of their collaborative initiative. Third, higher-level political leaders and administrative directors may use the comparative information of the relative success of the different projects to prioritize the allocation of the scarce public funding between them. They may decide to reward successful projects with scaling potential and cut back or even terminate less successful projects that fail to improve their performance. Alternatively, they may want to invest more in politically salient and promising projects that are struggling to realize their potential. In sum, the criteria-based measuring and assessment tool stimulates learning and facilitates performance-based management. To maximize the learning potential, however, public executives must carefully avoid introducing it as yet another top-down control and punishment mechanism, since doing so would tend to undermine local ownership of the tool among project managers and collaborators. Criteria-based measuring is also valuable for research purposes. Tapping into the performance-information data, researchers may use the criteria-based measuring tool to test the hypothesized correlations between collaboration, innovation and desirable outcomes. Analyzing the performance of collaborative innovation projects over time, researchers can test whether improved collaboration enhances the innovativeness of the solutions and whether a rising level of innovation results in improved problem solving and enhanced value creation. Depending on the extent to which the criteria-based measuring relies on self-reported information, the data quality may not allow the identification of causal inference. Constructing and exploiting time series datasets, however, research deploying the criteria-based measuring tool is able to provide a sound indication of the extent to which collaboration is conducive for producing innovative public value outcomes. Researchers may also supplement the criteria-based data with additional qualitative data, either provided by the project or collected by the researchers themselves, in order to explore how framing, resource allocation, institu-

Criteria-based measurement of collaborative innovation  207 tional design and the exercise of leadership impact project performance (Torfing, Krogh and Ejrnæs, 2017). This type of analysis will help to shed light on the impact of metagovernance on collaborative innovation (Sørensen and Torfing, 2009, 2017). In sum, researchers and practitioners alike may benefit from precise and comparable performance data gathered through the criteria-based measuring tool. There is little reason to expect researcher‒practitioner conflicts regarding the use of the tool since it is based on valid operationalizations of the key variables and reliable measurement of the indicators. To achieve this, researcher‒practitioner collaboration is conducive but not essential; indeed, practitioners may perform criteria-based measuring without researchers paying any interest to the data. We constructed three sets of indices (one for each of collaboration, innovation and ability to impact) and four components for each of these, giving a total of 12 indicators.

THE CRITERIA-BASED MEASURING TOOL AND ITS KEY VARIABLES AND INDICATORS The criteria-based measurement of collaborative innovation requires clear definitions of the key variables and indicators of collaboration, innovation and the ability to impact on problem solving and value creation, which can guide the construction of indicators. In this section we focus on collaboration and use one of the indices for collaboration as an example to demonstrate how each of the 12 indicators are constructed. Collaboration is commonly defined as a temporal process in which two or more interdependent but operationally autonomous actors work together in an organized fashion and use their different skills, competences and forms of knowledge to transform problems and opportunities into joint solutions resting on provisional agreements that are reached despite the persistence of dissent (Gray, 1989; Roberts and Bradley, 1991; Torfing, 2016). Early in the process, collaboration may help to facilitate a cooperative exchange of knowledge and resources and improve coordination in order to eliminate gaps and create synergies (Thomson and Perry, 2006; Keast, Brown and Mandell, 2007). Collaboration may therefore have a direct impact on the ability to solve public problems and create public value. However, it may also spur innovation by triggering mutual learning and solving problems while promoting joint ownership over new and bold solutions, and thus have a more indirect impact on project outcomes (Ansell and Torfing, 2014). In order to measure the quality and degree of collaboration, researchers and practitioners may select one or more of the following four indicators. Table 13.1 exhibits the additive index measuring “the breadth of collaboration,” defined in terms of the range of relevant and affected actors from the public and/or private for-profit or non-profit sector that are involved in the project. A highly diverse set of actors is important to prevent tunnel vision and stimulate outside-the-box thinking (see Dente, Bobbio and Spada, 2005; Skilton and Dooley, 2010). We can learn something new from actors who are different from ourselves, so it is a good idea to collaborate across professional groups and public agencies. Moreover, since public employees often think along similar lines, it is even better if private for-profit or non-profit organizations are included in the collaborative arena. Finally, the affected actors (in particular the people who belong to the target group) may also contribute valuable inputs to the decision-making process and should be included. If, for some reason, the barriers to such inclusion are prohibitive, then someone able to speak on behalf of the users should at least be given a seat at the table. If all of these actors participate in the

208  Handbook on measuring governance Table 13.1

The additive index measuring the breadth of collaboration

Score

Criteria

5

The collaboration involves a range of public and private actors as well as youth from the target group (or a representative who can convey their points of view, wishes and needs)

4 3

The collaboration involves one or more public actors and one or more private for-profit or non-profit organizations The collaboration involves various public organizations (e.g., state, region, municipality), departments (e.g., social, leisure, culture) and professional groups (e.g., social workers, administrators) with different perspectives on the problems, challenges and solutions

2 1

The collaboration involves different professional groups within the same public organization or administration The project only involves actors within a given public organization or administration with the same professional background

collaborative process, the highest score will be awarded; and if only the relevant actors from one’s own agency are involved, the lowest score will be awarded. A second indicator of collaboration is “the scope of collaboration,” defined in terms of the number of phases in the formation, development and execution of the local projects involving multi-actor collaboration. The expectation is that cross-cutting collaboration in all phases of a project (from problem definition and goal formulation, through selection and design of solutions, to implementation and consolidated operations) will be a stronger driver of innovation than if the collaborative process merely covers the implementation and operation phases (Torfing, 2016). The highest score is given when all phases are subject to multi-actor collaboration, and the lowest score is given when collaboration is limited to the phase of implementation and consolidated operations where external resources must be mobilized to cut costs. A third indicator of collaboration is “the depth of collaboration.” It is not enough to measure who participates in the project and in which phase(s) they participate and collaborate; we must also assess the quality of the collaborative process. Is the collaborative process an insincere, superficial and symbolic gesture, or are the actors committed to a deep and close cooperation in which conflicts are constructively managed, resources and ideas are exchanged, and the goal is to produce clear and tangible results? The underlying expectation is that the deeper and closer the collaborative interaction, the more it will stimulate innovation. If all of the involved actors are engaged in the dialogue-based co-creation of results over a longer period of time, the highest score will be awarded; whereas if the collaborative process is merely a question of informing relevant and affected actors so that they can voice their opinions and object to ideas and proposals put forward by the project leaders, the lowest score will be awarded. A fourth indicator of collaboration is “collaborative management,” which is an indirect measure assessing whether there is an appropriate and inclusive project management that facilitates rule-bound collaboration and ensures a steady progression toward goal achievement. Thomson and Perry (2006) view management as an indispensable part of their multidimensional model of collaboration. Participation is voluntary and the participants may withdraw if collaborative inertia prevents short-term success and quick wins (Huxham and Vangen, 2004). Provan and Kenis (2008) have clearly demonstrated the importance of management for successful collaboration in networks, and they recommend an inclusive approach to collaborative management. In collaborative settings, management is less about giving orders, monitoring performance and sanctioning poor results, and more about mobilizing and activating actors, building trust, facilitating dialogue, lowering the transaction costs of collaborating and tracking results (Kickert, Klijn and Koppenjan, 1997). In the absence of good management, collab-

Criteria-based measurement of collaborative innovation  209 oration will crumble and have a limited impact, and it is important to involve key actors in the exercise of collaborative management, as through the formation of a “Network Administration Organization” (Provan and Kenis, 2008). Hence, the highest score is given where there is clear and effective management throughout the project, which is anchored in a governing board in which the key actors jointly reflect on how the collaborative endeavor can be furthered. By contrast, the lowest score is given when no one really takes responsibility for managing the collaborative endeavor other than organizing meetings by circulating agendas and minutes. The second variable is innovation, which is defined as a step-change that disrupts the conventional wisdom and established practice in a particular context (Hartley, 2006; Torfing, 2016). Innovation can result both from the invention of something new and from the replication of new ideas from elsewhere in new contexts. Hence, it is not the source of an innovation that determines whether or not it counts as an innovation; as long as something is new to the context in which it is implemented, it qualifies as an innovation. While innovation processes are driven by intentions to find new and better solutions, it would be erroneous to assume that innovation is inherently good and praiseworthy. Hence, normative evaluations of whether an innovation is good or bad are always ex post and highly dependent on who is doing the evaluating. One or more of the following four indicators may be selected to measure the degree of innovation in local projects. A first indicator of innovation is “the depth of innovation at the ideational level.” Innovations have both ideational and practical aspects. This indicator concerns the magnitude of the ideational steps taken during innovative step-change processes. If an innovation builds on a brand new program or change theory, which not only defines a new set of goals and tools but also redefines the problems and challenges at hand, the highest score will be given. By contrast, if an innovation merely builds on old ideas that are combined in slightly different ways that produce new functionalities and perhaps also new results and effects, the lowest score will be given. A second indicator of innovation is “the depth of innovation at the level of practice.” New and creative ideas must be implemented in practice to count as innovations, and the implementation of new ideas can be more or less path-breaking. Hence, it is important to measure whether it is merely the delivery of existing services that is transformed or whether it is also the service itself, the organizational context, or perhaps even the overall policy that is disrupted. Hence, the highest score is given when an innovation includes a new policy, organizational transformation, or radical changes in the form and content of a service and how it is produced and delivered. When the practical scope of an innovation is limited to new ways of producing and delivering a service, the lowest score is given. A third indicator of innovation is “the character of the innovation,” defined in terms of whether the innovation as a whole can be characterized as radical or incremental. The focus here is on the combined impact of the ideational and practical change involved; that is, whether it amounts to a radical transformation of the ideas, practices and roles of the actors involved, in which case it will receive the highest score; or whether it is an incremental innovation that is only distinct in few and limited ways compared to ongoing attempts to improve or optimize the existing modus operandi, in which case the lowest score is given. A fourth indicator of innovation is “the reputation of the innovation.” There is always a subjective component when determining whether or not something represents an innovation. However, if the initiator and/or leader of a project are the only ones who regard it as innovative, it might not be all that path-breaking after all. However, the perception of a project as

210  Handbook on measuring governance innovative becomes more plausible if actors external to the project recognize its innovative and path-breaking character. As such, external recognition and public reputation are important indicators of a project’s innovativeness. Hence, the highest score is given to a project if the external environment, such as other municipalities, private associations and organizations, public authorities or research institutions, recognize it as innovative and the project has won innovation prizes and/or received positive media attention. By contrast, the lowest score is given to projects that are only perceived as innovations by their initiators. The third variable is ability to impact, defined as the ability to solve a problem and create value for the target group and society at large (Crosby, ’t Hart and Torfing, 2017). Even well-meaning projects can miss their target and fail to achieve their goals, and it is therefore important to measure whether projects are likely to produce the desired impact. While measuring process performance is relatively easy, measuring the actual outcomes of a more or less innovative solution is complicated. In real-life settings, it is difficult to measure all of the direct and indirect effects on a particular problem and target group and virtually impossible to control for external factors that might affect problem solving and value creation. Measuring the effects of local projects on the welfare and behavior of the target group is further complicated, as projects often lack comparable data on relevant parameters before and after an intervention. A target group may be extremely diffuse and subject to constant fluctuations as participants move in and out of local projects in contingent ways. In those cases, it is impossible to establish objective measures for the direct effects that the projects and their more or less innovative solutions have on the welfare and behavior of the individual members of the target group. It is therefore necessary to consider indirect and subjective measures of the ability to impact based on robust knowledge about causes and effects that are likely to affect the ability of the project to produce the desired impact. Relevant indicators of the ability of collaborative initiatives to make an impact thus depend on the field of interventions and the state-of-the-art knowledge in that field. Applying our tool in the field of crime prevention, we decided to include four field-specific indicators. First, “the targeting of known risk factors,” where the highest score is given to projects that directly target relevant high-risk factors, and the lowest score is given to projects that only indirectly affect low-risk factors. Second, “the use of effective methodological approaches,” where the highest score is awarded to projects that consistently apply methodologies that are known to be effective, and the lowest score is assigned to projects with limited focus on applying effective methodological approaches. Third, “the presence of safe and robust knowledge about outcomes,” where the highest score is given to projects that build on evidence-based methods tested by other similar projects and which also provide solid documentation for their own effects, and the lowest score is given to projects that only build on assumptions about possible future effects and which are incapable of providing documentation for their own effects. Fourth, “the achievement of stated goals in terms of results and effects,” where the highest score is awarded to projects that have reached their stated goals for activity-related results and effects in the past year, and the lowest score is attributed to projects that have only partially achieved the desired results and effects. Ultimately, the specific set of indicators deployed is a context-specific design choice.

Criteria-based measurement of collaborative innovation  211

EMPIRICAL APPLICATION OF THE CRITERIA-BASED MEASURING AND ASSESSMENT TOOL The project provided a complete list of variables, indicators and scoring rules. In an iterative process of testing an early prototype of the tool on four local projects, we adjusted and refined indicators and scoring tables in close cooperation with the Safer City Agency, the Department of Social Services, and the staff of the local crime prevention projects in the City of Copenhagen. To ensure some degree of continuity with past measuring practices and performance information data, we took into account the impact measures that the municipality had developed and deployed in previous years. In addition, we discussed the design of the tool with an advisory stakeholder board created for the sake of ensuring qualified feedback to and ownership of the new tool. The municipality then selected 24 local crime prevention projects that participated in testing the measuring and assessment tool. Adhering to the defined selection criteria, the projects were all operative, publicly financed and had a clearly discernible core activity that was amenable to evaluation. Although not a selection criterion, it turned out that all of the projects also relied on some degree of collaboration, co-creation and co-governance. Before receiving written instructions on how to score their projects, the project managers were verbally briefed about the research project in a joint meeting. They were instructed to discuss the scores in the project leadership group (typically comprising three or four members) before ticking the right boxes in the electronic questionnaire, which, besides the 12 indicators, also solicited some basic project information. Upon submission, the managers were also asked to briefly explain the scores assigned to each indicator. Validating all the scores, the Safe Cities employees then used their detailed knowledge about the projects to appraise the scores and compare them with the explanations and evidence provided by the project managers. If the agency judged the scores to be too high or low, they contacted the project manager to adjust the score to better reflect the project’s actual performance. The agency staff generally found the scores to be relatively precise. Only a handful of corrections of individual scores were made, which may be explained by the disciplining impact of the project’s anticipation of the external validation. The purpose of the external validation was to prevent “window dressing,” a well-known phenomenon whereby project managers exaggerate their results to make a project look good in the eyes of their financial benefactors. Before analyzing the data, we checked whether the projects scored consistently on our indicators for measuring each of the three variables. If there is inconsistency between the indicators measuring a particular variable, it will not make sense to add them together to construct the three additive indexes. Based on a factor analysis (PCA) and calculation of Cronbach’s alpha, we concluded that the correlations were sufficiently high (Cronbach’s alpha was above 0.7 for all three indexes: collaboration = 0.81; innovation = 0.71; ability to impact = 0.85) to merit the combination of the indicators into composite variables, justifying the construction of three additive indexes measuring collaboration, innovation and crime prevention effects, respectively. The scores from the four indicators of each variable were aggregated into a Likert scale with values ranging from 4 (lowest level) to 20 (highest). In order to analyze the relations among the three key variables, we first conducted a simple correlation analysis of the bivariate relations between the three composite variables. Table 13.2 shows the bivariate correlations between the different variables. The table shows significant positive correlations between collaboration, innovation and ability to impact. Hence, projects

212  Handbook on measuring governance Table 13.2

Correlation matrix for the relationship between collaboration, innovation and ability to impact

 

Ability to impact

Collaboration

Innovation

Ability to impact

1

 

 

Collaboration

0.50*

1

 

Innovation

0.68***

0.54**

1

Note: * p < 0.05; ** p < 0.01; *** p < 0.005.

scoring high on collaboration and innovation also tend to score high on ability to impact. The correlation coefficient is highest for the relationship between innovation and crime prevention impact and lowest between collaboration and crime prevention impact. In an additional analysis, we included both collaboration and innovation as independent variables in a regression analysis aimed at explaining crime prevention effect. We are fully aware of the n being too small to run a multiple regression analysis, but the results may nevertheless indicate a direction for future studies of larger datasets. The analysis shows that collaboration does not appear to have much direct impact on the extent of the crime prevention effect, as the bivariate relationship becomes insignificant when innovation is introduced and controlled for. The standardized regression coefficient is also drastically diminished from 0.50 to 0.19 when we turn from the bivariate to the multivariate regression analysis. In sum, the analysis lends support to the hypothesis that collaborative innovation (and not simply collaboration per se) enhances crime prevention impacts. In other words: collaboration appears to have an indirect influence on crime prevention by way of increasing innovation, which in turn magnifies the crime prevention effects of local projects. Figure 13.1 illustrates the suggested theoretical relations between collaboration, innovation and crime prevention effect as they appear in the multivariate regression analysis. The bivariate correlations are mentioned in brackets.

Figure 13.1

Suggested causal relations between collaboration, innovation and ability to impact

While both the correlation and regression analysis are indicative of the possible causal relations between the key variables, we should not forget that two crucial limitations in our analysis prevent us from drawing unequivocal conclusions about the causal relations between them.

Criteria-based measurement of collaborative innovation  213 First of all, it is not possible on the basis of cross-sectoral data to say anything definitive about the direction of the causal relations; that is, which variable is dependent and which are the independent ones, since the composite variables were all measured at the same time. Hence, we cannot state conclusively that multi-actor collaboration improves the ability to innovate. In principle, it might be the other way around, although this might seem counterintuitive. This problem is only partly mitigated by the fact that some of the Copenhagen crime prevention projects have been carefully studied in a qualitative comparative case analysis that suggests the existence of causal relations between collaboration and innovation, and innovation and effect, respectively (Krogh and Torfing, 2015; Krogh, 2017; Torfing, Krogh and Ejrnæs, 2017). Second, the analysis is based on self-reported data, which carries the risk that respondents who have already scored the project high on one indicator will also be inclined to score it high on subsequent indicators. This problem is mitigated by the construction of clear, pre-defined measurement categories, the strict demand for a written explanation of the score, and the subsequent validation procedure that prevents the projects from pursuing their self-interest when scoring their own project. The ultimate solution, however, would be to include an objective measure of the crime prevention effect, such as through panel data for the participant behavior in the local projects. There are no such data currently available, and it is important from the practitioners’ perspective that the measuring tool is easy to use and produces quick results that can guide political and administrative action to improve process and outcomes. The criteria-based measuring of the 24 crime prevention projects provided the administrative leadership of the crime prevention efforts in the City of Copenhagen with a much-needed overview of the performance of the different projects. There were few surprises in relation to the existing knowledge about the projects, but the combination of quantitative scores and more detailed qualitative assessments provided an excellent starting point for conversations with individual projects about what and how to improve. The positive practical experience with the criteria-based measuring and assessment tool indicated that the development and testing of the tool succeeded in producing a double impact on research and administrative practice.

CONCLUSIONS Recent research suggests that cross-boundary collaboration spurs innovation, but few have applied quantitative measures of collaboration, innovation and the impact on public problem solving and value creation, let alone developed criteria-based measuring tools for doing so. The empirical test of such a tool, reported on in this chapter, reveals that it is relatively easy to use and that its empirical application generates data that enable practitioners to work more systematically with the processes, result and impact of collaborative governance. Applied to a larger number of cases over time, it permits researchers to test hypotheses about the causal relations between collaboration, innovation and effect. The reported research cautiously lends support to the hypothesis that collaboration spurs innovation, which in turn has a crime prevention impact. This indicative finding has important implications for practice: Practitioners should see collaboration as an instrument for stimulating the development and implementation of innovative solutions that will make a real difference in addressing complex public problems and creating value for particular target groups and society at large. Disseminated on web-based platforms and at practitioner conferences addressing local crime prevention, the positive experiences from using the criteria-based measuring and assess-

214  Handbook on measuring governance ment tool have been inspiring similar work in other Danish municipalities. The tool has also been re-purposed to measure collaboration, innovation and the impact of social inclusion in the Danish general housing sector, which now mandates the formation of local collaboration in disadvantaged housing estates that receive funding for improving the social conditions for and wellbeing of the tenants. Nordic delegations have visited Denmark to learn about the use of criteria-based assessment, and the Danish General Housing Association provides free access to their tool. Hence, there appears to be a potential for diffusion across sectors and countries. Due to the analytical limitations of the present study, more research is needed on how to measure and achieve the desired impacts regarding collaborative innovation. Further research and feedback from empirical applications in different local settings and policy areas would enable the refinement of the measuring tool. It may also be possible to find ways to supplement self-reported data with more objective performance data collected independently of project managers’ assessments. However, the challenge remains to continue to collect data in a rapid and efficient manner, allowing practitioners easy access to information that they can use to enhance the impact of collaborative innovation.

NOTE 1. We are grateful that Policy and Politics kindly granted us permission to re-use parts of an article published as Torfing, J., Krogh, A.H., & Ejrnæs, A. (2020). Measuring and assessing the effects of collaborative innovation in crime prevention, Policy & Politics, 48(3), 397‒423.

REFERENCES Ansell, C., & Torfing, J. (Eds.) (2014). Public innovation through collaboration and design. Routledge. Bommert, B. (2010). Collaborative innovation in the public sector. International Public Management Review, 11(1), 15‒33. Borins, S.F. (2014). The persistence of innovation in government. Brookings Institution Press. Crosby, B.C., ’t Hart, P., & Torfing, J. (2017). Public value creation through collaborative innovation. Public Management Review, 19(5), 655‒69. Dente, B., Bobbio, L., & Spada, A. (2005). Government or governance of urban innovation? disP – The Planning Review, 41(162), 41‒52. Eggers, B., & Singh, S. (2009). The public innovators playbook. Harvard Kennedy School of Government. Emerson, K., & Nabatchi, T. (2015). Evaluating the productivity of collaborative governance regimes. Public Performance & Management Review, 38(4), 717‒47. Engeström, Y. (2008). From teams to knots: Activity-theoretical studies of collaboration and learning at work. Cambridge University Press. Gray, B. (1989) Collaborating: Finding common ground for multipart. Jossey-Bass. Hartley, J. (2005). Innovation in governance and public service. Public Money and Management, 25(1), 27‒34. Hartley, J. (2006) Innovation and its contribution to improvement. Department for Communities and Local Government. Hartley, J., Sørensen, E., & Torfing, J. (2013) Collaborative innovation: A viable alternative to market competition and organizational entrepreneurship, Public Administration Review, 73(6): 821‒30. doi: 10.1111/puar.12136. Huxham, C., & Vangen, S. (2004). Realizing the advantage or succumbing to inertia? Organizational Dynamics, 33(2), 190‒201. doi: 10.1016/j.orgdyn.2004.01.006.

Criteria-based measurement of collaborative innovation  215 Innes, J.E., & Booher, D.E. (1999). Consensus building and complex adaptive systems. Journal of the American Planning Association, 65(4), 412‒23. Keast, R., Brown, K., & Mandell, M. (2007). Getting the right mix: Unpacking integration, meanings and strategies. International Public Management Journal, 10(1), 9‒34. Kickert, W.J.M., Klijn, E.-H., & Koppenjan, J.F.M. (1997). Managing complex networks. Sage. Krogh, A.H. (2017). Preventing crime together. Doctoral dissertation, Department of Social Sciences and Business, Roskilde University. Available at: forskning​.ruc​.dk/​en/​publications/​preventing​-crime​ -together​-the​-promising​-perspectives​-and​-complica. Accessed August 17, 2023. Krogh, A. H., & Torfing, J. (2015). Leading collaborative innovation. In A. Agger, B. Damgaard, A.H. Krogh, & E. Sørensen (Eds.), Collaborative governance and public innovation in Northern Europe (pp. 91‒110). Bentham Science Publishers. Mezirow, J. (2000). Learning as transformation. Jossey-Bass. Newman, J., Raine, J., & Skelcher, C. (2001). Transforming local government. Public Money and Management, 21(2), 61‒8. doi: 10.1111/1467-9302.00262. Osborne, S.P., Radnor, Z., & Strokosch, K. (2016). Co-production and the co-creation of value in public services. Public Management Review, 18(5), 639‒53. Provan, K.G., & Kenis, P. (2008). Modes of network governance: Structure, management and effectiveness. Journal of Public Administration Research and Theory, 18(2), 229‒52. Roberts, N.C., & Bradley, R.T. (1991). Stakeholder collaboration and innovation. Journal of Applied Behavioural Science, 27(2), 209‒27. Skilton, P.F., and Dooley, K. (2010). The effects of repeat collaboration on creative abrasion. The Academy of Management Review, 35(1), 118‒34. Sørensen, E., & Torfing, J. (2009). Making governance networks effective and democratic through metagovernance. Public Administration, 87(2), 234‒58. Sørensen, E., & Torfing, J. (2011). Enhancing collaborative innovation in the public sector. Administration and Society, 43(8), 842‒68. doi: 10.1177/009539971141876. Sørensen, E., & Torfing, J. (2017). Metagoverning collaborative innovation in governance networks. The American Review of Public Administration, 47(7), 826‒39. Steelman, T.A. (2010). Implementing innovation. Georgetown University Press. Thomson, A.M., & Perry, J.L. (2006). Collaboration processes: Inside the black box. Public Administration Review, 66(s1), 20‒32. Torfing, J. (2016). Collaborative innovation in the public sector. Georgetown University Press. Torfing, J., Krogh, A.H., & Ejrnæs, A. (2017). Samarbejdsdrevet innovation i kriminalpræventive indsatser [Collaborative innovation in crime prevention initiatives]. Copenhagen Municipality. Torfing, J., Krogh, A.H., & Ejrnæs, A. (2020). Measuring and assessing the effects of collaborative innovation in crime prevention. Policy & Politics, 48(3), 397‒423.

14. Using collaborative performance summits to help both researchers and governance actors make sense of governance measures Scott Douglas

INTRODUCTION: HOW CAN RESEARCHERS AND ACTORS MAKE SENSE OF GOVERNANCE MEASURES? Government actors have long been keen to formulate concrete measures to assess how they are doing (Van Dooren and Hoffmann, 2018). However, empirical studies have shown that these same actors struggled to understand all the data generated and often failed to use it when actually assessing their governance efforts (James et al., 2020). Similarly, researchers have enthusiastically collected large measurement sets, but then struggled to comprehensively assess the multiple dimensions of government work and connect the often conflicting perspectives of the multiple actors involved (Moore, 1995; Moynihan, 2010). The difficulty of measuring government becomes especially pronounced when assessing complex governance arrangements (Emerson et al., 2012), such as collaborations between public agencies and community groups addressing thorny societal issues such as radicalization, domestic violence, or climate change (Head and Alford, 2015). For example, how should researchers and practitioners interpret a rise in reports of domestic violence after the formation of a taskforce to reduce domestic violence: Is the community facing an increase in violence or have people become more aware of the issue? And how should researchers and actors weigh the dissatisfaction of a community group about the fight against climate change against the progress reported by a panel of experts? This chapter does not attempt to cut through this complexity by finding a new and perfect measure, but rather outlines how actors can use dialogue routines to bring together the information each participant has to jointly make sense of these multiple measures (Moynihan et al., 2011). The chapter also examines how researchers can use these same dialogue routines to collect data on diverse governance measures and at the same time observe how the participants make sense of these measures (Douglas and Ansell, 2023). This chapter specifically explores the multiple purposes, technical characteristics, and current research use of collaborative performance summits. These summits are defined as “dialogue routines where partners in a collaborative governance arrangement gather to explicate their goals, exchange information about their activities, examine the progress towards their goals, and explore potential actions for improvement” (Douglas and Ansell, 2021). Other scholars have described similar dialogue routines using terms such as interorganizational learning forums (Moynihan et al., 2011), forums, arenas, and courts (Bryson and Crosby, 1993), and PerformanceStat sessions (Behn, 2014). Collaborative performance summits are here framed as part of an action-oriented approach to research (Dekker et al., 2020). Researchers can actively propose, support, or even host a col216

Using collaborative performance summits  217 laborative performance summit within the governance arrangements they study. Being closely involved in the preparation, conduct, and follow-up of a summit generates valuable data. This data has the shape of access to the information participants have about the functioning of their governance arrangement (e.g. data from the police about general trends in domestic violence reports, and from welfare agencies about trends in mental health problems). And observations of how actors interpret and discuss these measures are also relevant research data. Moreover, as the participants derive practical value from the summit, they may be actively willing to participate in the exercise, sparing the researchers much work in trying to convince actors to share their data. However, an active role of the researcher in the collaborative performance summit is also likely to influence the nature, substance, and outcomes of the discussion. Even if a researcher was to merely observe a summit as a fly-on-the-wall, actors may still feel compelled to change their tone and messaging (Dekker et al., 2020). This chapter therefore also frames collaborative performance summits as social, even political, interactions between the actors involved in a governance arrangement, where the participation of researchers in this process will alter the social dynamics. Using collaborative performance summits as part of the research process requires careful planning and active consultation with all the participants to still generate valid insights and comply with ethical standards. However, this hard work will generate a bounty of insights into different governance measures and how these measures are perceived by the actors involved.

PURPOSES OF METHOD: FACILITATING LEARNING, ACCOUNTABILITY, AND RELATIONSHIP-BUILDING WHILE COLLECTING RESEARCH DATA To the actors in a governance arrangement, a collaborative performance summit can serve multiple purposes, ranging from collective learning and mutual accountability to relationship-building (Douglas et al., 2021). To the researchers involved in collaborative performance summits, the meeting serves the purpose of gaining access to the data that partners bring to the table (after acquiring the appropriate consent) and observing the dynamics of the discussion (while accounting for their own influence on this discussion) (Douglas and Ansell, 2023). However, it is important for actors and researchers alike to appreciate that the multiple purposes of a summit often play out all at once, either implicitly or explicitly, and this presents participants and researchers with tensions in the conduct of the meeting (Douglas and Ansell, 2021). Collaborative Performance Summits as Instruments for Learning Collaborative performance summits can firstly serve the purpose of collaborative learning (Heikkila and Gerlak, 2013). An effective summit would enable the actors participating in a governance arrangement to better understand what they hope to achieve, to get an overview of what all the actors have been doing, assess how much progress all this work has delivered, and identify what steps to take next. Or, correspondingly, summits serve their learning purpose when researchers can collect data about what the actors hope to achieve, what they have been doing so far, how the group assesses its performance, and what steps they hope to take next.

218  Handbook on measuring governance This learning purpose can be defined in a rather narrow, static sense, or in a more expansive, dynamic sense. In a narrow, static sense, a summit would serve its purpose as a learning tool if the actors participating in a forum walk away with a shared and precise understanding of their goals, a shared, comprehensive, and factually correct understanding of what has been done so far, a shared and honest assessment of their progress, and shared and logical conclusions about who should take what actions next. Researchers could collate this data in neat lists and tables, creating an overview for themselves and the participants. In a more dynamic sense, a routine would serve its purpose as a learning tool if the actors walk away with a greater understanding of the different ambitions of the different partners and the overlaps or contradictions between them (Bryson et al., 2016). It would also be served if the actors have jointly made greater sense of their current activities and progress, which includes an appreciation of how the different partners within the arrangement may view the current state of affairs differently (Weick, 1995). And finally, an effective learning routine in this perspective would conclude with a tentative agreement about what things to try next and when to reconvene to jointly interpret the impact of these next steps. Researchers would then seek to trace and capture the different opinions within the group and the evolution of the various perspectives over time. Collaborative Performance Summits as Instruments for Accountability Collaborative performance summits can, whether by design or in practice, serve the purpose of organizing the accountability between actors for their work in a governance arrangement (Klijn and Koppenjan, 2014). From a narrow, static perspective, dialogue routines can be seen as a forum in which the principal holds its agent to account, asking them to explain what they have been doing and what this has achieved. Researchers could then trace the patterns of these conversations (who is holding to whom to account), the information that is provided, and the judgements that are passed. A more dynamic view on accountability would argue that the nature of the relationships between the many public organizations, private actors, and community groups involved in a summit can rarely be boiled down to a simple principal-agent relationship. Governance actors rarely have full formal power over each other, even if they rely on each for doing their own work (Moynihan et al., 2011). Summits would then be about organizing mutual accountability between partners and collective accountability of the governance arrangement as a whole towards the external environment (Sørensen and Torfing, 2009). In this perspective, actors can challenge each other on the extent to which each has fulfilled their role and get their story to the outside world straight. Researchers can then learn from the discussion how responsibility (and blame) is shared across the group and what story the collective projects to the outside world. Collaborative Performance Summits as Instruments for Relationship-building Finally, summits can serve the purpose of relationship-building between actors (Ansell and Gash, 2008). In a narrow, static sense, this purpose is mainly about coordination, making sure that actors know who is involved in the arrangement, what each actor is capable of, and who is doing what. Researchers can glean from these discussions how actors are (or are not) connected and how tasks are distributed amongst the actors, complementing the information

Using collaborative performance summits  219 they may have already collected through a Social Network Analysis or reading the work plans of the governance arrangements. In a more dynamic perspective, summits have the purpose of relationship-building. They provide actors with an opportunity to come to know, and hopefully trust, each other. And with an opportunity to reflect on the structure and culture that characterizes the governance arrangement (see Ostrom as discussed in McGinnis, 2011). From such discussions, researchers can aim to learn the nature and depth of the relationships between the actors, although these are often not fully revealed in a collective meeting, and how actors view the overall structure and culture of the arrangement.

COMPETING PURPOSES? In theory, a summit can serve multiple purposes at the same time, bolstering learning, accountability, and relationship-building. In practice, all of these purposes may be in play, but there are tensions between them. Learning and accountability, for example, have been shown to crowd each other out (Van Dooren and Hoffmann, 2018). For a successful learning process, organizations need to show their doubts and vulnerability, while an accountability process may cause participants to clam up and defend their achievements. Similarly, it may be difficult to build trust between actors at the same time as holding each other accountable. The tensions between these purposes will determine what actors reveal or obscure, and subsequently what researchers can or cannot learn from observing summits. One strategy for dealing with the competing nature of learning, accountability, and relationship-building is to strictly design summits for one purpose only (Behn, 2014). The organizers would then clearly communicate what the specific purpose of a summit is supposed to be and strictly police anyone who tries to change the nature of the conversation. However, preventing the purposes from crossing into other areas may not be achievable in practice (Douglas and Ansell, 2021). Organizers of a summit may say that the goal is purely to have a learning process, but at the same time the participants are likely to include actors who are in a direct principal-agent relationship with each other (think governments and the community organizations they subsidize). Furthermore, even if learning is the purpose of the discussion at the beginning, seasoned operators will know that accountability will inevitably come later anyway, and may anticipate this by emphasizing or withholding information during the summit. A more pragmatic approach for actors and researchers involved in summits may be to acknowledge the enmeshed nature of collaborative governance where learning, accountability, and relationship-building are fundamentally intertwined (Douglas et al., 2021). A more dynamic approach would be to create dialogue routines not in isolation, but in connection to other routines and recurring meetings, where different meetings can have a slightly different purpose, and actors can reconvene to reflect on different aspects. Similarly, researchers would accompany their data collection at the summit itself with pre- and post-summit interviews with the individual participants (e.g. Douglas and Ansell, 2023).

220  Handbook on measuring governance

THE TECHNICAL CHARACTERISTICS OF COLLABORATIVE PERFORMANCE SUMMITS AS A METHOD This method is depicted in summary form in Figure 14.1 and each component is described in the following paragraphs.

Figure 14.1

The roles of researchers during the various stages of a summit

STARTING FROM THE CONTEXT To fully understand the technical characteristics of collaborative performance summits, it is necessary for both actors and researchers to first consider the context in which these dialogues take place. The wider institutional context or regime sets certain boundaries and expectations on what is imaginable and appropriate during a summit discussion (Douglas and Ansell, 2023; Emerson et al., 2012). Actors are not free, or without prejudice, when it comes to deciding who gets to participate in the dialogue or what goals and measures should be discussed. The wider institutional context of the governance arrangements shapes who is recognized as relevant participants for a summit, what information is considered relevant or valid, and even who gets to initiate the organization of a summit. Actors and researchers looking to learn from summits should appreciate what barriers there are. Moreover, a collaborative performance summit is rarely the only opportunity actors have to coordinate their actions and process information (Douglas and Ansell, 2023). Other routines such as joint budget reviews, operational troubleshooting sessions, or annual reporting cycles also serve as opportunities for the actors to explicate the goals, exchange information, examine progress, and explore future actions. These routines shape what mechanisms will feed into the dialogue routine and which other routines the dialogue could feed into in turn. Again,

Using collaborative performance summits  221 researchers observing summits would do well to seek data about what is happening within these other routines.

DIVING IN: CREATING AND RESEARCHING THE DIFFERENT STEPS OF A COLLABORATIVE PERFORMANCE SUMMIT Selecting Actors The first step to conduct a collaborative performance summit is the selection of participants; a crucial step for both the practitioners and the researchers involved. The process of selecting actors to participate in the summit is neither technocratic nor innocent. In typical governance arrangements there is an essential ambiguity in the demarcation of who is or is not involved (Moynihan et al., 2011). This means that deciding who sits at the table requires a judgement call. Moreover, it matters for the nature of the discussion and potential outcomes who sits at the table. Different partners will bring different measures to the table, have different interpretations of these measures, and have different types of relationships with each other (Douglas and Ansell, 2021). In essence, selecting the participants of a dialogue is a political act. It entails reading the authorizing environment of a governance arrangement, both the formal and informal actors in play, and then calling them forward to reflect on the activities of the governance arrangement (Moore, 1995). Given the political nature of this act, it is important for actors to ensure proper democratic oversight for the invitation process. For researchers, it is at the minimum of scholarly interest to trace which actors get invited, which actors get excluded, and which actors get forgotten (Douglas and Ansell, 2023). Researchers taking a more action-oriented approach may even opt to actively suggest actors to include, to ensure the summit includes all perspectives on their research interests (Douglas et al., 2021). For example, researchers interested in the functioning of collaborations seeking to promote literacy may promote the inclusion of citizens who struggled with literacy themselves in the summit to ensure their perspective is included. However, as noted, this would be a very consequential intervention requiring explicit approval from the other participants. Setting the Agenda Next to determining which actors will participate in the dialogue is the question of what is to be discussed. Picking the participants and setting the agenda are processes that influence each other, as what you want to discuss informs who you want there, and who is there will influence what will be discussed. Collaborative performance summits are typically designed to cover four topics: explication of the goals, exchange of information, examination of the progress, and exploration of the next steps (Douglas and Ansell, 2023). In a narrow, static view on governance, some of these agenda items seem to require only limited attention from both the participants and the researchers. Why would it be necessary for a long-running governance arrangement to revisit its goals? This is especially true if these are clearly and formally laid down in a charter or covenant underpinning the governance arrangement, alongside clear indicators for these goals. And what need is there to have a lengthy discussion about what progress has been made if such an assessment would flow logically from the gap between the

222  Handbook on measuring governance goals set and the data about what has been done? In this perspective, the focus of both actors and researchers would be to quickly collect the information, affirm that people agree with the assessment generated by the data, and then quickly move on to exploring next steps. In a more dynamic view on summits, all of the items are on the menu, whether the organizers or researchers want to discuss them or not. When handling complex issues where much is unknown, the goals of the governance arrangement are always up for discussion (Head and Alford, 2015). And creating an overview of what is happening on the ground may actually lead to a reconsideration of what should be achieved on the whole. For example, if a literacy drive finds that the highest uptake of the programme is through mothers, the conclusion may be that the goals and resources of the programme should be redirected towards reaching more women. And the preferences and priorities of actors may also shift over time, autonomous of what is being achieved (Vangen and Huxham, 2013. An effective but flexible agenda helps actors to explore and home in on the issues that matter most, while researchers gain fascinating insights on what gets attention and what does not. Preparing Information The next aspect of the summit to consider is what information should be on the table and how it is presented. In a narrow sense, it may seem obvious to merely present the quantitative updates on the goals agreed upfront, but more effective information preparation would consist of a mix of ‘objective’ data and statistics, ‘subjective’ experiences of the various actors involved, and ‘expert’ opinions from researchers relevant to the challenge at hand (James et al., 2020). Moreover, effective information preparation would also invest considerable time in finding the best form to present the information in, using visual aids, vignettes, and in-person accounts to help the participants get an overview of the complex processes at play. Governance actors but also researchers can take an active role in collecting and presenting the information for the summit. For researchers, this creates an excellent opportunity to collect data from the participants, as they have a clear understanding of how their data will be used in the upcoming summit. Douglas and Ansell (2023) describe, for example, how they supported multiple summits between actors working together on reducing illiteracy, sending out surveys to the participants to collect their view on the state of the collaboration before the summit, asking participants to rate the quality of the discussion at the end of the summit, and then returning to the participants a few months after the summit to check back on whether the collaboration improved. Each of these three waves of data gathering contained information immediately relevant to the actors and was presented to them, helping them to steer their collaboration. And at the same time, the data provided the researchers with a rich and longitudinal account of how the performance of each collaboration was viewed by the different participants and evolved over time.

BACK TO THE CONTEXT Zooming out from the content of the summits itself, it is important for both actors and researchers to again consider the wider context after the summit and actively trace what happens to the governance arrangement after the meeting. Firstly, the characteristic of an impactful dialogue is that the findings, insights, and potential action points from the meeting are transmitted to

Using collaborative performance summits  223 the relevant other routines and forums in the wider governance arrangements. For example, if specific operational bottlenecks were identified in the course of the discussion, these need to be communicated to the line-managers of the various organizations involved so the problems can be resolved. Moreover, impactful dialogues are often not one-off, standalone events but part of a recurring and continuing cycle of learning (Heikkila and Gerlak, 2013). In a narrow, static perspective, this recurrence would have to be regular, with a uniform set of participants, and returning to the same set of measures in order to come to a reliable pattern of planning, doing, checking, and acting (see the Deming approach to quality management). In a more expansive, dynamic perspective, the recurrence would not have to be as regimented, as the precise timing, composition, and subject matter of the next dialogue meeting is shaped by the unfolding insights and developments (Dekker et al., 2020; Head and Alford, 2015). Actors, and action-oriented researchers, can actively seek to maintain the momentum for learning and relationship-building by initiating recurring summits. Finally, just as the institutional context previously shaped a summit, a summit can help to shape the context in turn. The institutional context of the summit shaped who was involved in the discussion and what was discussed. In turn, an effective dialogue can serve to impact the wider structure or regime (see discussion of Bryson et al., 2020 on structuration). For example, a joint meeting might conclude that specific key organizations should be involved in the collaboration, such as citizens groups or partners from the private sector, which can lead to these actors formally joining the wider governance arrangement. Researchers taking a sociological institutionalist perspective could trace how wider structure and specific summit practice influence each other over time.

USES OF COLLABORATIVE PERFORMANCE SUMMITS IN EMPIRICAL STUDIES FOR MEASURING GOVERNANCE Use of Collaborative Performance Summits in Different Studies Collaborative performance summits have existed in practice for a long time, and have similarly been used within research in various guises. Within public administration research, dialogue routines have been described and used under different labels, such as forums, arenas, and courts (Bryson and Crosby, 1993), PerformanceStat meetings (Behn, 2014), performance dialogues (Laihonen and Mäntylä, 2017), public value tables (Douglas et al., 2021), and collaborative performance summits (Douglas and Ansell, 2021). More substantively, they have been used by scholars to examine the creation of public value (Douglas et al., 2020) or the productivity of interagency action (Behn, 2014). Collaborative performance summits also have deep roots in other disciplines, where they feature as both objects of study and instruments for studying objects. For example, in the literature on urban planning, multiple authors describe how planning agencies use interactive dialogues with partners and citizens to assess their plans and realizations, but scholars also organize dialogue routine sessions themselves to collect assessment of urban projects (Innes and Booher, 2010). Similarly, dual uses of collaborative performance summits can be observed in the literature on crisis management, environmental management, evaluation studies, and healthcare initiatives (see Douglas and Ansell, 2023).

224  Handbook on measuring governance

PATTERNS EMERGING FROM APPLICATION OF COLLABORATIVE PERFORMANCE SUMMITS Complex But Not Random Processes When Moynihan (2006) described learning forums within organizations, he concluded that they were highly unpredictable. In his view, the links that the actors in the room make between the measures they get and the decisions they make is difficult to reconstruct, hard to follow, and impossible to predict. Empirical studies seem to confirm that this random coupling between data and conclusions also occurs, if not to a larger extent, in interorganizational learning forums. For example, Douglas and Ansell (2023) report in their observation of 18 summits that the issues which participants indicated before a summit that they wanted to discuss bear little relationship to what issues actors ended up discussing in the end. Similarly, the pre-stated purpose of the summit, be it learning, accountability, or relationship-building, is often observed to morph as the summit date draws closer or even alters during the course of the summit itself (Douglas and Schiffelers, 2021). These patterns can be seen as evidence for the utter randomness of summit processes, limiting their usefulness to both actors and researchers. However, another perspective would be that especially in governance arrangements – which are about complex interdependencies both in the nature of the societal problems they address and the constellations of actors they involve – considering and reconsidering things may be a necessary and even beneficial part of effective governing (Douglas and Ansell, 2023). A shift in goals necessitates thinking about what actors are needed or not needed to achieve these new ambitions. And vice versa, bringing new actors onboard will mean that these newcomers will bring their particular priorities to the table. Problems arising at the very operational level – that is, the inability of two partners to share basic case data – may require action or sanction at very strategic levels of governance (Douglas and Ansell, 2023). Effective routines may therefore not distinguish themselves by their ability to slice and dice the data, or arrive at decisions through neat and regimented processes. This would be expected from summits in a very rationalistic, static view on governance, but in a more dynamic view on governance, the ability to simultaneously embrace complexity and still take action might be more appropriate (see Noordegraaf et al., 2019). Similarly, the value to summits of researchers lies not in the absolute clarity they provide about the performance of governance arrangements, but the insight they offer into the strategies for sense-making that actors employ.

COMPETENCES VERSUS CONFIDENCE It is important for both researchers and actors to understand who speaks up at summits and what motivates their judgements. Based on experiments with individuals, psychologists Dunning and Kruger exposed a remarkable relationship between levels of competence and confidence. Individuals with a very low level of competence (e.g. in their ability to read budgets) were generally more confident of their abilities than people with a more advanced level of training and competence (e.g. people with basic accounting training) (Kruger and

Using collaborative performance summits  225 Dunning, 1999). Confidence in skills tends to go down as people find out more about the subject matter, and only starts to rise again as people approach expert levels of competence. The Dunning-Kruger effect means that two types of people are likely to express their views with great confidence during summits: Those with expert-level skills and those with very little skills at all. The ones in the middle, that are becoming aware of how much they do not know yet as their knowledge expands, might be more be hesitant to speak up. Governance arrangements – as opposed to be more straightforward principal-agent constructions – typically emerge around complex societal problems where much is unknown. This would mean that in a typical summit, most people would know little and their confidence actually goes down as they learn more about the topic through a rich discussion. Douglas and Schiffelers (2021) observed this pattern in action as they noted that low-performing literacy networks tended to take more rash and bold actions than high-performing networks which had meetings with a lot of insights points. This means that many summits might end with the actors leaving more confused and the researcher leaving with little clear insights. However, both actors and researchers should consider whether such despondency is down to the actors actually growing in competence, and vice versa, whether perceived clarity might actually be the product of a lack of real understanding. A Clash between Representative and Participative Democracy? The experiences with summits brought to the fore a mismatch, if not open conflict, between participatory, networked democracy and the more traditional, representative democracy routines (Klijn and Koppenjan, 2014; Sørensen and Torfing, 2009). This mismatch is firstly structural in nature. The different actors that would have to come together to review the measures of a governance arrangement do not neatly fit under the purview of any one representative, democratic body. Moreover, there is an ethical nature to this mismatch, as who is invited to the summit does matter, but organizing actors and researchers have little standing when it comes to doing such political work. However, both the actors and researchers organizing summits can take action to maximize the democratic and ethical nature of the summit process. Firstly, democratic representatives can be actively involved in the routine, merging the participative and representative democratic elements. For example, Douglas et al. (2021) describe how local legislators participated in dialogue routines with doctors, teachers, and parents about healthcare. Secondly, the legislators can take a more active role in setting the parameters for the participative process, actively conducting metagovernance (Sørensen and Torfing, 2009) to make sure the dialogues are inclusive, transparent, well run, and potential decisions made reflect the mandate of the governance arrangement. Thirdly, researchers should actively apply the standards of ethical research (seeking informed consent, ensuring data is not privacy sensitive, etc.), especially when actively participating in the preparation and conduct of the summit.

CONCLUSION: STATUS TODAY AND THE PROSPECTS OF USING SUMMITS IN MEASURING GOVERNANCE Public, private, and community organizations are increasingly required to work together in complex governance arrangements to address complex societal issues. The actors working

226  Handbook on measuring governance in these arrangements need tools to measure their progress and, above all, routines for jointly making sense of these measures. Similarly, researchers studying complex governance arrangements need tools to make sense of the often contradictory, incomplete data that is available about the functioning of governance arrangements. These tools should not ignore or stifle the dynamic nature of governance arrangements, which emerged in response to the dynamic nature of the societal problems these arrangements have to address. This means that governance arrangements need tools to help actors and researchers make sense of measures and act on them, but that these tools need to be just as dynamic as the governance arrangements they are meant to support. Considering the state-of-the-art in the development of dialogue routines in the shape of collaborative performance summits, progress has been made in taking lessons from performance measurement and management in organizations. The literature on performance measurement and management found that performance data is frequently not used by decision-makers. This chapter argues that dialogue routines can overcome this deficiency, ensuring that actors jointly understand and use governance measures, and researchers can use these summits to observe governance arrangements while helping them at the same time. However, this review also signals multiple challenges in the application of collaborative performance summits. A first problem is that different purposes can be at play in the same summit. Effective actors and researchers do not ignore this complexity, but carefully study the institutional context of a meeting and trace the dynamics between learning, accountability, and relationship-building during the summit itself. They do not seek to fully tame all the aspects of the meeting and pin everything down, as the dynamic nature of the discussion is an integral part of the process of sense-making and important to study in itself. A second problem is that crafting the meeting itself is a highly political activity. Effective actors and researchers recognize that the selection of participants is highly consequential and consciously seek broader democratic support and ethical compliance when approaching actors. Moreover, such actors and researchers recognize that the actual agenda of a summit may change during the summit itself, and that what gets attention and what does not is instructive in itself. Similarly, collating and shaping the information is considered a highly important task, just as reconnecting the summit to its context after the meeting is concluded cannot be a mere afterthought. On the whole, collaborative performance summits provide actors and researchers with a unique opportunity to work together. Both actors and researchers seek to better understand the quality of governance arrangements; summits provide them both with an instrument for sense-making and for observing that sense-making in action.

REFERENCES Ansell, C., & Gash, A. (2008). Collaborative governance in theory and practice. Journal of Public Administratio Research and Theory, 18(4), 543–71. Behn, R.D. (2014). The performanceStat potential: A leadership strategy for producing results. Brookings Institution Press. Bryson, J.M., & Crosby, B.C. (1993). Policy planning and the design and use of forums, arenas, and courts. Environment and Planning B: Planning and Design, 20(2), 175–94. Bryson, J.M., Ackermann, F., & Eden, C. (2016). Discovering collaborative advantage: The contributions of goal categories and visual strategy mapping. Public Administration Review, 76(6), 912–25.

Using collaborative performance summits  227 Bryson, J.M., Crosby, B.C., & Seo, D. (2020). Using a design approach to create collaborative governance. Policy & Politics, 48(1), 167–89. Dekker, R., Contreras, F.J., & Meijer, A. (2020). The living lab as a methodology for public administration research: A systematic literature review of its applications in the social sciences. International Journal of Public Administration, 43(14), 1207–17. Douglas, S., & Ansell, C. (2021). Getting a grip on the performance of collaborations: Examining collaborative performance regimes and collaborative performance summits. Public Administration Review, 81(5), 951–61. Douglas, S., & Ansell, C. (2023). To the summit and beyond: Tracing the process and impact of collaborative performance summits. Public Administration Review. doi​.org/​10​.1111/​puar​.13598. Douglas, S., & Schiffelers, M.J. (2021). Unpredictable cocktails or recurring recipes? Identifying the patterns that shape collaborative performance summits. Public Management Review, 23(11), 1705–23. Douglas, S., van de Noort, M., & Noordegraaf, M. (2021). Prop masters or puppeteers? The role of public servants in staging a public value review. In The Palgrave handbook of the public servant (pp. 277–88). Springer International Publishing. Emerson, K., Nabatchi, T., & Balogh, S. (2012). An integrative framework for collaborative governance. Journal of Public Administration Research and Theory, 22(1), 1–29. Head, B.W., & Alford, J. (2015). Wicked problems: Implications for public policy and management. Administration & Society, 47(6), 711–39. Heikkila, T., & Gerlak, A.K. (2013). Building a conceptual approach to collective learning: Lessons for public policy scholars. Policy Studies Journal, 41(3), 484–512. Innes, Judith E., & Booher, D.E. (2010). Planning with complexity: An introduction to collaborative rationality for public policy. Routledge. James, O., Olsen, A.L., Moynihan, D.P., & Van Ryzin, G.G. (2020). Behavioral public performance: How people make sense of government metrics. Cambridge University Press. Klijn, E.H., & Koppenjan, J.F.M.S. (2014). Accountable networks. In The Oxford Handbook of Public Accountability (pp. 242–57). Oxford University Press. Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121. Laihonen, H., & Mäntylä, S. (2017). Principles of performance dialogue in public administration. International Journal of Public Sector Management, 30(5), 414–28. McGinnis, Michael D. (2011). An introduction to IAD and the language of the Ostrom workshop: A simple guide to a complex framework. Policy Studies Journal, 39(1), 169–83. Moore, M.H. (1995). Creating public value: Strategic management in government. Harvard University Press. Moynihan, D.P. (2006). What do we talk about when we talk about performance? Dialogue theory and performance budgeting. Journal of Public Administration Research and Theory, 16(2), 151–68. Moynihan, D.P. (2010). From performance management to democratic performance governance. In R. O’Leary, D.M. Van Slyke, & S. Kim (Eds.), The Future of Public Administration around the World, The Minnowbrook Perspective (pp. 21–31). Georgetown University Press. Moynihan, D.P., Fernandez, S., Kim, S., LeRoux, K.M., Piotrowski, S.J., Wright, B.E., & Yang, K. (2011). Performance regimes amidst governance complexity. Journal of Public Administration Research and Theory, 21(suppl_1), i141–i155. Noordegraaf, M., Douglas, S., Geuijen, K., & Van Der Steen, M. (2019). Weaknesses of wickedness: A critical perspective on wickedness theory. Policy and Society, 38(2), 278–97. Sørensen, E., & Torfing, J. (2009). Making governance networks effective and democratic through metagovernance. Public Administration, 87(2), 234–58. Van Dooren, W., & Hoffmann, C. (2018). Performance management in Europe: An idea whose time has come and gone? In The Palgrave handbook of public administration and management in Europe (pp. 207–25). Palgrave Macmillan. Vangen, S., & Huxham, C. (2013). Building and using the theory of collaborative advantage. In R. Keast, M.P. Mandell, & R. Agranoff (Eds.), Network theory in the public sector: Building new theoretical frameworks (pp. 51–69). Routledge. Weick, K.E. (1995). Sensemaking in organizations (3rd ed.). Sage.

PART IV FIELDS OF MEASURING GOVERNANCE

15. Measuring active labour market polices Niklas Andreas Andersen, Flemming Larsen and Dorte Caswell

INTRODUCTION Labour market policies – or active labour market polices (ALMPs) as they came to be known by the end of the 20th century – have always been one of the most closely monitored and measured policy areas. Some of the first and most important measures of the many national statistics offices that emerged in the late 19th and early 20th centuries were measures regarding the work status of the population (Desrosières, 1998). At first, the monitoring of the number of people in-work and out-of-work was primarily seen as indicators of the state of the overall economy. Indicators that were of utmost importance for calculating expected state revenues and costs, but which were themselves largely seen as outside the power of states to directly influence through specific policies. However, as the depth and scope of welfare states expanded in the post-war period, so did the specific services and benefits targeted at the unemployed. Such policies are – in the words of Evelyn Brodkin – boundary-setting in their nature (Brodkin, 2021) as they draw the line between welfare and work. That is, they seek to move people from welfare to work – whether through help, coercion or simply by redrawing the boundaries of who are considered eligible and deserving of welfare benefits (and on what conditions) and who must fend for themselves on the labour market. With the expansion of ALMPs in the late 20th and early 21st centuries (Lødemel & Trickey, 2001; Lødemel & Moreira, 2014), the level of unemployment was seen as directly within the purview of specific polices to tackle, rather than being mere indicators of, larger economic forces. This active turn in the policies targeting the unemployed is grounded in the assumption that the number of benefit claimants do not merely rise and fall in tandem with economic cycles. Proponents of ALMPs argue, instead, that the number of people receiving benefits also reflect the motivations and/or competences among the unemployed for taking a job. While ALMPs may not be able to directly influence the greater economy (at least in the short run), they can influence the motivation and competences of the unemployed. Detailed measurements of, for example, the number and characteristics of claimants, duration on benefits, participation in different activation schemes etc. have thus taken centre stage as pivotal indicators of the success or failure of these policies. In this chapter, we first identify the most important methods and measures of performance within the area of ALMPs and show how these measures have been shaped by policymakers’ continual drawing and redrawing of the boundaries between welfare and work. We then discuss the most important political and administrative consequences of these developments, before finally ending the chapter by highlighting the main challenges currently facing the performance measurements of ALMPs.

229

230  Handbook on measuring governance

MEASURING ALMPS – THE INDICATORS AND THEIR EVOLUTION OVER TIME To understand how the success and failure of ALMPs – and the employment services responsible for implementing these policies – are predominantly measured, it is necessary to also understand their political and economic context. The post-World War II decades were marked by a parallel rise in the workforce (due to both growing populations and expansion of female labour market participation) and tax rates, which helped finance the expansion of welfare services and benefits across most of the OECD countries. But since the 1980s, we have witnessed an almost complete reversal of this trend as both the size of populations (and thus the potential workforce) and the tax rates have fallen and/or stagnated across most OECD countries (Piketty, 2014). Consequently, increasing the employment rate has today become the main (if not only) source of increasing state revenue and by extension also maintaining welfare services and benefits. ALMPs are thus naturally highly politically salient and contested – not least from the perspective of the national economy. As we will elucidate in the following, this political salience and contestation also extends to the question of how to measure both the outcomes (i.e., the results) and outputs (i.e., the actions leading to those results) of ALMPs. Measuring Outcomes – Increasing Incentives or Competences Translated into the vocabulary of performance measurements, the economic importance of increasing the employment rate would naturally make the level of employment the key performance indicator (KPI) within the area of ALMPs. However, while this KPI is of course highly relevant and closely monitored, two other KPIs have historically been more dominant indicators of the success or failure of ALMPs: (1) the number of benefit recipients; and closely related to this (2) the duration of time spent on benefits. On the face of it, there may not seem to be that much of a difference between these two indicators and the overall employment level. After all, you would expect an almost perfect inverse relation between overall employment levels and the number and duration of people receiving benefits – meaning that when the first goes up, the others go down. However, this is not necessarily the case, as there are many unemployed people who are not receiving any benefits – whether because of an explicit choice on their own part, because they are not deemed eligible or because they lack the necessary knowledge about their benefit entitlements. Nonetheless, the number of benefit recipients is typically used as a proxy for the level of unemployment, when measuring the performance of ALMPs. From a purely pragmatic perspective, most countries have much more reliable and detailed statistics regarding benefit recipients than they have on the employment history of the general population. The level of benefit recipients and the duration of benefit periods are thus simply easier to monitor than the actual level of unemployed people. However, the centrality of these KPIs in the field of ALMPs is not merely – or mainly – a purely pragmatic choice. It is also closely related to broader developments in the policies and governance of welfare states in the last three to four decades. At the core of all the different iterations of ALMPs that have proliferated across many countries since the 1990s is a general shift from the demand to the supply side of the labour market. By activating benefit recipients, ALMPs seek to enhance both the quantity and the quality of the labour supply, while generally shying away from more Keynesian-inspired tools

Measuring active labour market polices  231 of enhancing the demand for labour. In this regard it seems quite natural to measure ALMPs on their influence on the level and duration of people on benefits rather than their influence on the level of employment – simply because benefit recipients are the main point of intervention of these policies. The KPIs thus mirror the goals and normative commitments already stipulated in the ALMPs enacted by elected politicians. However, beneath this consensus about supply-side policies, the choice of how to stimulate the supply of labour has been much more politically contested. The main dividing line being whether the problem of unemployment is viewed as primarily caused by a lack of incentives or a lack of competences to take a job (Bonoli, 2010; Lindsay et al., 2007; Peck & Theodore, 2001). The former problem definition calls for a work-first approach (Bruttel & Sol, 2006) – sometimes also termed defensive workfare (Torfing, 1999) or negative activation (Taylor-Gooby, 2004) – which applies instruments such as reducing benefit levels and durations, increasing use of sanctions, increasing demands on job-search activity, and enhancing the obligations of the unemployed to participate in work-for-benefit schemes. The latter problem definition instead calls for a human capital approach (Lindsay, 2007) – sometimes also conceptualized as human resource development (Loedemel & Moreira, 2014) or a learnfare approach (Jørgensen, 2009) – where the basic instruments are geared towards bettering both the social skills and formal qualifications of the unemployed, for example, by using Vocational Training and Education. The choice between these two overarching approaches remains highly contested. Therefore, actual ALMPs often contain a complex mix between the two, which is not easily reconcilable at the level of implementation. In this ambiguous reality of diverging policy goals and normative commitments, the operationalization of KPIs then becomes a way of choosing the central goals and commitments within a given policy field, that is, a way of de facto doing politics (Bjørnholt & Larsen, 2014). This is also true with the KPIs of reducing the number of benefit recipients and the time they spend on the benefits. Numerous studies have shown how the use of these indicators generally incentivize both policymakers and implementing agencies to pursue a work-first rather than human capital approach (Brodkin, 2011; Fording et al., 2009; Larsen, 2013). As the human capital approach seeks to upskill benefit recipients to enable them to qualify for available jobs, it hinges on the intermediary factor of available jobs and can thus only indirectly influence the goal of reducing benefit recipients. Even though upskilling may give the claimants more sustainable jobs in a longer time perspective, it further has the disadvantage that the goal of reducing benefit recipients takes more time – thus impeding the KPI of lowering the duration of the time spent on benefits. Contrary to this, the work-first approach seeks to motivate recipients to get off their benefit as quickly as possible. Thereby it favours tools that directly influence these KPIs – such as stricter eligibility criteria, sanctions and work-for-benefit activation. The logical choice of the actors (be they policymakers, implementing agents etc.) being held accountable to the two-outcome KPIs is therefore to choose the policies and programmes that most directly and quickly influence that which they are held accountable to. A highly illustrative example of this can be found in Evelyn Brodkin’s seminal work on the Temporary Assistance for Needy Families (TANF) reform enacted in the US in 1996 (Brodkin, 2011). Brodkin shows that while the reform was infused with diverging logics and goals – mirroring work-first, human capital and traditional public benefit approaches – the performance regime that was instated to measure the implementation of the reform mainly focused on the goal of reducing the number of benefit recipients. The primacy of this KPI created a ‘choice-calculus’ that effectively disincentivized implementing organizations to respond to the needs of their clients and instead incentivized them to do everything in their

232  Handbook on measuring governance power to limit the availability and responsiveness of their services and benefits (Brodkin, 2011). Measuring Outputs – from Triple Activation to Triple Accountability The TANF reform is admittedly an extreme case as it shows the consequences of measuring and incentivizing implementing organizations using the sole KPI of reducing the number of benefit recipients. The implementing organizations thus enjoyed an unusual amount of discretion, which they could then use rather creatively to achieve this KPI – albeit to detrimental effects for the clients. Generally, the dominant outcome goals of reducing the number of people and time spent on benefits have also been supplemented by a host of different detailed output goals concerning the measures taken to achieve these outcomes. Typically, these output measures have been closely related to the goal of activating the unemployed. As also implied in the term, the introduction of ALMPs is often seen as a pivotal example of a general shift in the perception of the rights and duties of citizens in the welfare states of the late 20th century. Today the idea of citizens passively receiving benefits when unable to support themselves is deemed passé. Instead, citizens must actively seek to better their situation to be eligible for benefits, while welfare states are to enable this new active citizenship through the provision of so-called activation schemes. The active citizen has thus become both the end-goal of ALMPs – in the form of making the unemployed independent of benefits – as well as the tool to achieve this goal – in the form of making the unemployed participate in activation schemes. While the former goal is captured by the above-mentioned KPIs on outcomes, the latter is typically measured through detailed monitoring of the participation of clients in activation schemes. The organizations responsible for implementing ALMPs are thus monitored on whether and to what degree they enable this transition towards more active citizenship. While such chains of accountability have always been part and parcel of public sector governance, the use of detailed performance measurements to bolster these accountability relations is a more recent development. This development is closely connected to the advent of New Public Management (NPM) as a new way of governing the public sector (Bevir, 2009). In the traditional Weberian understanding of government, implementing agents are held accountable to elected politicians through detailed procedural regulations and a clear hierarchical structure of command and control. Since the 1980s, proponents of NPM have critiqued such bureaucratic forms of governance for being too rigid and inefficient and called for governments to ‘steer not row’ (Osborne & Gaebler, 1992). The idea being that service delivery organizations should not be bogged down by detailed regulations, but instead should be set free to innovate and customize services to the needs of the citizens (often reframed as consumers). However, the arrival of NPM-inspired governance arrangements has not led to the disappearance of the need for holding service delivery organizations accountable. Rather, it has led to an altering or supplementing of existing procedural regulations with accountability through performance measurements. No place has this been more vivid than in the implementation of ALMPs, where the quest for activating the unemployed have been bolstered by a whole host of performance measures concerning the activities which the unemployed participates in (Brodkin, 2013; Larsen & Berkel, 2009). Moreover, this close relation between activation and accountability extends beyond the individual client. Scholars have highlighted how the activation of the unemployed

Measuring active labour market polices  233 is mirrored in the activation of both the individual caseworker and their organizations – thus creating a form of double or triple activation (van Berkel, 2013). Looking at activation from the perspective of performance measurement this also leads to a form of triple accountability: (1) the unemployed individual is held accountable to the caseworker through the monitoring of their participation in activation schemes; (2) the caseworker is held accountable to management through the monitoring of their use of specific activation schemes; and (3) the management is held accountable to policymakers through the monitoring of the organization’s use of activation schemes. While such detailed monitoring of output measures is surely indicative of policymakers’ lack of trust in the organizations implementing ALMPs, the widespread proliferation of such external accountability performance regimes (Larsen & Berkel, 2009; Marston & Brodkin, 2013) also points towards a more structural explanation. Whether by fully or partly privatizing the delivery of employment services or by devolving responsibility for service delivery from state to local government, the central administration’s ability to control the implementation of ALMPs has generally become more limited in the last couple of decades – at least through traditional procedural regulations. As the delivery of employment services increasingly shifted towards private, third-sector or municipal organizations, central government and administration resorted to detailed performance measurements and monitoring to ensure some form of oversight over and accountability of the implementation of ALMPs (Borghi & van Berkel, 2007; Considine et al., 2015; Knuth et al., 2017). While the creation of systems of performance measurements aligns closely with the central tenet of NPM to introduce managerial practices from the private into the public sector (Hood, 1991), the actual implementation and workings of these systems often differed markedly from the theoretical basis of NPM. The imperative of governments to ‘steer not row’ and ‘let the managers manage’ (Osborne & Gaebler, 1992) should ideally entail increased discretion and autonomy regarding the design and organization of services at the level of implementation combined with increased responsibility and accountability of the results and outcomes of services. From the perspective of performance measurements, service delivery should be treated as a ‘black box’, and service delivery organizations should only be measured on their results. In practice, however, such black box approaches have never been the dominant or preferred way of measuring performance in the field of ALMP. Instead, governments in most countries opted to monitor and measure employment services on both their expected results (i.e., the aforementioned measures regarding the reduction of the number of benefit recipients and their time spent on benefits) and their approach to achieving these results. The latter has typically been measured on the different outputs of the employment services – for example, frequency of client-caseworker meetings and activation offers, preferred types of activation schemes used by employment services, the use of sanctions, categorizations of clients etc. The reasons behind many governments opting for such detailed output measures, rather than focusing on primarily monitoring the outcomes of employment services, are manifold. Firstly, the introduction of NPM-inspired governance instruments – such as systems of performance measurement – seldom completely displaced the former instruments and logics grounded in the ideal of traditional or ‘old’ public administration (OPA) (Christensen & Lægreid, 2010). Systems of performance measurement were more often layered on top of existing governance arrangements, where the logic of close procedural regulation prevailed. Secondly – and as already touched upon earlier – policymakers are not merely concerned with achieving the goal of lowering the number of benefit recipients. Due to ideological con-

234  Handbook on measuring governance siderations, most governments also have a preferred approach to doing employment services that they wish to further – such as a human capital or a work-first approach (Bruttel & Sol, 2006). It is therefore of paramount importance for governments to be able to closely monitor the approaches used by the organizations delivering employment services. Thirdly, the marketization of employment services has often been accompanied by outcome-based pay-for-performance schemes, where the income of service providers are either fully or primarily determined by their success in getting clients off benefits. However, many studies of such pay-for-performance schemes have shown considerable effects of creaming and parking clients as well as standardization rather than personalization of services (Considine et al., 2020; Knuth et al., 2017). The detailed monitoring of the outputs of service providers can thus function as a way of holding providers accountable to their methods as well as their results. Fourthly, the advent of the ideal of evidence-based policymaking (EBPM) from the 1990s and onwards (Andersen & Smith, 2022) has also had a strong influence on the policy and practice of ALMPs. Randomized controlled trials (RCTs) were already used to measure the effects of some of the earliest attempts of both human capital- and work-first-inspired activation schemes under the Californian GAIN programme (Greater Avenues for Independence) in the late 1980s and early 1990s (Baron, 2018; Considine et al., 2015). During the latter half of the 1990s and the 2000s the method of RCT increasingly became the ‘gold standard’ of measuring the effects – or lack thereof – of different activation schemes across countries. A central finding of many of these RCTs is lack of or even detrimental effects (measured on the level of benefit recipients) of activation schemes associated with the human capital approach and a moderately positive effect of many of the work-first-oriented schemes (Card et al., 2010, 2018). Although the methods and generalizability of these studies have been critiqued and questioned (Andersen, 2020), they have had a substantial influence on both the policymaking and the governance of ALMPs. In the latter case, the systems of performance measurements have been used to closely monitor to which degree employment services apply the methods and activation schemes deemed most effectful by the RCTs (Andersen, 2021). Fifth, and finally, the close monitoring of outputs has also been actively used to limit and/ or restructure the discretionary autonomy of the professions responsible for implementing ALMPs (Evetts, 2009). Whether warranted or not, policymakers have generally shown limited trust in the ability and/or motivation of frontline professionals to implement ALMPs in accordance with both the political goals of work first and the evidence on ‘what works’. A lack of trust also vindicated by one of the foundational theories of NPM – the principal-agent theory – which stipulates how agents will pursue self-interested ends, when left to their own devices. By detailing the expected outputs and incentivizing the achievement of these outputs through performance measurements, policymakers are thus able to steer the discretion of frontline workers away from either following their own interest or the guidelines and ethics of their given profession and more towards the approaches favoured by policymakers and management (Evetts, 2009). Especially in the Scandinavian context, the use of performance measurement within the area of ALMPs has also been a way of curtailing the power and dominance that the profession of social workers has traditionally held over the services targeting the most vulnerable unemployed (Baadsgaard et al., 2015).

Measuring active labour market polices  235

THE CONSEQUENCES OF MEASURING ALMPS The many different reasons and drivers behind the current dominance of KPIs concerning the number, duration and activities of benefit recipience are also mirrored in the many different consequences this has had to measure the success of ALMPs and employment services on these indicators. It is of course extremely difficult – if not outright impossible – to isolate and determine the influence of performance measurement systems on the outcome of ALMPs. And to the best of our knowledge, no such formal effect-evaluation of performance measurement within the field of ALMPs has ever been done. Existing meta-studies on the effects of performance measurement and management systems have generally applied a cross-sectorial perspective (Gerrish, 2016; Pollitt & Dan, 2013) and/or focused on the wider NPM-inspired governance arrangements into which such systems are inscribed (Hood & Dixon, 2015). However, the lack of effect-evaluations specifically trying to determine the causal relation between the use of performance measurement and the effects of ALMPs is not the same as there being a lack of knowledge of the consequences of performance measurement within the field. Increased Accountability of Implementing Agents Many studies have documented how the practice of the street-level workers and organizations delivering employment services have shifted towards work-first-inspired approaches following the introduction of performance measurement systems and other NPM-inspired governance instruments. Using a combination of ethnographic fieldwork and register data, scholars have documented this shift in relation to the aforementioned TANF reform in the US (Brodkin, 2011; Fording et al., 2009). Other scholars have used longitudinal survey data to show how the caseworkers in the employment services of the UK, the Netherlands and Australia have changed their perceptions of their job and clients following the introduction of more marketized and managerial modes of governance (Considine et al., 2015). Similar results have also been found in surveys of the managerial staff in Danish job centres (Larsen, 2013; Larsen & Andersen, 2018). From the perspective of central governments, the greater bend towards work-first approaches at the frontline can be viewed as indications of performance measurement systems succeeding in closing compliance gaps between policymakers’ wishes and what is actually being implemented (Bredgaard, 2011). However, while closing compliance gaps between policymaking and implementation is certainly important from the perspective of representative democracy, it is not the same as employment services also performing better – for example, by being more efficient, effective or responsive. Lowered Level of Benefit Recipients Given the focus of performance measurement systems on the KPIs measuring the level and duration of benefit recipience, you could argue that the success or failure of such performance measurement systems should also be judge on these KPIs. The general drop in the number of benefit recipients experienced by many countries in the 1990s and 2000s following the introduction of ALMPs – and the parallel implementation of systems of performance measurements – could thus indicate such successful outcomes of the performance measurements. But even if it was methodologically feasible to measure the isolated effect of performance

236  Handbook on measuring governance measurement systems on this general trend, the normative question remains whether a lowered level of benefit recipients can and should be considered a success. Few would consider the isolated outcome of a person no longer receiving income-supporting benefits a success if the benefit is not replaced by other means of income. Employment services would of course save some costs in benefit payments, but these savings would either be short-lived (as citizens reapply for benefits, when having no other means of income) or transferred to other organizations (e.g., by citizens becoming more dependent on social or health services). And even if benefit recipients do not return to employment or other welfare services, there is still the risk of them merely being supported by family and friends or finding illegal means of income. To consider lowered benefit levels a success thus depends on this indicator being a reliable proxy for the outcome of attaining sustainable employment. The strongest evidence of whether the main KPIs are reliable proxies for the outcome of sustainable employment would of course be to monitor the individual employment history of each benefit recipient from the moment they leave the employment system. As already mentioned, such detailed statistics have seldom been available in most countries. However, the sustainability of reduced levels of benefit recipients – that is, whether the level of benefit recipients remain at a lower level over time and across economy conjectures – can function as a crude indicator of whether it has also led to an actual increase in employment. Across countries there seems to be a general trend of ALMPs succeeding in lowering the level of unemployment benefit recipients, while similar success has not been achieved regarding the level of social assistance benefits or other benefits targeting the more vulnerable unemployed (Andersen et al., 2017; Loedemel & Moreira, 2014; O’Sullivan et al., 2021). This suggests that the KPI of lowered benefit level is a more reliable proxy for employment outcomes for the most resourceful benefit recipients than for vulnerable benefit recipients facing other problems than unemployment (e.g., related to health). In other words, if systems of performance measurements have had an isolated positive effect on the outcomes of ALMPs, it seems to mainly have been for the most job-ready benefit recipients. The Parallel Evolution of Ever New Forms of Gaming and KPIs Another way of evaluating whether the KPIs have been reliable proxies for increased levels of sustainable employment is to look at the process leading up to citizens getting off their benefit. That is, have employment services sought to achieve the KPIs in ways that could logically support citizens’ attainment of sustainable employment or have they mainly focused on lowering the level of benefit recipients – no matter whether this would lead to sustainable employment or not? The answer to this question is of course as varied as there are individual employment service organizations – or even individual caseworkers. However, it is well documented in the literature on performance measurement systems – both within and outside the field of ALMP – that such systems tend to skew the focus of organizations and individuals towards what is measured (Bruijn, 2007; Gerrish, 2016; Pollitt, 2013). As explained by Evelyn Brodkin, performance measurement systems create certain forms of routine discretion where actions and choices that better the performance of the organization or the individual on the main KPIs become much more likely than actions that run counter to these KPIs (Brodkin, 2011). In other words, what gets measured gets done! Thus again, accentuating the importance of KPIs being reliable proxies for what you want to get done. The general rule being that the less directly the indicator mirrors the preferred goal of the organization, the higher risk there

Measuring active labour market polices  237 is of unintended (Birdsall, 2018; Courtney et al., 2004) or perverse (Fording et al., 2009; Munro, 2004) consequences – such as gaming (Benaine, 2020; Bevan & Hood, 2006) or goal displacement (Ordóñez et al., 2009). Such perverse consequences of performance measurement are also well documented within the field of ALMP. The literature has found numerous examples of gaming, that is, the practice of ‘massaging’ (Hood, 2007) the numbers to better the score of the organization or the individual on the KPIs. Following the aforementioned TANF reform, welfare organizations developed routine practices that sought to hinder the ability of claimants to access and sustain benefits, for example, by increasing the waiting time of claimants or increasing their likelihood of being sanctioned (Brodkin, 2011). Similar examples abound from other countries (Møller et al., 2016), but it is especially within the context of the US, UK and Australia (O’Sullivan et al., 2021; Soss et al., 2011; Wright et al., 2020) that the quest for achieving the KPIs has led to such outright punitive approaches to benefit claimants. This aligns well with findings from the broader literature on performance measurement systems, which suggest that high-powered incentive structures – such as those typically inscribed within the performance measurement systems of Anglophone countries – are highly conducive to organizations doing whatever it takes to achieve the targets set by the KPIs (Moynihan, 2009). Including ways of gaming that transfer burdens to the clients and in many ways run counter to the overarching goal of increasing the level of employment. Policymakers and administrators have of course not been blind to these instances of gaming behaviour. Typically, the answer has been to devise new supplementary indicators and incentives that can limit the potential negative consequences of existing indicators. To deter employment service organizations from creaming, specific indicators and incentives can be set for different categories of benefit recipients. The logic is that the more vulnerable the unemployed, the greater the prize for getting them off the benefit. To deter the practice of parking, participation in specific types of activation can be set as a KPI. Or the KPI of getting people off benefits can be supplemented by performance targets on whether the benefit recipient exits into an actual job. One of the clearest consequences of the practice of performance measurement within the field of ALMP has thus been the continual increase in ever more detailed indicators – especially concerning the output of employment services. Thus, adhering to the more general ‘law of mushrooming’, specifying how the number of KPIs of performance measurement systems tends to increase over time (Bruijn, 2007; Pollitt et al., 2010; Woelert, 2015). However, the introduction of new and seemingly better indicators has often just led to new creative ways of gaming rather than the disappearance of such behaviour. In Australia’s privatized employment services, efforts to introduce new target indicators concerning the most vulnerable unemployed or actual job attainment have been met with creative workarounds such as re-categorizing stronger recipients so they appear more vulnerable (O’Sullivan et al., 2021) or simply financing short-term jobs for the clients at the points in time where performance is measured (Bredgaard & Larsen, 2006). In both US and Scandinavian employment services, the introduction of KPIs concerning specific types of activation schemes have also been shown to create incentives for employment services to re-categorize existing activation schemes so that performance targets are met, while the content of the activation scheme remains the same (Breidahl & Larsen, 2015; Brodkin, 2011). These examples all suggest the continuing prevalence of ‘Campbell’s law’, which states that: ‘The more any quantitative social indicator is used for social decision-making, the more

238  Handbook on measuring governance subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor’ (Campbell, 1979). Performance Measurement and Standardization Performance measurement within the field of ALMP has, as indicated above, had a significant impact on the content of the ALMP measures. The triple activation of unemployed citizens, frontline workers and the organization within they work (van Berkel, 2013) have been supported by continuously more comprehensive performance measurement systems. However, despite the theoretical assumption inherent in the NPM paradigm of reducing bureaucratic procedures and rigid standardization by primarily measuring outcome, this has not lived up to expectations. An unexpected downside of the increasingly sophisticated performance measurement systems has been that implementing organizations often spent more resources on monitoring and documenting services than delivering services. Contrary to the intentions, standardization of services have thus been the preferred answer to handle this pressure of meeting and documenting both activity and result targets. Standardization of services has been achieved by both the implementing organization’s own guidelines for meeting targets and by performance goals indirectly steering how frontline workers use their discretion in ways that ultimately ‘raise the odds that preferred paths will be taken’ (McGann, 2022; Soss et al., 2011). Especially as the target group for the active labour market policy has been expanded to apply to all types of target groups, including the most vulnerable groups in society, the employment services have been heavily criticized for being too standardized and for lacking flexibility to adapt to citizens’ needs (Caswell & Larsen, 2022; Considine et al., 2015; Lindsay et al., 2018). With the overall KPI of getting people off benefit in mind, this may not be that surprising (cf. also the conditionality discussion, see, e.g., Dwyer, 2018; Watts & Fitzpatrick, 2018; Wright et al., 2020), but this also relates to the standardized nature of services that have often been criticized for being meaningless for both society and citizens.

CONCLUSION: THE FUTURE OF MEASURING ALMPS The above-described critique of performance measurement is not only isolated to the field of ALMP but is also part of a growing general criticism of NPM (Hood & Dixon, 2015; Hood & Peters, 2004; Osborne et al., 2022). NPM – and by extension the above-mentioned forms of performance measurement – is, among other things, being criticized for the inappropriateness of a its product-dominant approach, its emphasis on internal efficiency rather than external impact on societal and individual problems and its challenges to democratic governance (Osborne et al., 2022). Post-NPM forms of governance such as New Public Governance (NPG) have offered answers to these challenges – also within the field of ALMP, where ideas of co-production (Lindsay et al., 2018), co-creation (Caswell & Larsen, 2022), personalized services (Thornton & Corden, 2017), user involvement (Djuve & Kavli, 2015) and integrated services (Minas, 2016) have gained prevalence in recent years. These new reform paths are, however, neither uniform nor have they, so far, replaced earlier governance paradigms. Instead, new governance arrangements are layered on top of pre-existing reforms and established structures, which means that NPM logics remain inherent in new hybrid governance forms. Furthermore, these transformations are also shaped by the

Measuring active labour market polices  239 institutionalized organizational setup of already-established service systems as new reform elements are to be incorporated into the existing welfare organizations. This is also the case with ALMPs, where performance measurements will most likely remain part and parcel of governance arrangements in the foreseeable future. A major challenge will therefore be how to make the transition from an NPM-inspired production logic towards an NPG-inspired citizen logic. Such a transition will be very difficult as long as legitimacy is predicated on performance measurements focused on efficiency and accountability. One likely scenario is that other logics gain foothold – for example, alternative measurements targeting the value of services for users – but the traditional performance measures (see above) remain the most important ways of ensuring legitimacy for the ALMPs. Another scenario is that the policy and governance changes are so radical that new KPIs and ways of measuring them replace the old ones. Finally, performance measurement might lose some of its importance for ensuring the legitimacy of ALMP, simply by accepting that value creation for citizens is quite hard to fit into existing performance measurement systems, and why legitimacy must be ensured through other means (e.g., by professional knowledge and norms inherent in frontline professions). No matter what, it is hard to imagine that performance measurement will not have an important role in the design and implementation of ALMPs in the future. How this will evolve will be dependent on to what extent it is possible to develop new meaningful measurements adapted to post-NPM logics. And – as we have learned from the past 30 years of measuring ALMPs – it may even be that the relation also works the other way around. Meaning that one of the decisive factors for succeeding in designing and implementing more citizen-oriented approaches to ALMP is whether and how such approaches can be captured by the KPIs of performance measurement systems.

REFERENCES Andersen, N.A. (2020). Evidensbaseret Beskæftigelsespolitik: Et studie af evalueringer, magt og offentlig politikudvikling. Aalborg Universitetsforlag. Andersen, N.A. (2021). The technocratic rationality of governance: The case of the Danish employment services. Critical Policy Studies, 15(4), 425–43. Andersen, N.A., & Smith, K. (2022). Evidence-based policy-making. In B. Greve (Ed.), De Gruyter handbook of contemporary welfare states (pp. 29–44). De Gruyter. Andersen, N.A., Caswell, D., & Larsen, F. (2017). A new approach to helping the hard to place unemployed: The promise of developing new knowledge in an interactive and collaborative process. European Journal of Social Security, 19(4), 335–52. Baadsgaard, K., Jørgensen, H, & Nørup, I. (2015). De-professionalization through managerialization in labour market policy: Lessons from the Danish experience. In T. Klenk & E. Pavolini (Eds.), Restructuring welfare governance (pp. 163–82). Edward Elgar Publishing. Baron, J. (2018). A brief history of evidence-based policy. The ANNALS of the American Academy of Political and Social Science, 678(1), 40–50. Benaine, S.L. (2020). Performance gaming: A systematic review of the literature in public administration and other disciplines with directions for future research. International Journal of Public Sector Management, 33(5), 497–517. Bevan, G., & Hood, C. (2006). What’s measured is what matters: Targets and gaming in the English public health care system. Public Administration, 84(3), 517–38. Bevir, M. (2009). Key concepts in governance. Sage. Birdsall, C. (2018). Performance management in public higher education: Unintended consequences and the implications of organizational diversity. Public Performance & Management Review, 41(4), 669–95.

240  Handbook on measuring governance Bjørnholt, B., & Larsen, F. (2014). The politics of performance measurement: ‘Evaluation use as mediator for politics’. Evaluation, 20(4), 400–411. Bonoli, G. (2010). The political economy of active labor-market policy. Politics & Society, 38(4), 435–57. Borghi, V., & van Berkel, R. (2007). New modes of governance in Italy and the Netherlands: The case of activation policies. Public Administration, 85(1), 83–101. Bredgaard, T. (2011). When the government governs: Closing compliance gaps in Danish employment policies. International Journal of Public Administration, 34(12), 764–74. Bredgaard, T., & Larsen, F. (2006). Udliciteringen af beskæftigelsespolitikken – Australien, Holland og Danmark: markedets usynlige eller statens synlige hånd? Djøf Forlag. Breidahl, K.N., & Larsen, F. (2015). The Developing trajectory of the marketization of public employment services in Denmark: A new way forward or the end of marketization? European Policy Analysis, 1(1), 92–107. Brodkin, E.Z. (2011). Policy work: Street-level organizations under new managerialism. Journal of Public Administration Research and Theory, 21, i253–i277. Brodkin, E.Z. (2013). Street-level organizations and the welfare state. In G. Marston & E.Z. Brodkin (Eds.), Work and the welfare state: Street-level organizations and workfare politics (pp. 17–34). Georgetown University Press. Brodkin, E.Z. (2021). Foreword: On history, poverty, and the continues quest for reform. In S. O’Sullivan, M. McGann, & M. Considine (Eds.), Buying and selling the poor – inside Australia’s privatized welfare-to-work market (pp. 1–14). Sydney University Press. Bruijn, J.A. (2007). Managing performance in the public sector. Routledge. Bruttel, O., & Sol, E. (2006). Work first as a European model? Evidence from Germany and the Netherlands. Policy & Politics, 34(1), 69–89. Campbell, D.T. (1979). Assessing the impact of planned social change. Evaluation and Program Planning, 2(1), 67–90. Card, D., Kluve, J., & Weber, A. (2010). Active labour market policy evaluations: A meta-analysis. The Economic Journal, 120(548), F452–F477. Card, D., Kluve, J., & Weber, A. (2018). What works? A meta analysis of recent active labor market program evaluations. Journal of the European Economic Association, 16(3), 894–931. Caswell, D., & Larsen, F. (2022). Co-creation in an era of welfare conditionality – lessons from Denmark. Journal of Social Policy, 51(1), 58–76. Christensen, T., & Lægreid, P. (2010). Complexity and hybrid public administration – theoretical and empirical challenges. Public Organization Review, 11(4), 407–23. Considine, M., Lewis, J.M., O’Sullivan, S., & Sol, E. (2015). Getting welfare to work: Street-level governance in Australia, the UK, and the Netherlands. Oxford University Press. Considine, M., O’Sullivan, S., McGann, M., & Nguyen, P. (2020). Contracting personalization by results: Comparing marketization reforms in the UK and Australia. Public Administration, 98(4), 873–90. Courtney, M.E., Needell, B., & Wulczyn, F. (2004). Unintended consequences of the push for accountability: The case of national child welfare performance standards. Children and Youth Services Review, 26(12), 1141–54. Desrosières, A. (1998). The politics of large numbers: A history of statistical reasoning. Harvard University Press. Djuve, A.B., & Kavli, H.C. (2015). Facilitating user involvement in activation programmes: When carers and clerks meet pawns and queens. Journal of Social Policy, 44(2), 235–54. Dwyer, P. (2018). Punitive and ineffective: Benefit sanctions within social security. Journal of Social Security Law, 25(3), 142–57. Evetts, J. (2009). New professionalism and New Public Management: Changes, continuities and consequences. Comparative Sociology, 8(2), 247–66. Fording, R., Schram, S.F., & Soss, J. (2009). The organization of discipline: From performance management to perversity and punishment. Journal of Public Administration Research and Theory, 21(2), i203–i232. Gerrish, E. (2016). The impact of performance management on performance in public organizations: A meta-analysis. Public Administration Review, 76(1), 48–66.

Measuring active labour market polices  241 Hood, C. (1991). A public management for all seasons? Public Administration, 69(1), 3–19. Hood, C. (2007). Public service management by numbers: Why does it vary? Where has it come from? What are the gaps and the puzzles? Public Money & Management, 27(2), 95–102. Hood, C., & Dixon, R. (2015). What we have to show for 30 years of new public management: Higher costs, more complaints. Governance, 28(3), 265–7. Hood, C., & Peters, G. (2004). The middle aging of new public management: into the age of paradox? Journal of Public Administration Research and Theory, 14(3), 267–82. Jørgensen, H. (2009). From a beautiful swan to an ugly duckling – the renewal of Danish activation policy since 2003. European Journal of Social Security, 4(11), 337–68. Knuth, M., Larsen, F., Greer, I., & Breidahl, K.N. (2017). The marketization of employment services: The dilemmas of Europe’s work-first welfare states. Oxford University Press. Larsen, F. (2013). Active labor-market reform in Denmark: The role of governance in policy change. In G. Marston & E.Z. Brodkin (Eds.), Work and the welfare state: Street-level organizations and workfare politics (pp. 103–24). Georgetown University Press. Larsen, F., & Andersen, N.A. (2018). Beskæftigelse for alle? Den kommunale beskæftigelsespolitik på kontanthjælpsområdet siden 2000 (1st ed.). Frydenlund Academic. Larsen, F., & Berkel, R.V. (2009). The new governance and implementation of labour market policies (1st ed.). DJØF. Lindsay, C. (2007). The United Kingdom’s ‘Work First’ welfare state and activation regimes in Europe. In A.S. Pascual & L. Magnusson (Eds.), Reshaping welfare states and activation regimes in Europe (pp. 35–70). Peter Lang. Lindsay, C., McQuaid, R.W., & Dutton, M. (2007). New approaches to employability in the UK: Combining ‘Human Capital Development’ and ‘Work First’ strategies? Journal of Social Policy, 36(4), 539–60. Lindsay, C., Pearson, S., Batty, E., Cullen, A.M., & Eadson, W. (2018). Co-production as a route to employability: Lessons from services with lone parents. Public Administration, 96(2), 318–32. Lødemel, I., & Moreira, A. (2014). Activation or workfare? Governance and neo-liberal convergence. Oxford University Press. Lødemel, I., & Trickey, H. (2001). An offer you can’t refuse: Workfare in international perspective. Policy Press. Marston, G., & Brodkin, E.Z. (2013). Work and the welfare state: Street-level organizations and workfare politics. Georgetown University Press. McGann, M. (2022). Meeting the numbers: Performance politics and welfare-to-work at the street-level. Irish Journal of Sociology, 30(1), 69–89. Minas, R. (2016). The concept of integrated services in different welfare states from a life course perspective. International Social Security Review, 69(3–4), 85–107. Møller, M.Ø., Iversen K., & Andersen, V.N. (2016). Review af resultatbaseret styring – Resultatbaseret styring på grundskole-, beskæftigelses- og socialområdet. KORA – Det Nationale Institut for Kommuners og Regioners Analyse og Forskning. Moynihan, D.P. (2009). Through a glass darkley: Understanding the effects of performance regimes. Public Performance & Management Review, 32(4), 592–603. Munro, E. (2004). The impact of audit on social work practice. The British Journal of Social Work, 34(8), 1075–95. O’Sullivan, S., McGann, M., & Considine, M. (2021), Buying and selling the poor – inside Australia’s privatized welfare-to-work market (pp. 1–14). Sydney University Press. Ordóñez, L.D., Schweitzer, M.E., Galinsky, A.D., & Bazerman, M.H. (2009). Goals gone wild: The systematic side effects of overprescribing goal setting. Academy of Management Perspectives, 23(1), 6–16. Osborne, D.E., & Gaebler, T. (1992). Reinventing government: How the entrepreneurial spirit is transforming the public sector. Addison-Wesley. Osborne, S., Powell, M.G.H., Cui, T., & Strokosch, K. (2022). Value creation in the public service ecosystem: An integrative framework. Public Administration Review, 82(4), 634–45. Peck, J., & Theodore, N. (2001). Exporting workfare/importing welfare-to-work: Exploring the politics of Third Way policy transfer. Political Geography, 20(4), 427–60. Piketty, T. (2014). Capital in the twenty-first century. Harvard University Press.

242  Handbook on measuring governance Pollitt, C. (2013). The logics of performance management. Evaluation, 19(4), 346–63. Pollitt, C., & Dan, S. (2013). Searching for impacts in performance-oriented management reform. Public Performance & Management Review, 37(1), 7–32. Pollitt, C., Harrison, S., Dowswell, G., Jerak-Zuiderent, S., & Bal, R. (2010). Performance regimes in health care: Institutions, critical junctures and the logic of escalation in England and the Netherlands. Evaluation, 16(1), 13–29. Soss, J., Fording, R.C., & Schram, S. (2011). Disciplining the poor: Neoliberal paternalism and the persistent power of race. University of Chicago Press. Taylor-Gooby, P. (2004). New social risks and welfare states: New paradigm and new politics. In P. Taylor-Gooby (Ed.), New risks, new welfare – the transformation of the European welfare state (pp. 209–39). Oxford University Press. Thornton, P., & Corden, A. (2017). Personalised employment services for disability benefits recipients: Are comparisons useful? In P. Saunders (Ed.), Welfare to work in practice: Social security and participation in economic and social life (pp. 173–87). Routledge. Torfing, J. (1999). Workfare with welfare: Recent reforms of the Danish welfare state. Journal of European Social Policy, 9(1), 5–28. van Berkel, R. (2013). Triple activation: Introducing welfare to work into Dutch social assistance. In G. Marston & E.Z. Brodkin (Eds.), Work and the welfare state: Street-level organizations and workfare politics (pp. 87–103). Georgetown University Press. Watts, B., & Fitzpatrick, S. (2018). Welfare conditionality. Routledge. Woelert, P. (2015). The ‘logic of escalation’ in performance measurement: An analysis of the dynamics of a research evaluation system. Policy and Society, 34(1), 75–85. Wright, S., Fletcher, D.R., & Stewart, A.B.R. (2020). Punitive benefit sanctions, welfare conditionality, and the social abuse of unemployed people in Britain: Transforming claimants into offenders? Social Policy and Administration, 54, 278–94.

16. Governance in public health care: measurement (in)completeness Margit Malmmose

INTRODUCTION This chapter describes developing health care governance themes related to performance measures during the last decades. During the past 30–40 years, on a global level, we have witnessed the continuous struggle to balance health care management’s focus on cost and quality. In this pursuit, countries face similar challenges and public pressures despite differences in socio-political set-ups (World Health Organization et al., 2018). This chapter highlights global health care governance’s continuous development and progress towards more completely capturing both financial resource management and constraints and individual patient-level needs and expectations. Such policy and strategic endeavours typically translate into different forms of financial and non-financial performance measurement systems, which will be described and linked to socio-political implications. Emerging from the viewpoint of New Public Management (NPM), accounting has come to centre health care policymaking and governance (Arnaboldi et al., 2015; Bevan and Hood, 2006; Haslam and Lehman, 2006; Lapsley, 2001). Within the NPM regime, a starting point was the focus on prospective payment systems – specifically that of diagnosis-related groups (DRG) evolving during the 1980s and 1990s (Borden, 1988; Forgione et al., 2005). This focus established cost accounting’s strong influence on health care (Chapman et al., 2014; Chua and Preston, 1994; Forgione et al., 2005; Lehtonen, 2007) and drove some of the early NPM market-based reforms, particularly seen in New Zealand (Lawrence et al., 1994) and the UK (Ezzamel and Willmott, 1993; Kurunmäki and Miller, 2008). Since the end of the 1990s, however, narrowly focused managerial control systems have been replaced by a more complete, nuanced and participative approach that builds on the cooperation and inclusion of health professionals (Christensen and Lægreid, 2011; Malmmose, 2015b; World Health Organization, 2000) as well as patients (Cardinaels and Soderstrom, 2013; Malmmose and Kure, 2021). This change has since expanded in various directions, emphasising quality as a driver of governance (Pflueger, 2020). Quality is widely applied to support specific and quantitative non-financial measures – such as quality indicators and the broader ideas of value-based health care (Kaplan et al., 2017; Porter and Lee, 2013) – which contain elements of cost accounting (Kaplan and Witkowski, 2014). Thus, quality is crucial when discussing the relations between governance and measurement. In addition, the concept of quality captures diffuse understandings, such as individual patient perspectives (Malmmose and Kure, 2021; Pflueger, 2016; Pflueger and Zinck Pedersen, 2022). These changes in governance to engage more with qualitative aspects of health care services respond to an ongoing concern that a single productivity focus takes place at the expense of the patient (Chua and Preston, 1994; Preston, 1992) and conflicts with the Hippocratic oath (Malmmose, 2015a). However, the challenge is 243

244  Handbook on measuring governance to expand the performance measurement system to capture non-financial elements, which is difficult, as Pflueger (2016) explains, as individually experienced quality as an intimate space can neither be monitored nor standardised. These concerns indicate that something else is at stake that cannot be solved by merely investigating the extension of accounting and measurement techniques; instead, a grasp of hospital management’s ability to manage and understand the incompleteness of such measures is essential (Dahler-Larsen, 2019; Jordan and Messner, 2012). Relating to such understandings of what performance measures can and cannot represent, this chapter highlights continuing issues in deciding what to steer and control, which becomes increasingly complex and challenging as health care governance’s focus expands to include quality and patient aspects. The chapter additionally suggests that such endeavours will be entwined with future health care sustainability efforts. This chapter aims to map out the above-introduced performance management developments in health care. Such developments will elucidate hospital managerial challenges, enabling a more careful and informed research agenda. Thus, the chapter provides an in-depth description of the past decades’ methods and regimes of performance measures. Following this description, the chapter elaborates on crucial changes and the forces behind them. The final section of the chapter discusses the political and administrative consequences of these changes. By relating to ‘incompleteness’, the chapter highlights the challenges of seeking to completely capture all elements of governance within a performance measurement system.

IDENTIFICATION OF THE MOST INFLUENTIAL REGIMES OF PERFORMANCE MEASURING IN HEALTH CARE The most influential measurement regimes in health care policy development are distinguishable by financial and non-financial performance criteria in health care, which are vital to capture the inherent role of measurement. Still, it is crucial to remember that financial and non-financial performance measures interrelate. Financial Performance Measures Health care expenses, as the percentage share of gross domestic product (GDP), have doubled and, in some countries, tripled within the OECD countries during the past 40–50 years (OECD, 2021), stimulating a focus on the economy (Ashton, 2001; Covaleski and Dirsmith, 1983; Groot, 1999; Preston et al., 1992). This financial focus is predominantly marked with a management accounting orientation that stresses efficiency, cost control, budgeting and management control (Hood, 1995; Jackson and Lapsley, 2003; Lapsley and Wright, 2004; Pettersen, 2001). As a result, policymakers have continuously pursued managerial solutions for the public health care sector to steer finances. A prevailing tool to deal with health care costs while aiming for better managerial solutions is that of DRG. The DRG prospective payment system’s development in 1985 as part of the US Medicare Waiver became the cornerstone of a global emphasis on costs within health care (Malmmose, 2015b). The goal of creating the DRG system was to replace retrospective payment. Following the DRG implementation in the US, the World Health Organization (WHO) disseminated a similar test project from a UK health district in Wales. The WHO issued a report, ‘The Application of Diagnosis-Related Groups (DRGs) for hospital budgeting and performance measurement’ (World Health Organization,

Governance in public health care  245 1988; see also Chua and Preston, 1994), urging its member states to set up similar systems that demand records of hospital activity data and hospital cost data (WHO, 1988, p. 5). Consequently, most OECD nations adopted DRG systems during the late 1980s and the 1990s (Forgione et al., 2005), often applied as benchmarking and resource-allocation systems to increase transparency and improve efficiency (Busse et al., 2013). Recently, a report from the World Bank (Bredenkamp et al., 2020) provided examples of DRG cases from countries like the US, Russia, China, Germany and others and stated its aim ‘to help policymakers who are considering or beginning to introduce diagnosis-related group based payments’ (p. 1) which shows that DRG is a strong and sustaining method applied in health care. Thus, the DRG system represents a widely applied accounting method integrated with policies for measuring costs of hospital activity (Forgione et al., 2005), hospital productivity (Cardinaels and Soderstrom, 2013; Chua and Preston, 1994) and reimbursement (Cardinaels and Soderstrom, 2013, p. 649; Tan et al., 2011). Variations exist in how DRGs are calculated (Bredenkamp et al., 2020; Tan et al., 2014), but the common denominator is the classification of patients – how they are grouped – and the determination of cost weights for each group of patients (Bredenkamp et al., 2020). For example, there can be differences in what cost rate calculations include and whether such rates are based on historical costs or negotiated (Bredenkamp et al., 2020; Busse, 2011; Tan et al., 2014). The US has developed its DRG system for pricing purposes according to an insurance and third-payer structure. Other OECD nations, with mostly collectively financed health care, instead apply DRGs for activity measurement and budget creation for resource allocations, which differs considerably from the former system (Busse, 2011). An alternative to the DRG mode of steering activity and costs is the Time Driven Activity-Based Costing system (TDABC). There have been examples of traditional ABCs in health care since the early 1990s (see, e.g., King et al., 1994). However, within the past ten years, ABCs have gained significant attention in value-based health care (Kaplan, 2014). In this context, the balance between stakeholder perspectives should ultimately be measured in health outcomes compared to the costs submitted to create this value-focused outcome. In this newly established agenda, a focus on the entire patient continuity-of-care cost (Leung and van Merode, 2019) is essential. This agenda warrants the increasing relevance of investigating a patient’s total use of resources across hospital departments, which has been empirically and theoretically lacking (Kaplan and Witkowski, 2014). Kaplan and Porter (2011) suggest applying TDABC, which seeks to allocate costs according to the core activities and the patient’s medical condition rather than medical and surgical specialities (Kaplan and Witkowski, 2014; Kaplan et al., 2017). TDABC increases direct and indirect cost accuracy (Balakrishnan et al., 2018). Accordingly, research has demonstrated how to implement TDABC and relate this to outcome data and, thus, how to technically support the value-based approach with data (Leung and van Merode, 2019). So far, the TDABC has not gained the same profound foothold in health care on national policy levels as the DRG system. Instead, we see sporadic TDABC applications typically on a departmental level (Keel et al., 2017). One reason behind this disparity is the lack of proficient data since TDABC demands much more detailed activity and cost data (Malmmose and Lydersen, 2021). Instead, TDABC-associated, value-driven health care management ideas are often seen within health care policies from non-financial and strategic perspectives.

246  Handbook on measuring governance From Non-financial Performance Measures to Strategic Goals In light of the strong NPM influence during the 1980s and 1990s with its narrow focus on cost, the WHO published a document in 2000 (World Health Organization, 2000) emphasising a need to apply a broader view on the performance measures driving health care governance policies and decision-making. The WHO stated that ‘performance assessment allows policymakers, health providers and the population at large to see themselves in terms of the social arrangements they have constructed to improve health’ (p. viii). Thus, the WHO encourages performance measurement in ways beyond cost determinants. After this publication, frequent keywords became ‘patient-oriented’, ‘quality of care’ and ‘cooperation’, while sustaining an emphasis on cost control, performance measures and budgets (Hyndman and Lapsley, 2016; Lapsley, 2008). Thus, the WHO’s 2000 report marks a change in health care governance, encompassing various attempts to integrate non-financial performance measures and manifesting in various forms of policymaking. For example, the New Zealand Health Strategy 2000 (New Zealand’s Minister of Health Hon Annette King, 2000) underscores quality while integrating 13 specific goals, such as reducing smoking and improving nutrition. Additionally, the strategy heavily emphasises collaboration and coordination, identical to the Health Act 2002 in the UK. While methods of introducing quality assurance measures – particularly their indicators – during the following decades varied widely across time and nations, some similar sub-systems and approaches have evolved. One of the first methods introduced to improve health care quality is that of accreditation systems. Again, this method is summarised and disseminated by the WHO in 2003 in ‘Quality and accreditation in health care services’, defining accreditation models as ‘programmes that on a national level, aim to provide accreditation services to primary care, community services, hospitals or networks’ (p. 106). The report outlines different action controls; thus, quality assurances are translated into checklists that state whether a health care organisation takes specific actions to ensure quality. The WHO 2003 report shows that this was a common approach among member states at the beginning of the 2000s when 33 countries reported the application of accreditation systems. For example, ‘The Malcolm Baldrige Model’ from the US is a specific set of standards that some European nations and Australia have adopted. A recent review conducted by Tabrizi and Gharibi (2019) shows that most research studies have focused on the English-speaking nations’ application of such systems; the Middle East (countries such as Jordan, Egypt and Lebanon) and Denmark have also produced some studies (Ehlers et al., 2017; Tabrizi and Gharibi, 2019; Triantafillou, 2014). Recent studies estimate that up to 80 countries today have accreditation models implemented on some level in their health care systems (Mannion et al., 2018; Mosadeghrad, 2021). What followed simultaneously across nations is other developments of quality goals and standards from national levels in all shapes and sizes. In 2006 the WHO published a report called ‘Quality of care – a process for making strategic choices in health systems’ to more precisely define what ‘quality’ in health care means. This definition of quality included six areas within health care systems: (1) Effective, focusing on improved outcomes for individuals and communities; (2) Efficient, maximising resources used and avoiding waste; (3) Accessible, providing timely, geographically reasonable health care in appropriate settings; (4) Acceptable/patient-centred, giving individual service and understanding cultures and different needs; (5) Equitable, delivering equal health care irrespective of race, income, gender and geographical location; and (6) Safe, minimising harm and risk to patients (World Health

Governance in public health care  247 Organization, 2006, pp. 9–10). Thus, performance measures started to take on varied shapes to capture diverse areas of health care governance. Cardinaels and Soderstrom (2013) identify the increasing legitimate adoption of quality in hospitals as becoming a progressively more critical element to cost accounting. Finite budgets urge governments to optimise resources in health care while increasing individual safety, access and patient-centred care. Quality programmes’ clinical databases and indicators are part of these different forms of governance (Mainz, 2003). Although such databases have existed within the medical profession for years, the introduction of quality indicators into governance systems has changed the scope of such performance measures, and they increasingly factor into health care management (Buetow and Roland, 1999). Consequently, non-financial performance measures have become part of a broad spectrum of management and governance systems and have therefore gradually introduced, and to some extent confused, such methods with strategies. Within the recent decade we have seen the development of the more strategic phenomena of value-based management and a focus on ‘the patient first’ programmes. Ideally, to ‘[achieve] high value for patients must become the overarching goal of health care delivery, with value defined as the health outcomes achieved per dollar spent’ (Porter, 2010, p. 2477). This conceptualisation supports the theory that value-based management is strongly linked to cost accounting. However, in practice, a more strategic approach is identified. For example, the US state of Maryland rhetorically introduces the concept of value-based health care with their pricing changes towards global budgets, yet it does not mention cost accounting aspects (Malmmose and Fouladi, 2019; Patel et al., 2015). In Denmark, there has been a predominant focus on value-based health care on a regional level (Moll, 2018), yet with little link to cost accounting (Malmmose and Lydersen, 2021; Triantafillou, 2020). Likewise, a report from the UK on the NHS (Hurst et al., 2019) recommends value-based health care within the NHS, but it puts little emphasis on cost measures. In Australia, on the other hand, a more precise guideline for integrating value-based management, including the cost techniques, is published by the Deeble Institute for Health Policy Research. Despite this example, overall we see an emphasis on the strategic part of health care services, which is also highlighted by Porter who states that ‘In any field, improving performance and accountability depends on having a shared goal that unites the interests and activities of all stakeholders’ (Porter, 2010, p. 2477). Thus, with the concept of quality and value-based management, more fluent approaches, that is, not indicator specific, are insinuated. Associated with value-based agendas is an emphasis on individual patients’ perspectives, which become part of governance systems; for example, in 2016, the WHO published a substantial document of 700 pages, ‘Strategizing national health in the 21st century: a handbook’. In this handbook, the WHO recommends that nations use situational analysis to understand the individual needs of their particular health system. Concurrently, a recent New Zealand report (2016) focuses on value and high-performance measures in health care by being more locally present and empowering patients. The Minister of Health, Dr Jonathan Coleman, even states, ‘Overwhelmingly, I heard the need for greater focus on people, how to engage better in designing services together and how to better understand people’s need’ (p. iii). This sentiment is echoed in Maryland’s attempt to introduce more advanced management control systems aimed at ‘moving towards a more patient-centred system’ (Johns Hopkins and Maryland Hospital Association, 2015, session 1) and in Denmark’s efforts to put the ‘Patient First’ (The Danish Ministry of Health, 2019; see also Malmmose and Kure, 2021). Thus, the empowerment

248  Handbook on measuring governance of patients is becoming a central element in health care policy development. Emphasising individual patient needs through discussions of quality and value-based approaches raises an expectation of something other than standardised indicators. Thus, health care policymakers are moving into an intimate sphere, which is difficult to steer and govern (Pflueger, 2016, 2020), yet which has undergone multiple attempts to conform to performance measures. Thus, as non-financial performance measures have evolved in health care, so has the confusion of what these measures can or should represent and to what extent policymakers, health care managers and stakeholders are aware of such measurement incompleteness.

DRIVING FORCES BEHIND PERFORMANCE MEASUREMENT INTRODUCTIONS AND CHANGES Summing up, two main initiatives have had significant international impacts on health care management approaches; a focus on budgets and costs during the 1980s and 1990s and, from 2000 onward, a more expansive governance focus, including qualitative and patient-centred indicators. Hood (1991) highlights the complexity of the driving forces behind NPM reforms. Socio-technical system changes, such as technological developments and income levels, are often explained as motivating factors driving such reforms. However, health care expenditures can reveal a more plausible explanation. Some researchers claim that rapid growth in health care expenditures during the 1960s, 1970s and 1980s sparked an interest in reforms (Oxley and MacFarlan, 1995) to curtail growing costs. For decades, controlling costs in health care has been a core concern debated in reform development (Malmmose, 2015b). However, looking back to the 1970s, the WHO Alma Ata declaration urged nations to reallocate their resources, particularly from military spending, to health care (World Health Organization, 1978). This declaration supports using social measures through economic rationalisation, highlighting the pre-eminence of budgets and costing methods during implementation. Thus, this declaration includes several elements that we know from NPM reforms and management accounting terminology and techniques where the governments are urged to set up a structured health care system by means of economically optimised programmes based on standardisation (Malmmose, 2015b). In addition, some nations – mainly English-speaking nations – developed market-based reforms, also known as marketisation (Pollitt and Boukaert, 2004), which intensified the split between purchasers and providers (Ellwood, 1996; Northcott and Llewellyn, 2001). In the following decades, we witness continuous raising health care expenditures but also increasing application of NPM reform focusing on accountability and health care efficiency by applying private sector management tools (Clatworthy et al., 2000; Robson, 2008). Thus, increasing health costs, NPM trends and accounting technologies enabled an increasing focus on various forms of calculative practices in health care (Kurunmäki and Miller, 2006; Lapsley and Wright, 2004). Such movements laid the groundwork for the overall development and implementation of DRG as a standard involved in various structures, allocating resources and holding health care organisations accountable. In later years, we observe a manifestation of such initiatives in which calculative practices, embedded in NPM reforms (Gruening, 2001; Hood, 1995), sustains changes in health care practice strategies (Hyndman and Lapsley, 2016; Lapsley and Segato, 2019). However, it is evident that health care systems continue to struggle

Governance in public health care  249 with cost containment (Chapman et al., 2016; OECD, 2021), as recently seen in Maryland’s Medicare agreement to introduce global budgets to contain costs (Maryland Health Services Cost Review Commission, 2015; see also Malmmose et al., 2018; Patel et al., 2015). Critics of NPM in health care, both in academia and in practice, began voicing concerns during the early 1990s (see, e.g., Chua and Preston, 1994; Lawrence et al., 1994; Preston et al., 1992; Samuel et al., 2005). In New Zealand, criticism of market-based reforms led to a ‘Coalition Agreement’ in the late 1990s (Ashton, 1996, 2001). Similarly, in the UK, the 1997 white paper, ‘The new NHS – modern and dependable’, in which the third principle is ‘to get the NHS to work in partnership’ (The UK National Health Service, 1997), displays an emerging focus on partnerships. Early adopters, such as New Zealand and the UK, thus switched from the market-based model of NPM towards a more inclusive focus on non-financial indicators and health targets, including the wellness of patients (New Zealand’s Minister of Health Hon Annette King, 2000; New Zealand Parliament, 2009). The WHO’s 2000 report supported this movement, ‘Health Systems – Improving Performance’ (World Health Organization, 2000), which noted awareness of performance measures in various forms to steer health care settings. In 2004, the WHO argued for explicitly defined hospital standards. Such standards remain central today, as shown in the WHO’s series of ‘Targets and Indicators for Health 2020’. Yet the standards are now diffused into other health care areas than the original economic focus. Thus, we now see an increasing application of non-financial standards. This focus is driven by an urge to emphasise a patient focus – particularly with a wish for equal access to health care. This has sparked a conceptual focus on ‘quality’ with the underlying assumption that ‘quality’ warrants something good and positive for the patient (Dahler-Larsen, 2019). Thus, criticism of the NPM’s focus on costing in health care has sparked a keen interest in non-financial elements. As highlighted by the WHO in 2006, ‘Even where health systems are well developed and resourced, there is clear evidence that quality remains a serious concern’ (p. 3). Patient groups and other stakeholders started pressuring health care systems to widen the perspectives of their governance (Cardinaels and Soderstrom, 2013). Financial standards were acknowledged to be incomplete because they don’t capture all aspects of what is essential in a patient treatment situation. Additionally, several studies have focused on the medical profession’s resistance to NPM financial reform efforts (Broadbent et al., 2001; Jacobs, 2005; Tummers and Van de Walle, 2012); thus, quality as a concept gains foothold. Quality serves as a meta-discourse, encompassing multiple aspects and perspectives of life. Quality can be deemed lesser or greater (Dahler-Larsen, 2019). Dahler-Larsen (2019) claims explicitly that the NPM movement enabled this focus on quality, saying, NPM as a movement helped neutralize the public/private distinction, thereby allowing the transport of quality regimes inspired by private industry and service into the management of public service organizations … The general positivity of the notion of quality helped take the focus away from what might be lost when other discourses lost attraction. Quality also helped nullify potential social contradictions and conflicts. Quality is for all, and apparently without opposition. (pp. 39–40)

This plausible explanation of the development and introduction of ‘quality’ in health care clarifies how quality has become a mode of several forms of representation. We may identify quality in initiatives such as accreditation models in health care, representing a specific level of standards (Agrizzi et al., 2016; Tabrizi and Gharibi, 2019); value-based management aiming for high quality balanced with costs (Groenewoud et al., 2019); to various patient-centred projects translated into quality standards (Malmmose and Kure, 2021). Research has demon-

250  Handbook on measuring governance strated that including quality metrics lowers medical profession resistance to NPM reforms and enables the inclusion of this profession in performance measurement agendas (Pflueger, 2020). Consequently, reforms focus increasingly on quality. For example, a 2008 UK white paper, ‘High quality care for all’, written by 2000 clinicians and other health care workers, described new visions for high-quality services. They emphasise that clinicians and service providers should consider quality in everything they do, specifically focusing on performance, effectiveness and standards. In the following period, quality commissions emerged. For example, a quality commission was established in 2009 in the UK, ‘The Care Quality Commission’, which merged three regulators of health care, mental health and social care (Parkin, 2020, p. 4). A comparable commission, the ‘Health Quality & Safety Commission in New Zealand’ was established in 2010, seeking to integrate the work of clinicians, providers and consumers to improve patient health and safety (The Health Quality & Safety Commission, 2011). In the US, the US Department of Health and Human Services had established a similar organisation, ‘The Agency for Health Care Research and Quality’, in 1989 (Eisenberg, 2000). This agency initially focused more on clinical improvement within some regions of diagnosis, resembling clinical databases instead of a quality commission; however, it has been developed in the following years to include the quality mentioned above of improvement areas outlined by the WHO. Such establishments strongly encourage reforming quality standards in health care. Within this work, the patient becomes increasingly central; therefore, with these quality standards, a keen interest in empowering the people has grown (see, e.g., New Zealand Health Strategy, 2016; WHO, 2000). In 2000, the WHO labelled this ideology ‘New Universalism’, defined as ‘delivery to all of high-quality essential care, defined mostly by criteria of effectiveness, cost and social acceptability’ (p. xiii). This ideology emphasises individual choice and responsibility. In summary, a cost-containment focus was introduced in health care systems in the 1980s and 1990s due to increasing health care costs, and technological developments. However, as critics highlighted the negative consequences of a singular financial focus, non-financial indicators come into focus for health care systems and policymakers. The driving forces of this movement were the desire to give patients the best possible treatment, which includes different quality standards, and to involve patients in decision-making, which had been neglected in the productivity regimes during the 1980s and 1990s.

POLITICAL AND ADMINISTRATIVE CONSEQUENCES OF THE USE OF PERFORMANCE MEASURING The primary consequences of applying performance measures as core managerial tools are data reliability, encouraging opportunistic behaviours due to integrated accountability schemes, and finally, a tendency to believe that performance measurement systems are capable of capturing all necessary aspects of managing. One of the first issues identified with the costing regime of the original NPM reform wave was the dilemma of relating cost information to ethical issues which shape clinical decision-making (Fischer and Ferlie, 2013; Hill et al., 2001; Kurunmäki et al., 2003; Preston et al., 1992). For example, Chua and Preston (1994) highlight that the new emphasis is on cost control and containment instead of calculating cost reimbursements. They further point to the

Governance in public health care  251 conundrum of ‘the taken-for-grantedness of seemingly neutral accounting concepts such as “costs”’ (pp. 4–5). Oakes et al. (1994) describe the problematic avenues of assuming that cost data represent objective evaluations of treatments. They depict the expanding role of monetary concerns in the jurisdiction of medicine that silence other interests (e.g., in the US). This recalls the earlier discussion of the powerful element of accounting numbers or calculative practices, raised earlier by Miller and O’Leary (1987). In a similar vein, Arnold et al. (1994) suggest that a strong focus on costs dismisses democratic decision-making processes, which was further demonstrated in the work of Malmmose (2015a), showing how a cost focus silences other medical viewpoints in debates surrounding the restructuring of health care. Numerous studies have shown similar concerns (Arnold et al., 1994; Preston et al., 1992; Samuel et al., 2005), and in particular, researchers have highlighted how doctors engage in more financial responsibilities than in the past (Jacobs, 2005; Kurunmäki, 2004). Such engagements have also been highlighted by Donabedian (1983, 1985) as being problematic, as they put practitioners into the dilemma of having to monitor social priorities versus individual patient needs. Engaging with the technical application of the cost practices of DRGs has shown cases of doubtful implications. As noted above, DRGs aim to increase transparency and improve efficiency (Busse, 2011). Research shows that cost data has a reliability issue where governments that collect cost data from different providers encounter variations within the practices of health care providers, raising doubts about the fairness of the prospective payment system (Chapman et al., 2014, p. 354). Issues with the calculative practices of cost accounting from DRGs exist where these methods seldom get questioned due to the organisations’ tendency to take numbers for granted. Surpassing calculative issues, DRGs have proven to encourage different forms of inexpedient behaviour due to their application as performance measures, whether for reimbursement, budgets and resource allocations, or to measure productivity levels. Thus, researchers have detected different issues, such as manipulating waiting lists and other forms of ‘gaming the numbers’, as a ripple effect from these applications (Buchanan and Storey, 2010; Lægreid and Neby, 2016; Neby et al., 2015). The 2015 study by Neby et al. demonstrates various forms of system manipulation by providers in Norway and Germany, with a focus on either adding extra DRG codes to receive more funding, show more productivity or avoid giving patients the correct information on treatments in order to favour their own waiting lists. Consequently, when DRGs are applied as prospective payment systems, they provide incentives for hospitals to treat more patients and limit the treatment service each patient receives (Busse, 2011; Chua and Preston, 1994), resulting in consequences such as cherry-picking and dumping (Busse, 2011; Busse et al., 2013; Chua and Preston, 1994). ‘Cherry-picking occurs if certain patients within one group are systematically more costly than others, leading to incentives for hospitals to select the less costly, more profitable cases and to transfer or avoid the unprofitable ones (dumping)’ (Busse et al., 2013, p. 2). These consequences have raised concerns of DRGs’ appropriateness when considered from a patient perspective (Cardinaels and Soderstrom, 2013; World Health Organization, 2000). In this line is the concerns of potential overutilisation. For example, in 2008, DRGs encouraged streamlining shoulder surgeries in Denmark, which led to a tremendous increase in surgeries – up to four times more than before – which initiated a debate on whether surgery should be prioritised over physical therapy (Bjerno, 2008). Similar concerns of overutilisation are discussed in the US (Nassery et al., 2015) where research also highlights a tendency of overcrowding outpatient services (Bahadori et al., 2017; Januleviciute et al., 2016; Kaplan and Witkowski, 2014; Livingstone

252  Handbook on measuring governance and Balachandran, 1977). While outpatient surgeries may be beneficiary to the patient in many cases, there has been a tendency to either perform unnecessary procedures on patients or to have the patient’s treatments split up on different days to impact productivity performance measures (Malmmose and Kure, 2021), while Chua and Preston (1994) have raised a concern of DRG rates driving the increasing outpatient treatments where they believe that, in some cases, cost are merely transferred from hospitals to other care providers and families. In light of a growing political understanding and awareness of the above described consequences of a single productivity or cost accounting focus, several initiatives have been put into place. Firstly, most countries today have systems in place which prevent overutilisation and short-term re-admissions (Busse et al., 2013). Additionally, and more profoundly, different ‘quality’ initiatives have emerged to focus on non-financial performance measures, with the aim of supplementing a cost accounting focus with other perspectives – particularly those of patients (Cardinaels and Soderstrom, 2013). Yet, at present, the concept of quality within health care is ambiguous and multifaceted. Two forms of quality in health appear in existing literature. First is the tendency of a quantitative conceptualisation of quality. Llewellyn (1993), for example, addressed the combination of cost and quality in accounting functions 25 years ago, indicating that the issue of balancing cost and quality is far from novel. Llewellyn (1993) defines quality as health care delivery, noting that ‘at the level of the individual, costs could be related to patient satisfaction’ (p. 192). Thus, quality is directly linked and identified through cost in this framework. Likewise, Forgione et al. (2005) apply patient mortality and medical misadventures as proxies for quality; they acknowledge, however, that they are merely rough measures for quality. This is mirrored in policy developments, where we see a continuous struggle with defining what quality means and with finding more applicable ways of ensuring quality in health care work (Malmmose and Fouladi, 2019; Malmmose et al., 2018). Pflueger (2016) highlights, in his study on a customer survey in health care, that accounting ‘was extended so as to complement and make up for the financial, to add more dimensions to financialised customers, and to add more qualities to the economy of care’ (p. 29). Literature thus show that ample issues and struggles emerge when working with performance measures, regardless of whether they are financial or non-financial. Quality appears to be a core concept of consensus seeking in many policy developments. As Bouillon et al. (2006) note, performance improvements only occur when nurses, physicians and management personnel reach consensus on strategic decision-making. Thus, quality is a common denominator that all stakeholders are interested in. However, due to quality’s ambiguousness as a measure in health care, policymakers are left to struggle with specifying what it really is. Conversely, there may be a need to let go of such contemplations; for example, studies have shown how narrowly defined quality improvement initiatives have failed (Ehlers et al., 2017; Malmmose and Kure, 2021; Vaz and Araujo, 2022) or are at least controversial (Dixon-Woods and Martin, 2016). Vaz and Araujo (2022) penned a literature review on quality failures in health care, defining three different areas in which failures occur. These areas are patient-centred quality, concern for human factors (culture, leadership and workgroups) and technical issues of measurement information and standardisation. These concerns are in line with Jordan and Messners’s (2012) definition of narrow and broad incompleteness, where narrow incompleteness represents technical issues and faults in measurements such as invalid or improper accounting data. In contrast, the ‘softer’ or more fluent side of quality (Pflueger, 2016) represents broader forms of incompleteness. As such, it is associated with flexibility; for example, if ‘managers see the indicators only as part of what they should pay attention to, they

Governance in public health care  253 will exercise more flexibility in linking their actions to these indicators’ (Jordan and Messner, 2012, p. 559). In summary, while an ongoing tension between financial and non-financial indicators is continuously present in health care governance – how to balance cost and quality – another problematic issue is the perception of such indicators and understanding of what they represent. For example, performance-based measures will fall short in fully representing quality of care. While a few reforms have started supplementing these performance measures with ones that prioritise dialogue, the contradictory aspect of these efforts is that dialogue is difficult to account for.

CONCLUSION The complexity of health care organisational structures, work processes and patient flows complicate the development of suitable performance management systems. In this chapter, two tracks of performance measures are identified: financial and non-financial measures. More specifically, these measures predominately represent cost and quality indicators. The primary international financial tools in health care settings are DRG indicators. Additionally, identical global attempts to restructure quality indicators are identified through the application of accreditation models and other quality indicator systems. Miller (1998) notes how the addition of practices in accounting occurs through the process of problematising current practices. This perspective brings forth changes in health care organisations’ performance measurement systems, highlighting that in particular, performance measures are often identified as insufficient or incomplete. Still, the reason for the continuous effort put into developing various performance measurements systems may reveal that indicators simplify quality and make it manageable. Additionally, if used properly, performance measures can add perspective, visualise issues and enable problematising similarly to other approaches. The constant efforts to improve performance measures, and thus change them, can be found in shifting health care strategies while health care organisations and policymakers struggle to find measurement systems capable of capturing patient needs and medical possibilities while accounting for costs. Expanding the repertoire of performance measures specifies a fine line between ideally wishing for complete systems while utilising the systems’ incompleteness to discuss and improve strategies and issues. As sustainability enters the global consciousness in all avenues of organisation and life, we can expect additional challenges in establishing (in)complete performance management systems. Thus, the complexity of balancing cost with quality indicators and representing different stakeholders is likely to increase in the future. While we have not seen sustainability goals incorporated into national policies yet, they may well become part of them in the future. Already, a proposal from the WHO (2018) – sparked by the UN’s sustainability goals – advises health care organisations to incorporate sustainability strategies. As this assessment on health care policy developments has described, the WHO plays an active role in global reform developments, suggesting that topics addressed by the WHO may come to play a leading role in future national reform developments.

254  Handbook on measuring governance

REFERENCES Agrizzi, D., Agyemang, G., & Jaafaripooyan, E. (2016). Conforming to accreditation in Iranian hospitals. Accounting Forum, 40, 106–24. Arnaboldi, M., Lapsley, I., & Steccolini, I. (2015). Performance management in the public sector. The Ultimate Challenge, 31, 1–22. Arnold, P.J., Hammond, T.D., & Oakes, L.S. (1994). The contemporary discourse on health care cost: Conflicting meanings and meaningful conflicts. Accounting, Auditing & Accountability Journal, 7, 50–67. Ashton, T. (1996). Health care systems in transition: New Zealand – Part I: An overview of New Zealand’s health care system. Journal of Public Health Medicine, 18, 269–73. Ashton, T. (2001). The rocky road to health reform: Some lessons from New Zealand. Australian Health Review, 24, 151–6. Bahadori, M., Teymourzadeh, E., Ravangard, R., et al. (2017). Factors affecting the overcrowding in outpatient healthcare. Journal of Education and Health Promotion, 6, 21. Balakrishnan, R., Koehler, D.M., & Shah, A.S. (2018). TDABC: Lessons from an application in healthcare. Accounting Horizons, 32(4), 31–47. Bevan, G., & Hood, C. (2006). What’s measured is what matters: Targets and gaming in the English public health care system. Public Administration, 84, 517–38. Bjerno, S.M. (2008). Flere kirurger skaber boom i skulderoperationer [More surgeons create boom in shoulder surgeries]. Dagens Medicin [Daily Medicine]. Borden, J.P. (1988). An assessment of the impact of diagnosis-related group (DRG)-based reimbursement on the technical efficiency of New Jersey hospitals using data envelopment analysis. Journal of Accounting and Public Policy, 7, 77–97. Bouillon, M.L., Ferrier, G.D., Stuebs, M.T., et al. (2006). The economic benefit of goal congruence and implications for management control systems. Journal of Accounting and Public Policy, 25, 265–98. Bredenkamp, C., Bales, S., & Kahur, K. (2020). Transition to diagnosis-related group (DRG) payments for health: Lessons from case studies. International Bank for Reconstruction and Development and The World Bank. Broadbent, J., Jacobs, K., & Laughlin, R. (2001). Organisational resistance strategies to unwanted accounting and finance changes: The case of general medical practice in the UK. Accounting, Auditing & Accountability Journal, 14, 565–86. Buchanan, D.A., & Storey, J. (2010). Don’t stop the clock: Manipulating hospital waiting lists. Journal of Health Organization and Management, 24, 343–60. Buetow, S.A., & Roland, M. (1999). Clinical governance: Bridging the gap between managerial and clinical approaches to quality of care. Quality in Health Care, 8, 184. Busse, R. (2011). Diagnosis-related groups in Europe: Moving towards transparency, efficiency, and quality in hospitals. European Observatory on Health Systems and Policies Series. Open University Press. Busse, R., Geissler, A., Aaviksoo, A., et al. (2013). Diagnosis related groups in Europe: Moving towards transparency, efficiency, and quality in hospitals? BMJ, 346, f3197, 7 June. doi: 10.1136/bmj.f3197. Cardinaels, E., & Soderstrom, N. (2013). Managing in a complex world: Accounting and governance choices in hospitals. European Accounting Review, 22, 647–84. Chapman, C., Kern, A., & Laguecir, A. (2014). Costing practices in healthcare. Accounting Horizons, 28, 353–64. Chapman, C., Kern, A., Laguecir, A., et al. (2016). Management accounting and efficiency in health services: The foundational role of cost analysis. In J. Cylus, I. Papanicolas, & P.C. Smith (Eds.), Health system efficiency: How to make measurement matter for policy and management (pp. 75–98). World Health Organization. Christensen, T., & Lægreid, P. (2011). Democracy and administrative policy: Contrasting elements of New Public Management (NPM) and post-NPM. European Political Science Review, 3, 125–46. Chua, W.F., & Preston, A. (1994). Worrying about accounting in health care. Accounting, Auditing & Accountability Journal, 7, 4–17. Clatworthy, M., Mellett, H., & Peel, M. (2000). Corporate governance under ‘New Public Management’: An exemplification. Corporate Governance: An International Review, 8, 166–76.

Governance in public health care  255 Covaleski, M.A., & Dirsmith, M.W. (1983). Budgeting as a means for control and loose coupling. Accounting, Organizations and Society, 8, 323–41. Dahler-Larsen, P. (2019). Quality: From Plato to performance. Springer International Publishing. Dixon-Woods, M., & Martin, G.P. (2016). Does quality improvement improve quality? Future Hospital Journal, 3, 191–4. Donabedian, A. (1983). Quality, cost, and clinical decisions. The Annals of the American Academy of Political and Social Science, 468, 196–204. Donabedian, A. (1985). Some thoughts on cost containment and the quality of health care. Administration in Mental Health, 13, 5–14. Ehlers, L.H., Jensen, M.B., Simonsen, K.B., et al. (2017). Attitudes towards accreditation among hospital employees in Denmark: A cross-sectional survey. International Journal for Quality in Health Care, 29, 693–8. Eisenberg, J.M. (2000). The Agency for Healthcare Research and Quality: New challenges, new opportunities. Health Services Research, 35, xi–xvi. Ellwood, S. (1996). Full-cost pricing rules within the National Health Service internal market – accounting choices and the achievement of productive efficiency. Management Accounting Research, 7, 25–52. Ezzamel, M., & Willmott, H. (1993). Corporate governance and financial accountability: Recent reforms in the UK public sector. Accounting, Auditing & Accountability Journal, 6, 109–32. Fischer, M.D., & Ferlie, E. (2013), Resisting hybridisation between modes of clinical risk management: Contradiction, contest, and the production of intractable conflict. Accounting, Organizations and Society, 38(1), 30–49. Forgione, D.A., Vermeer, T.E., Surysekar, K., et al. (2005). DRGs, costs and quality of care: An agency theory perspective. Financial Accountability & Management, 21, 291–308. Groenewoud, A.S., Westert, G.P., & Kremer, J.A.M. (2019). Value based competition in health care’s ethical drawbacks and the need for a values-driven approach. BMC Health Services Research, 19, 256. Groot, T. (1999). Budgetary reforms in the non-profit sector: A comparative analysis of experiences in health care. Financial Accountability & Management, 15, 353–77. Gruening, G. (2001). Origin and theoretical basis of New Public Management. International Public Management Journal, 4, 1–25. Haslam, C., & Lehman, G. (2006). Accounting for healthcare: Reform and outcomes. Accounting Forum, 30, 319–23. Hill, W.Y., Fraser, I., & Cotton, P. (2001). On patients’ interests and accountability: Reflecting on some dilemmas in social audit in primary health care. Critical Perspectives on Accounting, 12(4), 453–69. Hood, C. (1991). A public management for all seasons? Public Administration, 69, 3–19. Hood, C. (1995). The ‘new public management’ in the 1980s: Variations on a theme. Accounting, Organizations and Society, 20, 93–110. Hurst, L., Mahtani, K., Pluddemann, A., et al. (2019). Defining value-based healthcare in the NHS. CEBM and Oxford University. Hyndman, N., & Lapsley, I. (2016). New Public Management: The story continues. Financial Accountability & Management, 32, 385–408. Jackson, A., & Lapsley, I. (2003). The diffusion of accounting practices in the new ‘managerial’ public sector. The International Journal of Public Sector Management, 16, 359–72. Jacobs, K. (2005). Hybridisation or polarisation: Doctors and accounting in the UK, Germany and Italy. Financial Accountability & Management, 21, 135–62. Januleviciute, J., Askildsen, J E., Kaarboe, O., et al. (2016). How do Hospitals respond to price changes? Evidence from Norway. Health Economics, 25, 620–36. Johns Hopkins Bloomberg School of Public Health & Maryland Hospital Association. (2015). Global budgeting for hospital services: A webcast series. Available at: http://​www​.jhsph​.edu/​departments/​ health​-policy​-and​-management/​news​-and​-events/​global​-budget​.html. Accessed 15 July 2022. Jordan, S., & Messner, M. (2012). Enabling control and the problem of incomplete performance indicators. Accounting, Organizations and Society, 37, 544–64. Kaplan, R.S. (2014). Improving value with TDABC. Healthcare Financial Management, 68, 76–83. Kaplan, R.S., & Porter, M.E. (2011). How to solve the cost crisis in health care. Harvard Business Review, 89(9), 47–64.

256  Handbook on measuring governance Kaplan, R.S., & Witkowski, M.L. (2014). Better accounting transforms health care delivery. Accounting Horizons, 28, 365–83. Kaplan, R.S., Porter, M.E., & Frigo, M.L. (2017). Managing healthcare costs and value. Strategic Finance, 98(7), 24–33. Keel, G., Savage, C., Rafiq, M., et al. (2017). Time-driven activity-based costing in health care: A systematic review of the literature. Health Policy, 121, 755–63. King, M., Lapsley, I., Mitchell, F., et al. (1994). Costing needs and practices in a changing environment: The potential for ABC in the NHS. Financial Accountability & Management, 10, 143–61. Kurunmäki, L. (2004). A hybrid profession – the acquisition of management accounting expertise by medical professionals. Accounting, Organizations and Society, 29, 327–47. Kurunmäki, L., & Miller, P. (2006). Modernising government: The calculating self, hybridisation and performance measurement. Financial Accountability & Management, 22, 87–106. Kurunmäki, L., & Miller, P. (2008). Counting the costs: The risks of regulating and accounting for health care provision. Health, Risk & Society, 10, 9–21. Kurunmäki, L., Lapsley, I., & Melia, K. (2003). Accountingization v. legitimation: A comparative study of the use of accounting information in intensive care. Management Accounting Research, 14(2), 112–39. Lægreid, P., & Neby, S. (2016). Gaming, accountability and trust: DRGs and activity-based funding in Norway. Financial Accountability & Management, 32, 57–79. Lapsley, I. (2001). Accounting, modernity and health care policy. Financial Accountability & Management, 17, 331–50. Lapsley, I. (2008). The NPM agenda: Back to the future. Financial Accountability & Management, 24, 77–96. Lapsley, I., & Segato, F. (2019). Citizens, technology and the NPM movement. Public Money & Management, 39(6), 1–7. Lapsley, I., & Wright, E. (2004). The diffusion of management accounting innovations in the public sector: A research agenda. Management Accounting Research, 15, 355–74. Lawrence, S., Alam, M., & Lowe, T. (1994). The great experiment: Financial management reform in the NZ health sector. Accounting, Auditing & Accountability Journal, 7, 68–95. Lehtonen, T. (2007). DRG-based prospective pricing and case-mix accounting – exploring the mechanisms of successful implementation. Management Accounting Research, 18, 367–95. Leung, T.I., & van Merode, G.G. (2019). Value-based health care supported by data science. In P. Kubben, M. Dumontier, & A. Dekker (Eds.), Fundamentals of clinical data science (pp. 193–212). Springer International Publishing. Livingstone, J.L., & Balachandran, K.R. (1977). Cost and effectiveness of physician peer review in reducing medicare overutilization. Accounting, Organizations and Society, 2, 153–64. Llewellyn, S. (1993). Linking costs with quality in health and social care: New challenges for management accounting. Financial Accountability & Management, 9, 177–95. Mainz, J. (2003). Defining and classifying clinical indicators for quality improvement. International Journal for Quality in Health Care, 15, 523–30. Malmmose, M. (2015a). Management accounting versus medical profession discourse: Hegemony in a public health care debate – a case from Denmark. Critical Perspectives on Accounting, 27, 144–59. Malmmose, M. (2015b). National hospital development, 1948–2000: The WHO as an international propagator. Accounting History Review, 25, 239–59. Malmmose, M., & Fouladi, N. (2019). Accounting facilitating socio-political aims: The case of Maryland hospitals. Financial Accountability & Management, 35, 413–29. Malmmose, M., & Kure, N. (2021). Putting the patient first? The story of a decoupled hospital management quality initiative. Critical Perspectives on Accounting, 80, 102233. Malmmose, M., & Lydersen, J.P. (2021). From centralized DRG costing to decentralized TDABC – assessing the feasibility of hospital cost accounting for decision-making in Denmark. BMC Health Services Research, 21, 835. Malmmose, M., Mortensen, K., & Holm, C. (2018). Global budgets in Maryland: Early evidence on revenues, expenses, and margins in regulated and unregulated services. International Journal of Health Economics and Management, 1–14. http://​doi​.org/​10​.1007/​s10754​-018​-9239​-y.

Governance in public health care  257 Mannion, R., Shekelle, P.G., Whittaker, S., et al. (2018). Health care systems: Future predictions for global care. CRC Press. Maryland Health Services Cost Review Commission. (2015). Completed agreements under the all-payer model – global budget revenue overview presentation. Available at: https://​hscrc​.maryland​.gov/​ Pages/​gbr​-tpr​.aspx. Accessed 29 August 2023. Miller, P. (1998). The margins of accounting. European Accounting Review, 7, 605–21. Miller, P., & O’Leary, T. (1987). Accounting and the construction of the governable person. Accounting, Organizations and Society, 12, 235–65. Moll, P. (2018). Strategi og Styring med Effekt [Effective strategy and control]. DJØF. Mosadeghrad, A.M. (2021). Hospital accreditation: The good, the bad, and the ugly. International Journal of Healthcare Management, 14, 1597–601. Nassery, N., Segal, J.B., Chang, E., et al. (2015). Systematic overuse of healthcare services: A conceptual model. Applied Health Economics and Health Policy, 13, 1–6. Neby, S., Lægreid, P., Mattei, P., et al. (2015). Bending the Rules to play the game: Accountability, DRG and waiting list scandals in Norway and Germany. European Policy Analysis, 1, 127–48. New Zealand’s Minister of Health Hon Annette King. (2000). The New Zealand health strategy. The New Zealand Government. New Zealand Parliament. (2009). New Zealand Health System reforms. Parliamentary Library Research Paper. New Zealand Government (2016). New Zealand Health Strategy – future direction. Ministry of Health. April. ISBN: 978-0-947491-87-1 (online). Northcott, D., & Llewellyn, S. (2001). X-ray vision. Financial Management, 36–7. Oakes, L.S., Considine, J., & Gould, S. (1994). Counting health care costs in the United States: A hermeneutical study of cost benefit research. Accounting, Auditing & Accountability Journal, 7, 18–49. OECD. (2021). Health expenditure and financing. Organisation for Economic Co-operation and Development. Oxley, H., & MacFarlan, M. (1995). Health care reform: Controlling Spending and increasing efficiency. OECD Observer, 192, 23–7. Parkin, E. (2020). The Care Quality Commission. In UK Parliament (Ed.), The House of Commons researchbriefings​ .files​ .parliament​ .uk/​ Library. Briefing Paper. Number: 08754, 1 May. https://​ documents/​CBP​-8754/​CBP​-8754​.pdf. Accessed 29 August 2023. Patel, A., Rajkumar, R., Colmers, J.M., et al. (2015). Maryland’s global hospital budgets – preliminary results from an all-payer model. New England Journal of Medicine, 373, 1899–901. Pettersen, I.J. (2001). Implementing management accounting reforms in the public sector: The difficult journey from intentions to effects. The European Accounting Review, 10, 561–81. Pflueger, D. (2016). Knowing patients: The customer survey and the changing margins of accounting in healthcare. Accounting, Organizations and Society, 53, 17–33. Pflueger, D. (2020). Quality improvement for all seasons: Administrative doctrines after New Public Management. Financial Accountability & Management, 36, 90–107. Pflueger, D., & Zinck Pedersen, K. (2022). Assembling homo qualitus: Accounting for quality in the UK National Health Service. European Accounting Review, 32(4), 875–902. Pollitt, C., & Boukaert, G. (2004). Public management reform – a comparative analysis (2nd ed.). Oxford University Press. Porter, M.E. (2010). What is value in health care? The New England Journal of Medicine, 363(26), 2477–81. Porter, M.E., & Lee, T.H. (2013). The strategy that will fix health care. Harvard Business Review, October, 50. Preston, A.M. (1992). The birth of clinical accounting: A study of the emergence and transformation of discourses on costs and practices of accounting in U.S. hospitals. Accounting. Organizations and Society, 17, 63–101. Preston, A.M., Cooper, D.J., & Coombs, R.W. (1992). Fabricating budgets: A study of the production of management budgeting in the National Health Service. Accounting, Organizations and Society, 17, 561–94. Robson, N. (2008). Costing, funding and budgetary control in UK hospitals. Journal of Accounting & Organizational Change, 4, 343–62.

258  Handbook on measuring governance Samuel, S., Dirsmith, M.W., & McElroy, B. (2005). Monetized medicine: From the physical to the fiscal. Accounting, Organizations and Society, 30, 249–78. Tabrizi, J.S., & Gharibi, F. (2019). Primary healthcare accreditation standards: A systematic review. International Journal of Health Care Quality Assurance, 32, 310–20. Tan, S.S., Serdén, L., Geissler, A., et al. (2011). DRGs and cost accounting: Which is driving which? In R. Busse, A. Geissler, W. Quentin, et al. (Eds.), Diagnosis-related groups in Europe: Moving towards transparency, efficiency and quality in hospitals (pp. 59–74). McGraw Hill. Tan, S.S., Geissler, A., Serdén, L., et al. (2014). DRG systems in Europe: Variations in cost accounting systems among 12 countries. European Journal of Public Health, 24, 1023–8. The Danish Ministry of Health. (2019). The patient first [Patienten Først] (pp. 1–44). The Ministry of Health. The Health Quality & Safety Commission. (2011). Introducing the Health Quality & Safety Commission. Best Practice Journal, 36. The UK National Health Service. (1997). The new NHS; modern – dependable. UK Department of Health. Triantafillou, P. (2014). Against all odds? Understanding the emergence of accreditation of the Danish hospitals. Social Science & Medicine, 101, 78–85. Triantafillou, P. (2020). Accounting for value-based management of healthcare services: Challenging neoliberal government from within? Public Money & Management, 1–10. doi:10.1080/09540962.2 020.1748878. Tummers, L., & Van de Walle, S. (2012). Explaining health care professionals’ resistance to implement Diagnosis Related Groups: (No) benefits for society, patients and professionals. Health Policy, 108, 158–66. Vaz, N., & Araujo, C. (2022). Failure factors in healthcare quality improvement programmes: Reviewing two decades of the scientific field. International Journal of Quality and Service Sciences, 14, 291–310. World Health Organization. (1978). Health for all – Declaration of Alma-Ata. In Report of the International conference of Primary Health Care. WHO. World Health Organization. (2000). The world health report 2000: Health systems – improving performance. WHO. World Health Organization. (1988). The application of diagnosis-related groups (drgs) for hospital budgeting and performance measurement. World Health Organization. World Health Organization. (2003). Quality and accreditation in health care services – a global review. World Health Organization. World Health Organization. (2006). Quality of care – a process for making strategic choices in health systems, pp. 1–50. World Health Organization. (2018) Tool for mapping governance for health and well-being: The organigraph method. WHO regional office for Europe. World Health Organization, OECD, & World Bank. (2018). Delivering quality health services – a global imperative for universal coverage. WHO, p. 100.

17. Made to measure: how central banks deliver performances of their worth and why unconventional monetary policy is reversing the burden of proof Timo Walter

INTRODUCTION Although they are rarely discussed in such a context, contemporary central banking and monetary policy constitute almost a ‘poster child’ case of ‘governing by numbers’ (Rose 1991). Since vanquishing the Great Inflation in the 1980s, central banks have evolved into a prototype of the independent technical agency tasked with the provision of a clearly demarcated public good that characterizes the ‘regulatory capitalism’ of the late 20th and early 21st centuries (Tucker 2018; Wansleben 2021). If central banking is rarely discussed as an issue of ‘measuring (of) governance’, and especially measuring performance, it is because the ‘myth’ of central bank independence (Binder and Spindel 2017) effectively obscures how their autonomy as organizational actors depends on their ability ‘to prove their worth by measuring their activities and results’ (see Introduction in this volume). In recent years, research in the fields of economic sociology and anthropology as well as political economy has increasingly ventured ‘inside the black box’ of how central banks govern and operate (Zayim 2022). This work has begun to uncover how central banks’ independence is a practical accomplishment achieved under particular social, political and economic conditions. It shows that central banks’ autonomy depends on a careful orchestration of performances that demonstrate their technical expertise, credibility and agency to multiple audiences. Central banks’ autonomous agency thus is a performative accomplishment that depends on the construction of institutionally legitimate(d) scripts of competent action (Meyer and Jepperson 2000), demonstrable adherence to which continuously renders the organizational actor’s performance measurable and legible in terms of institutionally legitimated orders of worth (Shore and Wright 2015, 23). Demonstrating effective technical agency constitutes a particularly acute problem for monetary policy, which ‘is generally focused on some relatively complex and uncertain object of knowledge’ (Abolafia 2012, 169), namely controlling future inflation. Central banks have thus been forced to construct technical scripts allowing them to create a credible linkage between this ‘proximate object of knowledge and a more remote object of knowledge. The former is chosen for its immediate efficacy in having an influence on the latter, less accessible goal’ (Abolafia 2012, 169). This chapter, therefore, seeks to show how contemporary central bank agency depends on the measurability of monetary policy, by looking specifically at the evolution and functioning of inflation targeting as the modern organizational script for central banking (Bernanke and Mishkin 1997). It first discusses how central banks constructed an uncontroversial measure of ‘successful’ monetary policy, allowing them to maintain a position of equidistance to several 259

260  Handbook on measuring governance social fields and their respective publics (e.g. Thiemann et al. 2021). Second, it shows how central banks’ success in accomplishing autonomy has paradoxically entangled their agency with structural contradictions in global finance that undermine the measurable performance of monetary policy success. Finally, the chapter highlights how the ‘unconventional’ monetary policies (Borio et al. 2018) that central banks have adopted in the wake of the 2007–09 financial crisis to address these contradictions have created novel challenges for their attempts to restore a more unequivocal measurability of their policies.

HOW INFLATION TARGETING MADE MONETARY POLICY PERFORMANCE ‘MEASURABLE’ Beginning in the 1980s, central banking has undergone a profound transformation. Where before monetary policy was the ugly duckling of fiscally oriented Keynesian macroeconomic management, since then it ‘has emerged as a distinct and highly visible public policy domain’, in which ‘central bankers have acquired unprecedented power and now count as the quintessential technocratic authorities of our time’ (Wansleben 2018, 774). In part, this change of central banks’ fortunes can be explained by the diffusion of monetarist ideas about the ‘neutrality’ (Adolph 2013) of central banking, which has created an ideological environment highly conducive to the narrowing of monetary policy to the control of inflation and the creation of legally and operationally ‘independent’ central banks. However, the spectacular rise of central banks’ (infra)structural power (Braun 2020; Walter and Wansleben 2020) is insufficiently explained as a ‘functional’ consequence of the importance of monetary policy for neoliberal supply-side reforms, but depends on a radical transformation of how monetary policy operates and its social, political and economic conditions of possibility (see Goodhart 2015). Until the 1980s, central banking ‘resembled the administrative and regulatory agency through which other parts of the state act on the economy’ (Braun 2020, 398), relying on market-constraining interventions such as controlling market interest rates or imposing credit ceilings. Since then, it has evolved into a market-based regime of governing that operates by rendering the market processes that foreshadow the price level/inflation measurable, and uses this legibility to conduct market expectations towards the desired outcome(s): ‘modern’ central banking thus works through a process of ‘performative measurement’ (Coombs 2020, 526). On this basis, by the early 2000s independent central banking had become a globally established standard and model (Polillo and Guillén 2005), and ‘achieved an almost taken for granted quality in contemporary political life, with little questioning of its logic or effectiveness’ (McNamara 2002, 47). While in a formal-legal sense central banks’ authority is delegated to them, in practice the conditions for such organizational autonomy need to be carefully constructed (Carpenter 2001; Goodhart 2015) – in this specific case, through a metamorphosis of central banking from an ‘art’ (Hawtrey 1970) into a fully ‘scientized’ agency (Marcussen 2006). Science provides a powerful means for legitimizing formal organization and its technical rationalities (Drori and Meyer 2006), particularly in cases where ‘means–ends relationships are unclear or there is no agreement on performance criteria’ (McNamara 2002, 64). To firmly establish their independence, central banks thus ‘must demonstrate uniqueness and show that they can create solutions and provide services found nowhere else in the polity’ (Carpenter 2001, 5). In light of their decisive role in defeating inflation in the 1980s, central banks appeared as

Made to measure  261 natural custodians of price stability. What scientization helped them achieve, however, was to translate this rather abstract public good into a concrete organizational script, adherence to which would render monetary policy’s ability to govern price dynamics and control inflation visible and demonstrable. Drawing on state-of-the-art science allowed central banks to delineate a framework that measurably demonstrated the ‘immediate efficacy’ of monetary policy implementation as a ‘proximate object’ which would be ‘having an influence on the … less accessible goal’ of price stability (see Abolafia 2012, 169). While it took about a decade for this script to evolve into a scientifically codified (Bernanke and Mishkin 1997) technical standard of inflation targeting, there has been a global convergence on its key features since the 1990s (Polillo and Guillén 2005). This process of formal organizing, aligning monetary policy and formal economic knowledge and models, has involved two distinct, but interrelated, dimensions. On the one hand, Rational Expectations (RE) economics and its formal models provided central banks both with a legitimating rhetoric and a formal framework for organizing and describing how their manipulation of short-term interest rates (the ‘implementation’ of monetary policy goals) allows them to govern or control inflation over the long term (the remote object of price stability). While there remain profound doubts about the nature and extent of this ‘control’ (Blinder 2004, 77ff.), this inflation targeting has proven a phenomenal success in measurably demonstrating central banks’ technical agency. It removes the uncertainty about the effectiveness of central banks’ control over future inflation by translating it into a technical procedure: the temporal transmission of present manipulations of the (short-term) interest rate into control over the (temporally) remote object of future inflation becomes a procedural issue of demonstrably shaping the proximate object of present expectations of that future (with RE providing the ‘transmission belt’). This has allowed central banks to focus their technical acumen on developing and gradually ‘formalizing’ (Walter 2019) a well-delineated operative ‘frame’ of relevant variables within which precise control over the definition of the short-term interest rate in money markets becomes possible. This allows the central bank (in theory) ‘to manage expectations of the future path of the official short-term rate’ (Braun 2015, 370) and thus demonstrably ‘target’ a future inflation rate by shaping the expectations of the economic actors whose conduct produce the (future) price level – by ‘anchor[ing] the very expectations it ostensibly measures’ (Braun 2015, 379). To demonstrate their performative control over expectations, central banks therefore need markets to react ‘in a way that is consistent with the central bank’s model of the economy’ (Braun 2015, 371), proving both the accuracy of central banks’ expertise and the adequacy of their policy scripts. Formal economic expertise has become a crucial medium for rendering central banks’ communication ‘transparent’ and legible for market publics and for encoding policy signals in clearly defined ‘frames’ (Braun 2016; Holmes 2009; Tognato 2013; Velthuis 2015), and for the public display of expertise through the production of forecasts and increasingly sophisticated modelling of economic processes meant to shore up the credibility of monetary policy strategies to its audiences (Braun 2015; Hubert 2015). On the other hand, RE models have also instructed how central banks have cultivated an infrastructure of market-based technologies of governing, to make themselves operationally autonomous from the state. To improve the efficacy of those market-based instruments, central banks have strategically shaped the institutional evolution of financial, money and credit markets in ways that would improve their own ability to elicit demonstrable and measurable market reactions (e.g. Braun 2020; Gabor and Ban 2015; Gabor and Vestergaard 2018; Walter 2019; Walter and Wansleben 2020; Wansleben 2018; Wullweber 2021). Thus, not only have

262  Handbook on measuring governance ‘academic research and scientific prestige become a means to bureaucratic power’ as a sort of reputational capital (Mudge and Vauchez 2016, 162); but scientific expertise and calculative techniques have provided a common medium of communication allowing central banks to produce measurable effects by shaping the expectations of the ‘insider audience’ (Braun 2015, 369) of market participants. Since the 1980s, then, central banks have proven shrewd strategists, ‘capitalizing’ on the crisis of organized capitalism (Krippner 2011) to reconstruct monetary policy as the purely formal-technical provision of price stability as an autonomous and well-delineated public good. The infrastructural symbiosis between financial markets and monetary policy described above has allowed them to create a source of performative power they can wield to generate scientific legitimacy and thus organizational autonomy at the interstices of multiple publics – spinning the yarn of central bank independence as a paradigmatic case ‘where states can reap the benefits of delegating clearly delineated and unequivocal policy tasks to specialized technocrats’ (Wansleben 2021, 909). However, to preserve this legitimacy, central banks have become profoundly dependent on preserving the ‘felicity conditions’ for these demonstrations of their ‘performative power’ (Wansleben 2018). Scientization, in this sense, constitutes a double-edged sword. On the one hand, to derive institutional legitimacy from it, central banks have been forced to streamline the formal structure of their operations and their rhetoric to conform to the one-dimensional technical rationality implied by formal models (see Meyer and Rowan 1977 for how strategies of legitimation shape the formal structure of modern organizations). On the other hand, their ability to secure ‘reliable, predictable governability only arises through central banks’ purchase over processes of expectation formation’ (Wansleben 2018, 777), creating a straightjacket of conformity to what and, more importantly, how markets and formal economic expertise conceive of financial processes. As formal economic expertise has become more and more inscribed into central banks’ internal sense-making and interpretive strategies, their ability to ‘see’ beyond the limits of this expertise has become severely restricted, to the point where the carefully crafted myth of monetary policy as neutral pursuit of the public good of price stability has become severely compromised by the failure to pre-empt or even notice growing financial fragilities and excesses before the crisis of 2007–09 (Abolafia 2004, 2010; Fligstein et al. 2017; Golub et al. 2014). This failure to ‘see’ beyond what is visible through formal economic models has been compounded by the symbiotic dependence on the integrity of their market-based infrastructure, whose stability needs to be constantly shored up against the (growing) instabilities and dysfunctions of global finance. As market dysfunctions have become a rather constant state of affairs since the mid-2000s, central banks have found it increasingly difficult to fend off the impression that their autonomy to pursue price stability as their sole technical mission might be compromised by this functional symbiosis with increasingly ‘unfettered’ financial markets (Walter and Wansleben 2020).

CENTRAL BANKING AND FINANCIALIZATION: MEASURABLE SUCCESS(ES) AND RITUALS OF GOVERNABILITY Modern central banking has, by most conventional measures, proven a quite phenomenal success story. Following the end of the Great Inflation of the 1970s and early 1980s, central banks have presided over nearly two decades of price stability and stable and favourable macroeconomic conditions (affectionately dubbed the ‘Great Moderation’: Bernanke 2004),

Made to measure  263 during which they could quietly expand and perfect the infrastructural power which undergirds the measurable performance of monetary policy. However, refashioning monetary policy into a sort of performative power has also produced, in the longer run, a number of adverse side effects. The (self-)conditioning of organizational actors to conform to ‘pre-specified and narrow accountabilities and performance criteria’ commonly leads them to ‘develop strategies, knowledge, tools, and professional identities that allow them to … optimize only towards measurable success’ (Wansleben 2021, 914). The more central banks have tailored their monetary policy scripts towards the performance of ‘narrow measures of success’ (Wansleben 2021, 911), the more they have both contributed to and become thoroughly dependent on structural felicity conditions over which they had little or no control. The ‘unfettering’ of market-based finance helped produce favourable macroeconomic conditions, fuelling stable economic growth while enclosing inflationary pressures largely within the financial system (Krippner 2011; Mehrling 2011). At the same time, as monetary policy was increasingly streamlined to align with a body of expertise that considered financial markets to be fundamentally efficient and the market economy as intrinsically stable, central banks became increasingly myopic with regard to not only the emergence of dysfunctions and fragilities in the market-based infrastructures on which they relied, but also to how their own activities aimed at stabilizing markets (to secure the felicity conditions for the performance of monetary policy) contributed to this process (Fligstein et al. 2017; Golub et al. 2014; Walter 2019, 2020; Walter and Wansleben 2020). As central banks were streamlining their monetary policy apparatus rather exclusively towards measurable performance in relation to the public good of price stability, other goals traditionally pursued alongside or as instrumental to achieving price stability were gradually abandoned. The result has been a fragmentation of epistemic tasks within central banks (between the articulation of monetary policy strategy, its technical implementation, and financial supervision) (Wansleben 2021, 910), leading to a gradual dismantling of both the techniques required for their observation and diagnosis, and the instruments required for actively pursuing objectives other than ‘price stability’ in an increasingly narrow sense. Whereas previously, central banks used money market interventions as an instrument both for stabilizing the price level and for maintaining financial stability, it gradually evolved into a tool tailored purely to the implementation of inflation targeting (Wansleben 2021, 914). Without any instruments for practically tackling the problem of financial (in)stability, central banks began to turn a blind eye to it, since available ‘response repertoires’ control what is being noticed (Weick 1979, 26). This ‘division of regulatory labour’ (Wansleben 2021) has thus certainly contributed to the failure to ‘see’ growing financial instabilities (Fligstein et al. 2017), but also to the widely noted (cultural or ideological) ‘capture’ (Kwak 2013) of monetary policy by the idea of market efficiency as central banks grew accustomed to thinking of financial markets as a transmission belt for monetary policy, which needed to be left undisturbed in order for its policy signals to circulate efficiently. However, financial stability did not only gradually disappear as a separate objective and concern for monetary policy, but, in helping to engineer a more market-based finance, central banks actually contributed to making the financial system more prone to liquidity crises, unfettered credit expansion and thus the build-up of financial fragility. While traditionally, central banks had been very concerned with controlling the (overall) credit expansion in the financial system, and thus had been very sceptical of highly integrated and liquid interbank and capital markets (Wansleben 2018, 795), since the 1980s they have actively supported

264  Handbook on measuring governance financial innovation and institutional reconstructions that would enhance reactivity to monetary policy signals (Braun 2020; Gabor 2016; Krippner 2011; Mehrling 2011; Walter and Wansleben 2020; Wansleben 2020). In particular central banks’ and regulatory agencies’ tacit regulatory approval of an increasingly complex system of ‘shadow’ finance (Thiemann 2018), which central banks hoped could improve the transmission of monetary policy (Walter and Wansleben 2020), has effectively eroded the ability of monetary policy to control the endogenous creation of (credit-)money by global finance and the various waves of asset inflation to which it has contributed (Mehrling 2011). Central banks’ efforts to disentangle monetary policy from ‘fiscal domination’ have thus, paradoxically, led them straight into a situation of ‘financial dominance’ (Diessner and Lisi 2020), where the dependence of their technical agency on a market infrastructure conducive to its measurable performance meant they had little choice but to accommodate financial and credit expansion to avoid disrupting the market processes on which it depended. This, however, places additional constraints on central bankers to (publicly) conform to their narrow mission of price stability and avoid any digression from their organizational script, in order to both preserve the ‘credibility’ of their commitment (Braun 2015, 381) and the purity of their signalling apparatus, which are crucial to their ability to ensure markets ‘perform’ in accordance with the inflation targeting script. However, financial domination has another, more structural dimension. As central bank(er)‍s noticed early on, their ability to make markets’ expectations conform to central banks’ models depended on avoiding any interference with ‘orderly conditions’ (Mehrling 2011, 48ff.) in markets that could distort the signal or disrupt its transmission. While central banks’ active role in facilitating the diffusion of US-style liquid money and capital markets (Gabor 2016; Konings 2011) was meant to improve the liquidity of those markets in order to shore up their reactivity to monetary policy, its flip side was that market-based finance has exhibited an inherent tendency towards ‘pro-cyclical’ behaviour, producing cycles of credit expansion and subsequent fragility in which market liquidity evaporates (Goodhart 2015, 281). Over the course of the 1980s, the Federal Reserve had learned this lesson and imparted it to what eventually became the global model of inflation targeting, as it accidentally disrupted orderly conditions when trying to (in line with remnant monetarist ideas) rein in the rapid expansion of credit and the money supply in the US economy. This meant that, if the management of expectations in line with a RE model to demonstrate the ability to target inflation was to work efficiently, monetary policy had to give up all pretensions to be controlling the expansion of credit, which might trigger a liquidity crisis and undermine the credibility and autonomy of the central bank itself (Walter and Wansleben 2020). Over the course of the 1990s, as monetary policy increasingly shied away from curtailing liquidity and credit expansion in the financial system, the traditional role of central banks as ‘lenders of last resort’ during times of acute crisis has thus morphed into one of durable underwriters of market liquidity, as ‘dealers’ or ‘market makers of last resort’ (Mehrling 2011) that have a vested interest in ‘de-risking’ (Gabor 2020) financial markets, that is, structurally propping up asset prices to prevent any disruption of market liquidity that could disrupt the performance of (inflation targeting) monetary policy. Securing and protecting the felicity conditions required for producing measurable performances of their technocratic agency thus locked them into a symbiotic dependence with market-based finance, creating a strong and structural bias for central banks not to interfere in markets and their increasingly pro-cyclical dynamics. Over the course of the 1990s and 2000s, this has led numerous observers to point out a pro-market and pro-finance bias of central banks that appeared to result from ideological or technocratic ‘capture’ of monetary policy. As long

Made to measure  265 as central banks could point to the successes of their inflation targeting ‘ritual performances’, however, this charge of technocracy did not pose a significant threat to their autonomy. This changed when the pro-cyclical dynamics and fragilities in financial markets that monetary policy (especially in, but not limited to) the US had long turned a blind eye to erupted into the 2007–09 global financial crisis. With this sudden and complete disappearance of the ‘orderly conditions’ on which its institutional legitimacy and autonomy depended, central banking entered into a period of profound institutional transformations, the consequences and implications of which are still not fully understood.

THE ADVENT OF ‘UNCONVENTIONAL’ MONETARY POLICY AND ITS CONSEQUENCES FOR CENTRAL BANKS’ PERFORMANCE The global financial crisis of 2007–09 (GFC), and subsequent debt crises such as in the Eurozone between 2009 and 2014, have profoundly transformed the conditions under which central banks operate. The macroeconomic stability of the Great Moderation and its ‘financial Keynesianism’ (Minsky 2001), in which financial profits sustained demand without inflating real (as opposed to financial) prices, had played into the myth of central banks’ superior technical acumen and effectively immunized them against any critiques of their unwillingness to curb ‘unfettered markets’, to burst financial bubbles or to regulate rampant financial ‘innovation’. The GFC confronted central banks with multiple challenges. Their systematic reliance on and support for market-based finance put a dent in the technical credibility of central banks. The crisis also profoundly disrupted the normal functioning of market-based finance and thus robbed central banks of most of their tried-and-tested instruments for intervening in financial markets. Simultaneously, they faced mounting public pressure to come up with technical fixes to stabilize financial markets and fend off the looming economic crisis – requiring them to develop ad hoc solutions with little or no established scientific legitimacy and whose effectiveness and side effects were difficult to oversee (at best). Two tasks that fell to central banks illustrate particularly well the tensions and fissures that emerged within the careful de-politicization of monetary policy in this new situation: first, the much discussed shift to ‘unconventional’ monetary policy (Borio et al. 2018) to fend off the looming threat of another Great Depression, and second, the creation of a framework for the ‘macro-prudential’ regulation of finance to prevent and manage the financial instabilities endogenous to market-based finance (Coombs 2017; Goodhart 2015; Thiemann 2019). The idea that there was a need for ‘macro’-prudential regulation emerged from the fact that the pre-crisis division of regulatory labour between a dominant micro-prudential regime of financial regulation and monetary policy had failed to stem the pro-cyclical build-up of financial risks and fragilities (Goodhart 2015; Wansleben 2021), as in the case of the US-based real-estate bubble that burst in 2007 and revealed the profound frailties of financial institutions and markets that previously had been considered both robust and resilient. Political decision-makers responded by assigning central banks the task of developing and managing a framework for dealing with financial instability (Thiemann et al. 2021, 1434). Since macro-prudential regulation potentially involved making decisions about the allocation of credit and interfering in the determination of risk and asset values, it posed a significant

266  Handbook on measuring governance threat to the carefully calibrated de-politicization and autonomy of central banking. As neither any scientifically authenticated ways of measuring and assigning macro-prudential risks nor well-scripted and legitimate techniques or instruments for counteracting them existed (Thiemann et al. 2021), central banks faced the unwelcome task of taking over a policy field that remained heavily politicized in the absence of consensual scientific or technical expertise, with no authoritative framework for action or accepted instruments. While there is some (minor) variation between different central banks (mostly as a function of their legal mandates), generally they agreed that ‘anti-cyclical policies threatened to re-politicize central bank action’ (Thiemann 2019, 564); they ingeniously navigated this potential minefield by transforming macro-prudential regulation into an exercise of ‘performative measurement’ (Coombs 2020). In the absence of well-defined measures of financial fragility and systemic risk (and unable to break down specific financial institutions’ contribution to this aggregate), central banks opted to ‘pass the buck’ to financial institutions; instead of imposing any firm (counter-cyclical) rules on them, there has been a convergence on the ‘Solomonic’ solution of conducting public ‘stress tests’ in which financial institutions are enrolled in performances of central bank agency on financial stability by self-auditing their resilience and preparedness for crises of market liquidity (Coombs 2020, 2022). Although there are ample grounds for being sceptical of the effectiveness of such ‘rituals of verification’ and the ability of risk management to actually neutralize risk (Power 2007), from the point of view of central banks it constitutes a highly effective public ‘ceremony’ (Meyer and Rowan 1977). It avoids politically controversial interventions while enabling them to perform a very public demonstration of their agency. While, to some extent, stress testing may also serve an epistemic function of sensitizing banks and other financial institutions to systemic risk(s) (Coombs 2020), its main attraction (for central banks) certainly lies in the fact that stress tests function as a sort of ‘performative measurement’ (Coombs 2020, 526): by communicating assessment criteria ‘transparently’, central banks can incentivize financial institutions to conform to them and produce measurable performances of central banks’ regulatory success. This allows central banks to demonstrate their agency as regulators, and to frame ‘financial stability as a discrete objective addressed by specific instruments’, so that ‘financial stability now augments (rather than replaces or amends) the inflation-targeting model of central banking’ (Levingston 2021, 1475) as a complementary but functionally separate task (Levingston 2021; Thiemann 2019). As a result, the initially announced ‘macro-prudential revolution’ has been ‘reduced to a much more scaled back incremental approach during the process of implementation. While focusing on increasing the resilience of the system, implemented measures largely refrain from intervening in the build-up of financial risks during the upswing of the cycle’ (Thiemann 2019, 562). Central banks have thus quite successfully incorporated macro-prudential policy as a complementary, but functionally distinct task in a way that does not interfere with the standard monetary policy script. However, unconventional monetary policy has proven more difficult to absorb. ‘Unconventional’ monetary policy includes a bundle of distinct measures and techniques. Yet, the key difference with conventional policy focused on (short-term) interest rate manipulation is the unconventional focus on balance sheet policies meant to lock in longer-term structural effects on financial conditions (Borio et al. 2018). Specifically, unconventional monetary policy has been described as ‘the unprecedented and extensive use of central bank balance sheets to shape financial conditions’, with the aim of removing ‘market dysfunctions’ and ‘maintaining the plumbing of [market-based] finance’ (Musthaq 2021, 1).

Made to measure  267 Although the various forms of ‘Quantitative Easing’ have proven very effective in stabilizing financial markets, they have also raised awkward questions about the boundaries and scope of monetary policy, and thus about the operational autonomy of central banks. On the one hand, central banks needed to demonstrate their ability and expertise to intervene – given that they had largely failed to anticipate the crisis and had sung the praises of ‘self-regulating finance’ until the very moment the crisis broke. Unconventional policies succeeded in demonstrating the scope and range of central banks’ infrastructural powers; by effectively de-risking financial assets and restoring the liquidity of the financial system, they also restored the felicity conditions of conventional monetary policy. However, they also extend central bank intervention (far) beyond the interbank money market in which interest-rate management had (exclusively) operated. This extension implicated central banks in distributional questions, such as what is the appropriate price for an asset? What losses should be borne by market participants? etc. Another trend in unconventional monetary policy has been to extend liquidity measures not only to more markets and at longer maturities, but also beyond traditionally eligible institutions – including, in particular, so-called ‘shadow banks’ (Wullweber 2021; for an overview of the growth of shadow banking, see Thiemann 2018) – and to weaken the traditional requirement of full collateralization of liquidity provision, so that considerable market and credit risks have been transferred onto the central banks’ balance sheet (Borio et al. 2018). This massive recourse to ‘unconventional’ balance sheet measures meant that central banks’ continuing adherence to inflation targeting and price stability as the normality, to be restored through the necessary but temporary recourse to unconventional policies, came under (increasingly public) scrutiny. This problem was made worse by the widespread perception that central banks showed no hesitation to use these tools when it came to rescuing finance from its self-made crisis. Central banks were acutely aware of the dangers this perceived lack of neutrality posed to their independence, fearing that various publics might ‘judge the expansion of liquidity excessive and criticize the apparent accommodation of the financial sector’ or, in the case of the Eurozone crisis, accuse them of monetizing public debt (Mabbett and Schelkle 2019, 440). To avoid being left to effectively assume sole responsibility for the fall-outs of instability, they thus sought to enrol the state, and more specifically fiscal policy, in these refinancing operations for financial markets (Abolafia 2012, 169). They faced the following dilemma: they needed to intervene to demonstrate their agency vis-à-vis a problem they were seen to have co-created – but they also needed to avoid normalizing unconventional measures and undercut the very delimitations that made this agency so (apparently) measurably effective in the first place. Faced with such ‘institutional pressure to both sustain and not sustain [unconventional policies] as a regular practice’ (Ronkainen and Sorsa 2018, 711), central banks needed to portray unconventional monetary policy as consistent with and complementary to its primary script – they needed to weave a ‘sensible plot’ to reconcile unconventional policies with their standard scripts of operating (Abolafia 2004; Ronkainen and Sorsa 2018, 715). Unlike for macro-prudential regulation, central banks have struggled to ‘construct an overall narrative of operational schemas that accommodates QE consistently with … other activities to maintain credibility’ (Ronkainen and Sorsa 2018, 715). They have tried to portray unconventional monetary policy as little more than a temporary and exceptional crisis intervention – a more extensive version of the ‘lender of last resort’ rescue operations central banks had provided during traditional financial crises. They have insisted that ‘these programmes serve monetary policy

268  Handbook on measuring governance in two ways: (1) addressing disruptions in the monetary policy transmission channel; and (2) providing additional monetary stimulus once policy rates reach the lower bound’ (Musthaq 2021, 15) – but which would be phased out once the normal functioning of markets and thus the conditions for returning to a narrow goal of price stability would be restored. As critics have rightly pointed out, this (rather misleading) framing obscures the systemic and durable dependence of generally inflated financial markets on the (implicit and explicit) support from central banks (Mehrling 2011). In a tacit acknowledgement of this critique, central banks have increasingly begun to frame unconventional policies as a way of biding time until a more ‘stable’ system of market-based finance could be engineered (Braun 2020, 408). However, the process of making the financial system more stable and resilient bears a striking resemblance to the blueprints already used by central banks before the series of financial and debt crises that began in 2007–08 to make financial architectures more responsive and conducive to their monetary policy scripts and thus increase the resilience of their policy infrastructure. Indeed, numerous central banks around the world have adopted similar recipes of extending and deepening collateral markets (which serve as conduits for their money market operations), encouraging in particular the use of ‘repurchasing’ or ‘repo’ techniques by financial institutions for managing their liquidity (Birk and Thiemann 2020; Braun 2020), as these increase the transmission of and thus the reactivity of markets to monetary policy impulses. Despite central banks’ best efforts to (re)create conditions allowing them to return to the pre-crisis (near)-exclusive focus on price stability, unconventional monetary policy has thus become a sort of constant companion to the main organizational script of inflation targeting. For central banks, this situation is deeply ambivalent. On the one hand, the perpetuation of unconventional policies has become indispensable for stabilizing the liquidity of financial markets, required for inflation targeting to operate successfully. On the other, the same unconventional policies serve as a constant reminder of the artificiality and contingency of central banks’ (continued) focus on price stability as a narrow measure of success. Central banks are quite aware of how unconventional policies continuously controvert the ‘myth’ of central bank neutrality (on which their independence rests) by highlighting how their performance could be held accountable to many other measures or ‘accounts of worth’ (Stark 2009). However, faced with the alternative of a destabilization of financial markets that would undermine the operative foundation of the performances through which central banks continue to exhibit the effectiveness of and justify (the need for) their autonomous agency, they have chosen to drink from the poisoned chalice. Central banks have thus come to systematically and ‘increasingly rely on unconventional tools in noncrisis times to maintain confidence in an unstable financial system … these interventions increasingly target “market dysfunction”, as opposed to (a narrow interpretation of) monetary policy …, suggesting a convergence in central bank operations around maintaining the plumbing of finance’ (Musthaq 2021, 1).

CONCLUSION: HOW NORMALIZING ‘UNCONVENTIONAL’ POLICIES DE-POLITICIZES CENTRAL BANKS’ PERFORMANCE The need to formally institutionalize macro-prudential policy as a task for monetary policy and the de facto normalization of unconventional monetary policy have created considerable tensions in how the performance of monetary policy is being measured (Walter 2022). As what

Made to measure  269 constitutes effective and efficient monetary policy has become more and more ambiguous, so has central banks’ autonomous agency, premised on the performance of such efficacy, come under increasing stress. The recent ‘return’ of inflation has added insult to injury, as central banks now face a problem they had, for the better part of three decades, claimed to have acquired technical mastery of. Whereas before they could invoke (structurally) low inflation (in goods and services) as evidence that their relatively close control over financial markets’ inflation expectations worked, the current bout of inflation increasingly exposes that these performances of measurable control within financial markets do not actually transmit very well into the broader economy – and how limited central banks’ autonomy towards a financial system dependent for its stability on continued expansionary monetary policy has become. From the perspective developed in this chapter, the prospects for a quick return to the system of performance measuring on which independent central banking rests do not seem particularly promising. Beyond the problems these developments pose for the autonomous agency of central banks themselves, they also reveal the difficulties that the de-centred governance through technical agencies whose autonomy depends on continuous audit and performance measurements faces when confronted with problems and crises that ‘overflow’ the neat divisions of labour and narrow technical rationalities. As these agencies’ legitimacy depends on performances of ‘narrow measures of success’ (Wansleben 2021, 911), they will need to protect their particular formal scripts and organizational rationalities in times of crisis – seeking, as central banks have done, to protect and stabilize what they perceive as structural felicity conditions for successful performances. Performance measuring, in this way, encourages a continuous black-boxing of structural tensions and contradictions that overflow the technical division of labour. It renders a more reflexive governance, and organizational and social learning about the origins of crises more difficult by committing technical agencies to increasingly ritualistic pursuits of formally rational goals even while the broader structural conditions within which these tasks acquired a more substantive rationality may be eroding. As has been the case with central banking since the GFC, ‘rather than opening space for a discussion about the “social purpose” … rising political turbulence [may] actually strengthen … normative commitment to de-politicisation’ (Levingston 2021, 1479). In this way, the dependence of technical agency on measurable performance(s) may work to perpetuate, and even deepen, the entanglement of this agency in the production of the very structural problems that erode its felicity conditions.

REFERENCES Abolafia, M.Y. (2004). Framing moves: Interpretive politics at the Federal Reserve. Journal of Public Administration Research and Theory, 14(3), 349–70. https://​doi​.org/​10​.1093/​jopart/​muh023. Abolafia, M.Y. (2010). Narrative construction as sensemaking: How a central bank thinks. Organization Studies, 31(3), 349–67. https://​doi​.org/​10​.1177/​0170840609357380. Abolafia, M.Y. (2012). Central banking and the triumph of technical rationality. In K. Knorr Cetina & A. Preda (Eds.), The Oxford handbook of the sociology of finance (pp. 158–85). Oxford University Press. Adolph, C. (2013). Bankers, bureaucrats, and central bank politics: The myth of neutrality. Cambridge University Press. Bernanke, B. (2004). The Great Moderation. Remarks by the Governor of the Federal Reserve presented at the Meetings of the Eastern Economic Association, Washington, DC, 20 February. http://​www​ .federalreserve​.gov/​boarddocs/​speeches/​2004/​20040220/​. Accessed 17 August, 2023.

270  Handbook on measuring governance Bernanke, B., & Mishkin, F.S. (1997). Inflation targeting: A new framework for monetary policy? Journal of Economic Perspectives, 11(2), 97–116. Binder, S.A., & Spindel, M. (2017). The myth of independence: How Congress governs the Federal Reserve. Princeton University Press. Birk, M., & Thiemann, M. (2020). Open for business: Entrepreneurial central banks and the cultivation of market liquidity. New Political Economy, 25(2), 267–83. https://​doi​.org/​10​.1080/​13563467​.2019​ .1594745. Blinder, A. (2004). The quiet revolution: Central banking goes modern. Yale University Press. Borio, C., Zabai, A., Conti-Brown, P., & Lastra, R.M. (2018). Unconventional monetary policies: A re-appraisal. In Research handbook on central banking (pp. 398–444). Edward Elgar Publishing. Braun, B. (2015). Governing the future: The European Central Bank’s expectation management during the Great Moderation. Economy and Society, 44(3), 367–91. https://​doi​.org/​10​.1080/​03085147​.2015​ .1049447. Braun, B. (2016). Speaking to the people? Money, Trust, and central bank legitimacy in the age of quantitative easing. Review of International Political Economy, 23(6), 1064–92. https://​doi​.org/​10​.1080/​ 09692290​.2016​.1252415. Braun, B. (2020). Central banking and the infrastructural power of finance: The case of ECB support for repo and securitization markets. Socio-Economic Review, 18(2), 395–418. https://​doi​.org/​10​.1093/​ ser/​mwy008. Carpenter, D.P. (2001). The forging of bureaucratic autonomy: Reputations, networks, and policy innovation in executive agencies. Princeton University Press. Coombs, N. (2017). Macroprudential versus monetary blueprints for financial reform. Journal of Cultural Economy, 10(2), 207–16. https://​doi​.org/​10​.1080/​17530350​.2016​.1234404. Coombs, N. (2020). What do stress tests test? Experimentation, demonstration, and the sociotechnical performance of regulatory science. The British Journal of Sociology, 71(3), 520–36. https://​doi​.org/​ 10​.1111/​1468​-4446​.12739. Coombs, N. (2022). Narrating imagined crises: How central bank storytelling exerts infrastructural power. Economy and Society, 51(4), 679–702. doi:10.1080/03085147.2022.2117313. Diessner, S., & Lisi, G. (2020). Masters of the ‘Masters of the Universe’? Monetary, fiscal and financial dominance in the Eurozone. Socio-Economic Review, 18(2), 315–35. https://​doi​.org/​10​.1093/​ser/​ mwz017. Drori, G.S., & Meyer, J.W. (2006). Global scientization: An environment for expanded organization. In Globalization and organization: World society and organizational change (pp. 50–68). Oxford University Press. Fligstein, N., Brundage, J.S., & Schultz, M. (2017). Seeing like the Fed: Culture, cognition, and framing in the failure to anticipate the financial crisis of 2008. American Sociological Review, 82(5), 879–909. https://​doi​.org/​10​.1177/​0003122417728240. Gabor, D. (2016). The (impossible) repo trinity: The political economy of repo markets. Review of International Political Economy, 23(6), 967–1000. https://​doi​.org/​10​.1080/​09692290​.2016​.1207699. Gabor, D. (2020). Critical Macro-finance: A theoretical lens. Finance and Society, 6(1), 45–55. https://​ doi​.org/​10​.2218/​finsoc​.v6i1​.4408. Gabor, D., & Ban, C. (2015). Banking on bonds: The new links between states and markets. JCMS: Journal of Common Market Studies. https://​doi​.org/​10​.1111/​jcms​.12309. Gabor, D., & Vestergaard, J. (2018). Chasing unicorns: The European Single Safe Asset Project. Competition & Change, 22(2), 139–64. https://​doi​.org/​10​.1177/​1024529418759638. Golub, S., Ayse, K., & Reay, M. (2014). What were they thinking? The Federal Reserve in the run-up to the 2008 financial crisis. Review of International Political Economy, 1–36. https://​doi​.org/​10​.1080/​ 09692290​.2014​.932829. Goodhart, L.M. (2015). Brave new world? Macro-prudential policy and the new political economy of the Federal Reserve. Review of International Political Economy, 22(2), 280–310. https://​doi​.org/​10​.1080/​ 09692290​.2014​.915578. Hawtrey, R.G. (1970). The art of central banking (2nd ed). F. Cass. Holmes, D.R. (2009). Economy of words. Cultural Anthropology, 24(3), 381–419. https://​doi​.org/​10​ .1111/​j​.1548​-1360​.2009​.01034​.x.

Made to measure  271 Hubert, P. (2015). Do central bank forecasts influence private agents? Forecasting performance versus Ssignals. Journal of Money, Credit and Banking, 47(4), 771–89. https://​doi​.org/​10​.1111/​jmcb​.12227. Konings, M. (2011). The development of American finance. Cambridge University Press. Krippner, G. (2011). Capitalizing on crisis: The political origins of the rise of finance. Harvard University Press. Kwak, J. (2013). Cultural capture and the financial crisis. In Preventing regulatory capture: Special Interest influence and how to limit it (pp. 71–98). Cambride University Press. Levingston, O. (2021). Minsky’s moment? The rise of depoliticised Keynesianism and ideational change at the Federal Reserve after the financial crisis of 2007/08. Review of International Political Economy, 28(6), 1459–86. https://​doi​.org/​10​.1080/​09692290​.2020​.1772848. Mabbett, D., & Schelkle, W. (2019). Independent or lonely? Central banking in crisis. Review of International Political Economy, 26(3), 436–60. https://​doi​.org/​10​.1080/​09692290​.2018​.1554539. Marcussen, M. (2006). Institutional transformation? The scientization of central banking as a case study. In Autonomy and regulation: Coping with agencies in the modern state (pp. 81–109). Edward Elgar Publishing. McNamara, K. (2002). Rational fictions: Central bank independence and the social logic of delegation. West European Politics, 25(1), 47–76. https://​doi​.org/​10​.1080/​713601585. Mehrling, P. (2011). The new Lombard Street: How the Fed became the dealer of last resort. Princeton University Press. Meyer, J.W., & Jepperson, R.L. (2000). The ‘actors’ of modern society: The cultural construction of social agency. Sociological Theory, 18(1), 100–120. https://​doi​.org/​10​.1111/​0735​-2751​.00090. Meyer, J.W., & Rowan, B. (1977). Institutionalized organizations: Formal structure as myth and ceremony. The American Journal of Sociology, 83(2), 340–63. Minsky, H.P. (2001). Financial Keynesianism and market instability. In R. Bellofiore & P. Ferri (Eds.), The Economic Legacy of Hyman Minsky, vol. 1. Edward Elgar Publishing. Mudge, S.L., & Vauchez, A. (2016). Fielding supranationalism: The European Central Bank as a field effect. The Sociological Review, 64(2), 146–69. https://​doi​.org/​10​.1111/​2059​-7932​.12006. Musthaq, F. (2021). Unconventional central banking and the politics of liquidity. Review of International Political Economy, 1–26. https://​doi​.org/​10​.1080/​09692290​.2021​.1997785. Polillo, S., & Guillén, M.F. (2005). Globalization pressures and the state: The worldwide spread of central bank independence. American Journal of Sociology, 110(6), 1764–802. https://​doi​.org/​10​ .1086/​428685. Power, M. (2007). Organized uncertainty: Designing a world of risk management. Oxford University Press. Ronkainen, A., & Sorsa, V.P. (2018). Quantitative easing forever? Financialisation and the institutional legitimacy of the Federal Reserve’s unconventional monetary policy. New Political Economy, 23(6), 711–27. https://​doi​.org/​10​.1080/​13563467​.2018​.1384455. Rose, N. (1991). Governing by numbers: Figuring out democracy. Accounting, Organizations and Society, 16(7), 673–92. https://​doi​.org/​10​.1016/​0361​-3682(91)90019​-B. Shore, C., & Wright, S. (2015). Governing by numbers: Audit culture, rankings and the new world order. Social Anthropology, 23(1), 22–8. https://​doi​.org/​10​.1111/​1469​-8676​.12098. Stark, D. (2009). The sense of dissonance: Accounts of worth in economic life. Princeton University Press. Thiemann, M. (2018). The growth of shadow banking: A comparative institutional analysis. Cambridge University Press. Thiemann, M. (2019). Is resilience enough? The macroprudential reform agenda and the lack of smoothing of the cycle. Public Administration, 97(3), 561–75. https://​doi​.org/​10​.1111/​padm​.12551. Thiemann, M., Melches, C.R., & Ibrocevic, E. (2021). Measuring and mitigating systemic risks: How the forging of new alliances between central bank and academic economists legitimize the transnational macroprudential agenda. Review of International Political Economy, 28(6), 1433–58. https://​doi​.org/​ 10​.1080/​09692290​.2020​.1779780. Tognato, C. (2013). Central bank independence: Cultural codes and symbolic performance. Palgrave Macmillan. Tucker, P.M.W. (2018). Unelected power: The quest for legitimacy in central banking and the regulatory state. Princeton University Press.

272  Handbook on measuring governance Velthuis, O. (2015). Making monetary markets transparent: The European Central Bank’s communication policy and its interactions with the media. Economy and Society, 44(2), 316–40. https://​doi​.org/​ 10​.1080/​03085147​.2015​.1013355. Walter, T. (2019). Formalizing the future: How central banks set out to govern expectations but ended up (en-)trapped in indicators. Historical Social Research/Historische Sozialforschung, 44(2), 103–30. https://​www​.doi​.org/​10​.12759/​hsr​.44​.2019​.2​.103​-130. Walter, T. (2020). The Janus face of inflation targeting: How governing market expectations of the future imprisons monetary policy in a normalized present. In Futures past: Economic forecasting in the 20th and 21st century (pp. 105–38). Peter Lang. Walter, T. (2022). The social sources of unelected power: How central banks became entrapped by infrastructural power and what this can tell us about how (not) to democratize them. In Central banking, monetary policy and social responsibility (pp. 195–218). Edward Elgar Publishing. https://​doi​.org/​10​ .4337/​9781800372238​.00017. Walter, T., & Wansleben, L. (2020). How Central bankers learned to love financialization: The Fed, the bank, and the enlisting of unfettered markets in the conduct of monetary policy. Socio-Economic Review, 18(3), 625–53. https://​doi​.org/​10​.1093/​ser/​mwz011. Wansleben, L. (2018). How expectations became governable: Institutional change and the performative power of central banks. Theory and Society, 47(6), 773–803. https://​doi​.org/​10​.1007/​s11186​-018​ -09334​-0. Wansleben, L. (2020). Formal institution building in financialized capitalism: The case of repo markets. Theory and Society, 49(2), 187–213. https://​doi​.org/​10​.1007/​s11186​-020​-09385​-2. Wansleben, L. (2021). Divisions of regulatory labor, institutional closure, and structural secrecy in new regulatory states: The case of neglected liquidity risks in market-based banking. Regulation & Governance, 15(3), 909–32. https://​doi​.org/​10​.1111/​rego​.12330. Weick, K.E. (1979). The social psychology of organizing. Random House. Wullweber, J. (2021). The politics of shadow money: Security structures, money creation and unconventional central banking. New Political Economy, 26(1), 69–85. https://​doi​.org/​10​.1080/​13563467​ .2019​.1708878. Zayim, A. (2022). Inside the black box: Credibility and the situational power of central banks. Socio-Economic Review, 20(2), 759–89. https://​doi​.org/​10​.1093/​ser/​mwaa011.

18. We treasure what we measure: global development cooperation and the Sustainable Development Goals Katja Freistein

INTRODUCTION In the field of global development, notably the most comprehensive regime for performance measuring has emerged in the form of the Sustainable Development Goals (SDGs). The SDGs function as a conceptual node for many vital dimensions of development, including human well-being, social justice and sustainable livelihoods. Following the first instance of development goals with a global reach, the Millennium Development Goals 2000–2015 (United Nations: Dag Hammarskjöld Library 2022), the SDGs are the latest generation of a comprehensive agenda and cover an enormous range of development-related practices and operations. In the SDGs’ catalogue, the number of goals, which rose from 8 in the Millennium Development Goals (MDGs) to now 17, is matched by 169 targets and (currently) more than 240 indicators (some of which overlap) (see United Nations 2015a). Moreover, during the early stages of the process, suggestions by experts invited to contribute on how to design these indicators added up to nearly 5000, roughly 250 pages of fine-printed tables, which needed to be sighted and processed (Sustainable Development Solutions Network 2015). Experts from various developmental government and non-government organisations put forward ideas, which were then either followed or abandoned (for a longer discussion, see Janouskova et al. 2018). In their sheer volume, the SDGs epitomise a trend towards measuring in global development that has grown both out of the despair that progress in ongoing problems like fighting poverty or providing universal access to water etc. was slower than the global development community had hoped and out of the will to achieve progress by monitoring each step meticulously and systematically. While the (ongoing) global COVID-19 pandemic, has stalled or, in some instances, seriously endangered the realisation of the SDGs, the process is ongoing and too big to simply let it fail. The project of development measuring emerged out of a more general trend to monitor the capacity and performance of actors in the field of development. Grown out of earlier debates, the acknowledegment of the importance of monitoring and evaluation led to practices of measuring (Best 2017). Additionally, goal-setting became a widespread practice, substituting the emphasis on statistic averages for more fine-grained benchmarks and indicators that discipline actors towards compliance (Anand et al. 2009). Today, the role of data both used to describe policy challenges in detail and to track how governments and other actors fare in implementing joint goals has become extremely central. The goal (or fantasy) of a complete, global coverage of development data has also been reinforced by the SDGs and their claim of universality. The conviction that measuring is not only helpful but a clear precondition for the success of the SDGs has been reflected in the process of their realisation and puts a strong emphasis on 273

274  Handbook on measuring governance measurability. Thus, both are true: the SDGs measure what they treasure and treasure what they measure. The chapter will trace some of the larger trends in measuring development aid and goals, focusing on the SDGs as a complex endeavour that brings together many different practices and measuring regimes that emerged in the field of global development. It starts by identifying some of the key methods in performance measuring that structure the field, paying close attention to the debates between experts, experiences and lessons learned. The chapter will further address the context in which measuring practices evolved over the last few decades and the changes in objectives and methods. Finally, some of the main consequences for the organisational environment in which measuring takes place are detailed, particularly the repercussions for the operations of specialised international organisations in the field. The SDGs both serve as an example and a focal point that demonstrates how measuring works and what it entails for global development.

MEASURING IN THE DEVELOPMENT FIELD The many dimensions that can be measured in the field of development reflect the scope of a field that ranges from basic human needs of aid recipients to efforts made by donor countries, and particularly with broad frameworks like the SDGs overlap with many fields such as human rights, education, health or migration. Accordingly, many different definitions pre-structure the ways in which things are being measured in the field, and the mode of governance by numbers has had broader repercussions for the ways in which policies are framed and goals of national and international development cooperation are being formulated (Davis et al. 2012). Since many different state and non-state organisations operate in an environment characterised by continuing scarcity of financial resources, measuring also has a competitive angle, for instance, when it comes to establishing new indicators or rankings. Even descriptive numbers such as aid flows can be subject to politics (Krause Hansen and Porter 2012). Since no unitary definition exists of what constitutes aid or ‘concessional development finance’ (Mitchell 2020), states tend to offer their own definitions. In an effort to track and compare, the various sums allocated for development cooperation in different national budgets, defined as Official Development Assistance (ODA), are being monitored and published in form of comparative measurements by the Development Assistance Committee of the Organisation for Economic Co-operation and Development (OECD DAC), usually as both percentage of the GNI and in absolute US dollars. The DAC has developed a dataset based on a consistent, standardised definition that allows for comparing efforts of its 30 member countries; even for this data, measurements can be subject to contestation.1 Particularly larger states like the BRICS (Brazil, Russia, India, China, South Africa) members, foremost among them China, have become notable aid donors – but also remain recipients of development aid and prefer to call their development aid ‘South-South cooperation’. Some estimates of ODA-like flows exist (see Donortracker 2021). In an environment that values ODA as a measure of states’ contributions to a common global good, comparisons are bound to affect how states present themselves to boost their reputation, and comparisons are thus also bound to be contested. Measuring ODA has been related to measuring national development capacity, defined as ‘the ability of individuals, institutions, and societies to perform functions, solve problems, and set and achieve objectives in a sustainable manner’, which is assessed with regard to the poten-

We treasure what we measure  275 tial and real effectiveness of national and sub-national institutions as well as non-state partners (see OECD 2007). The debate about ‘aid effectiveness’, which concerns the practices of aid allocation, emerged out of the observation that spending did not match or engender the desired results. Several multilateral meetings discussed and eventually agreed on key principles that would help to guarantee aid effectiveness (Rome 2003, Paris 2005, Accra 2008 and Busan 2011). The idea of a Global Partnership (for Effective Development Cooperation, GPEDC) emerged at the Paris meeting and was further discussed.2 The GPEDC has generated its own set of data and indicators that measure and compare the implementation of these principles but does not rank states.3 Other, related assessments exist that face similar challenges in collecting data through self-reporting of states and surveys (Mitchell 2020); in fact, many of the reports in the field of development aid, capacity and effectiveness are dependent on sources provided by the states they monitor and on peer review within the field. This complexity has also been a reason for the limitations in the measuring regime, where ‘quality’ of donor practices and aid has only recently become a more quantifiable entity and remains subject to ongoing expert consultations. Such assessments of aid and aid effectiveness have also become closely interlinked with the 2030/SDG agenda. For instance, the DAC responded to this changed environment, in which ODA was complemented by different South-South and other growing initiatives, such as regional development banks or triangular cooperation initiatives, for example, on food security, vaccine technology or developing climate technologies,4 by initiating the ‘Total Official Support for Sustainable Development (TOSSD)’, introducing its methodology in 2019. The TOSSD measures all official resources that are used for sustainable development in developing countries and contributions to ‘International Public Goods – up to now “invisible” in development finance statistics – that help countries reach their Sustainable Development Goals’ (see Total Official Support for Sustainable Development 2022). In broadening the scope of flows that are being measured and compared, the DAC thus pays tribute to the centrality of the SDG agenda and efforts made by a variety of state and non-state actors towards realising the goals. Further, the TOSSD establishes and further refines a specific methodology, which incorporates a comprehensive dataset that is geared to monitoring SDG success and builds on the Addis Ababa Action Agenda on financing for development, which is intended to align financing flows and policies with economic, social and environmental priorities (see United Nations 2015b). Both in terms of the goal-setting and their dimensions of measuring, the SDGs have thus become one of the most complex measuring projects in the field of global development. Furthermore, there are various comparative indices that report progress in the form of rankings. One such index is the Commitment to Development Index, which ranks 40 of the ‘world’s most powerful countries’ on how their policies affect developing countries, weighing beneficial against detrimental action (INDEX) (Center for Global Development 2021). The index includes a variety of indicators aggregated to an overall score (investment/aid, trade, finance, migration, environment, security, technology, health), which is adjusted to averages and country size and offers a ranking based on perceived virtue. As with all rankings, there has been criticism of the index regarding the components and their relative weight, but no major changes have been induced. More generally speaking, indicators, indices and benchmarks have come to structure the development field in similar ways as in others, creating a massive, quantified knowledge regime that compares and ranks states based on their specific performance. Building govern-

276  Handbook on measuring governance ance on benchmarking and ranking has created demand for statistics and data, for example, on household income, poverty or calorie intake of individuals, where it was previously non-existent. That, in turn, has created new demand for qualified staff in both national and international agencies, communication channels between them and funding. The development field, in which reporting from the field used to be (and to some extent remains) the dominant form of accountability to donors, is more and more characterised by its need for numbers. Numbers can be constructed around both simple, money-metric measurements such as the World Bank’s poverty line that creates a binary line between extreme poverty and beyond and is measured in purchasing power parities (PPP)-adjusted US dollars, from once $1 to now $2.15 (as of September 2022). The money-metric format has been criticised (e.g. by Jerven 2012, 2013) and defended against critics, both for good reasons; it remains firmly in place and has been further anchored in the global fight against poverty by making the World Bank one of the central custodian agencies for SDG Goal 1, the eradication of poverty in all its forms (together with the International Labour Organization). While reductionist in nature, the poverty line can easily be adjusted to accommodate for changed external circumstances and is clearly measurable and comparable. The trade-off between complexity and measurability is often a ground for both criticism and pragmatism, particularly for international organisations that need to prove progress to their donors. Since measurability has become instrumental, we often witness a back and forth between simple numbers and attempts to include more participative, open forms of describing challenges and progress such as narratives and large open surveys, for example, a project like the United Nation’s (UN) ‘The World We Want’ (United Nations 2019). Composite indices such as the Human Development Index, for instance, combine narrative and quantified forms of reporting, and thus create a more comprehensive but also rather complex way of addressing policy problems. Ultimately, they are also geared towards measuring progress with regard to very specific objectives and thus need some form of measurable format. To take another example from the issue of poverty eradication, the United Nations Development Programme’s (UNDP) multidimensional poverty index (MPI) is based on several related, measurable dimensions of one phenomenon (health, education and standard of living) but does not use a money-metric approach (UNDP 2021). The MPI measures aggregates based on ten different indicators in three dimensions, thus aiming at depicting deprivation of human beings more accurately than a money-metric measure; we do not learn only about who and how many people are poor but also about the ways in which poverty affects them. The MPI requires more and different data and can create new complexity that is, however, more challenging to handle than one-dimensional measurements. At the same time, several indices have been adjusted, and different measurable dimensions can be combined in even more complex indices. One example is the inequality-adjusted Human Development Index, which consists of four dimensions (mean years of schooling, expected years of schooling, life expectancy at birth, gross national income per capita), which are further broken down by their distribution across national population, depicting domestic inequalities with regard to human development (UNDP 2023). Depending on the sub-field, but also on the development organisation and its internal trajectories, measuring can take very different forms – and we can trace many debates between experts both within and outside the main development organisations on how to best measure a certain problem (on the HDI, e.g., McGillivray and White 1993; Doessel and Gounder 1994). Particularly the availability and operability of data have determined some of the objects and techniques of measuring, and data

We treasure what we measure  277 itself has sometimes become a driving force for the development of indicators or even debates about certain subjects. Overall, some convergence in terms of the practices of measuring themselves can be observed. First, the importance of data and statistics, particularly the availability of micro-data and transparent forms of proceeding with this data, attests to the centrality of measuring, both in aggregate forms like indices or rankings and as single items. Second, the rise in absolute numbers of measurable benchmarks, indicators etc. also underlines that measuring itself has become deeply embedded in the global governance of development (Best 2014; Cooley and Snyder 2015). As argued above, the SDGs can be seen as not only the latest, but also the most comprehensive or even complex measuring regime. Within the SDGs’ framework, the different ways of performance measuring build on these measuring regimes in the development field. While universal in their overall goal, their measuring of progress is mostly based on aggregate national data, which either already exists or has to be provided according to the formulation of indicators. The high number of recommendations combines different expectations by drawing on a broad basis of expertise to develop measurability by inclusion and participation (Freistein and Mahlert 2016). While not impossible to fulfil, the goals and targets and their respective indicators already carry a heavy burden of more than just measuring. The idea of multi-stakeholdership, which was reflected in the complex process of developing the SDGs, guided the selection of goals and targets to a similar extent as the lesson from the MDGs process that measurable targets would work best (Fukuda-Parr and McNeill 2015). This is anchored in a firm belief that better data produces better outcomes – or, conversely the conviction that … the failure of development to fulfil its objectives, that is to really build states, empower the weak and marginalized, reduce poverty, or bring about peace, is linked to a lack of, inadequate, or even false knowledge, a diagnosis that is usually followed by a request to generate more or better knowledge and by a search for methodologies that are better suited to do so. (Bakonyi 2018: 270)

The process of governing by indicators comes with a number of challenges. As one major problem, a data gap exists which results from different problems: (a) the indicator addresses an issue that has not yet been systematically measured, for example, illicit flows of finances (indicator 16.4.1); (b) only some states have data available, others first need to collect and publish according to data (similarly: indicator 16.4.1 for which states only reluctantly provide data); (c) the quality of available data is too poor; (d) data may exist but has not been aggregated or made visible etc. etc. Lack of quality data has been a great source of concern for the realisation of the SDGs (Avenando et al. 2021), and indicators are grouped into three tiers (no more tier III indicators were identified in 2022).5 Data will be a grand endeavour, as a first pilot study of these capabilities by the UN Statistics Division brings to light: It revealed that, on average, data for only 40 of the applicable global SDG indicators (20 per cent) are currently available; another 47 global indicators (23 per cent) are considered easily feasible, meaning that the data source is, in principle, available. Moreover, existing capacity is heavily reliant on external assistance. Additional resources are required to monitor additional indicators. (UNStats 2018)

For that reason, the massive allocation of resources has been seen as justified by both proponents and critics of the current SDG indicators (Cobham 2014), even though money, in general, is in short supply to realise the SDGs.

278  Handbook on measuring governance Furthermore, indicators are in some cases supported by cooperation with private, commercial actors like survey institutes. Where data can be improved by mandating surveys in certain sectors, for instance, mapping attitudes towards migrant populations (Adams 2019), public-private partnerships have always been challenging. Indicators can hide power relations of producers and relations between those who are measured and those who measure (Mügge 2016), creating massive knowledge inequalities. Another effect that has been described is how indicators, particularly if poorly designed, can distract political attention and resources so that important other objectives are ignored (Fukuda-Parr and McNeill 2019). Similarly, some indicators are a poor match for the target or in direct contradiction to the ‘spirit’ of the target (Fukuda-Parr and McNeill 2015), which does not affect their measurability but decouples them from the overall goal. If trajectories for policies are predetermined by an indicator that does not match or even reinterpret the overall goal, the direction of policies can change in unintended ways. Practitioners in the field were aware of these potential pitfalls from the start: ‘A balance has been sought between what is feasible in the short term and what is required in the long term, in such a way as not to dilute the ambition of the 2030 Agenda’ (Ordaz 2019: 141). Some indicators, because none of the custodian agencies would claim them, initially remained ‘orphans’, running the risk of being meaningless (Kapto 2019: 134). At the same time, new data can be accumulated and reused for different purposes, for instance, for monitoring problematic political practices that can only be reported because an indicator requires data that previously did not exist, or issues that were missing from the SDG goals could be included by adding a new indicator. To give one example, this was the case for the initial exclusion of the term displaced people (i.e. refugees) in the targets, which was remedied by adding another indicator (to Target 16.3) after long discussions on how to make the case for it (Nahmias and Krynsky Baal 2019). Particularly since data that described the percentage of national populations as being displaced already (mostly) existed, the new indicator did not cause any further burdens on data collection but helped to explicitly anchor refugee protection in the SDGs. Indicator selection has been subject to both pragmatic (e.g. drawing on available data) and political concerns (e.g. steering away from too sensitive issues). More generally speaking, both feasibility and normative considerations continue to characterise the indicator and measuring process of the SDGs and in many other different fields of development cooperation.

MEASURING IS TREASURING The turn to measuring in global development cooperation built on established practices in states and arose out of the experience of a ‘lost decade’ in development. The gradual delegitimation of the ‘structural adjustment’ ideology promoted by the Bretton Woods institutions in the 1980s and 1990s, born out of a neoliberal belief (Fougner 2008), further reinforced the trend. Because of their lack of visible progress, global development agencies were under massive pressure to change course and to justify their efforts to donors. The idea of ‘aid effectiveness’, mentioned above, was one of the main lessons derived from earlier failures (Best 2017), and it systematically introduced ideas of country ownership and participation to the work of aid agencies. At the same time, the complexity created by acknowledging the importance of sectors besides the economy that were affected by interventions posed new challenges in the practice of development organisations. Drawing on ideas of New Public Management and adapting practices from the private sector (Seabrooke and Sending 2020), development

We treasure what we measure  279 organisations introduced measuring as a policy device that would allow them to make plausible claims about their contributions to the improvements achieved in developing countries (Ward 2004). The World Bank, for instance, developed a variety of different innovative instruments of performance measuring (Best 2014), which have both contributed to the discourse of accountability in global development and created their own pathologies (Clegg 2010), for instance, when states gave indicators to improve their status in ranking but do not actually implement policies that benefit their populations (Broome and Quirk 2015). Furthermore, the asymmetries created in the seemingly participatory practices were more easily veiled but persisted nonetheless (Lie 2015). States both pushed for these developments and were caught up in them; not all donor states have adopted measuring and reporting practices at the same time. The United Kingdom and the United States but also some of the Scandinavian countries and Ireland were forerunners, while such states as Germany came much later to the game and have only recently started to fully embrace comprehensive reporting systems. The financial crisis of 2008/9 acted as a catalyst in the discourse and practices of ‘results-based management’ (Binnendijk 2000) including the development of ‘standard indicators’ (for more on the German learning process, see Janus and Esser 2022). The development sector has since become dominated by managerial practices, including benchmarking and indicator use (Lie 2015; Lie and Sending 2015). Measuring itself has taken centre stage, and where it was initially more of a means towards a very different end, providing and processing data has now become an end in itself. Particularly, the frustration with a lack of identifiable progress drove the trend, which emulated and translated techniques developed in business regulation. Although the origins of indicators as modes of knowledge and governance stretch back to the creation of modern nation-states in the early nineteenth century and practices of business management a few centuries earlier, their current use in global governance comes largely from economics and business management. Development agencies such as the World Bank have created a wide range of indicators, including indicators of global governance and rule of law, and gross domestic product is one of the most widely used and accepted indicators. Thus, the growing reliance on indicators is an instance of the dissemination of the corporate form of thinking and governance into broader social spheres. (Merry 2011: S83)

Development organisations copied and started to use tools to measure good governance in the private sphere, buying into the idea of evidence-based decision-making, not least to appease donors’ frustration with a lack of aid effectiveness (Servén et al. 1999). The increasing availability of data through technical innovation and new methodologies, which emerged in the context of large organisations such as the World Bank, supported this turn towards measuring. The trend has solidified in the SDGs that bring together an unprecedented set of goals and indicators. As was recognised in the process early on, The Sustainable Development Goals (SDGs) – an ambitious set of 17 goals and 169 targets – represent one of the latest steps in the ‘evolution from statistics as a governmental technique of the nation state to indicators as a technology of global governance’ … Indeed, the SDGs usher in a reliance on goal setting as the principal method for the international community to reach consensus on a vision of development. (Fisher and Fukuda-Parr 2019: 375)

While the MDGs introduced goal-setting as a soft, yet powerful tool of development governance, the SDGs built on the experience with the MDGs and changed the underlying logics

280  Handbook on measuring governance from a reliance on statistics and aggregated data (Linsi and Mügge 2019) to developing indicators and using disaggregated data. Moreover, proper measuring of individual indicators became a main driving force for the SDG process itself. The machinery that was set up to support indicator use, in very transparent ways, is indicative of the enormous importance attached to them in the SDG process. We can see the shift, which realised the transition from statistical averages to governance by indicators, in the MDG final report and how it reflected on the importance of data and indicators: ‘Only by counting the uncounted can we reach the Unreached’, ‘Together we can measure what we treasure’ and ‘What gets measured gets done’ are three lessons learned for the SDGs. The slogans point to the idea that action depends on reliable data, which would then help to prove that progress had been made, which is central in the apparatus of SDG indicators. Knowledge and evidence-based decision-making are also part of the ‘data revolution’ for all UN organisations leading to ‘A World That Counts’ (UNDATA Revolution 2014). The data revolution is closely related to creating an exhaustive information and knowledge device that overcomes asymmetric reporting and, ideally, informs and transforms societies. One of the more recent developments concerns the shift in measuring large sets of goals. The lessons learned from the MDGs concerned the set-up of the process but, more importantly, the key role of data: A key lesson from the MDGs is that we need more and better data to monitor the implementation of the SDGs. We need a true ‘data revolution’ with new sources of data and better integration of statistics into decision-making … We need to go beyond the ‘tyranny of averages’, and ensure that the SDGs can reflect the needs of the least fortunate. (UNECE 2015)

Measuring through indicators instead of statistics and averages was, accordingly, regarded as necessary to counter existing inequalities in the depiction of problems and to identify problems in a way that paves the way for better, more adequate policies. However, different underlying logics can lead to inconsistency as: ‘(t)here has been a disconnect between the technical, quantitative SDG monitoring and the political, qualitative SDG reporting process at the High-level Political Forum, which is more accustomed to “adopted” inter-governmental decisions, negotiated and then nearly cast in stone’ (Kapto 2019: 135). The self-perpetuation of both measuring tools and their ideational support finds its (temporary) culmination in the SDGs, which embody and catalyse the enormous trust that measuring will ensure the necessary progress for global development in spite of all known pitfalls and flaws of the process.

COMPETITION AND CUSTODIANSHIP When the SDG agenda was agreed on, one of the main claims was that ‘A sound indicator framework will turn the SDGs and their targets into a management tool to help countries develop implementation strategies and allocate resources accordingly, as well as a report card to measure progress towards sustainable development and help ensure the accountability of all stakeholders for achieving the SDGs’ (Sustainable Development Solutions Network 2015: 2). Since this belief has become deep-seated in the field of development, creating and using indicators has paved the way for procedures and conditions that continue to ensure the creation and use of new indicators, aggregate indices etc. Indicator monitoring has been anchored in the institutional architecture of the UN (SDG Tracker 2015) appointing custodians for each

We treasure what we measure  281 indicator, usually international organisations that have long-standing expertise in a field. Some indicators are monitored by more than one organisation; custodian agencies are responsible for compiling and verifying country data and metadata, and for submitting the data (including regional and global aggregates) to the United Nations Statistics Division (UNSD). If data is incomplete or incoherent, custodian agencies should revise it in collaboration with countries, and ensure that country data is approved before submitting it to UNSD. Custodian agencies – created because UN agencies wanted their place at the SDG table (Kapto 2019) – both administer data and strengthen their own work by drawing on this data. This creates implicit path dependencies in the ways indicators are tied to goals and targets and can strengthen the position of certain organisations in the field. In some cases, several different indicators were proposed under one target; then ‘precedence was in general given to the proposals by agencies with a mandate in the specific area and/or already responsible for global monitoring on the specific indicator’ (Ordaz 2019). UNDP, to give one example, early on in the process of formulating the SDGs, engaged in pushing for a goal on ‘governance’ (UNDP 2016) which resulted in several targets in Goal 16, for which UNDP now (partly) acts as a custodian agency (UNStats 2022a). Agenda-setting was one way of becoming involved in the SDGs, but organisations are now also engaged in other activities that improve the process and, as further benefits, may give them some clout, such as the World Health Organization that acts as a coordinator in the Health Data Collaborative (van Driel et al. 2022). However, competition for both material support and influencing knowledge in the field can take very different forms. A mutual influence of indicator and organisational activity (in line with an organisational identity) may thus create benefits for the chances of implementation and the organisation alike (Freistein 2015), but at the same time narrows down the possible interpretations of the goal. Target 1.1 (By 2030, eradicate extreme poverty for all people everywhere, currently measured as people living on less than $1.25 a day) has two interpretations of indicator 1.1.1 (Proportion of population below the international poverty line, by sex, age, employment status and geographical location (urban/rural)), one monitored by the World Bank, the other by the International Labour Organization (ILO). The World Bank, in the rationale of the indicator, points to global poverty reduction, to its data, its reports, its long-standing knowledge. One such reference was quoting Martin Ravallion (2010), who was the World Bank’s director of research department – and referring to its tools, like PovCalNet, a poverty calculation instrument. The ILO, on the other hand, makes a short reference to ‘the working poor’, and the relationship between poverty and employment. Both rationales are perfectly in line with the organisations’ respective profiles but offer different interpretations of a global goal. Both organisations can expect gains in legitimacy but create path dependencies based more on their organisational context than the spirit of the development goals. Path depencies are also created because the production of indicators and other measurable information is so costly – with the data needed, the staff employed and the longevity of the process (Cobham 2014). Indicators are rarely explicitly abandoned, but have become refined, supplemented by further information or subtly replaced by competing indicators in the same field. Competition between different actors for influencing the agenda remains a factor, such as by international organisations among themselves but also between civil society actors and members of the expert groups (Kapto 2019). For instance, as mentioned above, the Human Development Index (HDI), one of the most successful products of the UNDP as both index and ranking, has two versions, of which a newer one adds information on inequality (IHDI); a gender-divided development index also has been added, and additional calculations, based

282  Handbook on measuring governance on the same data, distinguish the two indices. All versions of the HDI remain in place. While the information is, in fact, different, we can also see how relevant data has become in influencing what is measurable and, in turn, can be discussed as an important issue. Data in these cases enables agenda-setting, as in the case of a revival of a debate about ‘global inequality’ – as international asymmetries in income – which was in part initiated by the more recent availability of larger data samples on household levels, which in turn led to new research in the field of inequality (Kanbur and Lustig 1999). Where only selected experts, for instance, at the World Bank, first talked about ‘global inequality’, inequalities became mainstreamed into the development agenda in relation to poverty, particularly after the financial crisis of 2008/9. We can tentatively identify a trend where the methodological and technological innovations in data production and processing become very closely intertwined with the agenda-setting process. Measurability can thus be of important concern for the emergence and relevance of a policy issue as much as reporting from field offices or surveys, particularly because organisations are often dependent on data from external sources. Since the creation of indicators involves expert knowledge and feasibility concerns that reflect the specific capabilities and culture of organisations (Vetterlein 2012), access to data, expertise and strong organisational profiles have become important preconditions for the successful work of international organisations. Organisations are producers of indicators, some for their own field, some for others too: ‘The production of indicators depends not only on expert opinion or on the relevant epistemic community but also on administrative infrastructures that collect and process data and on the larger institutional setup of which they are part. Successful indicators … are typically backed by powerful institutionalized organisations …’ (Rottenburg and Merry 2015: 4). As we can observe in the changing environments of international organisations that adapt their policies and staff recruitment to the importance of measuring for the development field, the need for specific expertise has created path dependencies that structure the ways in which large developmental organisations operate. Rigorous, defensible and enduring systems of quantification require expertise, discipline, coordination and many kinds of resources, including time, money, and political muscle. This is why quantification is often the work of large bureaucracies … This is especially true when numbers circulate to places that are removed from the bureaucracies that manufactured them. (Espeland and Stevens 2008: 411)

The establishment of bureaucratic apparatuses and organisational change have practical reasons like the allocation of resources, but it also has to do with external pressure for success of long-term processes and their credibility as experts in a field – which is why data is crucial. Measuring (in the form of indicators, benchmarks, rankings etc.) means taking responsibility for solving a problem that has been measured and implies the competence and, in turn, mandate of an organisation. What is more, the mere observation of an ever-growing number of new indicators and the resources going into the allocation and processing of data is striking. The need for individual states to invest in human resources and the accumulation of data (e.g. as household surveys, censuses, geospatial data) for the SDG indicators is portrayed as pressing – perhaps even more pressing than addressing the 169 targets, since data is seen as a precondition for even tackling the SDGs. As stated by observers, the realisation of the SDGs has been extremely costly; providing good data has become an important cost factor, as the 2018 report ‘Data for Development’ makes clear (OECD 2021). It claims that

We treasure what we measure  283 a total of US$1 billion per annum will be required to enable 77 of the world’s lower-income countries to catch-up and put in place statistical systems capable of supporting and measuring the SDGs. Donors must maintain current contributions to statistics, of approximately US$300 million per annum, and go further, leveraging US$100–200 million more in Official Development Assistance (ODA) to support country efforts.

Even though the data revolution helps to tackle relevant problems, the logic of data as key is indicative of an organisational development towards data-based epistemic regimes, which then will keep calling for more and better data, new indicators etc., which creates self-entanglement and makes it hard to suggest alternatives. Thus, in a field that is both fragmented and characterised by competition for different material and immaterial resources, path depencies created by indicator use and reinforced by the activities of custodian agencies, a reversal of the trend to measuring has become almost inconceivable.

CONCLUSION The observation of a massive quantification apparatus that characterises the field of development aid, involving new forms of expertise, financing for data production and even the idea that quantification can be realised as a common, participatory project such as the SDGs process, cannot simply be explained by the functionality of numbers. The preoccupation with measurability shows that measurable numbers represent a surplus of information that continues to shape how they will be employed and discussed. The fact that a whole process has been organised around providing suitable measurement can be seen as indicative of their value in terms of providing a symbol of rational, process-driven and fact-based politics. This quest for more and better measures derives from the hope that such unfathomable challenges as finally eradicating poverty and hunger, of ensuring the survival of humankind in spite of climate change, and many other problems that have haunted efforts of global development can finally be met. The technocratic practice of measuring can thus also be seen as an indicator of hope in light of the hopelessness that characterises the daily practices of development cooperation. Measuring in global development cooperation has become such a central practice that it has created new organisational structures, generated an enormous demand for human and financial resources and has to some extent closed off alternative ways of monitoring progress. Occasionally, actors in the field reflect on this and suggest different forms to engage with their stakeholders or gain knowledge. Particularly methods such as ‘narratives’6 or large-scale surveys7 that are meant to engage with the perceptions and needs of the people targeted by development measures have been occasionally used. However, since the investment in expertise and data production has been so massive, the trend towards measuring will be very hard to challenge or even revert. The SDGs, as a complex, huge endeavour suggest universality, participation and promote goals that most of us would easily support. But the SDGs, like the MDGs, may even close off democratic space, not because their ‘accountability’ mechanisms are weak, but because they promote a dominant language that frames development debates in a technical, depoliticised way. Issues of global governance may be left out in response to the imperatives of simplification, quantification, and concreteness. (Fukuda-Parr and McNeill 2015: 16)

284  Handbook on measuring governance The dilemma created by relying on measurability and accountability as key success factors of the SDGs while producing new challenges because of these aims cannot easily be resolved.

NOTES 1. 2. 3. 4. 5.

6. 7.

Since the late 1990s, Southern donors have also increasingly entered the arena; however, it remains difficult to assess their contributions, since aid flows from emerging donors are not systematically monitored. Some data is collected from the Aid data (2022) website. The Busan declaration states as the main effectiveness principles: ‘Ownership of development priorities by developing counties; A focus on results; Partnerships for development; Transparency and shared responsibility.’ See OECD (2011). Three datasets exist to that end: (1) Recipient: containing data on the level of partner countries; (2) Providers: containing data on the level of development partners; (3) Provider-Recipient: containing disaggregated data of a specific development partner in a given partner country (see OECD 2019). See, for example, such initiatives as a triangular cooperation of China-Netherlands Food and Agriculture Organization, in which China and the Netherlands are the main donors for a joint project. See Food and Agriculture Organization of the United Nations (2022). See UNStats (2022b). Tier 1: Indicator is conceptually clear, has an internationally established methodology and standards are available, and data are regularly produced by countries for at least 50 per cent of countries and of the population in every region where the indicator is relevant; Tier 2: Indicator is conceptually clear2, has an internationally established methodology and standards are available, but data are not regularly produced by countries; Tier 3: No internationally established methodology or standards are yet available for the indicator, but methodology/standards are being (or will be) developed or tested. See, for instance, the World Bank’s section ‘Stories’ at https://datatopics.worldbank.org/world​ -development-indicators/stories.html (accessed: 12 May 2023). See, for instance, the UNDP’s Peoples’ Climate Vote at https://www.undp.org/press-releases/​ worlds-largest-survey-public-opinion-climate-change-majority-people-call-wide-ranging-action (accessed: 12 May 2023).

REFERENCES Adams, B. (2019). Commentary on special issue: Knowledge and politics in setting and measuring SDGs numbers and norms. Global Policy, 10, 157–8. https://​doi​.org/​10​.1111/​1758​-5899​.12639. Aid data (2022). Available at: https://​www​.aiddata​.org/​(accessed: 10 January 2023). Anand, P., Hunter, G., Carter, I., Dowding, K., Guala, F., & van Hees, M. (2009). The development of capability indicators. Journal of Human Development and Capabilities, 10(1), 125–52. Avendano, R., Jütting, J., & Kuhm, M. (2021). Counting the invisible: The challenges and opportunities of the SDG Indicator Framework for Statistical Capacity Development. In S.C. Chaturvedi, H. Janus, S. Klingebiel, L. Xiaoyun, A. deMello e Souza, E. Sidiropoulos, & D. Wehrmann (Eds.), The Palgrave handbook of development cooperation for achieving the 2030 Agenda (pp. 329–45). Palgrave Macmillan. Bakonyi, J. (2018). Seeing like bureaucracies: Rearranging knowledge and ignorance in Somalia. International Political Sociology, 12(3), 256–73. Best, J. (2014). Governing failure: Provisional expertise and the transformation of global development finance. Cambridge University Press. Best, J. (2017). The rise of measurement-driven governance: The case of international development, Global Governance, 23(2), 163–181. Binnendijk, A. (2000). Results based management in the development co-operation agencies: A review of experience background report. DAC Working Party on Aid Evaluation. Available at: https://​www​ .oecd​.org/​development/​evaluation/​dcdndep/​31950852​.pdf (accessed: 10 January 2023).

We treasure what we measure  285 Broome, A., & Quirk, J. (2015). Governing the world at a distance: The practice of global benchmarking. Review of International Studies, 41(5), 819–41. Center for Global Development. (2021). The Commitment to Development Index. Available at: https://​ www​.cgdev​.org/​cdi​#/​ (accessed: 10 January 2023). Clegg, L. (2010). Our dream is a world full of poverty indicators: The US, the World Bank, and the power of numbers. New Political Economy, 15(4), 473–92. Cobham, A. (2014). Uncounted: Power, inequalities and the post-2015 data revolution. Development, 57, 320–32. Cooley, A., & Snyder, J. (Eds.) (2015). Ranking the world: Grading states as a tool of global governance. Cambridge University Press. Davis, K., Fisher, A., Kingsbury, B., & Engle Merry, S. (2012). Governance by indicators: Global power through classification and rankings. Oxford University Press. Doessel, D.P., & Gounder, R. (1994). Theory and measurement of living levels: Some empirical results for the Human Development Index. Journal of International Development, 6, 415–35. Donortracker. (2021). A new era? Trends in China’s financing for international development cooperation. Available at: https://​donortracker​.org/​insights/​new​-era​-trends​-chinas​-financing​-international​ -development​-cooperation (accessed: 10 January 2023). Espeland, W., & Stevens, M. (2008). A sociology of quantification. European Journal of Sociology, 49(3), 401–36. Fisher, A., & Fukuda-Parr, S. (2019). Introduction – data, knowledge, politics and localizing the SDGs. Journal of Human Development and Capabilities, 20(4), 375–85. Food and Agriculture Organization of the United Nations. (2022) Second Annual Joint Project Steering Committee meeting of the FAO-China Triangular Cooperation Project with the Netherlands. Available at: https://​www​.fao​.org/​partnerships/​south​-south​-cooperation/​news/​news​-article/​en/​c/​1601 779/​(accessed: 10 January 2023). Fougner, T. (2008). Neoliberal governance of states: The role of competitiveness indexing and country benchmarking. Millennium, 37(2), 303–26. Freistein, K. (2015). Effects of indicator use: A comparison of poverty measuring instruments at the World Bank. Journal of Comparative Policy Analysis, 18(4), 366–81. Freistein, K., & Mahlert, B. (2016). The potential for tackling inequality in the Sustainable Development Goals. Third World Quarterly, 37(12), 2139–55. Fukuda-Parr, S., & McNeill, D. (2015). Post 2015: A new era of accountability? Journal of Global Ethics, 11(1), 10–17. Fukuda-Parr, S., & McNeill, D. (2019). Knowledge and politics in setting and measuring the SDGs: Introduction to special issue. Global Policy, 10(S1), 5–15. Janouskova, S., Hak, T., & Moldan, B. (2018). Global SDGs assessments: Helping or confusing indicators? Sustainability, 10(5), 1540. Janus, H., & Esser, D. (2022). New standard indicators for German development cooperation: How useful are ‘numbers at the touch of a button’? IDOS Policy Brief, No. 7/2022 (German Institute of Development and Sustainability, IDOS). https://​doi​.org/​10​.23661/​ipb7​.2022. Jerven, M. (2012). An unlevel playing field: National income estimates and reciprocal comparison in global economic history. Journal of Global History, 7(1), 107–28. Jerven, M. (2013). Poor numbers: How we are misled by African development statistics and what to do about it. Cornell University Press. Kanbur, R., & Lustig, N. (1999). Why is inequality back on the agenda? Cornell University Press. Kapto, S. (2019). Layers of politics and power struggles in the SDG indicators process. Global Policy, 10(1), 134–6. Krause Hansen, H., & Porter, T. (2012). What do numbers do in transnational governance? International Political Sociology, 6(4), 409–26. Lie, S.J.H. (2015). An ethnography of the World Bank-Uganda partnership. Berghahn Books. Lie, S.J.H., & Sending, O.J. (2015). The limits of global authority: World Bank benchmarks in Ethiopia and Malawi. Review of International Studies, 41(5), 993–1010. Linsi, L., & Mügge, D. (2019). Globalization and the growing defects of international economic statistics. Review of International Political Economy, 26(3), 361–83.

286  Handbook on measuring governance McGillivray, M., & White H. (1993). Measuring development? The UNDP’s Human Development Index. Journal of International Development, 5, 183–92. Merry, S. (2011). Measuring the world: Indicators, human rights, and global governance. Current Anthropology, 52(3), S83–S93. Mitchell, I. (2020). Measuring development cooperation and the quality of aid. In S.C. Chaturvedi, H. Janus, S. Klingebiel, L. Xiaoyun, A. deMello e Souza, E. Sidiropoulos, & D. Wehrmann (Eds.), The Palgrave handbook of development cooperation for achieving the 2030 Agenda (pp. 247–70). Palgrave Macmillan. Mügge, D. (2016). Studying macroeconomic indicators as powerful ideas. Journal of European Public Policy, 23(3), 410–27. Nahmias, P., & Krynsky Baal, N. (2019) Including forced displacement in the SDGs: A new refugee indicator. Available at: https://​www​.unhcr​.org/​blogs/​including​-forced​-displacement​-in​-the​-sdgs​-a​-new​ -refugee​-indicator/​(accessed: 10 January 2023). OECD. (2007). Glossary of statistical terms. Available at: https://​stats​.oecd​.org/​glossary/​detail​.asp​?ID​=​ 7230 (accessed: 10 January 2023). OECD. (2011). The Busan Partnership for Effective Development Co-operation. Available at: https://​ www​.oecd​.org/​development/​effectiveness/​busanpartnership​.htm (accessed: 10 January 2023). OECD. (2019). Making development co-operation more effective: Progress report 2019. Available at: https://​effectivecooperation​.org/​landing​-page/​2018​-monitoring​-results (accessed: 10 January 2023). OECD. (2021). Development co-operation report 2021: Shaping a Just digital transformation. Available at: https://​www​.oecd​.org/​dac/​development​-co​-operation​-report​-20747721​.htm (accessed: 10 January 2023). Ordaz, E. (2019). The SDGs indicators: A challenging task for the international statistical community. Global Policy, 10(1), 141–3. Ravallion, M. (2010). Mashup indices of development. Working Paper No. 5432. The World Bank Development Research Group. Rottenburg, R., & Merry, S.E. (2015). The world of indicators: The making of governmental knowledge through indicators. In R. Rottenburg, S.E. Merry, S. Park, & J. Mugler (Eds.), The world of indicators: The making of governmental knowledge through indicators (pp. 1–33). Cambridge University Press. SDG Tracker. (2015). Measuring progress towards the Sustainable Development Goals. Available at: https://​sdg​-tracker​.org/​about (accessed: 10 January 2023). Seabrooke, L., & Sending, O.J. (2020). Contracting development: Managerialism and consultants in intergovernmental organizations. Review of International Political Economy, 27(4), 802–27. Servén, L., Chang, C.C., & Fernández-Arias, E. (1999). Measuring aid flows: A new approach. World Bank Policy Research Paper. Sustainable Development Solutions Network (SDSN). (2015). Indicators and a Monitoring Framework for Sustainable Development Goals: Launching a data revolution for the SDGs. Available at: https://​ resources​.unsdsn​.org/​indicators​-and​-a​-monitoring​-framework​-for​-sustainable​-development​-goals​ -launching​-a​-data​-revolution​-for​-the​-sdgs (accessed: 10 January 2023). www​ .tossd​ .org Total Official Support for Sustainable Development. (2022). Available at: https://​ (accessed: 10 January 2023). UNDATA Revolution. (2014). A world that counts: Mobilising the data revolution for sustainable development. Available at: https://​www​.undatarevolution​.org/​wp​-content/​uploads/​2014/​11/​A​-World​ -That​-Counts​.pdf (accessed: 11 September 2023). UNDP. (2016). Final report on illustrative work to pilot governance in the context of the SDGs. Available at: https://​www​.undp​.org/​publications/​final​-report​-illustrative​-work​-pilot​-governance​-context​-sdgs (accessed: 10 January 2023). UNDP. (2021). 2021 Global Multidimensional Poverty Index (MPI). Available at: https://​hdr​.undp​ .org/​content/​2021​-global​-multidimensional​-poverty​-index​-mpi​#/​indicies/​MPI (accessed: 10 January 2023). UNDP. (2023). Inequality-adjusted Human Development Index (IHDI). Available at: https://​hdr​.undp​ .org/​inequality​-adjusted​-human​-development​-index​#/​indicies/​IHDI (accessed: 10 January 2023). UNECE. (2015). From MDGs to SDGs – what have we learned? Available at: https://​unece​.org/​general​ -unece/​press/​mdgs​-sdgs​-what​-have​-we​-learned (accessed: 10 January 2023).

We treasure what we measure  287 United Nations. (2015a). The Millennium Development Goals Report 2015. Available at: http://​www​ .un​.org/​millenniumgoals/​2015​_MDG​_Report/​pdf/​MDG​%202015​%20rev​%20​%28July​%201​%29​.pdf (accessed: 10 January 2023). United Nations. (2015b) Third International Conference: Financing for Development. Available at: https://​su​stainabled​evelopment​.un​.org/​index​.php​?page​=​view​&​type​=​400​&​nr​=​2051​&​menu​=​35 (accessed: 10 January 2023). United Nations. (2019). #TheWorldWeWant. Available at: https://​www​.un​.org/​en/​exhibits/​page/​the worldwewant (accessed: 10 January 2023). United Nations: Dag Hammarskjöld Library. (2022). UN documentation: Development: Millenium Development Goals (2000–2015). Available at: https://​research​.un​.org/​en/​docs/​dev/​2000​-2015 (accessed: 10 January 2023). UNStats. (2018). Sustainable Development Goals report: A data revolution in motion. Available at: https://​unstats​.un​.org/​sdgs/​report/​2018/​data​_revolution (accessed: 10 January 2023). UNStats. (2022a). SDG indicators: Data collection information & focal points. Available at: https://​ unstats​.un​.org/​sdgs/​dataContacts/​ (accessed: 10 January 2023). UNStats. (2022b) Sustainable Development Goals; IAEG-SDGs: Tier classification for global SDG indicators. Available at: https://​unstats​.un​.org/​sdgs/​iaeg​-sdgs/​tier​-classification/​ (accessed: 10 January 2023). van Driel, M., Biermann, F., Kim, R.E., & Vijge, M. (2022). International organisations as ‘custodians’ of the sustainable development goals? Fragmentation and coordination in sustainability governance. Global Policy, 13(5), 669–682. https://​doi​.org/​10​.1111/​1758​-5899​.13114. Vetterlein, A. (2012). Seeing like the World Bank on poverty. New Political Economy, 17(1), 35–58. Ward, M. (2004). Quantifying the world: UN ideas and statistics. Indiana University Press.

19. Measuring democracy: capturing waves of democratization and autocratization Marianne Kneuer

INTRODUCTION Research in comparative politics has always been characterized by the contrast between democracy and autocracy and the quest to define both concepts and to explain domestic structures, actors’ preferences and behaviour, institutional mechanisms and processes. The necessity of providing a rigorous understanding of democracy is equally relevant for the research interest of international relations, which is guided by the question to what extent the regime type – democratic or autocratic – has an effect on foreign policy behaviour, especially on war proneness or peacefulness, or how the regime type influences other policy areas of international politics, such as trade relations, environmental or energy policy. Both for the assessment and analysis of domestic politics as well as for international interactions, the regime type has critical theoretical and normative implications. Think of the democracy peace theory and the relevance of identifying what is a democracy and what is an autocracy. Regardless of whether democracy is used as an explanatory factor for domestic or international actions, structural conditions or policy outcomes, or whether factors influencing democracy are investigated, it is empirically critical to have clarity about (a) whether a country is actually democratic or not and (b) how high (or low) its quality of democracy or its democraticness is, (c) to trace possible changes with the group of democracies or moves away from democracy, and (d) to be able to compare all these aspects on a global scope. At the same time, any assessment on the democraticness of a country is based on theoretical concepts that are linked to normative pre-assumptions such as that democracies are superior to other types of regimes when it comes to protecting political rights and freedoms or controlling rule and those in power. The research field of democracy measurement has emerged only in the mid-twentieth century and was mainly driven by the real-world developments of democratization or reverse moves. Such developments stressed the increased scientific need for measuring and distinguishing regime types as well as capturing grades within the regime types themselves. To be precise, it was never just about measuring democracy, but always also about capturing autocracies and demarcating the two regimes, that is, also defining the threshold between the two types of regimes (Kneuer 2020, p. 65ff.). In this respect, what generally is called measuring democracy and often denominated as democracy indices is strictly speaking measuring the occurrence and variance of regime types and subtypes going well beyond only considering democracies.1 As democratizations have proliferated, especially in the last quarter of the twentieth century, the search for concepts and methods to measure regime types intensified. The push from 1974 onward, identified by Huntington as the third wave of democratization (Huntington 1991), as well as the emergence of numerous new states and their democratization after 1989, has led to a booming of democracy measurement tools as well as to a refinement of existing tools. Within 288

Measuring democracy  289 academia, but also policy makers, the end of authoritarian mostly rightist military regimes in Southern Europe and Latin America, and later on in Asia, produced a new research thread on ‘democratic transitions’, but also a positive prospect of broadening the democratic spectrum on a global scale. The implosion of the communist bloc generated an even greater euphoria prematurely and mistakenly referred to as the end of history. Democracy became the dominant global script for the 1990s coupled with high expectations that democratization would be accompanied by more development and peacefulness (see the Agenda for Democratization of the UN, Boutros Boutros-Ghali 1996). Driven by this spirit, since the end of the Cold War, democracy promotion has become a growth sector both in terms of the practical political side as well as in terms of academic work and it moved up the foreign policy agenda of Western state actors and international organizations. With the European Union (EU), which was intensively involved in post-socialist transformations, another central actor entered the international arena. At the same time, the number of non-state actors mushroomed and engaged in the democracy promotion enterprise. In this respect, it can be summarized that the increase in the importance of democracy in the international community’s perception, but also of democracy promotion, has in turn fuelled research interest in democracy measurement. At the same time, the democratizations since 1974 produced two important insights. Firstly, it became apparent that beyond consolidated democracies and full autocracies there also exist subtypes that only partially cover the features of a full democracy or full autocracy. Thus, research was confronted with an increasingly expanding ‘grey zone’ which led to the conceptualization of regime subtypes such as delegative democracy (O’Donnell 1994), electoral democracy (Diamond 1996), illiberal democracy (Zakaria 1997), deficient democracies (Croissant & Thiery 2000; Merkel et al. 2003), semi-authoritarianism (Ottaway 2003), electoral authoritarian regimes (Schedler 2006) and competitive authoritarian regimes (Levitsky & Way 2010). Other approaches suggest that there is a third regime type beneath democracy and autocracy understood as hybrid regimes (Diamond 2002; Karl 1995; Morlino 2009). Secondly, the studies focusing on the consolidation of democracies led to the research field of quality of democracy which relies on a fine-grained measurement of democratic properties within the group of consolidated democracies (Altman & Pérez-Liñán 2002; Beetham et al. 2008; de la Fuente et al. 2020; Diamond & Morlino 2005; Geissel et al. 2016; Munck 2016). Thus, the approaches of refining the continuum between democracy and autocracy impacted on democracy measurement as new concepts aimed to capture the subtypes and thus needed to define thresholds between those subtypes. Generally, the spirit had already changed towards more scepticism replacing the democratic euphoria. Beneath the scientific quest for capturing democraticness, there is also an important practical political need for democracy measurement. A number of policies, such as development cooperation, democracy assistance and democracy promotion, require a clear classification and a deeper understanding of how advanced or not a young democracy, a transition country, etc. is, which democratic structures, principles or processes are still deficient and therefore need to be supported. Thus, international organization such as the EU, the Organization of American States (OAS), governments and non-governmental organizations (NGOs) use datasets on democracy and politics to determine which countries should be allocated funds or should be supported by programmes (Munck 2009, pp. 1–13). The assessment of changes – be it democratic progress or regression – also plays a role for governmental and non-governmental actors in their work for human rights, women’s rights, strengthening of civil society, justice, parliaments, etc. Equally for practitioners, the waves of democratization (but for some years

290  Handbook on measuring governance also waves of autocratizations) entail an increased need for information, classifications and evaluations of states and their democratic status. State actors as well as NGOs are increasingly relying on democracy measurement instruments, also because the landscape of regimes has differentiated into sub-regime types and consequently classification requires more complex concepts and finer methods. In the real world of politics, a considerable amount of money is spent on the measurement of democratic quality as well as on the promotion and consolidation of democracy around the world (Geissel et al. 2016). Policy makers need to know how democratic quality can be measured and how good the quality of democracy is in a country. Incomplete and inadequate measurements and indices are not only pricey, they can lead to wrong results and investments (Geissel et al. 2016, p. 574). Thus, democracy measurement has become a growth industry in the past decades, and an indispensable element for research not only in democracy and autocracy studies, but beyond that also in comparative politics and international relations in general, and finally also for policy makers. No doubt, the relevance of this highly dynamic field of research and application will continue to develop, not least because the conjunctures of democratization and autocratization and their real-world implications make this necessary.

THE DEVELOPMENT OF CONCEPTS AND METHODS Today, more than a dozen different measurements claim to evaluate democracy and the quality of democracy. Given the fact that democracy is a contested concept, there is, however, no consensus about underlying models of democracy, concepts, variables, yardsticks and methods (Geissel et al. 2016). Hence, theory as well as methodology both continue to be discussed with most advancement recently – due to new avenues for data extraction, mining, collection and analysis – in methodological issues very much in the focus of attention. This chapter presents the most relevant and used measurement approaches and indices. It is important to underline that the choice of one tool or another is guided both for scholars and for the policy makers who draw on them by their own understanding of democracy or the expectations on a determined model of democracy, and what it can uncover. A first distinction can be drawn along ‘thick’ and ‘thin’ concepts of democracy. Thus, measurement of democracy started by relying on a thin model. For the conceptualization democracy and its measurement, the work of Robert Dahl (1971, 1989) has been highly influential. Several existing indices build on Dahl’s minimalist version and start with the root concept consisting of participation – citizens choose their representatives via elections – and competition between parties and candidates or rely on his catalogue of definitory elements of democracy.2 Later approaches criticize this narrow concept of democracy for lacking the component of control, whether in the sense of horizontal accountability, the rule of law and effective control mechanisms at the political, administrative and intermediary levels. Especially in face of the ambivalent results of democratization and democratic consolidation in the 1990s and 2000s which indicated that democratization goes beyond holding elections and is not a linear and irreversible process, there was a call for broader – hence ‘thicker’ – concepts of democracy (Carothers 2002; Lauth 2004; Merkel 2004; Zakaria 1997). This leads to the differentiation between ‘electoral democracies’, reflecting the minimal requirements of regular, free and fair electoral competition and universal suffrage, and ‘liberal democracy’ encompassing also features such as the absence of reserved domains, horizontal accountability, and extensive provisions for political and civic

Measuring democracy  291 pluralism (Diamond 1996). In this vein, more recent measurements add such variables attributing relevance to control of power to their concept and measurements.3 Moreover, a new research interest arose on investigating specifically the ‘quality of democracy’ (Beetham 2004; Diamond & Morlino 2005). The endeavour of capturing degrees of democraticness within democracies – understandably – went along with attempts of proposing measurement. Thus, the Democracy Barometer (see website) exclusively focusing on democracies is the most consistent result of this. A completely new and most recent stage of the development of democracy measurement was established by the idea of capturing democracy as a multidimensional concept (Coppedge et al. 2011) realized in the Varieties of Democracy Project (V-Dem). The following overview offers a chronological tracing of the important stages of development of democracy measurement, its concepts and methods, and presents the most relevant indices. This overview does not intend to provide a completely comprehensive picture of all quality-of-democracy indices and does not capture all debates, distinctions and specifications. And it does not aim to what also can be considered a new and recent approach of assessment of democracy measurement, started by the pioneering work of Gerardo Munck and Jay Verkuilen (2002, 2009). Democracy measurement can be considered a ‘young’ field of research, but at the same time a highly dynamic one. Repeatedly, there have been recognized different phases of democracy measurement (most recently, see Giebler et al. 2018). While in the 1950s and 1960s, new approaches emerged for conceptualizing democracy, systematic measurement only came to application later on. In the following three phases are presented.

FIRST COMPARATIVE INDICES First efforts at measuring democracy go back to the aftermath of the Second World War. The breakdown of democracies in the 1920s and 1930s and its consequences drew the interest of scholars to the question of the stability of democracies. The first ‘truly comparative measurements of democracy’ (Giebler et al. 2018), however, were produced later, in the 1970s and 1980s during the Cold War. Four of the widely important indices were generated during this period: Freedom in the World (1973), Polity Index (1975), Vanhanen Index (1971) and Bollen’s Political Democracy’s Index (1980). Freedom in the World is an index that assesses the level of freedom globally.4 According to its self-description it is the ‘most widely read and cited report of its kind’ (see Freedom House website). What certainly can be stated is that Freedom in the World – mostly labelled simply as Freedom House (FH) – indeed has been widely used by scholars and pundits, not least because of its easily accessible and understandable systematics. It was launched in 1973 by the NGO Freedom House that had been engaged in protection of freedom since its foundation in 1941. FH cannot be qualified as a democracy index in the strict sense as it captures mainly freedom, that is, political rights (electoral process, political pluralism and participation, functioning of the government) and civil liberties (freedom of expression and belief, associational and organizational rights, rule of law, personal autonomy and individual rights). FH covers 195 countries and 22 territories, but its measurement only embraces the years from 1973 onwards, which limits longitudinal perspectives.

292  Handbook on measuring governance The Polity Index was mainly generated by Ted Gurr and the later updates evolved in conjunction with Keith Jaggers and Monty G. Marshall (Marshall & Gurr 2020). The Polity project developed in five phases (Polity I–V) embodying methodological refinements and the expansion of data. As of today, Polity V covers the time span between 1800 and 2018. The data are hosted by the Center for Systemic Peace. The Polity Index is limited to states with a total population of 500,000 or more in the most recent year. This reduces the number of covered countries as to currently 167. When developed in the mid-1970s, one of the innovative elements that Polity entailed was the incorporation of executive factors into the measurement. While most other measurements are based on rights and freedoms and participation, Polity assesses the competitiveness of recruitment, the constraints on chief executives, and the competitiveness of political participation. Polity uses three regime categories: democracies, autocracies, anocracies, the latter understood as a regime type with a mix of institutional characteristics both from democracies and autocracies. Semi-democracy or hybrid regimes have also been used. Polity IV also provides a further differentiation into ‘full democracy’ and ‘democracy’, open and closed anocracy while the group of autocracies is not broken down into subtypes. The Vanhanen Index, the first comparative index of democratization generated by the Finnish political scientist Tatu Vanhanen, is based on the above-described democracy concept by Dahl. Hence, Vanhanen’s definition reflects the understanding of democracy as contestation and participation and results in a highly parsimonious conceptualization, especially because he attributes only one indicator to each component. Thus, Vanhanen operationalizes participation by the indicator of the share of voters in the last election and contestation by the vote share of the strongest party in the last election (Vanhanen 1990, 2000). Within the array of measurement methods, Vanhanen’s index stands out by the specific feature of not relying on expert opinions or other indices and surveys, but the use of objective data such as vote shares and voters’ turnout. This entails several problems such as the potential influence of electoral systems on the vote shares of parties that might distort the result. Vanhanen’s previous index versions started in 1850 and terminated in 1998 covering a high number of countries and country years and allowing a perspective that goes back a long way in history. Today, the index encompasses the period from 1810 until 2018 (Finnish Social Science Data Archive, see Vahanen 2019). The Political Democracy Index was developed by Kenneth Bollen in 1979. He concentrates on liberal democracy and defines the index on the basis of two principles: political freedom and democratic rule. He then subsumes six indicators (three for each principle) for measurement. More influential than his index are actually Bollen’s methodological suggestions. Thus, he advocates for ‘subjective’ data, which means disregarding objective data such as voter turnout (see Vanhanen website), because he holds that objective data was not able to capture aspects like electoral flaws or repression, etc. This approach to data collection, however, also raises concerns which led Bollen to develop related recommendations to avoid methodological pitfalls: maximizing the validity of measurement, minimizing measurement errors mostly by statistical methods (Bollen 1990).

Measuring democracy  293

SECOND PHASE: THE BLOOMING OF NEW INDICES In the 2000s new indices for democracy measurement emerged. On the one hand, this can be traced back to the increased interest in following more closely the development of the transition countries that democratized after 1989/90, with more comprehensive conceptualizations of democracy coming into play here. The interest in measuring and understanding how, in particular, the post-socialist new democracies have succeeded (or not) the regime change towards democracy and achieved democratic consolidation has led to new democracy measurement of which the most important are: the Bertelsmann Transformation Index (BTI), which measures 116 countries since 2006 every two years (later expanded to 137), and Nations in Transit (NiT), a much less comprehensive index that has been monitoring only 29 post-socialist states since 2005. Also in 2006, the Economist Intelligence Unit’s (EIU) Index of Democracy emerged covering 167 countries (microstates excluded). On the other hand, these indices obviously have been driven by the quest for a more comprehensive conceptual approach to democracy. Hence, BTI, NiT as well as EIU go well beyond the thin understanding of electoral democracy. BTI uses for its evaluation of democratic transformation five criteria: stateness, political participation, rule of law, stability of democratic institutions, and political and societal integration (quality of representation). Likewise, NiT is based on the broader concept of liberal democracy resulting in seven criteria evaluating besides the electoral process national and local state institutions, rule of law and corruption, media and civil society. The EIU describes itself as following a thick democracy definition and assesses five criteria: electoral process and pluralism, civil liberties, the functioning of government, political participation and political culture. What these three indices also have in common is that they apply value to gradual gradation between regime subtypes. Here, of course, the concept of democracy again takes centre stage, because the question of which and how many subtypes are established has a considerable influence on the classification of the cases (Table 19.1). Thus, many indices (including older ones such as Polity IV) combine two democratic subtypes, with the goal of being able to differentiate the large groups of democracies in a more fine-grained manner. BTI stands out here as it encompasses three different subtypes of democracies with ‘fully consolidated’, ‘defective’ and ‘highly defective democracies’. The BTI’s approach thus contains a conceptual predetermination to expand the scope of democracies very strongly, which can lead to countries falling under the subtype ‘highly defective democracy’ that are no longer considered democracies in other indices. Regarding autocracies, we find variation again as some indices introduce two subtypes (BTI, NiT), others not (Polity IV, EIU). The intermediary area is of interest, as some indices work with hybrid regimes (Polity IV, EIU, NiT), while BTI follows the dichotomous classification into democracy and autocracy, but again with a very wide range of democratic subtypes. These differences in the conceptualization of the regime subtypes can impact the assessments in a quite considerable way. To give an example: Russia has been categorized by BTI until 2014 as a highly defective democracy and thus been subsumed under the regime type democracy, while Russia is qualified as a hybrid regime by EIU in the same time span, and as an open anocracy by Polity IV. While these indices have gained influence in scholarship, some new measurements have also been generated that bear meaningful approaches but could not achieve wider attention. Two of them deserve mentioning here. These are the Global Democracy Ranking, mainly

294  Handbook on measuring governance Table 19.1

Regime types and subtypes in selected democracy indices



Democracy

Intermediates

Autocracy

Polity IV

– fully democratic

– open anocracy (mixed type)

– full autocracy

– democratic

– closed anocracy (mixed type)

BTI

– fully consolidated democracy

– moderate autocracy – full autocracy

– defective democracy EIU

– highly defective democracy

– highly defective democracy

– full democracy

– hybrid regime (mixed type)

– autocracy

– hybrid or transitional regimes

– consolidated autocracy

– flawed democracy NiT

– consolidated democracies – semi-consolidated democracies

(mixed type)

– semi-consolidated autocracy

developed by David Campbell, and the Combined Democracy Index by Hans-Joachim Lauth.5 Campbell and his index have a quite singular position in regard to democracy measurement as he is one of the few scholars advocating for an even more substantial concept of democracy including output and performance variables such as gender, economy, knowledge, health and environment. The work of Lauth dates back to the beginning of the 2000s when he developed a systematic approach to democracy measurement which builds on a three-dimensional understanding of democracy entailing the dimensions of freedom, equality and control leading to what he calls the 15-fields matrix of democracy. It is this basic three-dimensional concept that guides the only index so far that actually only concentrates on democracies or on measuring their quality, the Democracy Barometer (DB). Developed in a German-Swiss cooperation by Wolfgang Merkel (Berlin Social Science Center) and Marc Bühlmann (Center for Democracy Studies Aarau) in 2010, the DB measures exclusively established democracies and their democratic quality (Bühlmann et al. 2012). The last and seventh version of the DB contains data for 53 countries from 1990 to 2017 (Engler et al. 2020). As already mentioned, the DB implements the measurements of the quality of democracy in the most consistent way, insofar as it only includes those countries that qualify as democracies. Finally, another project belongs to the series of these indices emerging in the 2000s, which at the same time is fundamentally different, since it is dedicated to a qualitative survey in contrast to most of the other quantitative democracy measurement. For IDEA, an intergovernmental organization of democracy assistance, David Beetham and his team have created a different kind of instrument (Beetham 2004; Beetham et al. 2001). Beetham’s approach is driven by the identification of basic flaws of quantitative measurement (such as qualitative judgements that are translated into quantitative measures) leading him to propose a different way of data collection (qualitative data complemented with quantitative when appropriate) and transfer into an index. Moreover, his intention is different to other measurement projects as he understood the IDEA project as a tool for in-country stakeholders. This understanding as a ‘civil society project’ has several implications: the assessment is conducted by citizens of the country, interpretation and emphasis is a matter of in-country assessors as is the selection of evidence, the responsibility for the final judgements and their contextualization and the mode of presentation (Beetham 2004, p. 5). The concept of democracy is rather maximalist – and thus also opposed to other democracy measurement – as it embraces social and economic rights as well as the international dimension of democracy, a rarely used variable in democracy

Measuring democracy  295 measurement. Beetham bases the IDEA project on two main principles (popular control and political equality) and assesses four bundles of elements: citizens’ rights, institutions of representative and accountable government, civil society and popular participation, and democracy beyond the state (Beetham 2004, p. 7). This singular index then constituted the basis for the ensuing Global State of Democracy Index by IDEA (see below).

THIRD PHASE: EXPANSION ON CONCEPT AND MEASUREMENT LEVEL The current phase can be identified since the 2010s. Regarding the political context, the upheavals in the Arab region (the so-called ‘Arab Spring’) fuelled new hopes for democratizing one of the remaining clusters of hard autocracies in the Near East and Northern Africa. The fact that this did not materialize and the evidence that democracy promoters did not even come close to unleashing a similar commitment in support of possible democratizations resulted in an increasingly pessimistic climate concerning democracy that had already begun in the 2000s (Carothers 2009, 2012). Additionally, while after the global recession of the 2000s, the Western democracies entered into a self-critical phase of debate on the crisis of democratic capitalism, autocratizing countries such as Russia and Venezuela as well as long-standing autocracies such as China gained self-assertiveness by their economic successes and began to antagonize the democratic model nurturing an attractive alternative model of rule (Kneuer and Demmelhuber 2020, p. 6). Indeed, in the following years the scholarly debate intensified, if the reverse was also already there. As controversial as this debate was, there existed quite a broad consensus that the momentum of the third wave was over. The major innovation in democracy measurement since then can be considered the Project of Varieties of Democracy (see Varieties of Democracy website), which was launched in 2011. When introducing their concept, the early authors made it clear that they wanted to avoid the strategy of proposing any particular definition of democracy. Therefore, they constructed five indices covering five dimensions of democracy which are not aggregated. It is left to others to judge how the indices might be combined and aggregated to a summarized index (Coppedge et al. 2011, p. 255). What seems a purely academic motivation presenting a very different conceptual and methodological approach, however, has explicit and implicit implications for political assessments. Thus, the authors emphasize that as soon as a ‘set of indicators becomes established and begins to influence international policymakers, it also becomes fodder for dispute in other countries around the world’. Therefore, they strive to present a set of indicators that can claim ‘the widest legitimacy’ in order to avoid that it is perceived as a ‘tool of western influence or a mask for the forces of globalization (as Freedom House is sometimes regarded)’ (Coppedge et al. 2011, p. 259). Furthermore, the original idea was not to offer a classification scheme for distinguishing democracies and autocracies and their subtypes or to offer a measurement of democraticness. According to the authors, ‘the goal of summarizing a country’s regime type is elusive’ (Coppedge et al. 2011, p. 258).6 Thus, in all, this concept is presuppositional, demanding and requires appropriate immersion of each user in the chain from concept to operationalization and application of indicators. At the same time, as the measurement provided by V-Dem is so disaggregated, it can be used – and this was indeed the intention – by other scholars to aggregate; thus, it is open for subsequent use and application.

296  Handbook on measuring governance V-Dem constitutes in many aspects a highly ambitious endeavour: it is based on a complex conceptualization of democracy and on a demanding measurement strategy, but also the data collection, which is extraordinarily broad and comprehensive. The index covers a period from 1789 on and thus is the only index that actually encompasses completely the modern history. Moreover, V-Dem relies on more than 3000 reviewers and on a highly sophisticated process of monitoring and controlling data collection. Meanwhile, located in Sweden, V-Dem has become an influential voice in the chorus of democracy measurement, especially for scholarly work, but lately also has developed more activity in reaching out to policy makers and making their data more visible and tangible. Thus, since 2017, V-Dem also joined other indices in publishing an annual Democracy Report. Two equally new indices – Global State of Democracy and Democracy Matrix followed the idea of the V-Dem initiators of using their data for further aggregation and thus rely to some extent on the V-Dem data. Global State of Democracy (GSoD) is a project of International IDEA (Institute of Democracy and Electoral Assistance) and developed a measurement that includes three variants of democracy (high, mid-range and low performance), hybrid regimes and autocracies. The index covers the period from 1975 to 2020. GSoD draws on 12 different data sources of which the V-Dem data set is the largest. The Democracy Matrix (DemMax) developed by Lauth also uses V-Dem data (see Democracy Matrix website). Lauth builds on his own concept of measuring democratic quality (see above), but now draws on V-Dem data. This makes his index more comprehensive as it now offers a longitudinal perspective from 1900 to 2020 for 170 countries. Recall the example of Russia’s classification used above: GSoD assesses Russia as a hybrid regime until 2010 and as an authoritarian regime from 2011 on. DemMax qualifies Russia almost continuously since 1996 as a moderate autocracy (Table 19.2). This spotlight example on the varying classifications of Russia shows that the decisions of scholars and practitioners need to be well informed about the democracy concept, the inherent principles for defining regime subtypes, and about the aspects of measurement and data collection, to avoid divergent results. Finally, at the end of this chronological overview, it is important to emphasize some systematic points. In addition to the concepts of democracy outlined in detail here, the methods of data collection have also been both different and controversial. Two main approaches are opposed: (a) very fundamentally quantitative methods (the vast majority of the indices) and Table 19.2

Classification of Russia by selected indices for the time period 2008–22

Index

Classification

Year of identified regime change

Freedom House

2008–22: not free



Polity IV

2000–06: democracy

2007

2007–16: open anocracy NiT

–2008: semi-consolidated autocracy

2009

2009–: consolidated autocracy BTI

–2014: highly defective democracy

2014

2014–: moderate autocracy EIU

2008–14: hybrid regime

2014

2014–: authoritarian GsoD

–2010: hybrid

2010

2010: authoritarian DemMax

1996–: moderate autocracy



Measuring democracy  297 qualitative methods (e.g., Beetham), (b) objective data (e.g., Vanhanen, GSoD) and subjective data, that is, the assessment of evaluators (either in-country or in-country and outside) (e.g., Beetham, BTI, EIU, V-Dem). To this are added numerous methodological issues that mirror a highly dynamic development, some of them sketched below. Most importantly, however, the desideratum anyway is rather to understand the theoretical and methodological differences so that the consequences of the application of the various indices can be considered (Giebler et al. 2018, p. 1).

KEY CHANGES AND THE MAIN FORCES BEHIND The tour d’horizon through the development of democracy measurement displays not only a significant variety of different approaches, it also reflects the general trend towards expansion and refinement at the levels of conceptualization and methods of measurement, data collection and aggregation. There has been a significant reorientation in terms of the concept of democracy, which has led to a broadening beyond the minimal definition and more graduation within the regime types in the form of the introduction of at least two subtypes of democracy and often also two subtypes of autocracy or a third type such as hybrid regimes. The indices that have emerged since the 2000s thus allow a more differentiated view of the group of democracies whose heterogeneity could not be adequately captured before. Thus, it has been often criticized that FH groups dissimilar countries with the same score; think of Austria, Cabo Verde, Palua, Uruguay and Canada all receiving the highest score in 2022 (FH 2022; Coppedge et al. 2011). The development of more substantial democracy concepts including rule of law or horizontal accountability have been driven by the empirical evidence that ‘there is a life after elections’ (Zakaria 1997, p. 40) meaning that elections alone do not make a democracy work. As already in the 1990s, but even more so later, when young democracies did not consolidate and remained in a defective, semi-democratic or hybrid state, there was the need to reflect this variety in the measurement. At the same time, technological innovations such as, above all, digitization and the resulting easier generation of considerably larger data sets and storage of information have enabled a significant expansion of democracy measurement in terms of time periods, but also in terms of the quantity and combination of indicators. Digital data processing also facilitates application by the scientific community, for example, by making use of data packages available to scientists more easily, more quickly and more broadly, by making revisions to indices easier and faster, etc. Moreover, there has been advancement and extensive methodological work on reducing measurements errors (see inter alia Armstrong 2009; Pemstein et al., 2010; Treier and Jackman 2008). Furthermore, the new possibilities of interactive data presentation have simplified use also for practice, whether by governments, NGOs, etc. Practically all indices today offer interactive tools on their websites, which not only allow the classification of a country, regions, etc. at a glance relatively quickly, but also make the tracking of developments over longer periods of time visible and illustrative. Finally, this also opens up access to democracy measurement for students (secondary school, undergraduates) or interested citizens, who can thus gain an impression without having to use SPSS or R programs for complex calculations. Regarding other methodological issues such as the measurements and aggregation rules, there remains considerable variety as well as controversial discussion on the different

298  Handbook on measuring governance approaches. However, what the research branch of democracy measurement has also produced in the meantime is a reflection on its own approach, including in the form of the evaluation of indices. Implicitly, such an evaluation has always been done by the fact that when new measurement instruments were generated, this was done in differentiation or further development of others. However, the paper by Munck and Verkuilen (2002) and its revision (2009) represent a landmark in that a framework for systematic assessment of measurement instruments is proposed. Munck and Verkuilen identify three key challenges for democracy measurement – conceptualization, measurement and aggregation – and assign tasks to each. Thus, they point out the importance for the concept of democracy of the correct identification of attributes and the logic in the ordering of attributes to the different levels of abstraction. For the measurement itself, they focus on the selection of indicators, the measurement level and the documentation and publication of the coding rules, the coding process and the disaggregated data. Finally, they identify as tasks of aggregation the correct selection of the level of aggregation for the indicators, a correct rule for the aggregation of the attributes, and again the recording and publicizing of the aggregation rules and data (Munck & Verkuilen 2009, p. 15). Indeed, the postulate of data recording and publicizing has made considerable progress in the last decade, which is also induced by the stricter rules of scientific integrity. Many third-party funders in research, as well as scientific journals, now require a corresponding willingness and visibility of data disclosure. Funding agencies do so motivated by the approach that such publicly funded data generation makes these data a public good that should be accessible. Journals are especially concerned with enabling verification as well as replicability of the data. Munck’s and Verkuilen’s pioneering work has found fertile ground as several scholars have taken up this quest for systemic evaluation. Müller and Pickel (2007) propose, following Munck and Verkuilen (2002), a concept validation that primarily critically examines the methodological design of an index and examine its components for reliability and validity, and thus add concrete suggestions for the operationalization of Munck’s and Verkuilen’s criteria for evaluation in order to make theme applicable. The result is a three-part evaluation index that assesses if the criteria are met or not or are at an intermediate stage (Müller & Pickel 2007, pp. 520–23; Pickel et al. 2015).

POLITICAL AND ADMINISTRATIVE CONSEQUENCES The large body of literature, particularly since Munck and Verkuilen’s (2002) approach to systematic evaluation of democracy measurement, has addressed the three key challenges: conceptualization, measurement, and aggregation. Giebler et al. (2018) suggest to also consider the applicability of democracy measurement. Indeed, this is an aspect that scholars might focus on more systematically in the future. The aspects of how scholars interpret and practitioners apply indices have a practical implication that is connected to the ‘half-full or half-empty glass’ question. Thus, FH 2008 had already started to detect a global decline of democracy (Puddington 2008) and this assessment was continuously defended during the next years until today. During this period, however, there were other analyses that refuted such a global democratic regression and also warned against excessive alarmism (Levitsky & Way 2015; Skaaning & Jiménez 2017) stating that the widespread pessimism presents an overly dramatic storyline (Carothers 2009; Carothers & Young 2017).

Measuring democracy  299 This raises the question of the interpretation of the data that democracy measurement produces and here ‘the eye of the beholder matters a lot when deciding what to make of these results’ (Moller & Skaaning 2022). Caution is needed here on two counts. First, it must never be lost sight of that the underlying conceptualization already has a pre-decisive influence on the classification of the country and the mapping of trends. It is therefore important to note that one challenge is constructing a sound index of democracy, but that equally essential is a ‘profound understanding of the differences between various measures of democracy and their consequences for application’ (Giebler et al. 2018, p. 1). As more and more decision makers have taken note of democracy measurement in recent years and based policy assessments on it, it is important that scientists are aware of the impact on policy makers, pundits, journalists and the public when they interpret the data or already – one step ahead – establish their definitions and ground them up with certain methodological choices. And second, caution is needed in regards to the practical effect of democracy measurement. As the third wave of democratization, and especially the 1990s, have shown, conjunctures of international euphoria can mislead. The same, however, applies for conjunctures of over-pessimism or resignment. The myopic idea that the 1990s sealed the final triumph of democracy over dictatorships obscured the view for the dynamics that took place below the bird’s eye view, for dysfunctional developments or undesirable outcomes that become entrenched in the democratization process. Similarly, over-dramatizing the reverse phenomenon – autocratization – can lead to misperceptions. A recent debate in the community of democracy measurement exemplifies the relevance of how to define, operationalize and measure such waves. A widely received paper by Lührmann und Lindberg (2019) (at the same time the vice-director and director of the Varieties of Democracy Project) argued that a wave of autocratization started in 1994 and proposed an operationalization for measuring such waves of autocratization. Both endeavours were challenged by Skaaning (2020) as his analysis leads to a quite different conclusion: He does not find a start of a reverse wave in 1994 and holds that ‘it is even uncertain whether we are currently in the midst of an outright wave of autocratization’ (Skaaning 2020, p. 1539). Consequently, Skaaning postulates a critical discussion is needed since the conceptualization and measurement suggested by Lührmann and Lindberg ‘seem to influence key substantive conclusions about trends in democratization and autocratization’ and ‘can lead to skewed perceptions of trends of autocratization (and democratization)’ (Skaaning 2020, p. 1534). This connects to the application dimension of democracy measurement. There certainly exists a tension between, on one hand hand, creating alarmism with the potential effect to push policy makers to take a determined action which might not be the right thing to do. And on the other, underestimating developments could hinder policy makers, officers, etc. to identify threats that should be countered.7 The questions therefore are two-fold: What is the responsibility of democracy measurement in creating or sustaining such conjunctures? And how can scholars and practitioners act responsibly in using and interpreting their data?

CONCLUSION Democracy measurement is an area of research that seeks to answer very central questions in political science and make it tangible for many subdisciplines: When is a country a democracy? How democratic is a democracy? What has become clear today, after an evolving process of

300  Handbook on measuring governance refinement and increasing demands for quality control, is that answering these questions continues to hinge on the challenge of what is actually being measured by the measurement instrument. Thus: ‘… the measurement accuracy of an index and thus the answer to the question of how democratic democracies are, actually depends to a large extent on the quality of the respective measurement concept’ (Müller & Pickel 2007, p. 516 – own translation). Therefore, it is important, on the one hand, that evaluation of democracy measurement continues within scholarship, and that, on the other hand, awareness is created for a considered, prudent and responsible approach to measurement itself and the handling of its results. Moreover, it is equally meaningful to ‘recognize the public good aspect of enhanced measures of democracy’ (Coppedge et al. 2011, p. 261). The strong mutual influence between political practice – governments and international actors need benchmarks for their decisions – and research, which can provide these benchmarks, becomes particularly obvious and tangible in respect to democracy measurement. Hence, as shown, there is a demand for endogenous quality reflection regarding every stage of measurement requiring such criteria as validity and reliability (and replicability). But more than that, there is also a postulate for exogenous reflection on how to use and to apply the results of measurement, how to transfer interpretations and conclusions to policy makers and the public. Precisely because democracy measurement has significant influence on political decisions, the greatest possible legitimacy must be achieved with an index. Coppedge et al. point out that a poor index can also lead to it being ‘perceived as a tool of Western influence or a mask for the forces of globalization’ (Coppedge et al. 2011, p. 259). Democracy measurement is a highly dynamic research field as repeatedly underlined. More recent demands for future complementation focus on the inclusion of the citizens’ perspective which would methodologically involve the incorporation of survey data (Fuchs & Roller 2018; Mayne & Geissel 2016; Pickel et al. 2016). This demand addresses the micro or individual dimension beyond the legal and institutional – macro – level, and thus the subjective evaluations of citizens when it comes to assessing the quality of democracy. Another direction is induced by the emergence of digital tools for democratic communication and processes. New ways for decision making or administrative procedures have been established in the last decades and so far, democracy measurement has not confronted this changed situation by adjusting criteria or indicators. Thus, the question is: How to measure e-democracy elements (Kneuer 2016)? A third thread that is already made possible by various indices and their historical data lies in more work exploiting this rich pool in order to possibly fertilize historical analyses as well as opening further avenues of diachronic comparative research.

NOTES 1. An exception is the Democracy Barometer which actually only considers established democracies. See Engler et al. (2020). 2. Freedom to form and join organizations, freedom of expression, the right to vote, eligibility for public office, the right for political leaders to compete for support, alternative sources of information, free and fair elections, institutions for making government policies depend on vote and other expressions of preference. Coppedge and Reinicke (1990) later keep the variable of suffrage, the three variables on freedoms (organization, information, expression) and subsume the remaining four to one variable, fair and free elections. 3. The concept by Merkel et al. (2003) and Merkel (2004) constitutes the basis for the Bertelsmann Transformation Index (BTI): https://​bti​-project​.org/​de/​?​&​cb​=​00000 (accessed 15.1.2023) and also

Measuring democracy  301 was influential for the Democracy Barometer (see endnote 1). Bertelsmann also produces a different index, the Sustainable Governance Index (SGI): https://www.bertelsmann-stiftung.de/de/unsere​ -projekte/sustainable-governance-indicators-sgi (accessed 15.1.2023). 4. https://​freedomhouse​.org/​(accessed 15.1.2023). 5. Unfortunately, the KID website is not accessible in English language. Combined Democracy Index (KID): https://www.politikwissenschaft.uni-wuerzburg.de/lehrbereiche/vergleichende/forschung/​ kombinierter-index-der-demokratie-kid/. The Global Ranking of Quality of Democracy: http://​ democracyranking.org/wordpress/ (accessed 15.1.2023). 6. Different to this original idea, some scholars of V-Dem set out to construct a measurement of regime types. See, for example, Lührmann et al. (2018). 7. This is what a recent discussion evinced. See Weyland (2022) and Moller and Skaaning (2022).

REFERENCES Democracy Indices Bertelsmann Sustainable Governance Index: www​.bertelsmann​-stiftung​.de/​de/​unsere​-projekte/​sustain able-governance-indicators-sgi. Bertelsmann Transformation Index: https://​bti​-project​.org/​de. Democracy Barometer: https://​democracybarometer​.org. Democracy Matrix: www​.democracymatrix​.com. EIU: www​.eiu​.com/​n/​campaigns/​democracy​-index​-2021. Freedom House: https://​freedomhouse​.org. IDEA – Global State of Democracy: idea​.int/​data​-tools/​tools/​global​-state​-democracy​-indices. Nations in Transit by Freedom House: https://​freedomhouse​.org/​report/​nations​-transit. Polity V: www​.systemicpeace​.org/​polityproject​.html. Polity IV: http: www​.systemicpeace​.org/​polity/​polity4​.htm. Vanhanen: services​.fsd​.tuni​.fi/​catalogue/​FSD1289​?tab​=​summary​&​lang​=​en​&​study​_language​=​en. Varieties of Democracy: https://​www​.v​-dem​.net.

Literature Altman, D., & Pérez-Liñán, A. (2002). Assessing the quality of democracy: Freedom, competitiveness and participation in 18 Latin American countries. Kellogg Institute Working Paper, Notre Dame. Armstrong, D.A. (2009). Measuring the democracy–repression nexus. Electoral Studies, 28(3), 403–12. Beetham, D. (2004). Towards a universal framework for democracy assessment. Democratization, 11(2), 1–17. Beetham, D., Bracking, S., Kearton, I., & Weir, S. (2001). International IDEA handbook on democracy assessment. Kluge Academic Publishers. Beetham, D., Edzia, C., Todd, L., & Stuart, W. (2008). Assessing the quality of democracy: A practical guide. International Institute for Democracy and Electoral Assistance. Bollen, K.A. (1990). Political democracy: Conceptual and measurement traps. Studies in Comparative International Development, 25, 7–24. https://​doi​.org/​10​.1007/​BF02716903. Boutros-Ghali, B. (1996). An agenda for democratization. United Nations. https://​www​.un​.org/​fr/​events/​ democracyday/​assets/​pdf/​An​_agenda​_for​_democratization​.pdf (accessed 18.1.2023). Bühlmann, M., Merkel, W., Müller, L., & Wessels, B. (2012). The Democracy Barometer: A new instrument for measuring the quality of democracy and its potential for comparative research. European Political Science, 11(1), 519–36. Carothers, T. (2002). The end of the trasition paradigm. Journal of Democracy, 13(1), 5–21. Carothers, T. (2009). Stepping back from democratic pessimism. Carnegie Papers. Carnegie Endowment for International Peace.

302  Handbook on measuring governance Carothers, T. (2012). Democracy policy under Obama: Revitalization or retreat? Carnegie Endowment for International Peace. Carothers, T., & Youngs, R. (2017). Democracy is not dying: Seeing through the doom and gloom. Foreign Affairs. https://​www​.foreignaffairs​.com/​united​-states/​democracy​-not​-dying (accessed 18.1. 2023). Coppedge, M., & Reinicke, W. (1990). Measuring polyarchy. Studies on Comparative International Development, 25, 51–72. Coppedge, M., Gerring, J., Altman, D., et al. (2011). Conceptualizing and measuring democracy: A new approach. Perspectives on Politics, 9(2), 247–67. Croisant, A., & Thiery, P. (2000). Von defekten und anderen Demokratien. Welttrends, 29, Winter, 9–32. Dahl, R.A. (1971). Polyarchy: Participation and opposition. Yale University Press. Dahl, R.A. (1989). Democracy and its critics. Yale University Press. de la Fuente, G., Kneuer, M., & Morlino, L. (2020), Calidad de democracia en América Latina. Una nueva mirada. Fondo de Cultura Económica. Diamond, L.J. (1996). Is the third wave over. Journal of Democracy, 7(3), 20–37. Diamond, L.J. (2002). Elections without democracy: Thinking about hybrid regimes. Journal of Democracy, 13(2), 21–35. doi:10.1353/jod.2002.0025. Diamond L.J., & Morlino, L. (2005). Assessing the quality of democracy. Johns Hopkins University Press. Engler, S., Leemann, L., Abou-Chadi, T., et al. (2020). Democracy Barometer. Codebook. Version 7. Zentrum der Demokratie. Fuchs, D., & Roller, E. (2018). Conceptualizing and measuring the quality of democracy: The citizens’ perspective. Politics and Governance, 6(1). https://​doi​.org/​10​.17645/​pag​.v6i1​.1188. Geissel, B., Kneuer, M., & Lauth, H.J. (2016). Measuring the quality of democracy: Introduction. International Political Science Review, 3(5), 571–9. doi​.org/​10​.1177​%2F0192512116669141. Giebler, H., Ruth, S.P., & Danneberg, T. (2018). Why choice matters: Revisiting and comparing measures of democracy. Politics and Government, 6(1), 1–10. 10.17645/pag.v6i1.1428. Huntington, S.P. (1991). The third wave: Democratization in the late twentieth century. University of Oklahoma Press. Karl, T.L. (1995). The hybrid regimes of Central America. Journal of Democracy, 6(3), 72–86. Kneuer, M. (2016). E-democracy: A new challenge for measuring democracy. International Political Science Review, 37(5), 666–79. 10.1177/0192512116657677. Kneuer, M. (2020). Fenómenos de límite en la medición de la calidad democrática: por qué Venezuela es un caso límite [Borderline cases in measuring the quality of democracy: Why Venezuela is borderline case]. In G. de la Fuente, M. Kneuer, & L. Morlino (Eds.), Calidad de democracia en América Latina: Una nueva mirada (pp. 65–89). Fondo de Cultura Económica. Kneuer, M., & Demmelhuber, T. (2020). Autocratization and its pull and push factors – a challenge for comparative research. In Authoritarian gravity centers: A cross-regional study of authoritarian promotion and diffusion (pp. 3–26). Routledge. Lauth, H.J. (2004). Demokratie und Demokratiemessung. Springer Fachmedien. Levitsky, S., & Way, L.A. (2010). Competitive authoritarianism: Hybrid regimes after the Cold War. Cambridge University Press. Levitsky, S., & Way, L.A. (2015). The myth of democratic recession. Journal of Democracy, 26(1), 45–58. Lührmann, A. & Lindberg, S.I. (2019). A third wave of autocratization is here: What is new about it? Democratization, 26(7), 1095–113. Lührmann, A., Tannenberg, M., & Lindberg, S.I. (2018). Regimes of the World (RoW): Opening new avenues for the comparative study of political regimes. Politics and Governance, 6(1), 60–77. Marshall, M.G., & Gurr, T.R. (2020). Polity 5: Political regime characteristics and transitions, 1800–2018. Dataset users’ manual. http://​www​.systemicpeace​.org/​inscr/​p5manualv2018​.pdf (accessed 18.1.2023). Mayne, Q., & Geissel, B. (2016). Putting the demos back into the concept of democratic quality. International Political Science Review, 37(5), 634–44. https://​doi​.org/​10​.1177/​0192512115616269. Merkel, W. (2004). Embedded and defective democracies. Democratization, 11(5), 33–58. doi10.1080/ 13510340412331304598.

Measuring democracy  303 Merkel, W., Puhle, H.-J., Croissant, A., Eicher, C., & Thiery, P. (2003). Defekte Demokratie. VS Verlag für Sozialwissenschaften. Moller, J., & Skaaning, S.-E. (2022). Crisis of democracy: On the meaning and relevance of a much used and abused concept. Presentation at the APSA Annual Conference 2022, Montreal. Morlino, L. (2009). Are there hybrid regimes? Or are they just an optical illusion? Cambridge University Press. Müller, T., & Pickel, S. (2007). Wie lässt sich Demokratie am besten messen? Zur Konzeptqualität von Demokratieindices. PVS, 48(3), 511–39. Munck, G.L. (2009). Measuring democracy. A bridge between scholarship and politics. Johns Hopkins University Press. Munck, G.L. (2016). What is democracy? A reconceptualization of the quality of democracy. Democratization, 23(1), 1–26. Munck, G.L., & Verkuilen, J. (2002). Conceptualizing and measuring democracy. Evaluating alternative indices. Comparative Political Studies, 35(1), 5–34. Munck, G.L., & Verkuilen, J. (2009). Conceptualizing and measuring democracy: An evaluation of alternative indices. In G.L. Munck, Measuring democracy: A bridge between scholarship and politics (pp. 13–38). Johns Hopkins University Press. O’Donnell, G.A. (1994). Delegative democracy. Journal of Democracy, 5(1), 55–69. doi:10.1353/ jod.1994.0010. Ottaway, M. (2003). Democracy challenged: The rise of semi-authoritarianism. Carnegie Endowment for International Peace. https://​doi​.org/​10​.2307/​j​.ctt1mtz6c5. Pemstein, D., Meserve, S., & Melton, J. (2010). Democratic compromise: A latent variable analysis of ten measures of regime type. Political Analysis, 18(4), 426–49. https://​doi​.org/​10​.1093/​pan/​mpq020. Pickel, S., Stark, T., & Breustedt, W. (2015). Assessing the quality of quality measures of democracy. European Political Science, 14(4), 496–520. Pickel, S., Breustedt, W., & Smolka, T. (2016). Measuring the quality of democracy: Why include the citizens’ perspective? International Political Science Review, 37(5), 645–55. https://​doi​.org/​10​.1177/​ 0192512116641179. Puddington, A. (2008). The 2007 Freedom House Survey: Is the tide turning. Journal of Democracy, 19(2), 61–73. Schedler, A. (2006). Electoral authoritarianism: The dynamics of unfree competition. Lynne Rienner Publishers. Skaaning, S.E. (2020). Waves of Autocratization and democratization: A critical note on conceptualization and measurement. Democratization, 27(8), 1533–42. http://​doi​.org/​10​.1080/​13510347​.2020​ .1799194. Skaaning, S.E., & Jiménez, M. (2017). The global state of democracy 1975–2015. In IDEA (Ed.), The global state of democracy: Exploring democracy’s resilience (pp. 2–30). IDEA. Treier, S., & Jackman, S. (2008). Democracy as a latent variable. American Journal of Political Science, 52(1), 201–17. Vanhanen, T. (1990). The process of democratization: A comparative study of 147 states, 1980–1988. Crane Russak. Vanhanen, T. (2019). FSD1289 measures of democracy 1810–2018. Dataset. Finnish Social Science Data Archive. https://​services​.fsd​.tuni​.fi/​catalogue/​FSD1289 (accessed 18.1.2023). Vanhanen, T. (2000). Measures of democracy 1810–2010. Finnish Social Science Data Archive. https://​ www​.fsd​.tuni​.fi/​fi/​aineistot/​taustatietoa/​FSD1289/​Introduction​_2010​.pdf (accessed 18.1.2023). Weyland, K. (2022). Concept misformation in the age of democratic anxiety: Causes and downsides. Presentation at the APSA Annual Conference 2022, Montreal. Zakaria, F. (1997). The rise of illiberal democracy. Foreign Affairs, 76(6), 22–43. https://​doi​.org/​10​ .2307/​20048274.

Index

#data4covid19 platform 39 ability to impact 210, 211–14 Abolafia, M.Y. 259, 261 accountability ALMPs 232–4 increased accountability of implementing agents 235 collaborative governance processes 159, 162 collaborative performance summits 217, 218, 219, 224, 226 NPM 47–8, 50, 51, 54–5, 56, 57 public administration theory 84, 92 sociology of measurement 112–13, 120 WGI 139, 140 accountability bias 65 accountability relations 66 accounting 3 health care 243, 244–5, 248–9, 250–51, 253 accounting-led organizations 132 accreditation systems 246 action research 216–17 collaborative performance summits see collaborative performance summits activation schemes 232 active labour market policies (ALMPs) 229–42 consequences of measuring 235–8 future of measuring 238–9 measuring outcomes 230–32 measuring outputs 232–4 actor constellation 159, 165 Addink, G.H. 100 Addis Ababa Action Agenda on financing for development 275 administrative traditions 20 agenda setting 220, 221–2 aggregate statistical concepts 18 aggregation 298 aid Official Development Assistance (ODA) 274–5 see also global development cooperation aid effectiveness 275, 278 Akselvoll, M. 69 Albrow, M. 81 algorithmic governmentality 128 algorithms 117–18, 122 alternative data sources 31

Anderson, B. 21 anomie 3 Ansell, C. 156, 216, 217, 219, 221, 222, 224 anthropology 7 anticipatory governance 68 OECD’s in education 71–2 Araujo, C. 252 Arnold, P.J. 251 arousal 198–9 artificial intelligence (AI) 119 assemblages 118–19, 120 audit society 67–8 Australia 237 Australian Bureau of Statistics (ABS) 22, 24, 25–6, 27 ‘Australian Settlement’ 23 Committee on Integration of Data Systems 25 Commonwealth Bureau of Census and Statistics (CBCS) 23–4, 25 National Health Performance Authority (NHPA) 52 OECD rankings of school systems 118 purposes of performance measurement 52–7, 61 state formation and statistics 15, 19, 20, 22–7, 28 White Australia Policy 23, 24, 26 authoritarian regimes 104–5, 106 autocracy 288, 289 autocratization, wave of 299 Bakker, B.N. 198 Bakonyi, J. 277 Ballart, X. 198 Bartlett, J. 116 Baumann, H. 72–3 Beaud, J.-P. 21, 22 Beetham, D. 294–5 behavioural perspective 187–203 Behn, R.D. 49–50 emerging trends in measuring governance 198–9 theories of measurement 188–94 use of survey scales 194–8 benchmarking 8, 132–3 benefit recipients number of 229, 230–32, 233, 235–6 time spent on benefits 230–32, 235–6

304

Index  305 Bertelsmann Institute Sustainable Governance Index 99 Bertelsmann Transformation Index (BTI) 293, 294, 296 big data 34, 115, 133 biopower 129 Bollen, K. 292 Booher, D.E. 157 Booth, C. 3 Bouckaert, G. 81, 82 Bowker, G.C. 114 Bozeman, B. 194 Braun, B. 261 Brazil 183 National Program for Improving Access and Quality to Primary Care (PMAQ) 179–82 breadth of collaboration 207–8 Brodkin, E. 231–2, 236, 237 budgeting 3–4, 49–50, 51 GaaG 141, 142, 143, 145 Bühlmann, M. 294 Burchell, G. 17 bureaucracy 3, 46, 47 ideal type of efficient bureaucracy 81 public 46, 58 bureaucratic reputation 196 Busse, R. 251 campaign promises 102 Campbell, D. 294 Campbell’s law 237–8 Canada 19, 20 Canadian Institute for Health Information (CIHI) 52–3 purposes of performance measurement 52–7, 61 capacity to govern 98–9 capacity for joint action 159, 164–5 Carpenter, D. 196 case studies 107 categorization 64, 114 celebration 49–50, 51, 55 censuses see population censuses central banks 259–72 de-politicization of performance 268–9 and financialization 262–5 inflation targeting and measurability of monetary policy 260–62 unconventional monetary policy 265–9 centralized statistical systems 20 certainty 33–4, 40 chain of performance measurement 48 character of the innovation 209 charity 2

Chee, G. 174 cherry-picking 251 Chua, W.F. 250–51, 252 citizen-generated data (CGD) 33–4, 40 citizen science 40 citizens 116 classical liberalism 126 classical test theory 189–94 composite measures 190–91 reliability 191, 193, 194 validity 191–3, 194 coalition governments 102–3 coefficient alpha 191, 194 Coleman, J. 247 collaboration, indicators of 207–9, 211–14 collaborative governance 5–6, 102–3, 156 collaborative governance processes 156–71 measures for assessing quality of 158–66 methods for measuring quality of 166–9 purposes of measuring quality of 157–8 collaborative innovation 204–15 collaborative management 208–9 collaborative performance summits 216–27 accountability 217, 218, 219, 224, 226 competences vs confidence 224–5 competing purposes 219 current status and prospects for using in measuring governance 225–6 institutional context 220–21, 222–3, 226 learning 217–18, 219, 223, 224, 226 patterns emerging from application of 224 relationship-building 217, 218–19, 223, 224, 226 steps 220, 221–2 uses of 223 Combined Democracy Index 294 commensuration 64 commitment 159, 161 Commitment to Development Index 275 Common Assessment Framework (CAF) 88 communication 159, 164 competences ALMPs and increasing 230–32 vs confidence in collaborative performance summits 224–5 competition 280–83 composite measures 190–91 Comte, A. 2 conceptualization democracy 290–91, 297, 298 governance 96, 99–100 survey scales and 188–9, 193, 194, 195–6 confidence 224–5 conflict regulation 159, 164 congeneric models 190, 194

306  Handbook on measuring governance constitutive effects 62–78 contestation 68–70 distinguished from unintended effects 63–4 dynamics and histories in governance measurement 66–8 instruments, mechanisms and processes 64–6 OECD’s anticipatory governance in education 71–2 relevance of 62–3 SDGs 73–5 transparency in government 72–3 construct validity 192–3, 194 constructivism 130 content 65 content validity 192, 194 contestation 68–70 context collaborative performance summits 220–21, 222–3, 226 contextual factors in GaaG 142, 145 specificity in public administration 86 control 49–50, 51, 54, 57 conventions 7 convergent validity 192–3, 194 Coombs, N. 266 Copenhagen crime prevention projects 206, 211–13 Coppedge, M. 295, 300 core government results 141, 142, 144, 145 corporate governance 2, 47 correlation 34 corruption 97–8 control of 140 Corruption Perception Index (CPI) 72–3 cost-based accounting see accounting cost-cutting 91 Coursey, D.H. 197 COVID-19 pandemic 38, 39, 41, 273 health system strengthening before and after 183 crime 3 crime prevention projects 206, 211–13 Crisp, L. 25 criteria-based measurement 204–15 empirical application 211–13 key variables and indicators 207–10 main purposes of for collaborative innovation 205–7 criterion validity 192, 194 Critical Data Studies (CDS) 111, 119, 120, 121 critical policy drivers 183 Cronbach’s alpha 191, 194 custodian agencies 154, 280–83 customer-user satisfaction surveys 88

Dahl, R. 290 Dahler-Larsen, P. 69, 249 dashboard, governing by 115–16 data importance in global development cooperation 280, 281–3 interpretation and democracy measurement 299 quantification and global governance 31–44 data colonialism 121 data justice 112, 121 data sources alternative 31 used by supranational institutions 152–4 data sovereignty 112, 121 datafication 34, 114–15 decentralized statistical systems 20 decisiveness 159, 160 deductive scale development 188, 195–6 DeHart-Davis, L. 196 delegation 105–6 deliberative forums 76 democracy 73, 76 quality of 98, 291 representative vs participative in collaborative performance summits 225 Democracy Barometer 291, 294 Democracy Matrix (DemMax) 296 democracy measurement 288–303 development of concepts and methods 290–91 early comparative indices 291–2 expansion on concept and measurement level 295–7 key changes and the forces behind 297–8 new indices in the 2000s 293–5 political and administrative consequences 298–9 democratic legitimacy 159, 162–3 democratic regimes 101–4, 107 democratization 288–90, 299 Denmark 69–70, 247, 251 crime prevention projects in Copenhagen 206, 211–13 criteria-based measurement 206, 211–13, 214 Danish Accreditation Agency 70 Ligebehandlingsnævnet (The Council for Equal Treatment) 74 and the SDGs 73–5 de-politicization 268–9 depth of collaboration 208 depth of innovation at the ideational level 209 depth of innovation at the level of practice 209

Index  307 Derksen, L. 112 Desrosières, A. 21, 33, 34, 113 development 115 evolution of measuring practices 278–80 global development cooperation 273–87 indicators 73–4, 118–19, 275–8, 279–82 MDGs 38, 39, 273, 277, 279, 280 methods for measuring in the development field 274–8 SDGs see Sustainable Development Goals UN 2030 Agenda 31, 33, 38, 39–40, 41, 144, 146, 153 deviance 18 diagnosis-related groups (DRGs) 243, 244–5, 248, 251–2 dialogue routines see collaborative performance summits digital governance indicators 138, 150–51, 152–3 digital government 141, 142, 143–4, 145 digitality 38–9, 41 direct observation 198 discipline 129 discriminant validity 192–3, 194 displaced persons 278 document analyses 129 documentation 69 domestic violence 27 domination 50–51, 54, 57 Donzelot, J. 127, 131 Douglas, S. 216, 217, 219, 221, 222, 224, 225 Du Gay, P. 58 due deliberation 159, 163 dumping 251 Dunning, J. 224–5 duration of time spent on benefits 230–32, 235–6 Durkheim, E. 2, 3 E-Government Development Index (EGDI) 138, 150–51, 152–3 economics 7 Economist Intelligence Unit (EIU) Index of Democracy 293, 294, 296 education 117 costs of measurement 69–70 higher education see higher education OECD’s anticipatory governance in 71–2 effective policy integration levels (EPILs) 176, 178, 180, 181, 182, 184 elections 102, 107 Emerson, K. 156 emotions 198–9 employment services 232–4, 235, 236, 238 enrichment 159, 160 epidemiology 126 Espeland, W. 282

European Committee of the Regions 149 European Union (EU) 7, 16, 132, 289 Directive 89/391 67 National Recovery and Resilience Plans (NRRPs) 149–50 and the SDGs 149–50 EUROSTAT 149 evaluation 49–50, 51 evidence-based policymaking (EBPM) 114, 234 Ewald, F. 127, 131 expectations management 261–2, 264 expert surveys 107 external support 159, 164 extrinsic policy drivers 176–8 face validity 192 facilitation 159, 164 failed states 101, 105–6 Federal Reserve 264 federalism 20, 52 feminism 121 Ferguson, J. 34, 41 financial domination 264 financial markets 261–2, 263–5 financial performance measures 243, 244–5, 248–9, 250–51, 253 financialization 262–5 Fisher, A. 279 Fitoussi, J.-P. 35 fixation 64, 65 flexibility 159, 160 focus groups 169 formulation drivers 174, 175–8, 181–2, 184 Foucault, M. 125, 126, 128, 129, 130, 131, 133 fragile and conflict-affected situations (FCS) 36 France 17 French Revolution 111 Francis Report 56 Freedom in the World (Freedom House) 291, 296, 297, 298 Fukuda-Parr, S. 279, 283 full measurement invariance 199 Gabrys, J. 33 gaming 68, 168 ALMPs 236–8 Gash, A. 156 Geissel, B. 290 gender 27 gender equality 74, 121 generalizability of survey scales 193 Germany 17, 20 Global Democracy Ranking 293–4 global development cooperation 273–87 competition and custodianship 280–83

308  Handbook on measuring governance evolution of measuring practices 278–80 key measurement methods 274–8 global financial crisis (GFC) 35, 265, 267, 279 global governance 31–44 historical transformations in measuring 38–40 increasing role of quantification 32–4 instruments and tools for quantification 35–8 global inequality 282 Global Partnership for Effective Development Cooperation (GPEDC) 275 Global State of Democracy (GSoD) 295, 296 goals 101–3, 105, 106, 107 good faith negotiation 159, 163 good governance 97–8, 100–101 Gorur, R. 117, 118 governability 83, 130 rituals of 262–5 governance 1 conceptualization 96, 99–100 defining 1–2, 45, 139 generic conception of and its measurement 99–104 without government 96 governance capacity 98–9 governance failure 103–4, 106 governance measuring 2 disciplines and theories 6–7 emergence and development of 2–6 methods and methodologies 8 governance structure 159, 164–5 government, governance without 96 Government at a Glance (GaaG) 138, 141–4, 152–3 government effectiveness 139, 140 governmentality 7, 83, 125–37 concepts, analytical strategies and arguments 128–31 key contributions in measuring governance 131–3 studies and governance measuring 126–8 GovLab 39 Great Moderation 262–3, 265 green tape scale 196 Grimmelikhuijsen, S. 192 Gross Domestic Product (GDP) 7, 35, 119 ground rules 159, 165 Gurr, T. 292 Hacking, I. 17, 114, 126–7, 131 Hall, P.A. 16 Hassan, R. 38 Hattke, F. 198 Hawke, B. 26, 27 health care 243–58

consequences of the use of performance measurement 250–53 driving forces behind performance measurement 248–50 most influential performance measurement regimes 244–8 purposes of performance measurement 52–3, 54, 55–7 health system strengthening (HSS) 172–86 Policy Integration and Performance Framework (PIPF) 172–84 application in Brazil 179–82 during and after COVID-19 183 P4P programmes in LMICs 173–4, 182, 184 Henman, P. 128, 133 High-Level Political Forum on Sustainable Development 148 higher education purposes of performance measurement 52, 53, 54–5 students 70 historical institutionalism 82 historical sensitivity 131 Hood, C. 16, 46, 82, 83 Horn, R. 26 Hoskin, K. 127, 131, 132 Howard, J. 27 human capital approach to ALMPs 231, 234 Human Capital Index (HCI) 151 Human Development Index (HDI) 276, 281–2 Human Development Initiative (HDI) 35 human resources management (HRM) 141, 142, 143 human rights 97–8 ideas 15, 17–18, 22 Australia 23, 24–5, 26–7, 28 identities 66 national identity see national identity impact of collaborative innovation 210, 211–14 implementation 103, 104 implementation drivers 174, 175–8, 181–2, 183 improvement 49–50, 51, 55–6 incentives 230–32 inclusiveness 159, 162 incompleteness 252–3 independence of central banks 259, 260–62, 268, 269 Independent Expert Advisory Group on a Data Revolution for Sustainable Development (IEAG) 31, 33 India 119 indicators ALMPs 230–32, 235–8, 239

Index  309 development 73–4, 118–19, 275–8, 279–82 digital governance indicators 138, 150–51, 152–3 see also supranational institutions Indigenous peoples 23, 24, 26 individual level measurement see behavioural perspective inductive scale development 188, 196 industrial capitalism 2–3 industrial revolution 111 inequality 41, 100, 282 inflation targeting 259, 260–62, 264–5, 268 information preparation 220, 222 informed publics 119–20 infrastructure, governance of 142, 143, 145 Innes, J.E. 157 innovation collaborative 204–15 indicators of 209–10, 211–14 innovativeness 159, 160 inputs 142–3, 144, 145 Institute of Democracy and Electoral Assistance (IDEA) 294–5, 296 institutional context 220–21, 222–3, 226 institutional lock-in 65 institutions GaaG 141, 142, 143, 145 state formation and statistics 15, 17, 19–21, 22 Australia 23–4, 25–6, 27, 28 supranational see supranational institutions integration theory 83–4 intensity of interaction 159, 160 intentions 63–4 interaction, nature of 159, 160–61 interactive data presentation 297 interactive governance 5–6 interdependency theory 83 interests 15, 17, 19, 22 Australia 23, 25, 27, 28 internal legitimacy 159, 161 international aid agencies 115 see also under individual agencies international competitiveness 5 International Development Association (IDA) 36 International Labour Organization (ILO) 281 International Monetary Fund 5, 16 international organizations 280–83 international statistical standards 16 interpellation 64–5 intrinsic policy drivers 175–8 involuntary participation 119, 120 Jaggers, K. 292 job-centred red tape 196

joint action, capacity for 159, 164–5 Jordan, S. 252–3 Keast, R. 166 Keating, P. 26, 27 Kenis, P. 6 key performance indicators (KPIs) 230–32, 235–8, 239 Keynes, J.M. 18 Kitchin, R. 31, 39 Knies, E. 192 knowledge-power relations 7, 125–6, 129–30 knowledge sharing 159, 160–61 Kruger, D. 224–5 labour market policies see active labour market policies (ALMPs) labour movement 23 Lafortune, G. 141, 142 Lascoumes, P. 47 Latour, B. 114, 122 Lauth, H.-J. 294, 296 Le Gales, P. 47 leadership collaborative governance process quality 159, 163–4 health system strengthening 175, 176, 177, 178, 181, 184 learning 206 collaborative governance process quality 159, 160–61 collaborative performance summits 217–18, 219, 223, 224, 226 performance measurement 49–50, 51, 55 learning forums 224 Lee, D. 195–6 legitimacy central banks 262 democratic 159, 162–3 internal 159, 161 liberalism 18 classical 126 social 23 Lindberg, S.I. 299 Lins Ribeiro, G. 37 Llewellyn, S. 252 logical framework (logframe) 36–7 Lopdrup-Hjorth, T. 58 low- and middle-income countries (LMICs) 173–4, 182, 184 Lührmann, A. 299 Luminate 39 Lynn, L.E. 81 macro-prudential financial regulation 265–6

310  Handbook on measuring governance Macve, R. 132 management of collaborative governance 159, 163–4 Management-By-Objectives (MBO) 3 managerialism 82 Mandell, M. 166 Manifestos Project 102 Marcoulides, G.A. 189–90, 191, 192 Margetts, H. 46 market-based finance 261–2, 263–5 market governance 47 Marshall, M.G. 292 McNeill, D. 283 means-ends calculations 125, 128 measurement error 189 Measuring for Results (MfR) rationale 36–7, 40–41 mercantilism 17 Merkel, W. 294 Merry, S. 279, 282 Messner, M. 252–3 metaphysics of correlation 34 (meta-)governance 159, 163–4 micromodels 37 migrants 24 Millennium Development Goals (MDGs) 38, 39, 273, 277, 279, 280 Miller, J. 25 Milward, B. 6 Mirowski, P. 40 mission, shared 159, 161 modernity 16, 67, 68 sociology of measurement 112–13 modernization 115 monetary policy 259–72 inflation targeting and measurability of 260–62 unconventional 265–9 normalization and de-politicization of central banks 268–9 motivation 49–50, 51, 159, 164 shared 159, 161–2 Moynihan, D.P. 224 Müller, T. 298, 300 multiculturalism 26 Munck, G.L. 291, 298 mutual trust 159, 162 mutual understanding 159, 161 nation-states 21 national development capacity 274–5 national identity 15, 17, 21, 22 Australia 24, 26, 27, 28 national income accounting 24 National Statistical Offices (NSOs) 148

Nations in Transit (NiT) 293, 294, 296 nature of interaction 159, 160–61 Neby, S. 251 Nehring, D. 112 neoliberal globalization 21, 22 Australia 26–7 neoliberalism 18, 19, 45, 46, 128, 133 neo-Weberian state (NWS) 80–82, 84–93 contribution to measuring governance 89–92 reasons to measure governance 84–7, 90 what and how of measuring governance 87–9, 90 network governance 5, 47, 83–4 networks 6 new institutional economics 86–7 new public governance (NPG) 5, 80–81, 83–93, 238, 239 contribution to measuring governance 89–92 reasons to measure governance 84–7, 90 what and how of measuring governance 87–9, 90 New Public Management (NPM) 4–5, 80–81, 82–3, 84–93, 114, 172 ALMPs 232–4, 238–9 contribution to measuring governance 89–92 health care governance 243, 248–9, 250–51 and performance measurement 45–61 in Australia, UK and Canada 52–7, 61 health sector 52–3, 54, 55–7 higher education sector 52, 53, 54–5 ideas and rationalities 46–9 purposes of performance measurement 49–51, 54 reasons to measure governance 84–7, 90 state formation and statistics 20–21, 27 what and how of measuring governance 87–9, 90 New Universalism 250 New Zealand 5 health care 247, 249, 250 Health Strategy 246 nodality function of government 16 normal distribution 3 normativity 68–70 Norris, P. 98 numbers, governing by 114–15 Oakes, L.S. 251 objectivity 117–18 observation direct 198 participatory 167 Official Development Assistance (ODA) 274–5 official statistics 15, 16, 19 O’Leary, R. 157

Index  311 Online Service Index (OSI) 151 open government 113, 141, 142, 143, 145 Organisation for Economic Co-operation and Development (OECD) 4, 76 anticipatory governance in education 71–2 ‘Data for Development’ report 282–3 Development Assistance Committee (DAC) 274, 275 Government at a Glance (GaaG) 138, 141–4, 152–3 Programme for International Assessment (PISA) 7, 71–2 organizational culture 7 organizational management theory 6 Osborne, S.P. 83 outcomes of ALMPs 230–32 GaaG 142, 144, 145 outputs of ALMPs 232–4 GaaG 142, 144, 145 Overman, S. 196 overutilization of health care 251–2 Pandey, S. 194 parallel tests models 190–91, 194 participant selection 220, 221, 226 participative democracy 225 participatory evaluations 169 participatory observations 167 path dependencies 82, 281–2 patient-centred health care 247–8 pay-for-performance programmes (P4P) 173–4 ALMPs 234 Brazil 179–82 LMICs 173–4, 182, 184 perfectibility 32–3, 40 performance drivers 176, 177, 178, 182, 184 performance impact perception levels (PIPLs) 176, 178, 180, 181, 184 performance management 48, 206 performance measurement 4–5 ALMPs 232–4, 236–9 and standardization 238 chain of 48 governmentality 127, 132–3 health care governance 243–58 NPM and 45–61 public administration 87, 89 performativity 118 Perry, J. 195, 197 Petty, W. 2 Pflueger, D. 252 Pickel, S. 298, 300 Planning Programming Budget Systems (PPBS) 3

Plummer, K. 112 plurality 130 policy documents 129 policy drivers 174, 175–8, 183, 184 formulation drivers 174, 175–8, 181–2, 184 implementation drivers 174, 175–8, 181–2, 183 Policy Integration and Performance Framework (PIPF) 172–84 application in Brazil 179–82 health system strengthening during and after COVID-19 183 methodological aspects of application 175–9 P4P programmes in LMICs 173–4, 182, 184 policy processes 172–86 Political Arithmetic 2 Political Democracy Index 291, 292 political economy 7 political mandate 159, 162 political parties 19 political science perspective 6, 96–110 authoritarian regimes 104–5, 106 current state of governance measurement 97–9 failed states 101, 105–6 generic conception of governance and its measurement 99–104 methodology and data 106–7 subtypes of governance 103, 106 political stability 139, 140 political support 159, 162 Polity Index 291, 292, 293, 294, 296 Pollitt, C. 48, 49, 81, 82 population 17 population censuses 15, 18, 21 Australia 24, 26, 27 Porter, M.E. 5, 247 Porter, T.M. 113 post-war consensus 46 poverty 3, 26, 74, 276, 281 power governmentality 125–6, 129–30, 131, 132, 133–4 knowledge-power relations 7, 125–6, 129–30 performance measurement 50–51, 54, 57 power balance 159, 162 prediction 34, 40 Preston, A. 250–51, 252 Prévost, J.-G. 21, 22 primary data sources 152–3 primary health care (PHC) 174 Brazil 179–82 LMICs 182, 184 strengthening during and after COVID-19 183

312  Handbook on measuring governance private firms 2 probability 18 procedural fairness 159, 163 process mapping 166–7 process quality see collaborative governance processes processes 83 constitutive effects 64–6 GaaG 142, 143–4, 145 measuring effects of policy processes on health system strengthening 172–86 professional autonomy 234 professional standards and norms 86, 89, 91 projects 36–8 promotion 49–50, 51 protectionism 25 proto-statistical regimes 21, 22 Provan, K. 6 proximity 159, 160 psychometry 189 public administration (PA) 6, 80–95 contribution of NWS, NPG and NPM to measuring governance 89–92 neo-Weberian state (NWS) 80–82, 84–93 new public governance see new public governance (NPG) New Public Management see New Public Management (NPM) reasons to measure governance 84–7, 90 what and how of measuring governance 87–9, 90 public bureaucracy 46, 58 public employment 141, 142, 143, 145 public finance and economics 141, 142–3, 145 public policy theory 99–100 public procurement 141, 142, 143, 145 public sector integrity 142, 143, 145 public service logic 88 public service motivation 189 PSM scale 195, 197–8 public value 87–8, 89, 92 pupil plans 69–70 quality of democracy 98, 291 health care governance 243–4, 246–7, 249–50, 252–3 quality improvement schemes 88, 91 quantification and global governance 31–44 historical transformations in measuring global governance 38–40 increasing role of quantification 32–4 instruments and tools 35–8 quantitative easing (QE) 267–8

Raab, J. 6 race 26, 121 Radin, B. 46 Rainey, H. 194 Ramalho, L. 39 random error 190 randomized controlled trials (RCTs) 234 rational expectations (RE) economics 261–2 rationality 86–7, 125 Ravallion, M. 281 Raykov, T. 189–90, 191, 192 reductionism 116–17 reflexive government 67, 130–31 reflexive modernity 67, 68 regime subtypes 289, 293, 294 political science perspective 103, 106 regime types 288–9, 293, 294 regulation 54–5, 56 macro-prudential financial regulation 265–6 regulatory government 141, 142, 143, 145 regulatory quality 139, 140 relations 65, 66 quality of 159, 161 relationship-building 217, 218–19, 223, 224, 226 reliability 191, 193, 194 representation 117–18 representative democracy 225 representativeness 159, 162 repurchasing (repo) techniques 268 reputation of the innovation 209–10 re-regulation 4 resources availability 159, 165 costs of measurement 69–70 securing 159, 164 responsiveness 102, 103, 104 Riba, C. 198 risk management 56 rituals of governability 262–5 Robertson, S. 71 robustness of data 152–3 Rocha de Siqueira, I. 33, 39 Rose, R. 102 Rotberg, R. 100 Rottenburg, R. 282 rounds model 166–7 rule of law 139, 140 Russia 293, 296 Sachs, J. 5 Sarkozy, N. 35 Scandinavia 234, 237 Schiffelers, M.J. 225 schools ranking lists 63, 68–9 Science and Technology Studies (STS) 111, 121

Index  313 scientization 260–62 scope of collaboration 208 Scott, J.C. 113–14, 116–17, 119 secondary data sources 152–3 selective extraction 31 Sen, A. 35 serious games 168 service delivery 83 serving citizens 141, 142, 144, 145 SERVQUAL 88 sexual identity 27 shadow banks 267 shadow governance 105–6 shared mission 159, 161 shared motivation 159, 161–2 signification 51, 54, 57 simulations 168 Skaaning, S.E. 299 Skelcher, C. 158 skills, availability of 159, 165 smart cities 116 Snow, J. 62 social inequality 69 social liberalism 23 social network analysis 168 Social Progress Imperative (SPI) 35 social research methods 66 sociology 2, 6–7 sociology of measurement 111–24 concepts, assumptions and arguments 116–20 historical overview 112–13 key contributions 120–21 statistics and the state 113–16 socio-technical imaginary 116 Sørensen, E. 83 sovereignty 16, 129 span of governance measurement 152–3 spatial aspects of behaviour 199 Spearman, C. 189 standardization 114 performance measurement and 238 Star, S. 114 state relationship between statistics and 113–16 statist conception of governance 96 state cultures 20 state formation 15–30 defining 15–16 dimensions of the relationship with statistics 16–22 paradoxical relationship with statistics 15–16 and statistics in Australia 15, 22–7, 28 statistical agencies 20–21 statistical macro-management 21, 22

Australia 24–6 statistical nationalization 21, 22 Australia 23–4 statistics 2–3, 127 defining 15 official 15, 16, 19 relationship between the state and 113–16 state formation and 15–30 steering 51, 54, 55, 57, 99, 104 Stevens, M. 282 Stiglitz, J. 35 Stiglitz-Sen-Fitoussi Commission 35 stress tests 266 structural adjustment 278 styles of measurement as/of governance 67–8 subjectification 119 subjects 119 suicides 3 Sullivan, H. 158 supportive arrangements 159, 165 supranational institutions 138–55 EGDI 138, 150–51, 152–3 GaaG 138, 141–4, 152–3 SDGs 138, 144–50, 152–4 WGI project 138, 139–41, 152–3 survey length 193, 194, 196–7 survey scales 187, 188–98, 199 adaptation 196–8 theories of measurement 188–94 use in governance research 194–8 surveys 107, 167–8 sustainability 253 Sustainable Development Goals (SDGs) 100 comparing supernational institutions 138, 144–50, 152–4 constitutive effects 73–5 development measuring 273–4, 277, 278, 279–80, 282–3, 283–4 education goal 118–19 poverty reduction goal 276 quantification and global governance 31, 39, 40 Sutherland, T. 38 symbolism 50–51 system strengthening see health system strengthening systematic error 190 systematic literature review 194–8 systemic interactions 65, 66 Talbot, C. 53 tau-equivalent tests models 190–91, 194 technical dimension of measuring governance 132 technocrats 48–9

314  Handbook on measuring governance Telecommunication Infrastructure Index (TII) 151 temporal bracketing 166–7 temporalization 64 terrorism/violence, absence of 139, 140 ‘thick’ and ‘thin’ concepts of democracy 290–91 Time Driven Activity-Based Costing system (TDABC) 245 time spent on benefits 230–32, 235–6 timing 65 Tkacz, N. 116 tools of government approach 46–7 top officials 48–9 Torfing, J. 1, 83 Total Official Support for Sustainable Development (TOSSD) 275 Total Quality Management (TQM) 88, 91 transaction costs 159, 160 transparency 159, 163 Corruption Perception Index (CPI) 72–3 Transparency International (TI) 72 true score 188, 189–90 Trump, D. 19 trust, mutual 159, 162 tunnel vision 117 unconventional monetary policy 265–9 understanding, mutual 159, 161 unemployment 229, 230, 231 see also active labour market policies (ALMPs) UNESCO 71 Institute for Statistics 118–19 unintended effects 62, 63–4 United Kingdom (UK) 19, 102, 104 Audit Commission 4, 5 health care 247, 249, 250 National Audit Office 4 National Health Service (NHS) 52, 56 NPM 4–5 purposes of performance measurement 52–7, 61 United Nations (UN) 16, 31 2030 Agenda for Sustainable Development 31, 33, 38, 39–40, 41, 144, 146, 153 data revolution 31, 33, 280 EGDI 138, 150–51, 152–3 indicator monitoring 280–81 Millennium Development Goals (MDGs) 38, 39, 273, 277, 279, 280 Sustainable Development Goals see Sustainable Development Goals United Nations Development Programme (UNDP) 98, 281

multidimensional poverty index (MPI) 276 United States of America (USA) 4, 98, 101, 237, 265 Central Intelligence Agency (CIA) 34 Federal Reserve 264 Government Performance and Results Act (GPRA) 4 health care 244, 245, 250, 251 Maryland 247, 249 ‘Obamacare’ 100 Program Assessment Rating Tool 4 Springfield Armory 132 State Failure Task Force 34 state formation and statistics 19, 20, 21 Temporary Assistance for Needy Families (TANF) reform 231–2, 237 Westpoint Academy 132 urban planning 223 valence 198–9 validity 191–3, 194 value-based health care 247 value creation processes 88 Van Dooren, W. 49, 50, 51 Van Engen, N.A. 197 Van Loon, N.M. 196 Van Ryzin, G.G. 196 Vanhanen Index 291, 292 Varieties of Democracy Project (V-Dem) 98, 291, 295–6 variety 159, 160 Vaz, N. 252 Verkuilen, J. 291, 298 Vij, N. 157 violence/terrorism, absence of 139, 140 viziers 62 vocal pitch 199 voice 139, 140 Von der Leyen, U. 149 Wansleben, L. 260, 262, 263 Washington Consensus 5 waves of democratization 288–90, 299 Weber, M. 3, 15–16, 81 wellbeing 35, 100 Whitlam, G. 25–6 Win, E. 37 ‘window dressing’ 211 Woolf, S. 19 Woolgar, S. 112 work-first approach to ALMPs 231, 234 workforce-based health system strengthening 175, 176, 177, 178, 181, 184 World Bank 5, 16, 32, 40–41, 71, 97, 245, 279, 281

Index  315 Country Policy and Institutional Assessment (CPIA) 36 Doing Business Report 32, 154 logframe 36–7 poverty line 276 Worldwide Governance Indicators (WGI) 138, 139–41, 152–3

World Health Organization (WHO) 244–5, 246–7, 249, 250, 253, 281 Alma Ata declaration 248 Worldwide Governance Indicators (WGI) 138, 139–41, 152–3 Zuboff, S. 38