Decision and Game Theory for Security: 10th International Conference, GameSec 2019, Stockholm, Sweden, October 30 – November 1, 2019, Proceedings [1st ed. 2019] 978-3-030-32429-2, 978-3-030-32430-8

This book constitutes the refereed proceedings of the 10th International Conference on Decision and Game Theory for Secu

390 120 33MB

English Pages XI, 584 [596] Year 2019

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Computational Logistics: 10th International Conference, ICCL 2019, Barranquilla, Colombia, September 30 – October 2, 2019, Proceedings [1st ed. 2019] 978-3-030-31139-1, 978-3-030-31140-7

This book constitutes the proceedings of the 10th International Conference on Computational Logistics, ICCL 2019, held i

854 90 29MB Read more

Software Business: 10th International Conference, ICSOB 2019, Jyväskylä, Finland, November 18–20, 2019, Proceedings [1st ed. 2019] 978-3-030-33741-4, 978-3-030-33742-1

This book constitutes the refereed proceedings of the 10th International Conference on Software Business, ICSOB 2019, he

1,378 68 22MB Read more

Decision and Game Theory for Security: 13th International Conference, GameSec 2022, Pittsburgh, PA, USA, October 26–28, 2022, Proceedings 9783031263699, 9783031263682, 3031263693

This book constitutes the refereed proceedings of the 13th International Conference on Decision and Game Theory for Secu

286 21 33MB Read more

Provable Security: 13th International Conference, ProvSec 2019, Cairns, QLD, Australia, October 1–4, 2019, Proceedings [1st ed. 2019] 978-3-030-31918-2, 978-3-030-31919-9

This book constitutes the refereed proceedings of the 13th International Conference on Provable Security, ProvSec 2019,

642 97 9MB Read more

Discovery Science: 22nd International Conference, DS 2019, Split, Croatia, October 28–30, 2019, Proceedings [1st ed. 2019] 978-3-030-33777-3, 978-3-030-33778-0

This book constitutes the proceedings of the 22nd International Conference on Discovery Science, DS 2019, held in Split,

1,088 80 46MB Read more

Wireless Internet: 12th EAI International Conference, WiCON 2019, TaiChung, Taiwan, November 26–27, 2019, Proceedings [1st ed.] 9783030529871, 9783030529888

This book constitutes the refereed post-conference proceedings of the 12th International Conference on Wireless Internet

514 47 30MB Read more

Methods and Applications for Modeling and Simulation of Complex Systems: 19th Asia Simulation Conference, AsiaSim 2019, Singapore, October 30 – November 1, 2019, Proceedings [1st ed. 2019] 978-981-15-1077-9, 978-981-15-1078-6

This volume constitutes the proceedings of the 19th Asia Simulation Conference, AsiaSim 2019, held in Singapore, Singapo

579 53 35MB Read more

Interactivity, Game Creation, Design, Learning, and Innovation: 8th EAI International Conference, ArtsIT 2019, and 4th EAI International Conference, DLI 2019, Aalborg, Denmark, November 6–8, 2019, Proceedings [1st ed.] 9783030532932, 9783030532949

This book constitutes the refereed post-conference proceedings of two conferences: The 8th EAI International Conference

468 105 12MB Read more

Game Theory for Networks: 8th International EAI Conference, GameNets 2019, Paris, France, April 25–26, 2019, Proceedings [1st ed.] 978-3-030-16988-6;978-3-030-16989-3

This book constitutes the refereed proceedings of the 8th EAI International Conference on Game Theory for Networks, Game

482 56 7MB Read more

Proceedings of International Conference of Aerospace and Mechanical Engineering 2019 : AeroMech 2019, 20–21 November 2019, Universiti Sains Malaysia, Malaysia [1st ed.] 9789811547553, 9789811547560

This book presents selected papers from the International Conference of Aerospace and Mechanical Engineering 2019 (AeroM

387 90 25MB Read more

Decision and Game Theory for Security: 10th International Conference, GameSec 2019, Stockholm, Sweden, October 30 – November 1, 2019, Proceedings [1st ed. 2019]
978-3-030-32429-2, 978-3-030-32430-8

Author / Uploaded
Tansu Alpcan
Yevgeniy Vorobeychik
John S. Baras
György Dán

Table of contents :
Front Matter ....Pages i-xi
Design of Load Forecast Systems Resilient Against Cyber-Attacks (Carlos Barreto, Xenofon Koutsoukos)....Pages 1-20
Identifying Stealthy Attackers in a Game Theoretic Framework Using Deception (Anjon Basak, Charles Kamhoua, Sridhar Venkatesan, Marcus Gutierrez, Ahmed H. Anwar, Christopher Kiekintveld)....Pages 21-32
Choosing Protection: User Investments in Security Measures for Cyber Risk Management (Yoav Ben Yaakov, Xinrun Wang, Joachim Meyer, Bo An)....Pages 33-44
When Is a Semi-honest Secure Multiparty Computation Valuable? (Radhika Bhargava, Chris Clifton)....Pages 45-64
You only Lie Twice: A Multi-round Cyber Deception Game of Questionable Veracity (Mark Bilinski, Kimberly Ferguson-Walter, Sunny Fugate, Ryan Gabrys, Justin Mauger, Brian Souza)....Pages 65-84
Honeypot Type Selection Games for Smart Grid Networks (Nadia Boumkheld, Sakshyam Panda, Stefan Rass, Emmanouil Panaousis)....Pages 85-96
Discussion of Fairness and Implementability in Stackelberg Security Games (Victor Bucarey, Martine Labbé)....Pages 97-117
Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game (Taylor J. Canann)....Pages 118-134
Investing in Prevention or Paying for Recovery - Attitudes to Cyber Risk (Anna Cartwright, Edward Cartwright, Lian Xue)....Pages 135-151
Realistic versus Rational Secret Sharing (Yvo Desmedt, Arkadii Slinko)....Pages 152-163
Solving Cyber Alert Allocation Markov Games with Deep Reinforcement Learning (Noah Dunstatter, Alireza Tahsini, Mina Guirguis, Jelena Tešić)....Pages 164-183
Power Law Public Goods Game for Personal Information Sharing in News Commentaries (Christopher Griffin, Sarah Rajtmajer, Prasanna Umar, Anna Squicciarini)....Pages 184-195
Adaptive Honeypot Engagement Through Reinforcement Learning of Semi-Markov Decision Processes (Linan Huang, Quanyan Zhu)....Pages 196-216
Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals (Yunhan Huang, Quanyan Zhu)....Pages 217-237
DeepFP for Finding Nash Equilibrium in Continuous Action Spaces (Nitin Kamra, Umang Gupta, Kai Wang, Fei Fang, Yan Liu, Milind Tambe)....Pages 238-258
Effective Premium Discrimination for Designing Cyber Insurance Policies with Rare Losses (Mohammad Mahdi Khalili, Xueru Zhang, Mingyan Liu)....Pages 259-275
Analyzing Defense Strategies Against Mobile Information Leakages: A Game-Theoretic Approach (Kavita Kumari, Murtuza Jadliwala, Anindya Maiti, Mohammad Hossein Manshaei)....Pages 276-296
Dynamic Cheap Talk for Robust Adversarial Learning (Zuxing Li, György Dán)....Pages 297-309
Time-Dependent Strategies in Games of Timing (Jonathan Merlevede, Benjamin Johnson, Jens Grossklags, Tom Holvoet)....Pages 310-330
Tackling Sequential Attacks in Security Games (Thanh H. Nguyen, Amulya Yadav, Branislav Bosansky, Yu Liang)....Pages 331-351
A Framework for Joint Attack Detection and Control Under False Data Injection (Luyao Niu, Andrew Clark)....Pages 352-363
$\mathsf {QFlip}$: An Adaptive Reinforcement Learning Strategy for the $\mathsf {FlipIt}$ Security Game (Lisa Oakley, Alina Oprea)....Pages 364-384
Linear Temporal Logic Satisfaction in Adversarial Environments Using Secure Control Barrier Certificates (Bhaskar Ramasubramanian, Luyao Niu, Andrew Clark, Linda Bushnell, Radha Poovendran)....Pages 385-403
Cut-The-Rope: A Game of Stealthy Intrusion (Stefan Rass, Sandra König, Emmanouil Panaousis)....Pages 404-416
Stochastic Dynamic Information Flow Tracking Game with Reinforcement Learning (Dinuka Sahabandu, Shana Moothedath, Joey Allen, Linda Bushnell, Wenke Lee, Radha Poovendran)....Pages 417-438
Adversarial Attacks on Continuous Authentication Security: A Dynamic Game Approach (Serkan Sarıtaş, Ezzeldin Shereen, Henrik Sandberg, György Dán)....Pages 439-458
On the Optimality of Linear Signaling to Deceive Kalman Filters over Finite/Infinite Horizons (Muhammed O. Sayin, Tamer Başar)....Pages 459-478
MTDeep: Boosting the Security of Deep Neural Nets Against Adversarial Attacks with Moving Target Defense (Sailik Sengupta, Tathagata Chakraborti, Subbarao Kambhampati)....Pages 479-491
General Sum Markov Games for Strategic Detection of Advanced Persistent Threats Using Moving Target Defense in Cloud Networks (Sailik Sengupta, Ankur Chowdhary, Dijiang Huang, Subbarao Kambhampati)....Pages 492-512
Operations over Linear Secret Sharing Schemes (Arkadii Slinko)....Pages 513-524
Cyber Camouflage Games for Strategic Deception (Omkar Thakoor, Milind Tambe, Phebe Vayanos, Haifeng Xu, Christopher Kiekintveld, Fei Fang)....Pages 525-541
When Players Affect Target Values: Modeling and Solving Dynamic Partially Observable Security Games (Xinrun Wang, Milind Tambe, Branislav Bošanský, Bo An)....Pages 542-562
Perfectly Secure Message Transmission Against Independent Rational Adversaries (Kenji Yasunaga, Takeshi Koshiba)....Pages 563-582
Back Matter ....Pages 583-584

Citation preview

LNCS 11836

Tansu Alpcan Yevgeniy Vorobeychik John S. Baras György Dán (Eds.)

Decision and Game Theory for Security 10th International Conference, GameSec 2019 Stockholm, Sweden, October 30 – November 1, 2019 Proceedings

Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA

11836

More information about this series at http://www.springer.com/series/7410

Tansu Alpcan Yevgeniy Vorobeychik John S. Baras György Dán (Eds.) •

•

•

Decision and Game Theory for Security 10th International Conference, GameSec 2019 Stockholm, Sweden, October 30 – November 1, 2019 Proceedings

123

Editors Tansu Alpcan University of Melbourne Melbourne, VIC, Australia

Yevgeniy Vorobeychik Washington University in St. Louis St. Louis, MO, USA

John S. Baras University of Maryland, College Park College Park, MD, USA

György Dán KTH Royal Institute of Technology Stockholm, Sweden

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-32429-2 ISBN 978-3-030-32430-8 (eBook) https://doi.org/10.1007/978-3-030-32430-8 LNCS Sublibrary: SL4 – Security and Cryptology © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

It is difficult today to imagine the modern world without connectivity, information, and computing. We are now entering a new era in which typically isolated residential, commercial, and industrial devices form Cyber-Physical Systems (CPS) or Internet of Things (IoT). An important aspect of this modern connected world is complex interactions and decisions between humans, devices, and networks. Game theory, which studies multi-agent or person decision making provides a solid mathematical foundation for models investigating decisions in these emerging connected, distributed, and complex systems. Ubiquitous connectivity creates enormous value, but also comes at a cost: connected devices and people are also more vulnerable, as malicious parties are now able to gain access to them in ways that would have been impractical only a decade ago. Consequently, information security and privacy has gained paramount importance. Traditional approaches to security and privacy view this largely as a system engineering problem, focusing on specific applications and systems. The GameSec conference, in contrast, aims to study it from a more holistic perspective, using the tools borrowed from decision theory (including optimization and control theories) and game theory, as well as, more recently, from AI and machine learning. This volume contains the papers presented at GameSec 2019, the 10th Conference on Decision and Game Theory for Security held during October 30–November 1, 2019, in Stockholm, Sweden. The GameSec conference series was inaugurated in 2010 in Berlin, Germany. GameSec 2019 was the 10th instantiation, and in this span, it has become widely recognized as an important venue for interdisciplinary security research. The previous conferences were held in College Park (Maryland, USA, 2011), Budapest (Hungary, 2012), Fort Worth (Texas, USA, 2013), Los Angeles (USA, 2014), London (UK, 2015), New York (USA, 2016), Vienna (Austria, 2017), and Seattle (Washington, USA, 2018). As in past years, the 2019 edition of GameSec featured a number of high-quality novel contributions. The conference program included 21 full paper presentations, as well as 11 short papers. The program contained papers on traditional GameSec topics such as game-theoretic models of various security problems, as well as an increasing number of papers at the intersection of AI, machine learning, and security, particularly in reinforcement learning, some of which were presented at the Adversarial AI special track. There is, in addition, a clear increase in the interest in this GameSec program of modeling and studying deception through a game-theoretic lens. Several organizations supported GameSec 2019. We thank, in particular, KTH Digitalisation Research Platform, Association for Computing Machinery (ACM), Springer, Ericsson, SAAB, and F-Secure.

vi

Preface

We hope that the readers will find this volume a useful resource for their security and game theory research. October 2019

Tansu Alpcan Yevgeniy Vorobeychik John S. Baras György Dán

Organization

Program Committee Habtamu Abie Tansu Alpcan Saurabh Amin Bo An Konstantin Avrachenkov Svetlana Boudko Alvaro Cardenas Andrew Clark Jens Grossklags Yezekael Hayel Hideaki Ishii Eduard Jorswieck Charles Kamhoua Murat Kantarcioglu Arman Mhr Khouzani Christopher Kiekintveld Sandra König Aron Laszka Yee Wei Law Bo Li Daniel Lowd Mohammad Hossein Manshaei Aikaterini Mitrokotsa Shana Moothedath Mehrdad Nojoumian Andrew Odlyzko Miroslav Pajic Emmanouil Panaousis Sakshyam Panda Radha Poovendran David Pym Bhaskar Ramasubramanian Stefan Rass Henrik Sandberg Stefan Schauer Arunesh Sinha

Norwegian Computing Centre, Norway The University of Melbourne, Australia Massachusetts Institute of Technology, USA Nanyang Technological University, Singapore Inria, France NR, Norway The University of Texas at Dallas, USA Worcester Polytechnic Institute, USA Technical University of Munich, Germany LIA, University of Avignon, France Tokyo Institute of Technology, Japan TU Dresden, Germany US Army Research Laboratory, USA The University of Texas at Dallas, USA Queen Mary University of London, UK University of Texas at El Paso, USA Austrian Institute of Technology, Austria University of Houston, USA University of South Australia, Australia University of Illinois at Urbana-Champaign, USA University of Oregon, USA Florida International University (FIU), USA Chalmers University of Technology, Sweden University of Washington, USA Florida Atlantic University, USA University of Minnesota, USA Duke University, USA University of Surrey, UK Nokia, Finland University of Washington, USA University College London, UK University of Washington, USA System Security Group, Universität Klagenfurt, Germany KTH Royal Institute of Technology, Sweden AIT Austrian Institute of Technology GmbH, Austria University of Michigan, USA

viii

Organization

George Theodorakopoulos Long Tran-Thanh Yevgeniy Vorobeychik Haifeng Xu Quanyan Zhu Jun Zhuang

Cardiff University, UK University of Southampton, UK Washington University in St. Louis, USA University of Southern California, USA New York University, USA SUNY Buffalo, USA

Additional Reviewers Basak, Anjon Collinson, Matthew Elfar, Mahmoud Gan, Jiarui Gutierrez, Marcus Li, Zuxing Liang, Bei Milosevic, Jezdimir Misra, Shruti Nekouei, Ehsan Ortiz, Anthony

Sagong, Sang Uk Sahabandu, Dinuka Saritaş, Serkan Thakoor, Omkar Tsaloli, Georgia Veliz, Oscar Wang, Yu Williams, Julian Xiao, Baicen Zhang, Jing

Contents

Design of Load Forecast Systems Resilient Against Cyber-Attacks . . . . . . . . Carlos Barreto and Xenofon Koutsoukos Identifying Stealthy Attackers in a Game Theoretic Framework Using Deception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Anjon Basak, Charles Kamhoua, Sridhar Venkatesan, Marcus Gutierrez, Ahmed H. Anwar, and Christopher Kiekintveld Choosing Protection: User Investments in Security Measures for Cyber Risk Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yoav Ben Yaakov, Xinrun Wang, Joachim Meyer, and Bo An When Is a Semi-honest Secure Multiparty Computation Valuable? . . . . . . . . Radhika Bhargava and Chris Clifton You only Lie Twice: A Multi-round Cyber Deception Game of Questionable Veracity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mark Bilinski, Kimberly Ferguson-Walter, Sunny Fugate, Ryan Gabrys, Justin Mauger, and Brian Souza Honeypot Type Selection Games for Smart Grid Networks . . . . . . . . . . . . . Nadia Boumkheld, Sakshyam Panda, Stefan Rass, and Emmanouil Panaousis Discussion of Fairness and Implementability in Stackelberg Security Games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Victor Bucarey and Martine Labbé

1

21

33 45

65

85

97

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game . . . . Taylor J. Canann

118

Investing in Prevention or Paying for Recovery - Attitudes to Cyber Risk . . . Anna Cartwright, Edward Cartwright, and Lian Xue

135

Realistic versus Rational Secret Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . Yvo Desmedt and Arkadii Slinko

152

Solving Cyber Alert Allocation Markov Games with Deep Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noah Dunstatter, Alireza Tahsini, Mina Guirguis, and Jelena Tešić

164

x

Contents

Power Law Public Goods Game for Personal Information Sharing in News Commentaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christopher Griffin, Sarah Rajtmajer, Prasanna Umar, and Anna Squicciarini

184

Adaptive Honeypot Engagement Through Reinforcement Learning of Semi-Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linan Huang and Quanyan Zhu

196

Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunhan Huang and Quanyan Zhu

217

DeepFP for Finding Nash Equilibrium in Continuous Action Spaces . . . . . . . Nitin Kamra, Umang Gupta, Kai Wang, Fei Fang, Yan Liu, and Milind Tambe Effective Premium Discrimination for Designing Cyber Insurance Policies with Rare Losses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad Mahdi Khalili, Xueru Zhang, and Mingyan Liu Analyzing Defense Strategies Against Mobile Information Leakages: A Game-Theoretic Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kavita Kumari, Murtuza Jadliwala, Anindya Maiti, and Mohammad Hossein Manshaei

238

259

276

Dynamic Cheap Talk for Robust Adversarial Learning. . . . . . . . . . . . . . . . . Zuxing Li and György Dán

297

Time-Dependent Strategies in Games of Timing . . . . . . . . . . . . . . . . . . . . . Jonathan Merlevede, Benjamin Johnson, Jens Grossklags, and Tom Holvoet

310

Tackling Sequential Attacks in Security Games. . . . . . . . . . . . . . . . . . . . . . Thanh H. Nguyen, Amulya Yadav, Branislav Bosansky, and Yu Liang

331

A Framework for Joint Attack Detection and Control Under False Data Injection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luyao Niu and Andrew Clark

352

QFlip: An Adaptive Reinforcement Learning Strategy for the FlipIt Security Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lisa Oakley and Alina Oprea

364

Linear Temporal Logic Satisfaction in Adversarial Environments Using Secure Control Barrier Certificates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bhaskar Ramasubramanian, Luyao Niu, Andrew Clark, Linda Bushnell, and Radha Poovendran

385

Contents

Cut-The-Rope: A Game of Stealthy Intrusion . . . . . . . . . . . . . . . . . . . . . . . Stefan Rass, Sandra König, and Emmanouil Panaousis Stochastic Dynamic Information Flow Tracking Game with Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dinuka Sahabandu, Shana Moothedath, Joey Allen, Linda Bushnell, Wenke Lee, and Radha Poovendran

xi

404

417

Adversarial Attacks on Continuous Authentication Security: A Dynamic Game Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Serkan Sarıtaş, Ezzeldin Shereen, Henrik Sandberg, and György Dán

439

On the Optimality of Linear Signaling to Deceive Kalman Filters over Finite/Infinite Horizons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhammed O. Sayin and Tamer Başar

459

MTDeep: Boosting the Security of Deep Neural Nets Against Adversarial Attacks with Moving Target Defense . . . . . . . . . . . . . . . . . . . . Sailik Sengupta, Tathagata Chakraborti, and Subbarao Kambhampati

479

General Sum Markov Games for Strategic Detection of Advanced Persistent Threats Using Moving Target Defense in Cloud Networks . . . . . . . Sailik Sengupta, Ankur Chowdhary, Dijiang Huang, and Subbarao Kambhampati

492

Operations over Linear Secret Sharing Schemes . . . . . . . . . . . . . . . . . . . . . Arkadii Slinko

513

Cyber Camouflage Games for Strategic Deception . . . . . . . . . . . . . . . . . . . Omkar Thakoor, Milind Tambe, Phebe Vayanos, Haifeng Xu, Christopher Kiekintveld, and Fei Fang

525

When Players Affect Target Values: Modeling and Solving Dynamic Partially Observable Security Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinrun Wang, Milind Tambe, Branislav Bošanský, and Bo An

542

Perfectly Secure Message Transmission Against Independent Rational Adversaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kenji Yasunaga and Takeshi Koshiba

563

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

583

Design of Load Forecast Systems Resilient Against Cyber-Attacks Carlos Barreto(B) and Xenofon Koutsoukos Vanderbilt University, Nashville, USA {Carlos.A.Barreto,Xenofon.Koutsoukos}@vanderbilt.edu

Abstract. Load forecast systems play a fundamental role the operation in power systems, because they reduce uncertainties about the system’s future operation. An increasing demand for precise forecasts motivates the design of complex models that use information from different sources, such as smart appliances. However, untrusted sources can introduce vulnerabilities in the system. For example, an adversary may compromise the sensor measurements to induce errors in the forecast. In this work, we assess the vulnerabilities of load forecast systems based on neural networks and propose a defense mechanism to construct resilient forecasters. We model the strategic interaction between a defender and an attacker as a Stackelberg game, where the defender decides first the prediction scheme and the attacker chooses afterwards its attack strategy. Here, the defender selects randomly the sensor measurements to use in the forecast, while the adversary calculates a bias to inject in some sensors. We find an approximate equilibrium of the game and implement the defense mechanism using an ensemble of predictors, which introduces uncertainties that mitigate the attack’s impact. We evaluate our defense approach training forecasters using data from an electric distribution system simulated in GridLAB-D. Keywords: Security · Machine learning Load forecast · Game theory

1

· Power systems ·

Introduction

Load forecast systems play a fundamental role in the operation of power systems, because utilities and generators need estimations of the future loads to plan their operation. For example, the utilities procure (or sell) energy in electricity markets based on estimations of the future demand. The relevance of forecast systems will increase due to uncertainties coming from new technologies (e.g., renewable generation, electric vehicles, and smart appliances); however, these technologies also introduce vulnerabilities. Some works have demonstrated that false data injection (FDI) attacks, which manipulate sensor readings, can induce errors in state estimation systems of c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 1–20, 2019. https://doi.org/10.1007/978-3-030-32430-8_1

2

C. Barreto and X. Koutsoukos

power grids, affecting the system’s operation [14,29]. An adversary can design attacks to damage the system or to change the electricity market prices. Likewise, an adversary can manipulate the forecast system exploiting vulnerabilities of artificial intelligence models [1,4]. In this work, we assess the vulnerabilities that forecast systems introduce in electricity markets. We focus on forecast models based on artificial neural networks (NNs) that accept as inputs the historical measurements from some sensors (e.g., power sensors and thermometers). We consider an adverse generator who injects a bias in some measurements to induce errors in the forecast. Unlike other works, this adversary must choose its strategy taking into account that the attack will affect future predictions that use the biased measurements. We model the strategic interaction between the defender and the attacker as a Stackelberg game, in which the defender decides first the prediction scheme and the attacker chooses afterwards its attack strategy [9]. In this case, the defender chooses randomly the sensor measurements to use in the forecast. A near optimal defense strategy consists in selecting each sensor’s measurements with the same probability. With this strategy the defender reduces the number of compromised sensors used in the prediction. We find some practical limitations implementing the proposed defense strategy, due to the large strategy space. In particular, since the defender selects randomly the sensor measurements, the number of possible models grows exponentially with respect to the number of sensors. For practical reasons, we propose an approximate implementation of the defense mechanism using a ensemble of prediction models. The defense strategy can fail if the ensemble becomes more sensitive to the attacks than the original model. This can happen because each model in the ensemble makes predictions using less sensors; therefore, an attack with fewer resources can still create a significant deviation in the predictions. We find that the ensemble becomes more resilient when its models predict a fraction of the total load. We evaluate our defense approach training forecasters to predict the future load of an electric distribution system simulated in GridLAB-D. Our simulation includes both residential and commercial loads, which have appliances, such as heating, ventilation, and air conditioning (HVAC) systems, water heaters, pool pumps, among others. The paper is structured as follows. In Sect. 2 we introduce a model of the electricity market and explain how an adverse generator can manipulate the sensor readings to profit. Section 3 presents our methodology to design resilient forecasters. In this section we introduce the game between the adversary and the defender and find an approximated equilibrium. Section 4 presents a way to implement the optimal defense policy using a small number of models, without losing its efficacy. In Sect. 5 we validate our approach with some experiments. Finally, in Sect. 6 we comment on related work and we conclude the paper in Sect. 7.

Design of Load Forecast Systems Resilient Against Cyber-Attacks

2

3

System Model

In this section we introduce an electricity market model and show how load forecasts affect the profit of both utilities and generators. Then we explain how an adverse generator can manipulate the sensor readings to profit. Also, we quantify the consequences of an attack, that is, the costs for the utility and the benefits for the adversary. 2.1

Electricity Markets

In general, electricity systems use markets as mechanisms to allocate resources efficiently. The electricity markets, unlike other markets, need an operator who guarantees that the system’s equilibrium (allocation of resources) satisfies the system’s physical constraints.1 Some power systems use two markets, namely the day-ahead market (DAM) and the real-time market (RTM) [22]. The DAM accepts bids for the next day and produces commitments for demand and generation. The commitments reduce uncertainties of demand, which allows the system’s operator to schedule generators with anticipation. However, unexpected events may change the production capacity or the needs for energy. The RTM complements the DAM correcting periodically imbalances between demand and generation, preventing frequency deviations that may damage components with thigh operational limits. These adjustments translate into trades settled at the price of the RTM [16]. The market participants must fulfill the agreements from both the day-ahead and the real-time markets. For example, buyers must pay sellers the price agreed in the DAM; however, if a generator fails to supply energy, then the system operator has to buy energy in the RTM. Likewise, if a customer uses less resources, then the system operator sells the excess in the RTM. Why Load Forecasting Is Important?. Customers usually do not perceive changes in prices because they pay a flat tariff to retailers, who serve as their intermediaries in the markets. Hence, the retailers deal with the risk of uncertain market prices, that is, they buy energy at variable prices in the market and sell it at a fixed price to their customers. For this reason, the retailers try to reduce uncertainties by forecasting the demand of their customers [25]. 2.2

Load Forecasting

Utilities and generators use short-term load forecasting (STLF), which ranges from hours to weeks, to adjust their bids in electricity markets. Shorter forecast horizons help to control the power flow, while long-term forecasts help to plan the operation and the system expansion [10]. 1

These constraints prevent damage to the equipment and the environment. For example, generators may have operational limitations to prevent emissions or to regulate the use of water (for hydroelectric plants) [16].

4

C. Barreto and X. Koutsoukos

Here we consider a forecaster that uses past sensor measurements of loads and the weather to predict the total demand during a future time period. In particular, the forecaster uses the measurements available a time t to estimate the demand at time t + τ , where τ is the forecast horizon. Let us denote with M = {1, . . . , m} the set of sensors, which have measurements lk (t) at time t = 1, 2, . . . , T , with k ∈ M. Each measurement lk (t) corresponds to average values during a period of one hour. Furthermore, we denote with y(t) the total demand at time t. Thus, the prediction problem consists in finding a function f (·) that uses sensor measurements available at time t to estimate the future demand y(t + τ ). We denote with the vector xk (t) = [lk (t − 1), . . . , lk (t − H − 1)] the historical measurements of the k th sensor available at time t. In this case we use H past samples to estimate the future load. Moreover, we denote with the vector X(t) = [xk (t − τ )]k∈M the whole historical data at time t. Remark 1. In this case we build a forecaster using load and temperature measurements; however, we assume that the adversary manipulates only the load measurements. Here we use a NN to estimate y(t) as a function of the historical data X(t). The estimated demand is yˆ(t) = f (X(t), w∗ ) = f (X(t)), where the vector w∗ represents the weights of the NN that minimizes an error metric (loss function) l(·), that is, w∗ ∈ arg min l(y, f (X, w)). w

Hence, the prediction error at time t is ε(t) = y(t) − yˆ(t). In general, nonlinear distance metrics are more sensitive to outliers, since large errors in individual samples have a larger impact. Hence, the mean squared error (MSE) is more sensitive to outliers than the mean absolute error (MAE) [12] (we illustrate this in Sect. 5). For this reason, we choose MAE as loss function, that is, 1 T l(y, yˆ) = |y(t) − yˆ(t)|. (1) t=1 T Forecast’s Economic Impact. Recall that the utility uses load forecasts to choose its bids, which in turn create commitments in the electricity market. In our model the utility purchases the estimated load yˆ in the DAM and trades the demand imbalance (estimation error) ε in the RTM. Hence, the utility pays Ωu (y, yˆ) =

T t=1

yˆ(t)pDA (t) + ε(t)pRT (t) ,

(2)

Design of Load Forecast Systems Resilient Against Cyber-Attacks

5

where pDA and pRT represent the price in the DAM and RTM, respectively. On the other had, we model the profit of generators (revenues minus generation costs) as (3) Ωg (y, yˆ) = Ωu (y, yˆ) − C(y), where C(·) represents the generation cost. For simplicity, we formulate the generation cost as a function of the total energy produced y. However, in practice the trades in each market can affect the generation costs. 2.3

Adversary Model

According to Eq. (3), the generators can profit from estimation errors that increase the utility’s cost Ωu . In particular, we consider a cyber-attack that injects false data in the sensor measurements and transforms them as lka (t) = lk (t) + bk (t), where the bk (t) represents the bias in the k th sensor at time t. Likewise, the historical data of the k th sensor becomes xak (t) = xk (t) + ϕk (t), where ϕk (t) = [bk (t − 1), . . . , bk (t − H − 1)].

(4)

We denote the total historical data manipulated by an adversary as Xa (t) = X(t) + Ba (t), where the vector Ba (t) = [ϕk (t − τ )]k∈M

(5)

represents the bias observed by the forecast model. An attack with bias Ba (t) transforms the load forecast as yâ (t) = f (X(t) + Ba (t)) ≈ yˆ(t) − δ(Ba , t), where δ(Ba , t) denotes the impact of the attack (the deviation from the original prediction), which satisfies δ(0, t) = 0. Therefore, the net prediction error becomes (6) y(t) − yâ (t) ≈ y(t) − yˆ(t) + δ(Ba , t) = ε(t) + δ(Ba , t). Impact of the Attack. From Eqs. (2) and (6), the utility’s cost with an attack is Ωu (y, yâ ) ≈

T t=1

yâ (t)pDA (t) + (ε(t) + δ(Ba , t)) pRT (t) .

Hence, the benefit of generator is Ωg (y, yâ ) − Ωg (y, yˆ) ≈ −

T t=1

δ(Ba , t)(pDA (t) − pRT (t)).

(7)

The precise goal of the attack depends on the price difference between the DAM and the RTM. The next result shows some conditions in which the adversary benefits by either increasing or decreasing the forecasts.

6

C. Barreto and X. Koutsoukos

Lemma 1. Assume that δ(Ba ) and pDA − pRT are independent random variables. If either E[δ(Ba )] ≤ 0 and E[pDA − pRT ] ≥ 0 or E[δ(Ba )] ≥ 0 and E[pDA − pRT ] ≤ 0, then the adversary profits from the attack. In the remainder of the paper we assume that T1 t pDA (t) ≤ T1 t pRT (t); hence, the adversary seeks to induce under-estimations of the future load ( T1 t δ(Ba , t) ≥ 0). For this reason, we formulate the adversary’s objective as maximize m [bk ]k=1

subject to:

1 T 1 T |ˆ y (t) − yâ (t)| = |ε(t) + δ(Ba , t)| t=1 t=1 T T Eq. (4) Eq. (5) T t=1

(8) δ(Ba , t) ≥ 0

/ Ma bk (t) = 0 if k ∈ We measure the impact of the attack for the defender as the damage in the forecast’s accuracy. The defender’s loss function (see Eq. (1)) with an attack becomes 1 T 1 T l(y, yâ ) = |y(t) − yâ (t)| ≈ |ε(t) + δ(Ba , t)|. (9) t=1 t=1 T T Thus, the defender and the attacker pursue opposite goals. Remark 2. A FDI attack may have broader consequences, since other forecasters can use the same historical data with different purposes. For example, the system operator may calculate reserves based on load predictions. Thus, underestimations in the future load may expose the system to both failures and other attacks. 2.4

Attack Capabilities and Restrictions

We make the following assumptions about the attack: 1. The adversary knows both the forecast system (or estimates it [4,8,28]) and samples of the historical measurements. With this information the adversary can find an attack that solves Eq. (8). 2. The attack does not depend on the current state of the system. This can occur if the adversary cannot read the sensor measurements in real time or if it is unable to use such information to compute the bias. 3. The prices’ distribution do not change with the attack; hence, the adversary’s goal does not change after the attack. 4. The adversary compromises a subset of sensors Ma ⊆ M, with ma = |Ma |. / Ma . Hence, bk (t) = 0 for the sensors k ∈ 5. The adversary injects the same bias in all the compromised sensors. Hence, bk (t) = bj (t) = b(t) for all k, j ∈ Ma .

Design of Load Forecast Systems Resilient Against Cyber-Attacks

7

6. The number of sensors compromised is the main variable that determines the impact of an attack. 7. The impact of the attack δ(·) is concave increasing with respect to the number of sensors compromised. Intuitively, the attacker may experience diminishing returns in its attacks, that is, the impact increases with the number of sensors compromised, but the growth rate decreases with each additional sensor. Likewise, we assume that forecasters that use the same number of sensors have the same impact function. The adversary must design its attack considering its future effects, because the utility uses the biased measurements during H periods. In this case, the adversary can leverage the periodicity of the load to design a successful attack. In particular, the loads follow a 24 h period determined by the daily habits of the consumers. Likewise, the forecaster also has some periodicity, because it uses H samples in its estimation. Thus, the adversary can manipulate the sensors to report periodically the same bias, that is, bk (t) = bk (t + H).

3

Resilient Forecasting

The defense problem consists in designing a forecast system using data from untrusted sources. Here we consider the possibility of mitigating the impact of attacks by introducing randomness in the system, which in turn creates uncertainties for the attacker. In particular, we analyze the efficacy of building forecast models using randomly selected sensor measurements. Intuitively, uncertainties in the system’s model can reduce the success of the adversary, who has to design the attack considering possible contingencies. 3.1

Game Formulation

We model strategic interaction between the defender and the attacker as a Stackelberg game, where the defender decides first the prediction scheme and the attacker chooses afterwards its attack strategy [9]. Strategies. In this case, the defender chooses of using the k th thedprobability d d sensor ρk ∈ [0, 1], for k ∈ M, which satisfy k∈M ρk = m . The above condition implies that the forecaster uses in average md sensors. We denote the defense strategy with the vector ρd = [ρdk ]k∈M . Likewise, we represent the strategy of the adversary with the vector ρa = [ρak ]k∈M , where ρak ∈ [0, 1] denotes the probability of attacking the k th sensor.In this case, the adversary compromises at most ma sensors in average; hence, k∈M ρak ≤ ma . Let us denote with Md and Ma the sets of sensors selected by the defender and the attacker, respectively. Thus, the set Mc = Md Ma contains the compromised sensors that the defender uses in the prediction. Let Wk be a Bernoulli random variable with success probability ρk = ρdk ρak . In other words, Wk describes whether both the defender and the adversary select

8

C. Barreto and X. Koutsoukos

the k th sensor. Hence, the number of compromised sensors (attacked sensors used in the forecast) is Sm = Wk = |Mc |, k

where Sm follows the m-generalized binomial distribution. Hence, the expected number of compromised sensors is E[Sm |ρd , ρa ] = λ(ρd , ρa ) = ρdk ρak . k∈M

Player’s Payoff. Let us express the impact of the attack as a function of the sensors selected by the players (Md and Ma ) and the bias b y(t) − yâ (t) ≈ δ(Md , Ma , b, t). Since we assume that the impact depends only on the number of sensors compromised, then attacks on the sets Ma1 and Ma2 that satisfy atwo d a1 d 2 M | = |M M | have approximately the same impact,2 that is, |M δ(Md , Ma1 , b, t) ≈ δ(Md , Ma2 , b, t).

(10)

Henceforth we denote the impact function as δ(Sm , t) = δ(Md , Ma , b∗ , t), where b∗ represents the optimal attack schedule that solves Eq. (8). According to Eqs. (8) and (9), the defender’s objective consist in reducing the expected impact of the attack, while the adversary attempts to create an error in the prediction. For this reason, we define the payoff of the adversary as Π a (ρd , ρa ) = E[δ(Sm )|ρd , ρa ]. On the other hand, we define the payoff of the defender as Π d (ρd , ρa ) = −Π a (ρd , ρa ). 3.2

Game’s Approximate Equilibrium

The equilibrium of the game is the solution to min max Π a (ρd , ρa ). a ρd

ρ

(11)

The concavity of the impact with respect to the number of sensors compromised implies E[δ(Sm )|ρd , ρa ] ≤ δ E[Sm |ρd , ρa ] = δ(λ(ρd , ρa )). (12) 2

Although the impact depends on the particular model, that is, the set Md , we assume that models with the same number of sensors md have the same impact function.

Design of Load Forecast Systems Resilient Against Cyber-Attacks

9

The next result shows that an approximate equilibrium to the game in Eq. (11) comes from the solution to δ(λ(ρd , ρa )). min max a ρd

ρ

(13)

Proposition 1. Let (ρd , ρa ) be the solution Eq. (13). Then (ρd , ρa ) is a ξ−equilibrium of the game in Eq. (11), that is, ρd , ρa ) − ξ Π d (ρd , ρa ) ≥ Π d (˜ and Π a (ρd , ρa ) ≥ Π a (ρd , ρã ) − ξ for some strategies ρ˜d and ρã . and ξ ≥ 0. This means that the players cannot get benefits superior to ξ by adopting another strategy. Moreover, the next result shows that the optimal defense strategy ρd for the game in Eq. (13) consists in selecting the sensors with the same probability. Proposition 2. The defense strategy ρd in the equilibrium of Eq. (13) satisfies d ρdk = mm , for all k ∈ M. Remark 3. The adversary’s optimal strategy consists in targeting the sensors with the highest selection probability. However, when the defender chooses all the sensors with the same probability, then the adversary doesn’t have any preference for the sensors. Properties of the Defense Mechanism. If the defender selects md sensors with an uniform distribution, then the expected number of compromised sensors is md a m . λ(ρd , ρa ) = m Thus, the proportion of compromised sensors is λ(ρd , ρa )/md = ma /m, which doesn’t depend on md . In other words, by selecting sensors randomly we reduce the number, but not proportion, of compromised sensors. We improve the resiliency of the system if the ensemble has a lower impact than the original model; hence, from Eq. (12) we need δ(M, Ma , ˜b, t) ≥ E[δ(Sm )|ρd , ρa ],

(14)

where ˜b represent the optimal bias when the forecast model uses all the sensors. The above condition can fail if the ensemble becomes more sensitive to the attacks than the original model. Besides selecting randomly the sensor measurements, the defender may adjust the training of models to guarantee Eq. (14). In particular, the defender may implement some form of regularization to make the models less sensitive. For example, [20] makes NNs robust against attacks implementing an algorithm equivalent to Lipschitz regularization. In Sect. 5 we explore how the target in the training phase affects the sensitivity of the models.

10

4

C. Barreto and X. Koutsoukos

Implementation of the Forecast

From the game formulation, the defender selects randomly md sensors and constructs a forecaster. Thus, the defender must either train a new model for each prediction task or train and store the models with anticipation. However, such approaches may require a prohibitively large amount of time and resources, m different forecasters. For this reason, we because the defender can build m d approximate the defense strategy constructing an ensemble with n models guaranteeing that they use each sensor’s data with probability md /m (the desired defense strategy (see Proposition 2)). n Let us partition the set of sensors M in n sets Pi of size m/n, where i=1 Pi = M and Pi Pj = ∅ for all i = j, 1 ≤ i, j ≤ n. If md /m ≤ 0.5, then we construct an ensemble of n = m/md models, where each model uses the set Mi = Pi for its training. In this way, each model uses each sensor with probability md /m. On the other hand, if md /m ≥ 0.5, then we construct n models that use all except one of the subsets. In this case, the ith model uses Mi = j=i Pj sensors. Thus, we select each sensor’s data with probability (n − 1)/n, which md m must satisfy n−1 n = m (that is, n = m−md ). This means that each partition has d size |Pi | = m − m and each model uses |Mi | = md sensors. When n is not integer, we can still achieve the desired selection probability 1 2 and {Pib }ni=1 merging two ensembles. Let us construct two partitions {Pia }ni=1 with n1 = n and n2 = n . With these partitions we can build two ensembles , for k = 1, 2. We can merge that select each sensor with probability γk = nkn−1 k the ensembles selecting them with probability βk ∈ [0, 1] to satisfy k γk βk = md m , for k = 1, 2. In this way, we can construct an ensemble that guarantees that the prediction uses each sensor with probability md /m. Since the ensemble uses models trained beforehand, the adversary my target the sensors of particular forecasters to improve its profit, rather than selecting them randomly. The next result shows that the adversary’s optimal strategy consists in allocating its resources equally to all the partitions. Proposition 3. Consider an ensemble constructed from a partition {Pia }ni=1 and let σi be the proportion of resources allocated to the set Pi . Then the adversary maximizes the impact selecting σi = n1 , which leads to an expected impact

δ

md a m m

= δ(λ(ρd , ρa )).

Remark 4. According to the previous result, our mechanism to implement the ensemble has the same expected impact than the ensemble proposed in Sect. 3. Hence, combining multiple ensembles doesn’t improve the forecaster’s resiliency, because individually they have the same expected impact. Remark 5. The previous results hold if the models of the ensemble have the same impact as a function of the number of sensors compromised.

Design of Load Forecast Systems Resilient Against Cyber-Attacks

5

11

Evaluation

In this subsection we examine how some parameters of the forecasters affect their sensitivity to attacks. Based on the results from these experiments we design the ensemble and show its robustness against attacks. 5.1

Experimental Setup

Power System. We make a detailed simulation of an electric distribution system using GridLAB-D and the prototypical distribution feeder models provided by the Pacific Northwest National Laboratory (PNNL) [24]. The distribution models capture fundamental characteristics of distribution utilities from the U.S. In this case, we use the prototypical feeder R1-12.47-3 that represents a moderately populated area with 109 commercial and residential loads composed by appliances such as heating, ventilation, and air conditioning (HVAC) systems, water heaters, and pool pumps, among others. We simulate the distribution system during summer time (June to August) to build a dataset with measurements of the power consumed by each load and the outdoor temperature. Forecast Models. We implement each forecaster in Keras [6] using NNs composed by five layers (three layers with 150 Long Short-Term Memory (LSTM) units [11] and two layers with 200 and 100 rectified linear units (ReLU), respectively). We train the NN using Adadelta as optimizer, which adapts the learning rate based on a moving window of gradient updates. We use as input data X the last H = 24 measurements from 110 sensors (109 power sensors and 1 temperature sensor). We train the NNs to estimate the load during the next hour (τ = 1), and we make the predictions every hour. In the experiments we use 80% of the samples to train the forecasters, 10% to determine the attack policy, and 10% to evaluate the impact of the attacks. Figure 1 shows an example of the prediction made with the forecaster. Design of Attacks. We find the attack schedule solving Eq. (8) using the LBFGS-B algorithm from [15]. We use the gradient of the forecaster (e.g., the expected gradient of the ensemble) and part of the samples (10%) to find the optimal attack schedule. In other words, the adversary uses the available information about the forecaster and the loads’s behavior to design the attack. Moreover, we make Monte Carlo simulations to assess the impact of the strategy of each player. In particular, we train 20 forecasters reflecting the defense strategy ρd . Likewise, for each attack we choose randomly a forecaster and find an attack selecting randomly ma sensors (we repeat this random selection 20 times).

12

C. Barreto and X. Koutsoukos Load Prediction Total load y(t) Predicted load yˆ(t) 0.8

Load (MVA)

0.6

0.4

0.2

0.0 0

25

50

75

100 Time (t)

125

150

175

200

Fig. 1. Example of the load forecast.

5.2

Sensitivity of Forecast Models

Loss Function. Figure 2 shows the expected attack’s impact δ(Ba , t) on models trained with different loss functions, namely the MSE and the MAE. The experiment confirms that the NN trained with MSE suffers a higher impact with the attack. Moreover, the impact is approximately concave with respect to the number of sensors compromised ma for both MSE and MAE. In the remainder we use models trained with MAE. Ensemble Training. Now we examine how to design the models of an ensemble guaranteeing that it has a lower impact than a single model (see Eq. (14)). In particular, we experiment with different targets of the models (the value that they learn). We construct four forecasters {f j }4j=1 with different characteristics. The first forecaster f 1 estimates the total load y using data from m sensors (this is the nominal case). Each one of the remainder forecasters has an ensemble of two models, f1j and f2j for j = 2, 3, 4, trained using half of the sensors (md = 0.5m) to predict a values y1j and y2j , respectively. We build the second forecaster with models that estimate the total load; hence, y12 = y22 = y and f 2 = (f12 + f22 )/2. On the other hand, the models of the third forecaster estimate a fraction of the load with y13 = y23 = 0.5y and f 3 = f13 + f23 . We define the last forecaster as where the models f14 and f24 estimate the total load of their sensors, f 4 = f14 +f24 , 4 that is, yi = k∈Mi lk . Figure 3 shows that the model’s target affects the sensitivity of the ensembles. In this case, the forecasters f 2 and f 3 suffer a larger impact than the original model f 1 , while f 4 succeeds in reducing the impact of attacks (but has a larger prediction error (0.075) than the other forecasters). For this reason, in

Design of Load Forecast Systems Resilient Against Cyber-Attacks

13

Prediction Error with Attacks Loss function: MSE Loss funciton: MAE

Average prediction error (δ(Ba ) = yˆ − yâ )

1.0

0.9

0.8

0.7

0.6

20

40 60 Number of sensors compromised (ma )

80

100

Fig. 2. Attack’s impact on models trained with MSE and MAE. The model trained with MAE is more resilient to attacks. Impact of Attacks on Different Models

0.5

y − yâ ) Average impact (ˆ

0.4

0.3

0.2 f 1 (md = m, yi = y) f 2 (md = 0.5m, yi = y)

0.1

f 3 (md = 0.5m, yi = 0.5y) lk ) f 4 (md = 0.5m, yi = k∈Mi

0.0 10

20

30 40 Number sensors attacked (ma )

50

Fig. 3. Impact of an attack on four different forecasters. The training’s target yi affects sensitivity of the ensemble.

the remainder of the text we train ensembles as f 4 , dividing the prediction task among the ensemble’s individual models. 5.3

Selection of the Best Ensemble

Now we test the robustness of ensembles using models with different values of md . Figure 4a shows that the ensemble’s prediction error increases as we reduce the number of sensors used in each model md , but it does not increase significantly.

14

C. Barreto and X. Koutsoukos Accuracy of the Ensemble

Impact of Attacks on the Ensemble

Ensemble Original model

0.0750

0.35

0.0700

y − yâ ) Average impact (ˆ

Prediction error MAE(y, yˆ)

0.0725

0.0675 0.0650 0.0625

ma = 0.01m ma = 0.11m

0.30

ma = 0.21m ma = 0.3m

0.25

ma = 0.4m ma = 0.5m

0.20

0.15

0.10

0.0600 0.0575

0.05 20

30

40

50

60

70

Number of sensors per model (md )

80

90

20

30

40 50 60 70 Number of sensors per model (md )

80

90

(a) The ensemble has a larger prediction error, which increases as we reduce the (b) The impact of the attack is approxinumber of sensors used in each model mately convex with respect to the number of sensors used by each model md . md .

Fig. 4. Prediction error and impact of ensembles with different parameters md .

This may happen because as md decreases, the forecast tends to estimate the demand of fewer loads, giving the ensemble a greater detail of them.3 On the other hand, Fig. 4b shows that the impact increases with respect to md . Since both the prediction error and the attack’s impact have concave shapes, the value of md that minimizes the cost of the attack (see Eq. (9)) falls in one of the extremes, that is, md ∈ {1, m}. In our particular scenario, md = 1 attains the lowest cost of the attack (the ensemble that predicts each load individually). Ensemble Size. Figure 5 shows the impact of attacks as a function of combined ensembles. We train the models selecting randomly md = 0.5m models and consider random attacks on ma = 0.5m sensors. This experiment shows that the number of ensembles (or models) doesn’t affect significantly the impact, confirming Remark 4.

6

Related Work

Previous works have analyzed the vulnerability of CPS against false data injection (FDI) attacks, which modify sensor measurements to manipulate the system’s operation. The seminal work by Liu et al. [21] considers attacks on sensors that induce errors in the state estimation of power grids. Such errors can affect the system’s operation, in particular, the electricity prices. An adversary that manipulates the electricity prices can profit and/or cause damage to the system. This attack requires historical data and real time measurements to calculate the attack. 3

Other forecast models make predictions using less information (e.g., the aggregate loads); hence, their accuracy decrease significantly with less loads [25].

Design of Load Forecast Systems Resilient Against Cyber-Attacks Impact with Multiple Ensembles

1.0

Prediction Error with Multiple Ensembles

1.0

15

0.8

Impact (ˆ y − yâ )

Prediction error MAE(y, yâ )

0.8

0.6

0.6

0.4

0.4

0.2 0.2

0.0 1.0

0.0 1.0

1.5

2.0

2.5 3.0 3.5 Number of ensembles

4.0

4.5

(a) Average prediction error .

5.0

1.5

2.0

2.5 3.0 3.5 Number of ensembles

4.0

4.5

5.0

(b) Impact of an attack with ma = 0.5m.

Fig. 5. Combining multiple ensembles does not improve its accuracy nor its resiliency.

Other works have considered FDI attacks that modify information about the congestion patterns (the rate of the transmission lines) [14,29] and the topology of the power system [5]. Also, the attacks can also target information that consumers use to make decisions, misleading them to take actions that benefit the adversary [2,3]. In most cases, the adversary calculates the attack based on the system’s state. Some works have recognized the vulnerabilities of cyber attacks on forecasting systems. For example, [19] analyzes attacks that manipulate load forecasts to manipulate the economic dispatch, which determines the production of each generator based on the estimations of future demand. However, the paper focuses on the consequences of the attack (e.g., how an adversary profits), rather than its precise implementation. Chen et al. [4] show how an adversary can manipulate the temperature measurements to increase or decrease load forecasts. In particular, the adversary does not need knowledge about the forecasting model (e.g., the precise structure of the NN) or the power system. This and other works (for example [8,28]) show that the adversary can estimate the system’s model through either historical data or queries from the forecast system. Our work is closely related to [4]; however, our attack targets load, rather than temperature measurements. Nonetheless, our defense approach can be applied the attack presented in [4]. The research of FDI attacks against NNs analyzes how an adversary can design adversarial examples to induce errors in the system’s task (e.g., misclassify images) [27]. In particular, the attacks are transferable among models. In other words, two models trained independently (even with different data) to perform the same task can suffer from the same attacks [23]. Ilyas et al. [13] explain the existence of adversarial examples due to nonrobust-features (patterns in the data that are highly predictive). In other words, models may become sensitive to well-generalizing features of the data. Hence,

16

C. Barreto and X. Koutsoukos

attacks that target such features, regardless of the model, can induce errors in the outcome (this also explains why we have transferability of the attacks). Some papers design robust NNs in image classification applications introducing randomness in the system. However, these approaches differ from ours in the way they introduce the uncertainties. For example, [18] proposes a robust NN borrowing ideas from differential privacy (DP). DP randomize computations on databases such that a small change in the data set has a bounded change in the distribution over the outputs. This property guarantees that bounded changes in the input of NN will induce bounded changes in the output, preventing the misclassification of images. On the other hand, [20] prevents gradient based attacks adding noise in the layers of the NN. In this way, a single NN acts as multiple models, which combined conform an ensemble of models. Also, [7] proposes stochastic activation pruning, which removes a subset of the activations (nodes) in each layer to protect pre-trained NN against adversarial examples. This approach resembles dropout [26], but the selection is based on the magnitude of the activation. [7] also formulates the interaction between defender and attacker as a zero-sum game, but it does not present the equilibrium of the game. Some literature on statistics consider the problem of designing robust predictors or estimators [17]. In this case, a robust predictor has a small sensitivity to outliers (e.g., random failures). In other words, it has the capacity to handle disturbances for a wide type of distributions. Thus, the design decisions focus on selecting the loss function, rather than manipulating the data. In general, nonlinear distance metrics are more sensitive to outliers, since large errors in individual samples have larger impact. Hence, the MSE is more sensitive to outliers than the MAE [12]. Nonetheless, robust models have a cost in terms of efficiency, that is, they may have a larger variance (confidence interval).

7

Conclusions

In this work, we show that an adverse generator can profit by inducing errors in load forecasts of utilities. The adversary with knowledge about the forecast model and historical samples from the sensors can succeed injecting a bias in the sensor measurements. We model the interaction among defender and attacker using game theory and find a defense strategy that can mitigate the attack’s impact. In this case, building forecast models using each sensor’s measurements with a fixed probability reduces the number of compromised sensors. However, this strategy may fail if forecasters that use less information become more sensitive to attacks. Due to the large strategy space, we approximate the defense strategy with an ensemble of predictors. Beside its practical benefits, the ensemble allows us to divide the forecast task, improving the resiliency against attacks. In this case, the ensemble becomes more resilient as its models use less measurements, that is, as it estimates fewer loads. In this way, with a careful selection of the training data we can incorporate uncertainties in regular NNs that help to mitigate the impact of attacks.

Design of Load Forecast Systems Resilient Against Cyber-Attacks

17

Other protection schemes may complement the proposed approach. For example, regularization during the training also can mitigate the attack’s impact, because it makes the models less sensitive to deviations in the data.

A

Appendix

Proof (Lemma 1). Let δ(Ba ) and pDA − pRT be independent random variables; hence, we can approximate their expected value using a Monte Carlo integration with T terms, that is, 1 T δ(Ba , t), E[δ(Ba )] = t=1 T 1 T DA E[pDA − pRT ] = p (t) − pRT (t) . t=1 T Now, since two independent random variables X and Y satisfy E[XY ] = E[X]E[Y ], we can approximate their expected product E[δ(Ba )(pDA − pRT )] as 1 T 1 T DA p (t) − pRT (t) . δ(Ba , t) E[δ(Ba )(pDA − pRT )] = t=1 t=1 T T DA RT Thus, if either (t) − p (t) ≥ 0 and t δ(Ba (t)) ≤ 0 or t pDA (t) − tp pRT (t) ≤ 0 and t δ(Ba (t)) ≥ 0, then the attacker has positive profit (see Eq. (7)). Proof (Proposition 1). Let us consider the following bounds on the difference between expected impact and its approximation from Eq. (12) ξ ≤ δ(λ(ρd , ρa ), t) − Π a (ρd , ρa ) ≤ ξ. Since Π d (ρd , ρa ) = −Π a (ρd , ρa ), then the previous expression implies Π d (ρd , ρa ) ≥ −δ(λ(ρd , ρa ), t) + ξ

(15)

− δ(λ(˜ ρd , ρa ), t) ≥ Π d (ρd , ρa ) − ξ.

(16)

and Moreover, the solution to Eq. (13), denoted (ρ , ρ ), satisfies the following properties δ(λ(ρd , ρa ), t) ≥ δ(λ(ρd , ρã ), t), d

a

ρd , ρa ), t), δ(λ(ρd , ρa ), t) ≤ δ(λ(˜ for some strategies ρ˜ and ρ˜ . Thus, from Eqs. (15) and (17) we have d

a

ρd , ρa ), t) + ξ. Π d (ρd , ρa ) ≥ −δ(λ(ρd , ρa ), t) + ξ ≥ −δ(λ(˜ Now, using the previous expression with Eq. (16) we obtain ρd , ρa ) − ξ. Π d (ρd , ρa ) ≥ Π d (˜ where ξ = ξ − ξ ≥ 0. With a similar approach we can show that Π a (ρd , ρa ) ≥ Π a (ρd , ρã ) − ξ.

(17)

18

C. Barreto and X. Koutsoukos

Proof (Proposition 2). Since δ(·) is increasing with respect to the number of sensors compromised, the following holds max δ(x) = δ(max x). x

x

The previous property can be applied also to minimization problems; hence, we can express the game’s equilibrium of Eq. (12) as

d a d a δ(λ(ρ , ρ )) = δ min max λ(ρ , ρ ) min max a a ρd

ρ

ρd

ρ

This means that the adversary designs its strategy to maximize the number of compromised sensors, while the defender pursues the opposite goal. The adversary’s optimal strategy consists in attacking the sensors with highest selection probability. Without loss of generality, let ρd1 ≥ ρd2 ≥ . . . ≥ ρdm . Then, the attack strategy ρai = 1 and ρaj = 0 for 1 ≤ i ≤ ma and j > ma leads to the following expected number of compromised sensors λ(ρd , ρa ) =

ma i=1

ρdi .

Since a different attack strategy cannot increase the number of compromised sensors, this attack strategy is weakly dominant. Given the previous attack strategy, the defender’s optimal strategy consists in selecting all the sensors with the same probability ρdk =

md . m

Observe that any deviation from this strategy increases the number of sensors compromised. Proof (proposition 3). Here we consider that the adversary compromises ma sensors. Let σi be the proportion of resources allocated to the set Pi . According to Sect. 4, we create a partition of sensors {Pi }ni=1 . First, let us consider ensembles d trained with sensors in Mi = ∪j=i Pj , for i = 1, . . . , n, where n−1 n = f racm m. th Thus, the total number of compromised sensors used by the i model amount to ma j=i σj = ma (1 − σi ). Due to the concavity of the impact function, the expected impact on the ensemble satisfies n

d n n−1 a m a 1 1 a a m =δ m . δ(m (1 − σi )) ≤ δ m (1 − σi ) = δ n i=1 n i=1 n m Thus, the allocation that maximizes the impact attains the previous upper bound satisfying σi = n1 , for all i = 1, . . . , n. In other words, the adversary’s best strategy consists in allocating its resources uniformly in the partition’s sets.

Design of Load Forecast Systems Resilient Against Cyber-Attacks

19

m Now, if Mi = Pi , for i = 1, . . . , n, with n = m d , then the expected impact on the ensemble becomes

a

d n 1 m m a 1 n a a m . δ(m σi ) ≤ δ m σi = δ =δ i=1 i=1 n n n m

In this case, the attack strategy that attains the upper bound satisfies σi = n1 . Therefore, the adversary allocates its resources equally in all the sensors in the partition. In practice, the adversary can compromise at most md sensors form each partition. Hence, the optimal attack policy must satisfy σi = min{1/n, md /ma }. When 1/n > md /ma the adversary cannot implement its ideal strategy.

References 1. Alfeld, S., Zhu, X., Barford, P.: Data poisoning attacks against autoregressive models. In: Thirtieth AAAI Conference on Artificial Intelligence (2016) 2. Amini, S., Pasqualetti, F., Mohsenian-Rad, H.: Dynamic load altering attacks against power system stability: attack models and protection schemes. IEEE Trans. Smart Grid 9(4), 2862–2872 (2016) 3. Barreto, C., Cardenas, A.: Impact of the market infrastructure on the security of smart grids. IEEE Trans. Ind. Inform. 1 (2018) 4. Chen, Y., Tan, Y., Zhang, B.: Exploiting vulnerabilities of load forecasting through adversarial attacks. In: Proceedings of the Tenth ACM International Conference on Future Energy Systems, e-Energy 2019, pp. 1–11 (2019) 5. Choi, D.H., Xie, L.: Economic impact assessment of topology data attacks with virtual bids. IEEE Trans. Smart Grid 9(2), 512–520 (2016) 6. Chollet, F., et al.: Keras (2015). https://keras.io 7. Dhillon, G.S., et al.: Stochastic activation pruning for robust adversarial defense. arXiv preprint arXiv:1803.01442 (2018) 8. Esmalifalak, M., Nguyen, H., Zheng, R., Xie, L., Song, L., Han, Z.: A stealthy attack against electricity market using independent component analysis. IEEE Syst. J. 12(1), 297–307 (2015) 9. Fudenberg, D., Tirole, J.: Game Theory. The MIT Press, Cambridge (1991) 10. Hernandez, L., et al.: A survey on electric power demand forecasting: future trends in smart grids, microgrids and smart buildings. IEEE Commun. Surv. Tutor. 16(3), 1460–1495 (2014) 11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 12. Hyndman, R.J., Koehler, A.B.: Another look at measures of forecast accuracy. Int. J. Forecast. 22(4), 679–688 (2006) 13. Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran, B., Madry, A.: Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175 (2019) 14. Jia, L., Thomas, R.J., Tong, L.: Malicious data attack on real-time electricity market. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5952–5955 (2011) 15. Jones, E., Oliphant, T., Peterson, P., et al.: SciPy: open source scientific tools for Python (2001). http://www.scipy.org/

20

C. Barreto and X. Koutsoukos

16. Kirschen, D.S., Strbac, G.: Fundamentals of Power System Economics. Wiley, Hoboeken (2004) 17. Klebanov, L.B., Rachev, S.T., Fabozzi, F.J.: Robust and Non-robust Models in Statistics. Nova Science Publishers, Hauppauge (2009) 18. Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., Jana, S.: Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471 (2018) 19. Liu, C., Zhou, M., Wu, J., Long, C., Kundur, D.: Financially motivated FDI on SCED in real-time electricity markets: attacks and mitigation. IEEE Trans. Smart Grid 10(2), 1949–1959 (2019) 20. Liu, X., Cheng, M., Zhang, H., Hsieh, C.-J.: Towards robust neural networks via random self-ensemble. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 381–397. Springer, Cham (2018). https://doi. org/10.1007/978-3-030-01234-2 23 21. Liu, Y., Ning, P., Reiter, M.K.: False data injection attacks against state estimation in electric power grids. In: Proceedings of the 16th ACM Conference on Computer and Communications Security, CCS 2009, pp. 21–32 (2009) 22. Nudell, T.R., Annaswamy, A.M., Lian, J., Kalsi, K., D’Achiardi, D.: Electricity markets in the United States: a brief history, current operations, and trends. In: Stoustrup, J., Annaswamy, A., Chakrabortty, A., Qu, Z. (eds.) Smart Grid Control. PEPS, pp. 3–27. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-98310-3 1 23. Papernot, N., McDaniel, P., Goodfellow, I.: Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277 (2016) 24. Schneider, K.P., Chen, Y., Chassin, D.P., Pratt, R.G., Engel, D.W., Thompson, S.E.: Modern grid initiative distribution taxonomy final report. Technical report, Pacific Northwest National Laboratory (2008) 25. Sevlian, R., Rajagopal, R.: A scaling law for short term load forecasting on varying levels of aggregation. Int. J. Electr. Power Energy Syst. 98, 350–361 (2018) 26. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 27. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013) 28. Tan, S., Song, W.Z., Stewart, M., Yang, J., Tong, L.: Online data integrity attacks against real-time electrical market in smart grid. IEEE Trans. Smart Grid 9(1), 313–322 (2016) 29. Xie, L., Mo, Y., Sinopoli, B.: Integrity data attacks in power market operations. IEEE Trans. Smart Grid 2(4), 659–666 (2011)

Identifying Stealthy Attackers in a Game Theoretic Framework Using Deception Anjon Basak1(B) , Charles Kamhoua2 , Sridhar Venkatesan3 , Marcus Gutierrez1 , Ahmed H. Anwar2 , and Christopher Kiekintveld1 1

2

The University of Texas at El Paso, 500 W University Avenue, El Paso, TX 79968, USA {abasak,mgutierrez22}@miners.utep.edu, [email protected] Army Research Laboratory, 2800 Powder Mill Road, Adelphi, MD 20783, USA [email protected], [email protected] 3 Perspecta Labs Inc., 150 Mount Airy Road, Basking Ridge, NJ 07920, USA [email protected]

Abstract. A great deal of effort is devoted to detecting the presence of cyber attacks, so that defenders can respond to protect the network and mitigate the damage of the attack. Going beyond detection, identifying in as much detail as possible what specific type of attacker the defender is facing (e.g., what their goals, capabilities, and tactics are) can lead to even better defensive strategies and may be able to help with eventual attribution of attacks. However, attackers may wish to avoid both detection and identification, blending in or appearing to be a different type of attacker. We present a game-theoretic approach for optimizing defensive deception actions (e.g., honeypots) with the specific goal of identifying specific attackers as early as possible in an attack. We present case studies showing how this approach works, and initial simulation results from a general model that captures this problem.

1

Introduction

Cyber attackers pose a serious threat to economies, national defense, critical infrastructure, and financial sectors [3,5,11]. Early detection and identification of a cyber attacker can help a defender to make better decisions to mitigate the attack. However, identifying the characteristics of an attacker is challenging as they may employ many defense evasion techniques [7]. Adversaries may also mask their operations by leveraging white-listed tools and protocols [22,23]. There are many existing intrusion detection methods to detect attacks [15, 23]. We focus here on using honeypot (HP) for detection and identification, though our models could be extended to other types of defensive actions. HP are systems that are designed to attract adversaries [14,21] and to monitor [15] attacker activity so that the attack can be analyzed (usually manually by experts). There are works on automating the analysis process [18] using host based and network based data for correlation to identify a pattern. Deductive reasoning can be used to draw conclusions about the attacker [19]. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 21–32, 2019. https://doi.org/10.1007/978-3-030-32430-8_2

22

A. Basak et al.

Most current approaches focus on detection during an ongoing attack, possibly with some effort to categorize different types of detections. More detailed identification is done during later forensic analysis. Here we focus on formally modeling what types of actions the defender can take to support more detailed attacker identification early in the attack chain, allowing more information to target specific defensive responses. This can be challenging; for example many different attackers may use the same core malware toolkits and common tactics [1,3–5]. Attackers may intentionally try to look similar to other actors, and may even change during the course of an attack (e.g., when compromised resources are sold to other groups), leading to a complete change in focus [2,6,11]. We focus our model on detecting an attacker type early. An attacker type is defined by an Attack Graph (AG) specific to that attacker that describes his possible actions and goals in planning an attack campaign. An AG represents the set of attack paths that an attacker can take to achieve their goal in the target network [12]. Depending on the observed network and the exploits they have available, the attacker tries to choose the optimal sequence of attack actions to achieve their particular goal. Identification of different attack types in our model corresponds to identifying which of a set of possible attack graphs this particular attacker is using in a particular interaction. The defender chooses deception actions that try to force the attackers into choices that will reveal which type they are early on, which can then inform later defenses. We propose a multi-stage game theoretic model to strategically deploy honeypots to force the Fig. 1. An attack graph. attacker to reveal his type early. We present case studies to show how this model works for realistic scenarios, and to demonstrate how attackers can be identified using deception. Finally, we present a general model, a basic solution algorithm, and simulation results that show that if the defender uses pro-active deception the attacker type can be detected early compared to if he only observes the activity of an attacker.

2

Background

AGs represents sequential attacks by an attacker to compromise a network or a particular computer [12]. AGs can be automatically generated using a known vulnerabilities database [13,17]. Due to resource limitations, the automatically generated AGs are often used to identify high priority vulnerabilities to fix [16, 20]. We use AGs as a library of attack plans that can represent different attackers. The optimal plan for any attacker will depend on his particular options and goals, reflected in the AG. The AG for each attacker will also change depending on the network, including changes made by the defender (e.g., introducing honeypots).

Identifying Stealthy Attackers in a Game Theoretic Framework

23

We model a multi-stage Stackelberg Security game (SSG) with a leader and a follower. The defender commits to a strategy considering the attacker’s strategy. The attacker observes the strategy of the leader and chooses an optimal attack strategy using the AG. The AG of the attackers we considered are defined by initial access and lateral movement actions [8] to reach their corresponding goal node gi as shown in Fig. 1. Initial access represents the vectors an attacker uses to gain an initial foothold in the network. We consider the technique called Exploit Public-Facing Application [9] for initial access, where an attacker uses tools to exploit the weakness of the public facing systems. Lateral movement is a tactic to achieve greater control over network assets. We consider the example of Exploitation of Remote Services technique for lateral movement [10] where an attacker exploits the vulnerabilities of a program. While there may be similarities in the AG for different attackers, having different goals and options available mean that the plans may eventually diverge. The overlap between multiple attack plans is the number of actions that are identical at the start of the plan.

3

Case Studies

We present three case studies that consider different types of attackers. We look at different pairs of attackers based on what exploits are shared between them and whether their final objective is the same or not. We use the network shown in Fig. 2, where a router is Ri , a host is Hi , a firewall is Fi , a switch is S. An exploit φi (c) with cost c on an edge allows an attacker to move laterally if an Fig. 2. Network for case study. attacker ai has exploit φi (c). The cost of using an exploit represents both the time and effort as well as the risk of detection which attackers want to minimize. An attacker tries to reach a goal by making lateral movements using an attack plan with the minimum cost. If there is more than one minimum cost plan, attackers choose the ones that maximize the overlap with other attackers. We assume that when an attacker reaches his goal the game ends. He also has complete knowledge (e.g. vulnerabilities) about the network, but does not know which nodes are honeypots. For each case study we first analyze the attack plans based on the AGs. Then we analyze what proactive deceptive action a defender can take to detect the attacker type earlier. We assume that the attacks are fully observable. Since we are interested in scenarios where attackers have common attack plans in their AG, we assume that host H1 is where all attackers initially enter the network.

24

3.1

A. Basak et al.

Case Study 1: Attackers with Same Exploits but Different Goals

We consider two attackers a1 and a2 with the set of exploits φ1 , φ2 , φ5 , φ6 . Goal nodes for the attackers are defined by g(a1 ) = DB and g(a2 ) = W . The attack plans are shown in Fig. 3a. The defender cannot distinguish which attacker he is facing until the attacker reaches to his goal node. Here, the defender can use a decoy node of either the DB or the W to reduce the overlap in the attack plans by a1 and a2 . Since both of the attackers have the same set of exploits a defender cannot use any decoy node with vulnerabilities. The attackers have different goals and the defender can take advantage of that situation.

(a) Before deception

(b) After deception

Fig. 3. Case study 1. Attack plans of the attackers a1 and a2 before (a) and after deception (b)

Figure 3b shows the use of a decoy DB (with dotted line) between H1 and R2 with unique exploits on the edges to force only the targeted attacker to go through the decoy. Since other attackers will not use that plan, this creates a unique attack plan that can be identified. In Fig. 3b, we notice that if the acting attacker is a1 , then he will go for the decoy DB. Attacker a1 does not take the longer path to compromise the DB because he chooses attack plan with minimum cost. To maximize the overlapping attack plan attacker a2 will choose the plan through the decoy DB instead of R1 even though the two plans costs the same. In the next case studies, the defender avoids using goal node as decoy to reduce the cost associated with the decoy server. 3.2

Case Study 2: Attackers with Shared Exploits and Different Goals

Now we consider attacker a1 with exploits (φ1 , φ2 , φ5 , φ6 ) and g(a1 ) = DB. Attacker a3 has exploits φ1 , φ2 , φ6 , φ7 and g(a3 ) = W . The attackers a1 and a3 compute attack plans as shown in the Fig. 4a. If the defender just observes, he will be able to detect the acting attacker after R2 . Since the attackers have shared and unique exploits in their possession, these create a non-overlapping

Identifying Stealthy Attackers in a Game Theoretic Framework

25

path in the middle of their attack sequence, which causes the defender to detect the differentiation between the attackers’ attack plans.

(a) Before deception

(b) After deception

Fig. 4. Case study 2. Attack plans of the attackers a1 and a3 before (a) and after deception (b)

However, if the defender uses a honeypot HP as shown in Fig. 4b the attack sequence of the attackers changes. Attacker a1 cannot go through the HP due to his lack of exploit φ7 (2.5) which causes the attackers to diverge at an earlier stage facilitating identification by the defender.

(a) Before deception

(b) After deception

Fig. 5. Case study 3. Attack plans of the attackers a4 and a5 before (a) and after deception (b)

3.3

Case Study 3: Attackers with Shared Exploits but Same Goals

Now we consider attacker a4 with exploits (φ1 , φ2 , φ5 , φ6 , φ7 ) and g(a4 ) = W and for a5 exploits (φ1 , φ4 , φ7 , φ8 ), g(a5 ) = W . The attack plans are shown in Fig. 5a. Attacker a4 and a5 executes the exact same attack plan. Even though attacker a4 could go directly to F3 from R2 he chooses the plan to through R3 minimize

26

A. Basak et al.

the chance of getting type detected. Here, the defender has zero probability to differentiate the attackers. The defender cannot use any honeypot at the initial stage as in case study 1 or 2 since those will not make any difference in the attackers plan. However, if the defender deploys a honeypot HP as shown in the Fig. 5b, the attackers choose the plans with the minimum costs which leads to an earlier identification for the defender.

4

Game Model

We now present a general game-theoretic framework to model the interactions between the defender and the attacker. There is a network with nodes ti ∈ T similar to an enterprise network shown in the Fig. 2. Each node ti ∈ T has a value vti and a cost cti . Some nodes are public nodes which can be used to access the network. There are goal nodes gj ⊂ T (e.g. DB) which are high valued nodes. Each node ti ∈ T has some vulnerabilities that can be exploited using φti ∈ Φ on the edges. A node ti can be compromised from a node tj using the exploit φti if there is an edge from tj to node ti and if node tj allows the use of the exploit φti and if an attacker has any of the φti exploits. The game has N attacker types ai and one defender d. The attackers have complete knowledge about the network and have single deterministic policies π(ai ) to reach their goals and some exploits φai ∈ Φ. However, he does not know whether a node is a honeypot or not. The defender can deploy deception by allocating honeypot h ∈ H in the network. The configurations of the honeypots are chosen randomly from real nodes T in the network. The game starts by choosing an attacker type ai randomly. In round r, the defender knows the history up to round r − 1, which is used to update the defender’s belief about attacker type. Next, the defender allocates k honeypots to force the attacker to reveal his type earlier. The defense actions also may thwart the plan of the attacker of reaching the goal if the attacker moves laterally to a honeypot or if the attacker does not find any plan to move laterally. The network change is visible to the attacker in the current round. This can be justified because an APT usually performs reconnaissance again and monitors defender activity after making a lateral movement. The attacker recomputes his attack plan using his AG and chooses an optimal attack plan with the minimum cost from his current position to reach his goal. If there are multiple plans with the same cost, we break the tie in favor of the attacker where the attackers have the maximum common overlapping length in their attack plans. The game ends if the attacker gets caught by the deception set up by the defender or if he reaches to his goal node.

5

Defender Decision Making

In each round of the game the defender updates his beliefs about the attacker types and the attackers’ goals. According to Bayes’ theorem, given the sequence of lateral movement seq(t) up to node t from the starting node tp , the probability that the defender is facing attacker ai is:

Identifying Stealthy Attackers in a Game Theoretic Framework

27

p(seq(t)|ai )p(ai ) N j=0 p(seq(t)|aj )p(aj )

p(ai |seq(t)) =

where p(ai ) is the prior probability of facing the attacker ai and p(seq(t)|ai ) is the likelihood of the observation seq(t) given that we are facing the attacker ai . Similarly, belief about goal can be computed. The probability of the plan of the attacker is Pg from the start node tp is: p(Pg |seq(t), ai ) =

p(seq(t)|Pg ,ai )p(Pg ,ai ) p(seq(t)|Pg ,ai )p(Pg ,ai )

∀g∈G

Next, the defender considers all the possible deception deployments c ∈ C where there are edges tm → tn from the attacker’s last observed position tlp where tm can be reached from node tlp . Without affecting the existing connections of the network deceptions are deployed between two nodes. The defender has a library of AGs for each of the attackers which he can use to optimize the decision making. We consider three possible objectives the defender uses to make this decision. In Minimizing Maximum Overlapping Length the defender chooses his deception deployment by minimizing the sum of the attackers’ overlapping actions. Another variation would be to minimize attacker’s maximum overlapping length with other attackers by considering each of the attackers. Minimizing the maximum overlapping length of attack plans may not always focus on all the attackers’ attack plans, e.g. if all the attackers have high overlapping (of attack plans) with each other except the acting attacker. To overcome the issue the defender can compute the expected overlapping length of the attack plans:Minimizing Expected Overlapping Length. According to information theory one way to reduce the anonymity between the attacker types is to deploy deception in such a way which will minimize the entropy. If X1 = p(a0 ), X2 = p(a1 ) and X3 = p(a2 ) are three random variables for the attacker types X1 + X2 + X3 = 1, then where i=1 entropy can be written as follows: H(X) = − i=0 p(ai ) logb p(ai ) where p(ai ) is the posterior probability for the attacker ai . In Minimizing Entropy the defender chooses the deception deployment which results in the minimum entropy for all the attackers A.

6

Attacker Decision Making

Now we present the mixed integer program (MIP) for an attacker, where he chooses the minimum cost plan to reach his goal. Ties are broken by choosing a plan which maximizes the sum of common overlapping length of the attack plans. dij (1) max ij

0 ≤ dij −

me

dijme ≤ 0 ∀i, j

(2)

0

dijme ≤ eiem AN D dijme ≤ ejem dijme ≤ dij(m−1)e

∀i, j, m, e

∀i, j, e, m = 1, 2, ..., M

(3) (4)

28

A. Basak et al.

eiem ce ≤ Ci

em

m

eiem −

eie(m−1) −

m

if s ∈ e if g ∈ e otherwise

(5)

∀i, ∀t

(6)

eie m = 0 ∀i, ∀t, t = s, t = g

(7)

eie m

m

⎧ ⎪ ⎨1 = −1 ⎪ ⎩ 0

∀i

m

Equation 1 is the objective function where the attacker computes the maximum sum of overlapping length of attack plans among all the attackers. Constraint 2 assigns the sum of the overlapping length between attacker i, j up to move m into dij . In dijme e is the edge identifier. Constraint 3 computes the overlapping length between the attacker plans where eiem is a binary variable for attacker i representing an edge for edge e (subscript) at mth move. Constraint 4 makes sure that the overlapping starts from the beginning of the plans and not in the middle; if two plans start out differently but merges in the middle somewhere. Constraint 5 ensures that each attacker chooses a minimum cost plan to reach the goal node. Constraints 6 and 7 are path flow constraints for attackers.

7

Simulation Results

We want to show that using proactive deception a defender can reveal the attacker type earlier than otherwise. We define an early identification as when the defender is able to use proactive deception to determine the attacker type in an earlier round compared to when the defender just observes. Vulnerabilities in the nodes are (indirectly) represented by exploits on the edge between two nodes. We randomly generated 20 networks, somewhat similar to Fig. 2, with 18 nodes with values and costs chosen randomly from the range [0, 10] including one public node. We considered exploits φ0 , φ1 , φ2 , φ3 , φ4 , φ5 with cost chosen randomly from the range [0, 10]. Next, we assign exploits to the edges in such a way that it allows the attackers to have unique attack plans to their goals except the starting node. Depending on the edge density (number of edges from a node) and shared vulnerability parameters we randomly connect edges with exploits between nodes where the two nodes are in different attack plans of different attackers. We used six honeypots and the vulnerabilities are picked from randomly chosen nodes existing in the network so that the honeypots can act as decoys. The games are limited to five rounds. In each round r the defender d deploys 0 ≤ k ≤ 2 decoys. In the attack plan library, we considered three attacker types, a, b, c with different goals. In the first experiment, we show that depending on different density of edges and shared vulnerabilities between nodes, how early a defender can identify the attacker type he is facing varies. Each of the three attackers, a, b, c has some unique and some shared exploits in their possession. Attacker a has exploits

Identifying Stealthy Attackers in a Game Theoretic Framework

29

(a) Edge density and shared (b) Edge density and shared (c) Edge density and shared vulnerabilities 20% vulnerabilities 40% vulnerabilities 80%

(d) Edge density and shared (e) Edge density and shared (f) Edge density and shared vulnerabilities 80% vulnerabilities 20% vulnerabilities 40%

Fig. 6. Comparison between just observation (first row) and use of pro-active deception (second row). The shared exploits are fixed to 40% between attackers.

φ0 , φ1 , φ2 . Attacker b has φ2 , φ3 , φ1 . Attacker c has exploits φ4 , φ5 , φ2 . We picked the attacker b as the acting attacker. Figure 6 shows the results. In the first row in Fig. 6a, b and c defender just observes the attacker actions. As the density and shared vulnerabilities increases it takes more rounds for the defender to identify the attacker type b. In the second row in Fig. 6e and f the defender deploys honeypots. If we compare the figures of the same edge density and shared vulnerabilities from the two rows it is easy to notice that use of deception facilitates early identification except Fig. 6d where it was the same. However, an increase in edge density and shared vulnerabilities between nodes harms performance. The second experiment is the same as the first except that we kept the edge density and shared vulnerabilities between nodes fixed to 40% and we varied the shared exploits between the attackers. We chose the attacker c as the acting attacker. We can observe in Fig. 7a, b and c that as we increase the sharing of exploits between the attackers it takes longer for the defender to identify the attacker type c. When all the attackers have the same exploits the defender was unable to identify the attacker type even at around 4 without using any deception. However, in the second row in Fig. 7d, e and f as the defender strategically uses deception, identification of attacker c happens earlier. The performance of the early identification decreases as the shared exploits between the attacker’s increases. Another observation is noticeable in Fig. 7d: the defender was not able to identify the attacker type, however, the attacker did not find any policy to continue its attack. This shows that the use of strategical deception can also act as a deterrent for the attackers.

30

A. Basak et al.

(a) Unique exploits for the (b) 40% shared attackers between attackers

exploits (c) All attackers have the same set of exploits

(d) Unique exploits for the (e) 40% shared attackers between attackers

exploits (f) All attackers have the same set of exploits

Fig. 7. Comparison between just observation (first row) and use of pro-active deception (second row) We increase shared exploits between the attackers. The edge density and shared vulnerabilities between nodes are fixed to 40%.

For our last experiment, we compared different techniques a defender can use to facilitate early identification; minimizing maximum overlap: min-max-overlap, minimizing maximum expected overlap: min-max-exp-overlap and minimizing entropy: min-entropy. Data are averaged for all the attacker types we considered. Figure 8a and b show on average how many rounds it took for the defender to identify the attackers using different techniques. In both figures, we can see that using deception facilitates earlier identification. We did not notice any particular difference between different techniques except when all the attackers have the same exploits, and in that case min-max-overlap performed better. From all the experiments it is clear that use of deception will speed the identification of an attacker which is very important in cybersecurity scenarios as different real-world

Fig. 8. Comparison between different techniques used by the defender.

Identifying Stealthy Attackers in a Game Theoretic Framework

31

attackers e.g. APTs can lay low for a long time and detecting the attacker early can facilitate informed decision making for the defender.

8

Conclusions and Future Direction

Detecting and identifying attackers is one of the central problems in cybersecurity, and defensive deception methods such as honeypots have a key role to play, especially against sophisticated adversaries. Identification is an even harder problem in many ways than detection, especially when many attackers use similar tools and tactics in the early stages of attacks. However, any information that can help to narrow down the goals and likely tactics of an attacker can also be of immense value to the defender, especially if it is available early on. We present several case studies and a formal game model showing how we can use deception techniques to identify different types of attackers represented by the different attack graphs they use in planning optimal attacks based on their individual goals and capabilities. We show that strategically using deception can facilitate significantly earlier identification by leading attackers to take different actions early in the attack that can be observed by the defender. Our simulation results show this in a more general setting. In future work we plan to explore more specifically how this type of information can be used to respond dynamically to specific attackers during the later stages of an attack. We also plan to investigate how this model can be extended to different types of deception strategies, integration with other IDS techniques, as well as larger and more diverse sets of possible attacker types. Acknowledgment. This research was sponsored by the U.S. Army Combat Capabilities Development Command Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Combat Capabilities Development Command Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

References 1. APT33. https://www.fireeye.com/blog/threat-research/2017/09/apt33-insightsinto-iranian-cyber-espionage.html 2. APT34. https://www.fireeye.com/blog/threat-research/2017/12/targeted-attackin-middle-east-by-apt34.html 3. APT37 (REAPER). https://www2.fireeye.com/rs/848-DID-242/images/rpt APT37.pdf. Accessed 03 June 2019 4. APT38. https://www.fireeye.com/blog/threat-research/2018/10/apt38-detailson-new-north-korean-regime-backed-threat-group.html 5. APT38, Un-usual Suspects. https://content.fireeye.com/apt/rpt-apt38 6. APT40. https://www.fireeye.com/blog/threat-research/2019/03/apt40-examininga-china-nexus-espionage-actor.html. Accessed 03 June 2019

32 7. 8. 9. 10. 11. 12.

13.

14. 15.

16. 17.

18.

19. 20.

21. 22. 23.

A. Basak et al. Defense Evasion. https://attack.mitre.org/tactics/TA0005/ Enterprise Tactics. https://attack.mitre.org/tactics/enterprise/ Initial Access. https://attack.mitre.org/tactics/TA0001/ Lateral Movement. https://attack.mitre.org/tactics/TA0008/ M-Trends 2019. https://content.fireeye.com/m-trends. Accessed 04 June 2019 Durkota, K., Lis` y, V., Boˇsansk` y, B., Kiekintveld, C.: Optimal network security hardening using attack graph games. In: Twenty-Fourth International Joint Conference on Artificial Intelligence (2015) Ingols, K., Lippmann, R., Piwowarski, K.: Practical attack graph generation for network defense. In: 2006 22nd Annual Computer Security Applications Conference (ACSAC 2006), pp. 121–130. IEEE (2006) Kreibich, C., Crowcroft, J.: Honeycomb: creating intrusion detection signatures using honeypots. ACM SIGCOMM Comput. Commun. Rev. 34(1), 51–56 (2004) Nicholson, A., Watson, T., Norris, P., Duffy, A., Isbell, R.: A taxonomy of technical attribution techniques for cyber attacks. In: European Conference on Information Warfare and Security, p. 188. Academic Conferences International Limited (2012) Noel, S., Jajodia, S.: Optimal ids sensor placement and alert prioritization using attack graphs. J. Netw. Syst. Manag. 16(3), 259–275 (2008) Ou, X., Boyer, W.F., McQueen, M.A.: A scalable approach to attack graph generation. In: Proceedings of the 13th ACM Conference on Computer and Communications Security, pp. 336–345. ACM (2006) Raynal, F., Berthier, Y., Biondi, P., Kaminsky, D.: Honeypot forensics. In: Proceedings from the Fifth Annual IEEE SMC Information Assurance Workshop, pp. 22–29. IEEE (2004) Raynal, F., Berthier, Y., Biondi, P., Kaminsky, D.: Honeypot forensics, part II: analyzing the compromised host. IEEE Secur. Priv. 2(5), 77–80 (2004) Sheyner, O., Haines, J., Jha, S., Lippmann, R., Wing, J.M.: Automated generation and analysis of attack graphs. In: Proceedings 2002 IEEE Symposium on Security and Privacy, pp. 273–284. IEEE (2002) Spitzner, L.: Honeypots: Tracking Hackers, vol. 1. Addison-Wesley Reading (2003) Tsagourias, N.: Cyber attacks, self-defence and the problem of attribution. J. Conflict Secur. Law 17(2), 229–244 (2012) Wheeler, D.A., Larsen, G.N.: Techniques for cyber attack attribution. Technical report, Institute for Defense Analyses (2003)

Choosing Protection: User Investments in Security Measures for Cyber Risk Management Yoav Ben Yaakov1 , Xinrun Wang2 , Joachim Meyer1(B) , and Bo An2 1

2

Tel Aviv University, 69978 Tel Aviv, Israel [email protected], [email protected] Nanyang Technological University, Singapore, Singapore {xwang033,boan}@ntu.edu.sg

Abstract. Firewalls, Intrusion Detection Systems (IDS), and cyberinsurance are widely used to protect against cyber-attacks and their consequences. The optimal investment in each of these security measures depends on the likelihood of threats and the severity of the damage they cause, on the user’s ability to distinguish between malicious and nonmalicious content, and on the properties of the different security measures and their costs. We present a model of the optimal investment in the security measures, given that the effectiveness of each measure depends partly on the performance of the others. We also conducted an online experiment in which participants classified events as malicious or non-malicious, based on the value of an observed variable. They could protect themselves by investing in a firewall, an IDS or insurance. Four experimental conditions differed in the optimal investment in the different measures. Participants tended to invest preferably in the IDS, irrespective of the benefits from this investment. They were able to identify the firewall and insurance conditions in which investments were beneficial, but they did not invest optimally in these measures. The results imply that users’ intuitive decisions to invest resources in risk management measures are likely to be non-optimal. It is important to develop methods to help users in their decisions.

Keywords: Decision making

1 1.1

· Cyber insurance · Cybersecurity

Introduction Cybersecurity

Cybersecurity has become one of the major challenges for modern society [1], as the frequency and variety of cyber-attacks are steadily increasing [19], and the damage caused by cybercrime continues to rise [3]. The growing threats lead to a corresponding growth in the development of cybersecurity defense tools. The investment in these tools aims to diminish the risk of losses to the point where c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 33–44, 2019. https://doi.org/10.1007/978-3-030-32430-8_3

34

Y. B. Yaakov et al.

the marginal cost of implementing security is equal to the additional reduction of the costs, caused by security incidents [8]. Organizations need to decide how to allocate resources to different risk mitigation measures. Such decisions are based on the evaluation of the efficiency of the different security measures by decision makers in organizations. Investment decisions regarding different security measures are complicated by the fact that the consequences of implementing one measure may affect the efficiency of others. Also, securing the cyberspace is not only a technical issue [12]. Rather, it is strongly affected by the behaviors, perceptions and decision making of the people who use these systems [7]. 1.2

User Decision Making

To design proper security systems, designers must understand how users make decisions regarding security [20]. Actual human decisions regarding risk taking do not always correspond with the optimal decisions, prescribed by decision theory [20]. These deviations can result from users’ evaluations of the tradeoff between the effectiveness of the security and the usability of the system [15]. Some users may be willing to accept high False Alarm rates of an alerting system, if they think that the expected damage from an undetected malicious event is sufficiently larger than the expected cost of taking unnecessary actions after False Alarms. Others may prefer low False Alarm rates, if alerts are seen as too disruptive, even at the cost of reducing the likelihood of detecting an attack [2]. Also, decisions may not be optimal because abstract outcomes tend to have less impact on decisions than concrete ones [4]. In security, the pro-security alternative (invest in a security mechanism) usually has the invisible outcome of protecting from attacks [20]. This benefit is often intangible for users, making it difficult to evaluate the gains, compared to the costs. The difficulty to assess the benefits, gained from investing in security, is particularly large when users do not know the exact level of risk they are facing, or when they believe the risk is smaller than it actually is [20]. Risk taking is a complex combination of behaviors. It depends on the person’s individual characteristics, on the available security mechanisms and information and on the nature of threats [2]. For instance, people who feel better protected are likely to engage in more risky behavior [20]. A person may take greater risks, knowing that a warning system is installed and no warning has been issued [13]. 1.3

Defending Against Cyber-Threats

To lower risks, people can either reduce the likelihood of a threat or reduce the severity of the damage, caused by a successful attack. Firewalls are a widely used mechanism to lower the likelihood of threats. They are barriers between the secured and controlled internal networks and the untrusted outside networks [9]. The quality of a firewall system is measured by its ability to stop malicious events from entering the system, while not interfering with non-malicious events [8].

Choosing Protection

35

Cyber-insurance limits the damage caused by a cyber-attack, once it has occurred. It does not protect from attacks, but it helps organizations or individuals reduce their risks by sharing the costs associated with recovery after a successful cyber-attack [10]. One can also reduce risks by providing users with information that allows them to detect and avoid risks, using decision support systems or alerting systems. The availability of additional information may change the strategies decision makers use to address a decision problem [6]. In the context of cyber security, Intrusion Detection System (IDS) are widely used. They monitor networks or systems for malicious activity or policy violations, and they issue alerts when suspicious events are detected [8]. The quality of an IDS system is measured in terms of its ability to distinguish between malicious and non-malicious events. From the user’s perspective, the IDS can improve the information the user has for deciding whether an event is malicious or not. 1.4

Our Contributions

We develop a model of the value of the investments, leading to the lowering of the likelihood of adverse events, the costs related to the adverse events, and the information available for detecting adverse events. These three security measures affect different parts of the coping with threats, but they are intricately related. For instance, lowering the chances for the occurrence of adverse events may lower the need for investment in improved decision support. Thus, decision makers may need to consider trade-offs when deciding on the optimal strategy to manage cyber risks. We also conducted a behavioral experiment to assess actual user behavior in a controlled lab setting and to compare it to the model predictions. The results of the study allow us to identify possible differences between user preferences and choices and the optimal security behavior, prescribed by the normative model. In particular, we aim to determine whether users respond more strongly and are more sensitive to the likelihood of damage or to its severity. We also observe whether users are able to choose the optimal investment in the alerting system, considering the properties of the system. The knowledge of such biased decisions can help us decide whether choices of risk-mitigation investments can be left to the user’s intuitive choices, or whether one needs to develop ways to help users choose optimal protection.

2 2.1

Model Modeling Framework

We use Signal Detection Theory (SDT) [11,16–18] to describe and study binary classification decisions under uncertainty. It assumes that classifications are made by deciding from which of two overlapping distributions (often referred to as signal and noise) an observation was sampled [21].

36

Y. B. Yaakov et al.

Signal detection theory commonly assumes Gaussian distributions. However, to facilitate the development of analytical solutions, we use Exponential distributions to represent the signal and noise. 2.2

Model Description

We consider a decision-making model of an agent (a company, an organization or an individual) in a given period (e.g., a year) to decide on cyber risk management. We consider three security measures for cyber risk management firewalls, intrusion detection systems (IDS) and cyber-insurance. They become active sequentially - a threat first has to pass the firewall. It may then be detected by the IDS, and the user can respond to the indication of the IDS. The cyberinsurance will lower the damage, if neither of these two security measures stops the threat, and an attack occurs. 2.3

Firewall and IDS

Consider an information asset of the company with value v, which is also the loss caused when an attack succeeds. Let N be the number of events trying to access the information asset in the given period and of them are malicious, which can be estimated from the historical data. We assume the asset is protected by the typical cybersecurity system architecture. IDSs will raise an alarm when an event is identified as “malicious”. The outputs of IDSs, together with the features of the events, are inputs to humans’ classification decisions. F to denote the probability that a malicious event can pass the We use Pm F = firewall, which depends on the agent’s investment in firewalls, yielding Pm f (x). We assume that all non-malicious events can pass the firewall. Therefore, the probability that an event that passed the firewall is malicious is Ps = P (Malicious|after Firewall) =

F · Pm F (1 − ) + · Pm

(1)

and the probability that the event is non-malicious is Pn = 1 − Ps . The probability that the IDS correctly detects a malicious event (i.e., true positive) is PTI P and PFI P is the probability that the IDS incorrectly classifies an event as malicious (i.e., false positive). The probabilities of false negative and true negative are denoted by PFI N = 1 − PTI P and PTI N = 1 − PFI P , respectively. The values of PTI P , PFI P depend on the agent’s investment in the IDS and the agent’s alarming threshold β I according to the signal detection theory (SDT) parameters λs , λn , displayed in Fig. 1. We can compute the probability that an event is malicious when the alarm is raised, i.e., the positive predicted value (PPV), and the probability that the event is non-malicious when the alarm is not raised, i.e., the negative predicted value (NPV).

Choosing Protection

P P V = P (Malicious|Alarm) =

PTI P Ps PTI P Ps + PFI P Pn

N P V = P (Non-Malicious|No-Alarm) =

PTI N Pn PTI N Pn + PFI N Ps

37

(2) (3)

Fig. 1. Signal detection theory distributions for categorization decisions. The two exponential distributions for non-malicious and malicious events are parameterized by λs , λn , respectively. W.l.o.g., we assume that λs < λn . The threshold β I indicates that events with values above β I will trigger an alarm in the IDS. When applying the H is the human threshold when an alarm model to the human monitoring decisions, βA H was raised and βA¯ is the human threshold when no alarm was raised.

2.4

Human Monitoring

The human monitoring specifies the agent’s monitoring policy, considering the company’s defense systems (firewall and IDS). The agent classifies the event as malicious or non-malicious, given the output of the IDS, i.e., alarm and noalarm. The payoff matrix of the agent consist of (i) if the event is malicious and the agent classifies it as “malicious”, the agent suffers −γv, (ii) if the event is non-malicious and the agent classifies it as “malicious”, the agent suffers −c, (iii) if the event is malicious and the agent classifies it as “non-malicious”, the agent suffers −v and (iv) if the event is non-malicious and the agent classifies it as “non-malicious”, the outcome is d. We note that 0 ≤ γ < 1 and c < v. The agent’s monitoring policy is also modeled by SDT with the additional outputs, H H , βA and the decision variables are βA ¯ , which specifies the thresholds to classify the alarmed and no-alarmed events, as displayed in Fig. 1. 2.5

Cyber Insurance

The agent can purchase cyber insurance from an insurer with a premium to cover all (or a fraction of) the damage caused by the malicious events. The set

38

Y. B. Yaakov et al.

of insurances provided by the insurer is denoted by I and each insurance Ii ∈ I is specified by Pi , Li where Pi is the premium and Li is the limit of liability. The insurer will not pay more than the limit of liability. The agent’s strategy is denoted by z, where zi = 1 if Ii is purchased and zi = 0 otherwise. 2.6

Combining the Security Measures

H H , βA After specifying the investments x, y, z and the thresholds β I , βA ¯ , we can compute the agent’s expected costs, caused by an event that passes the firewalls. H H , βA u ˆ(x, y, z; β I , βA ¯)

= PA · {P (Malicious|Alarm)[PTHP |A (−γv) + PFHN |A (−v)]} + PA · {P (Non-Malicious|Alarm)[PFHP |A (−c) + PTHN |A (d)]} + PA¯ · {P (Malicious|No-Alarm)[PTHP |A¯ (−γv) + PFHN |A¯ (−v)]} + PA¯ · {P (Non-Malicious|No-Alarm)[PFHP |A¯ (−c) + PTHN |A¯ (d)]} = PTI P Ps [PTHP |A (−γv) + PFHN |A (−v)] + PFI P Pn [PFHP |A (−c) + PTHN |A (d)] + PFI N Ps [PTHP |A¯ (−γv) + PFHN |A¯ (−v)] + PTI N Pn [PFHP |A¯ (−c) + PTHN |A¯ (d)] (4) where PA is the probability that an event will cause an alarm when passing the IDS. Therefore, the agent’s damage, caused by all events, can be computed as F H H ˆ (x, y, z; β I , β H , β H¯ ) = N (Pm U + (1 − )) · u ˆ(x, y, z; β I , βA , βA ¯ ). The agent’s A A expected utility is ¯ − I(B) · U ˜ zi · Li − U (5) U = min 0, Ii ∈I

with the constraint that ˜ ≤B U

(6)

˜ = x+y+ where U Ii ∈I zi · Pi is the sum of investments in the firewall, IDS ¯ is and insurance. The first term of Eq. (5) implies that the expected damage U higher than the limit of liability, the agent will pay the excess cost. Otherwise the insurance will cover the entire cost. The indicator function in the second term is defined as: I(B) = 0 if B < ∞, which means the risk management is constrained by a finite budget B and I(B) = 1 if B = ∞, which means the risk management is without any budget constraint. Our goal is to compute the optimal assignment of the budget to maximize the agent’s expected utility.

3

Experiment

We conducted an experiment to test the extent to which model predictions correspond with actual user behavior, at least in controlled settings. To do so, we developed a web-based experimental system that presents trials, resembling

Choosing Protection

39

incoming email messages. Participants had to classify messages as malicious or non-malicious. The experiment consisted of six sessions. At the beginning of each session, participants decided how much they wanted to invest in a firewall, an IDS and insurance. Greater investment in each of the security measures led to greater protection. The system assigned the participants randomly to one of the four experimental conditions. The conditions differed in the optimal investment in the security measures (firewall, IDS, or insurance). In three of the four conditions, the optimal investment in one of the measures was the maximum 10, and it was 0 for the two other measures. In the fourth condition, the optimal investment was 0 for all measures. 3.1

Participants

Participants were 98 engineering students (59 females and 39 males), with 24 participants in condition 1, 24 participants in condition 2, 20 participants in condition 3, and 30 participants in condition 4. Most (84 participants) performed the experiment as part of a project in a course on quantitative models of human performance, while the remaining 14 responded to requests by the experimenter to participate in the study. 3.2

Experiment Description

The game is played by 98 independent players, who are all potential victims of attacks, where the attacker is modeled by the system. The game consists of 6 sessions, with 30 trials in each session. In each trial, security attacks occur probabilistically, according to a fixed, exogenous probability p = .3, which determines the baseline rate of security attacks. The success or failure of an attempted attack depends on the defensive measures purchased by the player in the beginning of each session. Each session began with investment screen in which the participant could explore and eventually choose a combination of investments, consisting of investments between 0 and 10 points in each of the three security systems, from a total budget of 30 points in each session: a An automatic system (resembling a firewall) that blocks part of the malicious events, letting all non-malicious events through. The greater the investment in the firewall, the larger the proportion of malicious components the system blocks. b An automatic alert system (referred to as the Intrusion Detection System, IDS), which either issues or does not issue an alert that indicates a malicious component. The greater the investment in the IDS, the greater its ability to distinguish between malicious and non-malicious components (in Signal Detection terms, its sensitivity increases). c Insurance that will compensate for some or all of the damage caused when a malicious component got through. The greater the investment in the insurance, the higher the proportion of the damage the insurance covers.

40

Y. B. Yaakov et al.

Participants could check the system quality as per the different potential investment mixtures as many times as they wanted. The quality of the systems was presented as the percentage of the malicious components that are stopped by the firewall, the True Positive and False Positive rate of the IDS and the percentage of compensation from the insurance for damage caused by malicious events. After selecting the investment mixture, the participant needed to classify 30 events per session. Events were blocked by the firewall with some probability p (p depends on the participant’s investment in the Firewall security system x). In Condition 1, when the optimal investment in the firewall was to invest 10 points p (x) = −0.2x , so that with the maximal investment, the firewall blocked .5 of 1.37 − e−1e −0.3x . Here the the malicious events. In the other conditions, p (x) = 1.9 − e−0.1e maximal investment in the firewall only blocked .105 of the malicious events. The participant’s investment in the IDS determined the system sensitivity. The larger the investment, the better was the ability to distinguish between malicious and non-malicious events. For Condition 2, in which the maximal −0.5x . For the other conditions, investment in the IDS was optimal, λIs = 1 − e−4e −0.5x λIn = 1 + e−4e . Figure 2 shows the probabilities PT P and PF P in the IDS for different investments for Condition 2 and the other conditions.

Fig. 2. PT P , PF P as functions of the investment in the IDS for Condition 2 and Conditions 1, 3 and 4.

If participants purchased insurance, they were compensated for some of the damage if they mistakenly classified a malicious event as non-malicious. For Condition 3, in which the optimal investment in the insurance was the maximal investment, the proportion of the damage covered by the insurance was I (z) = 0.08x. In the other conditions, the proportion of the damage covered by the insurance was I (z) = 0.03x. The private information the participant had about an event was shown as a rectangle. The longer the rectangle, the greater the probability that the component was malicious (length was presented visually, as well as with a numerical

Choosing Protection

41

value). Participants classified the events by using their private knowledge (the rectangle length) the likelihood of an event being malicious, given the properties of the firewall and the IDS, and the expected damage from malicious events, given the investment in the insurance. Classifying a malicious event as non-malicious resulted in the loss of 6 points, while correctly classifying a non-malicious event resulted in the gain of 3 points (no gain or loss of points in other cases). After classifying 30 components, the session ended and the next session began. Participants’ points were reset, and they could choose a new investment mixture and then moved on to classify 30 additional events.

4 4.1

Results Analysis of the Investment Choices

We analyzed the investment in the three security measures in the six sessions for the four conditions with a three-way Analysis of Variance. The security measure (firewall, IDS or insurance) and the session (six sessions) were within-subject variables, and the condition was a between-subject variable. We report the significance of effects with Greenhouse-Geisser corrections for all within- subject effects. There was a significant main effect of the security measure, F (1.995, 187.56) = 43.33, M Se = 22.11, p < .001. The investment in the IDS was overall higher (M = 6.675, SD = 2.98) than the investment in the other measures (Mf irewall = 4.17, SDf irewall = 3.32; Minsurance = 4.31, SDinsurance = 3.40). There was also a significant main effect of the condition, F (3, 94) = 6.94, M Se = 38.996, p < .001. Condition 1, in which investment in the firewall was optimal, showed more investment (M = 5.95, SD = 3.05) than Condition 2 (M = 3.76, SD = 4.02), p < 0.001, and Condition 4, in which the optimal strategy was not to invest at all, showed more investment (M = 5.11, SD = 3.23) than Condition 2, p < 0.05. Also significant were the interactions Measure X Condition, F (5.99, 22.11) = 7.54, M Se = 22.74, p < .001. and Measure X Session, F (8.37, 786.85) = 28.33, M Se = 6.68, p < .001. To gain a better understanding of the patterns of results, we conducted separate analyses for the different security measures. In the analysis of the investment in the firewall, there was a significant difference between the conditions, F (3, 94) = 9.52, M Se = 30.62, p < .001. The investments in Condition 1, in which investment in the firewall was optimal, were greater (M = 6, SD = 2.85) than those in Conditions 2 (M = 2.94, SD = 3.35), p < .001 and Condition 3 (M = 3.02, SD = 3.035), p < .001 according to Tukey HSD. There was no significant effect of the session for the firewall security measure. Participants rapidly realized that it is worthwhile to invest in the firewall when the investment was indeed optimal. However, they did not reach the maximal investment. There was no evidence for learning. Thus, participants did not approach the optimal investment in the firewall (raise the investment in Condition 1 to the maximum and lower the investment in all other conditions to 0).

42

Y. B. Yaakov et al.

The results for the investment in the insurance resembled those for the investment in the firewall. There was a significant difference between the conditions, F (3, 84) = 8.762, M Se = 30.77, p < .001. In Condition 3, in which investment in the insurance was optimal, participants invested more (M = 6.275, SD = 3.15) than in Condition 2, (M = 2.90, SD = 3.15), p < .001, and Condition 4 (M = 3.84, SD = 2.99), P < .005. Condition 1 showed more investment (M = 4.69, SD = 3.28) than Condition 2, p < 0.05 according to Tukey HSD. The effect of the session was again not significant. Thus, here, too, participants recognized the condition in which maximum investment in the insurance was optimal. Here, too, there was no evidence for learning over time. In the analysis of the investment in the IDS, there was a significant difference between the sessions, as can seen in Fig. 3, F (3.996, 375.61) = 6.38, M Se = 7.399, p < .001. Session 1 (M1 = 5.43, SD1 = 2.58), showed less investment than any other session (M2 = 7, SD2 = 2.75; M3 = 6.91, SD3 = 2.79; M4 = 7, SD4 = 3.00; M5 = 6.99, SD5 = 3.12; M6 = 6.72, SD6 = 3.27). There was no significant difference between the conditions for the IDS security measure. Thus, for the IDS, participants did not differentiate between the condition in which investments in the IDS were justified and the others. There was a change over time in the investments in IDS, but it was an overall increase in the willingness to invest in the IDS, irrespective of the effect of the investment.

Fig. 3. IDS: mean investment for the conditions and sessions

5

Discussion

We developed a model of the optimal investment in different security measures and examined actual investment choices in three different security measures, considering the quality of the systems. The results showed that participants preferred to invest in the IDS, compared to the firewall and insurance. However, the participants were not sensitive to the quality of the IDS. People invested between 60–70% of the maximum amount in the IDS, regardless of system quality. Investments in the IDS increased from the first session to the second and then remained fairly constant. Thus, even though there was a change over time, there

Choosing Protection

43

was no evidence for systematic learning that moves the investments towards the optimum in the different conditions. These findings are in line with the results of previous studies that show that user choices of the settings of alerting systems are often problematic [5,14]. There was, however, some awareness of the differences in the quality of the firewall and insurance. Players invested more in these two security measures when the maximal investment was optimal, although they did not reach the optimal investments. Neither did the investment in the conditions, in which the investment in the firewall or the insurance provided no benefits, diminish. Overall, there were no significant learning effects for these two measures. Participants decided on the level of investment in them during the first session, and they maintained approximately the same level throughout the experiment. These results indicate that people (at least in our controlled experiment) fail to make optimal decisions regarding risk taking/investment in defense systems. Therefore, an external mechanism is required to support these decisions. An additional solution is the government’s regulation that specifies the properties of the security measures (e.g., the required insurance coverage).

6

Implications from Our Study

Future research should validate and expand the results in additional systems. Still, our study does show that participants do not respond adequately to properties of the security measures. They over-invest in information that is supposed to help them differentiate between threat and non-threat situations, even when this information has limited value. They do not distinguish between systems in which the information benefits them and others in which it does not. Also, even though participants realized that some firewall and insurance conditions were better than the others, they did not adjust their responses sufficiently. They did not invest enough in the firewall or the insurance when investments could have benefited them, and they invested too much in these measures when they did not provide benefits that would have justified the investment. In summary, our study shows that users’ choices regarding the investment in security measures are problematic. Users can distinguish between better and worse investments in measures that lower the likelihood of attacks or the severity of the attack consequences. They invest more when the investment is justified, but not enough, and they invest too much when investments are not justified. When it comes to information, users invest in it, even if it is practically useless. These findings should be considered when planning risk mitigation strategies. It may be problematic to rely on users’ intuitive judgments to choose how to allocate resources to cyber security measures. Instead, it may be necessary to conduct systematic, decision-analytical evaluations of risk mitigation measures to optimally allocate the resources to cyber security measures. Acknowledgements. The research was partly funded by the Israel Cyber Authority through the Interdisciplinary Center for Research on Cyber (ICRC) at Tel Aviv University. This research was also supported by NCR2016NCR-NCR001-0002, MOE, and NTU.

44

Y. B. Yaakov et al.

References 1. Bajcsy, R., Benzel, T., et al.: Cyber defense technology networking and evaluation. Commun. ACM 47(3), 58–61 (2004) 2. Ben-Asher, N., Meyer, J.: The triad of risk-related behaviors (TriRB): a threedimensional model of cyber risk taking. Hum. Factors 60(8), 1163–1178 (2018) 3. Bissell, K., Ponemon, L.: The cost of cybercrime - unlocking the value of improved cybersecurity protection (2019). https://www.accenture.com/ acnmedia/PDF-96/ Accenture-2019-Cost-of-Cybercrime-Study-Final.pdf 4. Borgida, E., Nisbett, R.E.: The differential impact of abstract vs. concrete information on decisions 1. J. Appl. Soc. Psychol. 7(3), 258–271 (1977) 5. Botzer, A., Meyer, J., Bak, P., Parmet, Y.: Cue threshold settings for binary categorization decisions. J. Exp. Psychol.: Appl. 16(1), 1–15 (2010) 6. Botzer, A., Meyer, J., Borowsky, A., Gdalyahu, I., Shalom, Y.B.: Effects of cues on target search behavior. J. Exp. Psychol. 21(1), 73–88–539 (2014) 7. Bowen, B.M., Devarajan, R., Stolfo, S.: Measuring the human factor of cyber security. In: 2011 IEEE International Conference on Technologies for Homeland Security (HST), pp. 230–235. IEEE (2011) 8. Cavusoglu, H., Mishra, B., Raghunathan, S.: A model for evaluating it security investments. Commun. ACM 47(7), 87–92 (2004) 9. Cisco: Cisco website. https://www.cisco.com/c/en/us/products/security/ firewalls/what-is-a-firewall.html. Accessed 2 May 2019 10. Lindros: CIO website. https://www.cio.com/article/3065655/what-is-cyberinsurance-and-why-you-need-it.html. Accessed 2 May 2019 11. Marcum, J.: A statistical theory of target detection by pulsed radar. IRE Trans. Inf. Theory 6(2), 59–267 (1960) 12. MAS: Annual report 2014/15. http://www.parliament.gov.sg/lib/sites/default/ files/paperpresented/pdf/2015/. Accessed 2 May 2019 13. Meyer, J.: Conceptual issues in the study of dynamic hazard warnings. Hum. Factors 46(2), 196–204 (2004) 14. Meyer, J., Sheridan, T.B.: The intricacies of user adjustment of system properties. Hum. Factors 59(6), 901–910 (2017) 15. M¨ oller, S., Ben-Asher, N., Engelbrecht, K.P., Englert, R., Meyer, J.: Modeling the behavior of users who are confronted with security mechanisms. Comput. Secur. 30(4), 242–256 (2011) 16. Nevin, J.A.: Signal detection theory and operant behavior: a review of David M. Green and John A. Swets’ signal detection theory and psychophysics1. J. Exp. Anal. Behav. 12(3), 475 (1969) 17. Pastore, R., Scheirer, C.: Signal detection theory: considerations for general application. Psychol. Bull. 81(12), 945 (1974) 18. Tanner Jr., W.P., Swets, J.A.: A decision-making theory of visual detection. Psychol. Rev. 61(6), 401 (1954) 19. de Vries, J.: What drives cybersecurity investment?: organizational factors and perspectives from decision-makers. Master’s thesis, System engineering, Policy Analysis and Management, Technical University Delft, Delft (2017) 20. West, R.: The psychology of security. Commun. ACM 51(4), 34 (2008) 21. Wickens, T.D.: Elementary Signal Detection Theory. Oxford University Press, USA (2002)

When Is a Semi-honest Secure Multiparty Computation Valuable? Radhika Bhargava(B) and Chris Clifton(B) Purdue University, West Lafayette, IN 47907, USA {bhargavr,clifton}@purdue.edu

Abstract. Secure Multiparty Computation protocols secure under the malicious model provide a strong guarantee of privacy and correctness. The semi-honest model provides what appears to be a much weaker guarantee, requiring parties to follow the protocol correctly. We show that for all but a small class of problems, those in the non-cooperatively computable class, the correctness guarantee of the malicious protocol effectively requires semi-honest parties as well. This suggests a wider utility than previously thought for semi-honest protocols. Keywords: Secure multi party computation semi-honest models · Incentive compatibility

1

· Malicious and

Introduction

Secure Multiparty Computation (SMC)/Secure Function Evaluation (SFE) provide what appears a commercially invaluable tool: the ability to compute a shared outcome without sharing one’s own private data. While general tools exist allowing efficient computation of arbitrary (polynomial time) functionalities, the few practical uses have been in relatively simple auction protocols, in spite of the efficiency and strong guarantees of protocols secure under the malicious model. We show that there is a fundamental reason why such adoption has not happened: the class of protocols where a malicious protocol makes sense is relatively small. In most cases, even with a protocol secure under the malicious model, we still must make a semi-honest assumption about the parties involved. The exception is a somewhat limited class of problems studied in game theory, those that can reach equilibrium under the Non-cooperative computation (NCC) [35] model. We first introduce a class of problems where SMC would seem quite appropriate: revenue-sharing contracts. Under a revenue-sharing contract, a retailer pays a supplier a wholesale price for each unit purchased, plus a percentage of the revenue the retailer generates. A revenue sharing contract proceeds in two stages - in the first stage the quantity and the price at which the good has to be This work was partially supported by a grant from the Northrop-Grumman Cybersecurity Research Consortium. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 45–64, 2019. https://doi.org/10.1007/978-3-030-32430-8_4

46

R. Bhargava and C. Clifton

traded is decided by maximizing the profits of the supply chain. In the second stage, the retailer shares the profits with other entities. The ability to negotiate a revenue-sharing contract without revealing one’s own costs/sales/revenue would seem a valuable business tool. Unfortunately, we show that to maximize its own profit, a rational party should lie about its input. Input modification is outside the scope of the protections provided by SMC, and the absolute guarantees of the malicious model enable a party to lie with impunity. We look at an approach - incentive compatibility - that will help us to enable SMC to address the full range of malicious behavior in a real-world, multiparty problem. A protocol is incentive-compatible if every participant can achieve the best outcome to themselves just by acting according to their true preferences. Incentive compatibility is generally used in mechanism design; where if each agent has truthfulness as a dominant strategy for the game form, the mechanism is said to be straightforward (or “cheat proof”) [4,11,21,22]. There are several degrees of incentive compatibility: a stronger notion is that truth telling is a weak dominant strategy, i.e., a Nash equilibrium will exist if all the players report truthfully their strategy; a weaker degree is bayes-nash incentive compatibility where if all players are truthful then it is the best strategy or at least not worse for you to be truthful. There are also two basic notions of incentive compatibility - either it is enforced because truth telling is the dominant strategy, or if a person does not have any incentive to lie as he will be caught. When we apply cryptography to game theory, we seek to replace the trusted mediator with an SMC protocol. Our main contribution is to answer these research questions: – When we apply cryptographic protocols to replace a mediator in game theoretic problems, are malicious protocols enough to achieve a game theoretic equilibrium? What are the functions for which we can build a secure protocol? We prove that we have to incorporate rationality of the players to prevent cheating while designing protocols, which requires incentive compatibility. We do so by showing that a real world game theoretic problem (Supply chain coordination with revenue sharing contracts) is not incentive compatible and hence there will always be a strategy which will give a higher payoff when the party lies. Since, we cannot prevent a player from lying about his input, we cannot build a privacy preserving secure protocol. Then we study, under what conditions can an incentive compatible protocol can be developed? We have done this by introducing the Non-cooperative computation (NCC) [35] model and prove that if an incentive compatible protocol has to be developed and we want the function to be computed correctly, then we are restricted to the NCC model. We also show that if a function is in NCC, then we can build an incentive compatible protocol which is secure in the malicious model. The rest of the paper is organized as follows. Section 2 gives an example which is not incentive compatible that proves that an attack strategy will exist which ensures that a game theoretic Nash equilibrium is impossible. We discuss the conditions under which a function is incentive compatible in Sect. 3, and give an

When Is a Semi-honest Secure Multiparty Computation Valuable?

47

example of an incentive compatible protocol. In Sect. 5, we discuss our results. Finally, in Sect. 7 we present our conclusions.

2

Applying Secure Multiparty Model to Game Theory - The Supply Chain Problem and Revenue Sharing Contracts

Cryptography and game theory is concerned with developing protocols (games respectively) where there are mutually distrusting parties (players) aiming for a common objective. Most game theoretic equilibria (e.g., Bayes-Nash equilibria) are possible if there exists a trusted third party called a “mediator”. Cryptographic protocols attempt to solve this problem by replacing the mediator to ensure the privacy of the players input. However, game theoretic problems assume that the parties are rational, i.e., they seek their own self interest. Cryptographic protocols assume that the parties are either honest (they follow the protocol) or malicious (they may deviate from the protocol). The solution concept in game theoretic problems is an equilibrium which provides the maximum utilities to a party given the other parties strategies. In cryptographic problems the solution concept is a protocol which the parties are assumed to follow or if they deviate then they will be caught. Recently, cryptographic researchers have started exploring cryptographic information exchange protocol such as Secure multiparty communication (SMC) in Game Theoretic settings. A Secure multi-party computation is a problem where n players want to compute any given function of their inputs in a secure way. Ensuring security means guaranteeing the correctness of the output and privacy and independence of the players inputs [30] even when some players cheat. A secure protocol should prevent an adversary from generating an incorrect output by deviating from the function that the parties had set out to compute and simultaneously ensuring that no party learns any information during the executing of the protocol apart from its output. Formal definitions of SMC exist for two adversary models: semi-honest and malicious [31]. In the semi-honest adversarial model, every party (corrupted or honest) follows the protocol specification correctly. However, the corrupted parties may attempt, following the completion of the protocol, to use information exchanged to learn information that should remain private. This is a rather weak adversarial model and would seem to provide insufficient security for practical applications. A protocol secure in the malicious model guarantees that a party who diverge arbitrarily from the normal execution of the protocol will be caught by imposing penalties. This is a much stronger model for security; however it comes at a trade off - protocols in the malicious model may be much more expensive to compute than protocols secure in the semi-honest model. There are other frameworks for analyzing the security of cryptographic protocols. The Universally Composable (UC) security framework guarantees that if a protocol is secure in this framework then it will be secure even if it is executed concurrently with any arbitrary protocol [8]. It has been shown that regardless

48

R. Bhargava and C. Clifton

of the number of corrupted parties any two party or multi-party functionality can be performed provided that it is in the common reference string model [9]. This is orthogonal to the problem we study; we assume either semi-honest or malicious behavior is satisfied by the protocol (presumably satisfying universal composability as well, if that is appropriate for the use), and study when the guarantees under malicious actually provide a stronger guarantee on satisfying the overall task than the semi-honest model. Protocols secure in the semi-honest model achieve greater efficiency, but suffer from the strong assumption that parties will not deviate from the protocol even if they benefit from doing so. Perhaps most critically, even protocols secure under the malicious model do not prevent a party from lying about their input. The assumption is justified by the fact that the parties want to learn the correct output. However, this assumption is not guaranteed when the participating parties want to learn the correct result exclusively or they can ensure that they will compute the correct output even with an incorrect input. E.g., in a supply chain a supplier might inflate his price to keep a greater share of the profits. We want to answer the question - can we build secure protocols for such mechanisms. If not, then what are the functions for which we can build a secure protocol. We now discuss the supply chain problem and investigate if we can apply SMC to the problem. Firstly, we give an overview of the supply chain problem and revenue sharing contracts. We then go on to prove that unless we verify the players input we cannot build a secure model as one of the players always has an incentive to cheat. The aim of this problem is to demonstrate the limitations of a malicious model. The main aim of supply chain coordination is to coordinate autonomous entities in a supply chain so as to reduce global inefficiency which may arise because of conflicts of interests between the different entities [7,17]. Supply chain coordination mechanisms include [23] supply chain contracts which encompass information sharing and joint decision making. The aim of the supply chain contract is to negotiate between the buyer and the seller a set of parameters (like quantity, price) to increase the total Supply Chain profits and share the risk among the SC partners [37]. In information sharing the SC partners coordinate by sharing information about the market demand, inventory etc. For supply chain coordination, let the fixed cost incurred by the manufacturer be ψ(q), where q is the quantity transferred between the supplier and the retailer. The fixed cost incurred by the retailer is φ(q) and the retail price charged by the retailer is s per unit. The supplier and the retailer agree on the transfer cost c for each unit. In the absence of supply coordination, the retailer and the supplier will maximize their own individual objectives, i.e., their profits, which may not align with the supply chain objectives [2]. With no trusted mediator the retailer chooses a quantity q, that maximizes his profit function πr , max πr ((s − c)q − φ(q))

(1)

The supplier’s objective is to chooses a quantity q and transfer cost c to maximize his profit function πs , (2) max πs (cq − ψ(q))

When Is a Semi-honest Secure Multiparty Computation Valuable?

49

The supply chain’s profit function π is max πsc (sq − φ(q) − ψ(q))

(3)

We try to maximize the supply chain’s profits instead of maximizing each individual’s profits, as the latter leads to sub optimal performance due to double marginalization [36]. The double marginalization problem justifies the use of a trusted mediator to integrate the supply chain. However, with the trusted mediator issues of cheating arise 1. The supplier can over inflate the transfer cost, to keep a larger share of the profits, or 2. The retailer can cheat about the quantity sold, underreporting revenues in a revenue sharing contract. In the next section we prove that when n = 2, i.e., there is only one retailer and one supplier, it is impossible to build a protocol that is incentive compatible (the highest payoff is incurred by the strategy of reporting the true value.) However, for n ≥ 2 we prove that the mechanism is incentive compatible. 2.1

Incentive Compatibility

In this section we prove that the protocol is inherently not incentive compatible for n = 2 and incentive compatible for n ≥ 2 with respect to inputs from the supplier and the retailer. That is, the supplier or the retailer has no incentive to modify the value of their cost functions to achieve a higher utility. We begin by reviewing the formal definition for weakly dominated strategies as given by Katz [26] where a player can never increase their utility by playing a weakly dominated strategy. i=n Definition 1. Given a game Γ = ({Ai }i=n i=1 , {ui }i=1 ) where A is the set of action A1 X A2 X ... X An with a set of strategy (a1 , ..., am ) ∈ Ai and u is the set of utility functions u1 X u2 X ... X un . A strategy Aj is weakly dominated by Ak if it can sometimes improve its payoff by playing Ak instead of Aj .

Incentive Compatibility w.r.t. Supplier. From a game-theoretic perspective we assume that the supplier receives a higher utility when he sells a greater quantity to the retailer. Let q be the quantity that coordinates the supply chain, also called the equilibrium quantity. Let μ+ denote positive utility and let μ− denote negative utility. Let μ0 denote neutral utility s.t. μ+ > μ0 > μ− . Let q* be the actual quantity traded between the supplier and the retailer. Definition 2. Let μq be the utility defined because of the quantity traded. ⎧ + ⎨ μ : q∗ > q μq = μ0 : q∗ = q ⎩ − μ : q∗ < q

(4)

50

R. Bhargava and C. Clifton

The supplier prefers to sell the quantity q at a cost c* which is at least equal to the actual cost c so as to avoid losses. Definition 3. Let μc be the utility defined because of the cost at which the equilibrium quantity is traded. + μ : c∗ ≥ c (5) μc = μ− : c ∗ < c Definition 4. Let the suppliers reward function ρs be defined as the summation of the cost utility μc and quantity utility μq ρs = μq + μc

(6)

Let (asu , asa , aso ) be the actions of under reporting, actual reporting and over reporting the transfer cost for the supplier respectively. Lemma 1. The strategy ao s is weakly dominated by the strategy asa . Proof. Let the actual cost for the supplier be c, given by the strategy asa , and the overinflated cost be c + . Given the transfer cost the retailer will maximize his own profit function to find the optimal quantity, i.e., max πr (s − c)q − φ(q). To maximize, we set the first order derivative to be 0: s − c − φ (q) = 0

(7)

When the supplier reports the actual value, s - φ (q) = c. For an inflated value, s - φ (q) = c + . Since the retail price s is a constant value, the value of the first order derivative of the cost function of the retailer (φ (q)) will be higher because c+ > c. If the retailer can ensure that his cost function φ(q) is concave, then a lower value of the derivative will result in a quantity which is less than the optimal. Thus, if the supplier over inflates his cost the retailer will buy a quantity q* < q leading to a negative utility μ− for the supplier. The reward function ρs,o for aso is given by μq + μc = μ− + μc . Since the supplier sells at a price greater than the actual cost, he gains positive utility: μc = μ+ . The reward function ρs,o = μ− + μ+ = μ0 . The reward function ρs,a for asa is given by μq + μc . The quantity traded is optimal, therefore μq = μ0 . Since the supplier sells at a price at least equal to the actual cost, he gains positive utility: μc = μ+ . The reward function ρs,o = μ0 + μ+ = μ+ . The strategy asa weakly dominates the strategy aso as the supplier is always better off by revealing the correct value of his input. Lemma 2. The strategy asu is weakly dominated by the strategy asa . Proof. Let the actual cost for the supplier be c which is given by the strategy asa and the reduced cost be c - . Given the transfer cost the retailer will respond by maximizing his profit function to find the optimal quantity max πr = (s − c)q − φ(q). To maximize, we set the first order derivative to be 0. At a reduced cost value (c − ); s − φ (q) = c − . As the retailer’s cost function is concave, the

When Is a Semi-honest Secure Multiparty Computation Valuable?

51

quantity sold is inversely proportional to its derivative. Therefore, the utility of the supplier due to quantity sold = μ+ . The reward function ρs,o for aso is given by μq + μc = μ+ + μc . Since the supplier sells at a price less than the actual cost, he incurs losses leading to a negative utility: μc = μ− . The reward function ρs,o = μ+ + μ− = μ0 . The reward function ρs,a for asa is μ+ as proved in the earlier section. Hence, the supplier is always better off by being honest about his input. Theorem 1. The protocol is incentive compatible w.r.t the suppliers inputs. Proof. It follows immediately from Lemmas 1 and 2 that the strategy asa gives the highest payoff to the supplier. 2.2

Incentive Compatibility w.r.t. Retailer

In a revenue sharing contract the retailer can over inflate his cost expenses to the trusted mediator to get a greater share of the revenues. The quantity and the trade price has already been negotiated between the supplier and the retailer in the earlier phase of the contract. From game theoretic perspective we assume that the retailer receives a higher utility when he receives a greater share of the revenue (profits). Let R(q) = (s − c)q − φ(q) be the revenue generated which has to be shared between the entities. Let μ+ denote positive utility and let μ− denote negative utility. Let μ0 denote neutral utility as above. Let R*(q) be the revenue share of the retailer. Definition 5. Let μr be the utility defined because of the revenue generated. ⎧ + ⎨ μ : R∗ (q) > R(q) μr = μ0 : R∗ (q) = R(q) (8) ⎩ − μ : R∗ (q) < R(q) Let (aru , ata , aro ) be the actions of under reporting, actual reporting and over reporting the cost function for the retailer respectively. Case 1: n = 2 Lemma 3. The strategy ara is weakly dominated by the strategy aro . Proof. The retailer over reports his cost function φ(q)+ to the trusted mediator. The trusted mediator, in order to maximize the supply chain profits, sets the revenue function’s first order derivative to 0: s − c − φ (q) = 0

(9)

If the inflated cost function φ(q)+ is parallel and above the actual cost function φ(q) then the retailer inflates the cost function, while keeping the quantity sold q constant. (Note the retailers cost function does not effect the quantity traded because a constant value’s s derivative is 0). The retailer has an incentive to cheat because maximizing the profit function considers only the derivative which is the same for two parallel curves at any given point, thereby ensuring that the quantity to coordinate the supply chain does not change.

52

R. Bhargava and C. Clifton

Theorem 2. For n = 2, the protocol is not incentive compatible w.r.t the retailers inputs. Proof. The retailer has a higher utility when he reports an inflated cost function as proved in Lemma 3. Hence, the protocol is not incentive compatible. Case 2: n > 2 When n > 2, we assume that there is one supplier and multiple retailers. Multiple retailers does not guarantee that for any given retailer trade will occur. Let μt be the utility defined because of the trade between the supplier and retailer. + μ : trade occurs (10) μt = μ− : trade does not occur Lemma 4. The strategy aro is weakly dominated by the strategy ara . Proof. The retailer over reports his cost function φ(q)+ to the trusted mediator as before. However, the retailer runs the risk of trade not happening because there may be another retailer who might have a lower value of the cost function, hence higher profits. The retailers reward function ρr is defined as ρr = μt + μr

(11)

Let us consider the case when the retailers have complete information about all the other retailers, i.e., their cost functions, and when the retailers have incomplete information. Case 1: Complete Information: When the retailers have complete information then the equilibrium is not reached until every retailer increases their cost function to the maximum of all the cost functions of all the retailers. This is because a retailer with the lowest cost function can increase its cost function till the cost function becomes equal to the next retailer’s cost function without fearing loss of trade. Hence the reward ρr = μt + μr = μ+ + 2μ+ = 3μ+ increases because the payoff is higher due to increased revenue. This is recorded by increasing the revenue utility function to 2μ+ . This successively continues, till the equilibrium is reached when all the retailer’s cost function becomes equal. Case 2: Incomplete Information: In the case of incomplete information, the retailer’s reward function is defined as - E(ρr ) = Pr(trade occurs) ∗ (μt + μr ). The retailers reward function for in case of truthfully reporting his value is given by μ0 + μ+ = μ+ . We now calculate the retailer’s reward function when overreporting the cost function. To do so, we assume that due to competition, if a retailer with increased cost risks the loss of trade. Therefore, Pr(trade occurs) ∝ 1/n. Hence, E(ρr ) ∝ 1/n ∗ (μt + μr ) = E(ρr ) ∝ 1/n ∗ (μ+ + μ+ ) = E(ρr ) ∝ 1/n ∗ (2μ+ ) The expected value of the retailer’s utility decreases as n increases with its maximum at n = 3 (0.67 ∗ μ+ ).

When Is a Semi-honest Secure Multiparty Computation Valuable?

53

Theorem 3. For n > 2, the protocol is incentive compatible w.r.t the retailers inputs. Proof. The proof immediately follows from Lemma 4. Incentive Compatibility w.r.t. Multiple Suppliers and Retailers Corollary 1. For n > 2, the protocol is incentive compatible w.r.t the retailers inputs and the supplier inputs. Proof. The proof immediately follows from Theorems 1 and 3

3

Secure Multiparty Computation and NCC Model

In this section we briefly review the game theoretic approach to cryptography, NCC and the malicious model and we bridge the gap between them. 3.1

Cryptography Applied to Game Theory

We begin by introducing the notion of normal form games. A n-player game Γ i=n =({Ai }i=n i=1 , {ui }i=1 ) where A is the set of action A1 XA2 X...XAn with a set of strategy (a1 , ..., am ) ∈ Ai and u is the set of utility functions u1 Xu2 X...Xun → R. The utility function ui of party Pi expresses this player’s preferences over outcomes: Pi prefers outcome a to outcome a iff ui (a) > ui (a ). (We also say that Pi weakly prefers a to a if ui (a) ≥ ui (a )). The game assumes that the {Ai , ui } are common knowledge among the players, although the assumption of known utilities seems rather strong and it is preferable to avoid it (or assume only limited knowledge). The game is played by having each party Pi select an action ai ∈ Ai , and then having all parties play their actions simultaneously. The “payoff” to Pi is given by ui (a1 , ..., an ) and, as noted above, Pi is trying to maximize this value. A normal form game assumes that the players are rational and truthful because it is a complete information game. When we apply game theory to cryptography we assume that the cryptographic protocol itself is the game, by defining the parties’ utilities as functions of the inputs and outputs of the parties running the protocol. To relax the assumption about complete information Bayesian games have been developed in which the players have incomplete information about the other players. For example, a player may not know the exact payoff functions of the other players, but instead have beliefs about these payoff functions. These beliefs are represented by a probability distribution over the possible payoff functions. Bayesian Nash Equilibria (BNE) result in implausible equilibria in extensive form dynamic games as non-credible threats are not accounted for. BNE has been used as a solution concept in several cryptographic settings, e.g., rational secret sharing problem studied in [15]. A more general version of the Nash equilibrium is the correlated equilibrium. In some games, there may exist a correlated equilibrium that, for every party

54

R. Bhargava and C. Clifton

Pi , gives a better payoff to Pi than any Nash equilibrium, and the correlated equilibrium can be computed in polynomial time. The game is played in two stages and includes a mediator who recommends the actions. First, a mediator chooses a vector of actions ai ∈ A according to some known distribution, and then hands the recommendation to the players. A correlated equilibrium is a strategy profile s∗ = s∗ (A1 X....XAn ) = (s∗1 , ..., s∗n ) such that for any (a1 , ..., a2 ) in the support of s*, any ai ∈ Ai , we have μi (a∗1 , s∗−i |a1 )∗ ≥ μi (a∗1 , s∗−i |a1 ). Certain game-theoretic equilibria are possible if parties rely on the existence of an external trusted party called a mediator. When cryptographic protocols are applied to Game theoretic problems they are applied to replace a trusted mediator which ensures the privacy of the input. However, a SMC protocol π does not prevent lying about the input to the protocol itself. Incentive compatibility is a mechanism that has demonstrated in the past to develop protocols which prevent lying about inputs [10,38]. The following proposition captures the idea of realizing SMC protocols in real world problems (note - we assume that the players are rational, i.e., they deviate if and only if they get a better pay off). Proposition 1. Let π be a protocol which is secure in the malicious model under the assumption that players are rational and implement a game Γ . Then π will be secure w.r.t modified inputs if the function f that is securely computed by π is incentive compatible. This proposition gives a framework for us to apply cryptography to game theory. In cryptography when we compute a function with distributed inputs we assume that the protocol is secure either in the malicious or the semi-honest model. Both these models do not provide any guarantee about the player lying about their input to the protocol. Game theory protocols assume that players are rational; cryptographic protocols do not incorporate rationality but only semi-honest and malicious players. When we replace the mediator with a SMC protocol, we are only guaranteeing that nothing other than the final analysis result is revealed; it is impossible to verify whether participating parties are truthful about their private input data. Incentive compatibility is the mechanism that will ensure that players report their true values. 3.2

NCC Model

We now introduce a framework - Non cooperative computing (NCC), developed by Shoham et al. [35], in which there is an agent for n players to compute a n-ary function f in which each player holds one of the inputs to f. The only thing standing between the agents and successful computation are their conflicting self interests. A function f is NCC if the players can be incentivized to provide the correct input to the function f. In the NCC model, there is a trusted third party (TTP) to which each players sends its input xi , where i ∈ n, to compute the value of the function f (x1 , x2 , ..., xn ). We represent the joint input of the parties as x = (x1 , x2 , ..., xn ) and x−i = (x1 , x2 , ...xi−1 , xi+1 , xn ). Let Bi be the domain

When Is a Semi-honest Secure Multiparty Computation Valuable?

55

of the inputs from which a player i, can choose his input and the output be in the range R. For simplicity we assume that Bi = B. Definition 6. Let π be a protocol that computes the value of the function f : B n → R. We formally define the protocol π as follows 1. Each player i has a private input xi ∈ Bi , that it provides to the TTP. It is not necessary that xi is the correct value. 2. The TTP computes the value f (x) = f (x1 , x2 , ..., xn ) and announces it to the players of the value y. 3. Each party computes g(f (x), xi ) based on f(x) received from the TTP and xi , his correct input. We now state the conditions under which a condition f is deterministically NCC Definition 7. [35] Let N and f be defined as above. Then f is deterministically NCC if for every player i, ∀xi and for every strategy (ai , gi ), the following holds1. Either ∃x−i ∈ B−i such that gi (f (ai (xi ), x−i ), xi ) = f (xi , x−i ) 2. Or ∀x−i ∈ B−i , f (ai (xi ), x−i ) = f (xi , x−i ) These definitions are defined for Boolean functions, but can easily be extended to more general domains. 3.3

Secure Multiparty Computation Model

In game theoretic settings most players are assumed to be rational, i.e., they seek to maximize their utility. However, assuming rationality does not guarantee that a player will be truthful about its input. Most of the secure multi party computation work when assuming rational players has been concentrated on rational secret sharing where they assume that the agents are rational in a semihonest or malicious model. In the semi-honest model the players do not deviate from the protocol but are curious to know the value of the private input whereas in the malicious model the players can arbitrarily deviate from the protocol. The problem of rational secret sharing was first studied by Halpern and Teague [16] and then later revisited by Gordon and Katz [14]. If we hope to combine game theory with secure multi party then one of the steps in bridging the boundary is to ensure the truthfulness of the inputs. Most protocols have assumed that there exists a verification mechanism in the malicious model to ensure the correctness of the private input or there are extrinsic incentives to encourage the players to be truthful about the data. The malicious model can prevent the parties from modifying their input once the protocol has begun execution, but not before execution. We now attempt to bridge the gap between rational and malicious players. We answer the question whether rationality can ensure truthfulness of the private party in the absence of a verification mechanism. Before we do that, we introduce

56

R. Bhargava and C. Clifton

some concepts. Let us assume that there is a protocol π that securely computes the value of a function f (x) = f (x1 , x2 , ..., xn ), where n is the number of players and xi is the private input of the player i ∈ n. Note, our malicious model assumes that a player can modify the input arbitrarily but it ensures privacy. For the malicious model, we assume that a player Pi has higher utility if he can change the output in his favor by lying about his private input xi . Let us assume that the strategy set ai of a player Pi is limited to (at , af ) where at is the strategy of revealing the true input to the function f whereas af is the strategy of revealing the false input to the function f. Without loss of generality let us assume that μ+ denotes positive utility, μ− denotes negative utility and μ0 denotes no change in utility. Let us assume, without loss of generality, xi is the player’s Pi incorrect and private input, and xi is the correct and private input. Let us capture the notion of the utility of a player Pi being dependent on his ability to lie about his input in the following definition Definition 8. Let there be a protocol π as defined in Definition 6 that securely computes the value of a function f. Let ai be the set of the strategy (at , af ). Then for every x = (x1 , x2 , ..., xn ), the utility for modification in the input is defined as follows ⎧ + μ : ∃xi , ∀x−i , h(f (xi , x−i )) = f (xi , x−i ) ⎪ ⎪ ⎨ ∧f (xi , x−i ) = f (xi , x−i ) ∧ ai = af (12) μi (a1 , a2 , ..., an ) → 0 μ : ai = at ⎪ ⎪ ⎩ − μ : ∀xi , ∀x−i f (xi , x−i ) = f (xi , x−i ) ∧ ai = af Here, h denotes the function that a player Pi computes on the output received to arrive at the correct value of the output. This definition captures the fact that a player is rational and will lie about its input if this makes the TTP calculate an incorrect value whereas that player can calculate the correct value. 3.4

NCC and Malicious Model

We now bridge the gap between malicious and NCC model by the following theorem. Theorem 4. The protocol π which securely computes f is secure in the malicious model w.r.t modified inputs iff the function f is NCC. Proof. Let us assume that the protocol is secure in the malicious model w.r.t modified inputs, i.e., a player’s maximum utility is achieved by truthfully revealing input to the TTP. The maximum utility that a player can achieve is μ0 ; the case μ+ will not be possible. Hence, the following condition will hold ∀xi , ∃x−i , g(f (xi , x−i )) = f (xi , x−i ) ∨ f (xi , x−i ) = f (xi , x−i )). This satisfies the first condition of Definition 7, by ensuring that the player Pi will always get an incorrect result if it plays af . Hence, f is NCC. Let us assume that the function f is NCC and condition one holds. This ensures that the maximum utility a malicious player can achieve is μ0 . Hence, in

When Is a Semi-honest Secure Multiparty Computation Valuable?

57

this case a malicious player’s best strategy is to play at . If the second condition of Definition 7 is true then a malicious player’s utility in this case will be μ− . Therefore, it can increase utility by playing strategy at . Hence, if a function f is NCC, then it is secure with respect to modification of inputs in the malicious model as the player’s best interest is to always play the truth forward strategy.

4

Applying NCC Model to Rational Multiparty Model - Multivariate Statistics for Horizontally Partitioned Data

In this section, we consider the function f that calculates the mean of a dataset and prove that if it is dNCC [24] then it is secure in the malicious model w.r.t modification of inputs. Let us assume that V1 , V2 , ...Vn be n i.i.d 1 x p row vectors. The function fmean (V ) is given by E(Vi ) =

i=n 1 Vk N i=1

(13)

We assume that the data is horizontally partitioned, i.e., each player Pi holds a set Si where Si ∩ Sj = φ for any i and j and ∪i=n i=1 Si = V1 , V2 , ..., Vn . Theorem 5. [24] If N is private then f is dNCC. Proof. For proof, pleaser refer [24] When the set Si (and N = ∪i=n i=1 |Si |) is private, the function to be calculated becomes i=n i=1 Vi (14) E(Vi ) = i=n i=1 |Si | We now prove that the protocol is secure in the malicious model under the assumption that there is no violation of privacy during message exchange and a malicious player cannot gain any other information from the output as what an honest player can gain. For completeness, we give a protocol here that is secure in the malicious model. We use Homomorphic Encryption and the Pallier cryptosystem [34] to build the protocol (similar to electronic voting). In general, a cryptosystem supports the encryption Enck (·) and decryption Deck (·) operations such that Deck (Enck (x)) = x. A homomorphic encryption system has an additional property: Enck (x) · Enck (y) = Enck (x y)

(15)

where is a binary operator, such as addition or multiplication. By the definition of multiplication, we can observe that the following property also holds: Enck (x)c = Enck (x · c)

(16)

For simplicity, we assume that there is a trusted third party. The protocol proceeds as follows

58

R. Bhargava and C. Clifton

1. Each Player Pi computes Encpu (Vi ) and Encpu (|Si |) and sends it to the third party, where Encpu is encrypting with public key. 2. The third party adds all the values together: C1 = (Encpu (V1 ) + ... + Encpu (Vi )), C2 = (Encpu (|S1 |) + ... + Encpu |Si |) and then decrypts C1 , C2 to calculate the mean and announces it. To prove the theorem, we define a player’s utility as follows: Definition 9. Let μ be the utility defined based on the correctness of the output and let (at , af ) be the strategy set of a player ai , following the same convention as above. A player’s Pi utility is defined as follows ⎧ + μ : ∃xi , ∀V−i , h(f (Vi , V−i )) = f (Vi , V−i ) ⎪ ⎪ ⎪ ⎪ ∧f (Vi , V−i ) = f (Vi , V−i ) ∧ ai = af ⎨ 0 μi (a1 , a2 , ..., an ) → μ : ai = at ⎪ ⎪ μ− : ∀Vi , ∀V−i f (Vi , V−i ) = f (Vi , V−i ) ∧ ai = af ⎪ ⎪ ⎩ −− μ : ∀Vi , ∀V−i f (Vi , V−i ) = f (Vi , V−i ) ∧ ai = af

(17)

Here, Vi denotes the correct input and Vi denotes the incorrect input. This is similar to the Definition 8 with one modification - a player receives negative utility by lying about its input and cannot compute the correct value. The utilities are assumed to follow this order (μ+ > μ0 > μ− > μ+ ). To prove, the above theorem, we use the help of the following lemma Lemma 5. The strategy at is weakly dominated by the strategy af . Proof. We use the utility function defined in Definition 8. The only information that is publicly available is the mean. If a party lies about its input, i.e., adopts the strategy af and reports Vi = Vi , then the TTP will incorrectly calculate the mean. Even though the TTP will incorrectly calculate the mean, player the i=n cannot calculate the correct value of the mean as the value of the i=1 |Si | is unknown. Therefore, his utility will be μ− . Telling the truth gives utility μ+ . Theorem 6. Given fmean and private N, the protocol πmean as defined in 14 is secure in the malicious model w.r.t to input modification. Proof. This immediately follows from Lemma 5. To prove that this protocol is secure w.r.t to the malicious model, note that the only communication is encrypted and it involves one round of communication. Let us assume that m < n are corrupted parties. The simulator S receives from the adversary A the summation of the individual values C1 = (Encpu (V1 ) + ... + Encpu (Vm )) and C2 (Encpu (|S1 |) + ... + Encpu |Sm |). These values fully determine the input to the function and so S just sends this value to the TTP. This is also true when m = n −1. This is because all the TTP i=n i=n reveals is the average/mean = x/y, where x = i=1 Vk and y = i=1 |Si |. It is impossible to calculate Vn or |Sn | from the mean as there will be more unknown variables than the number of equations.

When Is a Semi-honest Secure Multiparty Computation Valuable?

5

59

Discussions

In this section we discuss our results. We mainly focus on two main results: – If a protocol is incentive compatible then a protocol is secure in the malicious model w.r.t input modification: We have shown a counter example for this in Sect. 2 and have given a Proposition 1. This proposition gives us a way to apply secure multi party computation to game theory. Game theory assumes that the players are rational, however SMC only guarantees that a player will not deviate once the protocol has begun. If a mechanism is incentive compatible then we only have to build a protocol which is secure in the malicious model to guarantee the privacy of a player’s input while ensuring that he will be truthful. There are several ways by which we can enforce incentive compatibility - either the dominant strategy for a player is the truthful strategy (i.e., truthful strategy); or we can ensure that there exists a verification mechanism to enforce the truthfulness. The latter case requires either some external factors inherent in the problem that enable input to be verified [10], or a protocol that is not secure in the malicious model to enable input verification that “leaks” input, although this leakage can be minimized using techniques such as a zero-knowledge proof [13]. However, if we impose a few restrictions on the what the utility of the player, e.g., players value the correctness of the function; then we are forced into the space of the NCC functions. – If a player values correctness and exclusivity of the function f , then a protocol is incentive compatible if it is in the NCC model: A function in the NCC model guarantees that if a player is rational and seeking to maximize his utility, then he will prefer to be honest about his input. This gives us a method for building secure protocols. If we can prove that the function is NCC then this eliminates the need for a verification mechanism. This result is important because it limits the space of the functions that we can implement using SMC if we want to prevent lying about inputs.

6

Related Work

In this section we briefly review the work on secure multi party computation. 6.1

SMC and Game Theory

SMC is a powerful tool of cryptography which allows multiple parties to jointly compute the value of a function without revealing the individual private inputs [39]. There have been many works on bridging SMC and game theory [12,16,26, 27,33,38]. The impetus for this work has been hugely driven by Katz’s [26] seminal paper. They have developed protocols to replace the mediator in normal and extensive games. They also proved that Nash equilibrium which is the fundamental equilibrium cannot provide protection against the coalitions formed.

60

R. Bhargava and C. Clifton

However, their model allows players to engage in “cheap talk” phase. They have assumed that the players are rational, i.e., they seek to play an action which maximizes their utility. When SMC and game theory are bridged together there are two approaches taken – Applying SMC to Game Theory - In this approach, we attempt to replace a mediator with a cryptographic protocol. – Applying Game Theory to SMC - Cryptographic protocols assume either honest players or malicious players. Game-theoretic protocol assume players that are self interested. How can we design protocols for such as setting? We have extensively analyzed the first approach in this paper to understand how we can construct a secure protocol. In the second approach the protocol is itself a game where the parties utilities are functions of the inputs and output. This has led to the development of rational multi-party computation. The rational secret sharing problem was first developed by Halpern and Teague [16]. They showed if agents prefer to get the function value and secondly, prefer that as few people as possible get it then there is no mechanism with a fixed running time that can be applied for secret sharing or secure multi-party computation. Rational multi party computation was further developed by Izmalkov et al. in [19] and [18]. Another approach was take by Wallrabenstein and Clifton [38], where they have developed a model using game theory which allows the tˆ atonnement process to be evaluated privately, revealing only the final equilibrium price. They assume that the parties are rational and have developed equilibrium concepts which are stronger than the Bayes-Nash equilibrium. The work of Harplen and Teague [16] was extended by Abraham et al. [1], where they show that a k-resilient Nash equilibrium exist (where k is the size of the coalition) for secret sharing and multiparty computation, provided that players prefer to get the information than not to get it. Their results holds true even for 2 players, so they can do multiparty computation with only two rational agents. This was an improvement as Harpleen and Teague [16], showed that Nash protocols did not suffice. Similar work has been performed in [14,27,32]. We now give a brief overview of the related work where SMC has been applied to game theoretic protocolsIn recent years, various cryptographic approaches have been applied to achieve game theoretic equilibrium for these problems [12,38]. Dodis et al. [12] applied SMC protocol to achieve a correlated equilibrium with rational players, thereby replacing the trusted mediator. In a correlated equilibrium, the strategy recommended by the mediator gives the maximum payoff. Wallrabenstein et al. [38] arrived at the Walrasian equilibria without invoking a mediator or allowing trade to occur prior to arriving at a stable price using SMC. They further proved that the protocol is incentive compatible: there is no motive for the player to deviate from their strategy of telling the truth. SMC has been deployed for several other real world uses. Bogetoft et al. [5] developed a system for Danish farmers to trade contract for sugar-beet They used SMC to ensure the privacy of contract and to determine the market clearing

When Is a Semi-honest Secure Multiparty Computation Valuable?

61

price in a double auction: the price at which the demand equals the supply. One disadvantage of their protocol is that they have not considered malicious parties, specifically parties that can lie about their input to the protocol. They also show that if a mechanism is randomized and f is non cooperatively computable then both are possible. 6.2

Incentive Compatible Protocols

We now give an overview of incentive compatible protocols that have been developed - Kantarcioglu and Jiang [24] have analyzed where the function used for various privacy-preserving distributed data analysis tasks like naive bayes, association rule mining are incentive compatible or not. Kantarcioglu and Nix [25] have used the Vickrey-Clarke-Groves (VCG) mechanism to encourage ruthful data sharing for distributed data mining. Their protocol does not rely on the ability to verify the data of the parties. Atallah et al. [3] have developed Supply-Chain Collaboration (SSCC) protocols that enable supply-chain partners to cooperate, to jointly-computed decisions require the information of all the parties. without revealing the private information of any of the parties, even though it require the information of all the parties. They have also proved that their mechanism is incentive compatible. Brandt [6] have presented a privacy preserving incentive compatible Vickrey auction protocol that determines the second-highest bid without revealing any other information. They assure incentive compatibility by detecting dishonest bidders.

7

Conclusion

Distributed protocols are particularly important in applications which require distributed companies to collaborate. E.g., credit card companies want to build more accurate better fraud detection models by combining their customers’ data; protein sequences which is being increasing collected by law enforcement agencies [20]; coordination in a decentralized supply chain [29]. These tasks require collaborating (possibly competing) geographically distributed parties to share their private data for building better data analysis models or coordination mechanisms. SMPC provides a valuable tool for these applications. SMC assumes security in two models - semi honest and malicious. While semi-honest has weaker guarantees as compared to malicious models, we have shown the need for semi-honest models by stating the limitations of the malicious model. We have shown that only a small class of models (NCC functions) are secure in the malicious model w.r.t input modification otherwise we require semihonest parties. We have used incentive compatibility as a mechanism to induce truthful telling (semi-honest parties) in the malicious model; which restricts us to the functions which are in the NCC class. We have also proved that under the guarantee of correctness, a protocol in the NCC class that is secure in the malicious model will also guarantee truthful inputs by the parties.

62

R. Bhargava and C. Clifton

Outside the NCC class, malicious protocols may provide little benefit, as parties are assumed to be semi-honest with respect to their input even if the protocol is secure in the malicious model. While this seems a strong limitation, there are many situations where the semi-honest assumption is practically acceptable, but open sharing of data is not. One example is a data collection and analysis tool whereby employers submit (private) information for analysis by the Boston Women’s Workforce Council [28]. This application uses secure multiparty computation to ensure employer anonymity, but employers are trusted to provide correct information. The tool has gained acceptance as a semi-honest protocol, providing protection against legal risks for processors of the data. This is one example, the key is that the guarantees provided should be well-understood and communicated to ensure appropriate use and buy-in.

References 1. Abraham, I., Dolev, D., Gonen, R., Halpern, J.: Distributed computing meets game theory: robust mechanisms for rational secret sharing and multiparty computation. In: Proceedings of the Twenty-Fifth Annual ACM Symposium on Principles of Distributed Computing, pp. 53–62. ACM (2006) 2. Arya, A., L¨ offler, C., Mittendorf, B., Pfeiffer, T.: The middleman as a panacea for supply chain coordination problems. Eur. J. Oper. Res. 240(2), 393–400 (2015) 3. Atallah, M.J., Elmongui, H.G., Deshpande, V., Schwarz, L.B.: Secure supply-chain protocols. In: IEEE International Conference on E-Commerce, CEC 2003, pp. 293– 302. IEEE (2003) 4. Bartal, Y., Gonen, R., Nisan, N.: Incentive compatible multi unit combinatorial auctions. In: Proceedings of the 9th Conference on Theoretical Aspects of Rationality and Knowledge, pp. 72–87. ACM (2003) 5. Bogetoft, P., et al.: Secure multiparty computation goes live. In: Dingledine, R., Golle, P. (eds.) FC 2009. LNCS, vol. 5628, pp. 325–343. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03549-4 20 6. Brandt, F.: Secure and private auctions without auctioneers. Technical Report FKI-245-02. Institut fur Informatick, Technishce Universitat Munchen (2002) 7. Cachon, G.P., Netessine, S.: Game theory in supply chain analysis. In: Simchi-Levi, D., Wu, S.D., Shen, Z.J. (eds.) Handbook of Quantitative Supply Chain Analysis. ISOR, vol. 74, pp. 13–65. Springer, Boston (2004). https://doi.org/10.1007/978-14020-7953-5 2 8. Canetti, R.: Universally composable security: a new paradigm for cryptographic protocols. In: Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pp. 136–145. IEEE (2001) 9. Canetti, R., Lindell, Y., Ostrovsky, R., Sahai, A.: Universally composable twoparty and multi-party secure computation. In: Conference Proceedings of the Annual ACM Symposium on Theory of Computing (2003). https://doi.org/10. 1145/509907.509980 10. Cho, R., Clifton, C., Ilyer, A.V., Jiang, W., Kantarioglu, M.: An approach to identifying beneficial collaboration securely in decentralized logistics systems (2003) 11. Dasgupta, P., Hammond, P., Maskin, E.: The implementation of social choice rules: some general results on incentive compatibility. Rev. Econ. Stud. 46(2), 185–216 (1979)

When Is a Semi-honest Secure Multiparty Computation Valuable?

63

12. Dodis, Y., Halevi, S., Rabin, T.: A cryptographic solution to a game theoretic problem. In: Bellare, M. (ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 112–130. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-44598-6 7 13. Feige, U., Fiat, A., Shamir, A.: Zero-knowledge proofs of identity. J. Cryptol. 1(2), 77–94 (1988) 14. Gordon, S.D., Katz, J.: Rational secret sharing, revisited. In: De Prisco, R., Yung, M. (eds.) SCN 2006. LNCS, vol. 4116, pp. 229–241. Springer, Heidelberg (2006). https://doi.org/10.1007/11832072 16 15. Groce, A., Katz, J.: Fair computation with rational players. In: Pointcheval, D., Johansson, T. (eds.) EUROCRYPT 2012. LNCS, vol. 7237, pp. 81–98. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29011-4 7 16. Halpern, J., Teague, V.: Rational secret sharing and multiparty computation. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, pp. 623–632. ACM (2004) 17. Hennet, J.C., Arda, Y.: Supply chain coordination: a game-theory approach. Eng. Appl. Artif. Intell. 21(3), 399–405 (2008) 18. Izmalkov, S., Lepinski, M., Micali, S.: Verifiably secure devices. In: Canetti, R. (ed.) TCC 2008. LNCS, vol. 4948, pp. 273–301. Springer, Heidelberg (2008). https:// doi.org/10.1007/978-3-540-78524-8 16 19. Izmalkov, S., Micali, S., Lepinski, M.: Rational secure computation and ideal mechanism design. In: 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2005), pp. 585–594. IEEE (2005) 20. Jha, S., Kruger, L., Shmatikov, V.: Towards practical privacy for genomic computation. In: 2008 IEEE Symposium on Security and Privacy (SP 2008), pp. 216–230. IEEE (2008) 21. Jurca, R., Faltings, B.: An incentive compatible reputation mechanism. In: IEEE International Conference on E-Commerce, CEC 2003, pp. 285–292. IEEE (2003) 22. Kalai, E., Postlewaite, A., Roberts, J., et al.: A group incentive compatible mechanism yielding core allocations. J. Econ. Theory 20(1), 13–22 (1979) 23. Kanda, A., Deshmukh, S., et al.: Supply chain coordination: perspectives, empirical studies and research directions. Int. J. Prod. Econ. 115(2), 316–335 (2008) 24. Kantarcioglu, M., Jiang, W.: Incentive compatible privacy-preserving data analysis. IEEE Trans. Knowl. Data Eng. 25(6), 1323–1335 (2013) 25. Kantarcioglu, M., Nix, R.: Incentive compatible distributed data mining. In: 2010 IEEE Second International Conference on Social Computing, pp. 735–742. IEEE (2010) 26. Katz, J.: Bridging game theory and cryptography: recent results and future directions. In: Canetti, R. (ed.) TCC 2008. LNCS, vol. 4948, pp. 251–272. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78524-8 15 27. Kol, G., Naor, M.: Cryptography and game theory: designing protocols for exchanging information. In: Canetti, R. (ed.) TCC 2008. LNCS, vol. 4948, pp. 320–339. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78524-8 18 28. Lapets, A., Volgushev, N., Bestavros, A., Jansen, F., Varia, M.: Secure MPC for analytics as a web application. In: 2016 IEEE Cybersecurity Development (SecDev), pp. 73–74. IEEE (2016) 29. Li, L., Zhang, H.: Confidentiality and information sharing in supply chain coordination. Manag. Sci. 54(8), 1467–1481 (2008) 30. Lindell, Y.: Secure multiparty computation for privacy preserving data mining. In: Encyclopedia of Data Warehousing and Mining, pp. 1005–1009. IGI Global (2005)

64

R. Bhargava and C. Clifton

31. Lindell, Y.: How to simulate it – a tutorial on the simulation proof technique. In: Lindell, Y. (ed.) Tutorials on the Foundations of Cryptography. ISC, pp. 277–346. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-57048-8 6 32. Lysyanskaya, A., Triandopoulos, N.: Rationality and adversarial behavior in multiparty computation. In: Dwork, C. (ed.) CRYPTO 2006. LNCS, vol. 4117, pp. 180– 197. Springer, Heidelberg (2006). https://doi.org/10.1007/11818175 11 33. Milosavljevic, N., Prakash, A.: Game Theory and Cryptography. University of California, Berkeley (2009) 34. Paillier, P.: Public-key cryptosystems based on composite degree residuosity classes. In: Stern, J. (ed.) EUROCRYPT 1999. LNCS, vol. 1592, pp. 223–238. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48910-X 16 35. Shoham, Y., Tennenholtz, M.: Non-cooperative computation: boolean functions with correctness and exclusivity. Theor. Comput. Sci. 343(1–2), 97–113 (2005) 36. Spengler, J.J.: Vertical integration and antitrust policy. J. Polit. Econ. 58, 347–352 (1950) 37. Tsay, A.A.: The quantity flexibility contract and supplier-customer incentives. Manag. Sci. 45(10), 1339–1358 (1999) 38. Wallrabenstein, J.R., Clifton, C.: Privacy preserving tˆ atonnement. In: Christin, N., Safavi-Naini, R. (eds.) FC 2014. LNCS, vol. 8437, pp. 399–416. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45472-5 26 39. Yao, A.C.: Protocols for secure computations. In: 23rd Annual Symposium on Foundations of Computer Science, SFCS 2008, pp. 160–164. IEEE (1982)

You only Lie Twice: A Multi-round Cyber Deception Game of Questionable Veracity Mark Bilinski1(B) , Kimberly Ferguson-Walter2 , Sunny Fugate1 , Ryan Gabrys1 , Justin Mauger1 , and Brian Souza1 1

2

Naval Information Warfare Center Pacific, San Diego, CA, USA {bilinski,gabrys,jmauger}@spawar.navy.mil Laboratory for Advanced Cybersecurity Research, San Diego, CA, USA

Abstract. Cyber deception focuses on providing advantage to defenders through manipulation of the information provided to attackers. Game theory is one of the methods that has been used to model cyber deception. In this work, we first introduce a simple game theoretic model of deception that captures the essence of interactions between an attacker and defender in a cyber environment. Second, we derive closed form expressions for a version of this model that capture the optimal strategies for both the attacker and defender. Third, we identify the potential behavior of an attacker at noncritical points via a Markov Decision Process (MDP) simulation.

1

Introduction

Cyber deception is a growing area of research focused on providing advantage to cyber defenses through manipulation of the information provided to cyber attackers [13] or covertly gathering information about attacker tactics, techniques and procedures [6]. Traditional network defenses focus on monitoring for and blocking suspicious activity. This remains a difficult task due to the complexity of detecting zero-day attacks and the difficulty in differentiating between anomalous and malicious activity. Cyber deception techniques provide a novel strategy that focuses on interacting with an attacker who has already breached network perimeter defenses [2,4]. Research is less mature on understanding the theory of when and how to best use cyber deception [9] as well as on how to use artificial intelligence (AI) to create adaptive cyber deception techniques [10]. Game theory is one of the methods that can be used to model cyber deception and provide a basis for modeling the changes that deception drives for both the attacker and defender [1,19]. This can lead to practical solutions needed to advance work in the application of deception techniques and can inform the feedback needed for AI systems to automatically adapt. In this paper we focus on modeling a situation where a defender has the ability to make a real machine look fake [16] (or look real) and make a decoy machine look real (or look fake). We examine how these decisions affect an attacker’s choice on which machine This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 65–84, 2019. https://doi.org/10.1007/978-3-030-32430-8_5

66

M. Bilinski et al.

to attack (and when). Furthermore, we enhance the model to consider the use of deception as a deterrence by allowing the attacker to simply quit the game. Including this extra action provides the promise of investigating the sunk cost fallacy–a cognitive bias theorized to be effective in mitigating cyber attacks [12].

2

Related Work

The topic of deception with applications to cybersecurity has been studied in the past. One such area looks at the problem of deploying honeypots in a distributed environment. Under this setup, there are usually two players: (1) a defender and (2) an attacker. The defender, which is monitoring some collection of network resources, is able to deploy a certain number of honeypots usually at some predetermined cost. The goal of the attacker then is to identify the locations of these honeypots, and eventually to use this knowledge to expose system vulnerablities. Typically these problems are modeled as two person, zero-sum games and such games have been studied extensively in the past [3,7,11]. To account for where one player does not fully understand the nature of the game, works have investigated the adoption of hypergame theory [15,18]. Similar to previous work on moving target defense strategies, the model studied in this work is largely concerned with preventing an attacker from gaining knowledge of a system and subsequently exploiting it. In a moving target defense environment, the defender is able to periodically modify the configuration of some resource they are trying to defend. In [8], the authors consider the case where the attacker is able to observe previous configurations for the defender. They then derive the optimal configuration switching strategy for the defender using a Markov Decision Process (MDP). In [21], the authors instead use an MDP to model the system from the attacker’s point of view. In [14], the authors developed a formal probabalistic logic of deception. The adversary’s knowledge of the system is repesented by a set of formulas in the logic. Unfortunately, various calculations necessary for optimal strategies prove to be NP-hard. In [17], the authors introduce the notion of a Cyber Deception Game which is a zero-sum Stackelberg game played between an attacker and defender. Under their setup, the defender moves first using a pure strategy. The attacker then moves by choosing a system to attack based on their responses. Similarly, the problem of finding optimal strategies is shown to be NP-hard. In [20], the authors study a two-stage game where in the first stage the attacker probes the entire network and in the second stage the attacker probes specific hosts before any attack is launched. Under this setup, the defender’s strategy is determined before the attacker begins probing, where they minimize their expected cost. Optimal strategies are proposed using mixed integer programming for very specific cases. Unlike previous work on moving target defense, our work does not allow the defender to change the underlying system configuration, but rather focuses on responses to the attacker’s probes of the system. Our focus was to construct as simple a model as possible to capture the essence of deception. Similar to [17,20], we introduce a game with two repeated stages, where in the first stage

You only Lie Twice

67

the attacker probes the machines. Due to the simplicity of our model and unlike [14,17], we are able to derive closed form expressions that represent optimal strategies for both the attacker and defender. Furthermore, our model allows for both the attacker and defender to adapt and evolve their strategies over time. In addition to the mathematical analysis, and similar to [8,21], we also adopt an MDP in simulation to observe the behavior of a player at noncritical points.

3

A Simple Masking Game

A simple form of deception game is one in which a defender masks the true nature of a computing system. This masking can be performed to either obscure true system behavior or to proactively lie to indicate false behavior of a system or software application. Honeypots and decoys can be thought of as sophisticated versions of this type of masking, where the signal being masked includes a cornucopia of information regarding system behavior. Unfortunately, the dynamic and complex nature of most computing systems makes these forms of defensive deception very difficult to analyze. Inspired by recent work in game theory [11] we present what we consider to be the simplest form of masking game. The intent of this model is to be realistic to implement, but general purpose enough to reflect a broad array of masking techniques and player behaviors. Prior work by others has focused on masking system configuration so that a system indicates incorrect versions of software or indicates software or operating system status that are incorrect. For example, a Windows 10 host may be configured to present features which indicate that it is a Windows 8 host. If an attacker learns this feature and subsequently attempts to use this information to attack the host, their attack will likely fail. Depending on the configuration of network sensors, the attempted attack may also be more easily detectable – having targeted a specific version of an older, more vulnerable operating system generally means that their actions are easier to detect as malicious. Indeed, the defender can intentionally create this scenario by presenting a vulnerability for which attack detection is easy to perform. We distill this type of masking game to an even simpler form: one in which we have two types of machines (one real and one fake). Each machine can either lie or tell the truth about their type. This is meant to represent a signal received by an attacker which is either intentionally or unintentionally sent by a defender. It is the job of the attacker to attack the correct system. If they attack the wrong system then they achieve a negative payoff. If they attack the correct machine they achieve a positive payoff. For our research it is important that our model be simple enough to be analyzable, but realistic enough that abstract strategies can be conceived of for both players and these strategies can be simulated numerically. It is also important for our research goals that our game be feasible to implement and tested in an actual network environment. To ground this example with a real-world context, the signal can be thought of as a presented configuration that makes the system either (a) look like a real host (consistent); or (b) look like a fake host (inconsistent). More concretely,

68

M. Bilinski et al.

we can conceive of a method by which a real Microsoft Windows host presents either (a) only Windows service ports; or (b) presents a mixture of Windows and Linux service ports. When the host presents only Windows service ports it is signaling to the attacker that it is real. If the host presents a mixture of Windows and Linux service ports it is signaling to the attacker that it is fake. Depending on whether the system is actually real or fake determines whether this signal is the truth or a lie. This is a scenario that we have run across in practice which can be implemented with little more than a filtering firewall and fake software to attach to selected service ports. In a system configured in this way, when a real system presents a mixture of service ports and service banners which are common to two types of systems this makes the system appear inconsistent. Similarly, a fake system (actually a Linux host) could be configured to lie and present only Windows service ports. Alternatively this fake system could tell the truth and present a mixture of Windows and Linux service ports. It may be noted by an astute reader that the quality of the simulated or fake service may be relevant to a real-world deployment of such a strategy, however it is not important for the purposes of our current studies. In future efforts we intend on representing lossy signals using stochastic variation of the attacker’s model parameters or directly adjusting attacker beliefs regarding the veracity of each received signal. In general, we treat the defender for the game as omniscient. This is sensible as they control the veracity of signals sent by either real or fake hosts. In terms of attacker knowledge, we assume that they are provided or have accurate estimates of the costs of each defender action, but do not know the value of the real system (value of the decoy being zero). To simplify analysis our model currently treats the real system’s value, V , as being equivalent for both attacker and defender. The parameters are summarized in Table 1. Table 1. Summary of game parameters MR MF αR αF V βN

Real machine Fake machine Prob of attacking real machine Prob of attacking fake machine Value of a real machine Exponential decay term for V

cR cF τR τF N

Defender cost of lying for real machine Defender cost of lying for fake machine Probability that real machine is truthful Probability that fake machine is truthful Number of rounds

The defender can lie with a cost cR or cF with probabilities τR and τF for real and fake machines respectively, but has no cost for telling the truth and indicating the true type of the machine (real or decoy). The attacker knows the costs of lying, but does not know the defender’s probability of lying, nor does the attacker know the true value of the machine, the value of fake machines being zero. We assume that the attacker is able to learn the parameters of the game from verified priors. In a real world scenario, verified knowledge of defender costs may

You only Lie Twice

69

occur if the attacker had insider knowledge of the network and its management. Alternatively, we can enforce a situation in which the attacker has access to the true game parameters by allowing the defender to publish their costs defending (lying) as either a real or decoy system. We can examine the model presented in Sect. 4.1 from varying perspectives. In particular, the attacker’s knowledge of model parameters can be adjusted to suit various use cases. For example, an attacker may be naive to the presence of deception and may be entirely unaware of the existence of fake systems or alternatively of the ability of real or fake systems to lie about their types. Modeling such an attacker allows us to reason about the theoretically optimal behavior and choices for various defender types. Our model also allows us to reason about attacker perception and signal quality, where an attacker on some rounds is unable to perceive a signal or where the signal value is inverted (or is perceived as inverted by the attacker). Lastly, we can model an attacker who has an invalid set of priors for their estimates of defender costs or even their own attack probabilities.

4

Analysis

In this section, we provide a mathematical analysis for a simple instantiation of our two person game model. In the following, we first analyze the setup where the behaviors of the attacker and defender do not change over time. We will then carry over this analysis to the more involved setup where the attacker and defender adapt their strategies as the game proceeds. 4.1

Non-adaptive Game Model

The game is played for N rounds. At each round, the attacker asks either machine about their identity. In particular, the attacker can ask one machine at each round whether that machine is real or fake. The defender has the machine respond either truthfully or it has the machine lie. However, when the real machine MR lies the defender pays a cost of cR and similarly if the fake machine MF lies, then the defender pays a cost of cF . Telling the truth has no cost in either case. After N rounds, the attacker chooses to attack either machine MR or machine MF . Both the attacker and the defender behave probabilistically. The attacker chooses to probe host MR with probability αR and it probes host MF with probability 1 − αR . Note that the parameter αR could be based on prior beliefs about the identity of each of the machines. Let τR denote the probability that the real machine responds truthfully and similarly let τF denote the probability that the fake machine responds truthfully. The attacker knows the defender costs cR , cF , and assumes the defender will act in a manner which will minimize its costs. Let C1 represent what the attacker expects the cost to the defender to be for lying, under the assumption that MR is the real machine: C1 = cR · αR (1 − τR ) + cF · (1 − αR )(1 − τF ). N

(1)

70

M. Bilinski et al.

Note C1 is the expected cost to the defender for lying with probabilities 1 − τR , 1 − τF . If the identities of the machines were switched (i.e., if the real machine was fake and vice versa), then the attacker would believe the defender is paying a cost of cF every time the real machine tells the truth and the attacker would believe the defender is paying a cost of cR each time the fake machine tells the truth. This quantity is represented as C2 below: C2 = cF · αR τR + cR · (1 − αR )τF . (2) N Under these assumptions, the attacker believes machine MR is the real machine and will choose to attack MR after N rounds if C1 < C2 and otherwise it believes MF is the real host and will choose to attack attack MF after N rounds. Suppose pMR denotes the probability that the attacker attacks machine MR . Since our model assumes that the attacker and defender behave probabilistically, pMR (which is derived in Lemma 1) is therefore equal to the probability C2 > C1 . Let β ∈ [0, 1], and suppose V is some positive integer. We let μD (N ) be the total expected payoff to the defender where: μD (N ) = cR · αR (1 − τR ) + cF · (1 − αR )(1 − τF ) + β N · V · pMR . (3) N Under this setup V can be interpreted as the value of attacking the real machine. Notice that the attacker does not know the value of V , but the defender does; the attacker only knows that there is a real and fake machine and that the defender pays the penalties of cR /cF for lying on the real/fake machine, respectively. Since the term β N · V · pMR decays with respect to β, the defender will prefer longer games (large N ) to shorter ones. Our goal is to determine the parameters τR , τF that minimize μD (N ). We derive the term pMR in the next claim using some elementary techniques. −

Lemma 1. pMR = αR τR + (1 − αR )τF . Proof. Since each round is played independently and the attacker and defender behave probabilistically according to the parameters αR , τR , τF , the probability that C2 > C1 is equivalent to the probability the following expression holds: (4) cF X1 X2 − (1 − X1 )(1 − X3 ) + cR (1 − X1 )X3 − X1 (1 − X2 ) > 0 where X1 , X2 , X3 are each Bernoulli random variables with the parameters αR , τR , and τF respectively. Note that (4) is positive if and only if either (a) X1 = 0, X3 = 1 or (b) X1 = X2 = 1. Since (a) holds with probability (1 − αR )τF and (b) holds with probability αR τRand these events are mutually exclusive, it follows that pMR = Pr C2 − C1 > 0 = (1 − αR )τF + αR τR as desired.

In light of the previous lemma, the goal is to determine the defender’s optimal decision boundaries where the defender has the following payoff function: μD (N ) = cR ·αR (1−τR )+cF ·(1−αR )(1−τF )+β N ·V · αR τR +(1−αR )τF . − N (5)

You only Lie Twice

71

The next lemma follows immediately by taking the derivatives of (5) with respect to τR , τF . Lemma 2. If V ≤

cR , βN

then τR = 1 and otherwise τR = 0. If V ≤

cF , βN

then τF = 1. Thus, the behavior of the defender depends entirely upon the relationship between the value of the machine, V along with the ratios βcRN , βcFN . If the cost is high relative to the value of the machine, the defender will choose to always tell the truth. Alternately if the cost is low enough, they will always choose to lie. Note that under our model, we assumed the variable αR was provided. In practice however, this variable may be a function of the responses the attacker receives from the defender. From Lemmas 1 and 2 we see that since cR > cF , then τR ≥ τF . Therefore, from Lemma 1, since the attacker wants to maximize pMR , then the attacker would want αR to be as large as possible. In what follows, we analyze the setup where the behaviors of the attacker (and subsequently the defender) change over time using these observations. 4.2

Adaptive Game Model

With a slight abuse of notation, we now assume our game proceeds for M N rounds where we sub-divided the M N rounds into M intervals each of length N . (i) Let αR denote the probability the attacker probes the real machine in interval (i) i where i ∈ {1, 2, . . . , M }. Similarly, let τR be the probability that the real (i) machine tells the truth at interval i and let τF be the probability that the fake machine tells the truth at interval i. We model the attacker’s behavior at interval i as follows: (i)

(i) (i)

(i)

(i)

pMR = αR τR + (1 − αR )τF . Suppose γ ∈ [0, 1] is a real number. Then, we assume α(i) is a linear combination of α(i−1) along with pMR : (i)

(i−1)

αR = (1 − γ) · αR

(i−1)

+ γ · pM R .

(6)

Since it is always more beneficial (for the attacker) to probe the real machine, (i) the quantity αMR represents the attacker’s belief at the start of interval i that machine MR is the real machine. This belief is a function of the attacker’s belief (i−1) (i−1) in the previous round (αR ) along with pMR which is determined at the end of interval i.

72

M. Bilinski et al.

We let μD (N, i) be the total expected cost to the defender at interval i where: −

μD (N, i) (i) (i) (i) (i) (i) = cR · αR (1 − τR ) + cF · (1 − αR )(1 − τF ) + β iN · V · pMR . (7) N

(N,i) as the normalized expected cost For shorthand, we refer to the quantity − μD N at interval i and we refer to M i=1

−

μD (N, i) , N

(8)

as the normalized expected cost of the game. Under this setup, we will be interested in how the game evolves given the parameters γ, V, cR , and cF along with the initial value of αR , which we denote (0) as αR . The next theorem represents the main result of this section and it gives the normalized expected cost of the game for a broad range of parameters. A detailed proof can be found in the Appendix. (0)

(0)

Theorem 1. Let αR , γ, V, cR ,and cF be positive integers and suppose αR = log

cF V

αR . If V > βcRN and M ≥ β N . Then, the normalized expected cost of the game is: log cR α(0) (c − c ) cR logβN(1−γ) R F β V −1 + R · 1−γ− cF N γ(1 − γ) V logβN(1−γ) (0) cF logβ ( ccFR ) (0) cR logβN(1−γ) αR cR − cF cVR + αR (1 − γ) − + (0) V (1 − β N )(1 − γ) N (1 − γ)αR c V F − β N (M +1) . · + N 1−β V (1)

From Appendix A, we have that the behavior of the attacker and defender can be divided into three regimes: R 1. V > βciN : In this regime, the value of the game, represented by the parameter iN β V , is high enough so that both the real and fake machine will lie at interval i. This behavior will continue until the value of the game becomes small enough so that it no longer is advantageous for the real machine to (i) continue to lie according to Lemma 2. During this regime, the parameter αR is monotonically decreasing so that the attacker’s belief that the fake machine is real is increasing. R F , V > βciN : In this regime, the fake machine will continue to lie, but 2. V ≤ βciN

(i)

the real machine will tell the truth. For this regime, the attacker’s belief αR remains constant. F : Here, both machines will tell the truth since the value of the game 3. V ≤ βciN at interval i is low enough to justify telling the truth on either the fake or the (i) real machine. In this case, αR is increasing.

You only Lie Twice

73

Thus, the behaviors of the attacker and defender are completely characterized cR cF by the ratios V ·β iN , V ·β iN . In regime (1), since both these ratios are low, the defender minimizes its cost function by deceiving the attacker and, if the number of sub-intervals is small, the defender’s normalized expected cost will be low. In cR cF regime (2), the ratios V ·β iN , V ·β iN are such that the defender will lie on machine (i)

MF but tell the truth on machine MR , and the attacker’s belief (αR ) during this cR cF regime is constant. Finally, in regime (3), the ratios V ·β iN , V ·β iN have become high enough that it is no longer advantageous for the defender to lie. Here, the (i) defender will tell the truth on both machines which causes the belief αR to increase. In the next section, we discuss how this analysis carries over to the more general model discussed in Sect. 3.

5

Simulation

In the prior analysis from Sect. 4.1, the assumption is initially made that the attackers and defenders do not change their behavior over time. The more involved setup introduced in Sect. 4.2 then allows updates to strategies at different intervals. While this provides insight into the motivations present in different regimes of the game and makes calculating optimal strategies in this limited context tractable, the analysis for the attacker depends on the defender playing optimally according to its costs, which very well may not be the case. In this section, we turn to a numerical analysis of the game from the point of view of the attacker through simulation. In this early prototype we fix the parameters for the defender and evolve the attacker’s in order to control the size of the state space, but the aim is to eventually relax this and allow both players to adversarially evolve their strategies. 5.1

Model

The game is a turn based Stackelberg Game in which the attacker plays as the leader and the defender is the follower. Similar to our previous setup, initially there are two machines M1 and M2 . One of these machines is real and the other is fake. The game proceeds in rounds where an attacker probes a machine and the machine, which is controlled by the defender, responds that it is either real or fake. At the beginning of every round, the attacker has the option to either continue to probe or attack. The game ends after at most N rounds. The game proceeds as follows: 1. The game begins at round i = 0 with the game initially in state s = ∅. 2. Each turn begins with the attacker choosing an action from the following set: P 1 , P2 , A . If the attacker chooses A, then they attack a machine and the game ends. The attacker’s choice of which machine to attack is based on a hypothesis test

74

M. Bilinski et al.

which will be discussed shortly. If the attacker chooses Pj , then the attacker probes Mj and the game continues. 3. If the attacker probes Mj , then the defender responds with one of the following actions: Rj , Fj . For example, if M1 is probed, it can respond with the signal R1 indicating to the attacker that it is a real machine. 4. We update the game state vector s by appending an element from the set {R1 , F1 , R2 , F2 } corresponding to the action by the defender. Next, the round number i is incremented. If i < N , then another turn is played so that steps 2 − 4 are repeated. Otherwise the game terminates. Remark 1. Note that the game can terminate early if the attacker chooses to attack. Moreover, the game can end without an attack since the attacker has the choice of probing at the last round. Next, we discuss the choice of which machine the attacker chooses to attack if the attacker chose action A in step (2). Recall that the defender knows the true state of the machines and accumulates costs cR , cF for lying accordingly. As the attacker does not know which machine is real, and similar to the approach from the previous section, they form two cost hypotheses as follows: Cost Hypothesis1 = C1 = cR · |{si = F1 }| + cF · |{si = R2 }| , if M1 is real Cost Hypothesis2 = C2 = cR · |{si = F2 }| + cF · |{si = R1 }| , if M2 is real where |{si = Fj }| is the count of the number of times Mj signaled as fake and |{si = Rj }| is the count of the number of times Mj signaled as real. If the attacker chooses to attack (si = A) in round i, then the attacker will target machine 1 if C1 < C2 . Otherwise, they target machine 2. The reward function for the attacker is as follows: ⎧ ⎪ if the attacker never chooses to attack a machine, ⎨0, R = V βi, for attacking the real machine, and ⎪ ⎩ i −V β , for attacking the fake machine. Figure 1 summarizes the above game. Note that in the summary MR and MF are indicated to us and the defender as omniscient observers, but the information set on the right indicates that the attacker is unaware of their true identities. Recall that the game terminates only if the attacker chooses to attack (or if the game exceeds a predetermined number of rounds). 5.2

Simulation Parameters

The number of rounds were chosen to control state space explosion and the total number of epochs were chosen empirically to ensure convergence. For each epoch, the simulation chooses one of the machines to be real uniformly at random at the start of each game. All games in this simulation were run with the following parameters:

You only Lie Twice

75

Fig. 1. Summary of the game, with both attacker and defender actions in any round i.

– – – – – – –

Number of epochs: 100, 000 Maximum number of rounds: N = 8 Value of the real machine: V = 10 Cost to defender of a real machine lying: cR = 2 Cost to defender of a fake machine lying: cF = 1 Discount of the value of the machine per round: β = 1 Cost to attacker of attacking a fake machine: −10

The defender’s strategy was fixed by choosing parameters (τF , τR ) which determine the probability of telling the truth on the fake and real machines respectively. The attacker was trained using the reinforcement agent in Open AI Gym [5]. The RL agent was trained using the following parameters: – – – –

Discount factor = 0.9 Learning Rate = 0.1 Epsilon = 1 Epsilon Decay Rate = 0.0001.

The simulation was run for the following values of (τF , τR ): (1, 1), (0, 0), (0, 1), (0.9, 0.9), (0.9, 1), (1, 0.9). 5.3

Results

We report the results of each pair of (τF , τR ) separately. Each has three graphs that show the evolution of an averaged parameter over the 100, 000 epochs. We subdivide the 100, 000 epochs into 1000 sequential bins of 100 each and take the average of the value in question over each bin of 100. The x axis shows the progression over the 1000 such bins. The y axis shows the average value of reward to the attacker, number of rounds at termination, and αR (the proportion of probes that targeted the real machine). Figure 2, where both machines always tell the truth, is in some ways the simplest case. We should expect to see an attacker probe one system, then depending on its response immediately know which machine to attack the second round. As expected, in simulation we see the reward converge to near 10 and the number of rounds converge to 2. However, it did take a substantial fraction of the 100, 000

76

M. Bilinski et al.

Fig. 2. Average rewards, rounds, and αR evolved over 100, 000 epochs for (τF , τR ) = (1, 1)

epochs to get that convergence and looking through the other cases the 100, 000 seems appropriate as well. The αR however seems to be a random scatterplot. In Fig. 3, both machines always lie. One would expect that an attacker could just invert their logic from (1, 1) and choose to attack the opposite machine. However, attacking is deterministic based on the cost hypotheses C1 and C2 . Note how the number of rounds converges to 8, the maximum possible. In this case, the attacker is terminating the game by probing, resulting in a reward of 0 to avoid having to choose the fake machine. The parameter αR appears to be random.

Fig. 3. Average rewards, rounds, and αR evolved over 100, 000 epochs for (τF , τR ) = (0, 0)

Figure 4 is the case where the fake machine lies but the real machine tells the truth. Thus the attacker only ever receives a signal of “Real”. Consequently, there is no way to distinguish between the two machines from these defenders’ responses and thus one would expect the attacker to guess randomly. We do see the reward converging to 0, but the number of rounds converges to 8, which implies that the game is terminating without an attack. The parameter αR seems random in this case as well.

You only Lie Twice

77

Fig. 4. Average rewards, rounds, and αR evolved over 100, 000 epochs for (τF , τR ) = (0, 1)

The remaining three cases introduce mixed strategies for the defender’s response. The results on rewards and rounds are intuitive, but difficult to guess apriori. Importantly, we start to see structure in the αR plots.

Fig. 5. Average rewards, rounds, and αR evolved over 100, 000 epochs for (τF , τR ) = (0.9, 0.9)

Figure 5 is where both machines are truthful 90% of the time. We see the reward converge near 10, but with some variance as expected from the defender’s probabilistic approach. The number of rounds converges to a little above 4. Intuitively, if you know both machines tend to be truthful, you can probe multiple times to verify you get the same result. Perhaps the possibility of terminating through probing at round 8 played a part in depressing the number of rounds. In any case, the number of rounds was greater than 2 but less than 8 and some additional parameter exploration is warranted. αR no longer seems random, but exhibits oscillating behavior. Figures 6 and 7 are where one machine always tells the truth but the other only does so 90% of the time. Both converge to a reward of nearly 10 and just under 3 rounds on average. Interestingly, the number of rounds is still above 2 from the (1, 1) case, still below the 4 from the (0.9, 0.9) case, and both of these

78

M. Bilinski et al.

Fig. 6. Average rewards, rounds, and αR evolved over 100, 000 epochs for (τF , τR ) = (0.9, 1)

Fig. 7. Average rewards, rounds, and αR evolved over 100, 000 episodes for (τF , τR ) = (1, 0.9)

have converged to roughly the same number of rounds. This seems to imply that the amount of deception is positively correlated with the number of rounds the attacker will probe and also seems to imply that it does not matter which machine is doing the deception. And finally we have what seems like a coherent result for the αR . In both cases it converges to a value and that value is about 0.2 on either side of 0.5. What seems to be happening is the attacker is learning which type of machine is more deceptive, and after the first probe will choose to probe again to go after what it thinks is the more deceptive machine. In Fig. 6, that is the fake machine and hence αR goes down. In Fig. 7, that is the real machine and hence αR goes up. Note there appears to be some quantization in some of the figures. We believe this is an artifact of the simulation, but further investigation is warranted. The simulation results are notable for a number of reasons. First, it implies that increasing deception seems to increase the number of rounds an attacker will choose to probe. This is significant, as deception is a parameter the defender controls and can use to extend or shorten the desired interaction with the attacker. Future work should better uncover the relationship between deception and game longevity. It seems you only lie twice to get the desired effect. Second, the parameter αR , which plays prominently in the mathematical analysis, seems to have

You only Lie Twice

79

value in the simulation as well. In particular, in the last two cases, it is correlated with the attacker’s belief of the deceptive system. αR is particularly interesting because it can be measured in an individual game and can provide information on the attacker before they choose to attack. Combining these, it may be possible to construct a game where multiple rounds are required of the attacker and then the defender is able to predict the attacker’s beliefs before the attack takes place. Third, it seems in some scenarios that the attacker will prefer to let the game go on indefinitely rather than actually attack. This is akin to not wanting to play the game and could be a desired strategy for the defender to steer the attacker towards such despondency. In our future work, we intend to more directly address this by adding a “leave” option in every round. Lastly, there is potential for additional insights through further exploration of existing parameters. One such example we will address in future work would be to add some cost to probing each round in order to explore sunk cost.

6

Conclusion

The contributions of this work are three-fold. First, we introduce a simple model of deception that captures the essence of interactions between an attacker and defender in a cyber environment. Second, we derive closed form expressions for a version of this model that capture the optimal strategies for both the attacker and defender. Finally, we identify the potential behavior of an attacker at noncritical points via an MDP simulation. Our network consists of only two systems and the act of performing deception is restricted to a simple signal indicating that a system is real or fake. Our analysis promises the potential of varying adversary types but only addresses the adversary who is both aware of deception and able to estimate defender costs (a strong adversarial model). A simplified network environment allows for tractable analysis and simulation. However, the primary motivation for such a simple game is to pursue results which can generalize to larger networks. Intuition suggests that directly scaling the environment to consist of many hosts of varying configurations will be more representative of a realistic network environment but is not necessary for our results to apply to larger environments. To this end, we believe that generalizing these results can form the basis of an inductive proof for larger network environments. In the next section, we discuss in more detail how our game may have applicability to more generalized setups, and we discuss other potential future work.

7

Future Work

In the context of defensive deception, the simpler environment represents a more difficult defensive environment. Because we have restricted the defense to a binary choice between two hosts we have enforced a minimum chance of success by the attacker to be 12 . Since we do not allow adjustments in the ratio of real hosts to decoys, a defender in this environment is tautologically worse off

80

M. Bilinski et al.

than an environment with a larger number of hosts. Any larger network would consist of a larger number of potential states. Depending on the attacker’s goal (to attack a specific host, to attack all hosts, or to attack any host) extending the base case can take several forms. Each attacker type represents a different analysis – each being supported by the base-case analysis. For example, in the case where the attacker chooses to attack all hosts, the base-case supports the analysis where an attacker has eliminated all but two hosts as being potential targets. Masking a single host between two hosts represents the base-case for systems which consist of larger numbers of real and fake hosts. If properties of this base case can be proven and supported through both simulation and experimental evidence this will provide a strong basis for extending the approach to a larger network environment. As such, our results represent a significant improvement from prior work in that we have selected a realistic scenario that will provide minimum bounds on defender performance and act as a base-case for inductive proofs concerning larger networks. This represents a much more difficult task that a mixture of many systems with messy features that would obscure poorly conceived defensive deception techniques. Our future work will take our base-case results and show that adding additional systems results in similar or improved results. Extending this work will require careful analysis that increasing the number of systems (either real or fake) results in game behavior that is demonstrably equivalent to the base case or which improves defender advantage within the confines of key assumptions of the game environment. Variations of our game environment and model may demonstrate varying consequences for the defender and varying success for the attacker: – Add in an action for the attacker to decide to walk away early (to avoid getting a large negative reward) – Add in the ability for the defender to eject the attacker at any round – Add costs to probing to explore sunk cost fallacy – Modify the simulation such that the defender is learning how to signal optimally. In addition we plan to follow the example of recent successes in large-scale human subjects research (HSR) evaluating cyber deception [9] and run HSR experiments to help better understand our model. We can use the results to both provide more realistic utilities, probabilities, and costs to this and future models as well as to compare our mathematical results to real human behavior. This will provide additional key insights to help improve the realism and applicability of our contributions.

A

Proof of Theorem 1

Recall from Sect. 4.2 that we claimed the behavior of the attacker and defender can be divided into three basic regimes according to the interval i ∈ {1, 2, . . . , M }:

You only Lie Twice

1. V > 2. V ≤ 3. V ≤

cR β iN cR β iN cF β iN

, ,V > .

cF β iN

,

Recall also that V > (1)

81

cR βN

. Since V >

cR βN

(1)

it follows from Lemma 2 that τR =

(1)

(2)

(1)

τF = 0, and so pMR = 0, which according to (6) implies that αR = (1−γ)·αR . R , the same logic will apply so that More generally note that whenever V > βciN αR = (1 − γ)i−1 · αR . (i)

(0)

R . Then, clearly TR + 1 = Let TR + 1 denote the smallest i where V ≤ βciN c c R R logβ V log . For convenience, we will assume β N V is a positive integer so that N we can omit the floor and ceiling functions from our notation. Recall that since (i) during this regime both the real and fake machine will lie, we have pMr = 0 for

(∗)

(∗)

(∗)

all i ∈ {1, 2, . . . , TR } according to Lemma 1. From (7) and (8), we get that the normalized expected cost of the game in regime (1) which begins in interval 1 (∗) and lasts until interval TR , is (∗)

TR

i=1

(i)

(i)

cR αR + cF (1 − αR ) (∗)

=

(0) cR αR

(∗)

TR −1

j

(1 − γ) +

j=0

(∗) TR cF

−

(0) cF αR

TR −1

(1 − γ)j

j=0

(∗)

1 − (1 − γ) (∗) (0) . = TR cF + αR cR − cF γ c logβ VR (∗) Substituting in TR = − 1, gives that the normalized expected cost of N the game in regime (1) is: log cR α(0) (c − c ) cR logβN(1−γ) R F β V (9) −1 + R · 1−γ− cF N γ(1 − γ) V TR

Next, we consider regime (2) where V ≤

cR β iN

, V >

cF β iN

. As noted, this

cF β iN

(∗) TR

+ 1 and will continue until V ≤ which occurs c F log (∗) (∗) when i ≥ TF + 1 for TF + 1 = β N V . Similarly to before, we assume for (∗) notational convenience that TF is a positive integer so that we can remove floor and ceiling functions from our analysis. According to Lemma 2, we will have that regime begins at interval

(i)

(i)

(i)

(T

(∗)

for every interval in this regime, τR = 1 and τF = 0 so that pMR = αR R (∗)

(∗)

(i)

(∗) (TR +1)

for TR + 1 ≤ i ≤ TF . Note that this also means that αR = αR

+1)

for all

82

M. Bilinski et al. (∗)

(∗)

i when TR + 1 ≤ i ≤ TF . From (7) and (8), this implies that the normalized expected cost of the game in regime (2) is given by (∗)

TF

(i)

(i)

(i)

cF · (1 − αR )(1 − τF ) + β iN · V · pM

R

(∗)

i=TR +1

=

(∗) cF TF

−

(∗) TR 1

−

(T +1) αR R (∗)

(∗)

+V

TF

(∗)

(T +1) αR R

β iN

(∗) i=TR +1

(∗)

− TR

(∗)

− TR

= cF TF = cF TF =

(∗)

(0)

(∗)

(∗)

(0)

(∗)

1 − αR (1 − γ)TR 1 − αR (1 − γ)TR

cF logβ ( ccF ) R N (1 −

(0) αR (1 (0) γ)αR

− γ) −

(0)

β N (TR

(∗)

(0)

+

αR (1 − γ)TR

cR − cF

(∗)

+1)

− β N (TF 1 − βN

+1)

1 − βN

(0)

cR logβ (1−γ)

+

N

V

(∗)

(∗)

+ V αR (1 − γ)TR ·

αR

cR − cF

cR logβ (1−γ) N

V

(1 − β N )(1 − γ)

Finally, we consider regime (3). Here we have that V ≤

cR β iN

,V ≤

.

(10)

cF β iN

. This

(∗) TF

regime begins at interval + 1 and ends at interval M when the game itself (i) (i) R , it follows that τR = 1 and τF = 1 ends. As mentioned earlier, since V ≤ βciN (∗)

(i)

for all TF + 1 ≤ i ≤ M , which implies pMR = 1. Recall that (T

(∗)

pMRF

)

(T

(∗)

= αR R

+1)

αR cR N1 1−γ V (0)

=

logβ (1−γ)

.

(i)

As in the other regimes, we compute αR , though its value will not affect the (∗) (0) 1 logβ (1−γ) (T +1) αR cR N computation of the normalized expected cost. αR F = 1−γ . V (∗)

(i)

(i−1)

For i > TF + 1, using the recursion that αR = (1 − γ) · αR (1 − γ) ·

(i−1) αR

+ γ (since

(i−1) pM R

(i−1)

+ γ · pM R

=

= 1), we have (∗)

(i) αR

= (1 − (1 −

= (∗)

TF

Since (1 − γ) 1 (0) γ)i−1 αR ccFR N

=

(∗)

(∗) (T +1) γ)i−(TF +1) αR F

+γ

1 log (1−γ) (0) γ)i−2 αR cVR N β ∗ (1 − γ)TF cF N1

1 1−γ

V

logβ (1−γ)

logβ (1−γ)

(1 − γ)j

j=0 (∗)

+ 1 − (1 − γ)i−TF (0)

and

(1−γ)i−2 αR

−1

.

c N1 logβ (1−γ) R V

∗

(1−γ)TF

= (1 −

, we have

1 (0) cR N logβ (1−γ)

αR = (1 − γ)i−1 αR (i)

i−(TF +1)−1

cF

+ 1 − (1 − γ)i

V N1 cF

logβ (1−γ)

.

You only Lie Twice

83

Now, we consider the normalized expected cost of the game in regime (3). For this case, from (7) and (8) we have that the normalized cost is given by: M

β iN · V =

(∗)

i=TF +1

=

(∗) V · β N (TF +1) − β N (M +1) N 1−β c V F N (M +1) − β · 1 − βN V

(11)

The result now follows from (9), (10), and (11).

References 1. Aggarwal, P., Dutt, V., Gonzalez, C.: Cyber-security: role of deception in cyberattack detection. In: Nicholson, D. (ed.) Advances in Human Factors in Cybersecurity. AISC, vol. 501, pp. 85–96. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-41932-9 8 2. Almeshekah, M.H., Spafford, E.H.: Planning and integrating deception into computer security defenses. In: Proceedings of the 2014 New Security Paradigms Workshop, NSPW 2014, pp. 127–138. ACM, New York (2014). https://doi.org/10.1145/ 2683467.2683482 3. Bilinski, M., Gabrys, R., Mauger, J.: Optimal placement of honeypots for network defense. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 115–126. Springer, Cham (2018). https://doi.org/10.1007/978-3030-01554-1 7 4. Bowen, B.M., Hershkop, S., Keromytis, A.D., Stolfo, S.J.: Baiting inside attackers using decoy documents. In: Chen, Y., Dimitriou, T.D., Zhou, J. (eds.) SecureComm 2009. LNICST, vol. 19, pp. 51–70. Springer, Heidelberg (2009). https://doi.org/ 10.1007/978-3-642-05284-2 4 5. Brockman, G., et al.: OpenAI gym (2016) 6. Campbell, R.M., Padayachee, K., Masombuka, T.: A survey of honeypot research: trends and opportunities. In: 2015 10th International Conference for Internet Technology and Secured Transactions (ICITST), pp. 208–212 (2015). https://doi.org/ 10.1109/ICITST.2015.7412090 7. Carroll, T.E., Grosu, D.: A game theoretic investigation of deception in network security. Secur. Commun. Netw. 4(10), 1162–1172 (2011) 8. Feng, X., Zheng, Z., Mohapatra, P., Cansever, D.: A Stackelberg game and Markov modeling of moving target defense. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) GameSec 2017. LNCS, vol. 10575, pp. 315–335. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68711-7 17 9. Ferguson-Walter, K.J., et al.: The Tularosa study: an experimental design and implementation to quantify the effectiveness of cyber deception. Maui, Hawaii, January 2019 10. Fugate, S., Ferguson-Walter, K.: Artificial intelligence and game theory models for defending critical networks with cyber deception. AI Mag. 40(1), 49–62 (2019). https://doi.org/10.1609/aimag.v40i1.2849. https://www.aaai.org/ojs/index.php/ aimagazine/article/view/2849

84

M. Bilinski et al.

11. Garg, N., Grosu, D.: Deception in honeynets: a game-theoretic analysis. In: 2007 IEEE SMC Information Assurance and Security Workshop, pp. 107–113. IEEE (2007) 12. Gutzwiller, R., Ferguson-Walter, K.J., Fugate, S., Rogers, A.: “Oh, Look, A butterfly!” a framework for distracting attackers to improve cyber defense, Philadelphia, Pennsylvania, October 2018 13. Heckman, K.E., Stech, F.J., Thomas, R.K., Schmoker, B., Tsow, A.W.: Cyber Denial, Deception and Counter Deception: A Framework for Supporting Active Cyber Defense. Advances in Information Security. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25133-2. https://www.springer.com/ gp/book/9783319251318 14. Jajodia, S., et al.: A probabilistic logic of cyber deception. IEEE Trans. Inf. Forensics Secur. 12(11), 2532–2544 (2017) 15. Kovach, N.S., Gibson, A.S., Lamont, G.B.: Hypergame theory: a model for conflict, misperception, and deception. Game Theory 2015 (2015). Article ID 570639 16. Rowe, N.C., Custy, E.J., Duong, B.T.: Defending cyberspace with fake honeypots. J. Comput. 2(2), 25–36 (2007) 17. Schlenker, A., et al.: Deceiving cyber adversaries: a game theoretic approach. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 892–900. International Foundation for Autonomous Agents and Multiagent Systems (2018) 18. Vane, R.R.: Hypergame theory for DTGT agents. In: American Association for Artificial Intelligence (2000) 19. Wagener, G., State, R., Dulaunoy, A., Engel, T.: Self adaptive high interaction honeypots driven by game theory. In: Guerraoui, R., Petit, F. (eds.) SSS 2009. LNCS, vol. 5873, pp. 741–755. Springer, Heidelberg (2009). https://doi.org/10. 1007/978-3-642-05118-0 51 20. Wang, W., Zeng, B.: A two-stage deception game for network defense. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 569–582. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01554-1 33 21. Zheng, J., Namin, A.S.: Markov decision process to enforce moving target defence policies. arXiv preprint arXiv:1905.09222 (2019)

Honeypot Type Selection Games for Smart Grid Networks Nadia Boumkheld1 , Sakshyam Panda1 , Stefan Rass2 , and Emmanouil Panaousis1(B) 1

2

Department of Computer Science, University of Surrey, Guildford, UK [email protected] Institute of Applied Informatics, Universit¨ at Klagenfurt, Klagenfurt, Austria

Abstract. In this paper, we define a cyber deception game between the Advanced Metering Infrastructure (AMI) network administrator (henceforth, defender) and attacker. The defender decides to install between a low-interaction honeypot, high-interaction honeypot, and a real system with no honeypot. The attacker decides on whether or not to attack the system given her belief about the type of device she is facing. We model this interaction as a Bayesian game with complete but imperfect information. The choice of honeypot type is private information and characterizes the essence and objective of the defender i.e., the degree of deception and amount of threat intelligence. We study the players’ equilibrium strategies and provide numerical illustrations. The work presented in this paper has been motivated by the H2020 SPEAR project which investigates the implementation of honeypots in smart grid infrastructures to: (i) contribute towards creating attack data sets for training a SIEM (Security Information and Event Management) and (ii) to support post-incident forensics analysis by having recorded a collection of evidence regarding an attacker’s actions.

Keywords: Game theory

1

· Honeypots · Smart grid · Cyber security

Introduction

Smart grid adds information and communication technologies to the traditional grid in order to build a strong electrical grid capable of meeting the growing demand for electricity. A smart grid can be a large system connecting millions of devices and entities using different types of technologies making it complex and attractive target for cyber attackers. Attacks may aim to compromise grid devices with the goal to launch further attacks. For example, a hacked device might abruptly increase the load to cause circuit overflow. Attacks against the “residential” part of smart grid may also try to insert, change, delete data, or control commands in the network traffic to mislead the smart grid and enforce faulty decisions such as a compromised smart meter causing inaccurate electricity bills [1]. User privacy is also threatened as cyber c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 85–96, 2019. https://doi.org/10.1007/978-3-030-32430-8_6

86

N. Boumkheld et al.

adversaries who gain access to communication channels used by the smart meters can infer the existence or absence of occupants in a building [2]. A honeypot is a security mechanism set up as a decoy to lure cyber attackers. Few objectives behind using honeypots include protecting real devices from getting attacked, wasting attackers’ time, and collecting information about the attack methods used towards improving threat intelligence [3]. Honeypots are used to deceive attackers, but they are limited in resources and thus should be deployed smartly to maximize the deception. Game theory has been used to decide strategic deployments of honeypots. Defensive deception cybersecurity and privacy, in general, has been modelled using game theory. Pawlick et al. have published a taxonomy and survey on this topic [4]. A honeypot is a static trap network. Once attackers suspect its existence, they will be able to escape it by turning their effort towards other devices. Thus, it is extremely important, especially when dealing with critical infrastructure, that appropriate type of honeypots are chosen to satisfy specific objectives. To address this, we propose a game-theoretic model that optimizes the defender’s choice of a system. The proposed model, however, is not confined to investigate smart grids but we aim at using this as a basis, in future work - as part of the H2020 SPEAR project, for investigating the optimal use of honeypots in a smart grid testbed. We aim to achieve this by installing different honeypot types (e.g., taking advantage of different configurations of Conpot [5]) in crucial smart grid infrastructure points such as the control center, the Remote Terminal Units (RTUs) and the smart meter gateways. The derived results will be used to assess the performance of game-theoretic strategies in the smart grid testbed. The aim of this paper is to introduce a game theoretic aid for the defender to optimally decide on the type of honeypot to protect a smart grid. The model aims to maximize threat intelligence while respecting associated costs related to the implementation of different honeypots types. These costs may include (i) network throughput introduced due to adding honeypots to the infrastructure, (ii) hardware cost of these honeypots, and (iii) operational management cost (e.g., system administrators’ time spent for operating, auditing, and maintaining honeypots). In our setting, the attacker decides whether to attack or not given that any unsuccessful attempt can lead to her attribution and disclose her attack methods. More precisely, we model the interactions between the defender and the attacker as a sequential game of complete but imperfect information. We have computed the perfect Bayesian Nash equilibrium as a guidance towards the optimal choice for each player. We assess our game model using numeric simulations to derive the probability of deploying a type of system and the attack probability. The rest of the paper is organized as follows: Sect. 3 explains the game model we developed using honeypots. Section 4 presents the analysis for the calculation of the game equilibria, while in Sect. 5 we display the results of our simulations. The next section presents some relevant, to our model, related work in the field and Sect. 6 concludes this paper.

Honeypot Type Selection Games for Smart Grid Networks

2

87

Related Work

Configuration and deployment of honeypots have been extensively studied and carries a rich literature [6,7]. However, from a game-theoretic perspective there are relatively fewer studies on the strategic use of honeypots [8]. P´ıbil et al. categorises the studies based on modelling (i) ongoing attack phase which captures the interaction within the honeypots and attackers [9,10] and (ii) pre-attack phase where the attacker chooses a target [11]. They further investigated how a honeypot should be designed to optimize the probability that the attacker will attack the honeypot and not the real system [12]. The model also reflects on the probing capability of the attacker to determine whether the targeted machine is a honeypot before attacking. Garg and Grosu studied the strategic use of honeypots for network defense through a signalling game. They investigated the problem of allocating k honeypots out of n possible hosts within a block of IP addresses [13]. In [14], Ceker et al. modelled the interaction between defender and attacker as a signalling game to devise a deception method to mitigate DoS attacks. The defender chooses, for one system, whether to be a honeypot or real system. The attacker can either attack, observe or retreat. La et al. extended the analysis of this work from single-shot game to repetitive game taking into account the deceptive aspects of the players [8]. In [15] the defender deploys honeypots in an AMI network to detect and gather DoS attack information while the attacker has the option to deploy anti-honeypot mechanisms to detect honeypot proxy servers before deciding on whether to attack or not. Similar to our work, the listed work employs honeypots to gather information about the attackers and use deception as a defensive mechanism. Wagener et al. have trained a high-interaction honeypot to be capable of learning from attackers and dynamically changing its behaviour using reinforced learning [16]. Low interaction honeypots might reveal their true identity while high interaction honeypots may result in adverse conditions, e.g. attacker increases the chance to take control of the real system. Thus, with such adaptive techniques there is a need for finding an optimal response strategy for the honeypot to prolong its interaction with malicious entities. Motivated by [16], Hayatle et al. studied the honeypot detection by bootmasters through a Bayesian game of incomplete information [17]. The honeypot decides whether to execute the attack commands received from the botmaster or not; the attacker decides to attack, just test the type of system seen or not interact at all. Carroll and Grosu defined a signalling in which the type of a system is chosen randomly for a distribution of honeypots and real systems [18]. The defender chooses to be truthful or deceptive regarding the type of each system. Based on the received information, the attacker decides to attack, to withdraw, or condition his attack on testing the type of the system. The detection of a target adds additional cost to the attacker regardless of it being a normal system or a honeypot, but it mitigates the loss of the attacker incurred when attacking a honeypot. In [19], Pawlick and Zhu extended this work by considering the effect of determining the system type to be endogenous on the utility. It analyses two

88

N. Boumkheld et al.

models: a simple cheap-talk game and a cheap-talk game with evidence where the receiver can detect deception with some probability. The core structure of our game is motivated from [18,19]. We refine the choices of the defender further i.e, choosing between deploying a high-interaction honeypot or a low-interaction honeypot or normal system, rather than just between honeypot or normal system. Further, we have motivated the parameters of the defender and the attacker from [20]. In [20], Li et al. have presented a simplistic model where the defender decides whether to deploy a honeypot or not and the attacker decides whether to attack or not using a complete imperfect dynamic game. In addition, [21], Li et al. used a Bayesian game to model a distributed honeypot network. Similar to our work, they consider a decoy factor of honeypot which could be conceived as the efficacy of each type of system in our model. Furthermore, we consider types of honeypots (high/low interaction) and propose optimal mix between the choice of a type of honeypot and real system rather than randomly changing the types.

3

Game Model and Assumptions

We model the interaction between the defender D and the attacker A as a sequential game with complete but imperfect information called the Honeypot Type Selection Game (HTSG), represented in Fig. 1. Table 1 presents the list of symbols used in our model. While deploying a new system, D has to decide whether it should include a high-interaction honeypot (H), or a low-interaction honeypot (L), or should be a system with no honeypot, i.e., a normal system (R). This choice of the defender is private information and is unknown to the attacker. A low-interaction honeypot facilitates limited services such as internet protocol, network services and does not provide interaction with the operating system. They are easy to deploy, maintain and minimize the risk by containing the attacker’s activities. High-interaction honeypots are more sophisticated, complex to implement and maintain, and they provide interaction with a real/virtual operating system [22]. We further assume that each type of system has an efficacy which we define as the probability of a system to be recognized as a real system by an attacker during reconnaissance. We represent these efficacies as aL , aH and pR for type-L, type-H and type-R, respectively. This efficacy factor induces uncertainty in the attacker’s decision. The defender aims at taking advantage of the information asymmetry to gather information about the attacker’s behaviour and detecting potential cyber attacks. For example, [23] used honeypots to detect cyber attacks, [24] used honeypots to simulate and learn about Distributed Denial of Service (DDoS) attack on network infrastructure, and [25] used the data collected from honeypots to identify cyber attack trends. The defender has to bear additional cost for choosing type-L or type-H system compared to the type-R system. This cost may be introduced by network throughput due to honeypot in the infrastructure, cost of implementing, deploying and maintaining the honeypot and operational management costs. In our

Honeypot Type Selection Games for Smart Grid Networks

89

Table 1. List of symbols Symbol Condition/range Description aH

0 < aH < 1

Efficacy of type-H system

aL

0 < aL < aH

Efficacy of type-L system

bA

bA > 0

Attacker’s benefit on attacking type-R system

bD H

D bD H ≥ cH

Defender’s benefit when type-H system is attacked

cD H cD L

cD H > 0

bD L

D D cD L ≤ bL < bH

Defender’s benefit when type-L system is attacked

D 0 < cD L < cH

Cost of running a type-L system

Cost of running a type-H system

d > bD H

Defender’s loss when type-R system is attacked

A A 0 < lL < lH

Attacker’s loss on attacking type-L system

pR

0 < pR ≤ 1

Efficacy of type-R system

p1

0 ≤ p1 ≤ 1

Attacker’s belief about type-R with information set {L, R}

p2

0 ≤ p2 ≤ 1

Attacker’s belief about type-R with information set {H, R}

d A lH A lL

A lH >0

Attacker’s loss on attacking type-H system

D model, cD L and cH represent the aggregate cost of running a type-L and type-H system. Based on these assumptions, we model HTSG to highlight the strategic aspects of the interaction. The leaf nodes present the payoffsfor the action chosen by the players. The payoffs are represented in the form xy , where x and y are the payoffs of D and A, respectively.

4

Equilibria Analysis

This section analyses the equilibria of the proposed HTSG in Fig. 1. We utilize the game-theoretic concept of the perfect Bayesian Nash equilibrium (PBNE) that helps us get an insight into the strategic behaviour of the players. PBNE refines the Bayesian Nash equilibrium to remove (some) implausible equilibria in sequential games [26]. PBNE, in the context of our game, is defined by the four requirements discussed in [27], and met below along our analysis. From the payoffs in Fig. 1, it can be seen that there is no preferred pure strategy for a player. This particular parametric configuration is prescribed to illustrate a network administrator’s challenge in deciding which type of system to install in presence of a threat. To analytically determine the optimal choice, represented as a PBNE, we segment our analysis into four sections. Condition 1: when U D (L, A) > U D (H, A) and U D (L, N A) ≥ U D (H, N A) Condition 2: when U D (L, A) > U D (H, A) and U D (L, N A) < U D (H, N A) Condition 3: when U D (L, A) ≤ U D (H, A) and U D (L, N A) < U D (H, N A) Condition 4: when U D (L, A) ≤ U D (H, A) and U D (L, N A) ≥ U D (H, N A) Having defined the necessary concepts, next, we determine the possible PBNEs of the game for the defined situations, where the PBNEs are strategy profiles and beliefs that satisfies all the four requirements described earlier.

90

N. Boumkheld et al. D L

H

R A A

A NA

(bD − cD )a −cD (1 − a ) L L L L L Aa 0 −lL L

A

NA

−d · p R

bA

· pR

A A

0

NA

(bD − cD )a −cD (1 − a ) H H H H H Aa 0 −lH H

0

Fig. 1. Extensive form representation of the Honeypot Type Selection Game (HTSG) with the defender (D) choosing between the types of system (type-L, type-H and typeR) and the attacker (A) deciding to attack (A) or not attack (NA).

Condition 1: This case refers to the situation when L is dominating H i.e., U D (L, A) > U D (H, A) and U D (L, N A) ≥ U D (H, N A) implying that H can be removed form the strategy set of the defender reducing the 3 × 2 payoff matrix to 2 × 2 payoff matrix. From the payoff matrix, it can observed that none of the players have preferred strategy. Following the requirements for a PBNE, we have (a) Belief consistency: Requirement 1 states that if the play of the game reaches a player’s non-singleton information set then the player with the move must have a belief about which node has been reached. Let the attacker believes that the defender has chosen R with probability p1 . (b) Attacker’s sequentially rational condition given updated beliefs: Given the attacker’s belief p1 , we calculate the payoffs for playing A and NA and choose the strategy that maximizes his payoff. For strategy A to be sequentially rational U A (N A) < U A (A) which gives p1 >

A aL · lL A A pR · b + aL · lL

(1)

(c) Defender’s sequentially rational condition given attacker’s best response: Knowing the best responses of the attacker i.e. A for p1 > A aL ·lL A, pR ·bA +aL ·lL

A aL ·lL A pR ·bA +aL ·lL

and

NA for p1 < we determine the best response of the defender. When the attacker prefers to attack, defender’s best response is play L and for NA the defender’s best response in R. Thus, the PBNEs of the game are ⎧ a ·lA ⎨(L, A; p1 ), where p1 > pR ·bAL+aLL ·lA ⎩(R, N A; p1 ),

L

where p1
U D (H, A) and U D (L, N A) < U D (H, N A). We solve the 3 × 2 matrix game with graphical solution approach. Let the attacker chooses A with probability α and NA with probability 1 − α. The attacker’s average payoff when the defender plays ⎧ A A ⎪ ⎨L, U = α · aL · (−lL ) R, U A = α · pR · bA ⎪ ⎩ A H, U A = α · aH · (−lH ) We plot these linear functions for 0 ≤ α ≤ 1. For a fixed value of α, the attacker aims at maximizing his average payoff. This is obtained by finding α that achieves the maximum in the lower envelop of these functions. From Fig. 2, this should be at the intersection of the three lines at α = 0. pR bA

Attacker's Payoff

R

0

1

L a L (-l A ) L

H a H (-l A ) H

Fig. 2. Attacker’s expected payoffs for attacking against defender’s strategy.

As more than two lines passes through the intersection point, we choose sets of two lines with opposite slopes. Applying the methodology as in Case A with lines L, R and attacker’s belief on defender playing R with probability p2 , we obtain the same PBNEs as in Case A. With lines R and H, PBNEs are ⎧ ⎨(H, A; p2 ),

where p2 >

⎩(R, N A; p2 ),

where p2
U D (H, A) (L, A; p1 ) p1 ≥

5

U D (L, N A) ≥ U D (H, N A) aL ·lA L pR ·bA +aL ·lA L aL ·lA L (R, NA; p1 ) p1 < A pR ·b +aL ·lA L aH ·lA H (H, A; p2 ) p2 ≥ pR ·bA +aH ·lA H aH ·lA H (R, NA; p2 ) p2 < pR ·bA +aH ·lA H aL ·lA L (L, A; p1 ) p1 ≥ A A pR ·b +aL ·l L aL ·lA L (R, NA; p1 ) p1 < pR ·bA +aL ·lA L

(L, A; p1 ) p1 ≥

Simulation Results

In this section, we present the results of our simulations which were established by comparing our game-theoretic (HTSG) approach with a non-game-theoretic (No GT) approach where the defender randomly chooses the type of the systems to deploy. We present the players’ utility by varying the probability pR of the attacker detecting a real system, and second by varying the number of honeypots in the network. Furthermore, we represent the players’ utility in HTSG with different values of beliefs p1 and p2 . First, we consider the case when no game theory approach is used. We work with ten systems for this simulation and the defender randomly decides the type of system to install. We first assume that the defender installs five Highinteraction honeypots with different values of efficacy aH = 0.69, aH = 0.71, aH = 0.73, aH = 0.75, aH = 0.79; four low-interaction honeypots with efficacy values aL = 0.45, aL = 0.49, aL = 0.51 and aL = 0.53; and one real system. We consider different pR values. Figure 3 illustrates both players’ utilities. We observe that the defender’s utility decreases by 16% and the attacker’s utility increases by approximately 10% when pR increases. This is expected because with increasing pR attacker becomes more capable of detecting the presence of real systems in the network and attacking them. Second, the defender plays the equilibrium strategy of HTSG which gives an advice to the defender about what configuration to choose among (L, R, H). We also vary pR and set p1 = 0.4 and p2 = 0.77. Figure 3 shows that the defender’s expected utility improves by 112.62% compared to the No GT case for pR = 0.72. Similarly, we consider the case of No GT, but this time we vary the number of High-interaction and Low-interaction honeypots the defender installs each time, by keeping the attacker’s probability of detecting the real system fixed pR = 0.5. We plot the players’ utilities in Fig. 4. The figure shows that the defender’s utility increases with the number of honeypots; getting improved by

Honeypot Type Selection Games for Smart Grid Networks

-3500

Players' expected utility VS the attacker's detection probability of the real system(No GT approach)

-4500 -5000 -5500 -6000 -6500 -7000 -7500 0.3

0.4

0.5

0.6

0.7

Players' expected utility VS the attacker's detection probability of the real system using HTSG

500

Players' expected utility in $ (US dollars)

-4000

players' expected utility in $ (US dollars)

1000

Defender Attacker

93

0.8

pR: Attacker's detection probability of the real system

0 -500 -1000 -1500 -2000 -2500 -3000 Defender Attacker

-3500 -4000 0.3

0.4

0.5

0.6

0.7

0.8

pR: Attacker's detection probability of the real system

Fig. 3. Players’ expected utilities for different attacker’s detection capabilities.

100% when increasing the number of honeypots by 2/3. For the same increase in the number of honeypots, the attacker’s utility decreases by 33% when the number of honeypots goes up. We also simulated the case when the HTSG equilibrium strategy is played and the number of honeypots varies and the probability of the attacker detecting the real system equals 0.5 (pR = 0.5) with p1 = 0.4 and p2 = 0.6. Figure 4 shows that the defender’s utility improves by 110.98% compared to the No GT case for 10 honeypots. Last, Fig. 5 shows that the utility changes at the belief value p1 = 0.3 for both players, because at this point the equilibrium changes from (R, NA, p1 ) Players' utility VS the number of honeypots in the network (No GT approach)

Players' utility VS the number of honeypots in the network using HTSG approach

500

-1000 -2000

Players' utility in $ (US dollars)

Players' utility in $ (US dollars)

0 -3000 -4000 -5000 -6000 -7000 -8000 -9000

-500

-1000

-1500

-2000

Defender Attacker

-10000 6(3H,3L)

Defender Attacker

8(4H,4L)

Number of honeypots in the network

10(5H,5L)

-2500 6(3H,3L)

8(4H,4L)

10(5H,5L)

Number of honeypots in the network

Fig. 4. Players’ expected utilities VS the number of honeypots in the network.

94

N. Boumkheld et al. Players' expected utility VS the attacker's belief p1

20

50

Players' expected utility in $ (US dollars)

0

Players' expected utility in $ (US dollars)

Players' expected utility VS the attacker's belief p2

100

-20

-40

-60

-80

-100

0

-50

-100

-150

-200

-250

Defender Attacker

-120

0

0.2

Defender Attacker

0.4

0.6

Attacker's belief p1

0.8

1

-300

0

0.2

0.4

0.6

0.8

1

Attacker's belief p2

Fig. 5. Players’ expected utilities VS the attacker’s beliefs p1 and p2 for the HTSG approach.

to (L, A, p1 ). The utility also changes at p2 = 0.6, because for this value, the equilibrium changes from (R, NA, p2 ) to (H, A, p2 ).

6

Conclusions

In this work, we developed a game-theoretic model to analyze the challenge of the network administrator/defender in selecting among the following types of systems: a low-interaction honeypot, a high-interaction honeypot and a system with no honeypot with each having its own set of costs and benefits. If the defender chooses to deploy a honeypot, her aim is to lure the attacker to this honeypot to gain threat intelligence. On the other hand, the attacker has to decide whether to attack or not given the different costs and benefits of both choices. This interaction between the players is modeled as a dynamic game of complete but imperfect information. We derived its PBNE solutions and have presented numerical results with the optimal probability of deploying a type of system for the defender and the optimal attack probability for the attacker under different parametric conditions. This paper is a first step towards implementing game-theoretic strategies in actual smart grid networks as part of the H2020 SPEAR project. To this end, we are planning to collect data from these networks to instantiate the game parameters and derive the corresponding equilibria. We will then apply the assess how these equilibria improve threat intelligence and defence of the smart grid network as opposed to existing strategies used by the project end users. In terms of theoretic extensions of this paper, future work may allow to consider a more complex model capturing a number of different costs (e.g., deployment, configuration, maintenance), related to honeypots, rather than congregated values. Secondly, future work may allow repeated version of the game with belief update

Honeypot Type Selection Games for Smart Grid Networks

95

schemes and dynamic choice of the type of system to deploy based on the updated belief. In this case, we shall investigate the trade-offs between playing instantiated (single-shot) version and iterative version of the game. In addition, in contrast to the current work, future work could investigate the situation where the defender has multiple honeypots in the network. Finally, we could consider a more sophisticated attacker who is able to detect the presence of honeypots in the network using anti-honeypots techniques [15] and assess how the difference in efficacies of the honeypot types affect the players’ decision. Acknowledgement. We thank the anonymous reviewers for their comments. Nadia Boumkheld and Emmanouil Panaousis are supported by the H2020 SPEAR grant agreement, no 787011.

References 1. Li, X., Liang, X., Lu, R., Shen, X., Lin, X., Zhu, H.: Securing smart grid: cyber attacks, countermeasures, and challenges. IEEE Commun. Mag. 50(8), 38–45 (2012) 2. Petrovic, T., Echigo, K., Morikawa, H.: Detecting presence from a WiFi router’s electric power consumption by machine learning. IEEE Access 6, 9679–9689 (2018) 3. Barnum, S.: Standardizing cyber threat intelligence information with the structured threat information expression (STIX). Mitre Corp. 11, 1–22 (2012) 4. Pawlick, J., Colbert, E., Zhu, Q.: A game-theoretic taxonomy and survey of defensive deception for cybersecurity and privacy. arXiv preprint arXiv:1712.05441 (2017) 5. Jicha, A., Patton, M., Chen, H.: SCADA honeypots: an in-depth analysis of Conpot. In: 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pp. 196–198. IEEE (2016) 6. Mairh, A., Barik, D., Verma, K., Jena, D.: Honeypot in network security: a survey. In: Proceedings of the 2011 International Conference on Communication, Computing and Security, pp. 600–605. ACM (2011) 7. Nawrocki, M., W¨ ahlisch, M., Schmidt, T.C., Keil, C., Sch¨ onfelder, J.: A survey on honeypot software and data analysis. arXiv preprint arXiv:1608.06249 (2016) 8. La, Q.D., Quek, T.Q., Lee, J., Jin, S., Zhu, H.: Deceptive attack and defense game in honeypot-enabled networks for the internet of things. IEEE Internet Things J. 3(6), 1025–1035 (2016) 9. Williamson, S.A., Varakantham, P., Hui, O.C., Gao, D.: Active malware analysis using stochastic games. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems, vol. 1, pp. 29–36. International Foundation for Autonomous Agents and Multiagent Systems (2012) 10. Wagener, G., State, R., Dulaunoy, A., Engel, T.: Self adaptive high interaction honeypots driven by game theory. In: Guerraoui, R., Petit, F. (eds.) SSS 2009. LNCS, vol. 5873, pp. 741–755. Springer, Heidelberg (2009). https://doi.org/10. 1007/978-3-642-05118-0 51 11. Rowe, N.C., Custy, E.J., Duong, B.T.: Defending cyberspace with fake honeypots. JCP 2(2), 25–36 (2007)

96

N. Boumkheld et al.

12. P´ıbil, R., Lis´ y, V., Kiekintveld, C., Boˇsansk´ y, B., Pˇechouˇcek, M.: Game theoretic model of strategic honeypot selection in computer networks. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 201–220. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34266-0 12 13. Garg, N., Grosu, D.: Deception in honeynets: a game-theoretic analysis. In: 2007 IEEE SMC Information Assurance and Security Workshop, pp. 107–113. IEEE (2007) 14. C ¸ eker, H., Zhuang, J., Upadhyaya, S., La, Q.D., Soong, B.-H.: Deception-based game theoretical approach to mitigate DoS attacks. In: Zhu, Q., Alpcan, T., Panaousis, E., Tambe, M., Casey, W. (eds.) GameSec 2016. LNCS, vol. 9996, pp. 18–38. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47413-7 2 15. Wang, K., Du, M., Maharjan, S., Sun, Y.: Strategic honeypot game model for distributed denial of service attacks in the smart grid. IEEE Trans. Smart Grid 8(5), 2474–2482 (2017) 16. Wagener, G., State, R., Engel, T., Dulaunoy, A.: Adaptive and self-configurable honeypots. In: 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops, pp. 345–352. IEEE (2011) 17. Hayatle, O., Otrok, H., Youssef, A.: A game theoretic investigation for high interaction honeypots. In: 2012 IEEE International Conference on Communications (ICC), pp. 6662–6667. IEEE (2012) 18. Carroll, T.E., Grosu, D.: A game theoretic investigation of deception in network security. Secur. Commun. Netw. 4(10), 1162–1172 (2011) 19. Pawlick, J., Zhu, Q.: Deception by design: evidence-based signaling games for network defense. arXiv preprint arXiv:1503.05458 (2015) 20. Li, H., Yang, X., Qu, L.: On the offense and defense game in the network honeypot. In: Lee, G. (ed.) Advances in Automation and Robotics, Vol. 2. LNEE, vol. 123, pp. 239–246. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-256462 33 21. Li, Y., Shi, L., Feng, H.: A game-theoretic analysis for distributed honeypots. Future Internet 11(3), 65 (2019) 22. Mokube, I., Adams, M.: Honeypots: concepts, approaches, and challenges. In: Proceedings of the 45th Annual Southeast Regional Conference, pp. 321–326. ACM (2007) 23. Jasek, R., Kolarik, M., Vymola, T.: APT detection system using honeypots. In: Proceedings of the 13th International Conference on Applied Informatics and Communications (AIC 2013), pp. 25–29. WSEAS Press (2013) 24. Weiler, N.: Honeypots for distributed denial-of-service attacks. In: Proceedings. Eleventh IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 109–114. IEEE (2002) 25. Kelly, G., Gan, D.: Analysis of attacks using a honeypot. In: International Cybercrime, Security and Digital Forensics Conference (2011) 26. Fudenberg, D., Tirole, J.: Perfect Bayesian equilibrium and sequential equilibrium. J. Econ. Theory 53(2), 236–260 (1991) 27. Gibbons, R.: A Primer in Game Theory. Harvester Wheatsheaf, New York (1992)

Discussion of Fairness and Implementability in Stackelberg Security Games Victor Bucarey1,2(B) and Martine Labbé1,2 1

Département d’Informatique, Université Libre de Bruxelles, Brussels, Belgium {vbucarey,mlabbe}@ulb.ac.be 2 INOCS, Inria Lille-Nord Europe, Lille, France

Abstract. In this article we discuss the impact of fairness constraints in Stackelberg Security Games. Fairness constraints can be used to avoid discrimination at the moment of implementing police patrolling. We present two ways of modelling fairness constraints, one with a detailed description of the population and the other with labels. We discuss the implementability of these constraints. In the case that the constraints are not implementable we present models to retrieve pure strategies in a way that they are the closest in average to the set of fairness constraints.

Keywords: Fairness

1

· Implementability · Stackelberg Security Games

Introduction

In the last years, Stackelberg Security Games have been applied in several real domains such as airport security [11], IRIS for security of flights [7], ports [13] and border [1] patrolling and fare evasion [5], among others. In these games, the leader, also called defender must protect a set of targets with limited resources available from a possible attack performed by the follower or attacker. The payoff structure depends only on whether the target attacked is being protected or not [8]. Each defender resource has a set of possible schedules, that is, the possible subset of targets they can protect in one strategy. If all the resources have the same possible set of schedules we name this game with homogeneous resources. Otherwise, it is called with heterogeneous resources. Several mixed integer formulations, for both, general Bayesian Stackelberg games and Bayesian Stackelberg Security games are presented in [4,9]. One of the key points of the scalability of SSG is the representation of the set of strategies of the defender. Instead of taking into account every single pure strategy, this set is represented through the frequency in which each target is protected. These frequencies are called coverage distribution. Authors have been partially supported by the Fonds de la Recherche Scientifique FNRS under Grant(s) no PDR T0098.18. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 97–117, 2019. https://doi.org/10.1007/978-3-030-32430-8_7

98

V. Bucarey and M. Labbé

In some applications, payoffs matrices are generated in a black-box. In others, they are built by real data. In both cases these payoffs can lead to discriminative outcomes. In the first case, this is due to the fact that there is not a known methodology. In the second case, this is due to the manner in which data is gathered. For instance, according to [10], officers generally stop black drivers at higher rates than white drivers, and stop Hispanic drivers at similar or lower rates than white drivers. Another example came from the United Kingdom. According to the Governamental institution Ethnicity in the UK1 , in 2016/17, there were 4 stop and searches for every 1,000 White people, compared with the 29 stop and searches for every 1,000 Black people. Among specific ethnic groups, the Chinese and Mixed White/Asian groups consistently had the lowest rates of stop and search since 2009/10 [6]. The relationship between algorithms and discrimination is a major concern for the European Union Agency for Fundamental Rights - FRA [12]. They state: When algorithms are used for decision making, there is potential for discrimination against individuals. The principle of non-discrimination, as enshrined in Article 21 of the Charter of Fundamental Rights of the European Union (EU), needs to be taken into account when applying algorithms to everyday life. And later, Big data claims to be neutral. It isn’t . . . machine learning depends upon data that has been collected from society, and to the extent that society contains inequality, exclusion or other traces of discrimination, so too will the data. Consequently, unthinking reliance on data mining can deny members of vulnerable groups full participation in society. While there are several real world problems where fairness is important, there is no a specific way to measure fairness. For instance, for classifications problems, it is used as fairness constraints the following considerations: disparate treatment which implies that the probability of any classifier output does not change after observing a sensitive feature; disparate impact which implies that the probability of classifying with a positive value; and disparate treatment which implies that the probability of misclassification does not change with some sensitive feature [15]. In the SSG context, bias in the data can be translated into allocate more/less surveillance focused in race, wealth/poverty or any other type of discrimination. In this work we study different ways to include constraints in order to avoid discrimination SSGs from a tactical point of view. We are interested in studying how these considerations could be implemented and how much we lose by adding these considerations in terms of expected utilities. SSG can be seen as the problem of allocating resources to targets in a random fashion. A relevant work that came from the problem of designing randomized allocations is presented in [3]. In this context, authors define the concept 1

http://www.ethnicity-facts-figures.service.gov.uk/.

Discussion of Fairness and Implementability in Stackelberg Security Games

99

implementability of a set of constraints when any random allocation under this set of constraints can be decomposed in a convex combination of pure allocations satisfying this set of constraints. Sufficient and necessary conditions for the implementability of a set of constraints are given, based on the bihierarchy structure of the constraints. Sometimes, there are constraints that are necessary in real applications but they are not implementable in general. In this article we show that including fairness constraints in a detailed description of the population inside each target might not be implementable. That means, that in average solutions satisfy those constraints but the pure strategies that implement this solution could not satisfy them. We will show some models and algorithms to retrieve pure strategies minimizing the violations of such constraints. The questions that we aim to answer are the following: – Is it possible to model coverage distributions including fairness considerations? – Are they implementable? If not, how can we include those considerations in practical settings? – How much does the defender lose, in terms of expected utility, by including these considerations? Our first contribution is to model fairness constraints in the coverage probabilities in SSGs. We present two models, one focused on a detailed description of the population, the second one based on labels on the targets. Our second contribution is to show that the model based on labels is implementable. Our third contribution is to provide a methodology for implementing schedules allocating low probability to strategies that violate more the set of non-implementable constraints. The rest of the paper is as follows. In Sect. 2 we introduce the main concepts related to SSGs, Implementability and Random Allocations. We give an introductory example for this problem. In Sect. 3, we provide the models that are discussed in this work. In Sect. 4 we provide a discussion about the implementability of the coverage distribution returned by these models. Also, some extensions are presented. In Sect. 5 computational results are shown. Our models are tested in a realistic instance presented in Sect. 6. Finally, our conclusions are presented in Sect. 7.

2 2.1

Problem Statement and Notation Stackelberg Security Games and Compact Formulations

In SSGs, the leader, named in this context defender, must protect from the followers, named attackers, a set of targets J. The payoffs for the players only depend on whether a target is protected or not. In consequence, several strategies have identical payoffs. Thus, we denote by Dk (j|p) the utility of the defender when an attacker of type k ∈ K attacks a covered target j ∈ J and by Dk (j|u)

100

V. Bucarey and M. Labbé

the utility of the defender when an attacker of type k ∈ K attacks an unprotected target j ∈ J. Similarly, the utility of an attacker of type k ∈ K when successfully attacking an unprotected target j ∈ J is denoted by Ak (j|u) and that attacker’s utility when attacking a covered target j ∈ J is denoted by Ak (j|p). We denote πk the probability of the defender facing attacker k. In the heterogeneous resources setting there is a set Ω of resources, |Ω| = m, in which each one can be allocated to a possible subset of targets Jω . If J = Jω for each ω ∈ Ω, we call it homogeneous resources. A pure strategy i ∈ I for the leader is an allocation of resources to targets. That is ⎫ ⎧ ⎬ ⎨ aωj ≤ 1 ∀ ω ∈ Ω I = {aω ∈ {0, 1}|Jω | }ω∈Ω : ⎭ ⎩ j∈Jω

in the case of heterogeneous resources, or ⎧ ⎫ ⎨ ⎬ I = a ∈ {0, 1}|J| : aj ≤ m ⎩ ⎭ j∈J

in the case of homogeneous resources. In this context, in the homogeneous case m |I| = k=1 nk . Authors in [8] provide a compact formulation using the transformation xa , (1) cj = a∈I:aj =1

where cj represents the frequency coverage of target j and xa represents the probability of playing strategy a ∈ I. The formulation for computing a SSE in the homogeneous case (HOM) is stated as follows:

(HOM) max

πk fk

(2)

c,q,f,s

k∈K

s.t.

fk ≤ Dk (j|p)cj + Dk (j|u)(1 − cj ) + M (1 − qjk ) 0 ≤ cj ≤ 1 j∈J cj = m k j∈J qj = 1 k

k

k

0 ≤ s − A (j|p)cj − A (j|u)(1 − cj ) ≤ M (1 − k

k

f ,s ∈ R qjk ∈ {0, 1}

qjk )

j ∈ J, k ∈ K j∈J

(3) (4) (5)

j ∈ J, k ∈ K

(6)

j ∈ J, k ∈ K

(7) (8) j ∈ J, k ∈ K (9)

where cj represents the frequency with which target j is protected. Variables f k represent the expected utility for the defender of facing attacker k and sk the expected utility of attacker k. The objective function (2) to be maximised represents the expected utility of the defender. Expression (3) states an upper

Discussion of Fairness and Implementability in Stackelberg Security Games

101

bound for f k which is tight for the target selected by each attacker k. Expression (4) and (5) define the coverage probabilities cj . Expression (6) states that each attacker selects a pure strategy. For each type of attacker, expression (7) states that the target attacked maximizes their expected utility. An extension to the heterogeneous case is stated by introducing variables cωj satisfying cj = cωj j ∈ J, (10) ω∈Ω:j∈Jω

and adding constraints limiting the amount of coverage for every single resource: cωj ≤ 1 ω ∈ Ω. (11) j∈Jω

By solving these optimization problems, a coverage distribution is obtained. In order to obtain a mixed strategy x in the original space I that fits with (1) the following method described in Algorithm 1 could be applied. An example is presented in Fig. 1. Algorithm 1. Box Method Require: c ∈ R|J| feasible coverage. Step 1: For each resource ω ∈ Ω, consider a column of height 1. Step 2: For each resource ω, fill up the column with the values of cj , with j ∈ Jω Step 3: Define x by extending each rectangle line into a horizontal line crossing all columns. The area between two horizontal lines represents a defender strategy. This area identifies a set of targets protected, at most one for each resource ω. The height of this area represents the probability of the corresponding strategy.

1 0.9

Type j1 j2 j3 j4 j5

j1 0.5 0.4 0.5 0.3 0.3

1 x{3,5} = 0.1

j3 j5

j2

j5 0.7

j2

j4

0.5

j4 0.4

j1 0

(a)

j3

j3

ω1

ω2

(b)

j1

ω1

j3

ω2

x{2,5} = 0.2 x{2,4} = 0.2 x{1,4} = 0.1 x{1,3} = 0.4 0

(c)

Fig. 1. Example of Algorithm 1 with 5 targets and two homogeneous resources. (a) Coverage probability (b) Step 1 and 2. (c) Step 3.

102

V. Bucarey and M. Labbé

2.2

Population and Unfair Allocations

The optimal solution found through the optimisation problem stated above relies on how the payoffs matrices are computed. Anyway, discrimination issues can be performed by the security agency, putting less (or more) resources over some groups. We study two models to avoid such situation. The first model considers that in each target j ∈ J there is a population pjt of type t between a set of possible population types T . Examples of T can include different races, religions, income level, among others. In order to avoid discrimination issues, it could be desirable that the total coverage allocated to each population type is proportional to the amount of inhabitants. A second model can be developed with a slightly different description of the population. Instead of considering a fraction of population in each target, consider that each target has a label L. Each label ∈ L denotes the most representative population in the target. For example, the Latin, Asian and African areas in the main cities in Europe. Also, in some cities there is a division in High income, Medium class and Low income areas. In the following example, we show that even in small instances, the problem of unfairness may occur. We will use this example along the article to introduce the main concepts of this work. Example 1. Consider the following instance with five targets, J = {j1 , j2 , j3 , j4 , j5 }, three attacker types, K = {k1 , k2 , k3 } and the total population divided in three types, T = {t1 , t2 , t3 }. Payoffs of the game and the description of the population in each target is represented in Table 1. We consider π = (πk1 , πk2 , πk3 ) = (0.5, 0.3, 0.2) the distribution of probability over the set of the attackers. We aim to allocate two homogeneus resources, i.e., m = 2. Table 1. Description of Example 1. (a) Payoffs: Each row represent a target j. For each attacker type it is described Dk = (Dk (j|p), Dk (j|u)) and Dk = (Ak (j|p), Ak (j|u)). (b) Description of the population detailed by types and labels.

j j1 j2 j3 j4 j5

k1 Dk1 Ak1 0 -6 34 -30 11 -26 9 -16 2 -4 11 -39 0 -35 3 -11 9 -28 20 -48

k2 Dk2 Ak2 42 -9 15 -44 47 -3 12 0 9 -11 15 -16 37 -6 22 -32 47 -33 7 -14

k3 Dk3 Ak3 32 -29 49 -33 37 -39 16 -48 25 -47 26 -5 29 -23 21 -48 42 -25 30 -40

Type t1 t2 t3 Label

j1 10 100 270 l3

j2 50 100 0 l2

j3 10 100 50 l2

j4 70 10 0 l1

j5 0 50 180 l3

Total 14% 36% 50%

(b)

(a)

In the last row of Table 1(b) a label for each target is stated. In this example, a target is labeled by li if the population type ti is the most representative. We label each target in L = {1 , 2 , 3 } corresponding with the most representative population type. In other words, j = i if i = argti ∈T max pjti . It

Discussion of Fairness and Implementability in Stackelberg Security Games

103

would be desirable that the coverage that each part of the population receives is proportional to the total population that they represent. Table 2. Description of the solution in Example 1 in terms of population: (a) Coverage in the Strong Stackelberg equilibrium. (b) Coverage received by each type of population and the deviation from the fraction of population that they represent. (c) Coverage received by each labeled target. Type ct % resources t1 0.552 27.6 % t2 0.872 43.6% t3 0.576 28.8% (b) Label cl % resources Deviation 1 0.335 16.75 % -16.25% 2 1.033 51.65% 29.125% 3 0.632 28.8% -21.0% (c)

Target j1 j2 j3 j4 j5 cj 0.416 0.675 0.358 0.335 0.216

(a)

Deviation 97.143% 21.111% -42.4%

Table 2(a) shows the coverage given by model (HOM) and Table 2(b) shows how this coverage is allocated to each part of the population. The proportional p coverage on population type t is computed as ct = j∈J cj jt p . Population t ∈T jt type t1 receives proportionally 97.143% more coverage than the population that they represent while population type t3 receives a −42.4% less. Note that 1 represents the 20% of the targets, and 2 and 3 the 40% each one. In Table 2(c), we show what is the proportion of resources allocated to each label. In this instance, both evaluations give an unbalanced allocation of the resources. 2.3

Random Allocations, Implementability and Bi-hierarchies

We now introduce some mathematical notions that allow us to develop the models and results of this work. The problem of finding an SSE can be seen as finding a specific set of random allocations between resources Ω and a set of targets J under some considerations. Then, the coverage vector cωj can be seen as a random allocation as presented in [3]. They study constraints of the type: qSL ≤

cωj ≤ qSU

S ⊆ Ω × J,

(ω,j)∈S

where S is called constraint set and qSL , qSU are positive integers named quotas on S. The full of set of constraints is named constraint structure H. A random allocation c is implementable under the constraint structure H and quotas q = {(qSL , qSU )}S∈H if it can be written as a convex combination of pure allocations feasible under H and quotas q. Constraint structure H is universally

104

V. Bucarey and M. Labbé

implementable if, for any integer quotas q, every random allocation satisfying constraints in H is implementable under q. A constraint structure is a hierarchy if for any pair S, S either S ⊆ S or S ⊆ S or S ∩ S = ∅. A constraint structure H is a bihierarchy if it can be partitioned in two hierarchies H1 and H2 , that is, H1 ∩H2 = ∅ and H1 ∪H2 = H. Necessary and sufficient conditions over the constraint structure and implementability are given by [3] through the following two theorems: Theorem 1 (Sufficiency). If a constraint structure is a bihierarchy and quotas q are integers, then it is universally implementable. Theorem 2 (Necessity). If a constraint structure contains all the rows and columns constraints and is not a bihierarchy, then it is not universally implementable. In the SSG context, the implementability of a set of constraints means that each coverage distribution feasible can be decomposed in a convex combination of pure strategies, or pure patrols, all of them satisfying the constraints. Note that if we only restrict to allocate resources to targets, the set of constraints forms a bihierarchy and all the coverage distributions c are implementable.

3

Models

In this section we discuss about how to restrict the coverage distributions taking into account the issues described in Sect. 2.2. We describe two ways of modelling fairness constraints: First we model constraints with a detailed description of the population; Then, we restrict the possible coverage considering aggregated information in terms of labels in each target. By doing this, we generate coverage probabilities that are not significantly correlated with sensitive features, as race, income level, etc. 3.1

Focus on the Population

In this setting we assume a description of the population in each target j, given by the percentage of population of type t ∈ T denoted by pjt . In order to avoid discrimination issues, we might consider to restrict the amount of coverage that each population receives. That is, the coverage vector c should satisfy cj p˜jt ≤ qtU t ∈ T. (12) qtL ≤ j∈J

where q = (qtL , qtU ) are the quotas of the total coverage performed over the p population t ∈ T . We denote p˜jt = jt p the fraction of population type t t ∈T jt in target j. In this work, we use quotas of the form:

Discussion of Fairness and Implementability in Stackelberg Security Games

qtL = (1 − α)m

p˜jt

qtU = (1 + α)m

j∈J

p˜jt

105

(13)

j∈J

where α is the maximum acceptable percentage of deviation from the total fraction of the population t multiplied by the number of the resources available (e.g., 10%). In constraints (12), we assume that the total coverage inside a target is distributed proportionally to each population type. This assumption can be relaxed, through introducing some nonlinear relationship between the coverage and the probability of being covered given the composition of the population inside target j. Anyway, this topic is out of the scope of this paper, but it is still an interesting research question. As we mentioned before, constraints (12) are not universally implementable. This is mainly because these constraints do not induce integer extreme points. In consequence, the vector coverage c under its form can not be decomposed in terms of pure strategies satisfying constraints (12). We will discuss how to deal with this issue in Sect. 4. 3.2

Label Focus

A different approach can be stated as follows. There exists for each target a label ∈ L, representing a type of population representative in that target. In that case, we should think about protecting targets with an amount of coverage proportional to the percentage of the population they represent. Information about population is aggregated in each target, and in consequence, it is a more relaxed way of modelling fairness constraints. Formally, define J the targets labeled by . Note that {J }∈L defines a partition of J. For each label , there is a minimum and a maximum number of resources assigned to protect zones in J , denoted again by qL and qU respectively. Then, constraints on the amount of coverage can be stated as: cj ≤ qU ∈ L. (14) qL ≤ j∈J

where the quotas q are given by:

|J | L q = (1 − α)m |J|

qU

|J | = (1 + α)m |J|

(15)

In this case, we use integer quotas to establish the implementability result in Theorem 3. In the model focused on the population, even using integer quotas the implementability result does not hold. Example 1 Continued. Consider α = 25%. In Table 3, we show the bounds for each type of population and labels for both models respectively. Figure 2(a) shows the optimal coverage in each target by including the fairness constraints in both models and the coverage given by (HOM). For a fixed α, the model on

106

V. Bucarey and M. Labbé

Table 3. Lower and upper bounds in the coverage for both models when (α = 0.25) Type qtL

qtU

Label qL qU

t1

0.21 0.35 1

0

1

t2

0.54 0.9

2

0

1

t3

0.75 1.25 3

0

1

(a)

(b)

Fig. 2. (a) Comparison of the optimal coverage without any fairness consideration and the models focused on Labels and Population. (b) Defender expected utility in function of α.

labels is less restrictive than the one focused on the population. This explains the difference between the optimal coverage given by (HOM). By the same reason, the difference in terms of Defender Expected utility for different values of α for the model of population is greater than the model with labels, as is shown in Fig. 2(b). The optimal coverage, in both cases should be implemented by sampling pure strategies. We showed in Table 4, one possible decomposition of these solutions using Algorithm 1. Note that strategies in model focused on population are not Table 4. Decomposition for the optimal coverage in (a) model focused on population and (b) model focused on labels, using Algorithm 1. In both cases we use α = 25%.

Target j1 j2 j3 j4 j5

λ1 λ2 λ3 λ4 λ5 cpop 0.324 0.17 0.074 0.191 0.241 0.494 1 1 0 0 0 0.265 0 0 1 1 0 0.565 1 0 0 0 1 0.244 0 1 1 0 0 0.432 0 0 0 1 1 (a)

Target j1 j2 j3 j4 j5

λ1 λ2 λ3 λ4 clab 0.063 0.349 0.376 0.212 0.412 1 1 0 0 0.651 1 0 1 1 0.349 0 1 0 0 0.376 0 0 1 0 0.212 0 0 0 1 (b)

Discussion of Fairness and Implementability in Stackelberg Security Games

107

implementable. In particular, the third strategy in Table 4(a) that covers targets j2 and j4 allocates 0 resources to the population type t3 .

4

Implementability and Extensions

In this section we discuss the implementability of both models. First, we show that if quotas used in the model of labels are integers, then every coverage satisfying fairness constraints are implementable. Then, we discuss how to find pure strategies that are the closest to the set of fairness constraints in the model of population and that fits with the coverage distribution. Finally, we discuss about some extensions of the models and when they preserve this property. 4.1

Labels Are Implementable, Population Constraints Do Not

Now we show that the model focused on labels is implementable. Theorem 3. If for each ∈ L, qL , qU are integers, then conditions (14) are universally implementable. Proof. We prove that the set of constraints forms a bihierarchy in the problem of allocation targets to resources. In particular, any vector coverage satisfies: 0 ≤ cωj ≤ 1

ω ∈ Ω, j ∈ A(ω)

(16)

0 ≤ cωj ≤ 0 0 ≤ j∈J cωj ≤ 1 0 ≤ ω∈Ω cωj ≤ 1 qL ≤ ω∈Ω j∈J cωj ≤ qU

ω ∈ Ω, j ∈ / A(ω) ω∈Ω

(17) (18)

j∈J ∈ L.

(19) (20)

Conditions (16) and (17) represent singletons {(ω, j)} ∈ Ω × J. Condition (18) can be represented as the sets {ω} × J. Condition (19) is represented by Ω × {j}. Finally, conditions (20) can be represented as the sets {(ω, j) : j ∈ J , ω ∈ Ω}. We show that by grouping conditions (16), (17) and (18), and grouping conditions (19) and (20), they form a bihierarchy. Formally, we define the following sets: H1 = {{(ω, j)} ∈ Ω × J} ∪ {{ω} × J : ω ∈ Ω} H2 = {Ω × {j} : j ∈ J} ∪ {{(ω, j) : j ∈ J , ω ∈ Ω} : ∈ L} First, we show H1 is a hierarchy: Clearly each pair of singleton are disjoint. The same occurs with each pair of elements of {{ω} × J : ω ∈ Ω}. On the other hand, each singleton (ω, j) is, either included in the set {ω}×J and it has empty intersection with any other set {ω } × J, with ω = ω. Then, H1 is a hierarchy. Now we prove that H2 is a hierarchy. Each pair of elements in {Ω × {j} : j ∈ J} are disjoint. The same occurs with each pair of elements in {{(ω, j) : j ∈

108

V. Bucarey and M. Labbé

J , ω ∈ Ω} : ∈ L} because {J } induces a partition over J. Now, if we take a pair of element of each group of conditions, for index j and , there are two / J , cases: Either j ∈ J , in that case Ω × {j} ⊆ {(ω, j) : j ∈ J , ω ∈ Ω}; or j ∈ in that case Ω × {j} ∩ {(ω, j) : j ∈ J , ω ∈ Ω} = ∅. Then, H2 is also a hierarchy. Clearly H1 ∩ H2 = ∅. Then, the set of conditions B = H1 ∪ H2 , forms a bihierarchy, and then under any integer quotas, the expected allocation is implementable and the result follows. A graphical representation of the proof is shown in Fig. 3.

Fig. 3. Representation of the bihierarchy. (a) H1 consisting in constraints (18) and singletons. (b) H2 consisting in constraints (19) and (20).

4.2

Approximating the Implementability

Coverage frequencies in the model focused on population are not implementable in general. That means that they cannot be decomposed in pure strategies satisfying the fairness constraints (12). In any case, they can be decomposed in pure strategies in the original set of strategies, that is, pure strategies covering at most m targets. This can be preformed by Algorithm 1. On one hand, this algorithm generates different decompositions by considering different orders in which the set of targets are included. For instance, in Fig. 1, targets were included in lexicographic order. If the algorithm considers order j2 , j3 , j1 , j4 , j5 , the output probabilities x will be different. On the other hand, Algorithm 1 does not take into account if the strategies produced satisfy conditions such as fairness, or are close to satisfy them. We would like to get a decomposition such that is the fairest as possible, allocating low probability to strategies that are unfair and high probabilities to the fairest ones. Formally, we have a polyhedron P1 as the convex hull of the binary encoding of the set I of pure strategies. Let P2 be the polyhedron of all the coverage vectors satisfying the fairness constraint in the model focused on population, that is P2 = {c ∈ P1 : c satisfies (12)}. We want to find a convex decomposition of a point in P2 in terms of vertices of P1 such that the weighted sum of the

Discussion of Fairness and Implementability in Stackelberg Security Games

109

violations of constraints (12) of each strategy is minimised. The weights used in this optimisation problem come from the convex decomposition. By doing this, we aim to achieve a set of strategies implementing the optimal fair coverage, but at the same time, allocating low probability to the strategies that are unfair. Now we present some models to find such decomposition. We formulate the following non-linear model in order to get a decomposition in pure strategies where each pure strategy minimises the violation of the constraint (12). Let M = {1, . . . , U } be a set of indices, where U is an upper bound on the number of strategies needed to decompose c in terms of pures strategies in I. In our first model, we create a vector of variables ai as the binary encoding of strategy i ∈ M, where aij = 1 if the target j is covered by strategy i. The formulation is as follow:

λi it

(21)

λi = 1 i∈M j∈J i∈M λi aij = cj a ≤ m i∈M j∈J ij −it + qtL ≤ j∈J aij p˜jt ≤ qtU + it t ∈ T, i ∈ M

(22)

min

i∈M

a,λ,

t∈T

a ∈ {0, 1}|J||M| , λ ∈ R|M| ≥0

(23) (24) (25) (26) (27)

where λi is the weight in the decomposition and t measures the violation of the fairness constraint for each strategy. The objective function (21) minimizes the weigted violation of constraints (25). Equation (22) states that the weights must sum to 1 and Eq. (23) ensures that the convex combination fits with c. Constraints (24) and (26) define the pure strategy i. Expression (25) and the fact that t ≥ 0 defines the maximum deviation for each population type. This formulation is a non-convex mixed integer non-linear problem untractable even for small instances. We linearize this model by introducing variables γij = λi aij and μit = λi it , and re-scaling constraints (25) by λi . This mixed integer linear problem (MILP) is: (DEC) min

a,γ,μ

μit

(28)

constraints (22), (24), (26) j∈J i∈M γij = cj L U −μit + λi qt ≤ j∈J γij p˜jt ≤ λi qt + μit t ∈ T, i ∈ M

(29) (30)

i∈M

t∈T

γij ≤ λi γij ≤ aij

j ∈ J, i ∈ M j ∈ J, i ∈ M

(31) (32)

aij + λi − 1 ≤ γij μ≥0

j ∈ J, i ∈ M

(33) (34)

110

V. Bucarey and M. Labbé

where constraints (31), (32) and (33) defines the product λi aij . We name this MILP (DEC). This formulation has two main drawbacks. First, we have to know a priori an upper-bound of the number of strategies to achieve the best decomposition. Secondly, the linear relaxation has always optimal value equal to zero. This means the formulation is a weak formulation and in consequence algorithms for solving MILP implemented in commercial optimization software might perform very poorly. We will show this issue in the next section. In order to solve this problem in an efficient way, we propose the following column generation algorithm. Consider a set of feasible strategies. For a given strategy, it is straightforward to compute the violation of fairness constraints. We denote the violation of strategy ai by v(ai ). Then, we state the following linear optimisation problem (MP): (MP) min λ

s.t.

λi v(ai )

(35)

λi ai = c i∈M λi = 1

(36) (37)

i∈M

i∈M

λi ≥ 0

(38)

with the difference, that here ai and v(ai ) are parameters known for the problem. We denote α ∈ R|J| and β ∈ R the dual variables associated to constraints (36) and (37) respectively. The column generation algorithm works as follow: First choose a set of feasible set of pure strategies. They can be retrieved using Algorithm 1. Solve (MP) and get the dual variables. Then, compute the most negative reduced cost of a possible new strategy. This optimization problem, called column generator (CG), is stated as follow: vt (a) − j∈J αj aj − β (39) (CG) min t∈T a s.t. (40) j∈J aj ≤ m L U (41) −vt (a) + qt ≤ j∈J aj p˜jt ≤ qt + vt (a) t ∈ T, i ∈ M a ∈ {0, 1} (42) where the objective function minimizes the reduced cost of the new strategy generated. If the new strategy generated by (CG) has a positive reduced cost, then the algorithm stops and the optimal solution of the master is the optimal solution of the whole problem. Otherwise, a new strategy is added to M with cost v(a) = t∈T vt (a). Example 1 Continued. Now we decompose the coverage cpop using formulation (DEC). The decomposition is showed in Table 5. The distance decreases from 0.284 to 0.258. Note that the strategy that covers targets j2 and j4 (allocating 0 resources to the population type t3 ) is not present anymore.

Discussion of Fairness and Implementability in Stackelberg Security Games

111

Table 5. Decomposition of cpop using (DEC). Target cpop

4.3

λ1 λ2 λ3 λ4 λ5 0.491 0.168 0.265 0.073 0.003

j1

0.494 1

0

0

0

1

j2

0.265 0

0

1

0

0

j3

0.565 1

0

0

1

0

j4

0.244 0

1

0

1

1

j5

0.432 0

1

1

0

0

Some Extensions

Here we present two extensions to the discussion in this topic: Multiple Labels: We consider the setting where each target has multiple labels representing different dimensions of analysis. It could be the case, where each target is characterised by race, religion, wealth, etc. Consider the set of attributes labels A = {L1 , L2 , . . . , L|A| }, with its corresponding set of labels. Similarly to (15), quotas (qLk , qUk ) on each label are stated, for each label k in the k-th attribute. We would like to state an extension of Theorem 3 for this setting. If the labels satisfy that Jk ⊆ Jk or Jk ∩ Jk = ∅ for each k ∈ Lk and k ∈ Lk , then constraints forms a hierarchy. Then, Theorem 3 applies directly. If it is not the case, we define L = L1 × L2 × . . . × L|A| which clearly induces a partition over J. If we set quotas as in (15) the result will be implementable under labels L, but is a relaxation of the original problem. We would need to find quotas (qL , qU ) for each , that produces the same constraints that quotas (qLk , qUk ). If we are able to find integer quotas satisfying this, then Theorem 3 applies again. Penalizing Violations with Different Weights: As in the Example 1, maybe it could be the case that decision makers would not prefer strategies that do not cover one type of population at all. The model (DEC) and the column generation, penalizes in the same way strategies that cover more than the quotas and the ones that cover less. This symmetry can be broken by introducing different variables to measure the violation, − and + , and replacing in the objective function (21) and constraints (25) by: min

a,λ,− ,+

i∈M

L −− it + qt ≤

t∈T

+ λi (κt − it + it )

j∈J

aij p˜jt ≤ qtU + + it t ∈ T, i ∈ M

(43) (44)

where κt > 1 is a parameter assigning more weight to under allocate protection to population type t. Linearization techniques and column generation can be straight applied as before.

112

5

V. Bucarey and M. Labbé

Computational Experiments

In our computational results we investigate three questions: First, the impact in the defender expected utility by including fairness considerations in SSGs. Secondly, we test the computational performance of (DEC) and the column generation approach in order to get strategies close to be implementable. Finally, we test how much we win by decomposing coverage probabilities in almostimplementable pure strategies with our method instead with the Box method presented in Algorithm 1. Experimental Setting: We test our methods in randomly generated games. For each n ∈ {20, 30, 40}, m ∈ {5, 10} and {1, 3} we generate 10 instances. Payoff matrices were generated uniformly such that D(j|p), A(j|u) ∼ U (0, 100) and D(j|u), A(j|p) ∼ U (−100, 0). For each game generated, an amount of 1000 “inhabitants” were allocated among the targets. Finally, we divided that population in |T | = 3 or |T | = 7 types, by running random partitions that sum the total population allocated in each target. With this setting, we aim to obtain defender’s expected rewards comparables when population is divided in 3 or 7 types.

(a)

(b)

Fig. 4. Average defender’s expected utility in function of α. (a) |T | = 3 (b) |T | = 7

All experiments have been carried out using CPLEX 12.8 and Python 3.6, in a single thread on a server with a 3.40 Ghz Intel i7 processor and 64 GB of memory. Defender’s Expected Utility: In order to measure the impact, we run the models of Population and Labels with different values of α ∈ 5%, 10%, 25%, 50%. In Fig. 4, we show the average defender’s expected utility in function of α, and separated by T . The model that consider labels in the targets is less restrictive, thus for a fixed α returns a bigger defender expected utility. Logically, both of them are upper bounded by the SSE returned by (HOM). Also, if we consider

Discussion of Fairness and Implementability in Stackelberg Security Games

113

a more detailed description of the population (i.e. more types), the defender expected utility decreases. Finally, as we increase α, the defender expected utility for the three models converges to the SSE value. Efficiency of Decomposition Methods: For α = 25%, we solve all the instances with the model focused on population. We test the MILP (DEC) and the column generation. For the formulation (DEC), we use an upper bound on the number of strategies necessaries to decompose the coverages equal to U = |J| and U = 1.5|J|. We compare them with the output of Algorithm 1, using a lexicographic order. We use a time limit of 600 s for the algorithms. Figure 5a shows the average runtime in logarithmic scale. (DEC) considering U = |J|, respectively U = 1.5|J| hits the time limit in 97%, resp. 99%, of the instances. Algorithm 1 takes less than 1 ms in being performed. The Column Generation takes between 0.75 s and 4 s. In Fig. 5b, we compare the minimum weighted violation and the weighted violation of the solution generated by Algorithm 1. Algorithm 1 returns consistently a solution that violates the more compared to the solution returned by the Column Generation. We note that the minimum weighted violation decreases as the size of the instance increase. From the practical point of view, the models that minimizes the weighted violation generates fairer allocations in reasonable time.

(a)

(b)

Fig. 5. (a) Solution time for different methods to decompose coverage c in implementable strategies x. (b) Weighted violation of fairness constraints for each method.

6

Case Study

In this section we study the performance of our models in a realistic instance, ˜ noa, a municipality in Santiago of Chile. This based on data retrieved from Nu˜ instance consists in 1266 census blocks where the data considered is level of income (medium high, medium, low), the demand of policial resources DEM , the

114

V. Bucarey and M. Labbé

population in each census block and a measure of the criminal activity (amount of reported crime RC). The demand of policial resources is computed by Carabineros de Chile, the national police of Chile, as the amount of resources necessary to do general deployments, court orders, monitoring establishments and extraordinary services as in [2]. To consider one target for each block is expensive to solve in terms of computational times, so we apply the clustering algorithm integrated in QGIS [14], and we aggregate the data to reduce the size of the problem. We finally consider 250 targets represented in Fig. 6(a). In this municipality there are three types of population in terms of level of education and income (strongly correlated). People with low income, medium income and high income, who receive 432US$, 750US$ and 1400$ per month. The geographical distribution of the income level is represented in Fig. 6(b). Income Low Medium Medium High

(a) Targets

(b) Income level

˜ noa Fig. 6. Data from Nu˜

We consider two types of attackers: the first one, named k1 , related to the criminal activity and the second one, named k2 , related to the general activities that Carabineros perform. Both of them will appear with the same frequency, so πk1 = πk2 = 0.5. Payoffs are built proportionally to DEM and RC. The penalties for the attacker where used as a constant for each type of attacker (the punishment will be equal no matter where the attacker is caught). The specific parameters are shown in Table 6. Table 6. Payoff matrices built in the case study. Dk1 (j|p) Dk1 (j|u) Ak1 (j|p) Ak1 (j|u) Dk2 (j|p) Dk2 (j|u) Ak2 (j|p) Ak2 (j|u) 0

−DEMj DEMj

−100

0

−RCj

RCj

−300

We test the model (HOM) (without any fairness consideration) and the models focused on population and labels. We deploy 120 homogeneous policial

Discussion of Fairness and Implementability in Stackelberg Security Games

115

resources and we use the fairness parameter α = 0.1. All models took less than 1 min to return the optimal coverage distribution. Results are shown in Fig. 7. The model focused on the population allocates resources where the model without fairness consideration does not. Both, high and low income areas which, in model HOM, were not covered now present patrol assignments. The model focused on labels, on the other hand, presents a result with similar behaviour as HOM. Even when results are comparable, these models become a useful tool to ensure a fair distribution of police resources without having significant impact on its optimal allocation. This could be important for both police decision making and governmental policies.

˜ noa. Fig. 7. Coverage distribution in Nu˜

7

Conclusions

In this paper, we have studied the impact of fairness constraints in SSG. We have presented two models: one imposing fairness constraints in a detailed description of the population, the other, imposing constraints in targets that are labeled.

116

V. Bucarey and M. Labbé

These models aim to allocate fair distribution of resources amongst the population, avoiding discrimination issues from a tactical point of view. We study our model on a realistic instance. We have shown that imposing constraints with labels on the targets is implementable, meaning that each coverage distribution satisfying these constraints can be decomposed in pure strategies, all of them satisfying these constraints. This is not the case with the model with the detailed description of the population. In this case, we propose a MILP formulation to find the decomposition that is closest to the set of fairness constraints. Also we propose a column generation method to solve this problem efficiently. Computational tests have shown that the column generation approach finds efficiently the best decomposition.

References ´ Rosas, K., Navarrete, H., Ord´ 1. Bucarey, V., Casorr´ an, C., Figueroa, O., on ˜ez, F.: Building real Stackelberg security games for border patrols. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) Decision and Game Theory for Security. LNCS, vol. 10575, pp. 193–212. Springer, Cham (2017). https://doi.org/10. 1007/978-3-319-68711-7 11 2. Bucarey, V., Ord´ on ˜ez, F., Bassaletti, E.: Shape and balance in police districting. In: Eiselt, H., Marianov, V. (eds.) Applications of Location Analysis. ISOR, vol. 232, pp. 329–347. Springer, Cham (2015). https://doi.org/10.1007/978-3-31920282-2 14 3. Budish, E., Che, Y.K., Kojima, F., Milgrom, P.: Designing random allocation mechanisms: theory and applications. Am. Econ. Rev. 103(2), 585–623 (2013) 4. Casorr´ an, C., Fortz, B., Labbé, M., Ord´ on ˜ez, F.: A study of general and security Stackelberg game formulations. Eur. J. Oper. Res. 278, 855–868 (2019) 5. Correa, J., Harks, T., Kreuzen, V.J., Matuschke, J.: Fare evasion in transit networks. Oper. Res. 65(1), 165–183 (2017) 6. Hargreaves, J., Linehan, C., Husband, H.: Police Powers and Procedures, England and Wales, Year Ending 31 March 2017. Home Office (2017) 7. Jain, M., Kardes, E., Kiekintveld, C., Ordonez, F., Tambe, M.: Security games with arbitrary schedules: a branch and price approach. In: Twenty-Fourth AAAI Conference on Artificial Intelligence (2010) 8. Kiekintveld, C., Jain, M., Tsai, J., Pita, J., Ord´ on ˜ez, F., Tambe, M.: Computing optimal randomized resource allocations for massive security games. In: AAMAS 2008, pp. 689–696 (2009) 9. Paruchuri, P., Pearce, J.P., Marecki, J., Tambe, M., Ordonez, F., Kraus, S.: Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games. In: AAMAS 2008, pp. 895–902 (2008) 10. Pierson, E., et al.: A large-scale analysis of racial disparities in police stops across the united states. arXiv preprint arXiv:1706.05678 (2017) 11. Pita, J., et al.: Deployed armor protection: the application of a game theoretic model for security at the Los Angeles international airport. In: AAMAS 2008, pp. 125–132 (2008) 12. European Union Agency for Fundamental Rights: Bigdata: discrimination in data-supported decision making. FRA FOCUS (2018). https://fra.europa.eu/en/ publication/2018/big-data-discrimination

Discussion of Fairness and Implementability in Stackelberg Security Games

117

13. Shieh, E., et al.: Protect: a deployed game theoretic system to protect the ports of the united states. In: AAMAS 2012, pp. 13–20 (2012) 14. QGIS Development Team, et al.: QGIS geographic information system. Open Source Geospatial Foundation Project (2016) 15. Zafar, M.B., Valera, I., Gomez-Rodriguez, M., Gummadi, K.P.: Fairness constraints: a flexible approach for fair classification. J. Mach. Learn. Res. 20(75), 1–42 (2019). http://jmlr.org/papers/v20/18-262.html

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game Taylor J. Canann(B) CEPA, McCombs School of Business, The University of Texas at Austin, Austin, TX 78712, USA [email protected]

Abstract. A game between software vendors, heterogeneous software users, and a hacker is introduced in which software vendors attempt to protect software users by releasing updates, i.e. disclosing a vulnerability, and the hacker is attempting to exploit vulnerabilities in the software package to attack the software users. The software users must determine whether the protection offered by the update outweighs the cost of installing the update. Following the model is a description of why the disclosure of vulnerabilities can only be an optimal policy when the cost to the hacker of searching for a Zero-Day vulnerability is small. The model is also extended to discuss Microsoft’s new “extended support” disclosure policy.

Keywords: Game theory policy

1

· Welfare analysis · Vulnerability disclosure

Introduction

In May of 2017 the WannaCry attacks infected over 300,000 systems in 150 countries and the approximate estimated cost that these attacks is $4 billion. One month later, the NotPetya attacks, another major global attack that primarily targeted Ukrainian systems, began. The approximated costs of the NotPetya attacks were even larger than the WannaCry attacks and have been estimated at around $10 billion. Following the NotPetya attacks, the Retefe banking Trojan began leveraging the EternalBlue exploit in September. Finally, in August of 2018 the Taiwan Semiconductor Manufacturing Company, an Apple chip supplier, was hit by a new variant of the WannaCry attack that cost the company approximately $170 million. The problem was not that Windows is an inherently flawed system, but instead that these attacks could have been avoided if I am grateful to Richard Evans, Kerk Phillips, the BYU MCL workshops, Brennan Platt, Brad Greenwood, Robert Mrkonich, Samuel Kaplan, Kenneth Judd, Chase Coleman, Ryne Belliston, Jan Werner, David Rahman, and Aldo Rustichini for very helpful comments and to Alexander Pingry for excellent research assistance. Additional comments and proofs can be found in the online mathematical appendix. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 118–134, 2019. https://doi.org/10.1007/978-3-030-32430-8_8

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game

119

users/firms had only updated. In March of 2017, Microsoft patched this vulnerability in their monthly, second Tuesday, update. This is not just a problem with Microsoft software, every piece of software, no matter what care is taken by a software vendor, is riddled with vulnerabilities, which leaves users of the software open to attack by hackers. To protect users, software vendors release patches to address these found vulnerabilities, but this is a double-edged sword. Releasing updates, a.k.a. vulnerability disclosure, may in fact increase the susceptibility of current users to attack, in particular, those who chose not to immediately install the updates. This is due to the fact that the update can be reverse engineered quite easily by hackers. These types of hacks have been gaining in prevalence over the last few of years. In a seminal paper in the field of vulnerability disclosure, [1] asked if finding vulnerabilities is optimal for social welfare. Since then, vulnerability disclosure policy has been greatly debated in the literature. The model outlined in this paper explores the decisions made by both the network of users and a hacker given the different policy regimes that could be implemented by the vendor. The interaction between vendors and software users was first modeled by [2], in which they find that vendors will always want to delay the release of patches, but this action is not socially optimal. However, [2] do not pose an answer to whether a vendor should engage in disclosing vulnerabilities, which is the main focus of this paper. One of the first papers on the economic modeling of hacker behavior was developed in [3], where they attempt to estimate the effects of the fixed costs of hacking on the incentives of a profit maximizing hacker. Much of the recent literature that has attempted to model hacker behaviour, e.g. see [4], follow models similar to the Becker model of criminal behavior [5], but this approach assumes that: (i) Law enforcement can easily track and find a hacker and (ii) Hackers can easily be prosecuted. These assumptions are the exception, not the rule. To break from this convention, the hacker is modeled as a profit maximizing agent in order to contribute hacker behavior into the vulnerability disclosure debate. Both the network framework of software users and the hacker behaving as a profit maximizing agent are extensions of the work in [6], where they focus on the welfare effects of disclosure policy for a representative set of users with the vendor facing a monopolistically competitive market. This paper follows the notation in [6] rather closely so as to maintain a consistent notational scheme within the vulnerability disclosure literature. Others have examined how attack propensity changes under different disclosure regimes (e.g. [7]), and have found that releasing patches tends to increase the number of attacks. This model identifies the reasons for this observed increase in attacks as being driven by the decisions of both the hacker and the software users. The hacker’s decision is driven by parameters such as the probability of a successful attack, as well as the costs associated of finding new vulnerabilities in the software package. The software users must balance the value they place on using the software relative to the expected damages of an attack and the

120

T. J. Canann

cost of updating their machine. Therefore, this model is able to give a causal relationship between attack propensity and disclosure regimes which strengthen the story behind these correlations. In order to describe the best type of disclosure policy, a model of a heterogeneous network made up of an interconnected set of software users that are attempting to defend themselves against a profit-maximizing hacker is developed in this paper. Within my model, there are three decisions to be made: (i) The strategy of attack to be played by the hacker, (ii) The optimal disclosure policy, and (iii) The updating decision made by the software user. Following the model setup, welfare maximizing policies are formulated to decrease a hacker’s efforts in infiltrating networks and increase the software users’ utility. The optimal strategy for the software vendor, the optimal policy, is to maximize the egalitarian sum of utilities of the software users, i.e. the vendor acts as a social planner. The optimal policy is dependent both on the distribution of software users on the network and how costly finding a previously unknown vulnerability, i.e. a Zero-Day vulnerability, is for the hacker. Software users that do not expect to bear the majority of the burden of an attack, known as low-type users, do not want vulnerabilities to be disclosed, i.e. a Non-Disclosure policy, since they will not update their machines, deeming it too costly. Also, if the cost of searching for a Zero-Day exploit is higher than the expected payoff of the exploit, then the hacker is not willing to expend the energy searching for a Zero-Day, and Non-Disclosure is the optimal policy. Therefore, the only case in which Disclosure can be an optimal policy is when search costs are low and there are enough users that desire to update their machines. Starting in January of 2020, Microsoft will no longer support Windows 7, unless the users enroll in Extended Support1 . The final result of the paper is that Microsoft’s new policy increases the cost of exploiting the disclosed vulnerability, and, even though the policy increases the cost of updating, under certain parameter values the policy causes the software uses to receive higher payoffs. This new approach to disclosure policy can increase the overall welfare relative to the policy of disclosing all vulnerabilities. The sections of the paper are as follows: the model is introduced as well as the first main contribution: A discussion of optimal policy when the hacker are decision making agents in Sect. 2. Following is the newly proposed policy by Microsoft in Sect. 4, then concluding is in Sect. 5.

2

Static Game

The players within this static game are the software vendor, a hacker, and the software users. The software vendor follows a welfare maximizing disclosure policy, and thus determines the rules of the game. The hacker maximizes his profits by choosing a hacking strategy of exploiting either a Zero-Day, the patch released by the software vendor, i.e. an N-Day attack, or he can exit the game. Lastly, 1

See https://www.microsoft.com/en-us/microsoft-365/blog/2018/09/06/helpingcustomers-shift-to-a-modern-desktop/.

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game

121

each software user must decide whether or not to update her machine if a vulnerability is disclosed, i.e. an update is released. Table 1. Notations used in the paper α

Probability the software vendor finds a vulnerability, α ∈ (0, 1)

D, N D

Disclosure or Non-Disclosure policy, respectively

I = {1, . . . , m} Set of interconnected software users θi

Set of software user weights, θ = {θ1 , . . . , θm } ith software user’s weight, 0 < θ1 ≤ θ2 ≤ · · · ≤ θm < 1

θi v

Value obtained by the ith software user for using the software package, v > 0

θi D

Damage done by the hacker to the ith software user via a hack, D > 0

θ

cu

Opportunity cost of updating, cu > 0

u, nu δ, δ

Software user choice of whether to update or not update, respectively Probability the hacker successfully finds a Zero-Day under D and N D, respectively, δ > δ

cs

Opportunity cost of searching for a Zero-Day, cs > 0

E, S, X

Hacker choice to (E)xploit the N-Day, (S)earch for a Zero-Day, or e(X)it the game

The vendor of the software package is only concerned with maximizing software user welfare in an egalitarian manner, similar to a social planner. The vendor is unable to detect all vulnerabilities before selling the software, but the vendor, under a Disclosure policy, will attempt to find these vulnerabilities ex post. The probability of finding a vulnerability, α, can either be thought of as individual vendors searching for vulnerabilities by themselves or as bounty systems such as Microsoft’s Bounty System (E.g. See: [8–11]). The software user’s weight parameter in Table 1 can be thought of as the network centrality of the software user, or as how desirable the information found on the software user’s machine is to the hacker. Due to the hacker attempting to exploit the network and the inability of vendors to solve all vulnerabilities ex ante, each software user is vulnerable to an attack. The hacker is only able to extract as much information as is available to software user i. This damage can also be thought of a direct transfer from the software user to the hacker when the software user is hacked. The vendor does not usually charge the software users to install the updates, but the updates are still costly in terms of opportunity costs, i.e. the time to install the update. Updates often require software users to stop working or even shutdown their machines, thus cu > 0. For simplicity, this cost is assumed to be a fixed cost to be paid if the software user decides to update. To model the fact that some people do not update under any policy and there exists at least one software user that might update, the following assumption is made: Assumption 1. Let θ1
δD i∈I θi , then under Non-Disclosure, the unique Nash equilibrium is to exit the game, A∗nd = (X). 2.2

Disclosure Regime

If the vendor chooses D, the vendor releases updates every time they find a vulnerability. Each software user then must choose whether to update, and thus

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game

123

endogenously define the two sets Γnu and Γu as the set of software users that do not update and the set of software users that do update, respectively, and ξ = |Γu |, the number of software users that update under a Disclosure policy. When a software user chooses to update, she protects her machine from NDay exploits, but is still vulnerable to Zero-Days. Due to the costly nature of updating, some software users may choose not to update leaving their computers open to both Zero-Day and N-Day hacks (E.g. See [12]). Now there are two stages within the game, the first being the possible release of updates by the vendor, which happen with probability α, followed by the game between the hacker and the software users. When the vendor is unable to find a vulnerability, the game is identical to that of the Non-Disclosure regime in Sect. 2.1. The hacker’s action set when the vendor is unable to find a vul∈ {S, X}. When the nerability within the Disclosure regime is denoted as A1−α d vendor finds a vulnerability and releases an update, then both the hacker and i the software user must choose their actions, Aα d ∈ {E, S, X} and A ∈ {u, nu}, respectively. The expected utility of the software user i is defined as the function Udi : α × Ai × θi → R, where she receives vθi if her machine is not exploited, Ad × A1−α d −Dθi if the hacker is successful in attacking her machine, and −cu if she decides to update. Table 3. Hacker expected payoff functions: Disclosure Hacker action Payoff (E, S)

D Π(E,S) (θ, {Γu , Γnu }) = α D i∈Γnu θi + (1 − α) δD i θi − cs

(E, X)

D Π(E,X) (θ, {Γu , Γnu }) = α D i∈Γnu θi

(S, S)

D Π(S,S) (θ, {Γu , Γnu }) = αδ + (1 − α)δ D i θi − cs

(S, X)

D θi − cs Π(S,X) (θ, {Γu , Γnu }) = α δD i

(X, S)

D Π(X,S) (θ, {Γu , Γnu }) = (1 − α) δD i θi − cs

(X, X)

D Π(X,X) (θ, {Γu , Γnu }) = 0

There are three main drivers of the Nash equilibria under Disclosure: (a) Do there exist any software users that choose not to update when an update is released? (Notice that this is always satisfied via Assumption 1.) (b) If there is no update released, does the cost of finding a Zero-Day exceed the expected profits of searching? I.e. θi . (1) cs ≶ δD i∈I

124

T. J. Canann

(c) If a vulnerability is disclosed, does the cost of finding a Zero-Day exceed the expected profits of searching? I.e. cs ≶ δD θi . (2) i∈I

The first case to examine is when the cost of searching is high, i.e. cs > δD i θi . Since both the hacker and the software users know whether an update has been released, then the solution can be split into the Non-Disclosure and the Disclosure sub-games. Similar to the Non-Disclosure case when there are high = (X) search costs, in the Disclosure game when no vulnerability is found A1−α∗ d is the equilibrium of the sub-game. Since the search costs are high for the hacker and there exists at least one software user that does not update, then Aα∗ d = (E) is the only strategy to survive elimination of strictly dominant strategies for the hacker, and is thus the only strategy in the best response for the hacker. Given the hacker strategy (E), cu ∗ , if θi < v+D . the best response of software user i is to not update, i.e. i ∈ Γnu cu Otherwise, for software user j such that θj > v+D , updating is optimal, j ∈ Γu∗ . cu , then she is indifferent between any mixture pj ∈ [0, 1] of U pdate If θi = v+D and N ot U pdate. Therefore, the Nash equilibrium of the Disclosure game under high search costs is (1−α)∗ ∗ , (u)j∈Γ ∗ ) ), (A∗i )i∈I ) = ((E, X), (nu)i∈Γnu (3) ((Aα∗ d , Ad u cu cu ∗ and Γu∗ = j ∈ I|θj > v+D . = i ∈ I|θi < v+D Where Γnu The next case, denoted the medium search cost case, is when searching is profitable when the vendor is unable to find the vulnerability but not when the θi ≤ cs < δD θi . If vulnerability is disclosed by the vendor, i.e. δD i∈I i∈I the vendor is unable to find a vulnerability, the cost of searching is still exceeded (1−α)∗ by the expected profits of searching, and thus Ad = (S) is his best response. However, when the vendor finds a vulnerability, then the expected profits of (S) are surpassed by the cost of (S), then the action of (S) when a vulnerability is disclosed yields a strictly lower payoff then (X). Since there always exist software users that do not update, then the best action for the hacker to play is Aα∗ d = (E). cu ∗ will be in Γnu , and Then, notice that all software users such that θi < v+D cu ∗ all software users θj > v+D will be in Γu . Therefore, the Nash equilibrium of the medium search cost case is ∗ , (u)j∈Γ ∗ ) ((Aα∗ ), (A∗i )i∈I ) = ((E, S), (nu)i∈Γnu (4) d , Ad u cu cu ∗ and Γu∗ = j ∈ I|θj > v+D . Where Γnu = i ∈ I|θi < v+D The final case is to determine what happens when searching yields positive ˆ profits under both branches of the game, i.e. cs < δD i θi . In this low search cost case, with probability 1 − α, we obtain the same solution as in the Non(1−α)∗ = (S). Disclosure game in Sect. 2.1, i.e. Ad

(1−α)∗

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game

125

Next is to determine the best response of both the hacker and each software user when an update is released. The first thing to notice is that (X) is never a best response since exiting gives a payoff of zero while (S) and (E) both yield ∗ , is the software positive expected payoffs. Given the hacker strategy (E), i ∈ Γnu cu user i’s best response so long as θi < v+D . cu , then software user j’s best response is j ∈ Γu∗ . Whenever the If θj > v+D hacker plays (S), updating will not protect the software user from a hack, and ∗ is the best response for all i ∈ I. thus, i ∈ Γnu Allowing for the hacker to use mixed-strategies introduces the probability ρ ∈ (0, 1), where ρ is the probability that the hacker chooses (E) and (1 − ρ) gives (S). Using the expected payoffs of the software users given ρ, then any cu ∗ when θi < ρ(v+D) . Notice that for software user i’s best response is i ∈ Γnu cu any ρ ∈ [0, θm (v+D) ), (nu) is the best response for all software users. For all cu , j ∈ Γu∗ is their optimal action. For any software users j such that θj > ρ(v+D) cu software user k such that θk = ρ(v+D) , the software user is indifferent between updating and not updating, and will mix with probability pk ∈ [0, 1], where pk is the probability of choosing (u). Now to examine the best response of the hacker when a vulnerability is disclosed given the software users’ strategies. If all of the software users update, i.e. Γu = I, then the best response for the hacker is Aα∗ d = (S). Similarly, if the software user strategy is Γnu = I, then Aα∗ d = (E) is the only strategy in the best response for the hacker. Due to the monotonicity of the software users, and thus their optimal actions, all that is left to do is to split I between high- and low-type users. Define Ω ≡ cu as the set of high-type software users, i.e. the users that will j ∈ I|θj ≥ v+D k update if the hacker chooses (E). For some k ∈ Ω, define Γnu = {i ∈ I|θi < θk } k k , (pk (u), and Γu = {j ∈ I|θj > θk }. Given a software user strategy of (Γnu k (1 − pk )(nu)), Γu ), for some mixed strategy pk ∈ [0, 1] for software user k, then the hacker’s expected payoff of mixing with ρ ∈ [0, 1] between exploiting and searching is θi + (1 − pk )Dθk + (1 − ρ) δD θi − cs (5) ρ D i∈I

k i∈Γnu

For all ρ ∈ [0, 1], if cs > δD

θi − D

i∈I

θi − 1 − pk Dθk

(6)

k i∈Γnu

then ρ∗ = 1 is the best response for the hacker given the software users’ strategy. However, if for every value ρ ∈ [0, 1],

θi − D θi − 1 − pk Dθk (7) cs < δD i∈I

then the hacker will send ρ∗ to zero.

k i∈Γnu

126

T. J. Canann

The last case is if there exists a pk ∈ [0, 1] such that Inequality 6 holds with equality, i.e.

cs = δD θi − D θi − 1 − pk Dθk (8) i∈I

k i∈Γnu

then any ρ∗ ∈ [0, 1] is the hacker’s best response to the software users’ strategy k , (pk (u), (1 − pk )(nu)), Γuk ). of (Γnu Theorem 1. Let kmin ∈ Ω be the minimal software user in Ω. If Inequality 6 holds for pkmin = 1, then the Nash Equilibrium is ∗ , (u)j∈Γ ∗ ) ((Aα∗ ), (A∗i )i∈I ) = ((E, S), (nu)i∈Γnu (9) d , Ad u cu cu ∗ and Γu∗ = i ∈ I|θi > v+D . = i ∈ I|θi < v+D Where Γnu ∗ Otherwise, there exists a pivotal software user, k ∈ Ω, and a mixed strategy for software user k ∗ , p∗k∗ ∈ [0, 1], such that Eq. 8 holds, and the Nash equilibrium is

(1−α)∗

(1−α)∗

((Aα∗ d , Ad ∗

), (A∗i )i∈I )

(10)

∗ ∗ = ((ρ (E, S), (1 − ρ∗ )(S, S)), (nu)i∈Γnu k∗ , (pk ∗ (u), (1 − pk ∗ )(nu)), (u)j∈Γ k∗ ) u

Where ρ∗ =

3

cu θk∗ (v+D) ,

k∗ Γnu = {i ∈ I|θi < θk∗ }, and Γuk∗ a = {i ∈ I|θi > θk∗ }.

Welfare Analysis

In this section, the “Optimal Disclosure Policy” is first defined followed by solving for the optimal policy for each of the different search cost scenarios found in Sect. 2. Definition 1. The optimal policy Ψ ∗ ∈ {Disclosure, N on−Disclosure} is chosen such that:

(1−α)∗ ∗ α∗ ∗ ∗ Ψ = argmaxψ∈{d,nd} Ud (Ad , Ad , Ai , θi ), Und (And , θi ) (11) i∈I

i∈I

Where ((Aα∗ ), (A∗i )i∈I ) and (A∗nd ) are the Nash equilibria under Disd , Ad closure and Non-Disclosure, respectively. (1−α)∗

Under High Search Costs, recall that in the Nash equilibrium the hacker chooses to exploit the N-Day under Disclosure and to exit the game under NonDisclosure. Under Disclosure, all low-type software users, the software users in ∗ , are hacked if a vulnerability is found; while all other software users must Γnu pay the cost of updating, which is assumed to be strictly greater than zero. Under Non-Disclosure, the hacker exits the game, and all software users obtain the optimal policy isNon-Disclosure. θi v. Then If δD i∈I θi ≤ cs < δD i∈I θi , i.e. the medium search cost case, then under a Non-Disclosure regime the hacker searches for a Zero-Day. However,

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game

127

under Disclosure, the hacker chooses to exploit the released vulnerability. Then, solving for the optimal policy is dependent on cu ≶δ θi + ξ ∗ θi (12) v+D ∗ i∈Γnu

i∈I

As the expected losses from a Zero-Day exceed the cost of the low type software users being hacked since they did not update and the cost of updating for all ξ ∗ = |Γu∗ | of the high type software users, then Disclosure is the optimal the optimal policy under medium search costs is Disclosure if policy. Thus, ∗ cu θi . Otherwise, the optimal policy is Non-Disclosure ∗ θi + ξ v+D < δ i∈Γ i∈I nu ∗ cu if i∈Γnu ∗ θi + ξ v+D > δ i∈I θi . The last case to examine is that of low search costs. Recall that the Nash (1−α)∗ = (S), while the Nash equiequilibrium of the Non-Disclosure game is Ad librium of the Disclosure game takes the form of mixing between (E) and (S) for k∗ , (p∗k (u), (1 − p∗k )(nu)), Γuk∗ ). the hacker while the software users split into (Γnu The analysis begins with the optimal policy for all low-type software users, followed by the optimal policy for all high-type software users. To conclude the section the combined results of both high- and low-type software users are used to find the optimal policy. k∗ , then we are able to analyze which policy they For all software users i ∈ Γnu would prefer by solving

− δD

θi + (1 − δ) v

k∗ i∈Γnu

∗

θi ≶ −ρ D

k∗ i∈Γnu

⎡ +(1 − ρ ) ⎣−δD ∗

k∗ i∈Γnu

θi + (1 − δ)v

k∗ i∈Γnu

⎤

θi ⎦

k∗ i∈Γnu

(13) Disclosure is the optimal policy for all software users that do not update so long as + (1 − δ)v −δD − [−δD + (1 − δ)v] ρ∗ < + D) (1 − δ)(v (14) δ−δ ⇐⇒ ρ∗ < 1 − δ Notice that both the left-hand side and the right-hand side are strictly posik∗ tive. Thus, the software users that do not update, software users in Γnu , will sometimes want the policy to be Disclosure. High-type software users, j ∈ Γuk∗ , then face the welfare decision of −δD

k∗ j∈Γu

θj + (1 − δ) v

θj ≶ρ∗ v θj − ξcu

k∗ j∈Γu

k∗ j∈Γu

⎡

+ (1 − ρ ) ⎣−δD ∗

k∗ j∈Γu

θj + (1 − δ)v

⎤ θj − ξ cu ⎦ ∗

k∗ j∈Γu

(15)

128

T. J. Canann

k∗ For software users i ∈ Γnu , Disclosure decreases the probability of being hacked by a Zero-Day, but it also increases their probability of being hacked since the hacker can exploit the N-Day vulnerability that these software users are not willing to defend against. However, software users j ∈ Γuk∗ are more likely to want a Disclosure regime since they both obtain the benefit of hackers having less vulnerabilities to search over as well as protection from the N-Day exploits since they will sometimes update. Now to examine the welfare over all the software users by comparing the sum of all software users’ utility functions. The optimal policy condition can be written as D δ − (1 − ρ∗ )δ − δ + ξ ∗ θk ≶ θi + θi (16) v+D ρ∗ k∗ i∈I

i∈Γnu

Hence, the optimal policy under low search costs is Disclosure if i∈Γnu k∗ θi + ∗ δ−(1−ρ ) δ D ∗ θk < i∈I θi , or the optimal policy is Nonv+D − δ + ξ ρ∗ δ−(1−ρ∗ )δ D ∗ Disclosure if i∈Γ k∗ θi + v+D − δ + ξ θk > i∈I θi . ρ∗

nu

4

Microsoft’s “Extended Support”

This section contains an analysis of the forthcoming change to Microsoft 7 and 10’s updating procedures and how this change alters the game described in Sects. 2 and 3. The game is altered such that the software vendor, Microsoft, introduces a new monthly charge to receive updates. Microsoft intends to implement this policy starting on January 14th , 2020, which is the same day that Windows 7 will no longer be supported. But with a large number of Windows users still using Windows 7, Microsoft needed to come up with a policy to protect these users and maintain their market share (Table 4). Table 4. New notation for Microsoft’s “Extended Support” φu New service charge paid by the software user and hacker to access the update (also called the exploitation cost), φu > 0 cv

Cost to switch to the new version of the software package, cv > 0

v

Software user’s choice to install the new version of the software package

Updating is no longer the only available choice to the software user. The software user can also choose to shift toward using a different version, i.e. Windows 10, for which the software user must pay a cost cv > 0. If the software user shifts toward using the new version of the software, then the hacker is not able to attack the software user, not even via Zero-Days.

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game

Assumption 2. Let

cv δ(v+D)

129

∈ (θ1 , θm ).

If the hacker wants to gain access to the disclosure of the vulnerability, the hacker must pay the subscription fee for the “Extended Support”, φu . However, the hacker does not have to pay cu since the hacker could easily enroll an old computer in the updating scheme in order to be notified of vulnerabilities. Consequently, the cost of exploiting N-Days has increased since φu > 0. To be clear, Microsoft’s new policy is fascinating since it has the potential to increase the cost of exploiting N-Days while also decreasing the effectiveness of Zero-Days against Windows 7. Following the notation of the game in Sect. 2, this new policy can be explicitly defined. The first case to be described is the Non-Disclosure regime. The vendor was unable to find a vulnerability, and thus the hacker is only able to an action (1−α) ∈ {S, X}. Searching for a Zero-Day is not as effective as in the above AM games due to the fact that software users are now able to change their software (1−α) version to avoid being attacked. The software user choice is an action AM,i ∈ (1−α)

(1−α)

i × AM,i × θi . All players {v, nu}. The utility of software user i, UM ;nd : AM that use the old software are contained in Γnu , and all software users that switch versions are in Γv . The next step is to formalize the Disclosure sub-game. The hacker has the same set of actions in this case as in the Disclosure case above to pick from: α Aα M ∈ {E, S, X}. The action set for the software users is now AM,i ∈ {v, u, nu}. i α α The utility of software user i is now UM ;d : AM × AM,i × θi .

Table 5. Hacker expected payoff functions: Microsoft Hacker action Payoff (E, S) (E, X) (S, S) (S, X) (X, S) (X, X)

nu

θi − φu + (1 − α) δD i∈Γ

nu

θi − φ u

M Π(E,S) (θ, {Γnu , Γu , Γv }) = α D i∈Γ

M Π(E,X) (θ, {Γnu , Γu , Γv }) = α D i∈Γ

nu ∪Γu

θi − cs

M Π(S,S) (θ, {Γnu , Γu , Γv }) = α δD i∈Γnu ∪Γu θi + (1 − α) δD i∈Γnu ∪Γu θi − cs M Π(S,X) (θ, {Γnu , Γu , Γv }) = α δD i∈Γnu ∪Γu θi − cs M Π(X,S) (θ, {Γnu , Γu , Γv }) = (1 − α) δD i∈Γnu ∪Γu θi − cs M Π(X,X) (θ, {Γnu , Γu , Γv }) = 0

There are five main drivers of the Nash equilibria in this model: the three stated in Sect. 2 and the following two conditions. (d) Does the cost of updating exceed the cost of switching to the new version of the software package? I.e. (17) cv ≶ cu + φu (e) Does the cost of searching for an N-Day exceed the payoff? φu ≶ D θi (18) i∈I

130

T. J. Canann

When search costs exceed the expected payoff of search under the NonDisclosure sub-game, the hacker will always play (X). Given the hacker strategy of exiting the game, all software users will not update. Therefore, the equilibrium (1−α)∗ (1−α)∗ , AM,i is AM = ((X), (nu)i∈I ). i∈I If cs < δD i∈I θi , then via the best responses of both software users and the hacker, the Nash equilibria under medium search costs are as follows in cv Theorem 2. Define ΩM ≡ k ∈ I|θk ≥ δ(v+D) . Theorem 2. Let kmin ∈ ΩM be the minimal software user in ΩM . Then under low search costs in the Non-Disclosure sub-game, if θi (19) cs < δD i∈I\ΩM

Then the Nash equilibrium is (1−α)∗ (1−α)∗ AM , AM,i = (S), ((nu)i∈Γ kmin ,nd∗ , (v)j∈Γ kmin ,nd∗ ) nu

i∈I

v

(20)

kmin ,nd∗ = {i ∈ I|θi < θkmin }, and Γvkmin ,nd∗ = {j ∈ I|θj ≥ θkmin }. Where Γnu Otherwise, there exists a pivotal software user k ∗ ∈ ΩM and a mixed strategy for software user k ∗ strategy, pv∗ k∗ ∈ [0, 1], such that ⎞ ⎛ ⎠ (21) θi + (1 − pv∗ cs = δ ⎝D k )Dθk∗ ∗

k ,nd∗ i∈Γnu

Then the Nash equilibrium is (1−α)∗ (1−α)∗ AM , AM,i (22) i∈I ∗

∗ v∗ v∗ ∗ ∗ = ρ (S), (1 − ρ )(X) , (nu)i∈Γnu k ,nd∗ , (pk ∗ (v), (1 − pk ∗ )(nu)), (v) j∈Γvk ,nd∗ Where ρ∗ = θ j > θk ∗ .

cv θk∗ δ(v+D) ,

∗

k ,nd∗ Γnu = {i ∈ I|θi < θk∗ }, and Γvk

∗

,nd∗

=

j ∈ I|

Now to solve for the Nash equilibria under the Disclosure sub-game. Notice that both the hacker and the software users have three actions they could each take. In Sect. 2.2, the equilibria cases followed from the relation between the cost of searching and the expected payoffs from searching. However, due to the new action available to the software users, (v), and the enrollment fee, φu , there now exist extra cases dependent on Eqs. 17 and 18. If there are both high or medium search costs and high exploitation costs, θi and φu > D θi , then notice that both searching for i.e. cs > δD i∈I i∈I Zero- and N-Days are too costly, therefore, the hacker will always exit the game.

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game

131

Given this strategy, the workers will all not update. Hence, the Nash equilibrium α∗ is (Aα∗ M , (AM,i )i∈I ) = ((X), (nu)i∈I ). The last case to examine is when the exploitation costs of the N-Day are low. θi and φu ≤ D θi , while the software users Theorem 3. If cs > δD i∈I i∈I face cv < cu + φu , and φu < D θi (23) i∈I\ΩM

Then the Nash equilibrium is

α∗ (Aα∗ M , (AM,i )i∈I ) =

(E), (nu)i∈Γnu d∗ , (v)j∈Γ d∗ v

(24)

d∗ = {i ∈ I \ ΩM } and Γvd∗ = {j ∈ ΩM }. Where Γnu θi and φu ≤ D θi , while the software users Otherwise if cs > δD i∈I i∈I face cv < cu + φu , and there exists k ∗ ∈ ΩM and a mixed strategy for software user k ∗ , pv∗ k∗ ∈ [0, 1], such that φu = D θi + (1 − pv∗ (25) k∗ )Dθk∗ ∗ i∈Γnu

Then the Nash equilibrium of the game is α∗ (Aα∗ M , (AM,i )i∈I ) =

v∗ (ρ∗ (E), (1 − ρ∗ )(X)), (nu)i∈Γ d∗ , (pv∗ k∗ (v), (1 − pk∗ )(nu)), (v)j∈Γ d∗ nu

v

(26) d∗ = {i ∈ I|θi < θk∗ }, Γvd∗ = {j ∈ I|θj > θk∗ }, and ρ∗ = Where Γnu

4.1

cv θk∗ (v+D) .

Welfare Analysis

Now to investigate whether this new “Extended Coverage” will be a welfare improving policy. This section flows as follows: First, define the optimal policy; Then, the welfare improving policy will be solved for each of the different cost scenarios. Definition 2. The optimal policy Ψ ∗ Disclosure} is chosen such that: Ψ ∗ = argmaxψ∈{M,d,nd}

∈

{M icrosof t, Disclosure, N on −

(1−α)∗

UM (Aα∗ M , AM

i∈I

i∈I

(1−α)∗

Ud (Aα∗ d , Ad

(1−α)∗

, Aα∗ M,i , AM,i

, A∗i , θi ),

, θi ),

(27) Und (A∗nd , θi )

i∈I

(1−α)∗ (1−α)∗ (1−α)∗ Where ((Aα∗ ), (A∗i )i∈I ), (A∗nd ), and ((Aα∗ ), (Aα∗ )i∈I ) are M , AM M,i , AM,i d , Ad the Nash equilibria of the Disclosure, Non-Disclosure, and Microsoft policies, respectively.

132

T. J. Canann

Beginning with the high search cost case, recall that the equilibria of the Microsoft policy game are split into two sub-cases. These two cases can be identified by Inequality 18. If φu > D i∈I θi , then both Microsoft and Non-Disclosure are optimal policies. However, if φu ≤ D i∈I θi , then Non-Disclosure is the optimal policy. Therefore, for the new policy to be effective under high search costs, the extended service fee must be large. Also notice that if φu ≤ D i∈I θi , i.e. the exploitation fee is low, then the Nash equilibrium of the hacker exit when a vulnerability is not found and to mix between exploitation of the N-Day and exiting the game. Then, Microsoft is preferred to Disclosure when ⎡ ⎤ ∗ M∗ ⎦ θi + (1 − pM cv < (v + D) θi (28) ρ∗M ⎣ k∗ )θk∗ + ξ M∗ i∈Γnu

d∗ i∈Γnu

Given medium search costs and high exploitation costs, the welfare equation for the software users is (1−α)∗ (1−α)∗ UM (Aα∗ , Aα∗ , θi ) = v θi (29) M , AM M,i , AM,i i∈I

i∈I

Therefore, compared to Disclosure, the software users do not need to either update or be hacked via the released patch, and compared to Non-Disclosure, the hacker is not going to be searching for a Zero-Day, and thus the software users will not bear the burden of the expected damages. Hence, as discussed in Theorem 4, the new policy proposed by Microsoft is optimal. The next case to discuss is when the exploitation cost is low, φu ≤ D i∈I θi , and the cost of installing the new version is less than the cost of updating, cv ≤ cu + φu . Comparing the new Microsoft policy to Disclosure and NonDisclosure, the following inequality describes when the new Microsoft policy is optimal. ⎡ αρ∗M (1

− δ) ⎣

M∗ i∈Γnu

⎤ θi + (1 −

⎦ pv∗ k∗ )θk∗

+ ξv∗

cv ≤ min δ θi , v+D i∈I α

d∗ i∈Γnu

θi + (1 − α)δ

i∈I

θi + ξv∗

cu v+D

(30) Finally, if cv > cu + φu , then, under the Disclosure sub-game, the hightype software users will update. Whereas, in the Non-Disclosure sub-game, the high-type software users will install the new version of the software to protect nd∗ their computers, hence ρd∗ M = ρM . Thus yields the following condition for when “Extended Support” of Windows 7 is the optimal policy.

Toward a Theory of Vulnerability Disclosure Policy: A Hacker’s Game ⎡ ⎢ αρd∗ M ⎣

⎤

⎡

⎥ nd∗ ⎢ θi + (1 − pu∗ k∗ )θk∗ ⎦ + (1 − α)ρM ⎣

M∗ i∈Γnu,nd

133

⎤

⎥ θi + (1 − pv∗ k∗ )θk∗ ⎦

M∗ i∈Γnu,nd

⎧ ⎫ ⎨ ⎬ α(c + φ ) + (1 − α)c c u u v u θi , α θi + (1 − α)δ θi + ξ ∗ ≤ min δ + ξ∗ ⎩ v+D v + D⎭ d∗ i∈I

i∈Γnu

i∈I

(31) Theorem 4. Let δD i∈I θi ≤ cs < δD i∈I θi . Then the cases satisfying Inequality 18 are 1. If φu > D i∈I θi , then Microsoft is the optimal policy. 2. If φu ≤ D i∈I θi , cv ≤ cu + φu , and Inequality 30 is satisfied, then Microsoft is an optimal policy. 3. If φu ≤ D i∈I θi , cv ≤ cu + φu , and Inequality 30 is not satisfied, then Microsoft is not an optimal policy. 4. If φu ≤ D i∈I θi , cv > cu + φu , and Inequality 31 is satisfied, then Microsoft is an optimal policy. 5. If φu ≤ D i∈I θi , cv > cu + φu , and Inequality 31 is not satisfied, then Microsoft is not an optimal policy Notice that φu can be used as a weapon to harm hackers. In order for Microsoft’s new policy to be effective under medium search costs, the optimal extended service fee and cost of installing the new version are interdependent. The first way for Microsoft to maximize software user welfare is to pick a very large support fee, i.e. high exploitation costs. This prices the hacker out of the market, while also allowing for the software users to not have to pay to install updates or update their software version since the hacker is priced out of the exploitation market. However, under low exploitation costs, for the Microsoft policy to maximize software user welfare they must choose cv such that either Inequality 30 or Inequality 31 hold.

5

Conclusion

Sun Tzu said: “Know thy self, know thy enemy. A thousand battles, a thousand victories.” This sentiment is just as relevant in cybersecurity as it was in the 5th century BC. The optimal policy debate should be centered around how policies influence both the hacker’s and software users’ behavior. The ease with which the hacker is able to infiltrate the network can be decreased via appropriate disclosure policies. Since the cost of searching for Zero-Days has drastically increased over the last couple of years, the hacker desires more disclosure to decrease his costs. Hence, Disclosure can only be an optimal policy in cases when the cost to the hacker of searching for a Zero-Day vulnerability is small. The policies of Non-Disclosure and Microsoft’s new policy both decrease hacker interference in the network as well as increase overall software user welfare.

134

T. J. Canann

The idea of this paper is to push the vulnerability disclosure literature toward thinking about the appropriate assumptions faced by hackers, software users, and software vendors. As the title implies, this is a simplified explanation of the problem that firms face. For example, the Equifax hack can be traced to an unpatched vulnerability, however there is more at play than is discussed in this static model. Many firms do not immediately update their software packages since doing so may inadvertently negatively affect other software packages. This is beyond the scope of this paper, as this is an introduction to a theoretical approach to the problem, and will be a focus of future research.

References 1. Rescorla, E.: Is finding security holes a good idea? IEEE Secur. Priv. 3(1), 14–19 (2005) 2. Arora, A., Telang, R., Hao, X.: Optimal policy for software vulnerability disclosure. Manag. Sci. 54(4), 642–656 (2008) 3. Png, I.P.L., Tang, C.Q., Wang, Q.-H.: Hackers, users, information security. In: WEIS Conference Proceedings (2006) 4. Hong, Y., Neilson, W.: Cybercrime and punishment: a rational victim model. Working Paper (2018) 5. Becker, G.S.: Crime and punishment: an economic approach. In: Fielding, N.G., Clarke, A., Witt, R. (eds.) The Economic Dimensions of Crime, pp. 13–68. Springer, London (1968). https://doi.org/10.1007/978-1-349-62853-7 2 6. Choi, J.P., Fershtman, C., Gandal, N.: Network security: vulnerabilities and disclosure policy. J. Ind. Econ. 58(4), 868–894 (2010) 7. Arora, A., Nandkumar, A., Telang, R.: Does information security attack frequency increase with vulnerability disclosure? An empiricial analysis. Inf. Syst. Front. 8(5), 350–362 (2006) 8. Ozment, A.: Bug auctions: vulnerability markets reconsidered. In: Workshop on the Economics of Information Security (2004) 9. Coyne, C., Leeson, P.: Who’s to protect cyberspace? J. Law Econ. Policy 2, 473– 496 (2005) 10. Laszka, A., Zhao, M., Grossklags, J.: Banishing misaligned incentives for validating reports in bug-bounty platforms. In: Askoxylakis, I., Ioannidis, S., Katsikas, S., Meadows, C. (eds.) ESORICS 2016. LNCS, vol. 9879, pp. 161–178. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45741-3 9 11. Kuehn, A., Mueller, M.: Analyzing bug bounty programs: an institutional perspective on the economics of software vulnerabilities. In: TPRC Conference Paper (2016) 12. Ion, I., Reeder, R., Consolvo, S.: “...no one can hack my mind”: comparing expert and non-expert security practices. In: Symposium on Usable Privacy and Security (SOUPS) (2015)

Investing in Prevention or Paying for Recovery - Attitudes to Cyber Risk Anna Cartwright1 , Edward Cartwright2(B) , and Lian Xue3 1

School of Economics, Finance and Accounting, University of Coventry, Coventry, UK [email protected] 2 Department of Strategic Management and Marketing, De Montfort University, Leicester, UK [email protected] 3 School of Economics, Wuhan University, Wuhan, China [email protected]

Abstract. Broadly speaking an individual can invest time and effort to avoid becoming victim to a cyber attack and/or they can invest resource in recovering from any attack. We introduce a new game called the prevention and recovery game to study this trade-off. We report results from the experimental lab that allow us to categorize different approaches to risk taking. We show that many individuals appear relatively risk loving in that they invest in recovery rather than prevention. We find little difference in behavior between a gain and loss framing.

Keywords: Cyber-security Risk aversion

1

· Ransomware · Insurance · Recovery ·

Introduction

Cyber-crime is a growing threat to society that will become increasingly important as reliance on technology grows (e.g. through the internet of things). Extensive evidence suggests, however, that individuals and organizations (both private and public) take excessive risk in cyber-space. Indeed, there are simple and low cost behaviors, such as two factor authentication and regular offline back-ups, that would dramatically decrease the likelihood and costs of a cyber-attack. Yet many individuals, employees and managers do not routinely follow such behavior [1]. An analogy would be a society in which we all rely on cars and yet leave them unlocked with the key in the ignition. A fundamental question is why we observe such risk taking? This research was funded by the Engineering and Physical Sciences Research Council (EPSRC) for project EP/P011772/1 on the EconoMical, PsycHologicAl and Societal Impact of RanSomware (EMPHASIS). The authors would like to thank three anonymous reviewers for there comments on an earlier version of the paper. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 135–151, 2019. https://doi.org/10.1007/978-3-030-32430-8_9

136

A. Cartwright et al.

In looking at an individual’s approach to cyber-security we distinguish between prevention and recovery [2]. We view prevention as decisions enacted before an attack with the objective of reducing the likelihood and/or damage from attack. Regular software updates, anti-malware and two factor authentication are primarily aimed at preventing an attack from being ‘successful’. Similarly, regular offline-back ups and insurance can prevent an attack from causing significant damage. We view recovery as decisions enacted after an attack. For instance, they may have to reset passwords, reformat hard drives or pay for an IT company to recover data and restore systems. While prevention and recovery are not mutually exclusive we can delineate a strategy that focuses, consciously or not, on prevention and one that focuses on recovery.1 As a case in point consider ransomware. Crypto-ransomware is a relatively new form of malware in which an individual’s files are encrypted and a ransom is demanded for the key to decrypt the files [5–7]. If the encryption is done in a technically sound way then the files are only recoverable by paying the ransom. Crypto-ransomware provides, therefore, a viable business model for criminals and some variants of ransomware have a ‘good reputation’ for returning files to victims (if the ransom is paid) [8–10]. In another, older variant, of ransomware victims are held to ransom for the release of sensitive information [11]. Consider an individual who stores sensitive information on her computer and is aware of the threat of ransomware. Broadly speaking the individual has two options. She can limit the damage from attack through, say, regular back-ups or insurance against loss. Or she can pay the ransom if attacked in order to try and recover her files. Law enforcement clearly encourage individuals to take the former approach. It would seem, however, that many take the latter approach. For instance, estimates of the proportion of individuals paying the ransom are as high as 40% [12]. To focus on recovery rather than prevention would appear to be risk-taking behaviour. Evidence from the field of behavioural economics suggests that individuals are more risk seeking in the loss domain than the gain domain [13]. That is, they are more willing to take risks to regain ‘losses’ than to earn ‘gains’. This would appear to be relevant in a cyber context. For instance, an individual who has ‘lost’ files to a ransom attack may be willing to take a risk in paying the ransom. We also know that perception of loss and gain is sensitive to framing [14,15]. A growing body of literature has explored the effectiveness of using different frames to influence behaviour [16–18]. These studies suggest that loss aversion can be a factor in shaping individual choice. The relevance of framing for cyber-security is well acknowledged in the academic literature [19–25].2 However, prior studies have focused, using our termi1

2

One can also delineate strategies that focus on different aspects of prevention or recovery. For instance [3, 4] compare protection versus insurance, where the former lowers the probability of attack and the latter the damage from attack. It is also, arguably, acknowledged (consciously or not) by cyber-criminals with ransomware demands threatening the permanent destruction of files etc.

Prevention or Recovery

137

nology, on either prevention or recovery. In our paper we explore prevention and recovery in tandem. This opens us interesting new avenues for exploration. In particular, we can check for consistency of behaviour across the two different decision tasks. For instance, do individuals who take risks in terms of prevention also take risks in terms of recovery. Indeed, can it be optimal for an individual to spend on recovery and yet not spend on prevention? We introduce a new game that captures investment on prevention and recovery in a cyber-security context. The game contains four stages in which individuals invest in prevention against a cyber attack, learn if they are attacked, can spend on recovery if attacked, and then learn if they have regained their files. We explore how behavior is likely to be influenced by a gain or loss frame. We show that an individual is predicted to invest more on recovery in the loss frame than gain frame. Conversely, she is predicted to invest more on prevention in a gain frame than loss frame. We then report an experiment designed to test the hypotheses of our model. We find only limited support that framing matters. More noteworthy is the large heterogeneity of behavior. A significant proportion of subjects invest mainly in prevention, some mainly in recovery, and then some in both or neither. Moreover, we see a lot of risk taking behavior. The behavior we observe in the lab would seem relatively consistent with that in the field. We argue, therefore, that our game and experimental design can be extended to further explore the trade-off between prevention and recovery. The rest of the paper is organized as follows. Section 2 summarizes the game, Sect. 3 contains our theoretical results, Sect. 4 presents our experimental results, and Sect. 5 concludes. Supplementary material, including experiment instructions and data, is available on Figshare (https://dmu.figshare.com/).

2

The Prevention and Recovery Game

In this section we introduce a simple game designed to capture salient aspects of the choice between preventing and recovering from cyber-attack. The game consists of four stages. In the first stage, the prevention stage, the individual can spend resource to insure against attack. In the second stage Nature determines if the individual is attacked. In the third stage, the recovery stage, the individual can, if attacked, spend resource on trying to recover her files. In the fourth stage nature determines if the losses are recovered. The game is carefully designed so that we can distinguish different potential influences on behavior. We now explain each stage in more detail. The individual has computer files worth V tokens. In Stage 1 of the game the individual chooses how much to spend on preventing loss from cyber-attack. She can spend any amount up to I tokens, where I is exogenously given. Let I ∈ [0, I] denote the amount allocated to prevention.

138

A. Cartwright et al.

In Stage 2 the individual may suffer a cyber-attack that means she loses the 100p−I 3 files worth V . The probability of attack is given by max 100−I , 0 where p is an exogenous parameter capturing the activities of the criminal and factors in place to deter attack. In interpretation, we note that if the individual does not spend on preventing loss, I = 0, the probability of attack is p. If the individual spends I = 100p on prevention then the probability of attack is 0. Thus, the more spent on preventing loss (in Stage 1 of the game) the lower the probability of attack. To make the analysis sharper we assume that resources allocated to prevention are not sunk. This is consistent with compensation for spending on prevention if attacked. For instance, the individual may not need to pay a cyber-security provider if they are attacked. The payoff of the individual at the end of stage 2 is, therefore, 0 if she is attacked and V − I if she is not attacked. Note that if the individual fully prevents attack her payoff is V − 100p. If the individual is not attacked then the game ends. If she is attacked then we proceed to stage 3. In stage 3 the individual has the opportunity to recover her losses. Specifically she can allocate resource to recovery. She can denote any amount up to R, where R is exogenously given. Let R ∈ [0, R] denote the amount allocated to recovery. In Stage 4 the individual can recover her files. The probability of the individual recovering her files is given by R/100. So, if the individual devotes no resource to recovery she has no chance of recovering her losses. If she spends R then she has an R% chance of recovering her files. The final payoff of the individual is given by V − R if the files are recovered and −R if they are not. Note that money spent on recovery is sunk and so paid irrespective of whether the file is recovered. This captures the notion of paying a ransom to a criminal who may or may not honour their part of the bargain. A strategy for the individual details the amount of resources allocated to preventing loss, I, and the amount that will be spent on recovery, R, in the event of attack. The strategy space is, therefore, [0, I] × [0, R].4

3

Theory

We begin with some preliminary definitions. A prospect (x1 , q1 ; x2 , q2 ; ...; xn , qn ) lists a set of n possibleoutcomes x1 , ..., xn and the probability of each outcome q1 , ..., qn (where i pi= 1) [26]. The expected value of a prospect n is given by i=1 xi qi . The expected deviation from expected (x1 , q1 ; ...; xn , qn ) n value is given by i=1 |xi −e|qi where e is expected value. We say that a prospect is a sure thing if n = 1 and a risky prospect if n > 1. A A A B B Consider two prospects, A = (xA 1 , q1 ; ...; xn , qn ) and B = (x1 , q1 ; ...; B B xm , qm ). Suppose that A and B have the same expected value and B has a 3

4

Here we assume that all attacks are ‘successful’. It would be equivalent to allow for deterrence and distinguish between successful and unsuccessful attacks. If I = 100p then attack is impossible and so, theoretically, there is a redundancy in choosing R in this case.

Prevention or Recovery

139

smaller expected deviation from expected value. Adopting standard terminology (without being constrained to a particular functional form) we say that an individual is risk averse (on domain A, B) if she prefers prospect B to A, is risk loving if she prefers A to B and is risk neutral if she is indifferent between A and B. Note that this definition is agnostic on whether risk aversion is due to curvature of the utility function and/or loss aversion [27]. It merely says that a risk averse individual prefers more certainty. We further distinguish between gains and losses. We say that a prospect is on the gain domain if the expected value is positive and on the loss domain if the expected value is negative. This terminology allows us to capture a reflection effect in which an individual is risk averse on the gain domain (would prefer prospect B to A if the expected value of the prospects is positive) and risk loving on the loss domain (would prefer prospect A to B if the expected value of the prospect is negative) [13]. To solve for the optimal strategy of the individual in the prevention and recovery game we proceed by backward induction. This means we start by solving the optimal strategy in the recovery stage and then (knowing what will happen in the recovery stage) solve for the optimal strategy in the prevention stage. 3.1

Recovery Stage

Crucially, the prevention and recovery game is designed so that incentives and payoffs in the recovery stage, stage 3, are independent of the choice of I in the prevention stage, stage 1. We can, thus, analyze the recovery stage in a relatively straightforward way without taking into account I. In the recovery stage the individual has been attacked and simply has to decide how much to spend on recovery. Inspired by the evidence of loss aversion and reflection effect we hypothesize that attitudes to risk and, therefore, willingness to spend on recovery will be influenced by framing and the reference point. We contrast two possibilities. In a gain frame we think of 0 as the status-quo. This implies that the individual has already internalized the loss of her files and so is now in the mindset of potentially regaining them. To choose R is to choose prospect (V − R, r; −R, 1 − r), where r = R/100. In other words there is probability r of recovering the files and having payoff V − R and probability 1 − r of nonrecovery and having payoff −R. Our game is set up in such a way that if V = 100 a risk neutral individual is indifferent as to how much she spends on recovery. Specifically, setting V = 100, the expected value from spending R on recovery is EV (R) = V ×

R − R = 0. 100

(1)

Note that, in this case, expected value is independent of R. To investigate the incentives of an individual who is not risk neutral let us contrast the choice of R = 0 and R. Throughout we fix V = 100. If the individual chooses R = 0 then her final payoff is 0; so she has sure prospect (0, 1). If she

140

A. Cartwright et al.

chooses R = R then she has r = R/100% chance of recovering her files and getting payoff V − R and a (1 − r)% chance of not recovering her files and getting payoff −R; so she has prospect (V − R, r; −R, 1 − r). The individual, therefore, has a choice between a sure thing and a risky prospect (with the same expected value). So, a risk averse individual would set R = 0 and a risk loving individual would set R = R. In a loss frame we think of −V as the status-quo. This implies that the individual has not accepted the loss of her file and so sees recovery as a way to avoid the loss. Specifically, if she spends nothing on recovery she has sure loss (−V, 1). If she spends R on recovery she faces prospect (−R, r; −V − R, 1 − r). Again, if V = 100, a risk neutral individual is indifferent as to her choice of R, a risk averse individual would set R = 0 and a risk loving individual would set R = R. The preceding discussion leads to our first result. Proposition 1. If V = 100 an individual who is risk averse will allocate no resource to recovery while an individual who is risk loving will allocate the maximum resource to recovery. The reflection effect suggests that individuals will be risk averse in the gain frame and risk loving in the loss frame. Thus, there is a tendency towards R = 0 in the gain frame and R = R in the loss frame. We, therefore, obtain a testable hypothesis. Hypothesis 1. In a gain frame individuals will invest less in recovery than do individuals in a loss frame. 3.2

Prevention Stage

The optimal choice in the prevention stage, stage 1, will depend on what the individual is going to choose in the recovery stage, stage 3. From Proposition 1 we know that (unless the individual is risk neutral) the optimal strategy in the recovery stage will be to choose either R = 0 or R = R. Using a gain and loss frame we will consider each possibility in turn. In a gain frame the status quo is to have 0 payoff meaning that the individual has not internalized the ownership of the files. So to not be attacked would be a gain. Similarly, to recover the files would be a gain. Suppose that the individual would choose R = 0 in stage 3. This means that she makes no attempt to recover her files in the case of being attacked. Her expected value in stage 1, setting V = 100, is therefore 100p − I = 100(1 − p). (2) EV (I) = (V − I) × 1 − 100 − I Again, a risk neutral individual is indifferent as to how much to spend on prevention.

Prevention or Recovery

141

To progress further contrast the two extremes of I = 0 and I = V p. If the individual chooses I = 0 then there is a p chance they are attacked and have payoff 0 and a 1 − p chance they have payoff V . To not allocate to prevention is, therefore, to choose a risky prospect (0, p; V, 1 − p). If the individual chooses I = V p then she has final payoff of V (1 − p). To fully prevent is, therefore, to choose sure prospect (V (1 − p), 1). Thus, if V = 100 a risk averse individual would set I = V p and a risk loving individual would set I = 0. Next suppose that the individual would choose R = R in stage 3. You can verify that the expected value is still V (1 − p) and so independent of I. To set I = V p is still to choose sure prospect (V (1 − p), 1) and so will appeal to someone who is risk averse. To set I = 0 is now to choose over a risky prospect with possible payoffs −R, V − R and V . This will appeal to someone who is risk loving. In a loss frame the status-quo is to have V meaning that the individual has internalized ownership of the file. So to be attacked would be a loss. The preceding analysis follows through independent of the frame. But we do need to reconsider the interpretation of fully preventing attack. If the individual sets I = V p then she has a sure loss of V p relative to her status-quo. This compares to a sure gain of V (1−p) in the gain frame. The reflection effect would, therefore, point to more risk loving behavior in the loss frame. This leads to our second result and hypothesis. Proposition 2. If V = 100 an individual who is risk averse will allocate maximum resource to prevention while an individual who is risk loving will allocate nothing to prevention. Hypothesis 2. In the gain frame individuals will invest more on prevention than in the loss frame. 3.3

Summary

Our two propositions and two hypotheses are summarized in Table 1. Overall we expect a risk averse individual would allocate resource to prevention rather than recovery while a risk loving individual would allocate resource to recovery rather than prevention. Moreover, we predict that in the gain frame individuals are more likely to be risk averse than in the loss frame. We now proceed to an experiment designed to test these predictions. Table 1. Behaviour in the prevention and recovery game. Attitudes

Prevention Recovery Framing

Risk averse I = 100p

R=0

Gain

Risk loving I = 0

R=R

Loss

142

4

A. Cartwright et al.

Experiment

The Experiment consisted of 20 rounds and was conducted in computer rooms at the University of Kent. Participants were recruited from the general student population and did not have experience in participating in an individual risk taking experiment. In total 77 participants were recruited, with ages ranging from 18 to 41, 49 of which were female. At the end of the experiment, 2 rounds of the 20 rounds were selected to be paid in cash. 20 tokens were converted to £1. All participants were given an additional £5 show-up fee. The average final earnings were £10.01. All tasks in the experiment were computerized using z-Tree [29]. In each round participants first performed a short real effort task in which they ‘earned’ a file worth 100 tokens. The task was to solve a travelling salesman problem (TSP) by finding a path between 10 points less than a pre-defined distance. The task was set up in such a way that it was easily solved. Even so, we expected that completing the task may give subjects more ownership over the ‘file’ and corresponding 100 tokens. An example of a completed file is shown in Fig. 1. A further benefit of using the (TSP) is that it gives us additional data on participants in terms of route length. This could be seen as a measure of engagement with the experiment.

Fig. 1. An example of a completed TSP task

After completing the TSP participants played the prevention and recovery game outlined above. We set p = 0.5 and R = 50. This choice of parameters was designed to give maximal variation in risk. In particular, a subject could choose (in both the prevention and recovery stages) between a sure thing or a prospect with a 50–50 risk profile. In interpretation, p = 0.5 means there was a

Prevention or Recovery

143

50% chance of attack in the event of no prevention. This is consistent with an individual who does nothing to prevent attack having a relatively high chance of being attacked. Setting R = 50 means that an individual has at most a 50% chance of recovering files after attack. This is consistent with the notion that there is no guarantee files can be recovered after an attack. In the experiment we consider a relatively continuous choice set in that subjects could allocate any integer amount to prevention, from 0 to 50, and to recovery, also from 0 to 50. In each round participants were asked to make their recovery choice without having been informed of the outcome of the prevention stage. This gives us additional data in that we see the recovery choice of the subject in every round (even if they were not attacked). At the end of the round participants were given full feedback on the outcome in that round (whether they were attacked and whether they recovered their files) before they moved onto the next round.

Fig. 2. An example of the prevention stage in the game (investing 10 tokens to prevent the file from attack; I = 10) (Color figure online)

Note that in this experiment we deliberately used a cyber-security frame that explicitly talked of files, cyber-attack and recovery etc. The experimental interface was also designed in a way to make the probability of attack and recovery as transparent as possible. Specifically, in the prevention stage participants were presented with a box with 100 balls on their computer screen, 50 of them red and 50 blue. For each token a participant put in a ‘cyber-security account’ a red ball was removed from the box (see Fig. 2 for an example). Once the participant had confirmed their choice the computer randomly selected a ball from those remaining. If the selected ball is red, the file would be attacked. If the selected ball is blue, the file would not be attacked.

144

A. Cartwright et al.

In the prevention stage participants were presented a box with 100 brown balls. For each token the participant put in a ‘recovery account’ one brown ball was replaced with a green ball (see Fig. 3 for an example). Once the participant had confirmed their choice the computer randomly selected a ball from the box. If the selected ball was brown the file was not recovered and if it was green it was. Note that the visual approach just described, for both the prevention and recovery stage, matches the incentives the in the recovery and prevention game. For example, if I = 10 in the prevention stage then 10 red balls are removed and the probability of attack becomes (50 − I)/(100 − I) = 4/9. Similarly, if R = 10 in the recovery stage then the probability of recovery is R/100 = 1/10.

Fig. 3. An example of the recovery stage in the game (Color figure online)

We ran two treatments corresponding to a gain and loss frame. The treatments differed only in the instructions provided to subjects. The key differences are illustrated in Table 2. The gain frame emphasizes the potential to gain 100 tokens by keeping or recovering the file. By contrast, the loss frame emphasizes the potential to lose the 100 tokens. Note that a subject was only exposed to one of the treatments - gain or loss. At the end of each treatment, we include two additional sets of questionnaires. One is a domain specific risk taking (DoSpeRT) task [28], the other includes demographic questions and a survey on attitudes towards ransomware. The two framings allows us to test Hypotheses 1 and 2 while data on risk attitudes allows us to explore Propositions 1 and 2. 4.1

Results

Let us look first at the average amount invested in prevention and recovery. Table 3 reports the mean and median amount of tokens participants invested in

Prevention or Recovery

145

Table 2. Comparison of two framings - gain and loss. Prevention stage

Recovery stage

Gain

“... If you are not attacked, you will not lose access to the file and so it is still worth 100 Tokens.”

“... If the selected ball is green, your files are recovered. You regain the 100 Tokens.”

Loss

“... If you are attacked, you will lose access to the file saved in stage 1 and so it becomes worthless. You therefore, lose the 100 Tokens.”

“... If the selected ball is brown, you do not recover your file. The 100 Tokens are lost.”

prevention and recovery. Figure 4 presents a box-plot of spending in prevention and recovery. The conditional R box controls for subjects who fully prevent (and so the recovery decision is irrelevant). We find that in the gain treatment investment in prevention is significantly higher than investment in recovery (p < 0.05, two-sided Wilcoxon matched pairs signed-rank test with individual average as unit observations). By contrast, in the loss treatment there is no significant difference (p > 0.1). This leads to our first experimental result. Result 1. Participants invested more in prevention than recovery in the gain treatment. There is no significant difference between investment in prevention and recovery in the loss treatment. Table 3. Investment in Prevention (I) and Recovery (R) by treatments. Mean investments in I and R measure the average of participants’ investment in prevention and recovery. Median is derived from the median of average investments. Prevention (I) Recovery (R) Gain Mean 29.2 Median (28.6)

22.7 (22.4)

Loss Mean 28.2 Median (31.1)

26.0 (25.4)

We next consider Hypotheses 1 and 2. Given that the observations of prevention and recovery are not independent, to assess the gain-loss treatment effect, we take the average of participants investment in back up and recover and run a bootstrap linear regression with robust standard errors. The regression results are reported in Table 4. The effect of treatment on prevention is statistically insignificant, whereas the effect on recovery is marginally lower in the gain treatment (p < 0.1). This result provides some marginal support for Hypothesis 1. There is no support for Hypothesis 2. Result 2. The amount invested in prevention is the same in the gain and loss treatments. Investment in recovery is marginally higher in the loss treatment than in the gain treatment.

146

A. Cartwright et al.

Fig. 4. Box plot of average numbers of tokens invested in prevention (I), and recovery (R and RCon).

The average data (see Table 3) suggests that spending on prevention and recovery is ‘in the middle’ of the permissible range. At the individual level, however, we see clustering at the extreme combinations (I, R) = (0, 0); (0, 50); (50, 0) and (50, 50). For instance, Around 13% of subjects invest nothing in prevention (11% in the gain treatment and 15% in the loss treatment) while 23% invest the maximum amount (24% and 22%). Around 22% of subjects invest nothing in recovery (29% and 16%) while 24% invest the maximum amount (20% and 27%). To study individual behavior in more detail we classify behaviour into 5 categories detailed in Table 5. The categories are (1) prevention lover, who invests in prevention not recovery, (2) recovery lover, who invests in recovery not prevention, (3) payment lover, who invests in both prevention and recovery, (4) payment averse, who invests in neither prevention nor recovery, (5) intermediate, who invests ‘in the middle’ for both prevention and recovery. Figure 5 provides a scatter plot of the frequency of each combination of I and R. We can see large clusters at payment lover, I = R = 50 and prevention lover, I = 50, R = 0. There are also smaller clusters at no payment averse, I = 0, R = 0, and recovery lover, I = 0, R = 50. A lot of subjects also fit in the intermediate category. Overall, therefore, there is considerable heterogeneity in how subjects behaved in the task. One interesting thing to observe is the near symmetric split around I = R meaning that a large proportion of subjects invest more in recovery than investment. This leads to our next result. Result 3. There is large heterogeneity in individual behaviour with clustering at the extremes of full prevention, no prevention or recovery, and maximum recovery.

Prevention or Recovery

147

Table 4. Bootstrap regression on back up and recovery investments by participants. The unit observation is the average investment in recovery (R) and investment (I). Independent variables include treatments framings and individual characteristics, ethical/financial risk taking attitudes, gender, age, first language English and average total distance traveled in TSP games. Robust standard errors in parentheses. ***p 0 is the value of the common good for party i and c > 0 is the cost of participation in the secret recovery. Here are some examples. Authorising the project. Imagine that a sufficiently authoritative group in a society, say a city council, can launch (authorise) certain activity that will result in a public good which will be consumed by all members of the society bringing a utility Ni to member i of the council (due to possible corruption the utility Ni of member i could be much larger than utilities of other members). To authorise the project one needs to gather a coalition that would be authorised to learn (or use) the secret (say, unlocking funds for the activity). The structure of authorised coalitions is given by an access structure Γ of a secret sharing scheme, e.g., a 2/3 majority may be needed. The participation in the coalition, however, comes at a certain cost, say c > 0, which may come in the form of time spent on negotiations or the responsibility for supporting the project. Launching a nuclear missile. In the former USSR any two of the three top state officials needed to activate their nuclear suitcases to launch a missile. The recovery of the secret could open a possibility of creation of a public good (defeating an enemy) that all the society will consume collectively. It also gives us a sense of the possible cost that participants can pay for their participation in the recovery of the secret. Inevitably, many people will die as a result and resolving this moral dilemma may be daunting. This is, by the way, why maintaining the privacy of participants (i.e., who actually participated in the recovery) becomes paramount. Threshold cryptography: signing a message. If a message has to be signed on behalf of the organisation, say Microsoft, the signing key is split into shares between several authorised executives who sign the message with their shares and the combiner then transforms their results into the message signed by the organisation [8,10].

160

Y. Desmedt and A. Slinko

At this point we have to make several important observations. 1. We note that all three examples have a power sharing flavour. In all three, participants perform as experts approving or not approving the project in the first instance or a launch of a missile in the second and a message in the third. Unlike Halpern-Teague scenario, in such a case non-reconstruction may be beneficial: parties do not reveal/use their shares if they suspect that their utilities may be negative as a result, i.e., the common good may be not good at all. As a side note, the common bad for the organisation may be actually a public good for the society (but this is not captured by our model). 2. In all three cases the secret itself is a meaningless combination of zeros and ones and knowledge of it has no value to participants (unless they want to engage in an illegal activity). An authorised coalition is not aiming to recover the secret but to use it to launch certain activity. In fact, the secret should never be recovered. 3. It is amazing how different the two main applications of threshold cryptography are. The signing of a message on behalf of an organisation is totally different from decrypting an incoming message. Indeed, all incoming messages should be read, hence decrypted, while for signing a harmful message a person may be fired. Also there is no incentive to be the last signer. We will show that the game with utilities V1, V2 has a natural set of Nash equilibria provided the cost of participation c is not too high. Firstly, we show that the equilibria of such game are all in pure strategies. A mixed strategy for participant i is characterised by a single non-negative number 0 ≤ αi ≤ 1 which is a probability of participant i disclosing her share in the recovery stage. Given a Nash equilibrium, we say that a player is inessential in this Nash equilibrium, if her utility does not depend on her strategy. Lemma 1. In every Nash equilibrium for the game with utilities V1, V2 a player plays a pure strategy or else this player is inessential. Proof. The strategy of the ith member of the society is the probability αi of participating in the recovery of the secret. Suppose that a vector of probabilities α∗ = (α1 , α2 , . . . , αn ) is a Nash equilibrium. Suppose for some i ∈ [n] we have 0 < αi < 1. For convenience and without loss of generality we may assume that i = n. We look for the best response of participant n given the tuple of strategies (α1 , α2 , . . . , αn−1 ), which we can therefore view as fixed. Then the expected utility of participant n, when she uses strategy x, would be En (x) = x · (−c) + [x · fΓ (α1 , α2 , . . . , αn−1 ) + gΓ (α1 , α2 , . . . , αn−1 )] Nn , where fΓ and gΓ be two functions that depend on the access structure Γ . Namely, the value fΓ (α1 , α2 , . . . , αn−1 ) is the probability of a non-authorised coalition which is a subset of {1, . . . , n − 1} that together with n becomes authorised; and gΓ (α1 , α2 , . . . , αn−1 ) is a probability of an authorised coalition in {1, . . . , n − 1} without participation of n.

Realistic versus Rational Secret Sharing

161

As we see En (x) is a linear function in x and takes its extremal values either at 0 or at 1. This means pure strategies unless x cancels out. In such a case player n is inessential. This happens if fΓ (α1 , α2 , . . . , αn−1 )Nn = c. To illustrate the proof, here is the calculation of the expected utility of participant 3 in case n = 3 and 2-out-of-3 scheme: E3 (x) = x · (−c) + [x(α1 (1 − α2 ) + α2 (1 − α1 )) + α1 α2 ] N3 . When [α1 (1 − α2 ) + α2 (1 − α1 )]N3 = c player 3 becomes inessential. This may be true for all three players. This, for example, happens when 2α(1 − α)N = c and N = N1 = N2 = N3 , with all three players choose strategy α. Let X ⊆ [n], then the characteristic vector vX of the subset X is the vector for which (vX )i = 1 in case i ∈ X and (vX )i = 0, otherwise. Theorem 3. Suppose we have a Nash equilibrium of the game with utilities satisfying V1 and V2 such that every player is essential. Then this Nash equilibrium is one of the vectors vX , where X is the minimal authorised coalition of Γ such that for every i ∈ X we have Ni > c. If Γ does not have a self-sufficient participant i with Ni > c, then the zero vector is also a Nash equilibrium. All Nash equilibria survive deletion of weakly dominated strategies. Proof. It is easy to see that for a parrtiy i with Ni < c the dominant strategy is always to abstain. By Lemma 1 we may assume that all values of a vector of Nash equilibrium are either 0 or 1. It is easy to see that the zero vector (in the absence of self-sufficient participants) and vectors vX for X ∈ Γmin with Ni > c for every i ∈ X are Nash equilibria. Also vectors vY for Y ⊃ X, where X ∈ Γmin are not Nash equilibria since some participants may change their probabilities to zero and save the amount c > 0. If X is not in authorised and somebody is playing a non-zero strategy, then this participant can change their probabilities to zero and be better off. This proves the theorem.

6

Conclusion

We have argued in this paper that secret sharing is used in different circumstances and there are many realistic situations where the assumptions and conclusions of Halpern and Teague do not apply. It all depends on the nature of the secret, and, if it opens the way for consumption of a common good, another set of assumptions are needed leading to a range of non-trivial Nash equilibria. In particular, we show that there is a large set of scenarios, like the threshold cryptography and MPC, for which secrets should not be recovered but used. Indeed, in threshold signatures, the signing key of the organisation should not be recovered. If parties want to co-sign a document, they should use their shares to do so. Similarly in MPC, the parties should not use their shares of the inputs, but only the shares of the output. Unlike Halpern and Teague scenario, nonparticipation may be beneficial and incentives for non-recovery in such circumstances would make both threshold cryptography and MPC more attractive to potential users.

162

Y. Desmedt and A. Slinko

References 1. Beerliov´ a-Trub´ıniov´ a, Z., Hirt, M.: Perfectly-secure MPC with linear communication complexity. In: Canetti, R. (ed.) TCC 2008. LNCS, vol. 4948, pp. 213–230. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78524-8 13 2. Ben-Or, M., Goldwasser, S., Wigderson, A.: Completeness theorems for noncryptographic fault-tolerant distributed computation. In Proceedings of the Twentieth Annual ACM Symposium Theory of Computing, STOC, 2–4 May 1988, pp. 1–10 (1988) 3. Blakley, G.R.: Safeguarding cryptographic keys. In: Proceedings of the National Computer Conference, pp. 313–317 (1979) 4. Boyd, C.: Digital multisignatures. In: Beker, H., Piper, F. (eds.) Cryptography and Coding, pp. 241–246. Clarendon Press, Oxford (1989) 5. Croft, R.A., Harris, S.P.: Public-key cryptography and re-usable shared secrets. In: Beker, H., Piper, F. (eds.) Cryptography and Coding, pp. 189–201. Clarendon Press, Oxford (1989) 6. Desmedt, Y.: A high availability internetwork capable of accommodating compromised routers. BT Technol. J. 24, 77–83 (2006) 7. Desmedt, Y., Frankel, Y.: Threshold cryptosystems. In: Brassard, G. (ed.) CRYPTO 1989. LNCS, vol. 435, pp. 307–315. Springer, New York (1990). https:// doi.org/10.1007/0-387-34805-0 28 8. Desmedt, Y.G.: Threshold cryptography. Eur. Trans. Telecommun. 5, 449–457 (1994) 9. Desmedt, Y.: Society and group oriented cryptography: a new concept. In: Pomerance, C. (ed.) CRYPTO 1987. LNCS, vol. 293, pp. 120–127. Springer, Heidelberg (1988). https://doi.org/10.1007/3-540-48184-2 8 10. Desmedt, Y.: Some recent research aspects of threshold cryptography. In: Okamoto, E., Davida, G., Mambo, M. (eds.) ISW 1997. LNCS, vol. 1396, pp. 158–173. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0030418 11. Desmedt, Y.: Unconditionally private and reliable communication in an untrusted network. In: Proceedings of IEEE Information Theory Workshop on Theory and Practice in Information-Theoretic Security, 16–19 October 2005, pp. 38–41 (2005) 12. Dolev, D., Dwork, C., Waarts, O., Yung, M.: Perfectly secure message transmission. J. ACM 40, 17–47 (1993) 13. Garay, J., Katz, J., Maurer, U., Tackmann, B., Zikas, V.: Rational protocol design: cryptography against incentive-driven adversaries. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 648–657. IEEE (2013) 14. Gennaro, R., Rabin, M.O., Rabin, T.: Simplified VSS and fact-track multiparty computations with applications to threshold cryptography. In: Proceedings of the Annual ACM Symposium on Principles of Distributed Computing (PODC), pp. 101–111 (1998) 15. Gordon, S.D., Katz, J.: Rational secret sharing, revisited. In: De Prisco, R., Yung, M. (eds.) SCN 2006. LNCS, vol. 4116, pp. 229–241. Springer, Heidelberg (2006). https://doi.org/10.1007/11832072 16 16. Halpern, J., Teague, V.: Rational secret sharing and multiparty computation: extended abstract. In: Proceedings of the Thirty-sixth Annual ACM Symposium on Theory of Computing STOC ’04, New York, NY, USA, pp. 623–632. ACM (2004)

Realistic versus Rational Secret Sharing

163

17. Kawachi, A., Okamoto, Y., Tanaka, K., Yasunaga, K.: General constructions of rational secret sharing with expected constant-round reconstruction. Comput. J. 60, 711–728 (2016) 18. Kol, G., Naor, M.: Cryptography and game theory: designing protocols for exchanging information. In: Canetti, R. (ed.) TCC 2008. LNCS, vol. 4948, pp. 320–339. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78524-8 18 19. Kurosawa, K., Suzuki, K.: Truly efficient 2-round perfectly secure message transmission scheme. IEEE Trans. Inf. Theory 55, 5223–5232 (2009) 20. Liu, C.L.: Introduction to Combinatorial Mathematics. McGraw-Hill, New York (1968) 21. McEliece, R.J., Sarwate, D.V.: On sharing secrets and Reed-Solomon codes. Commun. ACM 24, 583–584 (1981) 22. Mishra, A., Mathur, R., Jain, S., Rathore, J.S.: Cloud computing security. Int. J. Recent Innovation Trends Comput. Commun. 1, 36–39 (2013) 23. Nojoumian, M., Stinson, D.R.: Socio-rational secret sharing as a new direction in rational cryptography. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 18–37. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3642-34266-0 2 24. Samuelson, P.A.: The pure theory of public expenditure. Rev. Econ. Stat. 36, 387–389 (1954) 25. Samuelson, P.A.: Diagrammatic exposition of a theory of public expenditure. Rev. Econ. Stat. 37, 350–356 (1955) 26. Shamir, A.: How to share a secret. Commun. ACM 22, 612–613 (1979) 27. Shoham, Y., Tennenholtz, M.: Non-cooperative computation: boolean functions with correctness and exclusivity. Theoret. Comput. Sci. 343, 97–113 (2005) 28. Yao, A.C.: Protocols for secure computations. In 23rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 160–164. IEEE Computer Society Press (1982) 29. Yao, A.C.: How to generate and exchange secrets. In: 27th Annual Symposium on Foundations of Computer Science (FOCS), pp. 162–167 (1986). IEEE Computer Society Press

Solving Cyber Alert Allocation Markov Games with Deep Reinforcement Learning Noah Dunstatter(B) , Alireza Tahsini , Mina Guirguis , and Jelena Teˇsić Department of Computer Science, Texas State University, 601 University Dr, San Marcos, TX 78666, USA {nfd8,tahsini,msg,jtesic}@txstate.edu

Abstract. Companies and organizations typically employ different forms of intrusion detection (and prevention) systems on their computer and network resources (e.g., servers, routers) that monitor and flag suspicious and/or abnormal activities. When a possible malicious activity is detected, one or more cyber-alerts are generated with varying levels of significance (e.g., high, medium, or low). Some subset of these alerts may then be assigned to cyber-security analysts on staff for further investigation. Due to the wide range of potential attacks and the high degrees of attack sophistication, identifying what constitutes a true attack is a challenging problem. In this paper, we present a framework that allows us to derive game-theoretic strategies for assigning security alerts to security analysts. Our approach considers a series of sub-games between the attacker and defender with a state maintained between subgames. Due to the large sizes of the action and state spaces, we present a technique that uses deep neural networks in conjunction with Q-learning to derive near-optimal Nash strategies for both attacker and defender. We assess the effectiveness of these policies by comparing them to optimal policies obtained from brute force value iteration methods, as well as other sensible heuristics (e.g., random and myopic). Our results show that we consistently obtain policies whose utility is comparable to that of the optimal solution, while drastically reducing the run times needed to achieve such policies. Keywords: Game theory · Markov game · Network security Machine learning · Deep reinforcement learning

1

·

Introduction

Motivation and Scope: The rise of Advanced Persistent Threats (APT) against private and public organizations has put a significant strain on the resources held by these organization and specially on their cyber-defense personnel (i.e., security analysts) that need to investigate security issues that arise. When an attack is launched against an organization, network and system c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 164–183, 2019. https://doi.org/10.1007/978-3-030-32430-8_11

Solving Cyber Alert Allocation Markov Games with Deep RL

165

resources (e.g., Intrusion Detection Systems, Anti-malware tools, etc.) typically generate cyber-alerts with varying levels of severity. These alerts need to be investigated by analysts to thwart ongoing attacks. Legitimate network activity, however, can also trigger the generation of some of these alerts, making it extremely challenging for analysts to discern a true attack from legitimate activity. A worst-case scenario happens when precious time is wasted investigating false alerts while true alerts from ongoing attacks are ignored. This in fact occurred during the Target attack wherein malware alerts were repeatedly generated but not addressed by analysts [21]. To highlight the severity of this issue within the network security domain, in [16] it was reported that an average of 17,000 alerts are generated by intrusion detection software every week at the surveyed organizations. Of these alerts, about 19% (3,218) are estimated to be legitimate, with only 4% (705) eventually being investigated by security analysts. This relatively low proportion of assignment makes the defender’s allocation strategy (i.e., assignment of alerts to analysts) much more critical when it comes to minimizing risk. This high volume of alerts coupled with a typically small numbers of security analysts motivated us to investigate what a game-theoretic policy looks like in such a domain. The use of game-theory to study allocation of cyber alerts to analysts has been investigated in previous works [4,18,19]. The authors in [18,19] introduce a Stackelberg game-theoretic model to determine the best allocation of incoming cyber alerts to analysts. The model assumes a one shot-game in which both the alert resolution time and the arriving alert distribution are deterministically known. In [4] the authors develop a game-theoretic model that considers a series of games between the attacker and the defender with a state maintained between such sub-games that captures the availability of analysts as well as an attack budget metric. Using dynamic programming coupled with Q-maximin value iteration based algorithms they were able to obtain optimal policies for both players. However, as the state and action spaces of the games become larger, it becomes computationally prohibitive to use such methods to obtain optimal policies. Instead it is common to adopt methods capable of approximations of the optimal policy. Methods such as deep reinforcement learning have show great promise in the single agent domain wherein networks were shown to learn empirically successful policies from the raw frames of various video games [10,15]. As opposed to previous works, in this paper we adopt a deep reinforcement learning approach that we modify to handle the game-theoretic nature of uncooperative extensive form 2-player games. Contributions: In this paper, we make the following contributions: 1. Develop a Deep Nash Q-Network framework that captures the game-theoretic behaviors of the attacker and defender through replacing the traditional greedy max operator with minimax one. 2. Tame the curse of dimensionality caused by the prohibitively large state and action spaces by (1) relying on deep reinforcement learning to approximate the quality of state-actions pairs, (2) performing a loss-less compression of

166

N. Dunstatter et al.

player action spaces without sacrificing the representation of states and/or actions, and (3) use iterative fictitious play to approximate the solutions to sub-games. 3. Assess the performance of our proposed approximate game-theoretic solution method and its derived policies against the known optimal value-iteration based solution (where the state space allows) as well as other heuristics that are typically employed. Paper Organization: In Sect. 2 we put this work in context with related work. In Sect. 3 we present our stochastic game formulation for the cyber-alert assignment problem and in Sect. 4 we present our solution framework focusing on the methods developed to tackle the large state/action space and to solve the subgames efficiently. In Sect. 5 we present our performance evaluation and conclude the paper in Sect. 6 with a summary.

2

Related Work

Improving the scheduling and efficiency of cyber-security analysts is a highly studied area of research [1,5,6,26]. The authors in [1] model the problem as a two-stage stochastic shift scheduling problem in which the first stage allocates cyber-security analysts and in the second stage additional analysts are allocated. The problem is discretized and solved using a column generation based heuristic. The authors in [5] study optimal alert allocation strategies with a static workforce size and a fixed alert generation mechanism. In [6] the authors develop a reinforcement learning-based dynamic programming model to schedule cyber-security analyst shifts with the model based on a Markov Decision Process framework with stochastic load demands. In [26], the author describes different strategies for managing security incidents in a cyber-security operation center. The authors in [20] propose a queuing model to determine the readiness of a Cyber-Security Operations Center (CSOC). This paper departs from this previous research by explicitly considering the presence of a strategic and well informed adversary within a temporally stateful environment. The use of game theory has been instrumental in advancing the state-of-theart in security games and their wide range of applications [3,7,18,19,22,25]. As mentioned in Sect. 1, the work in [4,18,19] adopted a Stackelberg game-theoretic approach in allocating alerts to analysts. Unlike the one-shot game model in [18,19], we consider a series of games with stochastic alert arrival process. Unlike the work in [4] which uses dynamic programing and value iteration, we adopt deep reinforcement learning techniques that enables us to tackle problems with large state and action spaces. In contrast to the previously mentioned Stackelberg games, the authors in [9,12,13] explore solution methods to stateful Markov games. Convergence properties and Q-minimax value iteration are studied in [12,13] providing encouraging guarantees for optimality and convergence of value functions. This process is then approximated by the authors in [9], wherein they use a least squares policy iteration approach to train a linear function approximator to predict Q-values.

Solving Cyber Alert Allocation Markov Games with Deep RL

167

The use of stateful Markov game models to study security games with realworld applications are limited. The authors in [14] used dynamic programming and value iteration to investigate attacks on power grids. Using a similar method, the authors in [4] investigated the use of full state space value iteration in the alert assignment domain. The authors in [24] used Markov games to model the level of worst-case threat faced by an institution given parameters surrounding its network infrastructure. Despite guarantees of optimality and convergence, the use of full state space value iteration in many previous works’ solution methods scale poorly to large real-world sized models – prompting the need for approximate solution methods in the domain.

3

Cyber-Alert Assignment Markov Game

We consider a two-player zero-sum Markov game in which the Defender (D) and the Attacker (A) play a series of sub-games over an infinite time horizon [4]. At each time step t, a new batch of alerts ω ∈ Ω arrives in which A chooses some alert level(s) to attack in and D attempts to detect and thwart the incoming attack(s) by assigning available analysts to the incoming alerts. We let s ∈ S denote the current state of the player resources (e.g., availability of analysts as well as the budget available to the attacker). We also let Da and Aa denote the set of actions available to D and A, respectively. We define a transition function T : S × Da × Aa → Π(S) which maps each state and player action pair to a probability distribution over possible next states. We let T (s, a, d, s ) denote the probability that, after taking actions a ∈ Aa and d ∈ Da in state s, the system will make a transition into s . In general, the system can be described as follows: – Alerts: Our transition function T is manifested in the uncertainty of which i.i.d. batch of alerts will arrive in each state. At every time step t, some batch of alerts ω ∈ Ω arrives according to a pre-defined probability distribution Π(Ω), where Ω = {ω1 , ω2 , . . . , ω|Ω| }. Each alert σ ∈ ω belongs to one of three categories: High (h), Medium (m), or Low (l). Resolving an alert requires a certain number of time steps based on its category. The set holding each category’s work-time (i.e., the number of time steps needed by an analyst to investigate and resolve an alert) is defined as W = {wh , wm , wl } where wh > wm > wl and wh , wm , wl ∈ N. A similar reward structure U = {uh , um , ul }, where uh > um > ul and uh , um , ul ∈ N, is defined for each category as well. If the alert σ h is legitimate and is assigned to an analyst, D will receive a positive utility uh . Whereas not assigning the legitimate alert results in a negative utility −uh for D. If an alert is illegitimate (i.e., a false positive) then it awards no utility to either the attacker or defender, regardless of whether or not it is assigned. Since our model is zero-sum, the corresponding utilities for A are simply the additive inverse of those awarded to D. For example, a possible arrival set may be as follows: Ω = {ω1 , ω2 } where Pr(ω1 ) = 0.4 with ω1 = {σ1h , σ2m , σ3m }, and Pr(ω2 ) = 0.6 with ω2 = {σ1m , σ2l }. At the beginning of a particular sub-game, both players are aware

168

N. Dunstatter et al.

of the exact alert batch that has arrived (with future alert arrivals remaining probabilistic, thereby impacting the current resource allocations made by the players). It is important to note that not every alert may be assigned to an analyst, and not every alert may represent a legitimate attack (i.e., alerts can be false positives). – The Defender: D has n homogeneous cyber-security analysts available to handle incoming alerts. We define Rs as a vector of length n that describes the load of each analyst in state s. For example, Rs = [0 2 1] means that D has 3 analysts on their team in which analyst 1 is available for assignment (has load 0), analyst 2 will be available after two time steps, and analyst 3 will be available after one time step. We also define the function F(Rs ) as the number of analysts currently available for allocation in state s. In every time step t, D receives a batch of alerts ω and determines their allocation strategy based upon the current availability of analysts and the varying severity and volume of alerts in ω. Once the set of possible alert batches Ω and each alert category’s respective work-times W are known, we can construct the set of all possible analyst states. In general, |R| is bounded above by the following: |R| ≤ (wh + 1)n

(1)

– The Attacker: A has an attack budget B ∈ N and decides when to attack and in what category. We assume A knows the alert level that would be generated due to their attack. The set C = {ch , cm , cl } defines the respective cost to the attacker given the alert level their attack generates, where ch > cm > cl and ch , cm , cl ∈ N. A can attack with as many alerts as they wish as long as the sum of their costs is affordable given the current budget. The attacker’s budget enables us to model the amount of risk they are willing to undertake. Attacking more frequently with attacks that generate high level alerts would likely expose A. To capture this behavior, if A chooses to abstain from attacking in a state s with budget Bs they will be credited with 1 unit of budget in the subsequent state. However, their budget is capped to some value Bmax representing the maximum amount of risk they are willing to undertake in any one state. – State Representation: At the beginning of a time step, we assume that the system state is known to both players. A state is thus defined as follows: s = [Rs |Bs ]

(2)

Once we know the current state s and current alert arrival batch ω, we can formulate the action spaces of both players. The size of the defender’s action space Da grows combinatorially and is defined in Eq. 3. The size of the attacker’s action space Aa is defined in Eq. 4, where the indicator function 1{·} enumerates all the ways the attacker could attack using the alerts in ω (while also allowing them to abstain from attacking altogether). F (Rs )

|Da (s, ω)| =

i=0

|ω| i

(3)

Solving Cyber Alert Allocation Markov Games with Deep RL

169

|ω|

|Aa (s, ω)| =

2

1{Bs

≥ Bin(i−1) · Cω }

(4)

i=1

Bin(i) is a function that maps the integer i to its binary representation and Cω is the cost vector for the alerts in ω according to C. For example, an alert arrival ω = {σ1m , σ2l , σ3l } yields a cost vector Cω = [3 1 1]. Thus, given an attacker budget Bs = 2 for this state and arrival Aa = {[0 0 0], [0 1 0], [0 0 1], [0 1 1]}. Using the players’ action spaces we can now formulate a zero-sum game represented by a payoff matrix Rs of instantaneous rewards where Rs (a, d) represents the reward received when players commit to actions a and d. Each state-arrival pair in our environment represents a potential sub-game where the defender attempts to detect incoming attacks through some of assignment of alerts to analysts and the attacker attempts to evade detection when launching their attacks. Since the game is zero-sum, for the remainder of the paper we will not distinguish between defender and attacker rewards and will only discuss rewards from the defender’s perspective (i.e., the defender seeks to maximize rewards and the attacker seeks to minimize rewards). Players follow policies πD and πA (e.g., πD (s, d) is the probability D takes action d ∈ Da (s, ω)). These policies are obtained by solving for the Nash equilibrium of the payoff matrix R. While many algorithms exist to solve for such Nash equilibrium (e.g., linear programming) we chose to use an iterative fictitious play algorithm as it was much faster than the other methods we explored and still yielded accurate results. Once derived, both players will sample from their Nash equilibrium mixed strategies and commit to their respective actions. They are then awarded their respective utility and the state evolves from s to s according to the transition function T .

4

Methods

In this section we will first describe our method for compressing the combinatorial action spaces of our players, followed by a description of the iterative fictitious play algorithm used to quickly solve sub-games. Lastly we will define the DNQN methodology used to obtain approximate Nash policies in large Markov games. 4.1

Action Space Compression

We can represent our agents’ actions as a binary vector where a 1 represents that a given alert is either assigned to an analyst for the defender, or attacked in by the attacker. For example, if we are in a state s = R = [0, 0, 0] | B = 10 and a batch of alerts ω = {σ1h , σ2h , σ3m } arrives, where ch = 5 and cm = 3, our defender and attacker action spaces would be formulated as follows: Da (s, ω) = {[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1], [1, 1, 0], [1, 1, 1]} Aa (s, ω) = {[0, 0, 0], [0, 0, 1], [0, 1, 0], [0, 1, 1], [1, 0, 0], [1, 0, 1]}

170

N. Dunstatter et al.

Where d6 = [1, 0, 1] means that the defender assigns alerts σ1h and σ3m and ignores alert σ2h . Similarly a4 = [0, 1, 1] means that the attacker attacks in alerts σ2h and σ3m and ignores alert σ1h . However, consider the fact that alerts of the same severity level are homogeneous (i.e., their utility, work-time, and cost are equal). This means that our agents need not worry about which specific alerts are being assigned/attacked, only the number of alerts from each severity level being assigned/attacked. Thus we can represent actions as a 3-tuple h, m, l representing how many alerts from each severity level are being assigned/attacked: Dâ (s, ω) = {0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 2, 0, 0, 2, 1, 0} Aâ (s, ω) = {0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0} where d3 = 1, 1, 0 means that the defender assigns one of the two high alerts and the one medium alert. Similarly a3 = 1, 0, 0 means that the attacker attacks in one of the high alerts and ignores the other high and medium alert. To illustrate just how advantageous this compression is, consider an arrival ωs = {σ1h , σ2h , σ3m , σ4m , σ5l , σ6l } in a state where the number of available analysts F(Rs ) = 5. Under the combinatorial action space |Da (s, ω)| = 63 whereas the compressed action space |Dâ (s, ω)| = 26 (a 58% reduction). The compression is even more substantial when arrival batches posses more redundancy. For instance, if ωs = {σ1h , σ2h , σ3h , σ4h , σ5h , σ6h } then |Dâ (s, ω)| = 6 while our combinatorial action space remains unchanged at |Da (s, ω)| = 63 (while this small example is given for illustration purposes, in our larger models the compression can routinely get as high as 96%). Under the compressed action space formulation joint action rewards are calculated in expectation since we no longer specify exactly which alerts are being assigned/attacked. The simplified equation for deriving the compressed payoff ˆ is presented below: matrix rewards R(·) zmax = min(ak , σk ) 0, if dk − (σk − ak ) < 0; zmin = dk − (σk − ak ), otherwise. ak σk −ak z max z ˆ σdkk −z uk (2z − ak ) R(a, d) = k∈{h,m,l} z=zmin

dk

where σk is the number of alerts in severity level k, dk and ak are the number of alerts assigned/attacked from severity level k, and z represents the number of alerts caught by the defender in the current severity level. Most advantageous of all is that this compression is completely loss-less with respect to finding the value of a Nash equilibrium in our sub-games (i.e., whether ˆ our linear program will find the same expected value of the game). using R or R

Solving Cyber Alert Allocation Markov Games with Deep RL

171

The mixed-strategies will necessarily be different (since the action spaces are ˆ will be more meaningful different), however the mixed strategies derived from R to our neural network as they contain much less redundancy. 4.2

Fictitious Play

The biggest bottleneck for both of the algorithms developed in this work is the game solving mechanism. Solving the game via a linear programming approach, while accurate, incurs a large cost in terms of run-time. In both of our solutions we are solving millions of games – often containing hundreds of potential action pairs. Thus while solving one of these games is more or less instantaneous, solving our Markov game can take many hours. This motivated us to explore potential alternatives to using linear programming to solve our games. Ultimately we settled on the use of fictitious play, an iterative algorithm first introduced by Brown in 1951 [2]. In fictitious play each player tracks the empirical frequency of actions chosen by their opponent and best responds to this strategy. The other player, having also tracked the empirical frequency of their opponent, also plays their best response. By iterating this process many times we are able to derive close approximations of both the game’s value and the Nash equilibrium policies of the agents. A proof of convergence for this iterative approach in zero-sum games is given in [17].

Algorithm 1. Iterative Fictitious Play R is an m × n matrix of rewards rowReward and rowCnt are m-length arrays of zeros colReward and colCnt are n-length arrays of zeros Initialize bestResponse to any random row action for iterations do colReward = colReward + R[bestResponse, · ] bestResponse = argmin(colReward) colCnt = colCnt + 1 rowReward = rowReward + R[ · , bestResponse] bestResponse = argmax(rowReward) rowCnt = rowCnt + 1 end for gameV alue = max(rowReward) + min(colReward) / 2 / iterations rowM ixedStrat = rowCnt / iterations colM ixedStrat = colCnt / iterations=0

The algorithm we use for our fictitious play was first introduced by Williams in [23] and is described in Algorithm 1. (please note that element-wise vector operations are implied here, with scalar values being broadcast to all elements of the vector in question).

172

4.3

N. Dunstatter et al.

Deep Nash Q-Network

Motivated by the success in [15] we wanted to explore the possibility of applying Deep Q-Networks in the Markov game domain. After all, even complex systems like those of Atari games can be formulated as MDP’s. Furthermore, the transformation from the single agent MDP to the multi-agent Markov game is straightforward enough that we hoped the approximation power achieved in [15] would carry over to the Markov game domain. Especially if we were able to train this network in a manner similar to the provably optimal value iteration approach presented in [4]. This move from single agent to multi-agent would naturally necessitate some changes to the Q-learning and network loss equations in [15]. Namely, while both algorithms employ a Q-function to estimate future rewards attainable from an action pair, their Q-target equation need only apply a greedy max over the next action for a single player, whereas our game theoretic approach must choose actions for two players in a minimax fashion. Given the convergence proofs for value iteration in Markov games provided in [12] and the empirical success of its use in our domain in [4], we can be confident that the substitution of a game theoretic maximin in place of the greedy max operator will provide a stable and meaningful reward signal. The equations used to train the network are as follows: ˆ a, d yi = Es∼E Rs (a, d) + γ max min Q(s , a , d ; θ)|s, ˆ ˆ d ∈Da a ∈Aa

2 Li (θi ) = Es,a,d∼ρ(·) yi − Q(s, a, d; θi )

(5) (6)

Equation 5 is the target we are moving our approximator towards, with E being the environment we sample states from. Equation 6 is the loss function, with ρ(s, a, d) representing the behavior distribution (in our case, the current maximin -greedy policies of our agents) that defines a probability distribution over states and actions. Differentiating Eq. 6 with respect to the weights gives us the following,

∇θi Li (θi ) = Es,a,d∼ρ(·);s ∼E ∇θi Q(s, a, d; θi ) ·

ˆ − Q(s, a, d; θi ) Rs (a, d) + γ max min Q(s , a , d ; θ)

(7)

d ∈Dâ a ∈Aâ

To avoid computing the full expectations of Eq. 7 we can use single samples of actions from the behavior distribution ρ and transitions from our environment E while using an optimization function (e.g., stochastic gradient descent) to minimize the loss. The authors in [15] use two sets of networks while training – a target network θˆ and a training network θ. Their main purpose in using a target network was to hold the data the network is approximating towards fixed, empirically alleviating convergence issues. While this is also true in our case, we extended the

Solving Cyber Alert Allocation Markov Games with Deep RL

173

Algorithm 2. Deep Nash Q-Network with Experience Replay Initialize i = 0 Initialize learning network with random weights θi Initialize target network weights θˆ = θi Initialize τ to desired update cycle Initialize replay memory Z to capacity N for episode = 1, M do Initialize s1 randomly for t = 1, T do Qs = GetQMatrix(s, θi ) πA , πD = GameSolver(Qs ) With probability select a random action at otherwise sample at ∼ πA With probability select a random action dt otherwise sample dt ∼ πD Execute actions at , dt in E and observe reward rt and next state st+1 Store transition (st , at , dt , rt , st+1 ) in Z Sample random mini-batch of transitions (sj , aj , dj , rj , sj+1 ) from Z ˆ Qsj+1 = GetQMatrix(sj+1 , θ) vsj+1 = GameSolver(Qsj+1 ) Set yj = rj + γvsj+1 2 Take gradient descent step on yj − Q(sj , aj , dj ; θi ) using Eq. 7 Set i = i + 1 if i mod τ = 0, then θˆ = θi end if end for end for=0

use of a target network by delaying the amount of time between its successive updates (i.e., when θˆ is set to the current θ) by τ learning steps. This update cycle τ 1 allows the network to come closer to understanding the future rewards available from a state before we change the target data. This results in a loose approximation of the value iteration approach in [4]. For example, when learning begins θˆ initially makes prediction very close to zero. So before the first update takes place our target yi is essentially a slightly noisy representation of immediate rewards. By the end of our first update cycle θ will predict these ˆ our network will then immediate rewards very accurately and once we update θ, begin bootstrapping from the learned immediate rewards to understand future rewards. Our state-action inputs to the neural network were as follows: – For each analyst we calculate the current percentage of their remaining wait time resulting in n features. – The percent of currently available budget for the attacker. – Three features specifying the number of alerts from each severity level that had arrived (min-max normalized).

174

N. Dunstatter et al.

– Three features for the number of alerts assigned from each severity level (minmax normalized). – Three features for the number of alerts attacked in each severity level (minmax normalized). Most supervised machine learning algorithms rely on learning a fixed distribution given some set of samples. The difficulty in using such methods (e.g., stochastic gradient descent) in an RL context is that the samples obtained from an on-policy RL algorithm come from the estimator itself. Every time we update our policy we change the agent’s behavior and thereby the distribution of rewards we are going to see. This kind of moving target can lead to convergence issues and is addressed in part via the use of experience replay [11]. Experience replay is a method where we store a 5-tuple of the agents’ experience et = st , at , dt , rt , st+1 at each time step in some large data set Z = e1 , ..., eN from which we can sample from in a way analogous to traditional supervised learning. Readers are encouraged to review [15] for a detailed explanation of experience replay’s many benefits when training RL agents in a large state space. Build Q-Matrix for s using learning network

Game Solver

Execute Nash Strategies in s

s, a, d

Query Environment

s, a, d, r, s'

Store transition in replay memory

Sample mini-batch of transitions START

s = random state

Yes

s = s'

Trajectory over?

For each transition

Use target network to calculate y No

Update target network with learning network

Use learning network to calculate Q-estimate

Calculate average loss in mini-batch Yes No

Have learning steps occurred?

Backprop loss through learning network

Fig. 1. Diagram of the DNQN learning process

We refer to this learning architecture (depicted in Fig. 1) as a Deep Nash QNetwork (DNQN). This network learns off-policy and is model-free. The pseudocode for training a DNQN is presented in Algorithm 2. The function GetQMatrix (s, θ) re-populates the payoff matrix Rs with Q-values predicted using features about state the state-action pairs in Rs and network weights θ. GameSolver (Qs ) is a function that solves the matrix game Qs and returns the Nash equilibrium value vs of the game as well as the mixed strategies for both players that result in that value, πA and πD . Additionally, all experiments presented herein were performed using the Adam optimizer described in [8] in favor of the simple stochastic gradient descent approach. While both optimizers performed well in practice, Adam consistently lead to smoother learning curves and better performing policies.

Solving Cyber Alert Allocation Markov Games with Deep RL

5

175

Performance Evaluation

This section will present our experimental results. All results presented herein were obtained on a machine with two-dozen 2.2 GHz cores and 64 GB of RAM. Furthermore, the dynamic programming approach was run in a fully parallel manner while the DNQN approach was run serially. 5.1

Dynamic Programming Tractable Model

The first set of results we will discuss were obtained within a state space small enough to be solved in a reasonable amount of time (30 h) via the brute force value-iteration approach introduced in [4]. The parameters used when constructing the state space are presented in Table 1 and yield an environment with a total of 2,117,682 possible states our agents can inhabit. Table 1. Parameters used when constructing the DP tractable model. Parameter

Value

Number of experts n

5

Attack budget B

20

Utilities U

uh = 100, um = 20, ul = 5

Attack cost C

ch = 8, cm = 4, cl = 2

Work times W

wh = 6, wm = 3, wl = 1

Alert batches in Ω

ω1 = 0, 2, 2,

ω2 = 1, 2, 2,

where ω = h, m, l

ω4 = 1, 1, 4,

ω5 = 2, 2, 3,

Arrival prob. Π(Ω)

ω1 = 0.15,

ω2 = 0.21,

ω3 = 0.21

ω4 = 0.18,

ω5 = 0.20,

ω6 = 0.05

ω3 = 0, 3, 3 ω6 = 3, 3, 3

This dynamic programming solution provides us with an example of what optimal behavior looks like in the Markov game domain, allowing us to understand how well our approximate DNQN approach compares in terms of both long-term cumulative utility and solution time. When training our DNQN we used a fully connected 32×64×128×128×128 architecture with ReLU activation functions and an update cycle τ = 1, 000. We experimented with various network sizes and update cycle lengths, finding the aforementioned values to be both the most stable and fruitful. While they did impact the overall accuracy of the model, our solution remained quite robust across all the network sizes we tried. Figure 2 presents the mean loss of our network’s predictions on each learning batch after every step taken during training. As discussed in Sect. 4.3, over the first update cycle (the first 1,000 iterations) our agent’s are primarily learning immediate rewards and our network’s loss drops quite rapidly. However, when the second update to the target network takes place our loss suddenly spikes by roughly 2,000%. This initial convergence and sudden spike demonstrate two things. First, our network can quickly learn the immediate rewards of a given action pair. Second, these immediate rewards are a poor reflection of long term

176

N. Dunstatter et al.

reward (Q-values). After the first update cycle our agents understand immediate reward very well meaning solving the Q-matrices from Algorithm 2 yield nearperfect Nash play (for each normal form game). However, once the normal form game becomes an extensive form game their strategies must re-adjust greatly as their greedy actions now bear heavy consequences. 5000

4000

Loss

3000

2000

1000

0

0

0.5

1

1.5

2

2.5

3 10 4

Learning Iterations

Fig. 2. Loss values obtained while training the DNQN on the DP tractable model

While the early learning exhibits very noisy loss values this seems to stabilize drastically after about 5,000 iterations. Initially this steadiness seemed to imply that our network had essentially finished learning after the first 5,000 updates. To investigate, we ran simulations against an optimal dynamic programming opponent after each update cycle and plotted the average cumulative utility in Fig. 3. DNQN Attacker

120

DNQN Defender

-50 -100

100 -150 -200

Avrg. Utility

Avrg. Utility

80

60

40

-250 -300 -350 -400

20 -450 0

0

5,000

10,000

15,000

20,000

Training Iterations

25,000

30,000

-500

0

5,000

10,000

15,000

20,000

25,000

30,000

Training Iterations

Fig. 3. DNQN agents’ utility against a DP opponent at various stages of learning

The results of these simulations show that both agents continue to learn well past the point implied by the loss curve. It is interesting to note that the DNQN Defender seems to reach convergence at around the 15,000 iteration mark while the DNQN attacker continues to progress up until the end of its training. Taken together, Figs. 2 and 3 show that while later iterations’ exhibit rather accurate

Solving Cyber Alert Allocation Markov Games with Deep RL

177

Q-value predictions (i.e., low loss) the strategies derived from these Q-values continue to evolve throughout the learning process. After 30,000 training iterations we want to compare our solution methods’ policies against one another to understand how well they perform with respect to cumulative utility. We also include two heuristic policies, random and myopic. A random policy for either agent simply randomizes over their action space in each state. A myopic policy plays as if the agent is in a normal form game, solving the payoff matrix of immediate rewards in each state and then sampling from the derived mixed strategy. For each of the policy pairs we run 1,000 independent simulations with a time horizon of 100 rounds (starting from random states). The average discounted cumulative utility against the optimal dynamic programming opponents are presented in Fig. 4 for the defender and Fig. 5 for the attacker. Both figures show that our approximate solution maintains a similar utility as its optimal counterpart. 0 Random Defender Myopic Defender DNQN Defender DP Defender

-50 -100

Avrg Utility

-150 -200 -250 -300 -350 -400 -450

0

10

20

30

40

50

60

70

80

90

100

Round #

Fig. 4. Cumulative utilities obtained by all defender policies against the optimal DP attacker policy.

5.2

Dynamic Programming Intractable Model

The second set of results we will discuss were obtained within a state space too large to be solved by the dynamic programming approach. The parameters used when constructing the state space are presented in Table 2 and yield an environment with a total of 2.1 billion states. Performing the 30,000 training updates on our DNQN took little over 5.5 h in this environment. For comparison, a liberal estimate obtained by extrapolating the run-times in Sect. 5.1 would put the DP solution time at roughly 114 years.

178

N. Dunstatter et al. 250 Random Attacker Myopic Attacker DNQN Attacker DP Attacker

Avrg Utility

200

150

100

50

0

0

10

20

30

40

50

60

70

80

90

100

Round #

Fig. 5. Cumulative utilities obtained by all attacker policies against the optimal DP defender policy. Table 2. Parameters used when constructing the DP intractable environment. Parameter

Value

Number of experts n

8

Attack budget B

30

Utilities U

uh = 100, um = 20, ul = 5

Attack cost C

ch = 8, cm = 4, cl = 2

Work times W

wh = 6, wm = 3, wl = 1

Alert batches in Ω where ω = h, m, l

ω1 = 0, 2, 2,

ω2 = 0, 3, 3,

ω3 = 1, 2, 5

ω4 = 1, 3, 6,

ω5 = 1, 4, 4,

ω6 = 2, 2, 2

ω7 = 2, 2, 4,

ω8 = 2, 3, 4,

ω9 = 2, 3, 5

ω10 = 3, 3, 3, ω11 = 3, 4, 5, ω12 = 4, 5, 6 Arrival prob. Π(Ω)

ω1 = 0.02,

ω2 = 0.03,

ω3 = 0.08

ω4 = 0.08,

ω5 = 0.09,

ω6 = 0.10

ω7 = 0.10,

ω8 = 0.10,

ω9 = 0.10

ω10 = 0.13,

ω11 = 0.12,

ω12 = 0.05

In state spaces as large as this it can be very difficult to understand what optimal behavior looks like. In the absence of our optimal DP solution we can make no concrete guarantees as to the efficacy of our results. Despite this fact, our algorithm still maintains a converging loss curve and a superior utility when compared to our previously mentioned random and myopic policies. Figure 6 shows the loss curve obtained while training the DNQN on the DP intractable model. Similar to Fig. 2 we can see early convergence over the first update cycle followed by a large spike in loss as future rewards begin to be considered. After training we again want to assess the efficacy of our DNQN policy against the other policies but have lost the DP solution due to the intractability of the environment’s size. We run 1,000 independent simulations with a time horizon of 100 rounds to obtain the average cumulative utility in each policy pair using just the DNQN, myopic, and random policies. These utilities are presented for the attacker and defender in Figs. 7 and 8, respectively.

Solving Cyber Alert Allocation Markov Games with Deep RL

179

18000 16000 14000

Loss

12000 10000 8000 6000 4000 2000 0

0

0.5

1

1.5

2

2.5

3 10 4

Training Iterations

Fig. 6. Loss values obtained while training the DNQN on the DP intractable model 1000 900

Random Attacker Myopic Attacker DNQN Attacker

800

Avrg Utility

700 600 500 400 300 200 100 0

0

50

100

150

200

250

Round #

Fig. 7. Cumulative utilities obtained by all attacker policies against the approximate DNQN defender policy.

Figure 7 illustrates a harrowing fact. Due to the vast numbers of alerts, even a random policy can be effective at evading detection. While the random policy does not acquire as much utility as the DNQN attacker, it still performs quite well. The myopic attacker performs poorly in this environment because it acts greedily in each sub-game. This limits the amount of budget the attacker will build up before launching attacks and restricts the myopic attacker to using mostly low severity alerts. In Fig. 8 the DNQN policy performs much better than myopic and random and shows the importance of employing an intelligent alert allocation strategy. This figure demonstrates the disparity between the attacker and defender’s challenges within this domain. When the volume of alerts is so high even a random attacker policy can perform well as the increasing proportion of false positives provide ample camouflage for attacks with no input from the attacker. Contrast this with the defender, whose task only gets more challenging as the volume of alerts increases. This explains why we see a disparity between the relative efficacy of the DNQN policy compared to the myopic and random for the defender and attacker.

180

N. Dunstatter et al. 0

Avrg Utility

-500

-1000

-1500

Random Defender Myopic Defender DNQN Defender

-2000

-2500

0

10

20

30

40

50

60

70

80

90

100

Round #

Fig. 8. Cumulative utilities obtained by all defender policies against the approximate DNQN attacker policy.

When investigating further into the DNQN defender’s strategy we noticed that it would never fully allocate all of its analysts, preferring to hover around a 75% utilization. Essentially the defender was playing a game of chicken with the attacker, trying to discourage attacks by always keeping some analysts ready for assignment. While this kind of behavior is quite irrational against the random and myopic attackers, the DNQN defender plays very well against the intelligent DNQN attacker – almost quadrupling the utility of the other two policies. This represents a kind of trade-off between general coverage and acute prevention that is the core of a good defender policy. To show how influential just a few extra analysts can be we ran simulations using the values from Table 2, varying the number of analysts on staff while keeping all other parameters fixed. The results of this experiment are presented in Fig. 9 where we can clearly see a positive correlation between the number of analysts and the defender’s utility. Increasing the analyst on staff from five to 7 resulted in a 45% increase in utility. 0 -200 -400

Avrg. Utility

-600 -800 -1000 -1200 -1400 -1600 -1800

5

6

7

8

9

10

Number of Analysts

Fig. 9. Comparison of the cumulative utility obtained by a DNQN defender with varying numbers of analysts against a DNQN attacker.

Solving Cyber Alert Allocation Markov Games with Deep RL

181

As previously mentioned, a strong attacker policy is much easier to learn than a strong defender policy. The attacker only needs to stockpile their budget (a single resource) until the defender allocates a high percentage of their analysts, whereupon they flood the defender with attacks. Contrast this with the defender who must learn to balance the allocation times of their analysts (multiple resources) with the expected volume of incoming alerts. We find it quite remarkable that we are able to derive two very different policies from a single network while still maintaining a good performance for both tasks. As for implementing this kind of policy in the real world, defenders would obviously always want to maintain 100% utilization of their analysts and keep those who were unassigned by the model on call for reassignment if the model dictates. Furthermore, the strategies derived by our model could simply be viewed as a kind of threshold for game-theoretically sound behavior that could inform an organizations as to their level of security (or lack thereof). For instance, given the known volume of attacks they face, an organization could run simulations similar to those in Fig. 9 to investigate how many analyst they may need to have on staff to reach an acceptable level of security.

6

Conclusions

In this paper we provided a Markov game framework for modeling the adversarial interaction of computer network attackers and defenders in a game-theoretic manner. By framing this interaction as a series of zero-sum games wherein a state is maintained between each sub-game we were able to simulate long term periods of play and apply reinforcement learning algorithms to derive intelligent policies. An approximate solution method using our Deep Nash Q-network (DNQN) algorithm was presented that allowed for previous works’ results to be extended to much larger state spaces where an explicit model of the environment was not known. This DNQN approach was capable of deriving intelligent policies on par with the optimal approach in a much shorter time and with less information about the environment. These results motivate the use of DNQN-like architectures when solving very large Markov games as this approach proved to be both computationally expedient and empirically effective. Acknowledgement. This work was supported in part by NSF grant #1814064.

References 1. Altner, D., Servi, L.: A two-stage stochastic shift scheduling model for cybersecurity workforce optimization with on call options (2016) 2. Brown, G.W.: Iterative solution of games by fictitious play. In: Koopmans, T.C. (ed.) Activity Analysis of Production and Allocation. Wiley, New York (1951)

182

N. Dunstatter et al.

3. Brown, M., Sinha, A., Schlenker, A., Tambe, M.: One size does not fit all: a gametheoretic approach for dynamically and effectively screening for threats. In: AAAI Conference on Artificial Intelligence (2016) 4. Dunstatter, N., Guirguis, M., Tahsini, A.: Allocating security analysts to cyber alerts using markov games. In: 2018 National Cyber Summit (NCS) (2018) 5. Ganesan, R., Jajodia, S., Cam, H.: Optimal scheduling of cybersecurity analysts for minimizing risk. ACM Trans. Intell. Syst. Technol. 8, 52 (2015) 6. Ganesan, R., Jajodia, S., Shah, A., Cam, H.: Dynamic scheduling of cybersecurity analysts for minimizing risk using reinforcement learning. ACM Trans. Intell. Syst. Technol. 8, 4 (2016) 7. Jain, M., Kardes, E., Kiekintveld, C., Ord´ onez, F., Tambe, M.: Security games with arbitrary schedules: a branch and price approach. In: Proceedings of AAAI (2010) 8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR (2014) 9. Lagoudakis, M.G., Parr, R.: Value function approximation in zero-sum markov games. In: Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, pp. 283–292. Morgan Kaufmann Publishers Inc., San Francisco (2002) 10. Lample, G., Chaplot, D.S.: Playing FPS games with deep reinforcement learning. CoRR abs/1609.05521 (2016) 11. Lin, L.J.: Reinforcement learning for robots using neural networks. Ph.D. thesis, Pittsburgh, PA, USA (1992) 12. Littman, M.: Value-function reinforcement learning in markov games. Princeton University Press (2000) 13. Littman, M.: Markov games as a framework for multi-agent reinforcement learning. In: Proceedings of the Eleventh International Conference on Machine Learning, pp. 157–163. Morgan Kaufmann (1994) 14. Ma, C., Yau, D., Lou, X., Rao, N.: Markov game analysis for attack-defense of power networks under possible misinformation. IEEE Trans. Power Syst. 28, 1676– 1686 (2013) 15. Mnih, V., et al.: Playing atari with deep reinforcement learning. CoRR (2013) 16. Ponemon Institute: The cost of malware containment (2015) 17. Robinson, J.: An iterative method of solving a game. Ann. Math. 54(2), 296–301 (1951) 18. Schlenker, A., et al.: Don’t bury your head in warnings: a game-theoretic approach for intelligent allocation of cyber-security alerts. In: Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 381– 387 (2017) 19. Schlenker, A., et al.: Towards a game-theoretic framework for intelligent cybersecurity alert allocation. In: Proceedings of the 3rd IJCAI Workshop on Algorithmic Game Theory, Melbourne, Australia (2017) 20. Shah, A., Ganesan, R., Jajodia, S., Cam, H.: A methodology to measure and monitor level of operational effectiveness of a CSOC. Int. J. Inf. Secur. 17(2) (2018) 21. Shu, X., Tian, K., Ciambrone, A., Yao, D.: Breaking the target: an analysis of target data breach and lessons learned. CoRR (2017) 22. Sinha, A., Nguyen, T., Kar, D., Brown, M., Tambe, M., Jiang, A.: From physical security to cybersecurity. J. Cybersecur. 1(1), 19–35 (2015) 23. Williams, J.D.: The Compleat Strategyst: Being a Primer on the Theory of Games of Strategy. Dover, New York (1986)

Solving Cyber Alert Allocation Markov Games with Deep RL

183

24. Xiaolin, C., Xiaobin, T., Yong, Z., Hongsheng, X.: A markov game theory-based risk assessment model for network information system. In: 2008 International Conference on Computer Science and Software Engineering, vol. 3, pp. 1057–1061, December 2008 25. Yin, Z., et al.: Trusts: scheduling randomized patrols for fare inspection in transit systems using game theory. In: Proceedings of the 24th IAAI, Palo Alto, CA (2012) 26. Zimmerman, C.: Ten strategies of a world-class cybersecurity operations center. MITRE corporate communications and public affairs (2014)

Power Law Public Goods Game for Personal Information Sharing in News Commentaries Christopher Griffin1(B) , Sarah Rajtmajer2 , Prasanna Umar2 , and Anna Squicciarini2 1

Applied Research Laboratory, The Pennsylvania State University, University Park, PA, USA [email protected] 2 College of Information Sciences and Technology, The Pennsylvania State University, University Park, PA, USA {smr48,acs20,pxu3}@psu.edu

Abstract. We propose a public goods game model of user sharing in an online commenting forum. In particular, we assume that users who share personal information incur an information cost but reap the benefits of a more extensive social interaction. Freeloaders benefit from the same social interaction but do not share personal information. The resulting public goods structure is analyzed both theoretically and empirically. In particular, we show that the proposed game always possesses equilibria and we give sufficient conditions for pure strategy equilibria to emerge. These correspond to users who always behave the same way, either sharing or hiding personal information. We present an empirical analysis of a relevant data set, showing that our model parameters can be fit and that the proposed model has better explanatory power than a corresponding null (linear) model of behavior.

Keywords: Self-disclosure

1

· Public goods · Game theory

Introduction

Recent work acknowledges the importance of online social engagement, noting that the bidirectional communication of the Internet allows readers to engage directly with reporters, peers, and news outlets to discuss issues of the day [1–3]. In parallel, studies have noted several challenges linked with this new form of readership, particularly the high level of toxicity and pollution from trolls and even bots often observed in these commentaries [4]. While the negative impacts of trolling and abuse are well-studied (see, e.g., [5,6]), little attention has been paid to other more subtle risks involved with online commenting, particularly with respect to users’ privacy. In particular, as users engage in discussion online, they often resort to self-disclosure as a way to enhance immediate social rewards This is a U.S. government work and not under copyright protection in the U.S.; foreign copyright protection may apply 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 184–195, 2019. https://doi.org/10.1007/978-3-030-32430-8_12

Personal Information as Cost in a Power Law Public Goods Game

185

[7], increase legitimacy and likeability [8], or derive social support [9]. By selfdisclosure, we refer to the (possibly unintentional) act of disclosing identifying (e.g., location, age, gender, race) or sensitive (e.g., political affiliation, religious beliefs, cognitive and/or emotional vulnerabilities) personal information [10]. In this work, we model the behavior of users posting comments about newspaper articles on major news platforms (e.g., NYT, CNN). We hypothesize that all users who participate in commentary about an article receive a “reward” that is proportional to the number of total comments posted; i.e., the net amount of social engagement generated. Hence, the act of self-disclosing comes at an information cost to the individual user yet may serve to increase the net return (e.g., total number of comments or impact the conversation in some capacity) all users receive. Accordingly, this scenario can be envisaged as a public goods game in which pay-in is measured in terms of personal information and pay-out is measured in net quantity of social interaction through a commenting system. Public goods games are mathematical representations of the Tragedy of the Commons [11,12] in which individuals must contribute to a common good in order to prevent that good from collapsing. Within a public goods game, cheating or freeloading is generally a more profitable choice; in this way, it is intellectually similar to the prisoner’s dilemma (see, e.g., [13,14]), and various approaches to resolving the tragedy have been taken (e.g., [15]). Public goods games have been widely studied as models of cooperation. In [16], the public goods game poses the following dilemma to a group of N agents: each agent is asked to contribute c monetary units towards a public good. Contributions earn a linear rate of return r, providing rc monetary units for sharing. Thus, if k individuals contribute, a contributing individual receives rck/N − c monetary units, while a non-contributing individual receives rck/N monetary units. Rational agents choose not to contribute. There are several extensions to the classical public goods framework discussed above. Archetti and Scheuring [17] and Young and Belmonte [18] use a non-linear (power law) form of the public goods return function. We adopt this model in Sect. 3. Cooperation in a public goods setting is difficult to explain using a rational agent assumption and several approaches have been used to explain it. Volunteering in public goods is considered in [19]. Punishment as a form of cooperation enforcement is discussed in [20,21]. Reputation in an evolutionary public goods game is considered in [22]. The approach we take in this paper is substantially simpler; as we discuss in Sect. 3, we assume that each agent has a distinct information sharing cost, which leads to the emergence of equilibria in which users will share. Primary contributions of this paper are: development of a public goods model of personal information disclosure; proof of sufficient conditions for this game to exhibit pure strategy equilibria; proof of existance of at least one equilibrium for any choice of model parameters; identification of necessary and sufficient conditions in specific cases; and, initial validation of the proposed model in a dataset of online comments on news articles. The remainder of this paper is organized as follows: In Sect. 2, we discuss the data set used for model development and testing. We present our proposed

186

C. Griffin et al.

model in Sect. 3. Mathematical analysis of the model is performed in Sect. 4. Experimental evidence supporting our model of user behavior is presented in Sect. 5. Finally, we provide conclusions and future directions in Sect. 6. Proofs of all propositions are available at https://arxiv.org/abs/1906.01677.

2

Data Description

We consider a set of user comments on news articles from four major English news websites [23]. The data set is composed of 59, 249 comments made by 22, 132 distinct users from March through August 2015. Comments are distributed across 2202 articles from The Huffington Post (1136), Techcrunch (119), CNBC (421) and ABC News (526). On average, each user contributes 2.68 comments and participates in discussions related to 1.77 articles. We use the unsupervised detection of self-disclosure proposed and validated in our earlier work [10] to label these comments. Each comment is labeled for the presence or absence of self-disclosure, and each incidence of self-disclosure is tagged by category. We determine 10, 858 of the total 59, 249 comments to be self-disclosing. Methods and initial results for self-disclosure tagging on this data set, including a breakdown of self-disclosures by category, are discussed in [10].

3

Model

Let Ri be the total number of comments associated with article i. This is also the common reward to all commenters regardless of whether they provide personal information. Define the binary variable δk = 1 if and only if User k provides personal information in a comment at least once. Using a public goods framework, we hypothesize the relationship: γ Ri ∼ A · δik + i , (1) k

where γ is a scaling factor and A is constant of proportionality. The quantity i is the (normally distributed) error associated with Article i. The individual payoff to users in this pubic goods framework is: rik = Ri − βk δk ,

(2)

where βk measures the sensitivity to information sharing for User k. In a totally symmetric game, β = β1 = · · · = βN .

4

Mathematical Analysis

We analyze the model assuming that: δj ∼ Bernoulli(xj ),

(3)

Personal Information as Cost in a Power Law Public Goods Game

187

where xj ∈ [0, 1] is the probability that user j will disclose personal information. In a simultaneous game with n users, each user will selfishly maximize her expected reward, which can be computed on the interior of the feasible region as: γ δ ··· A δk xkk (1 − xk )1−δk − βj xj . (4) uj = E(rj ) = δ1 ∈{0,1}

δn ∈{0,1}

k

k

If for any k, xk is a pure strategy, then δk = xk and uj is modified in the obvious way to prevent expressions of the form 00 . In particular, if B = {0, 1} and x ∈ Bn is pure, then: γ γ uj = A xk − βj xj = A δk − βj δj . (5) k

k

Put more simply, this is just an n-player, n-array (tensor) game, where each player has two strategies: disclose or don’t disclose. The payoff structure is given by n multi-linear maps: A(j) : R2 × · · · × R2 → R. n

The following result is guaranteed by Wilson’s extension [24] of Nash’s theorem [25] and the Lemke-Howson theorem [26]: Proposition 1. There is at least one Nash equilibrium solution in simultaneous play. If the game is non-degenerate there are an odd number of equilibria. Fix the strategies for all users other than j and denote this x−j . The (tensor) contraction A(j) (x−j ) is a one-form (row vector). Assume:

A(j) (x−j ) = C1(j) (x−j ) C0(j) (x−j ) , (6) with: (j) C1 (x−j )

=

δ −j ∈Bn−1

(j)

C0 (x−j ) =

δ −j

∈Bn−1

⎛ A ⎝1 + ⎛ A⎝

⎞γ δk ⎠

k=j

k=j

⎞γ

δk ⎠

k=j

k=j

xδkk (1 − xk )1−δk − βj ,

xδkk (1 − xk )1−δk .

As in Eq. (5), care must be taken with this expression if xk is pure. If xj = xj , 1 − xj , then: (j) (j) uj (xj , x−j ) = A(j) (x−j ), xj = C1 (x−j )xj + C0 (x−j )(1 − xj ) = (j) (j) (j) C1 (x−j ) − C0 (x−j ) xj + C0 (x−j ).

(7)

188

C. Griffin et al.

A strategy vector x = (xj , x−j ) is an equilibrium precisely when it solves the simultaneous optimization problem: max uj (xj , x−j ) (8) ∀j s.t. 0 ≤ xj ≤ 1. We note the optimization problem for each Player j is a linear programming problem. Proposition 2. A point x is an equilibrium if and only if there are vectors λ, μ ∈ Rn so that the following conditions hold: ∀j xj − 1 ≤ 0 (9) PF −xj ≤ 0 ∀j ⎧ ∂uj ⎪ ⎪ ∀j ⎪ ⎨ ∂xj + λj − μj = 0 DF (10) ∀j λj ≥ 0 ⎪ ⎪ ⎪ ⎩ μj ≥ 0 ∀j −λj xj = 0 ∀j CS . (11) μj (xj − 1) = 0 ∀j Here:

∂uj (j) (j) = C1 (x−j ) − C0 (x−j ). ∂xj

(12)

Corollary 1. A point x is an equilibrium if and only if there are vectors λ, μ ∈ Rn and the triple (x, λ, μ) is a global optimal solution to the following non-linear programming problem: ⎧ (j) (j) ⎪ min λj xj + μj (1 − xj ) = μj − xj C1 (x−j ) − C0 (x−j ) ⎪ ⎪ ⎪ j j ⎪ ⎪ ⎪ ⎪ ⎨ s.t. C (j) (x ) − C (j) (x ) + λ − μ = 0 ∀j −j −j j j 1 0 . (13) ⎪ λ ≥ 0 ∀j ⎪ j ⎪ ⎪ ⎪ ⎪ μj ≥ 0 ∀j ⎪ ⎪ ⎩ 0 ≤ xj ≤ 1 Furthermore every global optimal solution has objective function value exactly equal to 0. We note that the KKT conditions of Proposition 2 can also be transformed into a complementarity problem [27] and solved accordingly. Phrasing the problem as a non-linear programming problem allows for solution of small-scale examples using readily available software packages. We show that pure strategy equilibria exist for this game. The following sufficient condition ensures there is at least one pure strategy equilibrium.

Personal Information as Cost in a Power Law Public Goods Game

189

Proposition 3. Assume β1 ≤ β2 ≤ · · · ≤ βn and that: βm ≤ Amγ − A(m − 1)γ βm+1 ≥ A(m + 1)γ − Amγ Then the point x1 = x2 = · · · xm = 1 and xm+1 = xi+2 = · · · = xn = 0 is an equilibrium in pure strategies. Because there may be many solutions to the KKT conditions from Proposition 2, there may be mixed strategies even if the sufficient conditions are met. However, we can construct both necessary and sufficient conditions for pure strategy equilibria in which all users either share personal information or withhold personal information. Proposition 4. The strategy x = 0 is an equilibrium if and only if A ≤ β1 . Proposition 5. The strategy x = 1 is an equilibrium if and only if A ≥ βn . These results yield a sensible interpretation for the parameter A. If βj is the perceived social cost of sharing personal information, then A is a common perceived social benefit of sharing information and the decision to share or not becomes a simple cost-benefit analysis on the part of the user. In practice, it is rare that all users in a thread will share personal information. Moreover, users may not consistently share (or withhold) personal information, as illustrated in Sect. 5. Consequently, mixed strategies may be common (as illustrated in Sect. 5) or A and βj (j = 1, . . . , n) may be context-dependent.

5

Experimental Results

Using the data set described in Sect. 2, we test our hypothesis that the number of comments (i.e., common reward) in a news posting game is modeled by Eq. (1). Articles with no comments were removed as they yield no additional information. This left 1977 articles for analysis. The proposed model is statistically significant above the 7σ level. Table 1 provides confidence information on the parameters of the model. The model explains 51% of the variance in the observed data (i.e., r2 − Adj ≈ 0.51). Figure 1 illustrates the fit of the data to the proposed model. Table 1. Parameters of the problem and confidence values Parameter Value p-value log(A) γ

2.20 0.71

Confid. Ival.

0

(2.15, 2.25) −312

1.4 × 10

(0.68, 0.74)

190

C. Griffin et al.

Fig. 1. We illustrate the goodness of the fit for the power law model, Eq. (1). A log-log scale is used.

The residual distribution is centered about zero and the Q−Q plot illustrates reasonable normality of the residual distribution (see Fig. 2). Normality tests of the residuals showed mixed results with five of eight tests performed rejecting normality and the remaining tests failing to reject normality. Raw output of distribution testing is given below:

Statistic Anderson-Darling 1.29886

p-value 0.00198841

Baringhaus-Henze 1.88282

0.00824273

Cramér-von Mises 0.18939

0.0067714

Jarque-Bera ALM 3.51543

0.168992

Mardia Combined 3.51543

0.168992

Mardia Kurtosis

−1.88623 0.0592636

Mardia Skewness

0.0342508 0.853174

Pearson χ2

135.343

1.465 × 10−12

Shapiro-Wilk

0.997751

0.00679984

Fig. 2. (a) The histogram of the fit residuals using Eq. (1) illustrates symmetry about 0. (b) The Q − Q plot illustrates approximate normality of the residuals.

Personal Information as Cost in a Power Law Public Goods Game

191

A favorable comparison of this model to the null linear model is available at https://arxiv.org/abs/1906.01677. 5.1

Fitting βj : A Pilot Study

As noted, this data set is not longitudinal and only a small number of users are repeat posters. This makes it impossible to estimate either xj or βj for all users. However, there are a subset of users who are repeat posters making it possible to estimate their mixed strategies and consequently their βj . We outline the algorithm for this process and discuss results. This algorithm works particularly well when all players are using a mixed strategy. We note results in the remainder of this section are preliminary and this should be considered as pilot data. 1. Compute xj using standard the standard MLE proportion estimator: x ˆj =

Number of Self Disclosing Posts . Number of Posts

(14)

2. From Eq. (10) at equilibrium we must have: (j)

(j)

C1 (ˆ x−j ) − C0 (ˆ x−j ) = μj − λj ,

∀j.

These equations can be used to fit an estimate for βj + μj − λj . In particular, when x ˆj ∈ (0, 1), then λj = μj = 0 and: ⎞γ ⎛ ⎞γ ⎞ ⎛⎛ δ A ⎝⎝1 + δk ⎠ − ⎝ δk ⎠ ⎠ x ˆkk (1 − x ˆk )(1−δk ) . βˆj = δ −j

k=j

k=j

(15)

k=j

If there are several articles (each with different number of users, N ), then βˆj is computed over all instances of Eq. (15) and the mean is the MLE of βˆj . In our analysis, x ˆj was not available for all users (because of data limitations). In analyzing an article with users who did not have a proper x ˆj , the mean of all ¯. available x ˆj was substituted. We denote this mean x In cases with articles where several user strategies were estimated with x ¯, we restricted the analysis to size N = 8 users for computational speed. By the central limit theorem, this approximation will not affect the resulting estimates of ˆbj substantially. Put more simply: An article with 34 users requires computing ¯, we used only 8 of those a sum with 233 summands. If 30 users are estimated as x users in computing Eq. (15). Using this approach we estimated the strategy for all users in the data set who posted to at least 15 articles. We used parameter estimates for A and γ obtained in the previous section. There were 135 users in this subsample. A histogram of their estimated strategies is shown in Fig. 3a. In particular, all estimated strategies were mixed, suggesting that pure strategies, while possible, are less likely to occur in real data. Using Eq. (15), we estimated βˆ in articles

192

C. Griffin et al.

containing at least three users for whom x ˆj had been estimated. We estimated ˆ βj for 14 users who had posted at least 15 times and who had posted in at least one article with 2 other such users. The histogram of these estimates is given in Fig. 3b.

Fig. 3. (a) The histogram of the strategies suggests an almost uniform distribution between x ˆ = 0.1 and x ˆ = 0.4 with a sharp dropoff after that. (b) Similarly, βj shows an almost uniform distribution with a high concentration of values near 4.8. This histogram is drawn from a small sample size, so we exercise care in interpreting these results.

To validate the hypothesis that higher βj is correlated with lower xj , we performed a simple linear fit, which is shown in Fig. 4 A table of parameter values and confidence regions are shown below, with the dependent variable being x ˆj and the independent variable βˆj .

Estimate

Standard error t-statistic P-value

1 0.707635 0.354241 βˆ −0.0922398 0.0745993

1.99761

0.0511066

−1.23647 0.221948

The coefficient of βˆ is negative (as predicted). However, the model is only significant at p ≈ 0.22. This is far too high to be considered conclusive, but is suggestive that additional data collection and analysis may be warranted. An alternate, more data intensive, approach to fitting βj is available at https:// arxiv.org/abs/1906.01677.

Personal Information as Cost in a Power Law Public Goods Game

193

Fig. 4. We illustrate the correlation between x ˆj and βˆj . We expect x ˆj ∼ a0 − a1 βj with a1 > 0, which we see.

6

Conclusions and Future Directions

In this paper, we have proposed a public goods model of personal information disclosure in news article commentaries. We have found sufficient conditions for the proposed public goods game to exhibit pure strategy equilibria and showed that for any choice of model parameters, there is always at least one equilibrium. Special necessary and sufficient conditions were identified for the case in which all users choose not to disclose personal information or when all users choose to disclose (some) personal information. We have validated this model using a dataset of online comments on news outlets and showed that the proposed common reward function fits the underlying data set better than a null (linear) model. For a small subset of users, we have estimated their strategy (ˆ xj ) as well ˆ as their sensitivity to personal information disclosure (βj ). We consider this a pilot study because the publicly available data set used in this study was not longitudinal, thus limiting our ability to study a large population of users over time. In future work, we will determine whether this model is valid using larger data sets when available. In particular, we are interested in a fitting approach for determining βˆj that relies on the solution to a large-scale mixed complementarity problem. (See https://arxiv.org/abs/1906.01677.) Studying this fitting problem, its complexity, and results from its application form the foundation of future work. In addition to this, we may investigate other commenting environments in which users may choose to share personal information to further validate this model and determine whether it holds across a broad spectrum of online platforms. Acknowledgements. Portions of Griffin’s work were supported by the National Science Foundation under grant CMMI-1463482. Griffin also wishes to thank A. Belmonte for the helpful discussion. Dr. Squicciarini’s work is partially funded by the National Science Foundation under grant 1453080. Dr. Squicciarini and Dr. Rajtmajer are also partially supported by PSU Seed grant 425-02.

194

C. Griffin et al.

References 1. Newman, N.: The rise of social media and its impact on mainstream journalism. Technical report, Reuters Institute for the Study of Journalism (2009) 2. Santana, A.D.: Online readers’ comments represent new opinion pipeline. Newspaper Res. J. 32(3), 66–81 (2011) 3. Ksiazek, T.B., Peer, L., Lessard, K.: User engagement with online news: conceptualizing interactivity and exploring the relationship between online news videos and user comments. New Media Soc. 18(3), 502–520 (2016) 4. Bishop, J.: The psychology of trolling and lurking: the role of defriending and gamification for increasing participation in online communities using seductive narratives. In: Gamification for Human Factors Integration: Social, Education, and Psychological Issues, pp. 162–179. IGI Global (2014) 5. Forbes: Is the era of reader comments on news websites fading? (2018). https:// www.forbes.com/sites/kalevleetaru/2015/11/10/is-the-era-of-reader-commentson-news-websites-fading/#4becbc1e4379 6. Buckels, E.E., Trapnell, P.D., Paulhus, D.L.: Trolls just want to have fun. Pers. Individ. Differ. 67, 97–102 (2014) 7. Hallam, C., Zanella, G.: Online self-disclosure: the privacy paradox explained as a temporally discounted balance between concerns and rewards. Comput. Hum. Behav. 68, 217–227 (2017) 8. Bak, J.Y., Kim, S., Oh, A.: Self-disclosure and relationship strength in Twitter conversations. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pp. 60–64. Association for Computational Linguistics (2012) 9. Tidwell, L.C., Walther, J.B.: Computer-mediated communication effects on disclosure, impressions, and interpersonal evaluations: getting to know one another a bit at a time. Hum. Commun. Res. 28(3), 317–348 (2002) 10. Umar, P., Squicciarini, A., Rajtmajer, S.: Detection and analysis of self-disclosure in online news commentaries. In: Proceedings of The Web Conference (WWW) (2019) 11. Lloyd, W.F.: Two Lectures on the Checks to Population. Oxford University Press, Collingwood (1833) 12. Hardin, G.: The tragedy of the commons. Science 162(3859), 1243–1248 (1968) 13. Fudenberg, D., Tirole, J.: Game Theory. The MIT Press, Cambridge (1991) 14. Brams, S.J.: Game Theory and Politics. Dover Press, New York (2004) 15. Levin, S.A.: Transition matrix model for evolutionary game dynamics. PNAS 111, 10 838–10 845 (2014) 16. Ledyard, J.O.: Public goods: a survey of experimental research. In: Kagel, J., Roth, A. (eds.) The Handbook of Experimental Economics, pp. 111–194. Princeton, Princeton University Press (1995) 17. Archetti, M., Scheuring, I.: Game theory of public goods in one-shot social dilemmas without assortment. J. Theor. Biol. 299, 9–20 (2012) 18. Young, M.J., Belmonte, A.: Convergence to fair contributions in a stochastic nonlinear public goods game with random subgroup associations. In: Working Paper (2019, to be submitted) 19. Hauert, C., De Monte, S., Hofbauer, J., Sigmund, K.: Volunteering as red queen mechanism for cooperation in public goods games. Science 296(5570), 1129–1132 (2002)

Personal Information as Cost in a Power Law Public Goods Game

195

20. Hauert, C., Traulsen, A., née Brandt, H.D.S., Nowak, M.A., Sigmund, K.: Public goods with punishment and abstaining in finite and infinite populations. Biol. Theory 3(2), 114–122 (2008) 21. Fehr, E., Gachter, S.: Cooperation and punishment in public goods experiments. Am. Econ. Rev. 90(4), 980–994 (2000) 22. Hauert, C.: Replicator dynamics of reward & reputation in public goods games. J. Theor. Biol. 267(1), 22–28 (2010) 23. Barua, J., Patel, D., Goyal, V.: TiDE: template-independent discourse data extraction. In: Madria, S., Hara, T. (eds.) DaWaK 2015. LNCS, vol. 9263, pp. 149–162. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-22729-0 12 24. Wilson, R.: Computing equilibria of n-person games. SIAM J. Appl. Math. 21(1), 80–87 (1971) 25. Nash, J.F.: Equilibrium points in n-person games. PNAS 36(1), 48–49 (1950) 26. Lemke, C.E., Howson, J.T.: Equilibrum points of bimatrix games. J. Soc. Indust. Appl. Math. 12(2), 413–423 (1961) 27. Chen, C., Mangasarian, O.L.: A class of smoothing functions for nonlinear and mixed complementarity problems. Comput. Optim. Appl. 5(2), 97–138 (1996)

Adaptive Honeypot Engagement Through Reinforcement Learning of Semi-Markov Decision Processes Linan Huang(B) and Quanyan Zhu Department of Electrical and Computer Engineering, New York University, 2 MetroTech Center, Brooklyn, NY 11201, USA {lh2328,qz494}@nyu.edu

Abstract. A honeynet is a promising active cyber defense mechanism. It reveals the fundamental Indicators of Compromise (IoCs) by luring attackers to conduct adversarial behaviors in a controlled and monitored environment. The active interaction at the honeynet brings a high reward but also introduces high implementation costs and risks of adversarial honeynet exploitation. In this work, we apply infinite-horizon SemiMarkov Decision Process (SMDP) to characterize a stochastic transition and sojourn time of attackers in the honeynet and quantify the rewardrisk trade-off. In particular, we design adaptive long-term engagement policies shown to be risk-averse, cost-effective, and time-efficient. Numerical results have demonstrated that our adaptive engagement policies can quickly attract attackers to the target honeypot and engage them for a sufficiently long period to obtain worthy threat information. Meanwhile, the penetration probability is kept at a low level. The results show that the expected utility is robust against attackers of a large range of persistence and intelligence. Finally, we apply reinforcement learning to the SMDP to solve the curse of modeling. Under a prudent choice of the learning rate and exploration policy, we achieve a quick and robust convergence of the optimal policy and value. Keywords: Reinforcement learning · Semi-Markov decision processes · Active defense · Honeynet · Risk quantification

1

Introduction

Recent instances of WannaCry ransomware attack and Stuxnet malware have demonstrated an inadequacy of traditional cybersecurity techniques such as the firewall and intrusion detection systems. These passive defense mechanisms can detect low-level Indicators of Compromise (IoCs) such as hash values, IP addresses, and domain names. However, they can hardly disclose high-level indicators such as attack tools and Tactics, Techniques and Procedures (TTPs) of Q. Zhu—This research is supported in part by NSF under grant ECCS-1847056, CNS1544782, and SES-1541164, and in part by ARO grant W911NF1910041. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 196–216, 2019. https://doi.org/10.1007/978-3-030-32430-8_13

Adaptive Honeypot Engagement

197

the attacker, which induces the attacker fewer pains to adapt to the defense mechanism, evade the indicators, and launch revised attacks as shown in the pyramid of pain [2]. Since high-level indicators are more effective in deterring emerging advanced attacks yet harder to acquire through the traditional passive mechanism, defenders need to adopt active defense paradigms to learn these fundamental characteristics of the attacker, attribute cyber attacks [35], and design defensive countermeasures correspondingly. Honeypots are one of the most frequently employed active defense techniques to gather information on threats. A honeynet is a network of honeypots, which emulates the real production system but has no production activities nor authorized services. Thus, an interaction with a honeynet, e.g., unauthorized inbound connections to any honeypot, directly reveals malicious activities. On the contrary, traditional passive techniques such as firewall logs or IDS sensors have to separate attacks from a ton of legitimate activities, thus provide much more false alarms and may still miss some unknown attacks. Besides a more effective identification and denial of adversarial exploitation through low-level indicators such as the inbound traffic, a honeynet can also help defenders to achieve the goal of identifying attackers’ TTPs under proper engagement actions. The defender can interact with attackers and allow them to probe and perform in the honeynet until she has learned the attacker’s fundamental characteristics. More services a honeynet emulates, more activities an attacker is allowed to perform, and a higher degree of interactions together result in a larger revelation probability of the attacker’s TTPs. However, the additional services and reduced restrictions also bring extra risks. Attacks may use some honeypots as pivot nodes to launch attackers against other production systems [37].

Fig. 1. The honeynet in red mimics the targeted production system in green. The honeynet shares the same structure as the production system yet has no authorized services. (Color figure online)

198

L. Huang and Q. Zhu

The current honeynet applies the honeywall as a gateway device to supervise outbound data and separate the honeynet from other production systems, as shown in Fig. 1. However, to avoid attackers’ identification of the data control and the honeynet, a defender cannot block all outbound traffics from the honeynet, which leads to a trade-off between the rewards of learning high-level IoCs and the following three types of risks. T1: Attackers identify the honeynet and thus either terminate on their own or generate misleading interactions with honeypots. T2: Attackers circumvent the honeywall to penetrate other production systems [34]. T3: Defender’s engagement costs outweigh the investigation reward. We quantify risk T1 in Sect. 2.3, T2 in Sect. 2.5, and T3 in Sect. 2.4. In particular, risk T3 brings the problem of timeliness and optimal decisions on timing. Since a persistent traffic generation to engage attackers is costly and the defender aims to obtain timely threat information, the defender needs costeffective policies to lure the attacker quickly to the target honeypot and reduce attacker’s sojourn time in honeypots of low-investigation value.

Fig. 2. Honeypots emulate different components of the production system. (Color figure online)

To achieve the goal of long-term, cost-effective policies, we construct the Semi-Markov Decision Process (SMDP) in Sect. 2 on the network shown in Fig. 2. Nodes 1 to 11 represent different types of honeypots, nodes 12 and 13 represent the domain of the production system and the virtual absorbing state, respectively. The attacker transits between these nodes according to the network topology in Fig. 1 and can remain at different nodes for an arbitrary period of

Adaptive Honeypot Engagement

199

time. The defender can dynamically change the honeypots’ engagement levels such as the amount of outbound traffic, to affect the attacker’s sojourn time, engagement rewards, and the probabilistic transition in that honeypot. In Sect. 3, we define security metrics related to our attacker engagement problem and analyze the risk both theoretically and numerically. These metrics answer important security questions in the honeypot engagement problem as follows. How likely will the attacker visit the normal zone at a given time? How long can a defender engage the attacker in a given honeypot before his first visit to the normal zone? How attractive is the honeynet if the attacker is initially in the normal zone? To protect against the Advanced Persistent Threats (APTs), we further investigate the engagement performance against attacks of different levels of persistence and intelligence. Finally, for systems with a large number of governing random variables, it is often hard to characterize the exact attack model, which is referred to as the curse of modeling. Hence, we apply reinforcement learning methods in Sect. 4 to learn the attacker’s behaviors represented by the parameters of the SMDP. We visualize the convergence of the optimal engagement policy and the optimal value in a video demo1 . In Sect. 4.1, we discuss challenges and future works of reinforcement learning in the honeypot engagement scenario where the learning environment is non-cooperative, risky, and sample scarce. 1.1

Related Works

Active defenses [23] and defensive deceptions [1] to detect and deter attacks have been active research areas. Techniques such as honeynets [30,49], moving target defense [17,48], obfuscation [31,32], and perturbations [44,45] have been introduced as defensive mechanisms to secure the cyberspace. The authors in [11] and [16] design two proactive defense schemes where the defender can manipulate the adversary’s belief and take deceptive precautions under stealthy attacks, respectively. In particular, many works [10,26] including ones with Markov Decision Process (MDP) models [22,30] and game-theoretic models [20,40,41] focus on the adaptive honeypot deployment, configuration, and detection evasion to effectively gather threat information without the attacker’s notice. A number of quantitative frameworks have been proposed to model proactive defense for various attack-defense scenarios building on Stackelberg games [25,31,46], signaling games [27,29,33,42,51], dynamic games [7,15,36,47], and mechanism design theory [5,9,43,50]. Pawlick et al. in [28] have provided a recent survey of gametheoretic methods for defensive deception, which includes a taxonomy of deception mechanisms and an extensive literature of game-theoretic deception. Most previous works on honeypots have focused on studying the attacker’s break-in attempts yet pay less attention to engaging the attacker after a successful penetration so that the attackers can thoroughly expose their postcompromise behaviors. Moreover, few works have investigated timing issues and risk assessment during the honeypot engagement, which may result in an 1

See the demo following URL: https://bit.ly/2QUz3Ok.

200

L. Huang and Q. Zhu

improper engagement time and uncontrollable risks. The work most related to this one is [30], which introduces a continuous-state infinite-horizon MDP model where the defender decides when to eject the attacker from the network. The author assumes a maximum amount of information that a defender can learn from each attack. The type of systems, i.e., either a normal system or a honeypot, determines the transition probability. Our framework, on the contrary, introduces following additional distinct features: – The upper bound on the amount of information which a defender can learn is hard to obtain and may not even exist. Thus, we consider a discounted factor to penalize the timeliness as well as the decreasing amount of unknown information as time elapses. – The transition probability not only depends on the type of systems but also depends on the network topology and the defender’s actions. – The defender endows attackers the freedom to explore the honeynet and affects the transition probability and the duration time through different engagement actions. – We use reinforcement learning methods to learn the parameter of the SMDP model. Since our learning algorithm constantly updates the engagement policy based on the up-to-date samples obtained from the honeypot interactions, the acquired optimal policy adapts to the potential evolution of attackers’ behaviors. SMDP generalizes MDP by considering the random sojourn time at each state, and is widely applied to machine maintenance [4], resource allocation [21], infrastructure protection [13,13,14], and cybersecurity [38]. This work aims to leverage the SMDP framework to determine the optimal attacker engagement policy and to quantify the trade-off between the value of the investigation and the risk. 1.2

Notations

Throughout the paper, we use calligraphic letter X to define a set. The upper case letter X denotes a random variable and the lower case x represents its realization. The boldface X denotes a vector or matrix and I denotes an identity matrix of a proper dimension. Notation Pr represents the probability measure and represents the convolution. The indicator function 1{x=y} equals one if x = y, and zero if x = y. The superscript k represents decision epoch k and the subscript i is the index of a node or a state. The pronoun ‘she’ refers to the defender, and ‘he’ refers to the attacker.

2

Problem Formulation

To obtain optimal engagement decisions at each honeypot under the probabilistic transition and the continuous sojourn time, we introduce the continuous-time infinite-horizon discounted SMDPs, which can be summarized by the tuple {t ∈ [0, ∞), S, A(sj ), tr(sl |sj , aj ), z(·|sj , aj , sl ), rγ (sj , aj , sl ), γ ∈ [0, ∞)}. We describe each element of the tuple in this section.

Adaptive Honeypot Engagement

2.1

201

Network Topology

We abstract the structure of the honeynet as a finite graph G = (N , E). The node set N := {n1 , n2 , · · · , nN } ∪ {nN +1 } contains N nodes of hybrid honeypots. Take Fig. 2 as an example, a node can be either a virtual honeypot of an integrated database system or a physical honeypot of an individual computer. These nodes provide different types of functions and services, and are connected following the topology of the emulated production system. Since we focus on optimizing the value of investigation in the honeynet, we only distinguish between different types of honeypots in different shapes, yet use one extra node nN +1 to represent the entire domain of the production system. The network topology E := {ejl }, j, l ∈ N , is the set of directed links connecting node nj with nl , and represents all possible transition trajectories in the honeynet. The links can be either physical (if the connecting nodes are real facilities such as computers) or logical (if the nodes represent integrated systems). Attackers cannot break the topology restriction. Since an attacker may use some honeypots as pivots to reach a production system, and it is also possible for a defender to attract attackers from the normal zone to the honeynet through these bridge nodes, there exist links of both directions between honeypots and the normal zone. 2.2

States and State-Dependent Actions

At time t ∈ [0, ∞), an attacker’s state belongs to a finite set S := {s1 , s2 , · · · , sN , sN +1 , sN +2 } where si , i ∈ {1, · · · , N + 1}, represents the attacker’s location at time t. Once attackers are ejected or terminate on their own, we use the extra absorbing state sN +2 to represent the virtual location. The attacker’s state reveals the adversary visit and exploitation of the emulated functions and services. Since the honeynet provides a controlled environment, we assume that the defender can monitor the state and transitions persistently without uncertainties. The attacker can visit a node multiple times for different purposes. A stealthy attacker may visit the honeypot node of the database more than once and revise data progressively (in a small amount each time) to evade detection. An attack on the honeypot node of sensors may need to frequently check the node for the up-to-date data. Some advanced honeypots may also emulate anti-virus systems or other protection mechanisms such as setting up an authorization expiration time, then the attacker has to compromise the nodes repeatedly. At each state si ∈ S, the defender can choose an action ai from a statedependent finite set A(si ). For example, at each honeypot node, the defender can conduct action aE to eject the attacker, action aP to purely record the attacker’s activities, low-interactive action aL , or high-interactive action aH to engage the attacker, i.e., A(si ) := {aE , aP , aL , aH }, i ∈ {1, · · · , N }. The highinteractive action is costly to implement yet both increases the probability of a longer sojourn time at honeypot ni , and reduces the probability of attackers penetrating the normal system from ni if connected. If the attacker resides in the normal zone either from the beginning or later through the pivot honeypots, the defender can choose either action aE to eject the attacker immediately, or action

202

L. Huang and Q. Zhu

aA to attract the attacker to the honeynet by exposing some vulnerabilities intentionally, i.e., A(sN +1 ) := {aE , aA }. Note that the instantiation of the action set and the corresponding consequences are not limited to the above scenario. For example, the action can also refer to a different degree of outbound data control. A strict control reduces the probability of attackers penetrating the normal system from the honeypot, yet also brings less investigation value. 2.3

Continuous-Time Process and Discrete Decision Model

Based on the current state sj ∈ S, the defender’s action aj ∈ A(sj ), the attacker transits to state sl ∈ S with a probability tr(sl |sj , aj ) and the sojourn time at state sj is a continuous random variable with a probability density z(·|sj , aj , sl ). Note that the risk T1 of the attacker identifying the honeynet at state sj under action aj = AE can be characterized by the transition probability tr(sN +2 |sj , aj ) as well as the duration time z(·|sj , aj , sN +2 ). Once the attacker arrives at a new honeypot ni , the defender dynamically applies an interaction action at honeypot ni from A(si ) and keeps interacting with the attacker until he transits to the next honeypot. The defender may not change the action before the transition to reduce the probability of attackers detecting the change and become aware of the honeypot engagement. Since the decision is made at the time of transition, we can transform the above continuous time model on horizon t ∈ [0, ∞) into a discrete decision model at decision epoch k ∈ {0, 1, · · · , ∞}. The time of the attacker’s k th transition is denoted by a random variable T k , the landing state is denoted as sk ∈ S, and the adopted action after arriving at sk is denoted as ak ∈ A(sk ). 2.4

Investigation Value

The defender gains a reward of investigation by engaging and analyzing the attacker in the honeypot. To simplify the notation, we divide the reward during time t ∈ [0, ∞) into ones at discrete decision epochs T k , k ∈ {0, 1, · · · , ∞}. When τ ∈ [T k , T k+1 ] amount of time elapses at stage k, the defender’s reward of investigation r(sk , ak , sk+1 , T k , T k+1 , τ ) = r1 (sk , ak , sk+1 )1{τ =0} + r2 (sk , ak , T k , T k+1 , τ ), at time τ of stage k, is the sum of two parts. The first part is the immediate cost of applying engagement action ak ∈ A(sk ) at state sk ∈ S and the second part is the reward rate of threat information acquisition minus the cost rate of persistently generating deceptive traffics. Due to the randomness of the attacker’s behavior, the information acquisition can also be random, thus the actual reward rate r2 is perturbed by an additive zero-mean noise wr . Different types of attackers target different components of the production system. For example, an attacker who aims to steal data will take intensive

Adaptive Honeypot Engagement

203

adversarial actions at the database. Thus, if the attacker is actually in the honeynet and adopts the same behavior as he is in the production system, the defender can identify the target of the attack based on the traffic intensity. We specify r1 and r2 at each state properly to measure the risk T3. To maximize the value of the investigation, the defender should choose proper actions to lure the attacker to the honeypot emulating the target of the attacker in a short time and with a large probability. Moreover, the defender’s action should be able to engage the attacker in the target honeypot actively for a longer time to obtain more valuable threat information. We compute the optimal long-term policy that achieves the above objectives in Sect. 2.5. As the defender spends longer time interacting with attackers, investigating their behaviors and acquires better understandings of their targets and TTPs, less new information can be extracted. In addition, the same intelligence becomes less valuable as time elapses due to the timeliness. Thus, we use a discounted factor of γ ∈ [0, ∞) to penalize the decreasing value of the investigation as time elapses. 2.5

Optimal Long-Term Policy

The defender aims at a policy π ∈ Π which maps state sk ∈ S to action ak ∈ A(sk ) to maximize the long-term expected utility starting from state s0 , i.e., ∞ k+1 T 0 −γ(τ +T k ) k k k+1 k k+1 u(s , π) = E e (r(S , A , S ,T ,T , τ ) + wr )dτ . k=0

Tk

At each decision epoch, the value function v(s0 ) = supπ∈Π u(s0 , π) can be represented by dynamic programming, i.e., 1 v(s0 ) =

sup a0 ∈A(s0 )

E

T

0

1

e−γ(τ +T ) r(s0 , a0 , S 1 , T 0 , T 1 , τ )dτ + e−γT v(S 1 ) .

T0

(1) We assume a constant reward rate r2 (sk , ak , T k , T k+1 , τ ) = r¯2 (sk , ak ) for simplicity. Then, (1) can be transformed into an equivalent MDP form, i.e., ∀s0 ∈ S, tr(s1 |s0 , a0 )(rγ (s0 , a0 , s1 ) + z γ (s0 , a0 , s1 )v(s1 )), (2) v(s0 ) = sup a0 ∈A(s0 )

s1 ∈S

∞ where z γ (s0 , a0 , s1 ) := 0 e−γτ z(τ |s0 , a0 , s1 )dτ ∈ [0, 1] is the Laplace transform of the sojourn probability density z(τ |s0 , a0 , s1 ) and the equivalent reward 0 0 rγ (s0 , a0 , s1 ) := r1 (s0 , a0 , s1 )+ r¯2 (sγ ,a ) (1−z γ (s0 , a0 , s1 )) ∈ [−mc , mc ] is assumed to be bounded by a constant mc .

204

L. Huang and Q. Zhu

A classical regulation condition of SMDP to avoid the probability of an infinite number of transitions within a finite time is stated as follows: there exists constants θ ∈ (0, 1) and δ > 0 such that tr(s1 |s0 , a0 )z(δ|s0 , a0 , s1 ) ≤ 1 − θ, ∀s0 ∈ S, a0 ∈ A(s0 ). (3) s1 ∈S

1 0 0 It is shown in [12] that condition (3) is equivalent to s1 ∈S tr(s |s , a ) γ 0 0 1 z (s , a , s ) ∈ [0, 1), which serves as the equivalent stage-varying discounted factor for the associated MDP. Then, the right-hand side of (1) is a contraction mapping and there exists a unique optimal policy π ∗ = arg maxπ∈Π u(s0 , π) which can be found by value iteration, policy iteration or linear programming. Cost-Effective Policy. The computation result of our 13-state example system is illustrated in Fig. 2. The optimal policies at honeypot nodes n1 to n11 are represented by different colors. Specifically, actions aE , aP , aL , aH are denoted in red, blue, purple, and green, respectively. The size of node ni represents the state value v(si ). In the example scenario, the honeypot of database n10 and sensors n11 are the main and secondary targets of the attacker, respectively. Thus, defenders can obtain a higher investigation value when they manage to engage the attacker in these two honeypot nodes with a larger probability and for a longer time. However, instead of naively adopting high interactive actions, a savvy defender also balances the high implantation cost of aH . Our quantitative results indicate that the high interactive action should only be applied at n10 to be cost-effective. On the other hand, although the bridge nodes n1 , n2 , n8 which connect to the normal zone n12 do not contain higher investigation values than other nodes, the defender still takes action aL at these nodes. The goal is to either increase the probability of attracting attackers away from the normal zone or reduce the probability of attackers penetrating the normal zone from these bridge nodes. Engagement Safety Versus Investigation Values. Restrictive engagement actions endow attackers less freedom so that they are less likely to penetrate the normal zone. However, restrictive actions also decrease the probability of obtaining high-level IoCs, thus reduces the investigation values. To quantify the system value under the trade-off of the engagement safety and the reward from the investigation, we visualize the trade-off surface in Fig. 3. In the x-axis, a larger penetration probability p(sN +1 |sj , aj ), j ∈ {s1 , s2 , s8 }, aj = aE , decreases the value v(s10 ). In the y-axis, a larger reward rγ (sj , aj , sl ), j ∈ S\{s12 , s13 }, l ∈ S, increases the value. The figure also shows that value v(s10 ) changes in a higher rate, i.e., are more sensitive when the penetration probability is small and the reward from the investigation is large. In our scenario, the penetration probability has less influence on the value than the investigation reward, which motivates a less restrictive engagement.

Adaptive Honeypot Engagement

205

Fig. 3. The trade-off surface of v(s10 ) in z-axis under different values of penetration probability p(sN +1 |sj , aj ), j ∈ {s1 , s2 , s8 }, aj = aE , in x-axis, and the reward rγ (sj , aj , sl ), j ∈ S\{s12 , s13 }, l ∈ S, in y-axis.

3

Risk Assessment

Given any feasible engagement policy π ∈ Π, the SMDP becomes a semi-Markov process [24]. We analyze the evolution of the occupancy distribution and first passage time in Sects. 3.1 and 3.2, respectively, which leads to three security metrics during the honeypot engagement. To shed lights on the defense of APTs, we investigate the system performance against attackers with different levels of persistence and intelligence in Sect. 3.3. 3.1

Transition Probability of Semi-Markov Process

Define the cumulative probability qij (t) of the one-step transition from {S k = i, T k = tk } to {S k+1 = j, T k+1 = tk + t} as Pr(S k+1 = j, T k+1 − tk ≤ t|S k = t i, T k = tk ) = tr(j|i, π(i)) 0 z(τ |i, π(i), j)dτ, ∀i, j ∈ S, t ≥ 0. Based on a variation of the forward Kolmogorov equation where the one-step transition lands on an intermediate state l ∈ S at time T k+1 = tk + u, ∀u ∈ [0, t], the transition probability of the system in state j at time t, given the initial state i at time 0 can be represented as t qih (t) + pli (t − u)dqil (u), pii (t) = 1 − h∈S t

pij (t) =

l∈S

0

l∈S

0

plj (t − u)dqil (u) =

l∈S

plj (t)

dqil (t) , ∀i, j ∈ S, j = i, ∀t ≥ 0, dt

206

L. Huang and Q. Zhu

where 1 − h∈S qih (t) is the probability that no transitions happen before time t. We can easily verify that l∈S pil (t) = 1, ∀i ∈ S, ∀t ∈ [0, ∞). To compute pij (t) and pii (t), we can take Laplace transform and then solve two sets of linear equations. For simplicity, we specify z(τ |i, π(i), j) to be exponential distributions with parameters λij (π(i)), and the semi-Markov process degenerates to a continuous time Markov chain. Then, we obtain the infinitesimal generator via the Leibniz integral rule, i.e., dpij (t) = λij (π(i)) · tr(j|i, π(i)) > 0, ∀i, j ∈ S, j = i, q¯ij := dt t=0 dpii (t) = − q¯ij < 0, ∀i ∈ S. q¯ii := dt t=0 j∈S\{i}

¯ := [¯ Define matrix Q qij ]i,j∈S and vector Pi (t) = [pij (t)]j∈S , then based on the forward Kolmogorov equation, Pi (t + u) − Pi (t) Pi (u) − I dPi (t) ¯ i (t). = lim = lim Pi (t) = QP + + dt u u u→0 u→0 Thus, we can compute the first security metric, the occupancy distribution of any state s ∈ S at time t starting from the initial state i ∈ S at time 0, i.e., ¯

Pi (t) = eQt Pi (0), ∀i ∈ S.

(4)

We plot the evolution of pij (t), i = sN +1 , j ∈ {s1 , s2 , s10 , s12 }, versus t ∈ [0, ∞) in Fig. 4 and the limiting occupancy distribution pij (∞), i = sN +1 , in Fig. 5. In Fig. 4, although the attacker starts at the normal zone i = sN +1 , our engagement policy can quickly attract the attacker into the honeynet. Figure 5 demonstrates that the engagement policy can keep the attacker in the honeynet with a dominant probability of 91% and specifically, in the target honeypot n10 with a high probability of 41%. The honeypots connecting the normal zone also have a higher occupancy probability than nodes n3 , n4 , n5 , n6 , n7 , n9 , which are less likely to be explored by the attacker due to the network topology. 1

1: Swtich 2: Server 10: Database 12: Normal Zone

0.9 0.8

1: Swtich

11: Sensor 12%

0.7

2: Server

Probability

12: Normal Zone

9% 4%

0.6

10% 0.5

3 4 5 6

0.4 0.3

1% 2% 1% 3% 3%

7

0.2

41% 11%

0.1

10: Database 3%

8

0 0

5

10

15

20

25

Time

Fig. 4. Evolution of pij (t), i = sN +1 .

9

Fig. 5. The limiting occupancy distribution.

Adaptive Honeypot Engagement

3.2

207

First Passage Time

Another quantitative measure of interest is the first passage time TiD of visiting a set D ⊂ S starting from i ∈ S\D at time 0. Define the cumulative c c (t) := Pr(TiD ≤ t), then fiD (t) = probability function fiD h∈D qih (t) + t c f (t − u)dq (u). In particular, if D = {j}, then the probability il l∈S\D 0 lD df c (t)

density function fij (t) := ijdt satisfies t c pij (t) = pjj (t − u)dfij (u) = pjj (t) fij (t), ∀i, j ∈ S, j = i. 0

∞ Take Laplace transform p¯ij (s) := 0 e−st pij (t)dt, and then take inverse Laplace p¯ij (s) transform on f¯ij (s) = p¯jj (s) , we obtain ∞ p¯ij (s) ds, ∀i, j ∈ S, j = i. (5) est fij (t) = p ¯jj (s) 0 We define the second security metric, the attraction efficiency as the probability of the first passenger time Ts12 ,s10 less than a threshold tth . Based on (4) and (5), the probability density function of Ts12 ,s10 is shown in Fig. 6. We take the mean denoted by the orange line as the threshold tth and the attraction efficiency is 0.63, which means that the defender can attract the attacker from the normal zone to the database honeypot in less than tth = 20.7 with a probability of 0.63.

Fig. 6. Probability density function of Ts12 ,s10 .

Mean First Passage Time. The third security metric of concern is the average engagement efficiency defined as the Mean First Passage Time (MFPT) tm iD = E[TiD ], ∀i ∈ S, D ⊂ S. Under the exponential sojourn distribution, MFPT can be computed directly through the a system of linear equations, i.e., tm iD = 0, i ∈ D, q¯il tm / D. 1+ lD = 0, i ∈ l∈S

(6)

208

L. Huang and Q. Zhu

m In general, the MFPT is asymmetric, i.e., tm ij = tji , ∀i, j ∈ S. Based on (6), we compute the MFPT from and to the normal zone in Figs. 7 and 8, respectively. The color of each node indicates the value of MFPT. In Fig. 7, the honeypot nodes that directly connect to the normal zone have the shortest MFPT, and it takes attackers much longer time to visit the honeypots of clients due to the network topology. Figure 8 shows that the defender can engage attackers in the target honeypot nodes of database and sensors for a longer time. The engagements at the client nodes are yet much less attractive. Note that two figures have different time scales denoted by the color bar value, and the comparison shows that it generally takes the defender more time and efforts to attract the attacker from the normal zone. The MFPT from the normal zone tm s12 ,j measures the average time it takes to attract attacker to honeypot state j ∈ S\{s12 , s13 } for the first time. On the contrary, the MFPT to the normal zone tm i,s12 measures the average time of the attacker penetrating the normal zone from honeypot state i ∈ S\{s12 , s13 } for the first time. If the defender pursues absolute security and ejects the attack once it goes to the normal zone, then Fig. 8 also shows the attacker’s average sojourn time in the honeynet starting from different honeypot nodes.

Fig. 7. MFPT from the normal zone tm s12 ,j .

3.3

Fig. 8. MFPT to the normal zone tm i,s12 .

Advanced Persistent Threats

In this section, we quantify three engagement criteria on attackers of different levels of persistence and intelligence in Figs. 9 and 10, respectively. The criteria are the stationary probability of normal zone pi,s12 (∞), ∀i ∈ S\{s13 }, the utility of normal zone v(s12 ), and the expected utility over the stationary probability, i.e., j∈S pij (∞)v(j), ∀i ∈ S\{s13 }. As shown in Fig. 9, when the attacker is at the normal zone i = s12 and the defender chooses action a = aA , a larger λ := λij (aA ), ∀j ∈ {s1 , s2 , s8 }, of the exponential sojourn distribution indicates that the attacker is more inclined to respond to the honeypot attraction and thus less time is required to attract the attacker away from the normal zone. As the persistence level λ increases from 0.1

Adaptive Honeypot Engagement

209

to 2.5, the stationary probability of the normal zone decreases and the expected utility over the stationary probability increases, both converge to their stable values. The change rate is higher during λ ∈ (0, 0.5] and much lower afterward. On the other hand, the utility loss at the normal zone decreases approximately linearly during the entire period λ ∈ (0, 2.5]. As shown in Fig. 10, when the attacker becomes more advanced with a larger failure probability of attraction, i.e., p := p(j|s12 , aA ), ∀j ∈ {s12 , s13 }, he can stay in the normal zone with a larger probability. A significant increase happens after p ≥ 0.5. On the other hand, as p increases from 0 to 1, the utility of the normal zone reduces linearly, and the expected utility over the stationary probability remains approximately unchanged until p ≥ 0.9. Figures 9 and 10 demonstrate that the expected utility over the stationary probability receives a large decrease only at the extreme cases of a high transition frequency and a large penetration probability. Similarly, the stationary probability of the normal zone remains small for most cases except for the above extreme cases. Thus, our policy provides a robust expected utility as well as a low-risk engagement over a large range of changes in the attacker’s persistence and intelligence.

0.2

Utility of Normal Zone

Utility of Normal Zone

-2

Value

-1

Value

0.5

0

0

-2 -3

-2.5 -3 -3.5

Expected Utility over Stationary Probability

Expected Utility over Stationary Probability

10

Value

10

Value

Stationary Probability of Normal Zone

1

Probability

Probability

Stationary Probability of Normal Zone 0.4

8 6

5 0

0

0.5

1

1.5

2

2.5

0

0.1

Value of

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Probability of Failed Attraction

Fig. 9. Three engagement criteria under different persistence levels λ ∈ (0, 2.5].

4

0.2

Fig. 10. Three engagement criteria under different intelligence levels p ∈ [0, 1].

Reinforcement Learning of SMDP

Due to the absent knowledge of an exact SMDP model, i.e., the investigation reward, the attacker’s transition probability (and even the network topology), and the sojourn distribution, the defender has to learn the optimal engagement policy based on the actual experience of the honeynet interactions. As one of the classical model-free reinforcement learning methods, the Q-learning algorithm for SMDP has been stated in [3], i.e., r1 (sk , ak , s¯k+1 ) Qk+1 (sk , ak ) := (1 − αk (sk , ak ))Qk (sk , ak ) + αk (sk , ak )[¯ + r¯2 (sk , ak )

k (7) k (1 − e−γ τ¯ ) − e−γ τ¯ max Qk (¯ sk+1 , a )], γ a ∈A(¯ sk+1 )

210

L. Huang and Q. Zhu

where sk is the current state sample, ak is the current selected action, αk (sk , ak ) ∈ (0, 1) is the learning rate, s¯k+1 is the observed state at next k is the observed stage, r¯1 , r¯2 is the observed investigation rewards, and τ¯ ∞ k k k k . When the learning rate satisfies sojourn time at state s k=0 α (s , a ) = ∞ k k k 2 k k k ∞, k=0 (α (s , a )) < ∞, ∀s ∈ S, ∀a ∈ A(s ), and all state-action pairs are explored infinitely, maxa ∈A(sk ) Qk (sk , a ), k → ∞, in (7) converges to value v(sk ) with probability 1. At each decision epoch k ∈ {0, 1, · · · }, the action ak is chosen according to the -greedy policy, i.e., the defender chooses the optimal action arg maxa ∈A(sk ) Qk (sk , a ) with a probability 1 − , and a random action with a probability . Note that the exploration rate ∈ (0, 1] should not be too small to guarantee sufficient samples of all state-action pairs. The Q-learning algorithm under a pure exploration policy = 1 still converges yet at a slower rate. In our scenario, the defender knows the reward of ejection action aA and v(s13 ) = 0, thus does not need to explore action aA to learn it. We plot one learning trajectory of the state transition and sojourn time under the -greedy exploration policy in Fig. 11, where the chosen actions aE , aP , aL , aH are denoted in red, blue, purple, and green, respectively. If the ejection reward is unknown, the defender should be restrictive in exploring aA which terminates the learning process. Otherwise, the defender may need to engage with a group of attackers who share similar behaviors to obtain sufficient samples to learn the optimal engagement policy.

13 12 11 10 9

State

8 7 6 5 4 3 2 1 2.4899

2.4994

2.5089

2.5184

Time

2.5279

2.5374 10 4

Fig. 11. One instance of Q-learning on SMDP where the x-axis shows the sojourn time and the y-axis represents the state transition. The chosen actions aE , aP , aL , aH are denoted in red, blue, purple, and green, respectively. (Color figure online)

Adaptive Honeypot Engagement

In particular, we choose αk (sk , ak ) =

kc k k{sk ,ak } −1+kc , ∀s

211

∈ S, ∀ak ∈ A(sk ), to

guarantee the asymptotic convergence, where kc ∈ (0, ∞) is a constant parameter and k{sk ,ak } ∈ {0, 1, · · · } is the number of visits to state-action pair {sk , ak } up to stage k. We need to choose a proper value of kc to guarantee a good numerical performance of convergence in finite steps as shown in Fig. 12. We shift the green and blue lines vertically to avoid the overlap with the red line and represent the corresponding theoretical values in dotted black lines. If kc is too small as shown in the red line, the learning rate decreases so fast that new observed samples hardly update the Q-value and the defender may need a long time to learn the right value. However, if kc is too large as shown in the green line, the learning rate decreases so slow that new samples contribute significantly to the current Q-value. It causes a large variation and a slower convergence rate of maxa ∈A(s12 ) Qk (s12 , a ). We show the convergence of the policy and value under kc = 1, = 0.2, in the video demo (See URL: https://bit.ly/2QUz3Ok). In the video, the color of each node nk distinguishes the defender’s action ak at state sk and the size of the node is proportional to maxa ∈A(sk ) Qk (sk , a ) at stage k. To show the convergence, we decrease the value of gradually to 0 after 5000 steps. Since the convergence trajectory is stochastic, we run the simulation for 100 times and plot the mean and the variance of Qk (s12 , aP ) of state s12 under the optimal policy π(s12 ) = aP in Fig. 13. The mean in red converges to the theoretical value in about 400 steps and the variance in blue reduces dramatically as step k increases. 3

Variance

Mean

Theoretical Value

2 1 0

Value

-1 -2 -3 -4 -5 -6

0

1

2

3

4

Step k

5

6

7

Fig. 12. The convergence rate under different values of kc . (Color figure online)

4.1

-7 0

100

200

300

400

500

600

700

800

900

1000

10 4

Fig. 13. The evolution of the mean and the variance of Qk (s12 , aP ). (Color figure online)

Discussion

In this section, we discuss the challenges and related future directions about reinforcement learning in the honeypot engagement.

212

L. Huang and Q. Zhu

Non-cooperative and Adversarial Learning Environment. The major challenge of learning under the security scenario is that the defender lacks full control of the learning environment, which limits the scope of feasible reinforcement learning algorithms. In the classical reinforcement learning task, the learner can choose to start at any state at any time, and repeatedly simulate the path from the target state. In the adaptive honeypot engagement problem, however, the defender can remove attackers but cannot arbitrarily draw them to the target honeypot and force them to show their attacking behaviors because the true threat information is revealed only when attackers are unaware of the honeypot engagements. The future work could generalize the current framework to an adversarial learning environment where a savvy attacker can detect the honeypot and adopt deceptive behaviors to interrupt the learning process. Risk Reduction During the Learning Period. Since the learning process is based on samples from real interactions, the defender needs to concern the system safety and security during the learning period. For example, if the visit and sojourn in the normal zone bring a significant amount of losses, we can use the SARSA algorithm to conduct a more conservative learning process than Qlearning. Other safe reinforcement learning methods are stated in the survey [8], which are left as future work. Asymptotic Versus Finite-Step Convergence. Since an attacker can terminate the interaction on his own, the engagement time with attacker may be limited. Thus, comparing to an asymptotic convergence of policy learning, the defender aims more to conduct speedy learning of the attacker’s behaviors in finite steps, and meanwhile, achieve a good engagement performance in these finite steps. Previous works have studied the convergence rate [6] and the non-asymptotic convergence [18,19] in the MDP setting. For example, [6] have shown a relationship between the convergence rate and the learning rate of Q-learning, [19] has provided the performance bound of the finite-sample convergence rate, and [18] has proposed E 3 algorithm which achieves near-optimal with a large probability in polynomial time. However, in the honeypot engagement problem, the defender does not know the remaining steps that she can interact with the attacker because the attacker can terminate on his own. Thus, we cannot directly apply the E 3 algorithm which depends on the horizon time. Moreover, since attackers may change their behaviors during the long learning period, the learning algorithm needs to adapt to the changes of SMDP model quickly. In this preliminary work, we use the -greedy policy for the trade-off of the exploitation and exploration during the finite learning time. The can be set at a relatively large value without the gradual decrease so that the learning algorithm persistently adapts to the changes in the environment. On the other hand, the defender can keep a larger discounted factor γ to focus on the immediate investigation reward. If the defender expects a short interaction time, i.e., the

Adaptive Honeypot Engagement

213

attacker is likely to terminate in the near future, she can increase the discounted factor in the learning process to adapt to her expectations. Transfer Learning. In general, the learning algorithm on SMDP converges slower than the one on MDP because the sojourn distribution introduces extra randomness. Thus, instead of learning from scratch, the defender can attempt to reuse the past experience with attackers of similar behaviors to expedite the learning process, which motivates the investigation of transfer learning in reinforcement learning [39]. Some side-channel information may also contribute to the transfer learning.

5

Conclusion

A honeynet is a promising active defense scheme. Comparing to traditional passive defense techniques such as the firewall and intrusion detection systems, the engagement with attackers can reveal a large range of Indicators of Compromise (IoC) at a lower rate of false alarms and missed detection. However, the active interaction also introduces the risks of attackers identifying the honeypot setting, penetrating the production system, and a high implementation cost of persistent synthetic traffic generations. Since the reward depends on honeypots’ type, the defender aims to lure the attacker into the target honeypot in the shortest time. To satisfy the above requirements of security, cost, and timeliness, we leverage the Semi-Markov Decision Process (SMDP) to model the transition probability, sojourn distribution, and investigation reward. After transforming the continuous time process into the equivalent discrete decision model, we have obtained long-term optimal policies that are risk-averse, cost-effective, and time-efficient. We have theoretically analyzed the security metrics of the occupancy distribution, attraction efficiency, and average engagement efficiency based on the transition probability and the probability density function of the first passenger time. The numerical results have shown that the honeypot engagement can engage the attacker in the target honeypot with a large probability and in a desired speed. In the meantime, the penetration probability is kept under a bearable level for most of the time. The results also demonstrate that it is a worthy compromise of the immediate security to allow a small penetration probability so that a high investigation reward can be obtained in the long run. Finally, we have applied reinforcement learning methods on the SMDP in case the defender can not obtain the exact model of the attacker’s behaviors. Based on a prudent choice of the learning rate and exploration-exploitation policy, we have achieved a quick convergence rate of the optimal policy and the value. Moreover, the variance of the learning process has decreased dramatically with the number of observed samples.

214

L. Huang and Q. Zhu

References 1. Al-Shaer, E.S., Wei, J., Hamlen, K.W., Wang, C.: Autonomous Cyber Deception: Reasoning, Adaptive Planning, and Evaluation of HoneyThings. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-02110-8 2. Bianco, D.: The pyramid of pain (2013). http://detect-respond.blogspot.com/ 2013/03/the-pyramid-of-pain.html 3. Bradtke, S.J., Duff, M.O.: Reinforcement learning methods for continuous-time Markov decision problems. In: Advances in Neural Information Processing Systems, pp. 393–400 (1995) 4. Chen, D., Trivedi, K.S.: Optimization for condition-based maintenance with semiMarkov decision process. Reliab. Eng. Syst. Saf. 90(1), 25–29 (2005) 5. Chen, J., Zhu, Q.: Security as a service for cloud-enabled internet of controlled things under advanced persistent threats: a contract design approach. IEEE Trans. Inf. Forensics Secur. 12(11), 2736–2750 (2017) 6. Even-Dar, E., Mansour, Y.: Learning rates for Q-learning. J. Mach. Learn. Res. 5(Dec), 1–25 (2003) 7. Farhang, S., Manshaei, M.H., Esfahani, M.N., Zhu, Q.: A dynamic Bayesian security game framework for strategic defense mechanism design. In: Poovendran, R., Saad, W. (eds.) GameSec 2014. LNCS, vol. 8840, pp. 319–328. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12601-2 18 8. Garcıa, J., Fern´ andez, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015) 9. Hayel, Y., Zhu, Q.: Attack-aware cyber insurance for risk sharing in computer networks. In: Khouzani, M., Panaousis, E., Theodorakopoulos, G. (eds.) GameSec 2015. LNCS, vol. 9406, pp. 22–34. Springer, Cham (2015). https://doi.org/10.1007/ 978-3-319-25594-1 2 10. Hecker, C.R.: A methodology for intelligent honeypot deployment and active engagement of attackers. Ph.D. thesis (2012). aAI3534194 11. Hor´ ak, K., Zhu, Q., Boˇsansk` y, B.: Manipulating adversary’s belief: a dynamic game approach to deception by design for proactive network security. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) GameSec 2017. LNCS, vol. 10575, pp. 273–294. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68711-7 15 12. Hu, Q., Yue, W.: Markov Decision Processes with Their Applications, vol. 14. Springer, Boston (2008). https://doi.org/10.1007/978-0-387-36951-8 13. Huang, L., Chen, J., Zhu, Q.: Distributed and optimal resilient planning of largescale interdependent critical infrastructures. In: 2018 Winter Simulation Conference (WSC), pp. 1096–1107. IEEE (2018) 14. Huang, L., Chen, J., Zhu, Q.: Factored Markov game theory for secure interdependent infrastructure networks. In: Rass, S., Schauer, S. (eds.) Game Theory for Security and Risk Management. SDGTFA, pp. 99–126. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75268-6 5 15. Huang, L., Zhu, Q.: Adaptive strategic cyber defense for advanced persistent threats in critical infrastructure networks. ACM SIGMETRICS Perform. Eval. Rev. 46(2), 52–56 (2018) 16. Huang, L., Zhu, Q.: A dynamic games approach to proactive defense strategies against advanced persistent threats in cyber-physical systems. arXiv preprint arXiv:1906.09687 (2019) 17. Jajodia, S., Ghosh, A.K., Swarup, V., Wang, C., Wang, X.S.: Moving Target Defense: Creating Asymmetric Uncertainty for Cyber Threats, vol. 54. Springer, New York (2011). https://doi.org/10.1007/978-1-4614-0977-9

Adaptive Honeypot Engagement

215

18. Kearns, M., Singh, S.: Near-optimal reinforcement learning in polynomial time. Mach. Learn. 49(2–3), 209–232 (2002) 19. Kearns, M.J., Singh, S.P.: Finite-sample convergence rates for q-learning and indirect algorithms. In: Advances in Neural Information Processing Systems, pp. 996– 1002 (1999) 20. La, Q.D., Quek, T.Q., Lee, J., Jin, S., Zhu, H.: Deceptive attack and defense game in honeypot-enabled networks for the internet of things. IEEE Internet Things J. 3(6), 1025–1035 (2016) 21. Liang, H., Cai, L.X., Huang, D., Shen, X., Peng, D.: An SMDP-based service model for interdomain resource allocation in mobile cloud networks. IEEE Trans. Veh. Technol. 61(5), 2222–2232 (2012) 22. Luo, T., Xu, Z., Jin, X., Jia, Y., Ouyang, X.: IoTCandyJar: Towards an intelligentinteraction honeypot for IoT devices. Black Hat (2017) 23. Mudrinich, E.M.: Cyber 3.0: the department of defense strategy for operating in cyberspace and the attribution problem. AFL Rev. 68, 167 (2012) 24. Nakagawa, T.: Stochastic Processes: with Applications to Reliability Theory. Springer, London (2011). https://doi.org/10.1007/978-0-85729-274-2 25. Paruchuri, P., Pearce, J.P., Marecki, J., Tambe, M., Ordonez, F., Kraus, S.: Playing games for security: an efficient exact algorithm for solving Bayesian stackelberg games. In: Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, vol. 2, pp. 895–902. International Foundation for Autonomous Agents and Multiagent Systems (2008) 26. Pauna, A., Iacob, A.C., Bica, I.: QRASSH-a self-adaptive SSH honeypot driven by Q-learning. In: 2018 International Conference on Communications (COMM), pp. 441–446. IEEE (2018) 27. Pawlick, J., Colbert, E., Zhu, Q.: Modeling and analysis of leaky deception using signaling games with evidence. IEEE Trans. Inf. Forensics Secur. 14(7), 1871–1886 (2018) 28. Pawlick, J., Colbert, E., Zhu, Q.: A game-theoretic taxonomy and survey of defensive deception for cybersecurity and privacy. ACM Comput. Surv. (CSUR) (2019, to appear ) 29. Pawlick, J., Farhang, S., Zhu, Q.: Flip the cloud: cyber-physical signaling games in the presence of advanced persistent threats. In: Khouzani, M., Panaousis, E., Theodorakopoulos, G. (eds.) GameSec 2015. LNCS, vol. 9406, pp. 289–308. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25594-1 16 30. Pawlick, J., Nguyen, T.T.H., Colbert, E., Zhu, Q.: Optimal timing in dynamic and robust attacker engagement during advanced persistent threats. In: 2019 17th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pp. 1–6. IEEE (2019) 31. Pawlick, J., Zhu, Q.: A Stackelberg game perspective on the conflict between machine learning and data obfuscation. In: 2016 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6. IEEE (2016). http:// ieeexplore.ieee.org/abstract/document/7823893/ 32. Pawlick, J., Zhu, Q.: A mean-field stackelberg game approach for obfuscation adoption in empirical risk minimization. In: 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 518–522. IEEE (2017) 33. Pawlick, J., Zhu, Q.: Proactive defense against physical denial of service attacks using poisson signaling games. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) GameSec 2017. LNCS, vol. 10575, pp. 336–356. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68711-7 18

216

L. Huang and Q. Zhu

34. Pouget, F., Dacier, M., Debar, H.: White paper: honeypot, honeynet, honeytoken: terminological issues. Rapport technique EURECOM 1275 (2003) 35. Rid, T., Buchanan, B.: Attributing cyber attacks. J. Strateg. Stud. 38(1–2), 4–37 (2015) 36. Sahabandu, D., Xiao, B., Clark, A., Lee, S., Lee, W., Poovendran, R.: DIFT games: dynamic information flow tracking games for advanced persistent threats. In: 2018 IEEE Conference on Decision and Control (CDC), pp. 1136–1143. IEEE (2018) 37. Spitzner, L.: Honeypots: Tracking Hackers, vol. 1. Addison-Wesley, Reading (2003) 38. Sun, Y., Uysal-Biyikoglu, E., Yates, R.D., Koksal, C.E., Shroff, N.B.: Update or wait: how to keep your data fresh. IEEE Trans. Inf. Theory 63(11), 7492–7508 (2017) 39. Taylor, M.E., Stone, P.: Transfer learning for reinforcement learning domains: a survey. J. Mach. Learn. Res. 10(Jul), 1633–1685 (2009) 40. Wagener, G., State, R., Dulaunoy, A., Engel, T.: Self adaptive high interaction honeypots driven by game theory. In: Guerraoui, R., Petit, F. (eds.) SSS 2009. LNCS, vol. 5873, pp. 741–755. Springer, Heidelberg (2009). https://doi.org/10. 1007/978-3-642-05118-0 51 41. Wang, K., Du, M., Maharjan, S., Sun, Y.: Strategic honeypot game model for distributed denial of service attacks in the smart grid. IEEE Trans. Smart Grid 8(5), 2474–2482 (2017) 42. Xu, Z., Zhu, Q.: A cyber-physical game framework for secure and resilient multiagent autonomous systems. In: 2015 IEEE 54th Annual Conference on Decision and Control (CDC), pp. 5156–5161. IEEE (2015) 43. Zhang, R., Zhu, Q., Hayel, Y.: A bi-level game approach to attack-aware cyber insurance of computer networks. IEEE J. Sel. Areas Commun. 35(3), 779–794 (2017) 44. Zhang, T., Zhu, Q.: Dynamic differential privacy for ADMM-based distributed classification learning. IEEE Trans. Inf. Forensics Secur. 12(1), 172–187 (2017). http://ieeexplore.ieee.org/abstract/document/7563366/ 45. Zhang, T., Zhu, Q.: Distributed privacy-preserving collaborative intrusion detection systems for vanets. IEEE Trans. Sig. Inf. Process. Netw. 4(1), 148–161 (2018) 46. Zhu, Q., Ba¸sar, T.: Game-theoretic methods for robustness, security, and resilience of cyberphysical control systems: games-in-games principle for optimal cross-layer resilient control systems. IEEE Control Syst. Mag. 35(1), 46–65 (2015) 47. Zhu, Q., Ba¸sar, T.: Dynamic policy-based IDS configuration. In: Proceedings of the 48th IEEE Conference on Decision and Control, 2009 Held Jointly with the 2009 28th Chinese Control Conference, CDC/CCC 2009, pp. 8600–8605. IEEE (2009) 48. Zhu, Q., Ba¸sar, T.: Game-theoretic approach to feedback-driven multi-stage moving target defense. In: Das, S.K., Nita-Rotaru, C., Kantarcioglu, M. (eds.) GameSec 2013. LNCS, vol. 8252, pp. 246–263. Springer, Cham (2013). https://doi.org/10. 1007/978-3-319-02786-9 15 49. Zhu, Q., Clark, A., Poovendran, R., Basar, T.: Deployment and exploitation of deceptive honeybots in social networks. In: 2013 IEEE 52nd Annual Conference on Decision and Control (CDC), pp. 212–219. IEEE (2013) 50. Zhu, Q., Fung, C., Boutaba, R., Ba¸sar, T.: GUIDEX: a game-theoretic incentivebased mechanism for intrusion detection networks. IEEE J. Sel. Areas Commun. 30(11), 2220–2230 (2012) 51. Zhuang, J., Bier, V.M., Alagoz, O.: Modeling secrecy and deception in a multipleperiod attacker-defender signaling game. Eur. J. Oper. Res. 203(2), 409–418 (2010)

Deceptive Reinforcement Learning Under Adversarial Manipulations on Cost Signals Yunhan Huang(B)

and Quanyan Zhu

New York University, New York City, NY 10003, USA {yh.huang,qz494}@nyu.edu

Abstract. This paper studies reinforcement learning (RL) under malicious falsification on cost signals and introduces a quantitative framework of attack models to understand the vulnerabilities of RL. Focusing on Q-learning, we show that Q-learning algorithms converge under stealthy attacks and bounded falsifications on cost signals. We characterize the relation between the falsified cost and the Q-factors as well as the policy learned by the learning agent which provides fundamental limits for feasible offensive and defensive moves. We propose a robust region in terms of the cost within which the adversary can never achieve the targeted policy. We provide conditions on the falsified cost which can mislead the agent to learn an adversary’s favored policy. A numerical case study of water reservoir control is provided to show the potential hazards of RL in learning-based control systems and corroborate the results. Keywords: Reinforcement learning · Cybersecurity · Q-learning Deception and counterdeception · Adversarial learning

1

·

Introduction

Reinforcement Learning (RL) is a paradigm for making online decisions in uncertain environment. Recent applications of RL algorithms to Cyber-Physical Systems enables real-time data-driven control of autonomous systems and improves the system resilience to failures. However, the integration of RL mechanisms also exposes CPS to new vulnerabilities. One type of threats arises from the feedback architecture of the RL algorithms depicted in Fig. 1. An adversary can launch a man-in-the-middle attack to delay, obscure and manipulate the observation data that are needed for making online decisions. This type of adversarial behavior poses a great threat to CPS. For example, self-driving platooning vehicles can collide with each other when their observation data are manipulated [2]. Similarly, drones can be weaponized by terrorists to create chaotic and vicious situations where they are commanded to collide to a crowd or a building. Q. Zhu—This research is supported in part by NSF under grant ECCS-1847056, CNS1544782, and SES-1541164, and in part by ARO grant W911NF1910041. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 217–237, 2019. https://doi.org/10.1007/978-3-030-32430-8_14

218

Y. Huang and Q. Zhu

Hence it is imperative to understand the adversarial behaviors of RL and establish a theoretic framework to analyze the impact of the attacks on RLs. One key aspect that makes RL security unique is its feedback architecture which includes components of sensing, control, and actuation as is shown in Fig. 1. These components are subject to different types of cyber threats. For example, during the learning process, agent learns optimal policy from sequential observations from the environment. An adversary may perturb the environment to deteriorate the learning results. This type of attack is called environment attack. Agents observe the environment via their sensors. But the sensory observation of the state may be delayed, perturbed, or falsified under malicious attacks which are usually called sensors attack. There are also actuator attacks and attacks on reward/cost signals. The latter refers to manipulation of the reward signal produced by the environment in response to the actions applied by a RL agent, which can significantly affect the learning process. Take a RL-based Unmanned Aerial Vehicle (UAV) as an example, if the reward depends on the distance of the UAV to a desired destination measured by GPS coordinates, spoofing of GPS signals by the adversary may result in incorrect reward/cost signals.

Fig. 1. Main components of a RL agent and potential attacks that can be applied to these components.

In this paper, we study RL under malicious manipulation of cost signals from an offensive perspective where an adversary/attacker maliciously falsifies the cost signals. We first introduce a general formulation of attack models by defining the objectives, information structure and the capability of an adversary. We focus our research on a class of Q-learning algorithm and aim to address two fundamental questions. The first one is on the impact of the falsification of cost signals on the convergence of Q-learning algorithm. The second one is on how the RL algorithm can be misled under malicious falsifications. We show that under stealthy attacks and bounded falsifications on the cost signals, the Q-learning algorithm converges almost surely. If the algorithm converges, we characterize the relationship between the falsified cost and the limit of Q-factors by an implicit

Deceptive Reinforcement Learning Under Adversarial Manipulations

219

map. We show that the implicit map has several useful properties including differentiability, Lipschitz continuity etc., which help to find fundamental limits of adversarial behavior. In particular, from the implicit map, we study how the falsified cost affect the policy that agents learn. We show that the map is uniformly Lipschitz continuous with an explicit Lipschitz constant and based on this, we characterize a robust region where the adversary can never achieve his desired policy if the falsified cost stays in the robust region. The map is shown to be Fréchet differentiable almost everywhere and Fréchet derivative is explicitly characterized which is independent of the falsified cost. The map has ‘piece-wise linear’ property on a normed vector space. The derivative and ‘piecewise linear’ property can be utilized by the adversary to drive the Q-factors to a desired region by falsifying cost signals properly. We show that once the falsified cost satisfies a set of inequalities, the RL agent can be mislead to learn the policy manipulated by the adversary. Further, we give conditions under which the adversary can attain any policy even if the adversary is only capable of falsifying the cost at a subset of the state space. In the end, An example is presented to illustrate potential hazards that might be caused by malicious cost falsification. The main contributions of our paper can be summarized as follows: 1. We establish a theoretic framework to study strategic manipulation/falsifications on cost signals in RL and present a set of attack models on RL. 2. We provide an analytical results to understand how falsification on cost singals can affect Q-factors and hence the policies learned by RL agents. 3. We characterize conditions on deceptively falsified cost signals under which Q-factors learned by agents can produce the policy that adversaries aim for. 4. We use a case study of water reservoir to illustrate the severe damages of insecure RL that can be inflicted on critical infrastructures and demonstrate the need for defense mechanisms for RL. 1.1

Related Works

Very few works have explicitly studied security issues of RL [1]. There is a large literature on adversarial machine learning, whose focus is on studying the vulnerability of supervised learning. However, we aim to provide a fundamental understanding of security risks of RL which is different from both supervised learning and unsupervised learning [21]. So, there remains a need for a solid theoretic foundation on security problems of RL so that many critical applications would be safeguarded from potential RL risks. One area relevant to security of RL is safe RL [10], which aims to ensure that agents learn to behave in compliance with some pre-defined criteria. The security problem, however, is concerned with settings where an adversary intentionally seeks to compromise the normal operation of the system for malicious purposes [1]. Apart from the distinction between RL security and safe RL, the difference between RL security and the area of adversarial RL also exists. The adversarial RL is usually studied under multi-agent RL settings, in which agents aim to maximize their returns or minimize their cost in competition with other agents.

220

Y. Huang and Q. Zhu

There are two recent works that have studied inaccurate cost signals. In [9], Everitt et al. study RL for Markov Decision Process with corrupted reward channels where due to some sensory errors and software bugs, agents may get corrupted reward at certain states. But their focus is not on security perspectives and they look into unintentional perturbation of cost signals. In [22], Wang et al. have studied Q-learning with perturbed rewards where the rewards received by RL agents are perturbed with certain probability and the rewards take values only on a finite set. They study unintentional cost perturbation from a robust perspective other than a security perspective. Compared the two works mentioned above, our work studies RL with falsified cost signals from a security point of view and we develop theoretical underpinnings to characterize how the falsified cost will deteriorate the learning result. The falsification of cost/reward signals can be viewed as one type of deception mechanisms. The topic of defensive deception has bee surveyed in [17], which includes a taxonomy of deception mechanisms and a review of game-theoretic models. Game and decision-theoretic models for deception have been studied in various contexts [12,27], including honeypots [16,18], adversarial machine learning [25,26], moving target defense [8,28], and cyber-physical control systems [15,19,20,29]. In this work, we extend the paradigm of cyber deception to reinforcement learning and establish a theoretical foundation for understanding the impact and the fundamental limits of such adversarial behaviors. 1.2

Organization of the Paper

In Sect. 2, we present preliminaries and formulate a general framework that studies several attack models. In Sect. 3, we analyze the Q-learning algorithm under adversarial manipulations on cost. We study under what conditions the Q-learning algorithm converges and where it converges to. In Sect. 4, we present an example to corroborate the theoretical results and their implications in the security problems of RL.

2 2.1

Preliminaries and Problem Formulation Preliminaries

Consider one RL agent interacts with an unknown environment and attempts to minimize the total of its received costs. The environment is formalized as a Markov Decision Process (MDP) denoted by S , A , c, P, β. The MDP {Φ(t) : t ∈ Z} takes values in a finite state space S = {1, 2, . . . , S} and is controlled by a sequence of actions (sometimes called a control sequence) Z = {Z(t) : t ∈ Z} taking values in a finite action space A = {a1 , . . . , aA }. In our setting, we are interested in stationary policies where the control sequence takes the form Z(t) = w(Φ(t)), where the feedback rule w is a function w : S → A . To emphasize the policy w, we denote Zw = {Zw (t) := w(Φ(t)) : t ∈ Z}. According to a transition probability kernel P, the controlled transition probabilities are given by p(i, j, a) for i, j ∈ S , a ∈ A . Commonly P is unknown to the agent.

Deceptive Reinforcement Learning Under Adversarial Manipulations

221

Let c : S × A → R be the one-step cost function, and consider the infinite horizon discounted cost control problem ∞ of minimizing over all admissible Z the total discounted cost J(i, Z) = E[ t=0 β t c(Φ(t), Z(t))|Φ(0) = i], where β ∈ (0, 1) is the discount factor. The minimal value function is defined as V (i) = min J(i, Z), where the minimum is taken over all admissible control sequences Z. The V satisfies the dynamic programming equations [3], V (i) = function mina c(i, a) + β j p(i, j, a)V (j) , i ∈ S and the optimal control minimizing J ∗ is given by the stationary policy defined through the feedback law w given by ∗ w (i) := arg mina [c(i, a) + β j p(i, j, a)V (j)], i ∈ S . If we define Q-values via Q(i, a) = c(i, a) + β

p(i, j, a)V (j), i ∈ S , a ∈ A ,

j

then V (i) = mina Q(i, a)and the matrix Q satisfies p(i, j, a) min Q(j, b), Q(i, a) = c(i, a) + β b

j

i ∈ S,a ∈ A .

(1)

If the matrix Q defined in (1) can be computed, e.g., using value iteration, then the optimal control policy can be found by w∗ (i) = arg mina Q(i, a), i ∈ S . When transition probabilities are unknown, we can use a variant of stochastic approximation known as the Q-learning algorithm proposed in [23]. The learning process is defined through the recursion Qn+1 (i, a) = Qn (i, a) + a(n) × β min Qn (Ψn+1 (i, a), b) + c(i, a) − Qn (i, a) , b

(2) i ∈ S , a ∈ A , where Ψn+1 (i, a) is an independently simulated S -valued random variable with law p(i, ·, a). Notations. An indicator function 11C is defined as 11C (x) = 1 if x ∈ C, and 11C (x) = 0 otherwise. Denote 1i ∈ RS a vector with S components whose ith component is 1 and other components are 0. The true cost at time t is denoted by the shorthand notion ct := c(Φ(t), Z(t)). For a mapping f : RS×A → RS×A , define fia : RS×A → R that maps RS×A to R where for any Q ∈ RS×A , we have [f (Q)]i,a = fia (Q) and [f (Q)]i,a is the ith component and ath column of f (Q). The inverse of f is denoted by f −1 . Given a set V ⊂ RS×A , f −1 (V ) is referred to the set {c : f (c) ∈ V }. Denote B(c; r) := {˜ c : ˜ c − c < r} an open ball in a normed vector space with radius r and center c. Here and in later discussion, · refers to the maximum norm. Given c ∈ RS×A and a policy w, denote cw ∈ RS a vector whose ith component is c(i, w(i)) for any i ∈ S . Define ca ∈ RS as a vector whose ith component is c(i, a). We define Qw , Qa in the same way. For transition probability, we define Pw ∈ RS×S as [Pw ]i,j = p(i, j, w(i)) and Pia = (p(i, 1, a), p(i, 2, a), . . . , p(i, S, a))T ∈ RS . Define Pa ∈ RS×S as the matrix whose components are [Pa ]i,j = p(i, j, a).

222

2.2

Y. Huang and Q. Zhu

General Attack Models

Under malicious attacks, the RL agent will not be able to observe the true cost feedback from the environment. Instead, the agent is given a cost signal that might be falsified by the attacker. Consider the following MDP with falsified cost (MDP-FC) denoting as S , A , c, c˜, P, β. In MDP-FC, at each time t, instead of observing ct ∈ R directly, the agent only observes a falsified cost signal denoted by c˜t ∈ R. The remaining aspects of the MDP framework stay the same. Attack models can be specified by three components: objective of an adversary, actions available to the adversary, and information at his disposal. The adversary’s task here is to design falsified cost signals c˜ based on his information structure and the actions available to him so that he can achieve certain objectives. Objective of Adversary: One possible objective of an adversary is to maximize the agent’s cost while minimizing the cost of attacks. This type of objectives can be can captured by a cost function max c˜

∞ E β t c(Φ(t), Zw(˜c) (t)) − AttackCost(˜ c). t=0

The other adversarial objectives would be to drive the MDP to a targeted process or to mislead the agent to learn certain policies the attacker aims for. Let w(˜ c) denote the policy learned by the agent under falsified cost signals c˜ and let w† denote the policy that an attacker aims for. We can capture the objective of such a deceptive adversary by max 11{w† } (w(˜ c)) − AttackCost(˜ c). c˜

(3)

Here, the second term AttackCost(˜ c) serves as a measure for the cost of attacking while the first term indicates whether the agent learns the policy w† or ∞ not. We can, for example, define AttackCost(˜ c) = t=0 αt d(ct , c˜t ), where d(·, ·) T is a metric, α is a discount factor. If d is a discrete metric, then t=0 d(ct , c˜t ) counts the number of times of cost signals being falsified before time T . Note that here, c˜ represents all the possible ways that an adversary can take to generate falsified signals. Information: It is important to specify the information structure of an adversary which determines different classes of the attacks an adversary can launch. We can categorize them as follows. Definition 1. 1. An attacker is called an omniscient attacker if the information the attacker has at time t, denoted by, It , is defined as It = {P, Φ(τ ), Z(τ ), c : τ ≤ t}. 2. An attacker is called a peer attacker if the attacker has only access to the knowledge of what the agent knows at time t. That means It = {Φ(τ ), Z(τ ), cτ : τ ≤ t}

Deceptive Reinforcement Learning Under Adversarial Manipulations

223

3. An attacker is called an ignorant attacker if at time t, he only knows the cost signals before time t, i.e., It = {cτ : τ ≤ t} 4. An attacker is called a blind attacker if the information the attacker has at time t, denoted by It , is defined as It = ∅. Remark 1. There are many other situations in terms of information sets of an attacker that we can consider. In the definition of an omniscient attacker, c represents the true cost at every state-action pair. One should differentiate it from cτ . The latter means the true cost generated at time τ . That is to say an omniscient attack knows the true cost at every state action pair (i, a) for all t. Actions Available: Even if an adversary can be omniscient, it does not mean that he can be omnipotent. The actions available to an adversary need to be defined. For example, the attacker can only create bounded perturbations to true cost signals. In some cases, the action of an adversary may be limited to changing the sign of the cost at certain time or he can only falsify the cost signals at certain states in the subset S ⊂ S . The constraints on the actions available to an attacker can also be captured by the attack cost. The cost for the type of attacks whose actions are constrained to a subset S can be captured by the following ˜ ∀t 0 if c˜t = ct := c(Φ(t), Z(t)), for Φ(t) ∈ S\S, AttackCost(˜ c) = ∞ Otherwise. Moreover, the generation of falsified costs relies heavily on the information an attacker has. If the attacker is a peer attacker or an omniscient attacker, the falsified signal c˜ can be generated through a mapping C : S × A × R → R, i.e., c˜t = C(Φ(t), Z(t), ct ). If the attacker only knows the state and the cost, c˜ can be generated by the mapping C : S × R → R. If the attacker is ignorant, we have C : R → R, then c˜t = C(ct ). Definition 2 (Stealthy Attacks). If c˜t takes the same value for the same state-action pair (Φ(t), Z(t)) for all t ∈ Z, i.e., for t = τ , if for (Φ(t), Z(t)) = (Φ(τ ), Z(τ )), we have c˜t = c˜τ , then we say that the attacks on the cost signals are stealthy. The definition states that the cost falsification remains consistent for the same state-action pairs. In later discussions, we focus on stealthy attacks, which is a class of attacks that are hard to detect. Under stealthy attackers, the falsified cost c˜ can be viewed as a falsified cost matrix of dimension S × A. At time t, the cost received by the RL agent is c˜(Φ(t), Z(t)).

224

2.3

Y. Huang and Q. Zhu

Q-Learning with Falsified Cost

If the RL agent learns an optimal policy by Q-learning algorithm given in (2), then under stealthy attacks on cost, the algorithm can be written as Qn+1 (i, a) = Qn (i, a) + a(n) × β min Qn (Ψn+1 (i, a), b) + c˜(i, a) − Qn (i, a) . b

(4) Note that if the attacks are not stealthy, we need to write cñ in lieu of c˜(i, a). There are two important questions regarding the Q-learning algorithm with falsified cost (4): (1) Will the sequence of Qn -factors converge? (2) Where will the sequence of Qn converge to? We will address these two issues in next section. Suppose that the sequence Qn generated by the Q-learning algorithm (4) ˜ ∗ = limn→∞ Qn . Suppose the objective ˜ ∗ be the limit, i.e., Q converges. Let Q of an adversary is to induce the RL agent to learn a particular policy w† . The adversary’s problem then is to design c˜ by applying the actions available to him based on the information he has so that the limit Q-factors learned from the Q-learning algorithm produce the policy targeted by the adversary w† , i.e, ˜ ∗ ∈ Vw† , where Q Vw := {Q ∈ RS×A : w(i) = arg mina Q(i, a), ∀i ∈ S }. In next section, we will develop theoretical underpinnings to address the issues regarding the convergence of (4) and the attainability of the adversarial objectives.

3 3.1

Analysis of Q-Learning with Falsified Cost Convergence of Q-Learning Algorithm with Falsified Cost

In Q-learning algorithm (2), to guarantee almost sure convergence, the agent usually stepsize [4] {a(n)} which satisfies 0 < a(n) ≤ 1, n ≥ 0, takes tapering and n a(n) = ∞, n a(n)2 < ∞. Suppose in our problem, the agent takes tapering stepsize. To address the convergence issues, we have the following result. Lemma 1. If an adversary performs stealthy attacks with bounded c˜(i, a) for all i ∈ S , a ∈ A , then the Q-learning algorithm with falsified costs converges to the fixed point of F˜ (Q) almost surely where the mapping F˜ : RS×A → RS×A is defined as F˜ (Q) = [F˜ia (Q)]i,a with p(i, j, a) min Q(j, b) + c˜(i, a), F˜ia (Q) = β j

b

˜∗. and the fixed point is unique and denoted by Q

Deceptive Reinforcement Learning Under Adversarial Manipulations

225

Sketch of Proof. If the adversary performs stealthy attacks, the falsified costs for each state-action pair are consistent during the learning process. The Q learning ˜ n) + process thus can be written as (4). Rewrite (4) as Qn+1 = Qn + a(n) h(Q ˜ ˜ M (n + 1) , where h(Q) := F (Q) − Q and M (n + 1) is given as Mia (n + 1) = β

min Qn (Ψn+1 (i, a), b) − b

j

p(i, j, a)(min Qn (j, b)) , i ∈ S, a ∈ A. b

˜ 1 ) − h(Q ˜ 2 ) and F˜ (Q1 ) − F˜ (Q2 ) do not depend Note that for any Q1 , Q2 , h(Q ˜ on the falsified costs. If the falsified costs are bounded, one can see that h(Q) is Lipschitz. And M (n + 1) is a Martingale difference sequence. Following the arguments in [4] (Theorem 2 Chap. 2) and Sect. 3.2 of [5], we can see the iterates of (4) converges almost surely to the fixed points of F˜ . Since F˜ is a contraction mapping with respect to the max norm, with contraction factor β [3] (pp. 250), by Banach fixed point theorem (contraction theorem), F˜ admits a unique fixed point.

It is not surprising that one of the conditions given in Lemma 1 that guarantees convergence is that an attacker performs stealthy attacks. The convergence can be guaranteed because the falsified cost signals are consistent over time for ˜ ∗ comes from the fact that if c˜(i, a) each state action pair. The uniqueness of Q ˜ is bounded for every (i, a) ∈ S × A , F is a contraction mapping. By Banach’s fixed point theorem [13], F˜ admits a unique fixed point. With this lemma, we conclude that an adversary can make the algorithm converge to a limit point by stealthily falsifying the cost signals. Remark 2. Whether an adversary aims for the convergence of the Q-learning algorithm (4) or not depends on his objective. In our setting, the adversary intends to mislead the RL agent to learn policy w† , indicating that the adversary ˜ ∗ lie in Vw† . promotes convergence and aim to have the limit point Q 3.2

How Is the Limit Point Affected by the Falsified Cost

Now it remains to analyze, from the adversary’s perspective, how to falsify the cost signals so that the limit point that algorithm (4) converges to is desired by the adversary. In later discussions, we consider stealthy attacks where the falsified costs are consistent for the same state action pairs. Denote the true cost by matrix c ∈ RS×A with [c]i,a = c(i, a) and the falsified cost is described c]i,a = c˜(i, a). Given c˜, the fixed point of F˜ is by a matrix c˜ ∈ RS×A with [˜ uniquely decided, i.e., the point that the algorithm (4) converges to is uniquely ˜ ∗ implicitly described by the relation determined. Thus, there is a mapping c˜ → Q ˜ F (Q) = Q. For convenience, this mapping is denoted by f : RS×A → RS×A . ˜ ∗ denote the Q-factor learned from algorithm (4) with falsiTheorem 1. Let Q fied cost signals and Q∗ be the Q-factor learned from (2) with true cost signals. There exists a constant L < 1 such that ˜ ∗ − Q∗ ≤ 1 ˜ c − c, (5) Q 1−L

226

Y. Huang and Q. Zhu

and L = β where discounted factor β has been defined in the MDP-FC problem. Proof. Define F˜ (Q) as F˜ia (Q) = β j p(i, j, a) minb Q(j, b) + c(i, a). From ˜ ∗ and Q∗ satisfy Q˜∗ = F˜ (Q ˜ ∗ ) and Q∗ = F (Q∗ ). Lemma 1, we know that Q ∗ ∗ ∗ ˜ = F˜ (Q ˜ ) − F (Q ). Since F˜ and F are both contraction map˜ −Q We have Q ˜ ∗ − Q∗ + ˜ ˜ ∗ − Q∗ ≤ LQ c − c. Thus, pings, by triangle inequality, we have Q we have (5). And the contraction factor L for F˜ and F is β.

Remark 3. In fact, taking this argument just slightly further, one can conclude that falsification on cost c using a tiny perturbation does not cause significant changes in the limit point of algorithm (2), Q∗ . This feature indicates that an adversary cannot cause a significant change in the limit Q-factor by just a small perturbation in the cost signals. This is a feature known as stability that is observed in problems that possess contraction mapping properties. Also, Theo˜ ∗ is continuous, and to be more specific, rem 1 indicates that the mapping c˜ → Q it is uniformly Lipchitz continuous with Lipchitz constant 1/(1 − β). With Theorem 1, we can now characterize the minimum level of falsification an adversary needs to change the policy from the true optimal policy w∗ to the policy w† that the adversary aims for. First, note that Vw ⊂ RS×A and it can be also written as Vw = {Q ∈ RS×A : Q(i, w(i)) < Q(i, a), ∀i ∈ S , ∀a = w(i)}.

(6)

We can easily see that for any given policy w, Vw is a convex set, hence connected. This is because for any λ ∈ [0, 1], if Q1 , Q2 ∈ Vw , λQ1 + (1 − λ)Q2 ∈ Vw . Second, for any two different policies w1 and w2 , Vw1 ∩ Vw2 = ∅. Define the infimum distance between the true optimal policy w∗ and the adversary desired policy w† in terms of the Q-values by D(w∗ , w† ) :=

inf

Q1 ∈V w∗ ,Q2 ∈V w†

Q1 − Q2 ,

which is also the definition of the distance between two sets Vw∗ and Vω† . Note that for w∗ = w† (otherwise, the optimal policy w∗ is what the adversary desire, there is no incentive for the adversary to attack), D(w∗ , w† ) is always zero according to the definition of the set (6). This counterintuitive result states that a small change in the Q-value may result in any possible change of policy learned by the agent from the Q-learning algorithm (4). Compared with Theorem 1 which is a negative result to the adversary, this result is in favor of the adversary. Similarly, define the point Q∗ to set Vw† distance by DQ∗ (w† ) := ˜ ∗ ∈ Vw† , we have inf Q∈V w† Q − Q∗ . Thus, if Q ˜ ∗ − Q∗ ≤ 0 = D(w∗ , w† ) ≤ DQ∗ (w† ) ≤ Q

1 ˜ c − c, 1−β

(7)

where the first inequality comes from the fact that Q∗ ∈ Vw∗ and the second ˜ ∗ ∈ Vw† . The robust region for the true cost c to the inequality is due to Q

Deceptive Reinforcement Learning Under Adversarial Manipulations

227

adversary’s targeted policy w† is given by B(c; (1 − β)DQ∗ (w† )) which is an open ball with center c and radius (1 − β)DQ∗ (w† ). That means the attacks on the cost needs to be ‘powerful’ enough to drive the falsified cost c˜ outside the ball B(c; (1 − β)DQ∗ (w† )) to make the RL agent learn the policy w† . If the falsified cost c˜ is within the ball, the RL agent can never learn the adversary’s targeted policy w† . The ball B(c; (1 − β)DQ∗ (w† )) depends only on the true cost c and the adversary desired policy w† (Once the MDP is given, Q∗ is uniquely determined by c). Thus, we refer this ball as the robust region of the true cost c to the adversarial policy w† . As we have mentioned in Sect. 2.2, if the actions available to the adversary only allows him to perform bounded falsification on cost signals and the bound is smaller than the radius of the robust region, then the adversary can never mislead the agent to learn policy w† . Remark 4. First, in discussions above, the adversary policy w† can be any possible polices and the discussion remains valid for any possible policies. Second, set Vw of Q-values is not just a convex set but also an open set. We thus can see that DQ∗ (w† ) > 0 for any w† = w∗ and the second inequality in (7) can be replaced by a strict inequality. Third, the agent can estimate his own robustness to falsification if he can know the adversary desired policy w† . For an omniscient attacker or attackers who have access to true cost signals, the attacker can compute the robust region of the true cost to his desired policy w† to evaluate whether the objective is feasible or not. When it is not feasible, the attacker can consider changing his objectives, e.g., selecting other favored policies that have a smaller robust region. We have discussed how falsification affects the change of Q-factors learned by the agent in a distance sense. The problem now is to study how to falsify the true cost in a right direction so that the resulted Q-factors fall into the favored region of an adversary. One difficulty of analyzing this problem comes from the ˜ ∗ is not explicit known. The relation between c˜ and fact that the mapping c˜ → Q ∗ ˜ Q is governed by the Q-learning algorithm (4). Another difficulty is that due ˜ ∗ lies in the space of RS×A , we need to resort to to the fact that both c˜ and Q Fréchet derivative or Gˆ ateaux derivative [7] (if they exist) to characterize how a ˜∗. small change of c˜ results in a change in Q From Lemma 1 and Theorem 1, we know that Q-learning algorithm converges ˜ ∗ is uniformly Lipschitz to the unique fixed point of F˜ and that f : c˜ → Q continuous. Also, it is easy to see that the inverse of f , denoted by f −1 , exists ˜ ∗ , c˜ is uniquely decided by the relation F˜ (Q) = Q. Furthermore, since given Q by the relation F˜ (Q) = Q, we know f is both injective and surjective and hence a bijection which can be simply shown by arguing that given different c˜, the solution of F˜ (Q) = Q must be different. This fact informs that there is a one-to˜ ∗ . One should note that the mapping one, onto correspondence between c˜ and Q f : RS×A → RS×A is not uniformly Fréchet differentiable on RS×A due to the min operator inside the relation F˜ (Q) = Q. However, for any policy w, f is Fréchet differentiable on f −1 (Vw ) which is an open set and connected due to the fact that Vw is open and connected and f is continuous.

228

Y. Huang and Q. Zhu

Proposition 1. The map f : RS×A → RS×A is Fréchet differentiable on Vw for any policy w and the Fréchet derivative of f at any point c˜ ∈ Vw , denoted c), is a linear bounded map G : RS×A → RS×A that does not depend on c˜, by f (˜ and Gh is given as T [Gh]i,a = βPia (I − βPw )−1 hw + h(i, a)

(8)

for every i ∈ S , a ∈ A . ˜∗ ∈ Proof. Suppose c ∈ f −1 (Vw ) and c˜ = c + h ∈ f −1 (Vw ). By definition, Q∗ , Q ∗ ∗ ∗ ∗ ˜ ˜ ˜ Vw . By Lemma 1, we have Q = F (Q ) and Q = F (Q ) which means ˜ ∗ + c˜(i, a) = βPia Q ˜ ∗ + c(i, a) + h(i, a), ˜ ∗ (i, a) = βPia Q Q w w Q∗ (i, a) = βPia Q∗w + c(i, a), ∀i ∈ S , a ∈ A .

(9)

From (9), we have Q∗w = βPw Q∗w + cw . Thus, Q∗w = (I − βPw )−1 cw . Similarly, ˜ ∗w = (I − βPw )−1 (cw + hw ), where (I − βPw ) is invertible due to the fact Q ˜ ∗ = Q∗ + (I − βPw )−1 hw . that β < 1 and Pw is a stochastic matrix. Thus, Q w w Substitute it into the first equation of (9), one have ˜ ∗ (i, a) = βPia (Q∗ + (I − βPw )−1 hw ) + c(i, a) + h(i, a) Q w = Q∗ (i, a) + βPia (I − βPw )−1 hw + h(i, a). Then, one can see f (c + h) − f (c) − Gh/h → 0 as h → 0.

From Proposition 1, we can see that f is Fréchet differentiable on f −1 (Vw ) c) = G for any c˜ ∈ f −1 (Vw ). Note and the derivative is constant, i.e., f (˜ that G lies in the space of all linear mappings that maps RS×A to itself and G is determined only by the discount factor β and the transition kernel P of the MDP problem. The region where the differentiability may fail is f −1 (RS×A \(∪w Vw )), where RS×A \(∪w Vw ) is the set {Q : ∃i, ∃a = a , Q(i, a) = Q(i, a ) = minb Q(i, b)}. This set contains the places where a change of policy happens, i.e., Q(i, a) and Q(i, a ) are both the lowest value among the ith row of Q. Also, due to the fact that f is Lipschitz, by Rademacher’s theorem, f is differentiable almost everywhere (w.r.t. the Lebesgue measure). Remark 5. One can view f as a ‘piece-wise linear function’ in the norm vector space RS×A instead of in a real line. Actually, if the adversary can only falsify the cost at one state-action pair, say (i, a), while costs at other pairs are fixed, ˜ ∗ ]j,b is a piece-wise linear then for every j ∈ S , b ∈ A , the function c˜(i, a) → [Q function. Given any c ∈ f −1 (Vw ), if an adversary falsifies the cost c by injecting value h, i.e., c˜ = c + h, the adversary can see how the falsification cause a change in Q-values. To be more specific, if Q∗ is the Q-values learned from cost c by Q-learning algorithm (2), after the falsification c˜, the Q-value learned from Q˜ ∗ = Q∗ + Gh if c˜ ∈ f −1 (Vw ). Then, an omnilearning algorithm (4) becomes Q ˜ ∗ can scient adversary can utilize (8) to find a way of falsification h such that Q

Deceptive Reinforcement Learning Under Adversarial Manipulations

229

be driven to approach a desired set Vw† bearing in mind that D(w, w† ) = 0 for any two policies w, w† . One difficulty is to see whether c˜ ∈ f −1 (Vw ) because the set f −1 (Vw ) is now implicit. Thus, we resort to the following theorem. ˜ ∗ ∈ RS×A be the Q-values learned from the Q-learning algoTheorem 2. Let Q ˜ ∗ ∈ Vw† if and only if the rithm (4) with the falsified cost c˜ ∈ RS×A . Then Q falsified cost signals c˜ designed by the adversary satisfy the following conditions c˜(i, a) > (1i − βPia )T (I − βPw† )−1 c˜w† .

(10)

for all i ∈ S , a ∈ A \{w† (i)}. ˜∗ † = ˜ ∗ ∈ Vw† , then from proof of Proposition 1, we know Q Sketch of Proof. If Q w −1 ∗ ˜ † is strictly less than Q ˜ ∗ (i, a) (I − βPw† ) c˜w† and the ith component of Q w ˜ ∗ † which gives us (10). ˜ ∗ (i, a) > 1T Q for each a ∈ A \{w† (i)}. That means Q i w ∗ ˜ Conversely, if c˜ satisfy conditions (10), Q ∈ Vw† due to the one-to-one, onto ˜∗.

correspondence between c˜ and Q With the results in Theorem 2, we can characterize the set f −1 (Vw ). Elements in f −1 (Vw ) have to satisfy the conditions given in (10). Also, Theorem 2 indicates that if an adversary intends to mislead the agent to learn policy w† , the falsified cost c˜ has to satisfy the conditions specified in (10). Note that for a = w† (i), c˜(i, w† (i)) ≡ (1i − βPiw† (i) )T (I − βPw† )−1 c˜w† . If the objective of an omniscient attacker is to induce the agent to learn policy w† while minimizing his own cost of attacking, i.e., the attack’s problem we have formulated in (3) in Sect. 2.2. Given AttackCost(˜ c) = ˜ c − c where c is the true cost, the attacker’s problem is to solve the following minimization problem min ˜ c − c s.t. (10) (11) c˜∈RS×A Remark 6. If the norm in the attacker’s problem (11) is a Frobenius norm, the attacker’s problem is a convex minimization problem which can be easily solved by omniscient attackers using software packages like MOSEK [14], CVX [11] etc. If AttackCost(˜ c) is the number of state-action pair where the cost has been falsified, i.e., AttakCost(˜ c) = i a 11{c(i,a)=c˜(i,a)} , then the attacker’s problem becomes a combinatorial optimization problem [24]. Remark 7. If the actions available to an adversary only allow the adversary to falsify the true cost at certain states S ⊂ S (or/and at certain actions A ⊂ A ), then the adversary’s problem (11) becomes min

c˜∈RS×A

s.t.

˜ c − c (10) c˜(i, a) = c(i, a) ∀i ∈ S \S , a ∈ A \A .

However, if an adversary can only falsify at certain states S , the adversary may not be able to manipulate the agent to learn w† .

230

Y. Huang and Q. Zhu

Without loss of generality, suppose that the adversary can only falsify the cost at a subset of states S = {1, 2, . . . , S }. We rewrite the conditions given in (10) into a more compact form: cã ≥ (I − βPa )(I − βPw† )−1 c˜w† , ∀ a ∈ A ,

(12)

where the equality only holds for one component of the vector, i.e., the i-th component satisfying w(i) = a. Partition the vector cã and c˜w† in (12) into two parts, the part where the adversary can falsify the cost denoted by c˜fa al , c˜fwal † ∈ true S−S RS and the part where the adversary cannot falsify ctrue , c ∈ R . † a w f al

f al

Ra Ya cã c˜w† ≥ , ∀a∈A (13) Ma Na ctrue ctrue a w† where

Ra Ya := (I − βPa )(I − βPw† )−1 , ∀ a ∈ A M a Na

and Ra ∈ RS ×S , Ya ∈ RS ×(S−S ) , Ma ∈ R(S−S )×S , Na ∈ R(S−S )×(S−S ) . ˜fwal Note that the ith component of c˜fwal † (i) is equal to the i component of c † . If † the adversary aims to mislead the agent to learn w , the adversary needs to design c˜fa al , a ∈ A such that the conditions in (13) hold. Whether the conditions in (13) are easy for an adversary to achieve or not depends on the true costs , a ∈ A . The following results state that under some conditions on the ctrue a transition probability, no matter what the true costs are, the adversary can find proper c˜fa al , a ∈ A such that conditions (13) are satisfied. For i ∈ S \S , if w(i) = a, we remove the rows of Ma that correspond to the state i ∈ S \S . ¯ a. Denote the new matrix after the row removals by M ¯T M ¯T ··· M ¯ T ]T ∈ R(A(S−S )−S )×S . If there Theorem 3. Define H := [M a1 a2 aA exists x ∈ RS such that Hx < 0, i.e., the column space of H intersects the negative orthant of RA(S−S )−S , then for any true cost, the adversary can find c˜fa al , a ∈ A such that conditions (13) hold. true and ctrue ≥ Ma c˜fwal Proof. We can rewrite (13) as c˜fa al ≥ Ra c˜fwal † + Ya cw † † + a f al f al Na ctrue for all a ∈ A . If there exists c ˜ such that M c ˜ can be less than a w† w† w† f al true any given vector in RS−S , then ctrue ≥ M c ˜ + N c can be satisfied no a a † a w† w f al true matter what the true cost is. We need ca ≥ Ma c˜w† + Na ctrue to hold for w† all a ∈ A , which means that we need the range space of [MaT1 , . . . , MaTA ] ∈ RA(S−S )×S to intersect the negative orthant. By using the fact that c˜(i, w† (i)) ≡ (1i − βPiw† (i) )T (I − βPw† )−1 c˜w† , we can give less stringent conditions. Actually, ¯T ,...,M ¯ T ] ∈ R(A(S−S )−S )×S to we only need the range space of H = [M a1 aA intersection the negative orthant. If this is true, then these exists c˜fwal † such that f al true true ca ≥ Ma c˜w† + Na cw† is feasible for all a ∈ A .

Deceptive Reinforcement Learning Under Adversarial Manipulations

231

true As for conditions c˜fa al ≥ Ra c˜fwal † + Ya cw † , note that there are S × A number f al f al of variables cã , a ∈ A and that c˜w† has been chosen such that conditions true ctrue ≥ Ma c˜fwal are satisfied. One can choose the remaining variables † + Na cw † a f al true in cã , a ∈ A sufficiently large to satisfy ctrue ≥ Ma c˜fwal due to the † + Na cw † a † T −1 fact that c˜(i, w (i)) is equivalent to (1i − βPiw† (i) ) (I − βPw† ) c˜w† .

Note that H only depends on the transition probability and the discount factor, if an omniscient adversary can only falsify cost signals at states denoted by S , an adversary can check if the range space of H intersects with the negative orthant of RA(S−S ) or not. If it does, the adversary can mislead the agent to learn w† by falsifying costs at a subset of state space no matter what the true cost is. Remark 8. To check whether the condition on H is true or not, one has to resort to Gordan’s theorem [6]: Either Hx < 0 has a solution x, or H T y = 0 has a nonzero solution y with y ≥ 0. The adversary can use linear/convex programming software to check if this is the case. For example, by solving min

y∈RA(S−S )

H T y

s.t. y = 1, y ≥ 0,

(14)

the adversary knows whether the condition about H given in Theorem 3 is true or not. If the minimum of (14) is 0, the adversary cannot guarantee that, for any given true cost, the agent learns the policy w† . If the minimum of (14) is positive, there exists x such that Hx < 0. The adversary can select c˜fwal † = λx and choose a sufficiently large λ to make sure that conditions (13) hold, which means an adversary can make the agent learn the policy w† by falsifying costs at a subset of state space no matter what the true costs are.

4

Numerical Example

In this section, we use the application of RL in water reservoir operations to illustrate the security issues of RL. Consider a RL agent aiming to create the best operation policies for the hydroelectric reservoir system described in Fig. 2. The system consists of the following: (1) an inflow conduit regulated by Val0 , which can either be a river or a spillway from another dam; and (2) two spillways for outflow: the first penstock, Val1 , which is connected to the turbine and thus generates electricity, and the second penstock, Val2 , allowing direct water evacuation without electricity generation. We consider three reservoir levels: MinOperL, MedOperL, MaxExtL. Weather conditions and the operation of valves are key factors that affect the reservoir level. In practice, there are usually interconnected hydroelectric reservoir systems located at different places which makes it difficult to find an optimal operational policy. For illustrative purposes, we only consider controlling of Val1 . Thus, we have two actions: either a1 , Val1 = ‘shut down’; or a2 , Val1 = ‘open’. Hence A = {a1 , a2 }. We consider three states which represent three different reservoir

232

Y. Huang and Q. Zhu

Fig. 2. A hydroelectric reservoir system.

levels, denoted by S = {1, 2, 3} where 1(2, 3) represents MaxExtL (MedOperL, MinOperL, respectively). The goal of the operators is to generate more electricity to increase economic benefits, which requires the reservoir to store a sufficient amount of water to generate electricity. Meanwhile, the operator also aims to avoid possible overflows which can be caused by the unexpected heavy rain in the reservoir area or in upper areas. The operator needs to learn a safe policy, i.e., the valve needs to be open at state 1 so that the cost at c(1, a1 ) needs to be high. We assume that the uncertain and intermittent nature is captured by the transition probability given by ⎡ ⎤ ⎡ ⎤ 1 0 0 0.3 0.7 0 Pa1 = ⎣0.6 0.4 0 ⎦ , Pa2 = ⎣0.1 0.2 0.7⎦ . 0.1 0.5 0.4 0 0 1 And the true cost is assumed to be c = [30 − 5; 6 − 10; 0 0]. Negative cost can be interpreted as the reward for hydroelectric production. Let the discounted factor β be 0.8. The ⎡limit Q-values learned from Q-learning algorithm (2) is ⎤ 8.71 −26.6129 approximately Q∗ = ⎣−15.48 −27.19 ⎦ . The optimal policy thus is w∗ (1) = −19.12 −15.30 a2 , w∗ (2) = a2 , w∗ (3) = a1 . Basically, the optimal policy indicates that one should keep the valve open to avoid overflowing and generate more electricity at MaxExtL. While at MinOperL, one should keep the valve closed to store more water for water supply and power generation purposes. From (5), we know that the resulting change in Q∗ under malicious falsification is bounded by the change in the cost with a Lipschitz constant 1/(1 − β). To see this, we randomly generate 100 falsifications h ∈ R3×2 using randi(10) * rand(3,2) in Matlab. For ˜ ∗ . We plot each falsified cost c˜ = c + h, we obtain the corresponding Q-factors Q ∗ ∗ ˜ c − c for each falsification in Fig. 3. One can Q − Q corresponding with ˜ clearly see the bound given in (5). The result in Fig. 3 corroborates Theorem 1. Suppose that the adversary aims to mislead the agent to learn a policy w† where w† (1) = a1 , w† (2) = a2 , w† (3) = a1 . The purpose is to keep the valve shut down at MaxExtL which will cause overflow and hence devastating consequences.

Deceptive Reinforcement Learning Under Adversarial Manipulations

233

˜ ∗ − Q∗ versus ˜ Fig. 3. Q c − c with 100 falsifications.

The adversary can utilize DQ∗ (w† ) to see how much at least he has to falsify the original cost c˜ to achieve the desired policy w† . The value of DQ∗ (w† ) can be obtained by solving the following optimization problem: min

Q∈R3×2

s.t.

Q − Q∗ Q(1, a1 ) ≤ Q(1, a2 ), Q(2, a2 ) ≤ Q(1, a1 ), Q(3, a1 ) ≤ Q(3, a2 ).

The value of DQ∗ (w† ) is thus 17.66. By (5), we know that to achieve w† , the adversary has to falsify the cost such that ˜ c − c ≥ (1 − β)DQ∗ (w† ) = 3.532. If the actions available to the adversary are to perform only bounded falsification to one state-action pair with bound 3.5, then it is impossible for the adversary to attain its goal, i.e., misleading the agent to the policy w† targeted by the adversary. Thus, in this MDP-FC, the robust region of c to the adversary’s desired policy w† is 3.532. In Fig. 4, we plot the change of the limit Q-values when only the cost at one state-action pair is falsified while the other components are fixed at c = [9 −5; 6 −10; 0 0]. We can see that when the other costs are fixed, for every ˜ ∗ ]j,b is piece-wise linear. And j ∈ {1, 2, 3}, b ∈ {a1 , a2 } the function c˜(i, a) → [Q the change of slope happens only when the policy changes. This illustrates our ˜ ∗ in Proposition 1. argument about the differentiability of the mapping c˜ → Q From the first two plots, one can see that changes in costs at one state can deviate the policy at another state. That is when altering the cost at MedOperL, an adversary can make the valve open at MinOperL so that the reservoir cannot store enough water to maintain the water supply and generate electricity. When an adversary aims to manipulate the policy at one state, he does not have to alter the cost at this state. Figure 5 illustrates Proposition 1 when costs corresponding to two state-action pairs are altered. Furthermore, to illustrate Proposition 1 in general cases, i.e., in R3×2 , suppose c = [9 −5; 6 −10; 0 0], the Q-factors learned from c is Q∗ = [−12.29

234

Y. Huang and Q. Zhu

Fig. 4. The change of the limit Q-values when only the cost at one state-action pair is altered. Black line corresponds to state 1, red line corresponds to state 2 and green line corresponds to state 3. Solid (dash) line corresponds to a1 (a2 ). (Color figure online)

−26.61; −15.47 −27.19; −19.12 −15.30]. The optimal policy is thus w∗ (1) = a2 , w∗ (2) = a2 , w∗ (3) = a1 . By (8) in Proposition 1, the derivative of f : R3×2 at c ∈ f −1 (Vw∗ ) is a linearly bounded map G : R3×2 → R3×2 ⎤ ⎤ ⎡ ⎤ ⎡ ⎡ −1 h(1, a ) 100 0.3 0.7 0 2 T ⎣h(2, a2 )⎦ + h(i, a). (15) ⎣0 1 0⎦ − 0.8 ⎣0.1 0.2 0.7⎦ [Gh]i,a = 0.8Pia h(3, a1 ) 001 0.1 0.5 0.4 One can see that G is a constant independent of c. Suppose ⎡ that the adversary ⎤ 0.6 −0.2 falsifies the cost from c to c˜ by h, i.e., c˜ = c + h and h = ⎣ 1 2 ⎦. Then, 0.4 0.7 ⎡ ⎤ ⎡ ⎤ 3.74 3.92 9.6 −5.2 Gh = ⎣4.70 5.68⎦ by (15). Thus, c˜ = c + h = ⎣ 7 −8 ⎦. The Q-factors learned 4.39 4.21 0.4 0.7 ⎡ ⎤ −8.55 −22.69 ˜ ∗ = ⎣−10.77 −21.51⎦ . The resulting policy is still w∗ . One thus can from c˜ is Q −14.73 −11.08 ∗ ∗ ˜ see that Q = Q + Gh. If an adversary aims to have the hydroelectric reservoir system operate based on a policy w† , the falsified cost c˜ has to satisfy conditions given in (10). Let the targeted policy of the adversary be w† (1) = a1 , w† (2) = a2 , w† (3) = a2 . If the adversary can deceptively falsify the cost at every state-action pair to any value, it is not difficult to find c˜ satisfying (10). For example, the adversary can first c(1, a1 ) c˜(2, a2 ) c˜(3, a2 )]T , e.g., c˜w† = [3 2 1]T . Then select cost at select c˜w† = [˜ other state-action pairs following c˜(i, a) = (1i −βPia )T (I −βPw† )−1 c˜w† +ξ for i ∈ S , a ∈ A \{w† (i)}, where ξ > 0. Then, c˜ satisfies conditions (10). For example if an adversary choose ξ = 1, the adversary will have c˜ = [3 10.86; −1.34 2; 0.34 1]. ˜ ∗ = [15 18.46; 8.15 7.14; 5.99 5; ]. Thus, the The Q-factors learned from c˜ is Q resulted policy is the adversary desired policy w† . Hence, we say if the adversary

Deceptive Reinforcement Learning Under Adversarial Manipulations

−20

235

1

−30

˜ (1, 2) Q

−40 −50

4

2

−60

3

−70 −80 −90 40 20

c˜(2, 1)

30 20

0

10 0

−20

−10 −40

c˜(1, 1)

−20

Fig. 5. The alteration of the limit Q-values when only the costs c˜(2, 1), c˜(1, 1) are altered.

can deceptively falsify the cost at every state-action pair to any value, the adversary can make the RL agent learn any policy. If an adversary can only deceptively falsify the cost at states S , we have to resort to Theorem 3 to see what he can achieve. Suppose S = {1, 2} and the adversary desires policy w† (1) = a1 , w† (2) = a2 , w† (3) = a2 . Given S and w† , (13) can be written as ⎤ ⎡ ⎤⎡ ⎤ ⎡ 1.0000 0 0 c˜(1, a1 ) c˜(1, a1 ) ⎣c˜(2, a1 )⎦ ≥ ⎣−2.0762 0.8095 2.2667⎦ ⎣c˜(2, a2 )⎦ , −0.5905 −0.4762 2.0667 c(3, a1 ) c(3, a2 ) ⎡ ⎤ ⎡ ⎤⎡ ⎤ (16) c˜(1, a2 ) 3.5333 −0.6667 −1.8667 c˜(1, a1 ) ⎣c˜(2, a2 )⎦ ≥ ⎣ 0 ⎦ ⎣c˜(2, a2 )⎦ . 1.0000 0 0 0 1.0000 c(3, a2 ) c(3, a2 ) Note that the last row in the second equality is automatically satisfied. Thus, we have H = [−0.5906 − 0.4762] whose range space is R which intersects (−∞, 0). Thus, no matter what values c(3, a1 ) and c(3, a2 ) are, the adversary can always find c˜(1, a1 ), c˜(2, a2 ) such that

c˜(1, a1 ) + 2.0667 × c(3, a2 ). c(3, a1 ) > Ma1 c˜(2, a2 ) Next, choose c˜(2, a1 ) and c˜(1, a2 ) by ⎤ ⎡ c˜(1, a1 ) c˜(2, a1 ) > −2.0762 0.8095 2.2667 ⎣c˜(2, a2 )⎦ c(3, a2 ) ⎤ ⎡ c˜(1, a1 ) c˜(1, a2 ) > 3.5333 −0.6667 −1.8667 ⎣c˜(2, a2 )⎦ . c(3, a2 )

236

Y. Huang and Q. Zhu

We hence can see that no matter what the true cost is, the adversary can make the RL agent learn w† by falsifying only the cost at sates S = {1, 2}. It can also be easily seen that when the adversary can only falsify the cost at state S = {1}, he can still make the RL agent learn the policy w† independent of the true cost.

5

Conclusion and Future Work

In this paper, a general framework has been introduced to study RL under deceptive falsifications of cost signals where a number of attack models have been presented. We have provided theoretical underpinnings for understanding the fundamental limits and performance bounds on the attack and the defense in RL systems. The robust region of the cost can be utilized by both offensive and defensive sides. A RL agent can leverage the robust region to evaluate the robustness to malicious falsifications. An adversary can use it to estimate whether certain objectives can be achieved or not. Conditions given in Theorem 2 provide a fundamental understanding of the possible strategic adversarial behavior of the adversary. Theorem 3 helps understand the attainability of an adversary’s objective.Future work would focus on investigating a particular attack model we have presented in Sect. 2.2 and developing defensive strategies based on the analytical tools we have introduced.

References 1. Behzadan, V., Munir, A.: The faults in our pi stars: security issues and open challenges in deep reinforcement learning. arXiv preprint arXiv:1810.10369 (2018) 2. Behzadan, V., Munir, A.: Adversarial reinforcement learning framework for benchmarking collision avoidance mechanisms in autonomous vehicles. IEEE Trans. Intell. Transp. Syst. (2019) 3. Bertsekas, D.P., Tsitsiklis, J.N.: Neuro-Dynamic Programming, vol. 5. Athena Scientific, Belmont (1996) 4. Borkar, V.S.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Heidelberg (2008). https://doi.org/10.1007/978-93-86279-38-5 5. Borkar, V.S., Meyn, S.P.: The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim. 38(2), 447–469 (2000) 6. Broyden, C.: On theorems of the alternative. Optim. Methods Softw. 16(1–4), 101–111 (2001) 7. Cheney, W.: Analysis for Applied Mathematics, vol. 208. Springer, Heidelberg (2013) 8. Clark, A., Zhu, Q., Poovendran, R., Ba¸sar, T.: Deceptive routing in relay networks. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 171–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34266-0 10 9. Everitt, T., Krakovna, V., Orseau, L., Legg, S.: Reinforcement learning with a corrupted reward channel. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 4705–4713. AAAI Press (2017) 10. Garcıa, J., Fern´ andez, F.: A comprehensive survey on safe reinforcement learning. J. Mach. Learn. Res. 16(1), 1437–1480 (2015)

Deceptive Reinforcement Learning Under Adversarial Manipulations

237

11. Grant, M., Boyd, S., Ye, Y.: CVX: Matlab software for disciplined convex programming (2008) 12. Hor´ ak, K., Zhu, Q., Boˇsansk` y, B.: Manipulating adversary’s belief: a dynamic game approach to deception by design for proactive network security. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) GameSec 2017. LNCS, vol. 10575, pp. 273–294. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68711-7 15 13. Kreyszig, E.: Introductory Functional Analysis with Applications, vol. 1. Wiley, New York (1978) 14. Mosek, A.: The MOSEK optimization toolbox for MATLAB manual (2015) 15. Pawlick, J., Chen, J., Zhu, Q.: iSTRICT: an interdependent strategic trust mechanism for the cloud-enabled internet of controlled things. IEEE Trans. Inf. Forensics Secur. 14(6), 1654–1669 (2018) 16. Pawlick, J., Colbert, E., Zhu, Q.: Modeling and analysis of leaky deception using signaling games with evidence. IEEE Trans. Inf. Forensics Secur. 14(7), 1871–1886 (2018) 17. Pawlick, J., Colbert, E., Zhu, Q.: A game-theoretic taxonomy and survey of defensive deception for cybersecurity and privacy. ACM Comput. Surv. (2019, to appear) 18. Pawlick, J., Nguyen, T.T.H., Colbert, E., Zhu, Q.: Optimal timing in dynamic and robust attacker engagement during advanced persistent threats. In: 2019 17th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pp. 1–6. IEEE (2019) 19. Pawlick, J., Zhu, Q.: Strategic trust in cloud-enabled cyber-physical systems with an application to glucose control. IEEE Trans. Inf. Forensics Secur. 12(12), 2906– 2919 (2017) 20. Rass, S., Alshawish, A., Abid, M.A., Schauer, S., Zhu, Q., De Meer, H.: Physical intrusion games-optimizing surveillance by simulation and game theory. IEEE Access 5, 8394–8407 (2017) 21. Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning, vol. 2. MIT press, Cambridge (1998) 22. Wang, J., Liu, Y., Li, B.: Reinforcement learning with perturbed rewards. arXiv preprint arXiv:1810.01032 (2018) 23. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992) 24. Wolsey, L.A., Nemhauser, G.L.: Integer and Combinatorial Optimization. Wiley, Hoboken (2014) 25. Zhang, R., Zhu, Q.: Secure and resilient distributed machine learning under adversarial environments. In: 2015 18th International Conference on Information Fusion (Fusion), pp. 644–651. IEEE (2015) 26. Zhang, R., Zhu, Q.: A game-theoretic approach to design secure and resilient distributed support vector machines. IEEE Trans. Neural Netw. Learn. Syst. 29(11), 5512–5527 (2018) 27. Zhang, T., Huang, L., Pawlick, J., Zhu, Q.: Game-theoretic analysis of cyber deception: evidence-based strategies and dynamic risk mitigation. arXiv preprint arXiv:1902.03925 (2019) 28. Zhu, Q., Ba¸sar, T.: Game-theoretic approach to feedback-driven multi-stage moving target defense. In: Das, S.K., Nita-Rotaru, C., Kantarcioglu, M. (eds.) GameSec 2013. LNCS, vol. 8252, pp. 246–263. Springer, Cham (2013). https://doi.org/10. 1007/978-3-319-02786-9 15 29. Zhu, Q., Basar, T.: Game-theoretic methods for robustness, security, and resilience of cyberphysical control systems: games-in-games principle for optimal cross-layer resilient control systems. IEEE Control Syst. Mag. 35(1), 46–65 (2015)

DeepFP for Finding Nash Equilibrium in Continuous Action Spaces Nitin Kamra1(B) , Umang Gupta1 , Kai Wang1 , Fei Fang2 , Yan Liu1 , and Milind Tambe3 1

University of Southern California, Los Angeles, CA 90089, USA {nkamra,umanggup,wang319,yanliu.cs}@usc.edu 2 Carnegie Mellon University, Pittsburgh, PA 15213, USA [email protected] 3 Harvard University, Cambridge, MA 02138, USA [email protected]

Abstract. Finding Nash equilibrium in continuous action spaces is a challenging problem and has applications in domains such as protecting geographic areas from potential attackers. We present DeepFP, an approximate extension of fictitious play in continuous action spaces. DeepFP represents players’ approximate best responses via generative neural networks which are highly expressive implicit density approximators. It additionally uses a game-model network which approximates the players’ expected payoffs given their actions, and trains the networks end-to-end in a model-based learning regime. Further, DeepFP allows using domain-specific oracles if available and can hence exploit techniques such as mathematical programming to compute best responses for structured games. We demonstrate stable convergence to Nash equilibrium on several classic games and also apply DeepFP to a large forest security domain with a novel defender best response oracle. We show that DeepFP learns strategies robust to adversarial exploitation and scales well with growing number of players’ resources.

Keywords: Security games

1

· Nash equilibrium · Fictitious Play

Introduction

Computing equilibrium strategies is a major computational challenge in game theory and finds numerous applications in economics, planning, security domains etc. We are motivated by security domains which are often modeled as Stackelberg Security Games (SSGs) [5,13,24]. Since Stackelberg Equilibrium, a commonly used solution concept in SSGs, coincides with Nash Equilibrium (NE) in zero-sum security games and in some structured general-sum games [17], we focus on the general problem of finding mixed strategy Nash Equilibrium. Security domains often involve protecting geographic areas thereby leading to continuous action spaces [3,26]. Most previous approaches discretize players’ continuous c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 238–258, 2019. https://doi.org/10.1007/978-3-030-32430-8_15

Deep Fictitious Play

239

actions [9,11,27] to find equilibrium strategies using linear programming (LP) or mixed-integer programming (MIP). However, a coarse discretization suffers from low solution quality and a fine discretization makes it intractable to compute the optimal strategy using mathematical programming techniques, especially in high-dimensional action spaces. Some approaches exploit spatio-temporal structural regularities [4,6,28] or numerically solve differential equations in special cases [13], but these do not extend to general settings. We focus on algorithms more amenable to tractable approximation in continuous action spaces. Fictitious Play (FP) is a classic algorithm studied in game theory and involves players repeatedly playing the game and best responding to each other’s history of play. FP converges to a NE for specific classes of discrete action games [18] and its variants like Stochastic Fictitious Play and Generalized Weakened Fictitious Play converge under more diverse settings [20,23,25] under reasonable regularity assumptions over underlying domains. While FP applies to discrete action games with exact best responses, it does not trivially extend to continuous action games with arbitrarily complex best responses. In this work, we present DeepFP, an approximate fictitious play algorithm for two-player games with continuous action spaces. The key novelties of DeepFP are: (a) It represents players’ approximate best responses via state-of-the-art generative neural networks which are highly expressive implicit density approximators with no shape assumptions on players’ action spaces, (b) Since implicit density models cannot be trained directly, it also uses a game-model network which is a differentiable approximation of the players’ payoffs given their actions, and trains these networks end-to-end in a model-based learning regime, and (c) DeepFP allows replacing these networks with domain-specific oracles if available. This allows working in the absence of gradients for player/(s) and exploit techniques from research areas like mathematical programming to compute best responses. We also apply DeepFP to a forest security problem with a novel defender best response oracle designed for this domain. The proposed oracle is another novel contribution of this work and may also be of interest in its own right for the forest protection domain. Related Work: Previous approaches to find equilibria in continuous action spaces have employed variants of cournot adjustment strategy [1] but suffer from convergence issues [14]. Variants of FP either require explicit maximization of value functions over the action set as in Fictitious Self-Play [12] or maintaining complex hierarchies of players’ past joint strategies as in PSRO [19], which is only feasible with finite and discrete action sets (e.g. poker) and does not generalize to continuous action spaces. Since it is challenging to maintain distributions over continuous action spaces, recent works in multiagent reinforcement learning (RL) [21] often assume explicit families of distributions for players’ strategies which may not span the space of strategies to which NE distributions belong. More recently update rules which modify gradient descent using second-order dynamics of multi-agent games have been proposed [2]. The closest method to our approach is OptGradFP [15] which assumes a multivariate logit-normal distribution for players’ strategies. We show that due to explicit shape assumptions,

240

N. Kamra et al.

it suffers from lack of representational power and is prone to diverging since logit-normal distributions concentrate in some parts of the action space often yielding −∞ log-probabilities in other parts. DeepFP addresses the lack of representational power by using flexible implicit density approximators. Further, our model-based training proceeds without any likelihood estimates and hence does not yield −∞ log-likelihoods in any parts of the action space, thereby converging stably. Moreover, unlike OptGradFP, DeepFP is an off-policy algorithm and trains significantly faster by directly estimating expected rewards using the game model network instead of replaying previously stored games.

2

Game Model

We consider a two-player game with continuous action sets for players 1 and 2. We will often use the index p ∈ {1, 2} for one of the players and −p for the other player. Up denotes the compact, convex action set of player p. We denote the probability density for the mixed strategy of player p at action up ∈ Up as σp (up ) ≥ 0 s.t. Up σp (up )dup = 1. We denote player p sampling an action up ∈ Up from his mixed strategy density σp as up ∼ σp . We denote joint actions, joint action sets and joint densities without any player subscript i.e. as u = (u1 , u2 ), U = U1 × U2 and σ = (σ1 , σ2 ) respectively. Each player has a bounded and Lipschitz continuous reward function rp : U → R. For zero-sum games, rp (u) + r−p (u) = 0 ∀u ∈ U . With players’ mixed strategy densities σp and σ−p , the expected reward of player p is: rp (u)σp (up )σ−p (u−p )dup du−p . Eu∼σ [rp ] = Up

U−p

The best response of player p against player −p’s current strategy σ−p is defined as the set of strategies which maximizes his expected reward: BRp (σ−p ) := arg max Eu∼(σp ,σ−p ) [rp ] . σp

A pair of strategies σ ∗ = (σ1∗ , σ2∗ ) is said to be a Nash equilibrium if neither player can increase his expected reward by changing his strategy while the other player sticks to his current strategy. In such a case both these strategies belong to the best response sets to each other: σ1∗ ∈ BR1 (σ2∗ ) and σ2∗ ∈ BR2 (σ1∗ ).

3

Deep Fictitious Play

To compute NE for a game, we introduce an approximate realization of fictitious play in high-dimensional continuous action spaces, which we call Deep Fictitious

Deep Fictitious Play

241

Play (DeepFP). Let the density function corresponding to the empirical distribution of player p’s previous actions (a.k.a. belief density) be σ ¯p . Then fictitious play involves player p best responding to his opponent’s belief density σ ¯−p : BRp (¯ σ−p ) := arg max Eu∼(σp ,¯σ−p ) [rp ] . σp

Repeating this procedure for both players is guaranteed to converge to the Nash equilibrium densities for both players for certain classes of games [18]. Hence extending Fictitious Play to continuous action spaces requires approximations for two essential ingredients: (a) belief densities over players’ actions, and (b) best responses for each player.

(a) Sampling actions from the best response network (b) Learning game model network parameters φ

(c) Learning best response network parameters θp

Fig. 1. Neural network models for DeepFP; blue color denotes player p, red denotes his opponent −p, green shows the game model network and violet shows loss functions and gradients. (Color figure online)

242

3.1

N. Kamra et al.

Approximating Belief Densities

Representing belief densities compactly is challenging in continuous action spaces. However with an appropriate approximation to Fictitious Play, one can get away with a representation which only requires sampling from the belief density but never explicitly calculating the density at any point in the action space. Our DeepFP is one such approximation and hence we maintain the belief density σ ¯p of each player p via a non-parameterized population based estimate i.e. via a simple memory of all actions played by p so far. Directly sampling up from the memory gives an unbiased sample from σ ¯p . 3.2

Approximating Best Responses

Computing exact best responses is intractable for most games. But when the expected reward for a player p is differentiable w.r.t. the player’s action up and admits continuous and smooth derivatives, approximate best responses are feasible. One way is to use the gradient of reward to update the action up iteratively using gradient ascent till it converges to a best response. Since the best response needs to be computed per iteration of FP, employing inner iterations of gradient descent can be expensive. However since the history of play for players doesn’t change too much between iterations of FP, we expect the same of best responses. Consequently we approximate best responses with function approximators (e.g., neural networks) and keep them updated with a single gradient ascent step (also done by [8]). We propose a best response network for each player p which maps an easy to sample dp -dimensional random variable Zp ∈ Rdp (e.g. Zp ∼ N (0, Idp )) to the player’s action up . By learning an appropriate mapping BRp (zp ; θp ) parameterized by weights θp , it can approximate any density in the action space Up (Fig. 1a). Note that this is an implicit density model i.e. one can draw samples of up by sampling zp ∼ PZp (·) and then computing BRp (zp ; θp ), but no estimate of the density is explicitly available. Further, best response networks maintain stochastic best responses since they lead to smoother objectives for gradient-based optimization. Using them is common practice in policy-gradient and actor-critic based RL since deterministic best responses often render the algorithm unstable and brittle to hyperparameter settings (also shown by [10]). To learn θp we need to approximate the expected payoff of player p given by E(up ∼BRp (·;θp ),u−p ∼¯σ−p ) [rp ] as a differentiable function of θp . However a differentiable game model is generally not available a priori, hence we also maintain a game model network which takes all players’ actions i.e. {up , u−p } as inputs and predicts rewards {ˆ rp , rˆ−p } for each player. This can either be pre-trained or learnt simultaneously with the best response networks directly from gameplay data (Fig. 1b). Coupled with a shared game model network, the best response networks of players can be trained to approximate best responses to their opponent’s belief densities (¯ σ−p ) (Fig. 1c). The training procedure is discussed in detail in Sect. 3.3. When the expected reward is not differentiable w.r.t. players’ actions or the derivatives are zero in large parts of the action space, DeepFP can also employ

Deep Fictitious Play

243

approximate best response oracle (BROp ) for player p. The oracle can be a nondifferentiable approximation algorithm employing Linear Programming (LP) or Mixed Integer Programming (MIP) and since it will not be trained, it can also be deterministic. In many security games, Mixed-integer programming based algorithms are proposed to compute best responses and our algorithm provides a novel way to incorporate them as subroutines in a deep learning framework, as opposed to most existing works which require end-to-end differentiable policy networks and cannot utilize non-differentiable solutions even when available. 3.3

DeepFP

Algorithm 1 shows the DeepFP pseudocode. DeepFP randomly initializes any best response networks and game model network (if needed) and declares an empty memory (mem) to store players’ actions and rewards [lines 1–2].

Algorithm 1. DeepFP

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17

Data: max games, batch sizes (m1 , m2 , mG ), memory size E, game simulator and oracle BROp for players with no gradient Result: Final belief densities σ ¯p∗ in mem ∀ players p Initialize all network parameters (θ1 , θ2 , φ) randomly; Create empty memory mem of size E; for game ∈ {1, . . . , max games} do /* Obtain best responses */ for each player p do if grad avlbl(p) then Sample zp ∼ N (0, I); Approx. best response up = BRp (zp ; θp ); else up = BROp (¯ σ−p ) with σ ¯−p from mem; /* Play game and update memory Play with u = {u1 , u2 } to get r = {r1 , r2 }; Store sample {u, r} in mem; /* Train shared game model net if grad avlbl(p) for any p ∈ {1, 2} then Draw samples {ui , ri }i=1:mG from mem; φ := Adam.min(LM SE , φ, {ui , ri }i=1:mG );

*/

/* Train best response nets for each player p with grad avlbl(p) do Draw samples {ui }i=1:mp from mem; θp := Adam.min(Lrp , θp , {ui−p }i=1:mp );

*/

*/

Then it iteratively makes both players best respond to the belief density of their opponent. This best response can be computed per player p via a forward

244

N. Kamra et al.

pass of the best response network BRp or via a provided oracle BROp or if gradients are not available [lines 4–9]. The best response moves and the rewards obtained by playing them are stored in mem [lines 10–11]. Samples from exact belief density σ ¯ of both players are available from mem. The game model network is also trained simultaneously to learn a differentiable reward model of the game [lines 12–14]. It takes all players’ actions u as input and predicts the game rewards rˆ(u; φ) for all players. Its parameters φ can be learnt by minimizing the mean square error loss over a minibatch of samples {ui }i=1:mG from mem, using any optimizer (we use Adam [16]):

LM SE (φ) =

1 2mG

mG

(ˆ rp (ui ; φ) − rpi )2 .

p∈{1,2} i=1

The advantage of estimating this differentiable reward model independent of playing strategies is that it can be trained from the data in replay memory without requiring importance sampling, hence it can be used as a proxy for the game simulator to train the best response networks. An alternative could be to replay past actions of players using the game simulator as done by [15], but it is much slower (see Sect. 4.2). Finally each player updates their best response network to keep it a reasonable approximation to the best response to his opponent’s belief density [lines 15–17]. For this, each player p maximizes his expected predicted reward rˆp (or ¯−p (see Fig. 1c) minimizes expected −ˆ rp ) against the opponent’s belief density σ using any optimizer (we use Adam): Lrp (θp ) = −E(zp ∼N (0,I),u−p ∼¯σ−p ) [ˆ rp (BRp (zp ; θp ), u−p ; φ)]. The expectation is approximated using a minibatch of samples {ui−p }i=1:mp drawn from mem and {zpi }i=1:mp independently sampled from a standard normal distribution. In this optimization, φ is held constant and the gradient is only evaluated w.r.t. θp and the updates applied to the best response network. In this sense, the game model network acts like a critic to evaluate the best responses of player p (actor) against his opponent’s belief density σ ¯−p similar to actor-critic methods [22]. However, unlike actor-critic methods we train the best response and the game model networks in separate decoupled steps which potentially allows replacing them with pre-trained models or approximate oracles, while skipping their respective learning steps. 3.4

Connections to Boltzmann Actor-Critic and Convergence of DeepFP

DeepFP is closely related to the Boltzmann actor-critic process proposed by Generalized Weakened Fictitious Play (GWFP) [20], which converges to the NE under certain assumptions. But it differs in two crucial aspects: (i) GWFP requires assuming explicit probability densities and involves weakened -best

Deep Fictitious Play

245

responses which are updated via a Boltzmann actor-critic process. Since we store the empirical belief densities and best responses as implicit densities, a Boltzmann-style strategy update is infeasible, (ii) GWFP also requires the best responses to eventually become exact (i.e. when limn→∞ n → 0). Since we are approximating stochastic best responses via generative neural networks (or with approximate oracles), this assumption may not always hold exactly. Nevertheless, with our approximate best responses and with the one-step gradient updates best response networks, we empirically observed that DeepFP converges for multiple games with continuous reward functions wherever GWFP converges. At convergence, the belief density σ ¯ ∗ in mem is a non-parametric approximation to a NE density for both players.

4 4.1

Experimental Evaluation Simple Games

We first evaluate DeepFP on two simple games where traditional fictitious play is known to converge, as potential sanity checks and to demonstrate convergence to nash equilibrium. Concave-Convex Game: Two players 1 and 2 with scalar actions x, y ∈ [−2, 2] respectively play to maximize their rewards: r1 (x, y) = −2x2 + 4xy + y 2 − 2x − 3y + 1 and r2 (x, y) = −r1 (x, y). The game is concave w.r.t. x and convex w.r.t. y and admits a pure strategy NE which can be computed using standard calculus. The NE strategies are x = 1/3, y = 5/6, the expected equilibrium rewards are r1∗ = −r2∗ = −7/12 and the best responses of players to each others’ average y ) = y¯ − 1/2 and BR2 (¯ x) = 3/2 − 2¯ x. strategies are BR1 (¯ Cournot Game: It is a classic game [7] with two competing firms (1 and 2) producing a quantity (q1 ≥ 0 and q2 ≥ 0 resp.) of a product. The price of the product is p(q1 , q2 ) = a − q1 − q2 and the cost of manufacturing quantity q is C(q) = cq, where c, a > 0 are constants. Reward for a firm p is Rp (q1 , q2 ) = (a − q1 − q2 )qp − cqp , p ∈ {1, 2} and the best response against the competing a−c−q−p . The firm’s choice can be analytically computed as q−p is BRp (q−p ) = 2 a−c ∗ ∗ NE strategy can be computed as q1 = q2 = 3 . We use a = 2 and c = 1 for our experiment so that q1∗ = q2∗ = 1/3. Figure 2 shows the results of DeepFP to these games and its convergence to the NE for all variants i.e. when both, exactly one, or no player uses the best response oracle. Note that both players using the best response oracle (bottom case in all subfigures) is the same as exact fictitious play and converges very fast as opposed to other cases (top and mid in all subfigures) since the latter variants require estimating the best responses from repeated gameplay. 4.2

Forest Protection Game

For a large application of DeepFP, we choose the forest protection game as proposed by [15] with a Defender (D) and an Adversary (A). Consider a circular

246

N. Kamra et al.

(a) Concave-convex game

(b) Cournot game

(c) Concave-convex game

(d) Cournot game

Fig. 2. DeepFP on simple games under three settings: when both players learn BR nets (top), player 1 uses BR oracle (mid), and when both players use BR oracle (bottom); (a) and (b) expected reward of player 1 converges to the true equilibrium value (shown by dashed line) for both games; (c) and (d) final empirical density for player 1 approaches NE strategy for both games (shown by blue triangle on horizontal axis).

Deep Fictitious Play

(a)

247

(b)

Fig. 3. Forest game with trees (green dots), guards (blue dots), guard radii Rg (blue circles), lumberjacks (red dots), lumberjack chopping radii Rl (red circles), lumberjacks’ paths (red lines) and black polygons (top weighted capture-sets for guards): (a) with m = n = 3, (b) best response oracle for 3 guards and 15 lumberjacks. (Color figure online)

forest with an arbitrary tree distribution. The adversary has n lumberjacks who cross the boundary and move straight towards the forest center. They can stop at any point on their path, chop trees in a radius Rl around the stopping point and exit back from their starting location. The adversary’s action is then all the stopping points for lumberjacks (which fully specifies their trajectories). The defender has m guards whose locations in the forest can be chosen to ambush the lumberjacks. A lumberjack whose trajectory comes within Rg distance from any guard’s location is considered ambushed and loses all his chopped trees and bears a penalty rpen . The final reward for adversary (rA ∈ R) is the number of trees jointly stolen by the lumberjacks plus the total negative penalty incurred. The defender’s reward is rD = −rA . A full game is shown in Fig. 3a. In our experiments we use the following settings for the game: rpen = 4.0, Rg = 0.1, Rl = 0.04. Approximate Best Response Oracle: Note that if guards’ locations do not overlap significantly with those of lumberjacks then changing them by a small amount does not affect the rewards for either player since no extra lumberjacks are ambushed. Hence, the gradient of reward w.r.t. defender’s parameters (∇θD r) ≈ 0 over most of the action space. But the gradients for the adversary are continuous and non-zero because of the dense tree distribution. Hence we apply DeepFP to this game with a best response network for the adversary and an approximate domain-specific best response oracle for the defender. Devising a defender’s best response to the adversary’s belief distribution is non-trivial for this game, and we propose a greedy approximation for it1 . Briefly, the oracle algorithm involves creating capture-sets for lumberjack locations l encountered so far in mem and intersecting these capture-sets to find those which cover multiple lumberjacks. Then it greedily allocates guards to the top m such capture-sets one 1

The full oracle algorithm and the involved approximations are detailed in the appendix to keep the main text concise and continuous.

248

N. Kamra et al.

at a time, while updating the remaining capture-sets simultaneously to account for the lumberjacks ambushed by the current guard allocation. We illustrate an oracle best response in Fig. 3b. Baselines: Since the forest protection game involves arbitrary tree density patterns, the ground truth equilibria are intractable. So we evaluate DeepFP by comparing it with OptGradFP [15] and to another approximate discrete linear programming method (henceforth called DLP). DLP Baseline: We propose DLP which discretizes the action space of players and solves a linear programming problem to solve the game approximately (but only for small m and n). The DLP method discretizes the action space in cylindrical coordinates with 20 radial bins and 72 angular bins, which gives a joint action space of size (72 × 20)m+n . For even a single guard and lumberjack, this implies about 2 million pure strategies. Hence, though DLP gives the approximate ground truth for m = n = 1 due to our fine discretization, going beyond m or n > 1 is infeasible with DLP. The DLP baseline proceeds in two steps: 1. We generate 72 × 20 = 1440 cylindrically discretized bins and compute a matrix R ∈ R1440×1440 where Rij characterizes the defender’s reward with a guard in the i-th bin and a lumberjack in the j-th bin. Each entry Rij is computed by averaging the game simulator’s reward over 20 random placements of the guard and lumberjack inside the bins. 2. Next we solve the following optimization problem for the defender: σ ∗ , χ∗ = arg max χ σ≥0,χ

s.t. σ R:j ≥ χ ∀j T

1440

σi = 1

i=1

Note that χ represents the defender’s reward, σi is the i-th element of σ ∈ [0, 1]1440 i.e. the probability of placing the guard in the i-th bin and R:j is the j-th column of R corresponding to the adversary taking action j. The above problem maximizes the defender’s reward subject to the constraints that σ has all non-negative elements summing to 1 (since it’s a distribution over all bins) and the defender’s reward χ is least exploitable regardless of the adversary’s placement in any bin j. Solving it gives us the optimal defender distribution σ ∗ over all bins to place the guard and the equilibrium reward for the defender χ∗ when m = n = 1. Fixed Hyperparameters: We set max games = E = 40000 to provide enough iterations to DeepFP and OptGradFP for convergence. The batch sizes for DeepFP are set to mD = 3 (kept small to have a fast oracle), mA = 64, mG = 128 (large for accurate gradient estimation). For full neural network architectures used, please refer to the appendix.

Deep Fictitious Play

249

Table 1. Results on four representative forests for m = n = 1. Green dots: trees, blue dots: guard locations sampled from defender’s strategy, red dots: lumberjack locations sampled from adversary’s strategy. The exploitability metric shows that DLP which is approximately the ground truth NE strategy is the least exploitable followed by DeepFP, while OptGradFP’s inflexible explicit strategies make it heavily exploitable.

Exploitability Analysis: Since direct computation of the ground truth equilibrium is infeasible for a forest, we compare all methods by evaluating the exploitability of the defender’s final strategy as NE strategies are least exploitable. For this, we designed an evolutionary algorithm to compute the adversary’s best response to the defender’s final strategy. It maintains a population (size 50) of adversary’s actions and iteratively improves it by selecting

250

N. Kamra et al.

the best 10 actions, duplicating them four-fold, perturbing the duplicate copies with gaussian noise (whose variance decays over iterations) and re-evaluating the population against the defender’s final strategy. This evolutionary procedure is independent of any discretization or neural network and outputs the adversary action which exploits the defender’s final strategy most heavily. We denote the reward achieved by the top action in the population as the exploitability and report the exploitability of the defender’s strategy averaged across 5 distinct runs of each method (differing only in the initial seed). Since rewards can differ across forests due to the number of trees in the forest and their distribution, the exploitability of each forest can differ considerably. Also, since the evolutionary algorithm requires 150K–300K game plays per run, it is quite costly and only feasible for a single accurate post-hoc analysis rather than using it to compute best responses within DeepFP. Single Resource Case: Table 1 shows results on four representative forests when m = n = 1. We observe that both DLP and DeepFP find strategies which intuitively cover dense regions of the forest (central forest patch for F1, nearly the whole forest for uniform forest F2, dense arch of trees for F3 and ring for the forest F4 with a tree-less sector). On the uniform forest F2, the expected NE strategy is a ring at a suitable radius from the center, as outputted by DeepFP. However, DLP has a fine discretization and is able to sense minute deviations from uniform tree structure induced due to the sampling of trees from a uniform distribution, hence it forms a circular ring broken and placed at different radii. A similar trend is observed on F4. On F3, DeepFP finds a strategy strongly covering the dense arch of trees, similar to that of DLP. Note that sometimes DeepFP even finds less exploitable strategies than DLP (e.g. on F1), since DLP while being close to the ground truth still involves an approximation due to discretization. Overall, as expected DLP is in general the least exploitable method and is the closest to the NE, followed by DeepFP. OptGradFP is more exploitable than DeepFP for nearly uniform tree densities (F2 and F4) and heavily exploitable Table 2. Results on forest F3 for m = n = {2, 3}. Green dots: trees, blue dots: guard locations sampled from defender’s strategy, red dots: lumberjack locations sampled from adversary’s strategy. DeepFP is always less exploitable than OptGradFP. DeepFP (m=n=2) DeepFP (m=n=3) OptGradFP (m=n=2) OptGradFP (m=n=3)

= 135.49 ± 15.24 = 137.53 ± 8.63

= 186.58 ± 23.71

= 190.00 ± 23.63

Deep Fictitious Play

251

(a)

(b)

Fig. 4. (a) Adversary’s average reward with memory size E as a fraction of total games played. Even for a 1% fraction of memory size i.e. γ = 0.01, the average rewards are close to γ = 1 case. (b) Time per iteration vs. players’ resources. DeepFP is orders of magnitude faster than OptGradFP (y-axis has log scale).

252

N. Kamra et al.

for forests with concentrated tree densities (F1 and F3), since unlike DeepFP, it is unable to approximate arbitrary strategy shapes. Multiple Resource Case: Since DLP cannot scale for m or n > 1, we compute the strategies and exploitability for m = n = {2, 3} on F3 in table 2 for DeepFP and OptGradFP only (more forests in appendix). We consistently observe that DeepFP accurately covers the dense forest arch of F3 and OptGradFP spreads both players out more uniformly (due to explicit strategies). For m = n = 3 case, DeepFP also allots a guard to the central patch of F3. Overall, DeepFP is substantially less exploitable than OptGradFP. Effect of Memory Size: In Algorithm 1, we stored and best responded to all games in the replay memory. Figure 4a shows the expected reward (E[rA ]) achieved by the adversary’s final strategy against the defender’s final strategy, when the replay memory size E is varied as a fraction γ of max games. Only the most recent γ fraction of max games are stored and best responded to, and the previous ones are deleted from mem. We observe that DeepFP is fairly robust to memory size and even permits significantly small replay memories (upto 0.01 times max games) without significant deterioration in average rewards. Running Time Analysis: Given the same total number of iterations, we plot the time per iteration for DeepFP and OptGradFP in Fig. 4b with increasing m and n (y-axis has log scale). OptGradFP’s training time increases very fast with increasing m and n due to high game replay time. With our approximate bestresponse oracle and estimation of payoffs using the game model network, DeepFP is orders of magnitude faster. For a total 40K iterations, training DeepFP takes about 0.64±0.34 h (averaged over values of m and n) as opposed to 22.98±8.39 h for OptGradFP.

5

Conclusion

We have presented DeepFP, an approximate fictitious play algorithm for games with continuous action spaces. DeepFP implicitly represents players’ best responses via generative neural networks without prior shape assumptions and optimizes them using a learnt game-model network with gradient-based training. It can also utilize approximate best response oracles whenever available, thereby harnessing prowess in approximation algorithms from discrete planning and operations research. DeepFP provides significant speedup in training time and scales well with growing number of resources. DeepFP can be easily extended to multi-player applications, with each player best responding to the joint belief density over all other players using an oracle or a best response network. Like most gradient-based optimization algorithms, DeepFP and OptGradFP can sometimes get stuck in local nash equilibria (see appendix for experiments). While DeepFP gets stuck less often than OptGradFP, principled strategies to mitigate local optima for gradient-based equilibrium finding methods remains an interesting direction for future work.

Deep Fictitious Play

253

Acknowledgments. This research was supported in part by NSF Research Grant IIS-1254206, NSF Research Grant IIS-1850477 and MURI Grant W911NF-11-1-0332.

A

Appendix

A.1

Approximate Best Response Oracle for Forest Protection Game

Algorithm 2. Approximate best response oracle

1 2 3 4 5

6 7 8

9

10

Data: mem, batch size mD , game simulator, m, n σA ) Result: Guard assignments approximating BROD (¯ Draw batch of adversary actions {uiA }i=1:mD from σ ¯A (stored in mem); Extract all mD × n lumberjack locations l ∈ {uiA }i=1:mD ; /* Capture-set for each lumberjack */ Initialize empty capture-set list S; for l ∈ {uiA }i=1:mD do Create a capture-set s(l) (approximated by a convex polygon) i.e. as the set of all guard locations which are within radius Rg from any point on the trajectory of the lumberjack stopping at l; Query reward w(l) of ambushing at l (using simulator); Append (s, w, l) to S. /* Output max reward capture-sets */ Find all possible intersections of sets s ∈ S while assigning a reward w = j wj and lumberjacks l = ∩j lj to s = ∩j sj and append all new (s , w , l ) triplets to S; Pop the top m maximum reward sets in S one at a time and assign a single guard to each, while updating all remaining sets’ weights to remove lumberjacks covered by the guard allotment; Output the guard assignments.

Devising a defender’s best response to the adversary’s belief distribution is non-trivial for this game. So we propose a greedy approximation to the best response. (see Algorithm 2). We define a capture-set for a lumberjack location l as the set of all guard locations within a radius Rg from any point on the trajectory of the lumberjack. The algorithm involves creating capture-sets for lumberjack locations l encountered so far in mem and intersecting these capture-sets to find those which cover multiple lumberjacks. Then it greedily allocates guards to the top m such capture-sets one at a time, while updating the remaining capture-sets simultaneously to account for the lumberjacks ambushed by the current guard allocation. Our algorithm involves the following approximations: 1. Mini-batch approximation: Since it is computationally infeasible to compute the best response to the full set of actions in mem, we best-respond to a small mini-batch of actions sampled randomly from mem to reduce computation (line 1).

254

N. Kamra et al.

2. Approximate capture-sets: Initial capture-sets can have arbitrary arc-shaped boundaries which can be hard to store and process. Instead, we approximate them using convex polygons for simplicity (line 5). Doing this ensures that all subsequent intersections also result in convex polygons. 3. Bounded number of intersections: Finding all possible intersections of capturesets can be reduced to finding all cliques in a graph with capture-sets as vertices and pairwise intersections as edges. Hence it is an NP-hard problem with complexity growing exponentially with the number of polygons. We compute intersections in a pairwise fashion while adding the newly intersected polygons to the list. This way the k th round of intersection produces uptil all k + 1-polygon intersections and we stop after k = 4 rounds of intersection to maintain polynomial time complexity (implemented for line 8, but not shown explicitly in Algorithm 2). 4. Greedy selection: After forming capture-set intersections, we greedily select the top m sets with the highest rewards (line 9). A.2

Supplementary Experiments with m, n > 1

Table 3. More results on forests F1 and F4 for m = n = 2.

F1

DeepFP (m=n=2) OptGradFP (m=n=2) = 153.21 ± 50.87 = 212.92 ± 27.95

F4

DeepFP (m=n=2) OptGradFP (m=n=2) = 53.70 ± 3.85 = 49.00 ± 3.68

Table 3 shows more experiments for DeepFP and OptGradFP with m,n>1. We see that DeepFP is able to cover regions of importance with the players’ resources but OptGradFP suffers from the zero defender gradients issue due to logit-normal strategy assumptions which often lead to sub-optimal results and higher exploitability.

Deep Fictitious Play

255

Table 4. Demonstrating getting stuck in locally optimal strategies.

A.3

F5

C1: DLP (m=n=1)

C2: OptGradFP (m=n=1)

C3: OptGradFP (m=n=1)

C4: DeepFP (m=n=1)

C5: DeepFP (m=n=1)

C6: OptGradFP (m=n=3)

C7: DeepFP (m=n=3)

Locally Optimal Strategies

To further study the issue of getting stuck in locally optimal strategies we show experiments with another forest F5 in Table 4. F5 has three dense tree patches and very sparse and mostly empty other parts. The optimal defender’s strategy computed by DLP for m = n = 1 is shown in C1. In such a case, due to the tree density being broken into patches, gradients for both players would be zero at many locations and hence both algorithms are expected to get stuck in locally optimal strategies depending upon their initialization. This is confirmed by configurations C2, C3, C4 and C5 which show strategies for OptGradFP and DeepFP with m = n = 1 covering only a single forest patch. Once the defender gets stuck on a forest patch, the probability of coming out of it is small since the tree density surrounding the patches is negligible. However, with more resources for the defender and the adversary m = n = 3, DeepFP is mostly able to break out of the stagnation and both players eventually cover more than a single forest patch (see C7), whereas OptGradFP is only able to cover additional ground due to random initialization of the 3 player resources but otherwise remains stuck around a single forest patch (see C6). DeepFP is partially able to break out because the defender’s best response does not rely on gradients but rather come from a non-differentiable oracle. This shows how DeepFP can break out of local optima even in the absence of gradients if a best response oracle is provided, however OptGradFP relies purely on gradients and cannot overcome such situations.

256

N. Kamra et al.

A.4

Neural Network Architectures

All our models were trained using TensorFlow v1.5 on a Ubuntu 16.04 machine with 32 CPU cores and a Nvidia Tesla K40c GPU. Cournot Game and Concave-Convex Game. Best response networks for the Cournot game and the Concave-convex game consist of single fully connected layer with a sigmoid activation, directly mapping the 2-D input noise z ∼ N ([0, 0], I2 ) to a scalar output qp for player p. Best response networks are trained with Adam optimizer [16] and learning rate of 0.05. To estimate payoffs, we use exact reward models for the game model networks. Maximum games were limited to 30,000 for Cournot game and 50,000 for Concave-convex game. Forest Protection Game. The action up of player p contains the cylindrical coordinates (radii and angles) for all resources of that player. So, the best response network for the Forest protection game maps ZA ∈ R64 to the adversary action uA ∈ Rn×2 . It has 3 fully connected hidden layers with {128, 64, 64} units and ReLU activations. The final output comes from two parallel fully connected layers with n (number of lumberjacks) units each: (a) first with sigmoid activations outputting n radii ∈ [0, 1], and (b) second with linear activations outputting n angles ∈ [−∞, ∞], which are modulo-ed to be in [0, 2π] everywhere. All layers are L2-regularized with coefficient 10−2 : xA = relu(F C64 (relu(F C64 (relu(F C128 (ZA )))))) uA,rad = σ(F Cn (xA )); uA,ang = F Cn (xA ) The game model takes all players’ actions as inputs (i.e. matrices uD , uA of shapes (m, 2) and (n, 2)) respectively) and produces two scalar rewards rD and rA . It internally converts the angles in the second columns of these inputs to the range [0, 2π]. Since the rewards should be invariant to the permutations of the defender’s and adversary’s resources (guards and lumberjacks resp.), we first pass the input matrices through non-linear embeddings to interpret their rows as sets rather than ordered vectors (see Deep Sets [29] for details). These non-linear embeddings are shared between the rows of the input matrix and are themselves deep neural networks with three fully connected hidden layers containing {60, 60, 120} units and ReLU activations. They map each row of the matrices into a 120-dimensional vector and then add all these vectors. This effectively projects the action of each player into a 120-dimensional action embedding representation invariant to the ordering of the resources. The players’ embedding networks are trained jointly as a part of the game model network. The players’ action embeddings are further passed through 3 hidden fully connected layers with {1024, 512, 128} units and ReLU activations. The final output rewards are produced by a last fully connected layer with 2 hidden units and linear activation. All layers are L2-regularized with coefficient 3 × 10−4 :

Deep Fictitious Play

embp =

(DeepSet60,60,120 (up ))

257

∀p ∈ {D, A}

dim=row

rˆD , rÂ = F C2 (relu(F C128 (relu(F C512 (relu(F C1024 (embD , embA )))))) The models are trained with Adam optimizer [16]. Note that the permutation invariant embeddings are not central to the game model network and only help to incorporate an inductive bias for this game. We also tested the game model network without the embedding networks and achieved similar performance with about 2x increase in the number of iterations since the game model would need to infer permutation invariance from data.

References 1. Amin, K., Singh, S., Wellman, M.P.: Gradient methods for stackelberg security games. In: UAI, pp. 2–11 (2016) 2. Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., Graepel, T.: The mechanics of n-player differentiable games. In: International Conference on Machine Learning (2018) 3. Basilico, N., Celli, A., De Nittis, G., Gatti, N.: Coordinating multiple defensive resources in patrolling games with alarm systems. In: Proceedings of the 16th Conference on Autonomous Agents and Multiagent Systems, pp. 678–686 (2017) 4. Behnezhad, S., Derakhshan, M., Hajiaghayi, M., Seddighin, S.: Spatio-temporal games beyond one dimension. In: Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 411–428 (2018) 5. Cerm´ ak, J., Boˇsansk´ y, B., Durkota, K., Lis´ y, V., Kiekintveld, C.: Using correlated strategies for computing stackelberg equilibria in extensive-form games. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI 2016, pp. 439–445 (2016) 6. Fang, F., Jiang, A.X., Tambe, M.: Optimal patrol strategy for protecting moving targets with multiple mobile resources. In: AAMAS, pp. 957–964 (2013) 7. Ferguson, T.S.: Game Theory, vol. 2 (2014). https://www.math.ucla.edu/∼tom/ Game Theory/Contents.html 8. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400 (2017) 9. Gan, J., An, B., Vorobeychik, Y., Gauch, B.: Security games on a plane. In: AAAI, pp. 530–536 (2017) 10. Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290 (2018) 11. Haskell, W., Kar, D., Fang, F., Tambe, M., Cheung, S., Denicola, E.: Robust protection of fisheries with compass. In: IAAI (2014) 12. Heinrich, J., Lanctot, M., Silver, D.: Fictitious self-play in extensive-form games. In: International Conference on Machine Learning, pp. 805–813 (2015) 13. Johnson, M.P., Fang, F., Tambe, M.: Patrol strategies to maximize pristine forest area. In: AAAI (2012) 14. Kamra, N., Fang, F., Kar, D., Liu, Y., Tambe, M.: Handling continuous space security games with neural networks. In: IWAISe: First International Workshop on Artificial Intelligence in Security (2017)

258

N. Kamra et al.

15. Kamra, N., Gupta, U., Fang, F., Liu, Y., Tambe, M.: Policy learning for continuous space security games using neural networks. In: AAAI (2018) 16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 17. Korzhyk, D., Yin, Z., Kiekintveld, C., Conitzer, V., Tambe, M.: Stackelberg vs. Nash in security games: an extended investigation of interchangeability, equivalence, and uniqueness. JAIR 41, 297–327 (2011) 18. Krishna, V., Sj¨ ostr¨ om, T.: On the convergence of fictitious play. Math. Oper. Res. 23(2), 479–511 (1998) 19. Lanctot, M., et al.: A unified game-theoretic approach to multiagent reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 4190–4203 (2017) 20. Leslie, D.S., Collins, E.J.: Generalised weakened fictitious play. Games Econ. Behav. 56(2), 285–298 (2006) 21. Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, O.P., Mordatch, I.: Multi-agent actor-critic for mixed cooperative-competitive environments. In: Advances in Neural Information Processing Systems, pp. 6379–6390 (2017) 22. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. In: International Conference on Machine Learning, pp. 1928–1937 (2016) 23. Perkins, S., Leslie, D.: Stochastic fictitious play with continuous action sets. J. Econ. Theory 152, 179–213 (2014) 24. Rosenfeld, A., Kraus, S.: When security games hit traffic: optimal traffic enforcement under one sided uncertainty. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-2017, pp. 3814–3822 (2017) 25. Shamma, J.S., Arslan, G.: Unified convergence proofs of continuous-time fictitious play. IEEE Trans. Autom. Control 49(7), 1137–1141 (2004) 26. Wang, B., Zhang, Y., Zhong, S.: On repeated stackelberg security game with the cooperative human behavior model for wildlife protection. In: Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2017, pp. 1751–1753 (2017) 27. Yang, R., Ford, B., Tambe, M., Lemieux, A.: Adaptive resource allocation for wildlife protection against illegal poachers. In: AAMAS (2014) 28. Yin, Y., An, B., Jain, M.: Game-theoretic resource allocation for protecting large public events. In: AAAI, pp. 826–833 (2014) 29. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. In: Advances in Neural Information Processing Systems, pp. 3394– 3404 (2017)

Effective Premium Discrimination for Designing Cyber Insurance Policies with Rare Losses Mohammad Mahdi Khalili(B) , Xueru Zhang, and Mingyan Liu University of Michigan, Ann Arbor, MI, USA {khalili,xueru,mingyan}@umich.edu

Abstract. Cyber insurance like other types of insurance is a method of risk transfer, where the insured pays a premium in exchange for coverage in the event of a loss. As a result of the reduced risk for the insured and the lack of information on the insurer’s side, the insured is generally inclined to lower its effort, leading to a worse state of security, a common phenomenon known as moral hazard. To mitigate moral hazard, a widely employed concept is premium discrimination, i.e., an agent/insured who exerts higher effort pays less premium. This, however, relies on the insurer’s ability to assess the effort exerted by the insured. In this paper, we study two methods of premium discrimination that rely on two different types of assessment: pre-screening and post-screening. Pre-screening occurs before the insured enters into a contract and can be done at the beginning of each contract period; the result of this process gives the insurer an estimated risk on the insured, which then determines the contract terms. The post-screening mechanism involves at least two contract periods whereby the second-period premium is increased if a loss event occurs during the first period. Prior work shows that both pre-screening and post-screening are generally effective in mitigating moral hazard and increasing the insured’s effort. The analysis in this study shows, however, that the conclusion becomes more nuanced when loss events are rare. Specifically, we show that post-screening is not effective at all with rare losses, while prescreening can be an effective method when the agent perceives them as rarer than the insurer does; in this case pre-screening improves both the agent’s effort level and the insurer’s profit. Keywords: Cyber insurance Risk assessment · Rare losses

1

· Premium discrimination ·

Introduction

Facing increasingly common cyber attacks and data breaches, organizations and businesses big and small have to invest in cyber self-protection against a myriad This work is supported by the NSF under grants CNS-1616575, CNS-1739517, and ARO W911NF1810208. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 259–275, 2019. https://doi.org/10.1007/978-3-030-32430-8_16

260

M. M. Khalili et al.

of losses, such as business interruption induced by such incidents. Organizations are also increasingly turning to cyber insurance as a form of protection for mitigating cyber risks by transferring all or part of their risks to the insurer through the purchase of a policy [1,2]. Specifically, a cyber insurance contract is between a risk averse agent and an insurer; the agent pays a premium in exchange for the insurer to provide certain coverage in the event of a loss. Risk aversion on the agent’s part makes him willing to buy insurance from the insurer to undertake the risk, resulting in reduced uncertainty for the insured agent. One of the challenges in offering an insurance contract is the lack of information on the insured, which results in the well-known moral hazard issue. In other words, the insurer is unaware of the agent’s effort in self-protection and therefore the latter’s true risk. This, together with the fact that the agent is now (partially) covered in his loss, typically leads the agent to exert less effort toward securing himself. This results in a worse state of cyber risk as compared to a scenario with uninsured agents [3–6]. The problem of designing cyber insurance policies in the presence of moral hazard has been studied in the literature and it has been shown that the impact of the insurance contract on the state of the network security depends on the insurance market [7–10]. The key idea for mitigating moral hazard is premium discrimination, i.e., the agent who invests more in self-protection gets charged less premium. Yang et al. [9] consider a competitive insurance market and show that insurance cannot improve the state of network security in the presence of moral hazard; on the other hand, in the absence of moral hazard and with observable agents’ actions, the insurer can premium discriminate and the state of security can improve as a result. Hofmann [8] studies a monopoly insurance market in the presence of a welfare-maximizing insurer. In this case, the insurer premium discriminates among high and low risk agents using imperfect information that the insurer has, and the insurance contract can incentivize the agents to exert higher effort as compared to the no-insurance case. In practice premium discrimination can be achieved in a number of ways. Traditional insurance products (e.g., auto, life, home, property) rely on actuarial models that estimate risks based on a variety of inputs obtainable through questionnaires or surveys. For instance, getting an auto insurance policy requires the submission of information on the model/year of the car being insured, the primary driver(s) of the vehicle, their age, gender, marital status, place of residence and so on, a process we refer to as pre-screening throughout the paper. The estimated risk based on this type of input directly determines the premium on the policy or a set of policies (with different choices of premium-deductible combinations) offered by the insurer. Furthermore, when a driver continues to purchase insurance over multiple years, then his/her previous driving and claims record also factor into future-year premium calculation. This latter element of premium discrimination, referred to as post-screening throughout the paper, has been shown to be effective in general: since an agent faces (potentially significantly) higher payments in the future, there is incentive for the agent to act

Effective Premium Discrimination for Designing Cyber Insurance Policies

261

responsibly (exert high effort) in the present time to avoid a loss event, see e.g., Rubinstein et al. [11]. The above in principle applies to the domain of cyber insurance, but with two challenges. (1) pre-screening is much harder to do for lack of actuarial models, and (2) while cyber attacks are increasingly common collectively, for a single organization it remains a relatively rare occurrence with high losses and damages, which means post-screening may seldom come into effect. In this study we shall examine these two mechanisms separately and attempt to understand under what conditions are these effective in incentivizing the agent to exert higher effort, thereby improving the state of security. Note that rare cyber incidents are different from natural disasters that have been studied in the literature [12,13]; the latter are also rare incidents with high losses but differ from cyber incident in the following sense. The agents/insureds cannot prevent natural disasters by exerting effort. For instance, the authors in [12,13] do not consider the agent’s effort in their models as it does not affect the probability of natural disaster occurrence. On the other hand, an agent can actively and proactively work toward decreasing his chance of being attacked or an attack being successful by investing in security and addressing vulnerability. In this paper, we shall assume that data breach and loss incidents are rare for each agent but the amount of loss from a breach is extremely large. This model is reasonably borne out by recent events such as the Equifax data breach, which affected 143 million American consumers and incurred $68.6 billion in loss for the company [14]; most of these events have been unprecedented in the respective victim’s company history. Our main finding in this paper is that post-screening (which involves at least two contract periods) is not effective at all with rare loss incidents. On the other hand, pre-screening can be an effective method if the agent perceives loss incidents as rarer than the insurer does; in this case sufficiently accurate pre-screening can be effective and improves the state of security as well as the insurer’s profit as compared to not using premium discrimination. The organization of this paper is as follows. In Sect. 2 we introduce the model and contract design problem. Section 3 summarizes prior results (but recast under our model) on designing cyber insurance policies when incidents are not rare. In Sect. 4 we examine the effect of pre-screening and post-screening on both the state of security and the insurer’s profit with rare losses. Section 5 discusses how pre-screening may be used to enable an active policy, as well as dependent cyber risks. Section 6 presents numerical results and Sect. 7 concludes the paper.

2

Model

We consider the cyber insurance design, a principal-agent problem, between a profit-maximizing, risk-neutral insurer/principal and a risk-averse insured/ agent. The agent exerts effort e toward securing himself, incurring linear cost c · e.

262

M. M. Khalili et al.

Let p(e) denote the probability of a loss incident, assumed to be strictly decreasing and strictly convex. Decreasing and convexity imply that the initial effort toward security leads to a considerable reduction in probability of a loss incident, and strict convexity implies that the probability of the loss incident cannot be zero even if the agent exerts high effort [15]. Specifically, we assume that probability of a loss incident has the following form,1 p(e) = t · exp{−α · e},

(1)

where t is the nominal probability of a successful attack to the agent if he exerts zero effort (e = 0) and α is a constant. Larger α implies that investment in security is more efficient and p(.) converges to zero faster. Note that t and α both are constants and cannot be modified by the agent or the insurer. When a loss occurs, the agent suffers the amount of loss l, also a constant. This is obviously a simplification; however, our qualitative conclusions remain the same for a random loss given by a known distribution. The expected utility of the agent without any insurance contract is given by: U (e) = p(e)f (−l − ce) + (1 − p(e))f (−ce),

(2)

where f (.) is a concave function that captures the agent’s risk aversion. To make the analysis concrete, we will further assume f (.) is an exponential function with constant absolute risk aversion γ: f (y) = 1 − exp{−γ · y},

(3)

where γ is referred to as the agent’s risk attitude; the higher the risk attitude the more risk averse the agent. 2.1

Agent’s Effort and Utility Without Insurance

Without insurance, the agent exerts an effort level eo to maximize his utility: eo = arg max U (e). e≥0

(4)

It is easy to see that if γc ≥ α, then eo = 0. Intuitively, γc ≥ α implies that the cost of effort is higher than its benefit, and the agent is not able to improve his utility by exerting effort. If α > γc, then eo is given by the first order condition. Together, we have 0 if γc ≥ α o e = (5) + ( α1 ln t · (α−γc)(exp{γl}−1) ) if γc < α γc where (a)+ = max{0, a}. As a result, the maximum utility of the agent outside the contract is given by, γc α 1 − α−γc (t · α−γc (exp{γl} − 1)) α if eo > 0 o o γc u = U (e ) = (6) t · (1 − exp{γl}) if eo = 0. 1

p(e) can be written as t · (exp{−α})e which is a function consistent with the exponential probability function introduced in [16].

Effective Premium Discrimination for Designing Cyber Insurance Policies

2.2

263

Contract Design

We will assume that in the event of a loss, a contract covers the full amount l. This is again a simplification but it allows us to get to the essence of our analysis more straightforwardly without affecting the main qualitative conclusions. Because a loss is covered in full, the agent will exert zero effort after entering an insurance contract. Thus the insurer will have to use premium discrimination to incentivize the insured to exert a higher effort in exchange for lower premium. We next describe in detail the resulting contract design problem under two different methods of premium discrimination: post-screening and pre-screening. Post-screening. In this case the contract design problem is framed in a twoperiod setting where the insurer is able to assess premium in the second period based on what happens in the first period. Such a contract is given by three parameters (π1 , π2 , π3 ): π1 is the first-period premium; in the second period, the agent pays premium π2 if a loss happened (and was covered in full) during the first period and pays π3 otherwise. Obviously π3 ≤ π2 . In this case, the agent may exert non-zero effort in the first period to decrease the chance of a loss in order to reduce the likelihood of paying a higher premium in the second period. In the second period, on the other hand, the agent will always exert zero effort as the loss is fully covered and he faces no more future punishment.2 We assume that when an agent enters such a contract he commits to both periods. The agent’s utility inside a contract (π1 , π2 , π3 ) with post-screening is thus the summation of his utility in each period: U in (e, π1 , π2 , π3 ) = f (−π1 − ce) + p(e)f (−π2 ) + (1 − p(e))f (−π3 ),

(7)

where e is the effort in the first period. The insurer’s problem is to maximize her profit subject to the Individual Rationality (IR) constraint and Incentive Compatibility (IC) constraint: V = s.t.

max

{π1 ,π2 ,π3 ,e}

π1 − p(e)l + p(e)(π2 − p(0)l) + (1 − p(e))(π3 − p(0)l)

(IR) U in (e, π1 , π2 , π3 ) ≥ 2 · uo , i = 1, 2 (IC) e ∈ arg max U in (e, π1 , π2 , π3 ).

(8)

The (IR) constraint ensures that the agent enters the contract only if he gets no lower utility than his outside option. Note that since the contract covers two periods, the comparison here is between his utility inside the contract over two periods and outside the contracts over two periods. The (IC) constraint suggests 2

Our analysis can be extended to a multi-period setting where the premium of each period depends on the agent’s history of losses, i.e., the agent’s third-period premium depends on his loss events in the first and second periods and so on.

264

M. M. Khalili et al.

that the agent acts in self-interest: he exerts an effort level maximizing his utility given the policy parameters. Under the contract (π1 , π2 , π3 ), by the first order condition, the agent’s optimal effort ein is given by: + 1 α exp{γπ2 }−exp{γπ3 } ln(t · ) if π2 > π3 in α+γc γc exp{γπ1 } e (π1 , π2 , π3 ) = (9) 0 if π2 ≤ π3 For notational convenience, we use ein instead of ein (π1 , π2 , π3 ), while noting the dependency. We have the following lemma on the (IR) constraint. Lemma 1. The (IR) constraint in the optimization problem (8) is binding. The above lemma implies that at the optimal solution, the agent is indifferent between entering vs. not entering the contract, as expected. Pre-screening. We now turn to the case of pre-screening. We assume the insurer can conduct a risk assessment prior to determining the contract terms; the determination mechanism is known to the agent so this is again a game of perfect information. We assume the outcome of the pre-screening is given by an assessment S = e + N , where N is a zero-mean Gaussian noise with variance σ 2 .3 There are various ways to achieve pre-screening in practice, using surveys, penetration tests, or advanced Internet measurement techniques, see e.g., [17]. The insurer then offers the agent a contract given by two parameters (π, β), where π is the base premium and β is the assessment-dependent discount factor: the agent pays π − βS in exchange for full coverage in the event of a loss. The agent’s total cost inside the contract (π, β) while exerting effort e is: X in = π − β · S + c · e.

(10)

in

As X follows a Gaussian distribution, using moment-generating function the agent’s expected utility under the contract is given by: U in (π, β, e) = E(f (−X in )) = 1 − exp{γπ + γ(c − β)e +

γ 2 β 2 σ2 }. 2

(11)

Therefore, the insurer’s design problem using pre-screening is as follows: max E{π − βS} − p(e) · l π,β,e

s.t. (IR) U in (π, β, e) ≥ uo , (IC) e ∈ arg max U in (π, β, e ) e ≥0

(12)

Similar as in Lemma 1, we can show that the (IR) constraint is binding in this case. Thus we have the following relation between optimal contract parameters (wo = γ1 ln(1 − uo )): γβ 2 σ 2 π = wo + βe − ce − . (13) 2 3

The analysis can be extended to other noise distributions.

Effective Premium Discrimination for Designing Cyber Insurance Policies

265

Using (13), the insurer’s problem can be simplified as follows: V (σ) = max wo − ce − β,e

s.t. (IC)

γβ 2 σ 2 − p(e)l 2

e ∈ arg min (c − β)e + e ≥0

γβ 2 σ 2 , 2

(14)

We next summarize (known) results on these two types of premium discrimination in terms of their effectiveness in incentivizing efforts.

3

State of Security and Optimal Contract When Losses Are Not Rare

Post-screening: Post-screening has been studied in the literature. Rubinstein et al. in [11] showed that post-screening can improve the agent’s effort inside the contract compared to the one-period contract without post-screening. This can be similarly observed in our model. In particular, in Theorem 1 below we introduce a sufficient condition under which the agent exerts non-zero effort in the first period of a contract with post-screening. In Sect. 6, we also provide an example where the agent inside a contract with post-screening exerts higher effort as compared to the no-insurance scenario. ˆ2 , π ˆ3 , eˆ) be the solution of the optimization problem (8). Theorem 1. Let (ˆ π1 , π (α−γc)(exp{γl}−1) > 1, then eˆ > 0. Suppose that t = 1 and γc Theorem 1 suggests that post-screening can be an to effective mechanism (α−γc)(exp{γl}−1) > 1 in incentivize non-zero effort. Note that the condition γc Theorem 1 can be satisfied if loss l is sufficiently large. Pre-screening: Our previous work [4] shows that pre-screening can simultaneously incentivize the agent to exert non-zero effort and improve the insurer’s utility. This is characterized for the present model in the following theorem. Theorem 2. Pre-screening incentivizes non-zero effort if and only if c tp as l → ∞. Therefore, the agent thinks the loss is rarer than the insurer does. 4.1

Post-screening

With the above rare loss assumptions, we have the following theorem on postscreening. Theorem 4. Using post-screening and given t → 0, 1. the agent always exerts zero effort inside the contract, and 2. at the optimal contract we have, π1 = π3 =

1 ln [1 − uo ] , γ

π2 ∈ R+

. Theorem 4 implies that premium discrimination in the second period based on the first period is not at all effective and the insurer is not able to improve the agent’s effort or her utility by post-screening as compared to a contract without premium discrimination. 4 5

By assuming that t goes to zero, the entire probability of a loss incident (i.e., p(e) = t exp(α(e)) goes to zero. If the agent exert effort e, then la exp{−α · e} and lp exp{−α · e} are the perceived expected loss from the agent and the insurer’s perspective.

Effective Premium Discrimination for Designing Cyber Insurance Policies

4.2

267

Pre-screening

For pre-screening, it turns out perception asymmetry makes a difference. The following theorem characterizes the optimal contract and introduces a sufficient condition under which pre-screening can incentivize the agents to exert non-zero effort inside the optimal contract. Theorem 5. Pre-screening can incentivize non-zero effort under the rare loss model, if and only if c 0 otherwise they are β = β = 0. Moreover, if e > 0, then eˆ = e. Lastly, we have V (σ) ≤ R(σ), where V (σ) is obtained from (14) by assuming there is only one pre-screening and the agent does not lower his effort afterward, with equality achieved if b = c. The last part of the theorem above suggests that performing the second screening helps the insurer to improve profit even when the agent may be assumed not to lower his effort. This is because second pre-screening decreases the variance and uncertainty in agent’s utility. Therefore, a risk averse agent is willing to pay more premium when the uncertainty and variance on his side decreases. 5.2

Insuring Interdependent Agents

So far we have assumed that the probability of a loss incident is solely determined by the effort of the agent. On the other hand, risk dependency is a unique feature of cyber risks: the incident probability for an agent may depend on the effort levels of other agents (the former’s vendors or service providers, etc.). In our previous work [4], we considered a cyber insurance market in the presence of risk dependency, and showed that the insurer can achieve higher profit as compared to a network of independent agents; moreover, pre-screening in such a case increases the agents’ efforts as compared to the no insurance scenario. If we introduce security dependency into our rare loss model, it can be shown that post-screening is not able to incentivize non-zero effort while pre-screening can. Table 1 summarizes the role of dependency and rare loss on the agents’ effort, where (∗) indicates the associated result holds under certain conditions.

Effective Premium Discrimination for Designing Cyber Insurance Policies

269

Table 1. Comparing agent’s effort inside (ein ) and outside (eo ) a contract Pre-screening ein > eo

(∗) ein = 0

Rare loss, independent agents

ein > eo

(∗) ein = 0

Frequent loss, dependent agents

e

in o

>e

o

in

Frequent loss, independent agents e ≥ e

6

Post-screening

Rare loss, dependent agents

(∗) ein > eo ≥0

in

e

>e

o

(∗) (∗)

Numerical Result

We show a number of numerical examples with the following parameters γ = c = 1, α = 1.5. 6.1

Frequent Losses: Post-screening

Our first example shows when post-screening may be effective in incentivizing the agent to exert higher effort as compared to the no-insurance scenario. Consider a scenario where the nominal probability of attack t = 1. Figure 1 illustrates the agent’s effort in the first period as a function of loss l. We note that post-screening can be an effective mechanism to incentivize the agents to exert non-zero effort inside a contract with full coverage. In this example, the agent exerts higher effort as compared to the no insurance scenario when l ≤ 0.7. This is because since the loss is relatively low, even without insurance the agent is not willing to exert substantial effort as the cost of effort is higher than the actual loss. Within a contract, the insurer is able to incentivize the agent to exert higher effort by imposing a large penalty (a much higher premium in the second period). 1 Inside the contract Outside the contract

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.5

1

1.5

2

Fig. 1. Post-screening: agent’s effort vs. loss (l)

270

6.2

M. M. Khalili et al.

Rare Losses: Pre-screening

Our second example examines the effect of pre-screening on the agent’s effort. Consider a scenario where ta , tp go to zero and l goes to infinity. Moreover, assume la = 5 and σ = 0.1. Figure 2 illustrates the agent’s effort inside and outside the insurance contract with pre-screening. We see that the agent exerts nonzero effort inside the insurance contract and the effort increases as lp increases. Note that outside a contract the agent’s effort is a function of his perceived loss la and does not change with lp . On the other hand, inside the contract, as the insurer’s perceived loss lp increases, the insurer incentivizes the agent to increase his efforts using premium discrimination (high premium for low pre-screening outcomes). Figure 3 illustrates the insurer’s utility as a function of lp . This figure implies that the insurer’s utility is negative for lp ≥ 85. Therefore, she does not insure 4

Outside the contract Inside the contract

3.5

3

2.5

2

1.5

1 0

50

100

150

200

Fig. 2. Pre-screening: agent’s effort vs. loss (lp ) 2

Insurer's utility

1.5

1

0.5

0

-0.5

-1 0

50

100

150

Fig. 3. Insurer’s utility vs. loss (lp )

200

Effective Premium Discrimination for Designing Cyber Insurance Policies

271

the agent if lp ≥ 85. Also, as expected, the insurer’s utility decreases as the perceived expected loss lp increases. The reason is that as the perceived expected loss increases, the insurer expects to pay more coverage and make less profit.

7

Conclusion

We studied the problem of designing cyber insurance contracts between a single profit-maximizing and risk-neutral insurer and a risk-averse agent. We showed that multi-period contract is an effective method of premium discrimination if loss incidents are frequent. We then considered rare but severe losses which is a common theme of cyber risks. In this case, we showed that multi-period contract is not effective in improving the agent’s effort: the agent exerts zero effort inside a contract with full coverage. By contrast, pre-screening is shown to allow the insurer to assess the agent’s state of security and premium discriminates properly so as to incentivize effort by the agent within the contract. We further discussed how the pre-screening result enables a type of active policy where periodic pre-screening within the same contract term can not only ensure the agent does not lower his effort after the initial assessment but also allows the insurer to improve her profit.

Appendix Proof (Lemma 1). Proof by contradiction. Let (ˆ π1 , π ˆ2 , π ˆ3 ) be the solution of optimization problem (8), and assume that the (IR) constraint is not binding at ˆ2 , π ˆ3 ). Because the (IR) constraint is not binding, the the optimal contract (ˆ π1 , π ˆ3 while she keeps exp{γ π ˆ2 } − insurer can increase her utility by increasing π ˆ2 , π exp{γ π ˆ3 } fixed. Therefore, based on (9) the agent’s effort inside the contract ˆ2 , π ˆ3 ) is not does not change, but the insurer’s profit increases. As a result, (ˆ π1 , π an optimal contract. This is the contradiction implying that the (IR) constraint is binding. Proof (Theorem 1). Proof by contradiction: Assume that eˆ = 0 and t = 1 and (α−γc)(exp{γl}−1) > 1. First we show that under these assumptions, π ˆ1 = π ˆ2 = γc

ln(1 − uo ) := wo . Because eˆ = 0 and t = 1, the optimization problem for ˆ2 , π ˆ3 ) is as follows, finding (ˆ π1 , π 1 γ

max{π1 ,π2 ,π3 } π1 + π2 − 2l s.t., (IR) 1 − exp{γπ1 } + 1 − exp{γπ2 } = 2uo (IC) 0 = ein (π1 , π2 , π3 )

(24)

By (IR) constraint we have, 1 ln(2 − 2uo − exp{γπ1 }) = π2 γ

(25)

272

M. M. Khalili et al.

Therefore, we re-write the optimization problem (24) as follows, max{π1 ,π2 ,π3 } π1 + γ1 ln(2 − 2uo − exp{γπ1 }) − 2l s.t., (IC) 0 = ein (π1 , π2 , π3 ) 1 o γ ln(2 − 2u − exp{γπ1 }) = π2

(26)

Because π3 does not appear in the objective function, we first find π1 and π2 such that they maximize the objective function. Then, we pick π3 such that (IC) constraint is satisfied. By the first order optimality condition for the objective function, we have, ˆ2 = π ˆ1 = π

1 ln(1 − uo ) γ

Without loss of generality, we set π ˆ3 =

1 γ

(27)

o ln( α−γc ˆ= 0 α (1 − u )). By (9), e

ˆ3 } ˆ 2 }−exp{γ π α exp{γ π (Notice that γc = 1 and a slight decrease in π ˆ3 , increases the exp{γ π ˆ1 } agent’s effort based on (9)). Now we show that the decrease in π ˆ3 increases the insurer’s payoff. Notice that a slight decrease in π ˆ3 , increases the agent’s effort (based on (9)) and improves agents’ utility and the (IR) constraint is not violated. We write the insurer’s objective function as a function of π3 . Therefore, we have (derivatives in the following equation are left derivatives),

h(π3 ) = π ˆ1 − p(ein (ˆ π1 , π ˆ2 , π3 ))(l − π ˆ2 ) + (1 − p(ein (ˆ π1 , π ˆ2 , π3 )))π3 − l π1 , π ˆ2 , π3 )) ∂p(ein (ˆ ∂h |π3 =ˆπ3 = · (ˆ π2 − l) ∂π3 ∂π3 π1 , π ˆ2 , π3 )) ∂p(ein (ˆ · π3 + (1 − p(ein (ˆ π1 , π ˆ2 , π3 ))) − ∂π3 ∂p(ein (ˆ π1 , π ˆ2 , π3 )) in |π3 =ˆπ3 · (−l + π ˆ2 − π ˆ3 ) − (1 − p(e1 (ˆ π1 , π ˆ2 , π ˆ3 ))) = ∂π3

(α−γc)(exp{γl}−1) > 1, (5) implies that eo is not zero and π ˆ2 = γc in ∂p(e (ˆ π ,ˆ π ,π )) 1 ∂h 1 2 3 o |π3 =ˆπ3 > 0 implies that ∂π |π3 =ˆπ3 < 0. γ ln(1−u ) < l. Moreover, ∂π3 3 Therefore, the decrease in π ˆ3 increases the insurer’s payoff. This is a contradic-

Because

tion and the agent exerts non-zero effort in the optimal contract under given assumptions. Proof (Theorem 2). By (14), the agent exerts non-zero effort in a contract if β = c. If the discount factor β = c, then any positive number satisfies the (IC) constraint. Therefore, if β = c, then the desired effort maximizes the insurer’s utility. By (14), we have, e = arg max wo − ce − tl exp{−α · e} − γc2 σ 2 e

(28)

Effective Premium Discrimination for Designing Cyber Insurance Policies

273

By the first order condition of optimality, the solution of above optimization + problem is e = ( α1 ln( α·t·l c )) . Moreover, if e > 0, then the maximum insurer’s profit using pre-screening (i.e., β = c) is given by, γc2 σ 2 c (29) wo − αc ln( αtl c )− α − 2 Without pre-screening (i.e., β = 0), the agent exerts zero effort and the insurer’s profit is given by, wo − t · l

(30)

Therefore, the insurer uses pre-screening if and only if, ln( α·t·l c )>0 w − αc ln( αtl c )− 1 α

o

c α

−

γc2 σ 2 2

≥ wo − tl

(31)

In other words, the insurer uses pre-screening and the agent exerts non-zero effort if and only if, α·t·l >1 c αtl 2 c σ 2 ≤ 2 (tl − (1 + ln( )) γc α c

(32)

Proof (Theorem 3).Assume σ < σ . 2 2 Let g(β, e, σ) = wo − ce − γβ2 σ − p(e)l . It is easy to see that g(β, e, σ ) ≤ g(β, e, σ). Therefore, we have, max

β,e,IC constraint

g(β, e, σ ) ≤

max

β,e,IC constraint

g(β, e, σ)

Therefore, V (σ ) ≤ V (σ).

Proof (Theorem 4). α exp{γπ2 }−exp{γπ3 } – By (9), the agent exerts zero effort if ta γc ≤ 1. Because ta exp{γπ1 }

α exp{γπ2 }−exp{γπ3 } goes to zero, ta γc also goes to zero. Therefore, the agent exp{γπ1 } exerts zero effort under any insurance contract. – Because the agent exerts zero effort inside the optimal contract, his utility is given by,

U in (0, π1 , π2 , π3 ) = − exp{γπ} − ta exp{γπ2 } − (1 − ta ) exp{γπ3 } (IR) is binding and ta → 0 ⇒ 1 − exp{γπ1 } + 1 − exp{γπ3 } = 2uo

(33)

274

M. M. Khalili et al.

Therefore, the insurer’s problem (8) can be written as follows, maxπ1 ,π2 ,π3 π1 + π3 − 2 · lp s.t., exp{γπ1 } + exp{γπ3 } = 2 − 2uo or

maxπ1 π1 +

1 γ

ln(2 − 2uo − exp{γπ1 }) − 2lp

(34) (35)

The optimal solution for the above optimization problem is π1 = π3 = γ1 ln(1− uo ) and also the value of π2 does not affect insurer’s or agent’s utility and can be any positive value. Proof (Theorem 5). The proof is similar to the proof of Theorem 2 except that we should substitute lp for t · l. Proof (Theorem 6). As the (IR) constraint is binding in (23), similar to (14) we can re-write optimization problem (23) as follows, 2 2 2 2 R(σ) = max{β,e,β ,e } wo − ce + b(e − e ) − γ (β−β ) σ2 +(β ) σ − p(e )l (36) e + γ(−β + b)˜ e s.t., (IC)(e, e ) ∈ arg min(˜e≥˜e ) γ(c − b + β − β)˜ First we show that eˆ = eˆ . Proof by contradiction. Assume eˆ > eˆ ≥ 0. Then, β − β = b − c since otherwise eˆ = ∞ or eˆ = 0. As b ≤ c, then the objective function of (36) can be improved by decreasing eˆ without violating (IC) constraint. This contradiction shows that eˆ = eˆ . By IC constraint, it is easy to see that if eˆ = eˆ > 0, then β = c, and β = b. ¯ e = e¯ be the solution to (14). According to the IC constraint of Let β = β, (14), two cases can happen: (i) β = 0 and e = 0. Then, (β = β = e = e = 0) satisfies the IC constraint in (36) and is a feasible point. We have, 2

γβ σ 2 − p(e)l = 2 (β − β )2 + (β )2 2 σ − p(e)l (37) wo − ce + b(e − e) − γ 2 (ii) β = c. Then (β = c, β = b, e = e = e) is a feasible point for (36) and satisfies the IC constraint. We have, wo − ce −

γc2 σ 2 − p(e)l ≤ 2 2 2 (c − b) + b 2 wo − c · e + b(e − e) − γ σ − p(e)l (38) 2 Note that in this case (β = c, β = b, e = e = e) is the solution to (36). By (37) and (38) we have, V (σ) ≤ R(σ). Notice that if b = c, then (36) and (14) are equivalent and V (σ) = R(σ) as eˆ = eˆ . wo − ce −

Effective Premium Discrimination for Designing Cyber Insurance Policies

275

References 1. Tosh, D.K., et al.: Three layer game theoretic decision framework for cyberinvestment and cyber-insurance. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) GameSec 2017. LNCS, vol. 10575, pp. 519–532. Springer International Publishing, Cham (2017). https://doi.org/10.1007/978-3-319-68711-7 28 2. Vakilinia, I., Sengupta, S.: A coalitional cyber-insurance framework for a common platform. IEEE Trans. Inf. Forensics Secur. 14(6), 1526–1538 (2018) 3. Lelarge, M., Bolot, J.: Economic incentives to increase security in the internet: the case for insurance. In: Proceedings of IEEE INFOCOM, pp. 1494–1502 (2009) 4. Khalili, M.M., Naghizadeh, P., Liu, M.: Designing cyber insurance policies: the role of pre-screening and security interdependence. IEEE Trans. Inf. Forensics Secur. PP(99), 1 (2018) 5. Shetty, N., Schwartz, G., Walrand, J.: Can competitive insurers improve network security? In: Acquisti, A., Smith, S.W., Sadeghi, A.-R. (eds.) Trust 2010. LNCS, vol. 6101, pp. 308–322. Springer, Heidelberg (2010). https://doi.org/10.1007/9783-642-13869-0 23 6. Schwartz, G., Shetty, N., Walrand, J.: Cyber-insurance: missing market driven by user heterogeneity (2010). www.eecs.berkeley.edu/nikhils/SecTypes.pdf 7. Zhang, R., Zhu, Q., Hayel, Y.: A bi-level game approach to attack-aware cyber insurance of computer networks. IEEE J. Sel. Areas Commun. 35(3), 779–794 (2017) 8. Hofmann, A.: Internalizing externalities of loss prevention through insurance monopoly: an analysis of interdependent risks. Geneva Risk Insur. Rev. 32(1), 91–111 (2007) 9. Yang, Z., Lui, J.C.: Security adoption and influence of cyber-insurance markets in heterogeneous networks. Perform. Eval. 74, 1–17 (2014) 10. Khalili, M.M., Liu, M., Romanosky, S.: Embracing and controlling risk dependency in cyber insurance policy underwriting. In: The Annual Workshop on the Economics of Information Security (WEIS) (2018) 11. Rubinstein, A., Yaari, M.E.: Repeated insurance contracts and moral hazard. J. Econ. Theory 30(1), 74–97 (1983). http://www.sciencedirect.com/science/ article/pii/0022053183900947 12. Slovic, P., Fischhoff, B., Lichtenstein, S., Corrigan, B., Combs, B.: Preference for insuring against probable small losses: insurance implications. J. Risk Insur. 44(2), 237–258 (1977). http://www.jstor.org/stable/252136 13. Raschky, P.A., Weck-Hannemann, H.: Charity hazard-a real hazard to natural disaster insurance? Environ. Hazards 7(4), 321–329 (2007). http://www.sciencedirect.com/science/article/pii/S174778910700049X 14. Cox, J.: Equifax stung with multibillion-dollar class-action lawsuit after massive data breach (2017). http://www.thedailybeast.com/equifax-stung-with-multibillion-dollar-class-action-lawsuit-after-massive-data-breach 15. Jiang, L., Anantharam, V., Walrand, J.: How bad are selfish investments in network security? IEEE/ACM Trans. Netw. 19(2), 549–560 (2010) 16. Gordon, L.A., Loeb, M.P.: The economics of information security investment. ACM Trans. Inf. Syst. Secur. 5(4), 438–457 (2002). https://doi.org/10.1145/581271. 581274 17. Liu, Y., et al.: Cloudy with a chance of breach: forecasting cyber security incidents. In: Proceedings of the 24th USENIX Security Symposium (2015)

Analyzing Defense Strategies Against Mobile Information Leakages: A Game-Theoretic Approach Kavita Kumari1 , Murtuza Jadliwala1 , Anindya Maiti2(B) , and Mohammad Hossein Manshaei3 1

Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249, USA {kavita.kumari,murtuza.jadliwala}@utsa.edu 2 Institute for Cyber Security, University of Texas at San Antonio, San Antonio, TX 78249, USA [email protected] 3 Department of Electrical and Computer Engineering, Isfahan University of Technology, 84156-83111 Isfahan, Iran [email protected]

Abstract. Abuse of zero-permission sensors (e.g., accelerometers and gyroscopes) on-board mobile and wearable devices to infer users’ personal context and information is a well-known privacy threat, and has received significant attention in the literature. At the same time, efforts towards relevant protection mechanisms have been ad-hoc and have main focus on threat-specific approaches that are not very practical, thus garnering limited adoption within popular mobile operating systems. It is clear that privacy threats that take advantage of unrestricted access to these sensors can be prevented if they are effectively regulated. However, the importance of these sensors to all applications operating on the mobile platform, including the dynamic sensor usage and requirements of these applications, makes designing effective access control/regulation mechanisms difficult. Moreover, this problem is different from classical intrusion detection as these sensors have no system- or user-defined policies that define their authorized or correct usage. Thus, to design effective defense mechanisms against such privacy threats, a clean slate approach that formalizes the problem of sensor access (to zero-permission sensors) on mobile devices is first needed. The paper accomplishes this by employing game theory, specifically, signaling games, to formally model the strategic interactions between mobile applications attempting to access zeropermission sensors and an on-board defense mechanism attempting to regulate this access. Within the confines of such a formal game model, the paper then outlines conditions under which equilibria can be achieved between these entities on a mobile device (i.e., applications and defense mechanism) with conflicting goals. The game model is further analyzed using numerical simulations, and also extended in the form of a repeated signaling game. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 276–296, 2019. https://doi.org/10.1007/978-3-030-32430-8_17

Analyzing Defense Strategies Against Mobile Information Leakages

1

277

Introduction

Modern mobile and wearable devices, equipped with state-of-the-art sensing and communication capabilities, enable a variety of novel context-based applications such as social networking, activity tracking, wellness monitoring and home automation. The presence of a diverse set of on-board sensors, however, also provide an additional attack surface to applications intending to infer personal user information in an unauthorized fashion. In order to thwart such privacy threats, most modern mobile operating systems (including, Android and iOS) have introduced stringent access controls on front-end or user-accessible sensors, such as microphone, camera and GPS. As a result, the focus of adversarial applications has now shifted to employing on-board sensors that are not guarded by strong user or system-defined access control policies. Examples of such back-end or userinaccessible sensors include accelerometer, gyroscope, power meter and ambient light sensor, and we refer to these as zero-permission sensors. As all installed applications have access to them by default, and that they cannot be actively disengaged by users on an application-specific basis, these zero-permission sensors pose a significant privacy threat to mobile device users, as it has been extensively studied in the security literature [1,5,6,8,10–19,21,23–26]. At the same time, development of efficient and effective protection mechanisms against such privacy threats is still an open problem [2]. One of the main reasons why zero-permission sensors have limited or no access control policies associated with them is because they are required by a majority of applications (accessed by means of a common set of libraries or APIs) primarily for efficient and user-friendly operation on the device’s small and constrained form factor and display. For instance, gyroscope data is used by applications to re-position front-ends (or GUIs) depending device orientation, while an ambient light sensor is used to update on-screen brightness. Thus, a straightforward approach of completely blocking access or reducing the frequency at which applications can sample data from these sensors is not feasible, as it will significantly impact their usability. Alternatively, having a static access control policy for each application is also not practical as it will become increasingly complex for users to manage these policies. Moreover, such an approach will not protect against applications that gain legitimate access to these sensors (based on such static policies). Given that all applications (with malicious intentions or not) can request access to these sensors without violating any system security policy, an important challenge for a defense mechanism is to differentiate between authentic sensor access requests and requests that could be potentially misused. In order to begin addressing this long-standing open problem, we take a clean-slate approach by first formally (albeit, realistically) modeling the strategic interactions between (honest or potentially malicious) mobile applications and an on-board defense mechanism that cannot differentiate between their (sensor access) requests. We employ game-theory as a vehicle for modeling and analyzing these interactions. Specifically, we model the following scenario. A defense mechanism on a mobile operating system receives requests to access zero-permission sensors from two different types of applications: honest and malicious. Each of

278

K. Kumari et al.

these applications could send either a normal or a suspicious request for access to on-board zero-permission sensors. A request could be classified as suspicious or normal (non-suspicious) based on the context, frequency or amount of requested sensor data. Although honest applications would typically make normal requests, they could also make suspicious requests depending on application- or contextspecific operations and requirements to improve overall application performance and usability. The goal of malicious applications, on the other hand, is to successfully infer private user data from these requests. Normal requests would give them some (probably, not enough) data to carry out these privacy threats, however, suspicious requests could give them additional critical data either to amplify or increase the success probability of their attacks. The defense mechanism, on receiving the request, has one of the following two potential responses: (i) accept the request and release the requested sensor data, or (ii) block the request preventing any data being released to the requesting application. It should be noted that the defense mechanism does not know the type of the application (i.e., honest or malicious) sending a particular request (i.e., suspicious or non-suspicious), as all mobile applications can currently request zero-permission sensor data without raising a flag or violating any policy. In other words, the defense mechanism has imperfect information on the type of application sending the request. The requesting application, on the other hand, has perfect information about its type and potential strategies of the defense mechanism. Given this scenario, the following are the main technical contributions of this paper: 1. We first formally model the strategic interactions between mobile applications and a defense mechanism (outlined above) using a two-player, imperfectinformation game, called the signaling game [3]. We refer to it as the Sensor Access Signaling Game. 2. Next, we solve the Sensor Access Signaling Game by deriving both the pureand mixed-strategy Perfect Bayesian Nash Equilibria (PBNE) strategy profiles possible in the game. 3. Finally, by means of numerical simulations, we examine how the obtained game solutions or equilibria evolve with respect to different system (or game) parameters in both the single-stage and repeated (more practical) scenarios. Our game-theoretic model, and the related preliminary results, is the first cleanslate attempt to formally model the problem of protecting zero-permission sensors on mobile platforms against privacy threats from strategic applications and adversaries (with unrestricted access to it). Our hope is that this model will act as a good starting point for designing efficient, effective and incentive-compatible strategies for protecting against such threats.

2

Sensor Access Signaling Game

System Model. Our system (Fig. 1a) comprises of two key entities residing on a user’s (mobile) device. The first is applications (AP P ) that utilize, and thus, need access to, data from zero-permission sensors. We consider two types

Analyzing Defense Strategies Against Mobile Information Leakages

279

of applications: Honest ( HA) and Malicious ( M A). Honest applications provide some useful service to the end-user with the help of zero-permission sensor data, while malicious applications would like to infer personal/private information about the user in the guise of offering some useful service. Both honest and malicious applications can request sensor data in a manner which may look normal/non-suspicious or suspicious (details next), regardless of their intentions or use-cases. The second entity is a sensor access regulator, which we refer to as the Defense Mechanism ( DM ). All sensor access requests (by all applications) must pass through and processed by the DM . The ideal functionality that the DM would like to achieve is to block sensor requests coming from M As, while allowing requests from HAs. As noted earlier, the DM itself does not know the type (i.e., honest or malicious) of application requesting sensor access - otherwise the job of the DM is trivial. This is also a practical assumption as currently all applications can access these sensors without violating any system/user-defined policy (to clarify, there is currently no way to set access control policies for zero-permission sensors on most mobile platforms). As the DM has no way of certainly knowing an application’s true intentions (and thus, its type), it must rely on the received request (suspicious or non-suspicious, as described next) and its belief about the requesting application’s type to determine whether it poses a threat to user privacy or not. Suspicious and Non-suspicious Requests. Zero-permission sensor access requests by the applications (to the DM ) can be classified as either suspicious (S) or non-suspicious (N S). Such a classification (generally, system-defined) can be accomplished using contextual information available to both the applications and the defense mechanism, such as, frequency, time, sampling rate, and relevance (according to the advertised type of service offered by the application) of these requests. Although there are several efforts in the literature in the direction of determining sensor over-privileges in mobile platforms [4,7], we abstract away this detail to keep our model general. We, however, assume that malicious applications are able to masquerade themselves perfectly as honest applications (in terms of the issued sensor requests), which is easy to accomplish when the target of these applications is zero-permission sensors. Other System Parameters. The strategic interactions between the (honest or malicious) AP P and DM can be characterized using several system parameters which we summarize in Table 1. In addition to identifying these parameters, we also establish the relationship between these parameters by considering realistic network and system constraints as discussed next. For example, if the cost of an application processing a successful S request (i.e., cS ) or N S request (i.e., cN S ) is expressed in terms of the CPU utilization (of the application), then it is clear that cS ≥ cN S because suspicious requests would usually solicit finegrained (high sampling rate) sensor data compared to non-suspicious requests, thus requiring more processing time. By a similar rationale, ψ S ≥ ψ N S , where ψ S and ψ N S are the costs to a DM (or the system) for processing a S or N S request, respectively. Now, the cost to the HA in terms of loss in usability when its request is blocked by DM (i.e., γ) and benefit for the HA in terms of gain in

280

K. Kumari et al.

(b) Extensive form of the Sensor Access Signaling Game GD =< P, T, S, A, U, θ, (p, q) >.

Fig. 1. Overview of the system and game models.

usability when its request is allowed by the DM (i.e., σ) are inversely proportional (γ ∝ 1/σ). Similarly, benefit to the M A when it’s request is allowed by DM (α) can be expressed in terms of monetary gains. An acute example would be if M A is able to successfully infer user’s banking credentials using sensor data [10,12,23,25], and uses it for theft. A more clement example of monetary gain could be through selling contextual data (inferred from sensor data) to advertising companies, without user’s consent. Accordingly, M A is set back with a proportional cost (τ ) if its request is rejected by DM , i.e., α ∝ τ . On the other hand, DM ’s cost of allowing a M A’s request (φ) versus benefit to the DM for blocking M A’s request (β) are also inversely proportional (φ ∝ 1/β). DM ’s cost of allowing a M A’s request is essentially borne by the user, but since the DM is working in the best interest of the user, we combine their costs and benefits. Consequently, in case DM blocks an HA’s request, it incurs a cost (κ) representing loss of utility/usability for the user. Lastly, we also capture the difference in benefits for M A and HA, in case they send out a S versus N S request, as u

Analyzing Defense Strategies Against Mobile Information Leakages

281

Table 1. System entities and parameters. Symbol Definition DM HA MA θ S NS q p B A cS cN S γ ψS ψN S φ τ κ α β σ u v m n x y

Defense Mechanism Honest Application Malicious Application Probability that nature selects M A Suspicious sensor request Non-suspicious sensor request Belief probability of the DM that the requester is of type M A on receiving a S request Belief probability of the DM that the requester is of type M A on receiving a N S request DM response to block a sender request DM response to allow a sender request Cost of an application processing a successful S request Cost of an application processing a successful N S request Cost to the HA when its request is blocked by DM Cost of a DM processing a S request Cost of a DM processing a N S request Cost to the DM when M A’s request is allowed Cost to the M A when its request is blocked by the DM Cost to the DM when HA’s request is blocked Benefit to the M A when its request is allowed by the DM Benefit to the DM for blocking M A’s request Benefit to the HA when its request is allowed by the DM Benefit difference to M A for sending S instead of N S Benefit difference to HA for sending S instead of N S Probability with which M A plays the S strategy Probability with which HA plays the S strategy Probability with which DM plays the B strategy on receiving a N S request Probability with which DM plays the B strategy on receiving a S request

and v, respectively. In essence, u denotes the gain in benefit due to M A’s better inference accuracy caused by sensor data obtained from S, and v denotes the improvement of HA’s utility/usability due to sensor data obtained from S. We also assume that these different (discrete) costs and benefits are appropriately scaled and normalized such that their absolute values lie in the same range of real values. Next, we outline the signaling game formulation to capture the strategic interaction between the mobile applications (requesting zero-permission sensor access) and the defense mechanism (attempting to regulating these requests). Game Model. A classical signaling game [3] is a sequential two-player incomplete information game in which Nature starts the game by choosing the type of the first player or player 1. Player 1 is the more informed out of the two players since it knows the choice of Nature and can send signals to the less informed player, i.e., player 2. Player 2 is uncertain about the type of player 1, and must decide its strategic response solely based on the signal received from player 1. In other words, player 2 must decide its best response to player 1’s signal without any knowledge about the type of player 1. Both players receive some utility (payoff) depending on the signal, type of player 1 and the response by player 2 (to player 1’s signal). Both the players are assumed to be rational and are interested in solely maximizing their individual payoffs.

282

K. Kumari et al.

Given the above generic description of the signaling game, let us briefly describe how our zero-permission sensor access scenario naturally lends itself as a single-stage signaling game. We refer to this game as the Sensor Access Signaling Game and is formally represented as GD = P, T, S, A, U, θ, (p, q), where P is the set of players, T is the set of player 1 types, S is the set of player 1 signals, A is the set of player 2 actions, U is the payoff/utility function, θ is the Nature’s probability distribution function, and (p, q) are player 2’s belief functions about player 1’s type. Each sensor access request by an application can be modeled as a single stage of the above signaling game. In each such stage, P contains two players, i.e., AP P which is player 1 and the DM which is player 2. As there are two types of applications (or player 1), i.e., honest (HA) and malicious (M A), T ≡ {HA, M A}. As applications can send two types of signals (or requests), i.e., suspicious (S) and non-suspicious (N S), S ≡ {S, N S}. As the DM (or player 2) takes two types of actions depending on the received signal from player 1, i.e., Allow (A) or Block (B), A ≡ {A, B}. The utility function U : T × S × A → (R, R) assigns a real-valued payoff to each player (at the end of the stage) based on the benefit received and the cost borne by each player, and is outlined in the extensive form of the game depicted in Fig. 1b. The first utility in the pair is the AP P ’s utility denoted as UAP P , while the second utility in the pair is the DM ’s utility denoted as UDM . Lastly, let ΓAP P ={μAP P |∀ti ∈ T, λ∈S μAP P (λ|ti ) = 1; ∀ti ∈ T} and ΓDM = {μDM |∀λ ∈ S, a∈A μDM (a|λ) = 1; ∀λ ∈ S} be the strategy spaces for AP P and DM , respectively. A strategy μAP P for the AP P and μDM for the DM can be either pure or mixed, as identified by parameters m, n, y and x in Fig. 1b. For pure strategies m, n, y, x ∈ {0, 1}, while for mixed strategies 0 < m, n, y, x < 1. Moreover, let us represent each of the DM ’s belief functions by conditional (posterior) probability distributions as q = P r(M A|S) and p = P r(M A|N S), which also imply that 1 − q = P r(HA|S) and 1 − p = P r(HA|N S). Now, let’s characterize the set of equilibrium strategies in GD , i.e., a set of strategy pairs that are mutual best responses to each other and no player has any incentive to move away from their strategy in that pair. In order to determine mutual best responses, we need to evaluate the actions (or strategies) of each player at each information set of the game. AP P ’s information set comprises of a single decision point (i.e., to select a signal λ ∈ {S, N S}) after Nature makes its selection of the type (HA or M A) and reveals it to AP P . DM ’s information set, on the other hand, comprises of two decision points because of its incomplete information about the type of AP P chosen by Nature. Thus, DM ’s strategy is to select an action a ∈ {A, B} depending on its belief P r(ti |λ) about the type ti ∈ T of AP P in that information set. Moreover, for each λ ∈ {S, N S}, ti P r(ti |λ) = 1. Our goal is to determine the existence of Perfect Bayesian Nash Equilibria (or PBNE) in GD , where strategies are combined with beliefs to determine the mutual best responses of each player at the end of each stage. A PBNE of the Sensor Access Signaling Game GD is a strategy profile μ∗ = (μ∗AP P , μ∗DM ) and posterior probabilities (or beliefs of the DM ) P r(ti |λ) such that:

Analyzing Defense Strategies Against Mobile Information Leakages ∗

283

∗

μAP P ∈ argmaxμAP P ∈ΓAP P UAP P (μAP P , μDM , ti ); ∀ti ∈ T

where, UAP P (.) is the utility or payoff of AP P for a particular pure or mixed strategy μAP P against DM ’s best response to it, when the type ti selected by Nature, and, ∀λ ∈ S = {S, N S} such that: ∗

μDM ∈ argmaxμDM ∈ΓDM

P r(ti |λ) UDM (λ, μDM , ti )

ti ∈T

where, UDM (.) is the payoff of DM for a particular pure or mixed strategy μDM against the signal (λ) received from the AP P , when the type ti selected by Nature. Moreover, the DM ’s belief P r(ti |λ) about the AP P ’s type given a received signal λ should satisfy Bayes’ theorem, i.e., P r(ti |λ) =

μAP P (λ|ti )P r(ti ) P r(λ|ti )P r(ti ) = P r(λ) P r(λ)

Four categories of PBNE can exist for a signaling game such as GD : – Separating PBNE: This category comprises of strategy profiles where player 1 or AP P of different types dominantly send different or contrasting types of signals λ ∈ {S, N S}. This allows DM to infer AP P ’s type with certainty. For instance, in a separating strategy profile {(S, N S), μ∗DM }, AP P of M A type always selects S (i.e., m = 1) while HA always selects the N S (i.e., n = 0). – Pooling PBNE: This category comprises of strategy profiles where player 1 or AP P of different types dominantly send the same type of signal λ. Here DM cannot infer AP P ’s type with certainty, but needs to update its belief (about AP P ’s type) based on the observed λ. For instance, in a pooling strategy profile {(S, S), μ∗DM }, both M A and HA types always select S (i.e., m, n = 1). – Hybrid PBNE: This category comprises of strategy profiles where one player 1 or AP P type dominantly sends one type of signal, but the other type randomizes its sent signal. For instance, in a hybrid strategy profile {(S, (S, N S)), μ∗DM }, M A always selects S (i.e., m = 1), whereas HA randomizes between S and N S (i.e., 0 < n < 1). – Mixed PBNE: Finally, this equilibrium comprises of strategy profiles where all player 1 or AP P types send signals λ only in a probabilistic fashion (i.e., 0 < m, n < 1).

3

Game Analysis

In this section, we find the PBNE for the sensor access signaling game GD . We begin by evaluating the existence of pure strategy equilibria (i.e., separating, pooling and hybrid), including conditions and regimes for achieving these equilibria. Following that we determine the mixed strategy equilibria for GD .

284

K. Kumari et al.

Theorem 1. There does not exist a separating equilibrium in the game GD . Proof. There can be two possible separating strategy profiles for AP P : (S, N S) and (N S, S). First, let us analyze the existence of an equilibrium on (S, N S), which means M A (malicious type) always selects S (i.e., m = 1) while HA (honest type) always selects N S (i.e., n = 0). DM ’s beliefs can be calculated using Bayes’ theorem as follows: P r(M A|S) = q = =

P r(S|M A) × P r(M A) P r(S|M A) × P r(M A) = P r(S) P r(S|M A) × P r(M A) + P r(S|HA) × P r(HA) m×θ 1×θ = =1 m × θ + n × (1 − θ) 1 × θ + 0 × (1 − θ)

Therefore, P r(HA|S) = 1 − q = 0. Similarly, we can show that p = 0, and 1 − p = 1. With these beliefs, the best response of DM can be calculated as follows. The DM ’s expected utility/payoff (EUDM ) from playing B or A if M A or HA selects S are: S

S

EUDM (B, S) = 1 × (β − ψ ) + 0 × (−κ − ψ ) = β − ψ S

S

EUDM (A, S) = 1 × (−φ − ψ ) + 0 × (−ψ ) = −φ − ψ

S

S

As EUDM (B, S) > EUDM (A, S), the DM ’s best response in this case is to play Block, i.e., BRDM (S) = B. Similarly, the DM ’s expected utility/payoff from playing B or A if M A or HA selects N S are: EUDM (B, N S) = 0 × (β − ψ

NS

) + 1 × (−κ − ψ

EUDM (A, N S) = 0 × (−φ − ψ

NS

NS

) + 1 × (−ψ

) = −κ − ψ

NS

) = −ψ

NS

S

In this case, as EUDM (B, N S) < EUDM (A, N S), the DM ’s best response is to play Allow, i.e., BRDM (N S) = A. In summary, if M A or HA plays S then DM ’s best response is B, and if M A or HA plays N S then DM ’s best response is A. Check for Equilibrium: HA and M A will follow the strategy along the equilibrium path as long as the payoff along that path is higher than the payoff it will get if it deviates. There can be two scenarios: first if the M A deviates and plays N S and second if the HA deviates and plays S. Let us first analyze the case where M A deviates and plays N S. The DM ’s beliefs do not change, and so, if it sees M A or HA playing N S, it will still always respond with it’s best response, i.e., A. M A will receive a payoff of −τ if it plays S and will receive a payoff of α − cN S if it plays N S. Thus, M A has an incentive to deviate from the equilibrium path. Although it can be shown that HA does not have an incentive to deviate, equilibrium does not exist in this case because at least one AP P (player 1) type has an incentive to deviate. Next, let us analyze the existence of a separating equilibrium on (N S, S), which means M A always selects N S (i.e., m = 0) and HA always selects S (i.e., n = 1). As before, the belief functions for the DM can be calculated as:

Analyzing Defense Strategies Against Mobile Information Leakages P r(M A|N S) = p =

285

1×θ P r(N S|M A) × P r(M A) = =1 P r(N S) 1 × θ + 0 × (1 − θ)

Therefore, P r(HA|N S) = 1 − p = 0. Similarly, we can also show that q = 0 and 1 − q = 1. Thus, the DM ’s expected utility/payoff from playing B or A if M A or HA selects S are: S

S

EUDM (B, S) = 0 × (β − ψ ) + 1 × (−κ − ψ ) = −κ − ψ S

S

EUDM (A, S) = 0 × (−φ − ψ ) + 1 × (−ψ ) = −ψ

S

S

In this case, as EUDM (B, S) < EUDM (A, S), the DM ’s best response is to play Allow, i.e., BRDM (S) = A. And, DM ’s expected utility from playing B or A if M A or HA selects N S are: EUDM (B, N S) = 1 × (β − ψ

NS

EUDM (A, N S) = 1 × (−φ − ψ

) + 0 × (−κ − ψ

NS

) + 0 × (−ψ

NS

NS

)=β−ψ

) = −φ − ψ

NS

NS

As EUDM (B, N S) > EUDM (A, N S), in this case the DM ’s best response is to Block, i.e., BRDM (N S) = B. In summary, if M A or HA plays S, then DM ’s best response is A and if M A or HA plays N S, then DM ’s best response is B. Check for Equilibrium: If M A deviates and plays S, DM will respond with it’s best response A. As a result, M A will receive a payoff of −τ if it plays N S and will receive a payoff of α + u − cS if it plays S. Thus, M A has an incentive to deviate from the equilibrium path. Again, although it can be shown that HA does not have an incentive to deviate, equilibrium does not exist in this case either because at least one AP P (player 1) type has incentive to deviate. Thus, neither of the separating strategy profiles {(S, N S), (B, A), p, q} and {(N S, S), (A, B), p, q} is a PBNE. Theorem 2. There exists pooling equilibria on AP P strategies of (S, S) and (N S, N S) in the game GD . Proof. An AP P strategy profile (S, S) means both M A and HA types always select S (i.e., m, n = 1). DM ’s beliefs in this strategy profile can be calculated as: P r(M A|S) = q =

1×θ P r(S|M A) × P r(M A) = =θ P r(S) 1 × θ + 1 × (1 − θ)

Therefore, P r(HA|S) = 1 − q = 1 − θ. Accordingly, expected payoff for DM from playing B or A if either M A or HA selects S are: S

S

EUDM (B, S) = θ × (β − ψ ) + (1 − θ) × (−κ − ψ ) = θ(β + κ) − κ − ψ S

S S

EUDM (A, S) = θ × (−φ − ψ ) + (1 − θ) × (−ψ ) = −φ × θ − ψ

S

286

K. Kumari et al.

Now, DM ’s best response to the AP P ’s pooling strategy of (S, S) would be to select B (over A) if and only if the following condition holds: θ(β + κ) − κ − ψ

S

≥ −φ × θ − ψ

S

≡θ≥

κ β+κ+φ

To analyze the existence of an equilibrium at the AP P ’s strategy of (S, S), given the DM ’s best response, we must check if AP P of either type (M A or HA) has an incentive to deviate and play N S. Here, if HA or M A deviate and play N S and DM chooses A, HA gains a payoff of σ − cN S compared to −γ if it plays S, while M A gains a payoff of α − cN S compared to −τ if it plays S. Thus, in this case both HA and M A have an incentive to deviate and play N S and there is no equilibrium. Here, if HA or M A deviate and play N S and DM chooses B, HA will receive a payoff of −γ, same as if it plays S, while M A will receive a payoff of −τ , same as if it plays S. Thus, in this case, both HA and M A do not have any incentive to switch to N S and an equilibrium exists. In summary, an κ . equilibrium on the AP P ’s pooling strategy of (S, S) exists when θ ≥ β+κ+φ Inversely, the DM ’s best response to AP P ’s pooling strategy of (S, S) would be to select A (over B) if and only if the following holds: θ(β + κ) − κ − ψ

S

≤ −φ × θ − ψ

S

≡θ≤

κ β+κ+φ

Here, if HA or M A deviate and play N S and DM chooses A, HA will receive a payoff of σ − cN S if it plays N S and will receive a payoff of σ + v − cS if it plays S. On the other hand, M A will receive a payoff of α − cN S if it plays N S and will receive a payoff of α + u − cS if it plays S. Thus, in this case, there will be a pooling equilibrium if and only if: S

σ+v−c

NS

≥σ−c S

α+u−c

NS

≥α−c

S

≡v≥c

−c S

≡u≥c

NS

−c

, and

NS

Here, if HA or M A deviate and play N S and DM chooses B, HA will receive a payoff −γ compared to σ + v − cS if it plays S, while M A will receive a payoff of −τ compared to α + u − cS if it plays S. Thus, in this particular case, HA and M A do not have any incentive to deviate as well. In summary, an equilibrium κ . on AP P ’s pooling strategy of (S, S) also exists when θ ≤ β+κ+φ As the proof of a pooling equilibrium on the AP P strategy of (N S, N S) follows an analogous methodology, it is omitted to conserve space. In summary, κ , equilibrium on the AP P ’s pooling strategy of (N S, N S) exists when θ ≥ β+κ+φ κ or when θ ≤ β+κ+φ . For complete proofs, please refer to [9]. Theorem 3. There exists hybrid equilibria on the AP P strategy profiles (S, (S, N S)), (N S, (S, N S)), ((S, N S), S) and ((S, N S), N S), in the game GD . Proof. An AP P strategy profile (S, (S, N S)) means that M A always selects S (i.e., m = 1), whereas HA selects S with some probability n and N S with

Analyzing Defense Strategies Against Mobile Information Leakages

287

probability 1 − n where (0 < n < 1). DM ’s beliefs in this strategy profile can thus be calculated as: P r(M A|S) = q =

1×θ θ P r(S|M A) × P r(M A) = = P r(S) 1 × θ + n × (1 − θ) θ(1 − n) + n

P r(M A|N S) = p =

P r(N S|M A) × P r(M A) 0×θ = =0 P r(N S) 0 × θ + (1 − n) × (1 − θ)

Now, let’s compute the DM ’s best response for each of the strategies S and N S of AP P . In order to determine that, we need to first compute the expected utilities/payoffs obtained by DM for playing B or A if AP P (M A or HA) selects N S or S, which is given by: EUDM (B, N S) = p × (β − ψ

NS

) + (1 − p) × (−κ − ψ

EUDM (A, N S) = p × (−φ − ψ

NS

NS

) + (1 − p) × (−ψ

) = −κ − ψ

NS

) = −ψ

S

NS

NS

S

EUDM (B, S) = q × (β − ψ ) + (1 − q) × (−κ − ψ ) S

S

EUDM (A, S) = q × (−φ − ψ ) + (1 − q) × (−ψ )

It is clear from these expected utilities obtained by the DM in this strategy profile that it will always plays A (i.e., A always dominates B) when the AP P plays N S. On the contrary, there are two possibilities in terms of the DM ’s best response to an application’s strategy of S. The first possibility is for the DM to always Block or B, i.e., B would dominate A. This, however, holds only if the following is true: S

S

S

S

q(β − ψ ) + (1 − q)(−κ − ψ ) ≥ q(−φ − ψ ) + (1 − q)(−ψ ) ≡ q ≥

(1 − q)κ β+φ

Now, as DM always plays A for N S, HA has more incentive to play N S because it will gain σ − cN S compared to −γ if it plays S. Also, M A has more incentive to play N S since it will gain α − cN S compared to −τ if it plays S. In other words, AP P is not indifferent between playing S and N S when q ≥ (1−q)κ β+φ , and strongly prefers playing N S. Thus, there is no hybrid equilibria at (S, (S, N S)) when q ≥ (1−q)κ β+φ . The second possibility, in terms of the DM ’s best response to an AP P ’s strategy of S, is for the DM to Accept or A (i.e., A dominates B) which is true if q ≤ (1−q)κ β+φ . This combined with the fact that the DM always plays A for N S, it is clear that when q ≤ (1−q)κ β+φ , DM invariantly plays A for both the S and N S strategies of the AP P . In this case, if M A deviates and plays N S it will gain α − cN S compared to α + u − cS if it plays S. Similarly, HA will gain σ − cN S instead of σ + v − cS if it plays S. Therefore, in order to make AP P indifferent between playing S and N S so that a hybrid equilibrium can be achieved at (S, (S, N S)), the following conditions must be satisfied: NS

α−c

NS

σ−c

S

≡c −c

S

NS

u

S

S

NS

v

α+u−c σ+v−c

≡c

−c

288

K. Kumari et al.

In summary, a hybrid equilibrium is possible at (S, (S, N S)) if and only if the above conditions hold. In order to conserve space, we omitted the proofs for hybrid equilibria on the AP P strategies (N S, (S, N S)), ((S, N S), S) and ((S, N S), N S). The proofs for all these three strategies follow an analogous methodology, all of which result in a hybrid equilibrium under certain conditions. Table 2 summarizes these equilibrium conditions for all the hybrid equilibria in the game GD . Theorem 4. There exists a mixed strategy PBNE in the game GD . Proof. First, let’s determine the conditions for each AP P type to randomize (or be indifferent) between its choices. Let’s assume DM plays the mixed strategy (yB, (1−y)A) for S (i.e., suspicious requests) and (xB, (1−x)A) for N S (i.e, nonsuspicious requests). Then for the AP P type M A, the expected utilities/payoffs of playing S and N S are: S

EUM A (S) = y × −τ + (1 − y) × (α + u − c ) NS

EUM A (N S) = x × −τ + (1 − x) × (α − c

)

M A is indifferent between playing S and N S if EUM A (S) = EUM A (N S), which gives: S

NS

y(τ + α + u − c ) − x(τ + α − c

S

) = u−c +c

NS

(1)

Similarly, for the AP P type HA, the expected utilities/payoffs of playing S and N S are: S

EUHA (S) = y × −γ + (1 − y) × (σ + v − c ) NS

EUHA (N S) = x × −γ + (1 − x) × (σ − c

)

HA is indifferent between playing S and N S if EUHA (S) = EUHA (N S), which gives: S

NS

y(γ + σ + v − c ) − x(γ + σ − c

S

) = v−c +c

NS

(2)

Solving Eqs. 1 and 2 for x and y, we get DM ’s mixed strategy for which each AP P type is indifferent between playing S and N S. Let this x = x∗ and y = y ∗ . Now let’s determine the conditions for DM to randomize (or be indifferent) between its choices. First, if DM observes AP P (M A or HA) played S, its expected payoffs from playing B and A are: S

S

EUDM (B) = q × (β − ψ ) + (1 − q) × (−κ − ψ ) S

EUDM (A) = q × (−φ − ψ ) + (1 − q) × −ψ

S

Analyzing Defense Strategies Against Mobile Information Leakages

289

Table 2. List of PBNEs. Conditions −−

Range of θ PBNE profiles κ θ ≥ β+κ+φ PBN E = {(S, S), (B, B), p, q}

v ≥ cS − cN S , u ≥ cS − cN S θ ≤ −−

θ≤

−−

θ≥

v≤

cS

−

cN S ,

u≤

cS

−

cN S

−−

θ≤ θ≤

cS − cN S u, cS − cN S v q ≤ cS − cN S u, cS − cN S v p ≤ −−

q≥

−−

p≥

κ β+κ+φ κ β+κ+φ κ β+κ+φ κ β+κ+φ κ β+κ+φ (1−q)κ β+φ (1−p)κ β+φ (1−q)κ β+φ (1−p)κ β+φ

PBN E = {(S, S), (A, A), p, q}

PBN E = {(S, S), (A, B), p, q} PBN E = {(N S, N S), (B, B), p, q} PBN E = {(N S, N S), (A, A), p, q} PBN E = {(N S, N S), (B, A), p, q} PBN E = {(S, (S, N S)), (A, A), p, q} PBN E = {(N S, (S, N S)), (A, A), p, q} PBN E = {((S, N S), S), (B, B), p, q} PBN E = {((S, N S), N S), (B, B), p, q}

Now, DM is indifferent between playing B and A on seeing S if, EUDM (B) = EUDM (A), which gives: q=

κ ∗ =q κ+β+φ

Similarly, DM ’s expected utilities/payoffs from playing B and A, when it sees N S are: EUDM (B) = p × (β − ψ

NS

) + (1 − p) × (−κ − ψ

EUDM (A) = p × (−φ − ψ

NS

) + (1 − p) × −ψ

NS

)

NS

DM is indifferent between playing B and A on seeing N S if EUDM (B) = EUDM (A), which gives: p=

κ ∗ =p κ+β+φ

Now, we determine AP P (M A or HA) randomization (mixed strategy) that is consistent with DM ’s beliefs. For that, we use Bayes rule to calculate the DM ’s beliefs q and p as: m×θ m × θ + n × (1 − θ)

(3)

(1 − m) × θ (1 − m) × θ + (1 − n) × (1 − θ)

(4)

∗

q=q = ∗

p=p =

We can solve Eqs. 3 and 4 for m and n, to obtain M A’s and HA’s mixed strategy for which they are indifferent in playing S and N S consistent with the DM ’s beliefs. It is easy to show that there exists a system of (cost/benefit) parameters for which such a solution exists. Let these solutions be represented as m∗ and n∗ . Then, the mixed strategy PBNE μ∗ will occur at: μ∗AP P : M A plays (m∗ S + (1 − m∗ )N S) and HA plays (n∗ S + (1 − n∗ )N S)

290

K. Kumari et al.

μ∗DM : DM plays y ∗ B + (1 − y ∗ )A to S and x∗ B + (1 − x∗ )A to N S DM ’s beliefs: q = P r(M A|S) = q ∗ and p = P r(M A|N S) = p∗ Example of a mixed equilibrium: Substituting θ = 12 , q = 14 and p = Eqs. 3 and 4, and solving for m and n, results in m = 14 and n = 34 .

4

3 4

in

Numerical Analysis

We perform numerical simulations to analyze how the various PBNEs in our Sensor Access Signaling Game GD evolves with respect to the various game and system parameters. Specifically, we evaluate the M A’s payoff, HA’s payoff and DM ’s expected utility (EUDM ) in a representative separating strategy profile (S, N S), a pooling strategy profile (S, S), a hybrid strategy profile ((S, N S), S) and a mixed strategy profile, by varying the value of θ (Nature’s selection probability). The results are outlined in Fig. 2, and the set of system parameters chosen for the numerical simulations are summarized in Fig. 2f. The parametric values used in our numerical analysis were primarily chosen to showcase the trends observable in different strategy profiles. They may or may not be reflective of their values in real-life, but we did our best to establish the inequalities between parameters as completely as possible. Separating Strategy (S, N S). As proved earlier, there is no equilibrium in any of the separating strategy profiles, and the same can also be observed in Fig. 2a. We observe that EUDM is linearly increasing, which implies that DM is blocking suspicious requests from M A, as the only way DM can increase it’s utility is by playing B. Both M A’s and HA’s payoffs are linearly decreasing because DM is playing B more than A. Pooling Strategy (S, S). In Fig. 2b we observe that the HA’s payoff and DM ’s expected utility initially decrease while M A’s payoff increases, for increasing values of θ. However, beyond a certain value of θ the trend reverses, i.e, HA’s payoff and DM ’s expected utility increase linearly while M A’s payoff decreases. Hybrid Strategy ((S, N S), S). In this strategy profile (Fig. 2c), EUDM is affected by random signals coming from M A. However, we can also observe that as θ increases EUDM gradually increases. EUDM also stabilized for higher values of θ. On the other hand, HA’s and M A’s payoffs are decreasing as expected when increasing θ (Fig. 2d). Mixed Strategy. In Fig. 2e we observe the effect of a mixed strategy in each player’s payoff/utility. The payoffs and utilities are highly unstable as m, n, x and y are all drawn from a random distribution for the mixed strategy. In summary, our numerical evaluations validate our game-theoretic results.

Analyzing Defense Strategies Against Mobile Information Leakages

(a) Separating Strategy (S, N S)

(b) Pooling Strategy (S, S)

(c) Hybrid Strategy ((S, N S), S)

(d) Hybrid Strategy ((S, N S), S)

(e) Mixed Strategy

(f) Simulation Parameters

291

Fig. 2. (a–e) Effect of θ on different strategy profiles. Each point is a average of 500 iterations. (f) Default simulation parameters.

5

Repeated Game

So far, we have outlined PBNE results and related numerical analysis for the Sensor Access Signaling Game GD in the single stage (or single-shot) scenario. In practice, however, the game GD will be repeated several times (possibly, as

292

K. Kumari et al.

long as the system is running). Thus, it is important to analyze how the game GD will evolve in a repeated scenario. 5.1

Background

Before proceeding ahead, let us provide some technical background on repeated games. There are two broad categories of repeated games: (i) Finite Repeated Games: Here, a stage game is repeated for a finite number of times. Repeated games could support strategy profiles (also known as reward and punishment strategies) that support deviation from stage game Nash Equilibria through cooperation. Players could cooperate and play a reward strategy (also referred to as a Subgame Perfect Equilibrium (SPE)) that is not a Nash Equilibrium strategy, if the expected utility of every player is strictly greater than the expected utility from the Nash Equilibrium strategy [20]. Due to the lower expected utility, the Nash Equilibrium strategy becomes the punishment strategy, which would be applied if any of the players deviate from the SPE. However, if a finite repeated game consists of stage games that each have a unique Nash Equilibrium, then the repeated game also has a unique SPE of playing the stage game Nash Equilibrium in each stage. This can be explained by unravelling from the last stage, where players must play the unique Nash Equilibrium. In the second-to-last stage, as players cannot condition the future (i.e., the last stage) outcomes, again they must play the unique Nash Equilibrium for optimal expected utility. This backward induction continues until the first stage of the game, implying that players must always play the Nash Equilibrium strategy to ensure overall optimal expected utility. This (players not cooperating on a reward strategy) is a limitation of finite repeated games with a unique Nash Equilibrium, that can be solved if the game is repeated infinitely. (ii) Infinite Repeated Games: In a repeated game with an infinite (or unknown) number of stages, players can condition their present actions upon the unknown future. Without a known end stage, players will be more inclined to cooperate on a mutually beneficial reward strategy, rather than a static Nash Equilibrium as seen in a finite repeated game. The payoff/utility for a player i in an infinite repeated game can be computed by discounting the expected utilities in future stages using a discount factor δ (0 ≤ δ ≤ 1) as: 1

2

2

2

ui = ui + δui + δ ui + . . . + δ

t−1

t−1

ui

+ ... =

∞ t=1

δ

t−1

t

ui

∞ And, the average (normalized) expected utility for player i is (1−δ) t=1 δ t−1 uti . In an infinitely repeated game, players can effectively employ a reward-andpunishment strategy, but to do so each player must maintain a history of the past actions taken by all players. Let Ht denote the set of all possible histories (ht ) of length t and let H = ∪∞ t=1 Ht be the set of all possible histories. A pure strategy (ωi ) for player i is a mapping ωi : H → Ωi that maps histories (H) into

Analyzing Defense Strategies Against Mobile Information Leakages

293

player actions (Ωi ) of the stage game. In an infinitely repeated game G(t, δ) of n players, a strategy profile ω = (ω1 , ..., ωn ) is a Subgame Perfect Equilibrium (SPE) if and only if there is no player i and no single history ht−1 for which player i would gain by deviating from ωi (ht−1 ). Next, let us analyze the Sensor Access Signaling Game GD for the infinite repeated scenario. 5.2

Repeated GD with History: A Case Study

Let us analyze one of the possible scenarios of an infinitely repeated game GD (t), where we assume {(S, N S), N S, (B, A), q, p} as the reward strategy and {(S, N S), N S, (B, B), q, p} as the punishment strategy. In this scenario, HA may start sending S at a later point in the game in order to increase its payoff from σ − cN S to σ + v − cS . However, as each player maintains a history of action sets for every player, as soon as HA deviates from the SPE, DM will enforce the punishment strategy profile, thus blocking all the incoming requests whether it is S or N S. M A is randomizing between S and N S according to the feasible reward strategy profile, so it does not matter to DM if M A deviates or not. It is not logical to assume that DM will deviate as it is DM ’s responsibility to keep check on the deviations of AP P . Moreover, each stage in the game GD (t) is a sequential game, where DM reacts to AP P ’s signal in every stage of the game. After each stage of the game, the set of actions of player AP P and the corresponding responses of player DM will be known to all players. Players may change their strategy after a certain period or stage, based on the history information until that stage. Figure 3a shows the effect of history on the repeated games. We observe that HA’s utility fluctuates whenever it deviates from the cooperative reward strategy. With a strategy reset interval of 100 stages, we observe that HA’s utility follows a up-down pattern in every interval, reflective of a start with reward strategy, then HA’s deviation from reward strategy, and followed by DM ’s switch to the punishment strategy. Overall, M A’s cumulative payoff is lower than HA’s cumulative payoff, which is desired in our system as we want the DM to thwart M A while allowing HA to function normally. We also study the effect of discount factor δ (on the game GD (t, δ)), which determines players’ patience. If the value of δ is high, then there is a high chance that game is going to progress to the next stage, prompting player to cooperate on the reward strategy for longer. In Fig. 3b, we initially observe HA’s utility increasing and M A’s utility decreasing as per the reward strategy. However, as the game progresses, the cumulative utilities converge because (i) the utilities are heavily discounted, and (ii) players switch to the Nash Equilibrium strategy as a result of the discounted utility.

294

K. Kumari et al.

(a) Utilities with history.

(b) Utilities with history and discount factor.

Fig. 3. Cumulative utilities for DM , M A, and HA in repeated games.

6

Related Work

Several recent works demonstrated the feasibility of side-channel inference attacks using mobile [1,5,6,8,14–19,21,24] and wearable [10–13,23,25,26] device sensors. Some of these works also propose defense mechanism against the specific type of attack that was demonstrated. For example, Miluzzo et al. [17] proposed to drastically reduce the maximum allowed sensor sampling rate, in order to prevent keystroke inference attacks on mobile keypads using mobile device motion sensors. However, reducing the sensor sampling rate for all applications may cause certain applications to malfunction, leading to poor user experience. To minimize unnecessary regulation of sensors at all times, Maiti et al. [12] proposed an activity recognition-based defense framework. In their framework, the defense mechanism continuously monitors user’s current activity (using smartwatch motion sensors data), and regulates third party applications’ access to motion sensor only when typing activity is detected (in order to prevent keystroke inference). However, while such ad-hoc defense approaches are effective in preventing a specific type of attack, they may not be effective against other types of side-channel attacks. In this work, we generalize the problem of side-channel attacks using mobile and wearable sensors, by modeling all different types of attacks as a Bayesian signaling game between a mobile application and a defense mechanism [22].

7

Conclusion

In this paper, we modeled the problem of zero-permission sensor access control for mobile applications using game theory. By means of a formal and practical signaling game model, we proved conditions under which equilibria can be achieved between entities with conflicting goals in this setting, i.e., honest and malicious applications who are requesting sensor access to maximize their utility and attack goals, respectively, and the defense mechanism who wants to protect

Analyzing Defense Strategies Against Mobile Information Leakages

295

against attacks without compromising system utility. By means of numerical simulations, we further studied how the different theoretically derived equilibria will evolve in terms of the payoffs received by the application and the defense mechanism. Our results in this paper have helped shed light on how a defense mechanism can act in a strategically optimal manner to protect the mobile system against malicious applications that take advantage of zero-permission sensors to leak private user information and are impossible to detect otherwise. Acknowledgment. Research reported in this publication was supported by the Division of Computer and Network Systems (CNS) of the National Science Foundation (NSF) under award number 1828071 (originally 1523960).

References 1. Cai, L., Chen, H.: TouchLogger: inferring keystrokes on touch screen from smartphone motion. In: HotSec (2011) 2. Cai, L., Machiraju, S., Chen, H.: Defending against sensor-sniffing attacks on mobile phones. In: ACM MobiHeld, pp. 31–36 (2009) 3. Cho, I.K., Kreps, D.M.: Signaling games and stable equilibria. Q. J. Econ. 102(2), 179–221 (1987) 4. Felt, A.P., Chin, E., Hanna, S., Song, D., Wagner, D.: Android permissions demystified. In: ACM CCS, pp. 627–638 (2011) 5. Felt, A.P., Finifter, M., Chin, E., Hanna, S., Wagner, D.: A survey of mobile malware in the wild. In: ACM SPSM (2011) 6. Gao, X., Firner, B., Sugrim, S., Kaiser-Pendergrast, V., Yang, Y., Lindqvist, J.: Elastic pathing: Your speed is enough to track you. In: ACM UbiComp (2014) 7. Hammad, M., Bagheri, H., Malek, S.: Determination and enforcement of leastprivilege architecture in android. In: IEEE ICSA, pp. 59–68 (2017) 8. Han, J., Owusu, E., Nguyen, L., Perrig, A., Zhang, J.: ACComplice: location inference using accelerometers on smartphones. In: ACM COMSNETS (2012) 9. Kumari, K., Jadliwala, M., Maiti, A.: Analyzing Defense Strategies Against Mobile Information Leakages: A Game-Theoretic Approach (Full Report) (2019). https:// sprite.utsa.edu/art/defender. Accessed 30 Apr 2019 10. Liu, X., Zhou, Z., Diao, W., Li, Z., Zhang, K.: When good becomes evil: keystroke inference with smartwatch. In: ACM CCS, pp. 1273–1285 (2015) 11. Maiti, A., Jadliwala, M., He, J., Bilogrevic, I.: Side-channel inference attacks on mobile keypads using smartwatches. IEEE Trans. Mob. Comput. 17(9), 2180–2194 (2018) 12. Maiti, A., Armbruster, O., Jadliwala, M., He, J.: Smartwatch-based keystroke inference attacks and context-aware protection mechanisms. In: ACM AsiaCCS (2016) 13. Maiti, A., Heard, R., Sabra, M., Jadliwala, M.: Towards inferring mechanical lock combinations using wrist-wearables as a side-channel. In: ACM WiSec, pp. 111–122 (2018) 14. Marquardt, P., Verma, A., Carter, H., Traynor, P.: (sp)iPhone: decoding vibrations from nearby keyboards using mobile phone accelerometers. In: ACM CCS (2011) 15. Michalevsky, Y., Boneh, D., Nakibly, G.: Gyrophone: recognizing speech from gyroscope signals. In: USENIX Security (2014)

296

K. Kumari et al.

16. Michalevsky, Y., Nakibly, G., Veerapandian, G.A., Boneh, D., Nakibly, G.: PowerSpy: location tracking using mobile device power analysis. In: USENIX Security (2015) 17. Miluzzo, E., Varshavsky, A., Balakrishnan, S., Choudhury, R.R.: TapPrints: your finger taps have fingerprints. In: ACM MobiSys (2012) 18. Narain, S., Vo-Huu, T.D., Block, K., Noubir, G.: Inferring user routes and locations using zero-permission mobile sensors. In: IEEE S&P (2016) 19. Nguyen, L., Cheng, H., Wu, P., Buthpitiya, S., Zhang, Y.: PnLUM: system for prediction of next location for users with mobility. In: Nokia Mobile Data Challenge Workshop (2012) 20. Osborne, M.J., Rubinstein, A.: A Course in Game Theory. MIT Press, Cambridge (1994) 21. Owusu, E., Han, J., Das, S., Perrig, A., Zhang, J.: Accessory: password inference using accelerometers on smartphones. In: ACM HotMobile (2012) 22. Rahman, M.A., Manshaei, M.H., Al-Shaer, E.: A game-theoretic approach for deceiving remote operating system fingerprinting. In: 2013 IEEE Conference on Communications and Network Security (CNS), pp. 73–81. IEEE (2013) 23. Sabra, M., Maiti, A., Jadliwala, M.: Keystroke inference using ambient light sensor on wrist-wearables: a feasibility study. In: ACM WearSys (2018) 24. Schlegel, R., Zhang, K., Zhou, X., Intwala, M., Kapadia, A., Wang, X.: SoundComber: a stealthy and context-aware sound Trojan for smartphones. In: NDSS (2011) 25. Wang, C., Guo, X., Wang, Y., Chen, Y., Liu, B.: Friend or foe?: Your wearable devices reveal your personal pin. In: ACM AsiaCCS (2016) 26. Wang, H., Lai, T.T.T., Roy Choudhury, R.: MoLe: motion leaks through smartwatch sensors. In: ACM MobiCom (2015)

Dynamic Cheap Talk for Robust Adversarial Learning Zuxing Li(B)

and Gy¨ orgy D´ an

School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, 10044 Stockholm, Sweden [email protected]

Abstract. Robust adversarial learning is considered in the context of closed-loop control with adversarial signaling in this paper. Due to the nature of incomplete information of the control agent about the environment, the belief-dependent signaling game formulation is introduced in the dynamic system and a dynamic cheap talk game is formulated with belief-dependent strategies for both players. We show that the dynamic cheap talk game can further be reformulated as a particular stochastic game, where the states are beliefs of the environment and the actions are the adversarial manipulation strategies and control strategies. Furthermore, the bisimulation metric is proposed and studied for the dynamic cheap talk game, which provides an upper bound on the difference between values of different initial beliefs in the zero-sum equilibrium.

Keywords: Cheap talk signaling game Bisimulation metric

1

· Stochastic game ·

Introduction

Adversarial machine learning has received significant attention lately, mainly due to recent results on targeted an non-targeted attacks against deep neural networks (DNNs), as DNNs are expected to find a large variety of applications [1,5]. Most works focus on generating adversarial examples for degrading classification [6], while others aim to develop countermeasures against existing attacks [7]. A prominent example of the DNN applications is deep reinforcement learning (DRL), which has been shown to be an efficient solution for learning near-optimal control policies in dynamically changing environments [8]. Adversarial examples in DRL result in the manipulation of the perceived states of the environment, and mislead the agent to take suboptimal actions [4]. To mitigate adversarial attacks on DRL, some recent works have explored introducing random perturbations in the training phase, while others considered This work was partly funded by MSB through the CERCES project, and by SSF through the CLAS project (RIT17-0046). c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 297–309, 2019. https://doi.org/10.1007/978-3-030-32430-8_18

298

Z. Li and G. D´ an

training the agent with adversarial examples [2]. Nonetheless, the existing works do not capture the strategic interaction between the adversary and the control agent in its entirety. For a complete treatment of the interaction, adversarial reinforcement learning should address the ability of the agent to reason about the presence of an adversary, which aims to degrade the control performance by manipulating the environment states. By doing so, the agent, which aims to maximize its reward by taking actions, should not act on the observed manipulated states, but on its belief about the environment state, constructed based on the manipulated observations. In the literature, the interaction between multiple players and the environment has been formulated and studied as a stochastic game [9], which can be seen as an extension of the single-player Markov decision process to the scenario with multiple players. Nash-Q learning algorithm was proposed as a solution method, which takes into account the game between players, as opposed to Q learning, which considers a single agent. Stochastic games have been used in the study of robust adversarial reinforcement learning [2,3], where the modeling errors, the difference of training and test scenarios, and other disturbances in reinforcement learning were modeled as destabilizing actions taken by an adversary. However, classical stochastic games cannot be directly applied to our problem because the model assumes the same observations for all players, while the adversary and the agent in adversarial reinforcement learning do not have common observations of the environment. There are different extensions of stochastic games for network security problems. Authors in [10] consider an anti-jamming problem in an energy harvesting communication system, where the jammer has access to the battery and channel states while the energy harvesting rate is only known by the communication system. For such a game with post-decision state (PDS) only available for the defender, the minimax-PDS algorithm is proposed to facilitate the learning [11]. In many security problems, the type of the adversary is not known by the defender. For those Bayesian stochastic games the Bayesian Nash-Q algorithm was proposed [12]. Recently, a one-sided partially observable stochastic game [13] was proposed for another class of security problems, where the adversary does not have access to the environment state but has a correlated observation. These aforementioned works assume that all players choose their actions simultaneously in each stage. This assumption does not hold for our problem, where the adversary manipulates the environment state first and the control agent takes an action based on the manipulated observation later. Unfortunately, the complexity of dynamic programming and reinforcement learning algorithms drastically increases as the number of environment states increases. In practice, quantization or aggregation of environment states are commonly used to reduce the complexity. Bisimulation metrics studied in [14– 16] measure the difference of states in dynamic systems and therefore can be used to measure the approximation error when the dynamic system is designed using quantized or aggregated environment states.

Dynamic Cheap Talk for Robust Adversarial Learning

299

In this paper, we approach adversarial learning as a dynamic system with strategic signaling. Our approach bridges previous works on adversarial learning with works on strategic information transmission, where signaling is used for introducing a bias in the values reconstructed by a decoder [17]. The proposed approach is based on subsequent cheap talk signaling games, in which the control agent’s prior is determined by the previous prior and the received, possibly compromised, observation. At the same time, the adversary constructs the compromised observation with respect to the prior of the control agent. We propose a bisimulation metric, which is helpful to decide aggregation of states to tradeoff the computational complexity and approximation error, and provide numerical results that illustrate the importance of maintaining a belief about the environment for robustness to adversarial input. The rest of the paper is organized as follows: Sect. 2 defines the problem; Sect. 3 introduces the belief structure; Sect. 4 provides a dynamic cheap talk game formulation; Sect. 5 proposes a bisimulation metric; Sect. 6 provides numerical results; and Sect. 7 concludes the paper.

2

Problem Statement

We consider the problem of learning under adversarial signaling as shown in Fig. 1. Our model is an extension of a Markov decision process: The environment state is defined on the finite set S; the control action is defined on the finite set A; the current environment state si is updated depending on the previous state si−1 and control action ai−1 following the conditional pmf pSi |Si−1 ,Ai−1 ; the mapping from the environment state si to the control action ai is now determined by the adversarial strategy Gi , which manipulates the environment state si to sî , and the control strategy Fi , which chooses the control action ai on observing sî ; w.l.o.g., we assume that the manipulated observation sî is also defined on S such that the control agent cannot detect the existence of the adversary directly; the instantaneous reward of the control action depends on the state and action as C (si , ai ). Compared with the classical stochastic game, the play between the adversary and the control agent in each stage is a Stackelberg game instead of a static game and the environment state is not available for the control agent.

Fig. 1. The considered problem of learning with adversarial signaling.

300

Z. Li and G. D´ an

Given a sequence of adversarial strategies G = [G1 , G2 , . . . ] and a sequence of control strategies F = [F1 , F2 , . . . ], then the discounted reward of the control agent over an infinite time horizon is λi−1 EG,F (C (Si , Ai )) , (1) R (G, F ) = i∈Z+

where λ ∈ (0, 1) denotes the discount factor; EG,F denotes the expectation induced by the implemented strategies G and F . Under the zero-sum game framework, the reward of the adversary is −R (G, F ). The task of the adversarial reinforcement learning is to learn the strategies for the two players G∗ and F ∗ in an equilibrium, i.e., G∗ = arg max −R (G, F ∗ ) ,

(2)

F ∗ = arg max R (G∗ , F ) .

(3)

G∈G

F ∈F

3

Belief of Environment State

In the i-th stage, we have a signaling game where the adversary is the leader and the control agent is the follower with incomplete information about the environment state si . Let bi = pSi denote the belief of the current state. Then, the equilibrium for this myopic (single-stage) game consists of

∀si ∈ S, Gi = arg max − Gi ∈Gi

∀ˆ si ∈ S,

Fi

= arg max

Fi ∈Fi

C (si , ai ) Fi (ai |ˆ si , bi ) Gi (ˆ si |si , bi ) , (4)

s î ∈S,ai ∈A

C (si , ai ) Fi (ai |ˆ si , bi )

si ∈S,ai ∈A

si |si , bi ) bi (si )Gi (ˆ . (5) si |s , bi ) s ∈S bi (s )Gi (ˆ

Note that the myopic strategies Gi and Fi depend on the belief bi implicitly since a different belief of the environment state generally will lead to a different equilibrium for the myopic signaling game. To keep consistent with the rewards used in the original problem, it can be shown that Gi = arg max −EGi ,Fi (C (Si , Ai )) ,

(6)

Fi = arg max EGi ,Fi (C (Si , Ai )) .

(7)

Gi ∈Gi

Fi ∈Fi

Based on the signaling game formulation, the original problem can be specified to a repeated game with belief-dependent strategies. To formulate a game with belief-dependent strategies, both the adversary and control agent need to have access to the belief. The update from current belief bi to the next belief bi+1 is determined by the adversarial strategy Gi and control strategy Fi as ∀s ∈ S, bi+1 (s ) = pSi+1 |Si ,Ai (s |si , ai )Fi (ai |ˆ si , bi )Gi (ˆ si |si , bi )bi (si ). si ,ˆ si ∈S,ai ∈A

(8)

Dynamic Cheap Talk for Robust Adversarial Learning

301

Therefore, the belief in each stage can always be updated (observed) by both players if they keep track of all the previously played strategies of each other and the initial belief, which is satisfied under the assumption of completely rational players.

4

Dynamic Cheap Talk Game

We reformulate the original problem to a dynamic cheap talk game by imposing the adversarial and control strategies to depend on the belief in addition to their own observations as: Gi : B × S × S → [0, 1], Fi : B × S × A → [0, 1].

(9) (10)

The dynamic cheap talk game between the adversary and control agent is played as follows: The initial belief b1 is revealed to both players; in the i-th stage, the adversary randomly manipulates the observed environment state si to sî based si |si , bi ); the control agent randomly on the belief bi with the probability Gi (ˆ takes an action ai based on the observed manipulated state sî and belief bi with the probability Fi (ai |ˆ si , bi ); the instantaneous control reward is C(si , ai ); the belief update from bi to bi+1 is determined by the adversarial strategy Gi and control strategy Fi as (8). Compared with the classical stochastic game, the players in the dynamic cheap talk game share a common observation of belief but have individual observations of environment state and manipulated state. Note that an adversarial manipulation strategy Gi can be equivalently implemented through two steps: The adversary first randomly selects a deterministic manipulation strategy φi : S → S based on the belief bi with the probability gi (φi |bi ); and then the adversary manipulates the environment state si to sî = φi (si ). There are |S||S| deterministic manipulation strategies in Φ. Given Gi , gi is designed to satisfy si |si , bi ) = gi (φi |bi )1sî =φi (si ) . (11) ∀si ∈ S, sî ∈ S, bi ∈ B, Gi (ˆ φi ∈Φ

Similarly, a control strategy Fi can be equivalently implemented through two steps: The control agent first randomly selects a deterministic control strategy θi : S → A based on the belief bi with the probability fi (θi |bi ); and then the si ) based on the observation sî and the control agent takes an action ai = θi (ˆ selected deterministic control strategy θi . Similarly, there are |A||S| deterministic control strategies in Θ. Given Fi , fi is designed to satisfy si , bi ) = fi (θi |bi )1ai =θi (ˆsi ) . (12) ∀ˆ si ∈ S, ai ∈ A, bi ∈ B, Fi (ai |ˆ θi ∈Θ

302

Z. Li and G. D´ an

Based on these settings and observations, given bi , gi , fi , the i-th stage reward in (1) can be specified as R(bi , gi , fi ) = Ebi ,gi ,fi (C (Si , Ai )) C(si , ai )gi (φi |bi )1sî =φi (si ) fi (θi |bi )1ai =θi (ˆsi ) bi (si ); = si ,ˆ si ∈S,ai ∈A,φi ∈Φ,θi ∈Θ

(13) and the update of the belief of next stage is denoted as bi+1 = T (bi , gi , fi ) as: ∀s ∈ S, bi+1 (s ) = T (bi , gi , fi )(s ) =

pSi+1 |Si ,Ai (s |si , ai )gi (φi |bi )1sî =φi (si )

si ,ˆ si ∈S,ai ∈A,φi ∈Φ,θi ∈Θ

(14)

fi (θi |bi )1ai =θi (ˆsi ) bi (si ) .

Remark 1. In sum, the dynamic cheap talk game between the adversary and control agent can be equivalently reformulated as a classical stochastic game: The initial belief b1 is revealed to both players; in the i-th stage, the adversary takes the mixed manipulation strategy gi ; the control agent takes the mixed strategy fi ; the instantaneous reward for the control agent is R(bi , gi , fi ); the belief update from bi to bi+1 is determined as (14). In the following, we will omit the time index subscript i when it is clear in the context. For any initial belief b ∈ B, denote the value function of the control agent in the zero-sum equilibrium of the dynamic cheap talk game by Vc (b). Define the Q-function of the control agent as Qc (b, g, f ) = R(b, g, f ) + λ Vc (b )1b =T (b,g,f ) . (15) b ∈B

Applying the results of stochastic game [9], the value function of the control agent then is Vc (b) = max min Qc (b, g, f ) g

f

= max min R(b, g, f ) + λ g

f

b ∈B

Vc (b )1b =T (b,g,f ) .

(16)

Given an arbitrary initial valuation function V1 : B → R for the control agent, construct a sequence of valuation functions V1 , V2 , V3 , . . . as: For all b ∈ B and n ≥ 2, Vn (b) = max min R(b, g, f ) + λ f

g

b ∈B

Vn−1 (b )1b =T (b,g,f ) .

(17)

It follows from [9] that Vc = limn→∞ Vn , i.e., the sequence of valuation functions will converge to the value function in the zero-sum equilibrium.

Dynamic Cheap Talk for Robust Adversarial Learning

5

303

Bisimulation Metric

Note that there are an infinite number of beliefs in B, which leads to an infinite number of fix-point equations and makes the solution of the considered problem of closed-loop control with adversarial signaling not feasible. In practice, quantization or aggregation of the states are commonly-used methods to reduce the complexity. At the same time, these methods will result in new problems which are approximations of the original problem. Therefore, it is important to evaluate the approximation error. To this end, we study the bisimulation metric for the considered problem. Given a semi-metric on belief d : B × B → R≥0 , it satisfies the following elements: 1. b = b ⇒ d(b, b ) = 0; 2. ∀b, b ∈ B, d(b, b ) = d(b , b); 3. ∀b, b , b ∈ B, d(b, b ) ≤ d(b, b ) + d(b , b ). Let μ(d) : B → R denote a valuation of belief such that μ(d)(b) − μ(d)(b ) ≤ d(b, b ) for all b, b ∈ B. Define F (d)(b, b ) = max min max min |R(b, g, f ) − R(b , g , f )| g g f f μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ), + λ max μ(d)

(18)

t∈B

where the second term is the Kantorovich metric and has an equivalent dual problem as: max μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) μ(d) t∈B = min lt,t d(t, t ), s.t., lt,t = 1t=T (b,g,f ) , lt,t = 1t =T (b ,g ,f ) . lt,t ≥0

t,t ∈B

t

t

Proposition 1. Given any belief semi-metric d, F (d) is also a belief semimetric. Proof. In the following, F (d) is shown to satisfy the elements of a semi-metric. Note that |R(b, g, f ) − R(b , g , f )| ≥ 0 for all b, b ∈ B. From the dual problem of the Kantorovich metric, max μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) ≥ 0. μ(d)

t∈B

Therefore, F (d)(b, b ) ≥ 0 for all b, b ∈ B. When b = b , the minimization over f and g can always be achieved by using f = f and g = g , which will lead to F (d)(b, b ) = 0.

304

Z. Li and G. D´ an

Given b, b ∈ B, F (d)(b, b ) = F (d)(b , b). That is because: Given any f , f , g , and g, μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) |R(b, g, f ) − R(b , g , f )| + λ max μ(d) t∈B = |R(b , g , f ) − R(b, g, f )| + λ max μ(d)(t)(1t=T (b ,g ,f ) − 1t=T (b,g,f ) ). μ(d)

t∈B

Given any f and g , we have F (d)(b, b ) = max min max min |R(b, g, f ) − R(b , g , f )| g f f g + λ max μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) μ(d)

t∈B

≤ max min max min |R(b, g, f ) − R(b , g , f )| g f f g + λ max μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) μ(d)

t∈B

+ |R(b , g , f ) − R(b , g , f )| + λ max μ(d)(t)(1t=T (b ,g ,f ) − 1t=T (b ,g ,f ) ). μ(d)

t∈B

Therefore, F (d)(b, b ) + F (d)(b , b ) = max min max min max min max min |R(b, g, f ) − R(b , g1 , f1 )| g f f g f1 f2 g2 g1 + λ max μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g1 ,f1 ) ) μ(d)

t∈B

+ |R(b , g2 , f2 ) − R(b , g , f )| + λ max μ(d)(t)(1t=T (b ,g2 ,f2 ) − 1t=T (b ,g ,f ) ) μ(d)

t∈B

≥ max min min max min min |R(b, g, f ) − R(b , g2 , f1 )| g f f g f1 g2 + λ max μ(d)(t)(1t=T (b,g,f ) − 1t=T (b ,g2 ,f1 ) ) μ(d)

t∈B

+ |R(b , g2 , f1 ) − R(b , g , f )| + λ max μ(d)(t)(1t=T (b ,g2 ,f1 ) − 1t=T (b ,g ,f ) ) μ(d)

t∈B

≥ F (d)(b, b ). Define the pointwise ordering for the belief semi-metrics as: d ≤ d iff d(b, b ) ≤ d (b, b ) for all b, b ∈ B.

Dynamic Cheap Talk for Robust Adversarial Learning

305

Theorem 1. The semi-metric transform F is continuous. Proof. Given an ω-chain of belief semi-metric {dn }, F sup{dn } (b, b ) n≥1

= max min max min |R(b, g, f ) − R(b , g , f )| g g f f +λ max μ sup{dn } (t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) n≥1 μ(supn≥1 {dn }) t∈B (a)

= max min max min |R(b, g, f ) − R(b , g , f )| g g f f + λ sup max μ(dn )(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) n≥1

μ(dn )

t∈B

= sup max min max min |R(b, g, f ) − R(b , g , f )| g g f n≥1 f μ(dn )(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) + λ max μ(dn ) t∈B = sup{F (dn )} (b, b ), n≥1

where the equality (a) has been proved in [14]. Therefore F is continuous. From Theorem 1, the least fix-point df of the transform F can be used as the bisimulation metric and can be constructed through Picard iteration: Give d1 such that d1 (b, b ) = 0 for all b, b ∈ B; for n ≥ 2, the semi-metric is updated as dn = F (dn−1 ); the iteration will converge to the least fix-point df = limn→∞ dn . Furthermore, the least fix-point semi-metric can be used to bound the difference of value functions of beliefs. Theorem 2. Given any b, b ∈ B, |Vc (b) − Vc (b )| ≤ df (b, b ).

(19)

Proof. Let V1 be the initial valuation function of belief such that V1 (b) = 0 for all b ∈ B. A sequence of valuation functions {Vn }n≥1 are constructed through (17). We first show by induction that Vn (b) − Vn (b ) ≤ dn (b, b ) for all b, b ∈ B, and n ≥ 1. When n = 1, V1 (b) − V1 (b ) ≤ d1 (b, b ) for all b, b ∈ B. Given n ≥ 2, for all b, b ∈ B, we assume that

306

Z. Li and G. D´ an

Vn (b) − Vn (b ) max min R(b, g, f ) − R(b , g , f ) = max min g g f f +λ Vn−1 (t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) t∈B

≤ dn (b, b ). Then we have Vn+1 (b) − Vn+1 (b ) = max min max min R(b, g, f ) − R(b , g , f ) g g f f +λ Vn (t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) t∈B

≤ max min max min |R(b, g, f ) − R(b , g , f )| g g f f + λ max μ(dn )(t)(1t=T (b,g,f ) − 1t=T (b ,g ,f ) ) μ(dn )

t∈B

= dn+1 (b, b ). From the induction, for all b, b ∈ B, and n ≥ 1, Vn (b) − Vn (b ) ≤ dn (b, b ). Similarly, for all b, b ∈ B, and n ≥ 1, it can be shown that Vn (b ) − Vn (b) ≤ dn (b , b) = dn (b, b ). Note that Vc = limn→∞ Vn and df = limn→∞ dn . Therefore, for all b, b ∈ B, |Vc (b) − Vc (b )| = lim |Vn (b) − Vn (b )| ≤ lim dn (b, b ) = df (b, b ). n→∞

n→∞

Consider the following quantization of beliefs and valuation update as: Divide

K K B into K subsets {Bj }j=1 such that j=1 Bj = B and for all j = j , Bj ∩ Bj = ∅; let qj ∈ Bj be the quantization output of the j-th belief subset; set initial valuation function of the quantization output as U1 (qj ) = 0 for all 1 ≤ j ≤ K; K the valuation update of {qj }j=1 is implemented as Un (qj ) = max min f

min R(qj , gj,j ,f , f ) + λUn−1 (qj ),

1≤j ≤K gj,j ,f

where the adversarial strategy gj,j ,f satisfies T (qj , gj,j ,f , f ) ∈ Bj . The approximation error of using this belief quantization and corresponding valuation update can be bounded as follows. Theorem 3. For all 1 ≤ j ≤ K and b ∈ Bj , lim sup |Un (qj ) − Vn (b)| ≤ n→∞

λ max max df (qj , b ) + df (qj , b). 1 − λ 1≤j ≤K b ∈Bj

(20)

The proof idea is similar to that used in [14] and the proof is omitted due to the limited space. Theorem 3 relates the quantization and the resulted approximation error, and therefore is useful to design the quantization subject to a certain error constraint.

Dynamic Cheap Talk for Robust Adversarial Learning

6

307

Numerical Results

Here we illustrate the idea of dynamic cheap talk by a simple numerical experiment of robust closed-loop control with adversarial signaling. The experiment settings are summarized in Table 1. Note that we use a quantization of beliefs. Table 1. Experiment settings. Parameter Value/setting

Parameter

Value/setting

S

{s(1) , s(2) }

q4

pS1 (s(1) ) = 0.875

A

{a(1) , a(2) }

C(s(1) , a(1) )

2

B1

pS1 (s(1) ) ∈ [0, 0.25]

C(s(1) , a(2) )

15

B2

pS1 (s(1) ) ∈ (0.25, 0.5] C(s(2) , a(1) )

15

B3

pS1 (s(1) ) ∈ (0.5, 0.75] C(s(2) , a(2) )

2

B4

pS1 (s(1) ) ∈ (0.75, 1]

pSi+1 |Si ,Ai (s(1) |s(1) , a(1) ) 0.3

q1

pS1 (s(1) ) = 0.125

pSi+1 |Si ,Ai (s(1) |s(1) , a(2) ) 0.9

q2

pS1 (s(1) ) = 0.375

pSi+1 |Si ,Ai (s(1) |s(2) , a(1) ) 0.9

q3

pS1 (s(1) ) = 0.625

pSi+1 |Si ,Ai (s(1) |s(2) , a(2) ) 0.3

The benchmark method considered here is repeated myopic cheap talk game, where the adversarial and control strategies in each stage are used in the myopic zero-sum equilibrium without taking into account the impact on the next belief. Let Ui denote the accumulated discounted reward of control agent achieved by the repeated myopic cheap talk game until the i-th stage. With the discount factor set as λ = 0.6 and running N = 30 valuation updates, the comparison of Ui and UN for each initial belief is shown in Fig. 2. From the numerical results, the control reward is improved by solving the proposed dynamic cheap talk game for the initial beliefs q1 and q4 . It can also be noted that there is no 35

control reward

30 25 20

UN(q1),UN(q4) Ui(q1),Ui(q4)

15

UN(q2),UN(q3) Ui(q2),Ui(q3)

10

1

2

3

4

5

6

7

8

9

10

i

Fig. 2. Comparison of control rewards of solving the dynamic cheap talk game and repeated myopic cheap talk game.

308

Z. Li and G. D´ an

improvement for the initial beliefs q2 and q3 , which might result from the brutal belief quantization.

7

Conclusion

We study the robust adversarial learning in the context of closed-loop control with adversarial signaling and formulate a dynamic cheap talk game. We also propose a bisimulation metric, which is useful to study the trade-off between the computation complexity and approximation error for belief quantization. The future works are to take into account the adversarial manipulation cost, and to design computation-efficient algorithms for the adversarial reinforcement learning and the bisimulation metric.

References 1. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016) 2. Pinto, L., Davidson, J., Sukthankar, R., Gupta, A.: Robust adversarial reinforcement learning. In: Proceedings of the 34th International Conference on Machine Learning, pp. 1–10 (2017) 3. Pan, X., Seita, D., Gao, Y., Canny, J.: Risk averse robust adversarial reinforcement learning. In: Proceedings of the IEEE International Conference on Robotics and Automation (2019) 4. Huang, S., Papernot, N., Goodfellow, I., Duan, Y., Abbeel, P.: Adversarial attacks on neural network policies. arXiv:1702.02284 (2016) 5. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv:1312.6199v4 (2013) 6. Dong, Y., et al.: Boosting adversarial attacks with momentum. In: Proceedings of the 2018 Conference on Computer Vision and Pattern Recognition, pp. 9185–9193 (2018) 7. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: Proceedings of IEEE European Symposium on Security and Privacy (2016) 8. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 9. Shapley, L.S.: Stochastic games. Proc. Natl. Acad. Sci. 39(10), 1095–1100 (1953) 10. He, X., Dai, H.: Dynamic Games for Network Security. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75871-8 11. He, X., Dai, H., Ning, P.: Faster learning and adaptation in security games by exploiting information asymmetry. IEEE Trans. Sig. Process. 64(13), 3429–3443 (2016) 12. Busoniu, L., Babuska, R., Schutter, B.: A comprehensive survey of multiagent reinforcement learning. IEEE Trans. Syst. Man Cybern. - Part C 38(2), 156–172 (2008) 13. Hor´ ak, K., Boˇsansk´ y, B., Pˇechouˇcek, M.: Heuristic search value iteration for onesided partially observable stochastic games. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, pp. 558–564 (2017)

Dynamic Cheap Talk for Robust Adversarial Learning

309

14. Ferns, N., Panangaden, P., Precup, D.: Metrics for finite Markov decision processes. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 162–169 (2004) 15. Alfaro, L., Majumdar, R., Raman, V., Stoelinga, M.: Game relations and metrics. In: Proceedings of the 22nd Annual IEEE Symposium on Logic in Computer Science, pp. 99–108 (2007) 16. Chatterjee, K., Alfaro, L., Majumdar, R., Raman, V.: Algorithms for game metrics. Logic. Methods Comput. Sci. 6(3:13), 1–27 (2010) 17. Crawford, V.P., Sobel, J.: Strategic information transmission. Econometrica 50(6), 1431–1451 (1982)

Time-Dependent Strategies in Games of Timing Jonathan Merlevede1 , Benjamin Johnson2 , Jens Grossklags2(B) , and Tom Holvoet1 1

imec-DistriNet, KU Leuven, Leuven, Belgium {jonathan.merlevede,tom.holvoet}@cs.kuleuven.be 2 Chair for Cyber Trust, Technical University of Munich, Munich, Germany [email protected],[email protected]

Abstract. Timing, a central aspect of decision-making in security scenarios, is a subject of growing academic interest; frequently in the context of stealthy attacks, or advanced persistent threats (APTs). A key model in this research landscape is FlipIt [1]. However, a limiting simplifying assumption in the FlipIt literature is that costs and gains are not subject to discounting, which contradicts the typical treatment of decision-making over time in most economically relevant contexts. Our recent work [2] introduces an adaptation of the FlipIt model that applies time-based exponential discounting to the value of a protected resource, while allowing players to choose from among the same canonical strategies as in the original game. This paper extends the study of games of timing by introducing two new classes of strategies that are fundamentally motivated by a time-discounted world view. Within our game model, we compute player utilities, best responses and give a partial characterization of the game’s Nash equilibria. Our model allows us to re-interpret the APT model using a finite total valuation, and a finite time horizon. By applying time-based discounting to the entire decision-making framework, we increase the level of realism as well as applicability to organizational security management.

1

Introduction

Defense against stealthy advanced persistent threats (APTs) through perfectly effective preventative investments is an impossible goal in most contexts. In fact, data about reported security incidents reveal that organizations need on average about 200 days to merely detect successful attacks [3]. As such, additional emphasis needs to be placed on the optimization of mitigation strategies against stealthy threats such as the scheduling of investigative in-depth security audits. To make strategic decisions about the mitigation of stealthy attacks, a central consideration must be the notion of time, which has been the focus of the games of timing literature stemming from the cold war period (see, for example, [4]). This research field received a new influx of work on the so-called FlipIt game c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 310–330, 2019. https://doi.org/10.1007/978-3-030-32430-8_19

Time-Dependent Strategies in Games of Timing

311

beginning with research by Dijk et al. [1] focused on the competitive dynamics to control a contested resource in a limited information environment. While the majority of these studies assume that players are indifferent between costs and gains now in comparison to those in the (distant) future, our recent work [2] has begun to apply notions of time-based discounting to the game of FlipIt. These initial efforts have been focused on two canonical classes of timing strategies, the so-called periodic and exponential strategies (introduced in [1]), for which the expected time between strategic actions is constant. In a discounted environment, constant expected time between actions implies a decreasing expected valuation between actions, which calls into question the rational appropriateness of this class of strategies in a discounted setting. In this work, we conduct game-theoretic analysis of infinite timing games with time-based discounting, using two new classes of strategies (discounted periodic and discounted exponential), constructed so that the strategic timing is aligned with the discounted resource valuation. We determine player utilities for all combinations of strategies within the same class, and provide numerical illustrations for each player’s best response strategy, concluding with a partial characterization of the game’s Nash equilibria. Our results differ in several aspects from those of non-discounted games of timing as well as discounted games of timing in which only non-discounted canonical strategies are considered. For example, enacting strategies from the revised classes always results in a finite total number of actions; and the cumulative effect of discounting the cost of action at lower rates is limited to at most doubling the total cost, implying that costs and resources may be time-discounted at substantially different rates without affecting the structure of results.

2

Related Work

Our discussion of related work focuses on games of timing, and specifically on research designed to capture key aspects of stealthy APT attacks, related to the FlipIt game [1,5]. This literature has grown considerably, such that there now exist many adapted and extended versions of the original game. The exponential discounting extension in our previous work [2] represents one such adaptation; and serves as the primary motivation for the current analysis. In the following, we briefly review additional literature. Laszka et al. [6,7] have investigated the influence of including non-targeted attackers in the FlipIt model. Feng et al. [8] and Hu et al. [9] modified the game by considering insider threat actors. Feng et al. [8] accomplish this by adding a third player, an insider, to the model. The insider derives gains from the resource, when it is under control of the defender, by selling information to the attacker who will learn about ways to decrease the cost of attacks. In the basic FlipIt game, moves by both the attacker and defender are assumed to be instantaneous and always successful. Farhang and Grossklags [10] introduce the idea of imperfect defensive moves with a quality level α ∈ [0, 1] that expresses the fraction of the resource that remains under the control of the

312

J. Merlevede et al.

attacker after a flip by the defender. Zhang et al. [11] and Laszka et al. [6,7] capture the realistic notion that attacks are complex and take a random amount of time before taking effect. Johnson et al. [12] redefine the probability of success of an attack as a function of time. They also consider that the cost of flipping may be time-dependent. In the basic FlipIt game, the game has an infinite time horizon and players compete for the resource forever. Zhang et al. [11] and Johnson et al. [12] assume that the game ends at a fixed pre-defined point in time. Pham and Cid [13] propose a variation of the FlipIt game in which each action makes it more costly for the opponent to take over the resource again; effectively reducing the game to a finite version. Laszka et al. [14] consider two ways of composing resources: one where the attacker receives gain when she is in control of at least one resource (OR-model) and one where she receives gain only when in control of all of the resources (ANDmodel). Leslie et al. [15] generalize this to a model where the attacker has to compromise a threshold fraction of the defender’s resources before receiving any gain. Zhang et al. [11] also consider multiple resources, but model no interaction between them except through a resource constraint imposed on players in the form of a maximum play frequency that is shared across resources. Much of the follow-up work on FlipIt has made changes to the assumption of perfect stealthiness. Often the defender is assumed to be completely overt [6,7]. [10,11] Besides the conceptual difference, overtness also allows for a different characterization of the FlipIt game as a convex optimization problem [11]. Pham and Cid [13] add a new audit action to the game, which allows a player to query the current owner of the resource. The insider player introduced by Hu et al. [9] is at risk of being caught when selling information, which is integrated into her utility function. Johnson et al. [12] consider a discretized version of a timing game similar to FlipIt, in which players are only allowed to make decisions at discrete points in time. Discretization of time is especially relevant for defender moves, which often have to be performed according to some schedule so as not to interrupt business operations (e. g. only at night) [16]. Zhang et al. [11] impose budget constraints on players that limit the maximum flip frequency, a practical consideration that is ignored in other treatments of FlipIt. Pawlick et al. [17] define a meta-game that consists of a signalling game and a FlipIt game. The parameters of the FlipIt game are defined by the outcome of the signalling game and vice versa. Beyond our previous research (Merlevede et al. [2]), we are unaware of any studies that investigate the impact of discounting in FlipIt game models.

3

Model Definition

In this section, we introduce our model for stealthy timing-based security games with discounted costs, discounted resource valuations, and discounting-inspired strategies. In this two-player game, a defender (D) and an attacker (A) are vying for control over a central resource. To obtain control, either player i ∈ {A, D}

Time-Dependent Strategies in Games of Timing

313

can choose to pay a fixed instantaneous cost ci to immediately assume control of the resource. The resource is always controlled exclusively by the last player to execute such a move. The controlling player accrues utility at a rate which decreases exponentially over time. The cost to execute a move is also time-discounted, albeit at a potentially-different discount rate. Control is always stealthy in the sense that neither player knows who controls the resource until the moment that they initiate an instantaneous ‘flip’. The remainder of this section formalizes the game. 3.1

Player Strategies

For player i ∈ {D, A} , define ti = (ti,0 , ti,1 , ti,2 , . . .) to be a strictly increasing sequence of times at which player i moves. (A player can move at most once in a particular instance of time.) The length of ti can be finite or infinite. A player strategy in this game is defined completely by a probability distribution over a set of possible ti . 3.2

Player Control

An outcome of the game is a pair of move sequences (tD , tA ). Times that occur in both vectors complicate a smooth analysis, so for this and other reasons1 , we assume that tD ∩ tA = ∅. Let t = tD ∪ tA = (t0 , t1 , t2 , . . .) be the strictly increasing sequence of player move times. Then for any time t ≥ t0 , we may define the latest flip time function by LFT(t) = max{tk ∈ t : tk ≤ t}. From time t = 0 until the time of the first flip t0 , the defender has control of the resource. We may thus define the player control function for any time t > 0 by D if t < t0 or LFT(t) ∈ tD PC : t → A if LFT(t) ∈ tA . The asymmetry of the player control function shows that the defender has an advantage due to starting off the game in control of the resource. We also define a player control indicator function PCi (t) = 1PC(t)=i , which can be used for integrating. This function tells us, for a given player i ∈ {D, A} and time t > 0, whether that player controls the resource at that time. 1

For each of the strategy distributions that we analyze, the set of all outcomes with non-disjoint strategy vectors has probability zero, so in our setting this assumption is benign. We could also address overlapping sequences by defining the resource control function so that any occurrence of simultaneous moves leaves control of the resource unaffected; and so doing would accomplish the same effect as our assumption, but with a more complicated logical underpinning.

314

3.3

J. Merlevede et al.

Player Gains

Players achieve gains by controlling the resource. Gains initially accrue value at some rate of V dollars per unit of time. Meanwhile, the resource decreases in value over time at a discount rate of ρ. The gain for player i may be determined by computing the expected exponentially weighted integral of PCi (t) over all of time, normalized with respect to the total (discounted) value of the resource: +∞ +∞ E τ =0 PCi (τ )V e−ρτ dτ −ρτ Gi = PCi (τ )e dτ . (1) =ρ·E +∞ V e−ρτ dτ τ =0 τ =0 The expectation is taken over the distributions involved in defining the player strategies. Normalization allows comparing player gains for different discount rates and the interpretation of total gain as a fraction of total achievable gain. 3.4

Player Costs

When players perform a move, this comes at a fixed instantaneous cost of ci > 0. Costs are exponentially discounted with discount rate c ρi , which can be different from the discount rate for the resource, ρ. A player’s (total) costs are defined as the expected sum of exponentially discounted instantaneous costs, normalized with respect to the total (discounted) value of the resource: ⎡ ⎤ − c ρi τ E τ ∈t i ci e ρ · E⎣ (2) ci e− c ρi τ ⎦ . = Ci = +∞ −ρτ V Ve dτ τ ∈t τ =0

i

As with gains, the expectation is taken with respect to the distribution used to define ti . Scaling gains and costs by the same factor makes the normalization operation neutral with respect to the behavior of rational players. Finally, since we only deal with normalized costs and gains and since ci and V are both free parameters, we can without loss of generality assume that V = 1. This assumption allows us to think of the instantaneous cost ci = cVi as a unitless value, expressing a fraction of the initial rate at which the resource accrues value per unit of time. 3.5

Player Utilities

Player utilities are equal to the difference of player gains and player costs: ui = Gi − Ci .

Time-Dependent Strategies in Games of Timing

3.6

315

Discounted Strategies

Because the full strategy space for this game involves probability distributions over countable sequences of real numbers, a desire for a useful analysis requires us to restrict our attention to “reasonable” sub-classes of such strategies. A review of research into timing games has identified two canonical classes of strategies for this purpose, the class of periodic strategies, in which the time between moves is constant, (with the first move being randomized), and the class of exponential strategies, for which the time between moves is exponentially distributed, (as is the time of the first move). In this section, we present adaptations of these two classes of strategies that are specifically-motivated by our game’s time-based discounting factor ρ: instead of defining strategies in terms of inter-arrival times, we define them in terms of the value generated by the resource between subsequent moves. To do this, we first introduce the concept of a compressed timeline, where the compression is adapted to the rate of exponential discounting of the resource. Compressed Timeline. To begin, we define a transformation T that maps time t ∈ [0, ∞) onto a compressed time x ∈ [0, 1), such that the total value of the resource up until t equals x: ⎧ ⎪ ⎨[0, +∞) → [0, 1) t T : ⎪ e−ρτ dτ = 1 − e−ρ·t . ⎩t → ρ · τ =0

We can map compressed time x onto real times t using the inverse transformation T −1 : ⎧ ⎪ ⎨[0, 1) → [0, +∞) T −1 : ln(1 − x) ⎪ . ⎩x → − ρ Discounted Exponential Strategy. A discounted exponential strategy is a strategy for which all compressed inter-arrival times, as well as the time of the first move, are drawn from the same exponential distribution: Δi,n ∼ Exp(νi ) and xi,0 ∼ Exp(νi ). Here Δi,n = xi,n+1 − xi,n is the difference between the timing of the nth and (n + 1)th move as observed on a compressed timeline. Parameter ν is the flip rate, move rate or play rate of the discounted exponential strategy. The expected compressed time between two moves is constant and equal to 1/ν. The expected real time between moves is therefore time-dependent and increasing. The expected total number of moves is finite and equal to ν.

316

J. Merlevede et al.

Discounted Periodic Strategy. A discounted periodic strategy is a strategy where the compressed inter-arrival times are equal to a constant value δ. We refer to δ as the period of a discounted periodic strategy and to the inverse of the period, ν = 1/δ, as a strategy’s play rate. Discounted periodic strategies with random phase are those periodic strategies where the compressed time of the first move or phase ϕ is drawn from the positive values smaller than δ: Δi,n = δi and xi,0 = ϕi ∼ U [0, δi ]. If δ is chosen to be greater than one, and the randomly-drawn phase also turns out to be greater than one, the realized move time is not on the compressed timeline, implying that the player never moves. Constant compressed inter-arrival times imply increasing real inter-arrival times. The total number of flips that a player performs when executing a discounted periodic strategy is always between νi and νi + 1. See Fig. 1 for an illustration. Normal timeline

t=0

x=0

1

.1

2

3

.2

4

.3

5

.4

6

.5

7

.6

8

9

.7

...

.8

.9

t

x=1

Compressed timeline

Fig. 1. A game outcome in which each player uses a discounted periodic strategy. Dots ( , ) indicate defender and attacker moves. Shaded areas ( , ) are proportional to ) are proportional to the cost of moving. the gain obtained. Thick dotted lines ( , (Color figure online)

Interpretation. Figure 1 provides an illustration for the discounted periodic strategy. On the normal timeline, moves occur less and less frequently over time, which makes strategic sense because the resource is valued less and less over time. While the strategy has a simple periodic representation only on the compressed timeline, the value generated by the resource between two moves (the area under the curve) is constant on both timelines. In moving to a discountingaware strategy, we add some complexity to our automation processes because the strategy must be implemented in normal time; but we gain in exchange an improved alignment with our valuation.

Time-Dependent Strategies in Games of Timing

4 4.1

317

Analysis Player Gains

We begin our analysis by determining the player gains in our model. Anonymous Gains. We first note that due to the normalization, the total value of the resource is exactly 1. Since at any point in time one of the two players is gaining revenue from the resource, the sum of the gains of the two players must also equal to one. Let us define the anonymous gain of player i to be the (total expected) gain for that player after the first flip (by either player) has occurred. We use the word anonymous because, for both (discounted) exponential and periodic strategies, the calculation of this quantity is symmetric with respect to the two player identities (A and D). We sometimes refer to the total anonymous gain by which we mean the sum of the anonymous gains of the two players. Note that the anonymous gain of the attacker is the total expected gain of the attacker, because if any value accrues at all for the attacker, it does so after his first flip. Since the gain of the defender is one minus the gain of the attacker, we may easily convert expressions involving the anonymous gains into expressions for player gains. This is useful because it allows us to express player gains in a uniform notation even though the structure of gains is identity-dependent. Compressed Time. Note that the time compression used in the presentation of our discounted strategies was defined so as to preserve player gains over time. (For example, the areas of each shaded region in Fig. 1 are matched across the timelines.) Because the strategies are defined in a compressed timeline, it is easier to compute the gains by integrating over compressed time. To do this, we need to first apply the appropriate transformations so that an integration over the player control function in normal time can be performed on the compressed i := PCi ◦ T −1 . We can then coordinate system. To accomplish this, let PC write: ⎡ Gi = E ⎣ lim ρ · T →+∞

⎡

= E ⎣ lim ρ ·

T →+∞

=E ρ · =E

1

x=0

1

x=0

T

t=0

⎤ PCi (t) · e−ρτ dt⎦

T (T )

x=T (0)

PCi (T −1 (x)) · e−ρ·T

i (x) · eln(1−x) · PC

i (x) dx . PC

1 dx (1 − x)ρ

−1

⎤ dt (x) dx⎦ · dx

318

J. Merlevede et al.

Exponential Play. Now, we assume that both players are playing a discounted exponential strategy, using play rates νD and νA . Lemma 1. For exponential play, each player i obtains a fraction of the total anonymous gain equal to: νi . νi + νj Proof. At any moment in time, the probability that player i is the next player to move is equal to +∞ +∞ νi −νi τi pi = νi e νj e−νi τj dτj dτi = . ν + νj i τi =0 τj =τi Probability pi is, therefore, also equal to the probability that any flip by either player after time t0 is made by player i. Consider the set of all intervals between flips together with the interval between the last flip and the end of the game. For each such interval, with probability pi player i is the player who receives gain over the entire interval; and with probability 1 − pi she receives nothing. Her expected gain over each interval is, therefore, pi times the length of the interval. By linearity of expectation, her total expected gain over this set of intervals is therefore pi times the total combined duration of the intervals. Lemma 2. If νi + νj = 0, then the total anonymous gain is zero. Otherwise, the expected total anonymous gain for exponential play is: 1−

1 − e−(νi +νj ) . νi + νj

(3)

Proof. Let Xi be the time of i’s first flip, and define random variable Z = min{Xi , Xj } as the time until the first flip by either player. Then FZ (z) = Pr[Z ≤ z] = Pr[Xi ≤ z or Xj ≤ z] = 1 − Pr[Xi ≥ z and Xj ≥ z] = 1 − Pr[Xi ≥ z] · Pr[Xj ≥ z] = 1 − (1 − (1 − e−νi z ))(1 − (1 − e−νj z )) = 1 − e−(νi +νj )z , that is, Z is distributed exponentially with rate parameter νi + νj . We can then express the expected total anonymous gain as 1 1 1 −(νi +νj )z (νi + νj ) · e dτ dz = (νi + νj ) · e−(νi +νj )z (1 − z) dz, z=0

which evaluates to Eq. 3.

τ =z

z=0

Time-Dependent Strategies in Games of Timing

319

Note that the expected total anonymous gain never quite reaches one. There are two reasons: – The expected gain before t0 is not part of the anonymous gain. – There is a probability of e−νi + e−νj − e−(νi +νj ) that neither player ever flips. An expression for the anonymous gain of player i now follows easily from Lemmas 1 and 2. Lemma 3 (Anonymous gain for discounted exponential play). Player i’s anonymous gain for discounted exponential play is: Gi =

νi 1 − e−(νi +νj ) − νi · . νi + νj (νi + νj )2

Periodic Play. Next, we assume that both players are playing a discounted periodic strategy using play rates νD and νA . Lemma 4 (Anonymous gain for discounted periodic play). Player i’s anonymous gain for discounted periodic play is: ⎧ 1+νj νj ⎪ ⎪ ⎨1 − 2νi + 3ν 2 if νi ≥ 1 and νi ≥ νj , Gi =

νi νj νi 2 − 6 ⎪ ⎪ ⎩ νi − νi2 2νj 6νj

i

if νi ≤ 1 and νj ≤ 1, and otherwise.

(4)

Proof Outline. By linearity of expectation, player i’s total anonymous gain equals the sum of the expected gains over the following intervals: – The time before player i’s first flip. – The time after player i’s last flip. – The time in between. Player i’s expected anonymous gain before her first flip is always zero. Appendix A lists derivations for the expected gains over the other two intervals. Illustrations of Anonymous Gain. Figure 2 shows a contour plot of the anonymous gain as a function of the players’ play rates. Note that the anonymous gain is monotone in both play rates. For both types of strategy configurations, the anonymous gain of player i is increasing in player i’s own play rate, and decreasing in player j’s play rate. 4.2

Player Costs

Next, we determine the total player costs in our model.

J. Merlevede et al. 4

4

3

3

2

2

νj

νj

320

1

1 0

.9 .8 .7 .6 .5 .4 .3 .2 .1

0

1

2 νi

3

(a) Exponential play

4

0

0

1

2 νi

3

4

(b) Periodic play

Fig. 2. Contour plot of the anonymous gain of player i for periodic and exponential play. The dotted line ( ) illustrates where the different cases of Eq. 4 apply. (Color figure online)

Lemma 5. The cost of performing an exponential or periodic strategy with rate parameter νi is ci · νi · ρ2 . Ci = ρ + c ρi Proof. For both the periodic and the exponential strategy, the compressed probability density of flipping at any specific moment in time on the compressed timeline is constant and equal to νi . With respect to real time, the probability density of a move at time t is therefore: νi

dx dT (t) d = νi = νi 1 − e−ρt = νi · ρ · e−ρt . dt dt dt

The discounted instantaneous cost of performing a flip at time t is ci · e− c ρi t . The total cost of all flips becomes: +∞ ci · νi · ρ2 ρ· (νi · ρ · e−ρt ) · (ci · e− c ρi t ) dt = . ρ + c ρi t=0 4.3

Player Utilities

The utility of a player is simply her expected gains minus her expected costs. The costs are provided above. The total gain for the attacker is the same as the anonymous gain, and the total gain for the defender is 1 minus that. Therefore, this section is just an exercise in translating the results from the previous two sections. Here, we provide the explicit formulation of player utilities for exponential play. The expression of utilities for periodic play may be determined similarly, but has been omitted due to space considerations.

Time-Dependent Strategies in Games of Timing

321

Theorem 1 (Utility for discounted exponential play). Player utilities for discounted exponential play are: νA 1 − e−(νA +νD ) cA · νA · ρ2 − νA · − νA + νD (νA + νD )2 ρ + c ρA νA 1 − e−(νA +νD ) cD · νD · ρ2 =1− − νA · . − 2 νA + νD (νA + νD ) ρ + c ρD

uA = uD

Proof. This follows from Lemmas 3 and 5 and the fact that the total gain for the attacker and defender are G A and 1 − G A . 4.4

Player Incentives

With player utilities in hand, we may now ask what players want to do. For both discounted exponential strategies and discounted periodic strategies, each player i has the choice of selecting one real parameter νi . We are especially interested in the behavior of the partial derivative of a player’s utility with respect to her own play rate, which, for a player i, may be expressed in standard notation as ∂ui , ∂νi and to which we refer in the following discussion as the incentive of player i. The following lemmas provide us with information that we need about incentives to determine the best response strategies. Lemma 6. For discounted exponential play, each player’s incentive is strictly decreasing in her own play rate. Lemma 7. For discounted periodic play, each player’s incentive is independent of her own play rate if she is the slower player or if her play rate is smaller than 1. It is strictly decreasing in her own play rate otherwise. Lemmas 6 and 7 tell us that given an opponent’s play rate, a player’s incentive is upper-bounded by her incentive when not playing. We will refer to this incentive as her base incentive. Figures 3 and 4 display player incentives with each cost parameter ci set to zero. With zero costs, increasing play rate νi implies increasing utility ui , so that the incentives are always strictly positive. Increasing the cost ci decreases the incentive, but only by a constant amount, since the derivative of a player’s costs with respect to her own play rate is constant as a function of play rates. This means that adding cost to the figures will only change the color labels of the graphs. We can verify that νi and νj are the only variables impacting the rate of change of the incentive with respect to νi . Figures 3 and 4, therefore, provide strong support for the claims made in Lemmas 6 and 7.

322

J. Merlevede et al. 6

6

νA

4 2 0

2

0

4

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05

4 νA

0.12 0.10 0.08 0.06 0.04 0.02

2 0

6

0

2

νD

4

6

νD

(a) Defender incentive

(b) Attacker incentive

Fig. 3. Player incentives for exponential play and ci = 0. (Color figure online) 4

4

νA

3 2 1 0

0

1

2 νD

3

4

(a) Defender incentive

0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05

3 νA

0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02

2 1 0

0

1

2 νD

3

4

(b) Attacker incentive

Fig. 4. Player incentives for periodic play and ci = 0. (Color figure online)

4.5

Player Best Responses

This subsection characterizes the best-response strategies for the attacker and defender. We begin with a discussion of non-participatory responses and characterize when they are optimal. These results apply equally to both strategy regimes. We then discuss properties of participatory best responses for the exponential and periodic strategy regimes. Non-participatory Best Responses. The following results apply generally, without restricting players to specific strategy classes. Lemma 8. The unique best response by a defender to a non-participatory attacker is not to play. Lemma 9. If the attacker’s best response to a non-participatory defender includes not playing, then not playing is a best response to any participatory defender. These two lemmas follow directly from the properties of the incentive functions (Lemmas 6 and 7).

Time-Dependent Strategies in Games of Timing

323

Lemma 10. For exponential and periodic discounted play, not playing is a best response to a non-participatory defender iff cA ≥

ρ + c ρi . 2ρ2

If the inequality is strict, then there cannot be any other best responses. Proof. We can verify that equation limνA →0 limνD →0 ρ+c ρi 2ρ2

∂ uA ∂νA

has a single root at

for both exponential and periodic discounted play. The attacker’s cA = incentive is strictly non-increasing in her own play rate (Lemmas 6 and 7) and strictly decreasing in cA . It is, therefore, sufficient to show that the attacker’s incentive is non-increasing in νD provided that the defender does not move. We do this by showing that the limits of the second order partial derivatives are negative. For exponential play, we can compute +∞ 2 ∂ 2 uA 2e−νA xi 2e−νA νA − e−νA = − 3 . 1 + νA + = lim 2 3 νD →0 ∂ν νA 2 νA i=3 i! A For periodic play, the attacker’s incentive is independent of her own play rate if νA < 1. For νA > 1, we can compute ∂ 2 uA −1 = 3. lim νD →0 ∂ν 2 νA A νD ≤νA νA >1

Participatory Best-Responses. Our first two results for participating players are deemed corollaries because they follow immediately from properties of the players’ incentive functions (Lemmas 6 and 7), and the definition of a best response. Corollary 1. Best-responses for exponential play are single-valued. Corollary 2. Player i’s best response for periodic play to an opponent play rate ν¯j can be characterized in terms of her base incentive as follows. – If her base incentive is strictly negative, then her unique best response is not to play. νj , 1}] – If her base incentive is zero, then moving at any play rate αi ∈ [0, max{¯ is a best response. – If her base incentive is strictly positive, then her unique best response is to αj , 1} for which her incentive is zero. play at the rate αi > max{¯

324

J. Merlevede et al.

While an algebraic characterization of the best response functions is cumbersome to present, a numerical computation of best responses is straightforward. Moreover, because each player’s incentive is non-increasing, each player is playing a best response precisely when her incentive is zero. This gives rise to an alternative interpretation of Figs. 3 and 4 as best-response curves. This interpretation is valid exactly when the value (color) of the incentive function is ci ·ρ2 . (An expression for total costs exactly equal to the effective cost ∂∂νC i = ρ+ c ρi i was provided in Lemma 5.) 4.6

Nash Equilibria

Finally, with basic properties of our game’s best responses characterized, we may extend this characterization to important properties of the game’s Nash equilibria. Non-participatory Nash Equilibra. Our first two results follow directly from the characterization of non-participatory best responses given in Lemmas 8 and 9. Theorem 2. There is never a Nash equilibrium in which the defender plays, but the attacker does not. Theorem 3. If the attacker is playing a discounted periodic or exponential strategy, there is a Nash equilibrium in which neither player moves iff cA ≥

ρ + c ρi . 2ρ2

Participatory Nash Equilibra. Our last result describes the necessary conditions for there to be a Nash equilibrium in the discounted periodic regime. This result follows from the best response characterization for periodic play provided by Corollary 2. Nash equilibria for the exponential and periodic cases are exemplified numerically in Fig. 5. Corollary 3. In any Nash equilibrium where both players play discounted periodic strategies at non-zero rates, the faster player f plays at a rate αf that is a root of the slower player’s base incentive function. Player s is then indifferent between playing at any rate in [0, αf ].

5

Discussion

This section discusses results from the previous sections and their practical impacts. Specifically, we consider the total number of player actions (Sect. 5.1) and the limited impact of time-based discounting of costs (Sect. 5.2).

Time-Dependent Strategies in Games of Timing

325

cD = 0.3, cA = 0.32, ρ = c ρi = 1 3

4

2 νA

νA

cD = 0.23, cA = 0.3, ρ = c ρi = 1 6

2 0

1

0

2

4

6

0

0

νD (a) Exponential play

1

2

3

νD (b) Periodic play

Fig. 5. Defender ( ) and attacker ( ) best response curves and Nash equilibria (•) for exponential and periodic play. (Color figure online)

5.1

Finite Number of Player Actions

One interesting feature of our model with compressed strategies is that the total number of player actions is always finite. This contrasts with the non-discounted models of APTs in which the total number of enacted actions in an outcome of a strategy is generally infinite. Our model exhibits this feature because we apply the periodic or exponential paradigm to a time line which has been exponentially compressed. If we think about our resources and expenditures in a discounted sense, it makes sense that we would not want to keep playing forever. At some point the value of the resource will be extremely small, in which case there comes a point in time where further attacks and further expenditures on security are largely pointless. We consider it a useful feature that our modeling framework captures this dynamic. 5.2

Limited Impact of Cost Discounting

Our illustrations of best response strategies involve costs which are discounted at the same rate as the value of the resource. However, the formula for costs (see Lemma 1) is expressed in terms of notation that can apply different costdiscounting rates for each player. If we consider any fixed play rate, the impact of changing the cost-discounting rate from the resource-discounting rate down to zero is to double the total cost. This effect is substantially different from the regime in which players choose strategies from an exponential or periodic strategy on a non-discounted time line. In evaluating a non discounted strategy, the discount factor for costs could be infinitely important. But for the strategies considered in this paper, varying the cost-discounting factor has an effect more similar to a rounding error. An implication of this result is that it offers an additional interpretation to time-based discounting that involves a shorter duration of time. Here, we might presume that a resource being protected were discounted not merely because

326

J. Merlevede et al.

of economic considerations (which also apply to costs), but rather because the very nature of the resource was short-lived. For example, the resource could involve a private key or token, with a fixed duration of validity. Such a resource would become less valuable the closer to its expiration time, although the costs to attack or defend the same resource might not decrease at all in this short time frame. The fact that our model exhibits a relatively small effect for cost discounting (compared to discounting the value of the resource) means that it could be applied in cases where time-based discounting of resource valuation were justified even though time-based discounting of costs were not.

6

Conclusion

The timing of security decisions is an aspect of policy-making that is generally under-appreciated. The overwhelming majority of existing research that does investigate the timing of offensive and defensive actions does not consider how the passing of time can affect the value of the resource. Of the very small number of studies that do consider this aspect of timing, each allows players to choose strategies from classes that are only well-motivated in an environment in which security artifacts retain the same value over time. In this paper, we consider the full gamut of time-dependent considerations for making strategic security-based investment decisions for stealthy resources. Our costs and resources are valued in a time-dependent manner. The individual strategies employed by our players involve making investments over time; and the strategy spaces from which players may choose are motivated by a time-discounted worldview. Discounting the value of a resource and the costs of defending it already have important implications for interpreting the security landscape involving persistent threats. When applying exponential discounting to a resource and its defense costs, its total value over time becomes finite, as does the total cost of implementing a given strategy. This fact already provides significantly more realism over less time-sensitive models, because the costs and valuation for any real world security decision is of non-infinite magnitude. The time-discounted regime also allows for the possibility of achieving perfect security by raising the costs of an attack – something that would not be possible if the resource were considered to have infinite value. This important consequence of exponential discounting was discussed extensively in our previous work [2] and also holds true in the regime of restricted strategies used in this paper. When we focus our attention on revised canonical strategies that are motivated by a discounted time horizon, we move further toward reality. More than simply having a finite valuation for our resource, we now consider only reasonable strategies that exhibit a finite number of actions to attack or defend it. This finite time window offers a simpler framework for understanding good responses, which can be incredibly useful in communicating policy decisions. With this work, we advance the study of time-based aspects of security decisions; but it still has a long way to go. Future work might further extend timebased discounting to methods beyond exponential discounting, and additional

Time-Dependent Strategies in Games of Timing

327

classes of attack and defense strategies could be useful to consider. In the meantime, advanced persistent security threats will remain, tempered only by the certainty that the value of every protected resource is bounded; the time horizon for each new threat is finite; and every strategy with an infinite number of plans will cease to be well-motivated long before its completion. Acknowledgments. We thank the anonymous reviewers for their constructive comments and feedback. This work was partially supported by the German Institute for Trust and Safety on the Internet (DIVSI) and the Research Fund KU Leuven.

A

Anonymous Gains for Periodic Play

Lemma 11. Player i expects gain after her last move: – – – –

(Case (Case (Case (Case

1) 2) 3) 4)

If If If If

δi δi δi δi

≤1 ≤1 ≥1 ≥1

and and and and

2

δi ≤ δj , then she expects δi/2 − δi /6δj . 2 δi ≥ δj , then she expects δj/2 − δj/6δi . 2 δj ≤ 1, then she expects δj/2δi − δj/6δi . δj ≥ 1, then she expects 1/2δi − 1/6δi δj .

Proof. For every case, we first fix player i’s strategy and take expectations over player j’ strategy. We then compute player i’s expected gain by taking expectations over player i’s strategy. Let T be the game time that remains after i’s last move. After this move, player i remains in control until a move by j or the end of the game. (Case 1; i = f , j = s). Player s moves before the end of the game with probability T /δs at a time distributed uniformly between 1 − T and T . Player f then expects a gain of T /2. With probability 1 − T /δs , player s does not move and player f receives a gain of T . Summarizing, for a specific T , player f can expect to receive T T T T2 T2 T2 + 1− +T − =T − . T = δs 2 δs 2δs δs 2δs δ T2 dT . Taking expectations over T yields the stated result: δ1f T f=0 T − 2δ s (Case 2: i = s, j = f ). With probability δf /δs , we have T ≤ δf . The analysis 2 is then the same as for Case 1, and player s expects to receive T − T /2δf . With probability 1−δf /δs , we have T > δf . Player f then always regains control before the gain ends, at a time distributed uniformly between 1 − T and 1 − T + δf , yielding an expected gain of δf /2 for player f . Taking expectations over T yields δ 2 the stated result: δf/δs T f=0 1/δf (T − T /2δf ) dT + (1 − δf/δs )δf/2. (Case 3). Player i is the slower player. With probability 1/δi , i moves once, 2 in which case she expects δj/2 − δj/6δi as argued in Case 2. With probability 1− 1/δi , i never moves and she receives nothing. Taking expectations over player i’s strategy yields the stated result.

328

J. Merlevede et al.

(Case 4). There are four possible outcomes: – Neither player moves. Player i receives no gain. – Player i moves, player j does not. Player i receives an expected gain of 1/2. This outcome occurs with probability 1/δi (1 − 1/δj ). – Player j moves, player i does not. Player i receives no gain. – Both players move. Player i receives the same expected gain as a player with strategy δi = 1 against a player with strategy δj = 1. From any of the previous cases, we know that the expected gain in this scenario is 1/2 − 1/6 = 1/3. This outcome occurs with probability 1/δi δj . Taking expectations yields the stated result: 1/δi (1 −

1 1 δj ) /2

+ 1/δi δj 1/3.

Lemma 12. Player i expects gain after her first and before her last move: – (Case 1) If δi ≥ 1, then she expects 0. – (Case 2) If δi ≤ 1 and δi ≥ δj , then she expects δj/2δi − δj/2. – (Case 3) If δi ≤ 1 and δi ≤ δj , then she expects (1 − δi )(1 − δi/2δj ). Proof. Let Ii be an arbitrary instance of the interval between player i’s first and last move. The duration of interval Ii equals the duration of the entire game, minus the time before the first and the time after the last flip. These both have an expected duration of δi /2, so the expected duration of interval Ii is equal to 1 − δi (assuming δi ≤ 1). (Case 1). Player i flips either once or not at all, implying that Ii is always empty. Player i’s expected gain over an empty interval is zero. (Case 2; i = s, j = f ). Partition Is into sub-intervals of length δs . At the beginning of any such sub-interval, player s is in control of the resource. In expectation (over player f ’s strategy), player s remains in control for a duration of δf /2 at the beginning of every sub-interval. It follows that player s’s gain over the course of interval Is is δf/2δs times its duration. Taking expectations over player s’s strategy yields (1 − δs )δf/2δs = δf/2δs − δf/2. (Case 3; i = f , j = s). Partition If into sub-intervals of length δf . Consider any sub-interval. With probability δf /δs , the slower player moves once over the course of the sub-interval at a time that is uniformly distributed over the subinterval, yielding an expected gain of δf /2. With probability 1−δf /δs , the slower player does not move, yielding a gain of δf . Player f ’s expected gain over the sub-interval is therefore δf/δs δf/2 + (1 − δf/δs )δf = δf (1 − δf/2δs ), and player f ’s expected gain over the course of interval If is (1 − δf/2δs ) times its duration. Taking expectations over player f ’s strategy yields the result stated above.

Time-Dependent Strategies in Games of Timing

329

References 1. van Dijk, M., Juels, A., Oprea, A., Rivest, R.L.: FlipIt: the game of “Stealthy Takeover”. J. Cryptol. 26, 655–713 (2012) 2. Merlevede, J., Johnson, B., Grossklags, J., Holvoet, T.: Exponential discounting in security games of timing. In: Workshop on the Economics of Information Security (WEIS), June 2019 3. Farhang, S., Grossklags, J.: When to invest in security? Empirical evidence and a game-theoretic approach for time-based security. In: Workshop on the Economics of Information Security (WEIS), June 2017 4. Radzik, T.: Results and problems in games of timing. Lect. Notes-Monogr. Ser. 30, 269–292 (1996) 5. Bowers, K.D., et al.: Defending against the unknown enemy: applying FlipIt to system security. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 248–263. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3642-34266-0 15 6. Laszka, A., Johnson, B., Grossklags, J.: Mitigating covert compromises. In: Chen, Y., Immorlica, N. (eds.) WINE 2013. LNCS, vol. 8289, pp. 319–332. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-45046-4 26 7. Laszka, A., Johnson, B., Grossklags, J.: Mitigation of targeted and non-targeted covert attacks as a timing game. In: Das, S.K., Nita-Rotaru, C., Kantarcioglu, M. (eds.) GameSec 2013. LNCS, vol. 8252, pp. 175–191. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-02786-9 11 8. Feng, X., Zheng, Z., Hu, P., Cansever, D., Mohapatra, P.: Stealthy attacks meets insider threats: a three-player game model. In: 2015 IEEE Military Communications Conference (MILCOM), October 2015 9. Hu, P., Li, H., Fu, H., Cansever, D., Mohapatra, P.: Dynamic defense strategy against advanced persistent threat with insiders. In: 2015 IEEE Conference on Computer Communications (INFOCOM), April 2015 10. Farhang, S., Grossklags, J.: FlipLeakage: a game-theoretic approach to protect against stealthy attackers in the presence of information leakage. In: Zhu, Q., Alpcan, T., Panaousis, E., Tambe, M., Casey, W. (eds.) GameSec 2016. LNCS, vol. 9996, pp. 195–214. Springer, Cham (2016). https://doi.org/10.1007/978-3-31947413-7 12 11. Zhang, M., Zheng, Z., Shroff, N.B.: A game theoretic model for defending against stealthy attacks with limited resources. In: Khouzani, M.H.R., Panaousis, E., Theodorakopoulos, G. (eds.) GameSec 2015. LNCS, vol. 9406, pp. 93–112. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25594-1 6 12. Johnson, B., Laszka, A., Grossklags, J.: Games of timing for security in dynamic environments. In: Khouzani, M.H.R., Panaousis, E., Theodorakopoulos, G. (eds.) GameSec 2015. LNCS, vol. 9406, pp. 57–73. Springer, Cham (2015). https://doi. org/10.1007/978-3-319-25594-1 4 13. Pham, V., Cid, C.: Are we compromised? Modelling security assessment games. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 234–247. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34266-0 14 14. Laszka, A., Horvath, G., Felegyhazi, M., Butty´ an, L.: FlipThem: modeling targeted attacks with FlipIt for multiple resources. In: Poovendran, R., Saad, W. (eds.) GameSec 2014. LNCS, vol. 8840, pp. 175–194. Springer, Cham (2014). https:// doi.org/10.1007/978-3-319-12601-2 10

330

J. Merlevede et al.

15. Leslie, D., Sherfield, C., Smart, N.P.: Threshold FlipThem: when the winner does not need to take all. In: Khouzani, M.H.R., Panaousis, E., Theodorakopoulos, G. (eds.) GameSec 2015. LNCS, vol. 9406, pp. 74–92. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-25594-1 5 16. Rass, S., K¨ onig, S., Schauer, S.: Defending against advanced persistent threats using game-theory. PloS One 12, e0168675 (2017) 17. Pawlick, J., Farhang, S., Zhu, Q.: Flip the cloud: cyber-physical signaling games in the presence of advanced persistent threats. In: Khouzani, M.H.R., Panaousis, E., Theodorakopoulos, G. (eds.) GameSec 2015. LNCS, vol. 9406, pp. 289–308. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25594-1 16

Tackling Sequential Attacks in Security Games Thanh H. Nguyen1(B) , Amulya Yadav2 , Branislav Bosansky3 , and Yu Liang2 1

3

University of Oregon, Eugene, USA [email protected] 2 Pennsylvania State University, State College, USA {amulya,luy70}@psu.edu Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Prague, Czech Republic [email protected]

Abstract. Many real-world security problems exhibit the challenge of sequential attacks (i.e., the attacker carries out multiple attacks in a sequential manner) on important targets. Security agencies have to dynamically allocate limited security resources to the targets in response to these attacks, upon receiving real-time observations regarding them. This paper focuses on tackling sequential attacks using Stackelberg security games (SSGs), a well-known class of leader-follower games, which have been applied for solving many real-world security problems. Previous work on SSGs mainly considers a myopic attacker who attacks one or multiple targets simultaneously against each defense strategy. This paper introduces a new sequential-attack game model (built upon the Stackelberg game model), which incorporates real-time observations, the behavior of sequential attacks, and strategic plans of non-myopic players. Based on the new game model, we propose practical game-theoretic algorithms for computing an equilibrium in different game settings. Our new algorithms exploit intrinsic properties of the equilibrium to derive compact representations of both game state history and strategy spaces of players (which are exponential in number in the original representations). Finally, our computational experiments quantify benefits and losses to the attacker and defender in the presence of sequential attacks.

1

Introduction

In many real-world security domains, security agencies often have to protect important targets such as critical infrastructure from sequential attacks carried out by human attackers, given information about these attacks is revealed over time. In fact, an attacker can exploit sequential nature of attacks by settingup first a decoy attack to attract the attention, following by a more severe attack. Such sophisticated attacks could mislead security agencies to allocate a majority of security resources to handle attacks that have happened, leaving other important targets less protected and thus vulnerable to subsequent attacks. This raises an important question of how to effectively assign limited security c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 331–351, 2019. https://doi.org/10.1007/978-3-030-32430-8_20

332

T. H. Nguyen et al.

resources among targets in response to sequential attacks, considering real-time observations regarding these attacks. We propose to use a Stackelberg security game model (SSG) to represent these sequential-attack security scenarios. SSGs have been widely used to model the strategic interaction between the defender and attacker in security domains [1, 3,10,13,15]. In standard SSGs, the attacker is assumed to be a myopic player who only attacks once by choosing a single target (or simultaneously choosing a subset of targets [8]) against a strategy of the defender. In the sequential attack scenario, the attacker, however, is a non-myopic player who considers all future possibilities when deciding on each attack. The attacker can update its belief about the security level at each target after each attack and adapt its next attacks accordingly by leveraging the attacker’s prior knowledge of the defender’s strategy and partial real-time observations of the resources of the defender. Our work focuses on addressing the challenge of sequential attacks in security games, with the following main contributions. First, we introduce a new security game model to represent the security problems with sequential attacks. This new model incorporates real-time observations, the behavior of sequential attacks, and strategic plans of players. In the model, the non-myopic attacker carries out multiple attacks in a sequential manner. The attacker strategically adapts its actions based on real-time (partial) observations of defense activities. The defender, on the other hand, has a full observation of previous game states (i.e., which targets were attacked and/or protected) and determines an effective strategic movements of security resources among targets accordingly. Second, we propose new practical game-theoretic algorithms for computing an equilibrium in two game settings with two rounds of attacks, sorted by the defender’s capability of moving security resources after each attack. (i) In the no-resource-movement setting, the defender does not move resources when the attacker performs its attacks. This scenario reflects the worst-case situation in which the defender is unable to respond as quickly as attacks happen. (ii) In the resource-movement setting, the defender can quickly move resources among all targets after each attack. The main computational challenge of finding an equilibrium in these game settings comes from an exponential number of state histories and strategies of players used in the optimization formulation. Our algorithms address this challenge by exploiting intrinsic properties of the equilibrium to derive compact representations of both state histories and strategies. Finally, we conduct extensive experiments in various game settings to evaluate our proposed algorithm to handle sequential attacks. Our results show that the defender and attacker receive significant loss and benefit respectively if the defender does not address sequential attacks. By taking into account sequential attacks, such loss and benefit is reduced drastically.

2

Background

Stackelberg security games (SSGs) refer to a class of leader-follower games. In standard SSGs, there is a set of important targets N = {1, 2, . . . , N }. A defender

Tackling Sequential Attacks in Security Games

333

has to allocate a limited number of security resources K < N to protect these targets. A pure strategy of the defender is an allocation of these resources over the targets, denoted by s, where each resource protects exactly one target. We denote by S the set of all these pure strategies. In this work, we consider no scheduling constraint on the defender’s resource allocations and all resources are homogeneous and indistinguishable to the players. A mixed strategy of the defender is a randomization over all pure strategies, denoted by x = {x(s)} where x(s) is the probability the defender plays s. We denote by X = {x : x(s) = 1, 0 ≤ x(s) ≤ 1} the set of all mixed strategy of the defender. On the s other hand, there is an attacker who is aware of the defender’s mixed strategy and aim at attacking one or multiple targets simultaneously in response. When the attacker attacks a target i that is protected by the defender, the attacker receives a penalty Pia while the defender obtains a reward of Rid . Conversely, the attacker gets a reward of Ria > Pia and the defender receives a penalty of Pid < Rid . We denote by S(i) = {s ∈ S : i ∈ s} which consists of all pure defense strategies that cover target i. The defender and attacker’s expected utility at i is computed respectively as follows: x(s) (Rid − Pid ) + Pid U d (x, i) = s∈S(i) U a (x, i) = x(s) (Pia − Ria ) + Ria s∈S(i)

In this work, we call standard SSGs as simultaneous-attack SSGs (i.e., siSSGs) to distinguish from our games with sequential attacks. Denote by L < T the number of targets the attacker can attack. Then a simultaneous-attack strategy of the attacker, denoted by asi , is a subset of L targets. Given a mixed strategy of the defender x, if the attacker plays asi , then the attacker and defender’s expected utility for playing (x, asi ) is computed as follows: U a (x, i) U a (x, asi ) = i∈asi U d (x, asi ) = U d (x, i) i∈asi

A pair of strategies (x∗si , a∗si ) of players form a simultaneous-attack Strong Stackelberg Equilibrium (siSSE) if and only if: x∗si = argmaxx U d (x, a∗si (x)) a∗si (x) = argmaxasi U a (x, asi ) Our work uses siSSE as a baseline to compare with our algorithms for tackling sequential attacks. The comparison results are presented in Sect. 7.

3

Related Work

Security games [7,14] are a well-known class of resource allocation games where the defender allocates scarce resources to protect selected targets. There are

334

T. H. Nguyen et al.

many variants of security games, however, none of them can be used to solve the sequential security game proposed in this paper. The main distinguishing characteristic is a combination of (1) attacker’s ability to attack multiple targets and (2) the ability of the attacker (and the defender) to execute their plans sequentially while observing and reacting to the strategy of the opponent. There are several works that consider scenarios where the attacker can attack multiple targets (defined in this paper as simultaneous-attack SSGs). In the first work, Korzhyk et al. [8] show that computing an SSE in such simultaneous-attack setting is NP-hard, however, their main goal is to design a polynomial-time for finding a Nash equilibrium. The authors, however, do not study the sequential case. Among the follow-up works, the goal of the research is typically allowing more complex scenarios (e.g., allowing dependency among the targets that is not necessarily additive [18]). In theory, one could use such generalized security games where the attacker can attack multiple targets even for modeling the sequential scenario, however, at the cost of an exponential increase in the number of strategies compared to our approach (this is similar to using a normal-form representation in order to solve a sequential game). On the other hand, in several variants of security games, the players are executing sequential actions. However, the sequential actions are often performed by the defender and not by the attacker. This is the case for applications of green security games [4,12] or applications against urban crime [5,19]. In these models, the defender moves sequentially, however, the defender is not able to observe and condition the chosen actions based on the actions of the attacker. This subclass of sequential games has been later generalized for other than security games and domain independent algorithms were provided for the zero-sum case [2,11]. This is in contrast with our game model where the defender is aware that certain target was attacked in the first step and depending on which target was attacked, the defender can choose different strategy for the next time step. Finally, security games with sequential attacks can be modeled as general extensive-form games. While computing an SSE in an extensive-form game of this class (with imperfect information) is NP-hard [9], there are several domainindependent algorithms for computing an SSE in extensive-form games that can be, in theory, used for solving security games with sequential attacks [2,6,16, 17]. However, extensive-form games do not allow compact representations of strategies and, moreover, all existing algorithms compute solution for strategies with perfect memory. Hence even in a small game with five resources and 10 targets, the defender has 252 possibilities and the game with only two possible attack steps has more than 6 · 105 states. This size corresponds to maximal sizes of games solvable by existing state-of-the-art algorithms for computing a Stackelberg equilibrium in extensive-form games. Contrary, since our novel algorithm is specifically tailored for security games, we are able to achieve much better scalability and we are able to find solutions for more than 20 targets.

Tackling Sequential Attacks in Security Games

4

335

Sequential-Attack Game Model

Our sequential-attack SSG (i.e., seSSG) model is built upon the standard SSG model. Initially, the defender randomly allocates security resources to targets according to a mixed strategy (as defined in siSSGs). At the execution time, the defender employs a pure strategy which is sampled from that strategy. The attacker is aware of the mixed strategy of the defender but does not know which targets the defender is protecting at the execution time. In our game, we assume that when the attacker attacks a target i, it can discover if the defender is covering that target or not. Nevertheless, the attacker is still unaware of the current protection status at other targets. By attacking targets sequentially, the attacker is able to explore which targets are being protected by the defender. Based on observations from previous attacks, the attacker can update its belief about the defender’s strategy and then decide on targets to attack next that would benefit the attacker the most. In this work, we study the attack scenario in which the attacker can carry out L > 1 rounds of attacks and attack one target at each round. The defender has to move security resources among targets in response to such sequential attacks. We assume that when a target is attacked, the damage caused by the attack (if any) to the target is already done. Thus, this target will not be considered in future attack rounds. In addition, if there is a security resource at the attacked target, the resource has to resolve that attack and thus the defender can no longer use that resource for future defense. 4.1

Players’ Strategies

State and Observation History. At each attack round l ∈ {1, 2, ..., L}, we denote by odl = {s1 , i1 , s2 , i2 , ..., sl−1 , il−1 } the state history of the game. In particular, sl is the deployment of security resources and il is the attacked target at round l < l. The defender knows odl while the attacker only has partial observations of the game states. The attacker’s observation history is denoted by oal = {(i1 , c(i1 )), (i2 , c(i2 )), ..., (il−1 , c(il−1 ))} where il is the target attacked at round l and c(il ) ∈ {0, 1} represents if the defender is protecting il (i.e., c(il ) = 1) or not (i.e., c(il ) = 0). At round 1 specifically, od1 ≡ ∅ and oa1 ≡ ∅. Example: As an example, consider a security game with three targets, two defender resources, and two attack rounds. A possible state history at round l = 2 is od2 = {(1, 2), 2} in which the defender protects targets s1 = (1, 2) and the attacker attacks target i1 = 2 at round 1. The corresponding observation history of the attacker is oa2 = {(2, c(2) = 1)} in which the attacker attacks target i1 = 2 and the defender protects that target (i.e., c(i1 ) = 1) at round 1. Defender Strategy. At each round l, given a state history odl , the defender (re-)distributes active security resources (i.e., resources at targets which have not been attacked) to the targets according to some constraints on resource movements. In the previous example of the 3-target games, in the state history

336

T. H. Nguyen et al.

od2 = {(1, 2), 2}, the defender was protecting target i1 = 2 when the attacker attacks that target at round 1. Therefore, at round l = 2, the only remaining active security resource of the defender is located at target 1. Suppose that the defender is able to move that security resource to target 3 at round 2, the defender now has to choose either keep that resource at target 1 or move the resource to target 3. We denote by Sl (odl ) the set of all possible feasible deployments of security resources at round l given the state history odl . A behavior strategy of the defender at odl , denoted by xl = {xl (sl | odl )}, is a probability distribution over Sl (odl ) in which xl (sl | odl ) is the probability the defender plays the deployment sl at round l given the state history odl . For example, in the 3-target game, given od2 = {(1, 2), 2}, there are two feasible deployments at round l = 2: s2 = (1) and s2 = (3). An example of a behavior strategy of the defender is to protect target 1 and 3 with a probability of 0.6 and 0.4, respectively. At round l, the defender executes a deployment sl which is randomly drawn from the strategy xl . At round 1 specifically, x1 ∈ X is equivalent to a mixed strategy of the defender in the corresponding siSSG. Attacker Strategy. (Stackelberg assumption) We assume the attacker is aware of the defender’s behavior strategies, i.e., the probability of each resource deployment of the defender given a state history odl , but not the actual deployments sl . The attacker decides to attack a target il based on its observation history oal . Bayesian Update. At each round l ∈ {1, 2, ..., L}, given an observation history oal , the attacker can update its belief regarding the defender’s strategy using Bayesian update, formulated as follows: β(odl | oal )xl (sl | odl ) β(sl | oal ) = d ol

β(odl β(odl

| |

oal ) oal )

= 0 if odl and oal are not consistent ∝ P (odl ) = P (odl−1 , sl−1 , il−1 ) = xl−1 (sl−1 | odl−1 )P (odl−1 ), otherwise

where β(sl | oal ) and β(odl | oal ) are the updated belief of the attacker. In particular, β(sl | oal ) is the probability the defender plays sl at round l and β(odl | oal ) is the probability the state history is odl given oal . Finally, odl and oal are consistent if they share the same attack sequence (i1 , i2 , . . . , il−1 ) and / sl−1 , otherwise. Based on the belief update il−1 ∈ sl−1 if c(il−1 ) = 1 and il−1 ∈ β(sl | oal ), the attacker will choose next target to attack accordingly. 4.2

Players’ Utility

Suppose that the defender plays xse = {xl (sl | odl )} (which consists of all behavior strategies of the defender at all of state histories) and the attacker plays ase = {il (oal )} (which consist of all choices of targets to attack at all of observation histories of the attacker), then players’ utility at each round can be computed using backward induction as follows:1 1

Sometimes we omit oal and odl when the context is clear.

Tackling Sequential Attacks in Security Games

337

At round L, given an observation history oaL , the attacker’s total expected utility for attacking a target iL (oaL ) (shorten by iL ) is computed as follows: β(sL | oaL ) PiaL + β(sL | oaL ) RiaL U a (iL (oaL )) = sL :iL ∈sL

sL :iL ∈s / L

On the other hand, given a state history odL , the defender’s total expected utility when the attacker attacks a target iL (odL ) is computed as follows: xL (sL | odL ) RidL + xL (sL | odL ) PidL U d (iL (odL )) = sL :iL ∈sL

sL :iL ∈s / L

where iL (odL ) ≡ iL (oaL ) with the attacker’s observation history oaL is consistent with the state history odL . At round l < L, given an observation history oal , attacker’s total expected utility for attacking a target il (oal ) (shorten by il ) is computed as follows: β(sl | oal ) Pial + U a (il+1 (oal , (il , c(il ) = 1))) U a (il (oal )) = sl :il ∈sl β(sl | oal ) Rial + U a (il+1 (oal , (il , c(il ) = 0))) + sl :il ∈s / l

which comprises of (i) the immediate expected utility for current round and (ii) future expected utility as a result of the current attack. On the other hand, given a state history odl with a positive probability, the defender’s total expected utility when the attacker attacks a target il (odl ) is computed as follows: xl (sl | odl ) Ridl + xl (sl | odl ) Pidl U d (il (odl )) = sl :il ∈sl sl :il ∈s / l d d d d + xl (sl | ol )U (il+1 (ol , sl , il (ol ))) sl

where il (odl ) ≡ il (oal ) with the observation history of the attacker oal is consistent with the state history odl . Finally, players’ total expected utility for playing (xse , ase ) is determined as:2 U d (xse , ase ) = U d (i1 ) U a (xse , ase ) = U a (i1 ) 4.3

Sequential-Attack SSE

A pair of strategies of players (x∗se = {x∗l (sl | odl )}, a∗se (x∗ ) = {i∗l (oal )}) forms a sequential-attack SSE (seSSE) if and only if: x∗se = argmax U d (xse , a∗se (xse )) xse

a∗se (xse ) = argmax U a (xse , ase ). ase

2

For the sake of presentation, we will omit ∅ (i.e., i1 = i1 (∅)) from all contexts.

338

T. H. Nguyen et al.

In this paper, we study seSSGs with the focus on computing an seSSE of the game in two game settings, sorted by the defender’s capability of moving security resources after each attack. (i) In the resource-movement setting, the defender can move security resource from a target to any other target after each attack. This setting captures the situation in which the defender has a full capability of re-allocating resources in response to every attack. (i) In the no-resourcemovement setting, the defender does not move resources when attacks happens. This setting reflects the worst security scenario in which the defender is unable to react (by moving resources) as quickly as attacks happening. Analyzing and finding an seSSE in general is computationally expensive which involves exponentially many strategies of both the attacker and the defender, as well as exponentially many state histories. Therefore, in this paper, we focus on developing efficient game-theoretic algorithms to compute an seSSE in these game settings in which the attacker attacks two targets sequentially, i.e., L = 2. In fact, as we shown later, even in this 2-round game scenario, the search space is still exponential. We provide efficient algorithms to compute an seSSE, by (i) exploiting underlying characteristics of seSSGs to compactly represent spaces of state histories and strategies of players; and (ii) applying optimization techniques such as cutting plane to scale up the computation of an seSSE.

5

The Resource-Movement Setting

In the resource movement game setting with two (sequential) attacks, the defender has to determine: (i) how to allocate resources before any attacks happens; and (ii) how to move resources in response to the first attack. 5.1

Compact Representation

Solving the 2-round seSSE is computationally expensive since it involves an exponential number of resource allocations and movements of the defender. In fact, the exponentially many resource allocations at first round also leads to an exponential number of state histories at second round. In the following, we introduce an equivalent compact representation of the players’ strategies and state histories, leveraging the resource-movement property. We first present a characteristic of the seSSE in Theorem 1. Theorem 1. In the 2-round seSSGs with resource movements, the defender’s strategy at the second round in the seSSE can be compactly represented such that the compact representation only depends on which target is attacked at the first round and whether the defender is protecting that target or not. Proof. Consider the seSSE of the game (x∗se , a∗se ). According to the definition of the seSSE in Sect. 4.3, given the players’ strategy at the first round ({x∗1 (s1 )}, i∗1 ) in the seSSE, the defender’s strategy at the second round is the optimal solution of the following optimization problem:

Tackling Sequential Attacks in Security Games

max x2

x∗1 (s1 )

s1 :i∗ 1 ∈s1

+ max x2

339

x2 (s2 | s1 , i∗1 )V d (s2 , i∗2 (i∗1 , c(i∗1 ) = 1))

(1)

x2 (s2 | s1 , i∗1 )V d (s2 , i∗2 (i∗1 , c(i∗1 ) = 0))

(2)

s2

x∗1 (s1 )

s2

s1 :i∗ / 1 1 ∈s

which maximizes the defender’s expected utility at second round. The term V d (s2 , i∗2 (i∗1 , c(i∗1 ) = 1)) is equal to the defender’s reward at target i∗2 (i∗1 , c(i∗1 ) = 1) if the deployment of defender resources s2 covers that target. Otherwise, it is equal to the defender’s penalty at the target. The first optimization component in (1) can be equivalently represented as follows: ⎡ ⎤ x∗1 (s1 ) ∗ d ∗ ∗ ∗ ⎣ x∗1 (s1 )⎦max ∗ x2 (s2 | s1 , i1 )V (s2 , i2 (i1 , c(i1 ) = 1)) x2 ∗ ∗ s :i∗ ∈s x1 (s1 ) s1 :i1 ∈s1

s2 ,s1 :i1 ∈s1

1

1

1

Also, the attacker’s second attack i∗2 (i∗1 , c(i∗1 ) = 1) is an optimal solution of:

∗ x (s ) max β(s1 | i∗1 , c(i∗1 ) = 1)x2 (s2 | s1 , i∗1 )V a (s2 , i2 ) 1 1 ∗ ∗ ∗ s1 :i1 ∈s1

i2 =i1

s2 ,s1 :i1 ∈s1

which maximizes the attacker’s expected utility at second round. The term V a (s2 , i2 ) is equal to the attacker’s reward at i2 if the target is not covered by the defender’s deployment s2 . Otherwise, it is equal to the attacker’s penalty at that target. The attacker’s updated belief is computed using Bayesian update: x∗1 (s1 ) ∗ ∗ , if i1 ∈ s1 s :i∗ ∈s x1 (s1 )

β(s1 | i∗1 , c(i∗1 ) = 1) =

1

1

1

We introduce the following new variables: ∗ x∗1 (s1 ) ∗ y i1 ,1 (s2 ) = ∗ (s ) x2 (s2 | s1 , i1 ), ∀s2 ∗ s1 :i1 ∈s1 x ∗ 1 1 s :i ∈s 1

1

1

Since the defender can move resource from one target to any other targets with∗ ∗ out any constraint, any values of yi1 ,1 = {y i1 ,1 (s2 )} such that: ∗ ∗ y i1 ,1 (s2 ) = 1, y i1 ,1 (s2 ) ∈ [0, 1], ∀s2 s2

is equivalent to a strategy of the defender {x2 (s2 | s1 , i∗1 )} at second round (i.e., ∗ we can simply assign x2 (s2 | s1 , i∗1 ) = y i1 ,1 (s2 ) for all s2 ). Therefore, for the ∗ i∗ ,1 rest of this section, we can consider y 1 = {y i1 ,1 (s2 )} as a strategy of the defender at second round given the attacker attacks target i∗1 at first round and the defender is covering that target. The first optimization component in (1) is now equivalent to the following optimization problem: ∗ y i1 ,1 (s2 )V d (s2 , i∗2 (i∗1 , c(i∗1 ) = 1)) maxyi∗1 ,1 s2 ∗ ∗ ∗ s.t. i2 (i1 , c(i∗1 ) = 1) = argmaxi2 =i∗1 y i1 ,1 (s2 )V a (s2 , i2 ) s2 ∗ ∗ y i1 ,1 (s2 ) = 1, y i1 ,1 (s2 ) ∈ [0, 1] s2

340

T. H. Nguyen et al.

which results in a siSSE of a siSSG which consists of (i) a target set N\{i∗1 }; (ii) K − 1 security resources, and (iii) one attack. Similarly, we can also show that the second optimization component in (2) corresponds to a siSSE of the similar game but with K security resources. These siSSEs only depends on which target is attacked and if the defender protects that target or not at first round. Based on Theorem 1, each strategy of the defender can be now compactly represented as having two components: (i) {x1 (s1 )} where x1 (s1 ) is the probability the defender plays s1 ∈ S before any attacks; and (ii) ({y i,1 (s2 )}; {y i,0 (s2 )}) where y i,1 (s2 ) is the probability the defender plays s2 ∈ SK−1 (−i) if the first attack is towards target i and the defender is protecting that target. The set SK−1 (−i) is the set of subsets of K − 1 targets (there are K − 1 resources left) excluding target i. In addition y i,0 (s2 ) is the probability the defender plays s2 ∈ SK (−i) if the first attack is towards target i and while the defender is not protecting this target (the defender allocates K resources to targets excluding target i). Note that the current compact representation of the defender’s strategies still involves all possible deployments of the defender’s security resources at each round, of which number is exponential. We then provide Proposition 1 (its proof is in the Online appendix A)3 showing that the defender strategies at each round are equivalent to compact marginal coverage probabilities at every target. Proposition 1. In the sequential-attack game with resource movements, the defender’s strategies can be compactly represented as follows: x1 (s1 ), xj = K, ∀j xj = s1 j y i,1 (s2 ), y i,1 = K − 1, ∀i, j = i yji,1 = s2 j=i j y i,0 (s2 ), yji,0 = K, ∀i, j = i yji,0 = s2

j=i

where xj is the marginal probability the defender protects target j before any attack happens. In addition, yji,1 and yji,0 are the marginal probabilities the defender protects target j = i after the attacker attacked target i while the defender was protecting and was not protecting that target, respectively. 5.2

Mixed Integer Linear Program (MILP) Representation

According to Theorem 1, given the attacked target i∗1 at round 1, the player’s equilibrium strategies in round 2 is equivalent to an siSSE of a siSSG, which consists of (i) a target set N \ {i∗1 }; (ii) K defender resources (if the defender is not protecting i∗1 in the original seSSG) or K − 1 resources (otherwise); and (iii) one attack. This siSSE can be computed in advance. Therefore, we introduce 3

Online dl=0.

appendix:

https://www.dropbox.com/s/hjyjabfg69llyn3/Appendix.pdf?

Tackling Sequential Attacks in Security Games

341

the following MILP to compute the seSSE of the 2-round seSSG based on these pre-computed siSSEs and the compact representation described in Proposition 1: max v d d s.t. v ≤ xi (Rid − Pid ) + Pid + (1 − xi )UsiSSE (i, 0) + xi UsiSSE (i, 1) + (1 − hi )M, ∀i a a a a a r ≥ xi (Pi − Ri ) + Ri + (1 − xi )UsiSSE (i, 0) + xi UsiSSE (i, 1), ∀i a a r ≤ xi (Pia − Ria ) + Ria + (1 − xi )UsiSSE (i, 0) + xi UsiSSE (i, 1) + (1 − hi )M, ∀i hi = 1, hi ∈ {0, 1} i

d d where UsiSSE (i, 0) and UsiSSE (i, 1) are the defender’s utilities in the siSSEs of the resulting siSSGs after the attacker attacks i while the defender is not protecting a a (i, 0) and UsiSSE (i, 1) are and protecting the target, respectively. Similarly, UsiSSE the attacker’s equilibrium utilities in the resulting siSSEs. In addition, v is the defender’s total expected utility which we aim to maximize and r is the attacker’s total expected utility. The binary variable hi represent if the attacker attacks target i (hi = 1) or not (hi = 0) at the first round.

6

The No-Resource-Movement Setting

In this section, we study the problem in which the defender does not move resources when the attacker performs its attacks. This setting reflects the response-delayed security scenario in which the defender is unable to react (by moving resources) as quickly as attacks happening. In this scenario, the defender’s goal is to optimize his randomization over resource allocations before such sequential attacks happen. Therefore, we use the same notation x = {x(s)} as in the simultaneous-attack games to represent the defender’s mixed strategy. An attack strategy of the attack is denoted by ase which is defined the same as in the unconstrained resource movement setting. The seSSE is still denoted by (x∗se , a∗se ). 6.1

Equilibrium Analysis

We first provide Theorem 2 showing the benefit and loss of the attacker and defender for playing sequentially instead of simultaneously in zero-sum games. Theorem 2. In zero-sum games, the attacker obtains a higher utility while the defender gets a lower utility from sequential attacks than from simultaneous attacks. In particular, we have: U d (x∗se , a∗se (x∗se )) ≤ U d (x∗si , a∗si (x∗si ))

(3)

U a (x∗se , a∗se (x∗se )) ≥ U a (x∗si , a∗si (x∗si ))

(4)

Proof. (i) First, given any defense strategy x, the attacker always obtains a higher total expected utility for playing sequentially than simultaneously. Indeed,

342

T. H. Nguyen et al.

we denote the simultaneous-attack best response a∗si = (i∗ , j ∗ ). The corresponding expected utility of the attacker is U a (x, i∗ ) + U a (x, j ∗ ). On the other hand, finding a sequential-attack best response is determined as follows: max U a (x, ase ) ≥ U a (x, i∗ ) + U a (x, j ∗ ). ase

Essentially, the attacker’s simultaneous-attack best response (i∗ , j ∗ ) corresponds to a feasible sequential-attack response in which the attacker attacks i∗ at first attack and then attacks j ∗ at second attack regardless of the attack result of the first attack. Therefore, U a (x∗se , a∗se (x∗se )) ≥ U a (x∗se , a∗si (x∗se )). (ii) In addition, based on the definition of the siSSE, we have the defender’s utility U d (x∗se , a∗si (x∗se )) ≤ U d (x∗si , a∗si (x∗si )). This inequality is equivalent to U a (x∗se , a∗si (x∗se )) ≥ U a (x∗si , a∗si (x∗si )) (according to the zero-sum property). Based on (i) and (ii), we have U a (x∗se , a∗se (x∗se )) ≥ U a (x∗si , a∗si (x∗si )). Since this game is zero-sum, we obtain: U d (x∗se , a∗se (x∗se )) ≤ U d (x∗si , a∗si (x∗si )). Computing an seSSE is computationally expensive due to an exponential number of pure strategies of the defender. In the following, we present our MILP formulation to exactly compute an seSSE. We then provide a scalable algorithm which is based on the compact representation and cutting-plane based method to overcome the computation challenge. We denote by S(i) and S(−i) the sets of pure strategies of the defender which cover and do not cover target i, respectively. We denote by S(i, j) the set of pure strategies which cover both (i, j). 6.2

Equilibrium Computation: MILP Formulation

We first present Lemma 1, showing the linearity relationship between the players’ total expected utility and the defender’s mixed strategies. This result serves as a basis to develop an MILP to exactly compute an seSSE. Lemma 1. Given an attack strategy ase , both the attacker and defender’s total expected utility is a linear function of the defender’s mixed strategy x. Proof. Based on the computation of the players’ utility described in Sect. 4, we can represent the attacker’s total expected utility as follows: U a (x, ase ) = x(s) (Pia1 −Ria1 )+Ria1 (5) s∈S(i1 ) + x(s) β(s | i1 , c(i1 ) = 1)(Pia2 (i1 ,1) −Ria2 (i1 ,1) )+Ria2 (i1 ,1) +

s∈S(i1 )

s∈S(−i1 )

s∈S(i1 ,i2 (i1 ,1))

x(s)

β(s | i1 , c(i1 ) = 0)(Pia2 (i1 ,0) −Ria2 (i1 ,0) )+Ria2 (i1 ,0)

s∈S(−i1 ,i2 (i1 ,0))

where the attacker’s updated belief is computed using the Bayesian update: β(s | i1 , c(i1 ) = 1) =

x(s) s ∈S(i1 )

x(s )

, if s ∈ S(i1 )

Tackling Sequential Attacks in Security Games

343

The updated belief, β(s | i1 , c(i1 ) = 0), is computed similarly. Therefore, the attacker’s total expected utility can be formulated as follows: U a (x, ase ) = x(s) (Pia1 −Ria1 )+Ria1 (6) s∈S(i1 ) + x(s) (Pia2 (i1 ,1) −Ria2 (i1 ,1) ) + x(s) Ria2 (i1 ,1) +

s∈S(i1 ,i2 (i1 ,1))

s∈S(i1 )

x(s) (Pia2 (i1 ,0) −Ria2 (i1 ,0) )+

s∈S(−i1 ,i2 (i1 ,0))

x(s) Ria2 (i1 ,0)

s∈S(−i1 )

which is a linear function of x. Finally, we apply the same computation process to show that the defender’s total expected utility is also linear in x. In the seSSE, our goal is to find an optimal strategy for the defender that maximizes the defender’s total expected utility. Based on Lemma 1, this problem can be represented as an MILP, formulated as follows: max v s.t.

(7)

v ≤ U d (x, ase ) +(3−hi −qj1 −qk0 )M, ∀i = j, k

(8)

x,ase

a

r ≥ U (x, ase ), ∀i = j, k a

(x, ase ) +(3−hi −qj1 −qk0 )M, ∀i = j, k

r≤U hi = 1, hi ∈ {0, 1} i qj0 = 1, qj1 = 1, qj0 , qj1 ∈ {0, 1} j

j hi+qi0 ≤ 1, hi +qi1 ≤ 1.

(9) (10) (11) (12) (13)

where ase is defined as i1 = i, i2 (i1 , 1) = j and i2 (i1 , 0) = k (i.e., the attacker will attack target j at round 2 if the defender protects target i at round 1 and attack target k, otherwise). We first introduce three binary variables: (i) hi = 1 if the attacker attacks target i at first round and hi = 0, otherwise; (ii) qj0 = 1 if the attacker attacks target j at second round given the defender is not protecting the attacked target at first round and qj0 = 0, otherwise; and (iii) qj1 ∈ {0, 1} given the defender is protecting the attacked target at first round. In the MILP, v and r are the defender and attacker’s total expected utility. Constraint (8) determines the defender’s utility given a best response of the attacker (when hi = 1, qj1 = 1, qk0 = 1). Constraints (9–10) ensure that r is the attacker’s utility with respect to the attacker’s best response (when hi = 1, qj1 = 1, qk0 = 1). Constraints (11–12) imply the attacker attacks one target at each round. Finally constraint (13) imposes that if the attacker already attacks a target i, it will not attack i again. Finally, M is a large constant which ensures that associated constraints are effective only when hi = 1, qj1 = 1, qj0 = 1. While the MILP provide an exact seSSE, it has limited scalability due to an exponential number of pure strategies of the defender in S. Therefore, we introduce a new efficient algorithm to overcome this computation issue.

344

6.3

T. H. Nguyen et al.

Equilibrium Computation: Scalable Algorithm

Our new algorithm comprises of two ideas: (i) compact representation and (ii) a cutting-plane method. For the first idea, we provide a relaxed compact representation of the defender’s mixed strategies. This compact representation may not be equivalent to a feasible mixed strategy. Yet, the number of compact defense 2 strategies is significantly small (i.e., O(|N | )) compared to the original repreN . For the second idea, we apply sentation with the number of strategies is K the cutting-plane method to gradually refine the compact strategy space. This iterative process stops when the optimal strategy of the defender found in the refined compact space is feasible in the original space, which means an optimal mixed strategy of the defender is found. Compact Representation. We propose the following compact representation: xi,j = x(s) s∈S(i,j)

where xi,j represents the probability the defender protects both target i and j simultaneously. There is the following resource constraint associated with {xi,j }: xi,j = 2 × xi,j = K × (K − 1) (14) i

j=i

i 0. ∗ ∗ ∗ 2. i=j xij (yij − zij ) + λ ≤ 0, ∀ feasible {xij }. zij ) + λ > 0 for all feasible Proof. Since {¯ xij } is infeasible, then i=j xij (yij − ∗ ∗ − zij ) + λ∗ > 0. {yij , zij , λ} of the dual program. Therefore, we have i=j xij (yij For all feasible {xij }, the optimal objective of the dual program with respect by {yij , zij , λ }, must be equal to zero. Thus, we obtain: to {x ij }, denoted ∗ ∗ ∗ i=j xij (yij −zij )+λ ≤ i=j xij (yij −zij )+λ = 0. Based 3, the new linear constraint to add to the problem (7–13) on Proposition ∗ ∗ − zij ) + λ∗ ≤ 0, which refine the search space for compact strateis i=j xij (yij gies {xij }. We now aim at solving (22–25) to find the cutting plane. This dual program involves an exponential number of constraints since there is an exponential number of pure strategies for the defender (Constraint 23). Therefore, we propose to use the incremental constraint generation approach (i.e., column generation). That is, we solve the relaxed dual program with respect to a small subset of mixed defense strategy in S. We then gradually add new constraints until we obtain the optimal solution (i.e., no violated constraint is found). The main part of this approach is to find a maximally violated constraint given the ∗ ∗ , zij , λ∗ } of the relaxed(22–25). This is current optimal solution {yij problem ∗ ∗ − i,j∈s zij + λ∗ equivalent to finding a pure strategy s ∈ S such that i,j∈s yij is maximum. Intuitively, we want to find s such that the constraint (23) is violated the most, which can be represented as the following MILP: ∗ ∗ yij hij − zij hij + λ∗ max i,j

s.t. hij ≤ hi , hj hi ∈ {0, 1},

ij

i

hi = K

where hi is a binary variable which indicates if s covers target i (hi = 0) or not (hi = 1) and hi,j is a binary variable which indicates if s covers both i and j.

7

Experimental Evaluation

We evaluate the solution quality and runtime of our algorithms on games generated using GAMUT4 . All our experiments were run on a 2.8 GHz Intel Xeon E5-2680v2 processor with 256 GB of RAM, using CPLEX 12.8 for solving LP/MILPs. We set the covariance value r ∈ [0.0, 1.0] in GAMUT with step size λ = 0.2 to control the correlation of attacker and defender payoffs. All results are averaged over 120 instances (20 games per covariance value). All comparison results with our algorithms are statistically significant under bootstrap-t (α = 0.05). 4

See http://gamut.stanford.edu/.

Tackling Sequential Attacks in Security Games

347

Algorithms and Baselines. We show simulation results for two algorithms: (i) sequential attacks with resource movement (or URM); and (ii) sequential attacks with no resource movement (or NRM). In addition, we use siSSE as a baseline in order to show that the defender can suffer arbitrary losses if he does not take into account sequential attacks in his resource allocation problem. We test all our algorithms and the baseline against a sequential move attacker. Figures 1a and b show solution qualities (i.e., expected game utilities) in the resource-movement setting for the defender and attacker (respectively), whereas Figs. 2a, 3a and 2b, 3b show the same solution qualities in the no-resourcemovement setting. The x-axis in Figs. 1 and 2 shows increasing number of targets. The x-axis in Fig. 3 shows increasing number of resources. The y-axis in all these figures show the expected utility for defender and attacker. Scaling up Number of Targets. Figures 1a and 2a show that against sequential attacks, the expected defender utility achieved by our algorithms (i.e., URM and NRM) is significantly higher than that achieved by siSSE in both resourcemovement and no-resource-movement settings. On the other hand, Figs. 1b and 2b show that the attacker achieves significantly lower utility with our algorithms as compared to siSSE. This shows the importance of taking into account sequential attacks in the defender’s optimization problem, and shows that our algorithms are able to successfully outperform the baseline. 15 URM

5

siSSE

0 -5

-10

6

8 10 Number of Targets

(a) Defender Utility

URM

10

Expected Utility

Expected Utility

10

20

siSSE

5 0 -5

-10

6

8 10 Number of Targets

20

(b) Attacker Utility

Fig. 1. Difference in the utility of the players with increasing number of targets in the resource-movement setting.

Scaling up Number of Resources. Next, we show how the solution quality of our algorithms varies with increasing number of resources. The number of targets is set to 10. Figures 3a and b show that in the no-resource-movement setting, the defender solution quality increases with increasing number of security resources, whereas the attacker solution quality decreases. Runtime Results. Figures 4a and b shows the runtime of our algorithms with increasing number of targets and resources (respectively). The x-axis shows increasing number of targets (resources), and the y-axis shows the runtime (in seconds). These figures show that our NRM algorithm runs significantly slower than our URM. This makes sense since the NRM algorithm needs to solve multiple

348

T. H. Nguyen et al.

20

15 NRM

siSSE

Expected Utility

Expected Utility

15 10 5 0 -5

-10

6

8 10 Number of Targets

20

NRM

10

siSSE

5 0 -5

-10

6

(a) Defender Utility

8 10 Number of Targets

20

(b) Attacker Utility

Fig. 2. Difference in the utility of the players with increasing number of targets in the no-resource-movement setting.

1

2 3 4 Number of Resources

(a) Defender Utility

12

siSSE

Expected Utility

NRM

Expected Utility

8 6 4 2 0 -2 -4 -6 -8 -10

5

NRM

10

siSSE

8 6 4 2 0

1

2 3 4 Number of Resources

5

(b) Attacker Utility

Fig. 3. Difference in the utility of the players with increasing number of resources in the no-resource movement setting.

MILPs and LPs to find cutting planes until it reaches the optimal solution, which is time consuming compared to the URM algorithm. As expected, the siSSE algorithm runs quicker than both our algorithms, but as shown in Figs. 1, 2 and 3, it performs significantly worse than our algorithms in terms of solution quality. This establishes the superiority of our algorithms performance over the state-ofthe-art baseline in tackling sequential attacks. Finally, we analyze the impact of limiting the maximum number of cutting planes that the NRM algorithm can generate on its runtime and solution quality. Figure 5a and b show the variation in solution quality and runtime (respectively) of the NRM algorithm with increasing limits on the number of cutting planes that can be added to the MILP. The number of targets is fixed to 10. The x-axis shows increasing number of cutting planes and the y-axis shows the solution quality (and runtime). These figures show that beyond three cutting planes, the solution quality of NRM shows diminishing returns with higher values of cutting planes, whereas the running time of the algorithm increases at a roughly linear rate with increasing number of cutting planes. This suggests that in practice, NRM can be run by setting a limit on the number of cutting planes, beyond which there are only marginal increases in the solution quality.

Tackling Sequential Attacks in Security Games 120 Runtime (in secs)

Runtime (in secs)

2000 1500 NRM

1000

URM

siSSE

500 0

349

6

8 10 Number of Targets

80

NRM

URM

siSSE

60 40 20 0

20

(a) Scaling up the number of targets.

100

1

2 3 4 Number of Resources

5

(b) Scaling up the number of resources.

6 5 4 3 2 1 0 -1

Defender Attacker

1

2

Runtime (in secs)

Expected Utility

Fig. 4. Comparison of the computation time.

3 4 5 Number of Cutting Planes

(a) Solution Quality

6

160 140 120 100 80 60 40 20 0

NRM MILP 1

2

3 4 5 Number of Cutting Planes

6

(b) Runtime

Fig. 5. The impact of cutting planes on solution quality and runtime performance in the no-resource-movement setting

8

Summary

This paper studies the security problem in which the attacker can attack multiple targets in a sequential manner. In this paper, we introduce a new sequentialattack game model (built upon the Stackelberg game model), which incorporates real-time observations, the behavior of sequential attacks, and strategic plans of non-myopic players. We then propose practical game-theoretic algorithms for computing an equilibrium in different game settings. Our new algorithms exploit intrinsic properties of the equilibrium to derive compact representations of both state history and strategy spaces of players (which are exponential in number in the original representations). Finally, our computational experiments show that the defender and attacker receive significant loss and benefit respectively if the defender does not address sequential attacks. By taking into account sequential attacks, such loss and benefit is reduced drastically. Acknowledgment. This work was supported in part by the Czech Science Foundation (no. 19-24384Y).

350

T. H. Nguyen et al.

References 1. Basilico, N., Gatti, N., Amigoni, F.: Leader-follower strategies for robotic patrolling in environments with arbitrary topologies. In: AAMAS, pp. 57–64 (2009) ˇ 2. Boˇsansk´ y, B., Cerm´ ak, J.: Sequence-form algorithm for computing Stackelberg equilibria in extensive-form games. In: AAAI Conference on Artificial Intelligence (2015) 3. Fang, F., et al.: Deploying PAWS: field optimization of the protection assistant for wildlife security. In: IAAI (2016) 4. Fang, F., Stone, P., Tambe, M.: When security games go green: designing defender strategies to prevent poaching and illegal fishing. In: Proceedings of 24th International Joint Conference on Artificial Intelligence (IJCAI) (2015) 5. Jiang, A.X., Yin, Z., Zhang, C., Tambe, M., Kraus, S.: Game-theoretic randomization for security patrolling with dynamic execution uncertainty. In: Proceedings of 12th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), pp. 207–214 (2013) 6. Karwowski, J., Ma´ ndziuk, J.: Stackelberg equilibrium approximation in generalsum extensive-form games with double-oracle sampling method. In: Proceedings of the 18th International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2019, pp. 2045–2047 (2019) 7. Kiekintveld, C., Jain, M., Tsai, J., Pita, J., Ord´ on ˜ez, F., Tambe, M.: Computing optimal randomized resource allocations for massive security games. In: Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems, pp. 689–696 (2009). http://portal.acm.org/citation.cfm? id=1558013.1558108 8. Korzhyk, D., Conitzer, V., Parr, R.: Security games with multiple attacker resources. In: IJCAI (2011) 9. Letchford, J., Conitzer, V.: Computing optimal strategies to commit to in extensive-form games. In: Proceedings of the 11th ACM Conference on Electronic Commerce, pp. 83–92. ACM, New York (2010). https://doi.org/10.1145/1807342. 1807354 10. Letchford, J., Vorobeychik, Y.: Computing randomized security strategies in networked domains. Appl. Advers. Reason. Risk Model. 11, 06 (2011) 11. Lis´ y, V., Davis, T., Bowling, M.: Counterfactual regret minimization in sequential security games. In: Proceedings of AAAI Conference on Artificial Intelligence (2016) 12. Nguyen, T.H., et al.: CAPTURE: a new predictive anti-poaching tool for wildlife protection. In: Proceedings of the 2016 International Conference on Autonomous Agents and Multiagent Systems, pp. 767–775. AAMAS, Richland (2016). http:// dl.acm.org/citation.cfm?id=2937029.2937037 13. Shieh, E., et al.: PROTECT: a deployed game theoretic system to protect the ports of the United States. In: AAMAS (2012) 14. Sinha, A., Fang, F., An, B., Kiekintveld, C., Tambe, M.: Stackelberg security games: looking beyond a decade of success. In: IJCAI, pp. 5494–5501 (2018) 15. Tambe, M. (ed.): Security and Game Theory: Algorithms, Deployed Systems. Lessons Learned. Cambridge University Press, Cambridge (2011) ˇ 16. Cerm´ ak, J., Boˇsansk´ y, B., Durkota, K., Lis´ y, V., Kiekintveld, C.: Using correlated strategies for computing Stackelberg equilibria in extensive-form games. In: Proceedings of AAAI Conference on Artificial Intelligence, pp. 439–445 (2016)

Tackling Sequential Attacks in Security Games

351

ˇ 17. Cern´ y, J., Boˇsansk´ y, B., Kiekintveld, C.: Incremental strategy generation for Stackelberg equilibria in extensive-form games. In: Proceedings of ACM Conference on Economics and Computation (EC), pp. 151–168 (2018) 18. Wang, S., Liu, F., Shroff, N.: Non-additive security games. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI 2017, pp. 728–735 (2017) 19. Yin, Z., et al.: TRUSTS: scheduling randomized patrols for fare inspection in transit systems. In: Proceedings of 24th Conference on Innovative Applications of Artificial Intelligence (IAAI) (2012)

A Framework for Joint Attack Detection and Control Under False Data Injection Luyao Niu(B) and Andrew Clark Worcester Polytechnic Institute, Worcester, MA 01609, USA {lniu,aclark}@wpi.edu

Abstract. In this work, we consider an LTI system with a Kalman filter, detector, and Linear Quadratic Gaussian (LQG) controller under false data injection attack. The interaction between the controller and adversary is captured by a Stackelberg game, in which the controller is the leader and the adversary is the follower. We propose a framework under which the system chooses time-varying detection thresholds to reduce the effectiveness of the attack and enhance the control performance. We model the impact of the detector as a switching signal, resulting in a switched linear system. A closed form solution for the optimal attack is first computed using the proposed framework, as the best response to any detection threshold. We then present a convex program to compute the optimal detection threshold. Our approach is evaluated using a numerical case study. Keywords: False data injection attacks · Control system threshold · LQG control · K-L divergence · Stealthiness

1

· Detection

Introduction

Distributed sensors provide control systems with rich data. However, open and insecure communication between the sensors and plant exposes the system to threats from false data injection attacks. Control systems are vulnerable to false data injection attacks for the following reasons. Sensors might be physically unprotected and hence vulnerable to attacks. Compared to directly attacking the plant, the adversary incurs very low cost when attacking the sensors. Moreover, the false measurements can bias the decision of the controller and hence degrade the system performance [3] and cause safety risks [17]. The main challenge of mitigating false data injection attacks initiated by intelligent adversaries is that false data injection attacks are fundamentally different from stochastic disturbances whose distributions are typically assumed to be given and independent of the control policy [3]. The adversary, however, is strategic and hence its attack action will be tailored to the estimation and control policies that are used by the targeted system. This work was supported by NSF grant CNS-1656981. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 352–363, 2019. https://doi.org/10.1007/978-3-030-32430-8_21

Joint Attack Detection and Control Under False Data Injection

353

Due to the adversary’s strategic response, designing a detection and control mechanism with fixed parameters could result in a degradation of control performance. An alternative approach is to develop a time-varying detection and randomized measurement selection strategy in order to increase the uncertainty of the adversary and thus reduce the impact of the attack. This approach is in the spirit of moving target defense [20], which has recently been proposed for control and cyber-physical systems. To the best of our knowledge, however, such detection strategies have not been proposed in the LQG setting. In this paper, we focus on a system equipped with a Kalman filter, a detector and an LQG controller under false data attacks. We adopt the Stackelberg setting to capture the interplay between the controller and adversary. The adversary aims at degrading LQG control performance by introducing false measurements to a subset of sensors while being stealthy. The set of possibly compromised sensors is known to the controller, since some sensors (e.g., GPS signal [17]) are easier to tamper with, while others are more difficult to manipulate (e.g., inertial measurement unit serves as backup for GPS spoofing attack [14]). The controller computes a detection threshold at each time step to minimize the LQG cost function. Given the time varying threshold, the controller computes the control law at each time step by randomly using either all measurements or the measurements from the secured sensors to eliminate the impact of false data injected by the adversary. The proposed framework jointly models the attack, detection and LQG control, and consequently improves the system’s resilience. We make the following specific contributions: – We model the interaction between the system and adversary using a zero-sum Stackelberg game, in which the controller is the leader and the adversary is the follower. A switched linear system is used to model the system behavior, where switches between modes occur due to attack detection. – We formulate a convex optimization problem for the attacker to compute the optimal attack sequence. By solving the convex program, we show that the optimal attack for the attacker is a zero-mean Gaussian noise. We derive a closed form solution for the covariance matrix of the optimal attack. – We generalize our analysis of optimal attack under single stage case to multistage case. We show that the optimal attack sequence is a sequence of zeromean Gaussian noise and give the closed form solution for the optimal covariance at each time step. We formulate a convex optimization problem to compute the optimal detection thresholds for the controller. – A numerical case study is used to evaluate the proposed approach. The results show that the proposed approach outperforms the attack detector designed with fixed parameter. The remainder of this paper is organized as follows. Section 2 presents related work. Section 3 gives the system model and problem formulation. Section 4 presents the proposed solution. Section 5 contains an illustrative case study. Section 6 concludes the paper.

354

2

L. Niu and A. Clark

Related Work

False data injection attacks have been extensively analyzed from the adversary’s perspective in the existing literature. False data injection attack against networked control system equipped with Kalman filter and power system are analyzed in [12] and [10], respectively. In [7], the worst-case stealthy false data injection attack strategy is proposed against Kullback-Leibler (K-L) divergence based detector is proposed. In this work, we consider the interaction between the detector design and adversary. Hence we not only give the optimal attack strategy, but also present a game-theoretic approach to analyze how to design a resilient detector to counter false data injection attacks, which is absent in [7]. Resilient control in adversarial environments has been extensively studied. Robust control and secure state estimation against false data injection attacks has been studied in [4,5,13,15] and references therein. In this work, we focus on the detection of false data injection attack under LQG setting. One alternative approach to thwart false data injection initiated by adversary that is knowledgeable in system model and detection and control strategies is to limit its information by committing to a time varying detection and control mechanism. A randomized detection threshold for K-L divergence based detectors is proposed in [9]. While [9] focuses on minimizing estimation error, in this work, we fill the gap between LQG control in adversarial environments and the detection strategy under false data injection attack. Moving target defense has been applied in literature to limit the adversary’s knowledge of the system model, e.g., system dynamics [20]. The idea of [20] is to change the system dynamics randomly to limit the knowledge of the adversary, while this works aims at designing a time varying detection threshold with fixed system dynamics. Moreover, the metric in this work is set as the LQG cost function, while contributions [19,20] focuses on the information metric (e.g., Fisher information matrix) from the adversary’s perspective. A resilient LQG control under false data injection attacks has been proposed for LTI system in [4]. In [4], a resilient control strategy is proposed so that the worst case damage introduced by the adversary is limited. However, no detection mechanism is considered in [4]. In addition to control-theoretic approaches, game theory is also used to study the interaction between the system and adversary [1,11,21]. These models consider the Nash setting, in this paper we consider a Stackelberg setting which is applicable to a variety of CPS domains [16]. Contribution [22] focuses on picking pre-designed detector among a configuration library. In this work, we investigate the problem of jointly modeling the attack, detection and LQG control performance. Stackelberg setting is adopted in [6,18] to compute a detector tuning. A fixed detection threshold is considered in [18], while a time-varying threshold is considered in this work. While an exhaustive search based approach is given [6] to select an adaptive detection threshold, we consider the LQG control performance in this work and show that the time-varying detection threshold can be obtained by solving a convex program.

Joint Attack Detection and Control Under False Data Injection

3

355

System Model and Problem Formulation

In this section, we present the system model and problem formulation. 3.1

System Model

Consider a discrete-time LTI system with time index k = 1, 2, · · · , as follows: xk+1 = Axk + Buk + wk ,

yk = Cxk + vk

where xk ∈ Rn is the system state, uk ∈ Rp is the input, yk ∈ Rm is the output, and wk ∈ Rn and vk ∈ Rm are i.i.d. stochastic disturbance with distributions wk ∼ N (0, Σw ) and vk ∼ N (0, Σv ), respectively. Matrices A, B and C are with proper dimensions. The initial state x0 is assumed to follow a distribution x0 ∼ N (0, Σx ). The disturbances wk and vk are assumed to be independent of each other and independent of the historical values of w, v, u and y. ˆ 0|−1 = ˆ k is computed using a Kalman filter [8] as x The state estimation x ˆ k+1|k = Aˆ xk + Buk , P0|−1 = Σx , Pk+1|k = APk AT + Σw , Kk = Pk|k−1 C T 0, x ˆ k = Aˆ ˆ k|k−1 ), and Pk = Pk|k−1 − (CPk|k−1 C T + Σv )−1 , x xk|k−1 + Kk (yk − C x Kk CPk|k−1 . The Kalman filter is assumed to be in steady state and the error covariance and Kalman gain are hence represented as P = limk→∞ Pk|k−1 , K = P C T (CP C T + Σv )−1 [8]. Denote the residue at each time step k as zk+1 = ˆ k+1 = xk + Buk ). Then the state estimation can be rewritten as x yk+1 − C(Aˆ xk + Buk )]. Aˆ xk + Buk + K[yk+1 − C(Aˆ 3.2

Adversary Model

We consider an intelligent adversary that can corrupt a subset of sensors by injecting false measurements. The injected false measurements provide the sys˜ k at each time k and hence misleads the controller. At each tem biased outputs y time step k, the measurements perceived by the system can be characterized as ˜ k = Cxk + vk + ak , where ak is an arbitrary measurement injected follows: y by the adversary. The residue under false data injection attack is computed as ˜ k+1 − C(Aˆ ˜k+1 = y xk + Buk ). z The adversary can perform false data injection attacks on a certain set of sensors Υ . Thus the support of injected false measurements supp(ak ) ⊆ Υ . We assume that Υ is fixed and known to both the system and adversary. The reason that the adversary can only corrupt the measurements of sensors in Υ is that the adversary might only be co-located with a subset of sensors, and it is only capable of corrupting the sensors that are exposed and unattended (e.g., distributed sensors in networked control system [12,17]). For instance, in [14], inertial measurements are used as a secure backup in the event of a GPS spoofing attack. For notation simplicity, we denote the set of sensors Υ as compromised sensors and the set of sensors outside Υ as the secured sensors. Denote the information available to the adversary at time k as IkA . Then the information set is represented as IkA = {u0 , · · · , uk } ∪ {y0A , · · · , ykA } ∪

356

L. Niu and A. Clark

{x0 , · · · , xk }, where ykA is the measurement from the compromised sensors. The information IkA captures the worst-case adversary model. Denote the set of all possible information set available to the adversary as IkA . An attack policy for the adversary τk : IkA → Rsupp(ak ) is a function mapping the set of possible information to the set of false measurements at time k. Let τ = {τk : k = 0, 1, · · · , } be the sequence of attack policies over time. 3.3

Controller Model

Assume the matrices A, B, C, Σw and Σv are known to the system and adversary. Denote the information available to the system at each time step k as Ik . The system knows the control inputs up to time k and the outputs up to time k. Therefore, the information set Ik is represented as Ik = ˜1, · · · , y ˜ k }. Denote the set of all possible Ik at time k y0 , y {u0 , u1 , · · · , uk } ∪ {˜ as Ik . ˆ k to minimize a cost funcThe system implements LQG control uk = −Lk x tion in quadratic form as follows: N J =E xTk Qk xk + uTk Rk uk , (1) k=0

where Qk and Rk are symmetric positive definite matrices for all k, respectively, ˆ k is determined by the and Lk is the controller gain. The state estimation x measurements that the system uses. Based on the detection result, which is further jointly determined by the system’s strategy and adversary’s strategy, the system decides if it will only consider the measurements from secured sensors or it will consider measurements from all sensors. Taking the detection result as a switching signal, we model the system’s choices over sensors by formulating the following switched linear system: xk+1 = Axk + Buk + wk , yk = Cθk xk + vk , where θk ∈ Θ = {0 , 1 } is the mode index defined as θk = 0 if no alarm is triggered from the detector and θk = 1 if an alarm is triggered from the detector. Matrix Cθk models the selection of the sensor measurements for each time step k. Let C[i] denote the ith row of matrix C. Then for each row i and time k, matrix Cθk is defined as follows: Cθk [i] = 0n if θk = 0 and Cθk [i] = C[i] if θk = 1 , with 0n being zero vector of length n. Using matrix Cθk , the jump between modes of the switched linear system captures the system’s choice over sensors. If no alarm is sounded, then the output yk is computed using the measurements from all sensors. In this mode, the control performance could be potentially degraded since the adversary can inject false measurements and bias the system’s control decision. If an alarm is triggered by the detector, then the output yk is computed using the measurements obtained from the subset of sensors that are secured. In this mode, although the false measurements injected by the adversary are eliminated, system performance would degrade under benign environment since ˆ might be inaccurate when only using measurements from the state estimation x a subset of sensors. Thus, the system needs to carefully design its detection threshold and henceforth determine the mode θk at each time step.

Joint Attack Detection and Control Under False Data Injection

357

Let D(˜ zk ||zk ) = fz˜ (t) log ff˜zz (t) (t) dt be the K-L divergence between compro˜k and residue zk [9,21]. The K-L divergence represents how the mised residue z realized residue under attack differs from the expected residue without attack. From the adversary’s perspective, the K-L divergence should be small to fool the controller who cannot distinguish the deviation caused by measurement noise vk and attack ak . Given a detection threshold γk at time step k, an alarm will be triggered from the detector if D(˜ zk ||zk ) > γk , and correspondingly the operation mode θk = 1 . Otherwise no alarm will be triggered and θk = 0 . Thus we zk ||zk ) ≤ γk have that the mode at each time k is determined as θk = 0 if D(˜ and θk = 1 if D(˜ zk ||zk ) > γk . To implement the detector, we need to evaluate the K-L divergence D(˜ zk ||zk ), which requires the probability distributions of the ˜k under false data injection residue zk of the legitimate system and the residue z attack. The distribution of the residue zk is identical to that of the additive noise vk , i.e., zk ∼ N (0, Σv ). The probability distribution of the compromised ˜k can be evaluated numerically by observing the historical residue [2]. residue z While the residues of the sensors under attack have unknown distribution, we can leverage Theorem 1 in Sect. 4, which states that the optimal attack strategy is a zero-mean ergodic Gaussian random process. Hence, without loss of generality (since we assume that the adversary chooses the optimal strategy), we assume that the residue sequence {˜ zk } is ergodic and the K-L divergence can be computed using known detection algorithms [2]. A deterministic control policy μk : Ik → R at each time step k is a function mapping the set of information Ik to a detection threshold γk . Let μ = {μk : k = 0, 1, · · · , } be the sequence of control policies over time. Then the objective of the system is to compute a sequence of control policies μ such that the cost function (1) is minimized. 3.4

Problem Formulation

N The problem we investigate is formulated as minμ maxτ E{ k=0 (xTk Qk xk + uTk Rk uk )}, where the expectation is over wk , vk , θk , and x0 . The formulation can be interpreted as a two-player zero-sum Stackelberg game, in which the system computes a sequence of detection threshold, and the adversary chooses the set of false measurements to inject. In the following section, we solve the problem by computing the Stackelberg equilibrium.

4

Solution Approach

In this section, we give the proposed solution approach. First, we rewrite the ˆ k ]T ∈ R2n as x ¯ k+1 = A¯k x ¯ k = [xk , x ¯ k + Wk , system dynamics for state vector x A −BL k where A¯k = , and Wk = [wk , KCθk+1 wk + KCθk+1 A A − BLk − KCθk+1 A Kvk+1 + K(1 − Iθk+1 )ak+1 ]T , where indicator function Iθk+1 = 0 when θk+1 = 0 and Iθk+1 = 1 when θk+1 = 1 . Denote the matrices A¯k and Wk when θk+1 = θ ¯ Tk }. Given uk = −Lk x ˆ k , the cost xk x as A¯θk and Wkθ , respectively. Let Σk = E{¯

358

L. Niu and A. Clark

T N N ¯k = ¯ k Hk x function (1) can be rewritten as J = E x k=0 k=0 tr (Hk Σk )

Qk 0 where Hk = , and tr(·) is the trace operator. The evolution of 0 LTk Rk Lk matrix Σk is given by

T ¯ k + F (Λk+1 ), ¯ k + Wk A¯k x ¯ k + Wk = G(Σk ) + W Σk+1 = E A¯k x T T 01 ¯1 00 ¯0 ¯kx ¯ Tk A¯Tk } = p0k (qk+1 Ak Σk A¯1k + qk+1 Ak Σk A¯0k ) + p1k where G(Σk ) = E{A¯k x T T 11 ¯1 10 ¯0 ¯ k = E{Wk W T } = p0 (q 01 W 1 +q 00 W 0 )+ Ak Σk A¯1k +qk+1 Ak Σk A¯0k ), W (qk+1 k k k+1 k+1

0 0 1 0 1 11 10 0 00 1 10 , W1 = pk (qk+1 W + qk+1 W ), F (Λk+1 ) = (pk qk+1 + pk qk+1 ) 0 KΛk+1 K T

T Σw P 1 Σw P 0 0 = , W , P 1 = KC 1 Σw C 1 K T , P¯ 1 = P 0 + KΣv K T , P 1 P¯ 1 P 0 P¯ 0 T P 0 = KC 0 Σw C 0 K T , P¯ 0 = P 0 + KΣv K T , pθk is the probability of the system θθ is the transition probability from mode θ to θ , being at mode θ at time k, qk+1 and Λk+1 is the covariance matrix for injected false measurement ak+1 . In the following, we derive the optimal attack strategy and controller’s strategy. At time step k , the adversary solves the following problem:

max

ak :N

subject to

N

tr (Hk Σk )

(2a)

k=k

¯ k + F (Λk ), ∀k = k , · · · , N Σk = G(Σk−1 ) + W D(˜ zk ||zk ) ≤ γk , ∀k = k , · · · , N

(2b) (2c)

Σk 0, ∀k = k , · · · , N

(2d)

The objective of the adversary is to maximize the cost function J. Constraint (2b) models the evolution of matrix Σ. Constraint (2c) requires the adversary to design its attack signal such that the system stays in mode 0 so that the injected false measurements can bias the system. Constraint (2d) guarantees the covariance matrix Σ is well defined. Substituting constraint (2b) into the objective function (2a), we observe that the adversary maximizes the cost J in two ways: (i) increase the probability of being at mode 0 , and (ii) increase the covariance matrix Λ. The following theorem characterizes the optimal attack [7]. Theorem 1. The optimal attacks a∗0:N = [a∗0 , · · · , a∗N ]T are zero-mean and Gaussian. Given z||z) can be represented as D(˜ z||z) = Theorem 1, the K-L divergence D(˜ 1 −1 tr Σ Λ − m + log (det(Σ )/ det(Λ)) . Assume mode switch reaches stav v 2 tionary probability distribution. Then we simply denote the probability of being at mode 0 and 1 at time k as p0k and p1k , respectively. Substituting constraint (2b) into (2a), we can convert problem (2) to a convex program

Joint Attack Detection and Control Under False Data Injection

max

Λk :N

subject to

359

T p0k tr Hk Λ¯k + p0k +1 p0k tr Hk +1 A¯0k Λ¯k A¯0k T + p1k +1 p1k tr Hk +1 A¯1k Λ¯k A¯1k + J k (3a) det(Σv ) 1 tr Σv−1 Λ − m + log ≤ γk , ∀k = k , · · · , N 2 det(Λ) (3b) Λk 0, ∀k = k , · · · , N

(3c)

where J k contains the terms that are independent of Λk . Solving optimization problem (3), then the covariance of optimal attack at time k is characterized using the following theorem. Theorem 2. The covariance Λ∗k of the optimal attack a∗k is computed as Λ∗k

=

1 − Φk + Σv−1 βk

−1 ,

(4)

where A0 = A − BLk − KC 0 A, A1 = A − BLk − KC 1 A, T Φk = 2p0k K T LT Rk LT K + 2p0k+1 p0k K T LTk B T Qk+1 BLk + A0 LTk+1 Rk+1 T T T Lk+1 A0 K + 2p1k+1 p1k K T LTk B T Qk+1 BLk + A1 LTk+1 Rk+1 Lk+1 A1 K, pθk is the probability of system being in mode θ at time k, and βk satisfies

βk − Λk,i βk + log = m + 2γ, βk − Λk,i βk i

(5)

and Λk,i are the eigenvalues of Φk Σv . Before proving Theorem 2, we first present the a preliminary proposition. Proposition 1. Let {λi : λi ≥ 0} be a set of non-negative real numbers and sorted in descending order. Consider function g(β) : (−∞, 0) ∪ (Λ1 , +∞) → β β−Λi (0, +∞) defined as g(β) = i β−Λi + log . Then given any positive β number g¯, there exists some β > 0 such that g(β) = g¯.

Proof. Proposition 1 follows from the fact that function g(β) is continuous and monotone decreasing with respect to β.

Proof. (Proof of Theorem 2.) We prove by induction backwards. First, (4) and (5) hold for time k = N since Hk+1 = 0. Next, suppose (4) and (5) hold up to time k + 1. We then induct one time step backwards. We prove (4) and (5) hold at time k by verifying the KKT conditions of (3) at time k .

360

L. Niu and A. Clark

We start with the stationarity condition. The Lagrangian is represented as T T Lk = −p0k p0k +1 tr Hk +1 A¯0k Λ¯k A¯0k − p0k p1k +1 tr Hk +1 A¯1k Λ¯k A¯1k det(Σv ) βk −p0k tr Hk Λ¯k + J k + tr Σv−1 Λk − m + log − 2γ , 2 det(Λk ) zk ||zk ) ≤ where βk is the Lagrangian multiplier associated with constraint D(˜ γk . Take the partial derivative of Lk with respect to Λk and let it be zero. 1 −1 is positive Then we have Φk + βk Σv−1 − βk Λ−1 k = 0. When − βk Φk + Σv ∗ definite, we have (4) holds. By (4), we have Λk is symmetric. Moreover, Λ∗k −1 −1 can be rewritten as −Φk /βk + Σv−1 = Σv (I − Φk Σv /βk ) , which is a ∗ product of two positive definite matrices. Hence, we have Λk defined in (4) is positive definite, implying primal feasibility defined by (3c) is satisfied. We then verify dual feasibility βk ≥ 0. First, we show that βk = 0. Suppose βk = 0. Then the derivative of Lagrangian implies −p0 K T LT RLK = 0. Since K T LT RLK 0, we must have p0 = 0. Therefore the system stays in mode 1 forever, implying that constraint (2c) is violated. By Proposition 1, we have that given any γk ≥ 0, there exists a unique βk > 0 such that (5) is satisfied. Hence, (5) guarantees dual feasibility β ≥ 0. We finally verify primal feasibility and complementary slackness. We take the partial derivative of Lk with respect to βk and set it as zero. Then we have tr Σv−1 Λk + log det(Σv ) − log det(Λk ) = m + 2γ. Substituting (4) into the equation above, we have (5) holds.

By Theorem 2, the controller can compute the best response from the adversary, i.e., given any detection threshold γ, it can estimate the covariance matrix selected by the adversary. By Theorem 1, we have that the mode switch probability follows χ2 distribution. Using the tail bounds of χ2 random variable toapproximate zk ||zk ) ≤ γ) the mode switch probability, we have P r (D(˜ 2 . We remark that the cost obtained using the tail bound is ≤ exp − (m−γ) 4m an upper bound, and hence models the worst-case cost. Although Theorem 1 coincides with the result reported in [7], Theorem 2 differs from [7] since the adversary’s objective is different and its strategy is not restricted to linear attack strategy. In the following, we also show how the system computes the detection threshold to optimize the LQG control performance, which is not reported in [7]. Given the optimal attacks characterized by Theorems 1 and 2, in the following, we derive the optimal mode switch thresholds. The system solves the following optimization problem min γ0:N

subject to

N

tr (Hk Σk )

(6a)

k=0

¯ k + F (Λk ), ∀k Σk = G(Σk−1 ) + W

(6b)

Σk 0, ∀k

(6c)

Joint Attack Detection and Control Under False Data Injection

361

Theorem 3. The optimal mode switch thresholds can be obtained by solving a convex program. Proof. By Theorem 2, problem (6) can be expressed as follows: min

γ0:N ,Ψ

subject to

Ψ

(7a)

max Λ0:N

N

tr(Hk Σk ) : Λk satisfies D(˜ zk ||zk ) ≤ γk

≤Ψ

(7b)

k=0

¯ k + F (Λk ), ∀k Σk = G(Σk−1 ) + W

(7c)

Σk 0, ∀k

(7d)

Constraint (7b) is linear with respect to Ψ and logarithmically convex with respect to γk . Thus problem (7) is jointly convex with respect to γk and Ψ .

5

Simulation

In this section, we present a case study to demonstrate our proposed method. The proposed approach is evaluated using Matlab. We consider a robot moving along a straight line [12]. The state of the robot contains its position and velocity, which are measured by two sensors. The

10 1 x + u + wk , yk = xk + vk + ak . The dynamics is given by xk+1 = 11 k 0.5 k adversary can compromise the measurements from the position sensor, while it cannot tamper with the measurements from velocity sensor. Therefore, the output model is expressed using

a switched system as yk = xk + vk + ak if θt = 0 , 00 00 and yk = x + v , if θt = 1 . Three scenarios are considered in our 01 k 01 k simulation: (i) LQG control in benign environment, (ii) design the detection threshold using the proposed approach, (iii) randomly generate one detection threshold without considering the presence of adversary and the adversary optimally responds to it. We demonstrate the effectiveness of the proposed approach by evaluating their LQG control performances, as shown in Fig. 1a. In the first scenario, the system does not need to switch between different modes and the cost incurred is the optimal LQG cost. This scenario gives the minimum cost among the three scenarios. Using the proposed approach, although the system still incurs additional cost comparing to the LQG cost incurred under benign environment due to the presence of the adversary, the cost increment is limited. In the third scenario, the system does not consider the presence of adversary and simply fix a mode switch threshold γ = 3.3 for all time instants. This scheme gives the highest cost among all scenarios. The strategic adversary can introduce much higher cost comparing to our proposed approach. We illustrate the relationship between the cost function and γ in Fig. 1. We consider the single time step case and choose γ from 1 to 4. When γ is close to the single stage optimal solution γ ∗ = 2.1, the cost is minimized. When the threshold deviates from the optimal value γ ∗ , the system incurs more cost.

362

L. Niu and A. Clark Cost Comparison under Different Scenarios

4500

4000

Single Stage Quadratic Cost

Quadratic Cost

on Control Performance

5500

3500

3000

2500

2000

1500

1000

500

0

Effect of Detection Threshold

6000

No Adversary Proposed Approach Under Attack Random Switch Under Attack

5000 4500 4000 3500 3000 2500 2000 1500

1

2

3

4

5

Time Step

(a)

6

7

8

9

1000

0

0.5

1

1.5

2

2.5

3

3.5

4

Value of Detection Threshold

(b)

Fig. 1. (a) presents cost functions incurred in different scenarios: optimal LQG cost in benign environment, cost incurred using proposed mode switch, and cost incurred using fixed mode switch threshold without considering the presence of adversary. (b) shows the relationship between the cost function and the value of threshold γ.

6

Conclusion

In this paper, we focused on a control system conducting LQG control under false data injection attacks. Using the signal issued by the detector whose detection threshold is carefully designed as a switch signal, the system was modeled as a switched linear system with two modes. We investigated the optimal attack strategy and gave a closed form solution for the covariance of optimal attack. Furthermore, we showed the optimal detection threshold for the detection, and the corresponding optimal mode switch policy can be computed using a convex program. The proposed approach was evaluated using a numerical case study.

References 1. Alpcan, T., Basar, T.: An intrusion detection game with limited observations. In: International Symposium on Dynamic Games and Applications (2006) 2. Bai, C.Z., Gupta, V.: On Kalman filtering in the presence of a compromised sensor: Fundamental performance bounds. In: American Control Conference (ACC), pp. 3029–3034. IEEE (2014) 3. C´ ardenas, A.A., Amin, S., Sastry, S.: Research challenges for the security of control systems. In: Summit on Hot Topics in Security (HotSec). USENIX (2008) 4. Clark, A., Niu, L.: Linear quadratic gaussian control under false data injection attacks. In: American Control Conference (ACC), pp. 5737–5743. IEEE (2018) 5. Fawzi, H., Tabuada, P., Diggavi, S.: Secure estimation and control for cyberphysical systems under adversarial attacks. IEEE Trans. Autom. Control 59(6), 1454–1467 (2014) 6. Ghafouri, A., Abbas, W., Laszka, A., Vorobeychik, Y., Koutsoukos, X.: Optimal thresholds for anomaly-based intrusion detection in dynamical environments. In: Zhu, Q., Alpcan, T., Panaousis, E., Tambe, M., Casey, W. (eds.) GameSec 2016.

Joint Attack Detection and Control Under False Data Injection

7. 8. 9.

10. 11.

12.

13.

14. 15.

16. 17.

18.

19.

20.

21.

22.

363

LNCS, vol. 9996, pp. 415–434. Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-47413-7 24 Guo, Z., Shi, D., Johansson, K.H., Shi, L.: Worst-case stealthy innovation-based linear attack on remote state estimation. Automatica 89, 117–124 (2018) Kalman, R.E.: A new approach to linear filtering and prediction problems. ASME J. Basic Eng. 82(1), 35–45 (1960) Kung, E., Dey, S., Shi, L.: Optimal stealthy attack under KL divergence and countermeasure with randomized threshold. 20th IFAC World Congr. 50(1), 9496–9501 (2017) Liu, Y., Ning, P., Reiter, M.K.: False data injection attacks against state estimation in electric power grids. ACM Trans. Inf. Syst. Secur. (TISSEC) 14(1), 13 (2011) Miao, F., Zhu, Q.: A moving-horizon hybrid stochastic game for secure control of cyber-physical systems. In: Conference on Decision and Control (CDC), pp. 517–522. IEEE (2014) Mo, Y., Garone, E., Casavola, A., Sinopoli, B.: False data injection attacks against state estimation in wireless sensor networks. In: Conference on Decision and Control (CDC), pp. 5967–5972. IEEE (2010) Pajic, M., et al.: Robustness of attack-resilient state estimators. In: ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS), pp. 163–174. IEEE (2014) Psiaki, M.L., Humphreys, T.E.: GNSS spoofing and detection. Proc. IEEE 104(6), 1258–1270 (2016) Shoukry, Y., Nuzzo, P., Puggelli, A., Sangiovanni-Vincentelli, A.L., Seshia, S.A., Tabuada, P.: Secure state estimation for cyber-physical systems under sensor attacks: a satisfiability modulo theory approach. IEEE Trans. Autom. Control 62(10), 4917–4932 (2017) Tambe, M.: Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned. Cambridge University Press, Cambridge (2011) Tippenhauer, N.O., P¨ opper, C., Rasmussen, K.B., Capkun, S.: On the requirements for successful GPS spoofing attacks. In: ACM Conference on Computer and Communications Security, pp. 75–86. ACM (2011) Umsonst, D., Sandberg, H.: A game-theoretic approach for choosing a detector tuning under stealthy sensor data attacks. In: 2018 IEEE Conference on Decision and Control (CDC), pp. 5975–5981. IEEE (2018) Weerakkody, S., Sinopoli, B.: Detecting integrity attacks on control systems using a moving target approach. In: 54th IEEE Conference on Decision and Control (CDC), pp. 5820–5826. IEEE (2015) Weerakkody, S., Sinopoli, B.: A moving target approach for identifying malicious sensors in control systems. In: Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1149–1156. IEEE (2016) Zhang, R., Venkitasubramaniam, P.: A game theoretic approach to analyze false data injection and detection in lqg system. In: Conference on Communications and Network Security (CNS), pp. 427–431. IEEE (2017) Zhu, Q., Ba¸sar, T.: Dynamic policy-based IDS configuration. In: Conference on Decision and Control, pp. 8600–8605. IEEE (2009)

QFlip: An Adaptive Reinforcement Learning Strategy for the FlipIt Security Game Lisa Oakley(B) and Alina Oprea Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA {oakley.l,a.oprea}@northeastern.edu

Abstract. A rise in Advanced Persistent Threats (APTs) has introduced a need for robustness against long-running, stealthy attacks which circumvent existing cryptographic security guarantees. FlipIt is a security game that models attacker-defender interactions in advanced scenarios such as APTs. Previous work analyzed extensively non-adaptive strategies in FlipIt, but adaptive strategies rise naturally in practical interactions as players receive feedback during the game. We model the FlipIt game as a Markov Decision Process and introduce QFlip, an adaptive strategy for FlipIt based on temporal difference reinforcement learning. We prove theoretical results on the convergence of our new strategy against an opponent playing with a Periodic strategy. We confirm our analysis experimentally by extensive evaluation of QFlip against specific opponents. QFlip converges to the optimal adaptive strategy for Periodic and Exponential opponents using associated state spaces. Finally, we introduce a generalized QFlip strategy with composite state space that outperforms a Greedy strategy for several distributions including Periodic and Uniform, without prior knowledge of the opponent’s strategy. We also release an OpenAI Gym environment for FlipIt to facilitate future research. Keywords: Security games · FlipIt · Reinforcement learning · Adaptive strategies · Markov decision processes · Online learning

1

Introduction

Motivated by sophisticated cyber-attacks such as Advanced Persistent Threats (APT), the FlipIt game was introduced by van Dijk et al. as a model of cyberinteractions in APT-like scenarios [3]. FlipIt is a two-player cybersecurity game in which the attacker and defender contend for control of a sensitive resource (for instance a password, cryptographic key, computer system, or network). Compared to other game-theoretical models, FlipIt has the unique characteristic of stealthiness, meaning that players are not notified about the exact state of the resource during the game. Thus, players need to schedule moves during the game with minimal information about the opponent’s strategy. The challenge of c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 364–384, 2019. https://doi.org/10.1007/978-3-030-32430-8_22

QFlip: An Adaptive Reinforcement Learning Strategy

365

determining the optimal strategy is in finding the best move times to take back resource control, while at the same time minimizing the overall number of moves (as players pay a cost upon moving). FlipIt is a repeated, continuous game, in which players can move at any time and benefits are calculated according to the asymptotic control of the resource minus the move cost. The original FlipIt paper performed a detailed analysis of non-adaptive strategies in which players move according to a renewal process selected at the beginning of the game. Non-adaptive strategies are randomized, but do not benefit from feedback received during the game. In the real world, players naturally get information about the game and the opponent’s strategy as play progresses. For instance, if detailed logging and monitoring is performed in an organization, an attacker might determine the time of the last key rotation or machine refresh upon system takeover. van Dijk et al. defined adaptive strategies that consider various amounts of information received during gameplay, such as the time since the last opponent move. However, analysis and experimentation in the adaptive case has remained largely unexplored. In a theoretical inspection, van Dijk et al. prove that the optimal Last Move adaptive strategy against Periodic and Exponential opponents is a Periodic strategy. They also introduce an adaptive Greedy strategy that selects moves to maximize local benefit. However, the Greedy strategy requires extensive prior knowledge about the opponent (the exact probability distribution of the renewal process), and does not always result in the optimal strategy [3]. Other extensions of FlipIt analyzed modified versions of the game [5,6,10,13,19], but mostly considered non-adaptive strategies. In this paper, we tackle the challenge of analyzing the two-player FlipIt game with one adaptive player and one non-adaptive renewal player. We limit the adaptive player’s knowledge to the opponent’s last move time and show how this version of the game can be modeled as an agent interacting with a Markov Decision Process (MDP). We then propose for the first time the use of temporal difference reinforcement learning for designing adaptive strategies in the FlipIt game. We introduce QFlip, a Q-Learning based adaptive strategy that plays the game by leveraging information about the opponent’s last move times. We explore in this context the instantiation of various reward and state options to maximize QFlip’s benefit against a variety of opponents. We start our analysis by considering an opponent playing with the Periodic with random phase strategy, also studied by [3]. We demonstrate for this case that QFlip with states based on the time since opponent’s last move converges to the optimal adaptive strategy (playing immediately after the opponent with the same period). We provide a theoretical analysis of the convergence of QFlip against this Periodic opponent. Additionally, we perform detailed experiments in the OpenAI Gym framework, demonstrating fast convergence for a range of parameters determining the exploration strategy and learning decay. Next, we perform an analysis of QFlip against an Exponential opponent, for which van Dijk et al. determined the optimal strategy [3]. We show experimentally that QFlip with states based on the player’s own move converges to the optimal strategy and the time to convergence depends largely on the adaptive player’s move

366

L. Oakley and A. Oprea

cost and the Exponential player’s distribution parameters. Finally, we propose a generalized, composite QFlip instantiation that uses as state the time since last moves for both players. We show that composite QFlip converges to the optimal strategy for Periodic and Exponential. Remarkably, QFlip has no prior information about the opponent strategy at the beginning of the game, and most of the time outperforms the Greedy algorithm (which leverages information about the opponent strategy). For instance, QFlip achieves average benefit between 5% and 50% better than Greedy against Periodic and 15% better than Greedy against a Uniform player. The implications of our findings are that reinforcement learning is a promising avenue for designing optimal learning-based strategies in cybersecurity games. Practically, our results also reveal that protecting systems against adaptive adversaries is a difficult task and defenders need to become adaptive and agile in face of advanced attackers. To summarize, our contributions in the paper are: – We model the FlipIt game with an adaptive player competing against a renewal opponent as an MDP. – We propose QFlip, a versatile generalized Q-Learning based adaptive strategy for FlipIt that does not require prior information about the opponent strategy. – We prove QFlip converges to the optimal strategy against a Periodic opponent. – We demonstrate experimentally that QFlip converges to the optimal strategy and outperforms the Greedy strategy for a range of opponent strategies. – We release an OpenAI Gym environment for FlipIt to aid future researchers. Paper Organization. We start with surveying the related work in Sect. 2. Then we introduce the FlipIt game in Sect. 3 and describe our MDP modeling of FlipIt and the QFlip strategy in Sect. 4. We analyze QFlip against a Periodic opponent theoretically in Sect. 5. We perform experimental evaluation of Periodic and Exponential strategies in Sect. 6. We evaluate generalized composite QFlip against four distributions in Sect. 7, and conclude in Sect. 8.

2

Related Work

FlipIt, introduced by van Dijk et al. [3], is a non-zero-sum cybersecurity game where two players compete for control over a shared resource. The game distinguishes itself by its stealthy nature, as moves are not immediately revealed to players during the game. Finding an optimal (or dominant) strategy in FlipIt implies that a player can schedule its defensive (or attack) actions most effectively against stealthy opponents. van Dijk et al. proposed multiple non-adaptive strategies and proved results about their strongly dominant opponents and Nash Equilibria [3]. They also introduce the Greedy adaptive strategy and show that it results in a dominant strategy against Periodic and Exponential players, but it is not always optimal. The original paper left many open questions about designing general adaptive strategies for FlipIt. van Dijk et al. [1] analyzed the applications of the game in real-world scenarios such as password and key management.

QFlip: An Adaptive Reinforcement Learning Strategy

367

Additionally, several FlipIt extensions have been proposed and analyzed. These extensions focus on modifying the game itself, adding additional players [6,10], resources [13], and move types [19]. FlipLeakage considers a version of FlipIt in which information leakage is gradual and ownership of resource is obtained incrementally [5]. Zhang et al. consider limited resources with an upper bound on the frequency of moves and analyze Nash Equilibria in this setting [25]. Several games study human defenders players against automated attackers using Periodic strategies [8,18,20]. All of this work uses exclusively non-adaptive players, often limiting analysis to solely opponents playing periodically. The only previous work that considers adaptive strategies is by Laszka et al. [14,15], but in a modification of the original game with non-stealthy defenders. QFlip can generalize to play adaptively in these extensions, which we leave to future work. Reinforcement learning (RL) is an area of machine learning in which an agent takes action on an environment, receiving feedback in the form of a numerical reward, and adapting its action policy over time in order to maximize its cumulative reward. Traditional methods are based primarily on Monte Carlo and temporal difference Q-Learning [22]. Recently, approximate methods based on deep neural networks have proved effective at complex games such as Backgammon, Atari and AlphaGo [17,21,23]. RL has emerged in security games in recent years. Han et al. use RL for adaptive cyber-defense in a Software-Defined Networking setting and consider adversarial poisoning attacks against the RL training process [9]. Hu et al. proposes the idea of using Q-Learning as a defensive strategy in a cybersecurity game for detecting APT attacks in IoT systems [11]. Motivated by HeartBleed, Zhu et al. consider an attacker-defender model in which both parties synchronously adjust their actions, with limited information on their opponent [26]. Other RL applications include network security [4], spatial security games [12], security monitoring [2], and crowdsensing [24]. Markov modeling for moving target defense has also been proposed [7,16]. To the best of our knowledge, our work presents the first application of RL to stealthy security games, resulting in the most effective adaptive FlipIt strategy.

3

Background on the FlipIt Game

FlipIt is a two-player game introduced by van Dijk et al. to model APT-like scenarios [3]. In FlipIt, players move at any time to take control of a resource. In practice, the resource might correspond to a password, cryptographic key, or computer system that both attacker and defender wish to control. Upon moving, players pay a move cost (different for each player). Success in the game is measured by player benefit, defined as the asymptotic amount of resource control (gain) minus the total move cost as described in Fig. 2. The game is infinite, and we consider a discrete version of the game in which players can move at discrete time ticks. Figure 1 shows an example of the FlipIt game, and Fig. 2 provides relevant notation we will use in the paper. An interesting aspect of FlipIt is that the players do not automatically learn the opponent’s moves in the game. In other words, moves are stealthy, and

368

L. Oakley and A. Oprea Player 1 (LM-Adaptive)

LM1

t time

Player 0 (Renewal) LM0

τt

Fig. 1. Example of FlipIt game between Last Move adaptive Player 1 and Player 0 using a Renewal strategy. Rounded arrows indicate player moves. The first move of Player 1 is flipping, and the second move is consecutive. τt is the time since Player 0’s last known move at time t and LMi is Player i’s actual last move at time t. Due to the stealthy nature of the game, τt ≥ t − LM0 . Symbol t ki Γi ni βi τt LMi ρ

Description Time step (tick) Player i’s move cost Player i’s total gain (time in control) Player i’s total moves Player i’s total benefit. βi = Γi − ki · ni Time since opponent’s last known move at time t Player i’s actual last move time at time t Player 0’s average move time

Symbol st at rt γ α d p

Description Observation Action Reward Future discount Count of at in st Exploration parameter Exploration discount New move probability

Fig. 2. FlipIt notation (left) and QFlip notation (right)

players need to move without knowing the state of the resource. There are two main classes of strategies defined for FlipIt: Non-adaptive Strategies. Here, players do not receive any feedback upon moving. Non-adaptive strategies are determined at the beginning of the game, but they might employ randomization to select the exact move times. Renewal strategies are non-adaptive strategies that generate the intervals between consecutive moves according to a renewal process. The inter-arrival times between moves are independent and identically distributed random variables chosen from a probability density function (PDF). Examples of renewal strategies include: – Periodic with random phase (Pδ ): The player first moves uniformly at random with phase Rδ ∈ (0, δ), with each subsequent move occurring periodically, i.e., exactly at δ time units after the previous move. – Exponential: The inter-arrival time is distributed according to an exponential (memoryless) distribution Eλ with rate λ. The probability density function for Eλ is fEλ (x) = λe−λx , for x > 0, and 0 otherwise. – Uniform: The inter-arrival time is distributed according to an uniform distribution Uδ,u with parameters δ and u. The probability density function for Uδ,u is fUδ,u (x) = 1/u, for x ∈ [δ − u/2, δ + u/2], and 0 otherwise.

QFlip: An Adaptive Reinforcement Learning Strategy

369

– Normal: The inter-arrival time is distributed according to a normal distribution Nμ,σ with mean μ and standard deviation σ. The probability density 2 2 1 function for Nμ,σ is fNμ,σ (x) = √2πσ e−(x−μ) /2σ , for x ∈ R. 2 Adaptive Strategies. In these strategies, players receive feedback during the game and can adaptively change their subsequent moves. In Last Move (LM) strategies, players receive information about the opponent’s last move upon moving in the game. This is the most restrictive and therefore most challenging subset of adaptive players, so we only focus on LM adaptive strategies here. Theoretical analysis of the optimal LM strategy against specific Renewal strategies has been shown [3]. For the Periodic strategy, the optimal LM strategy is to move right after the Periodic player (whose moves can be determined from the LM feedback received during the game). The memoryless property of the exponential distribution implies that the probability of moving at any time is independent of the time elapsed since the last player’s move. Thus, an LM player that knows the Exponential opponent’s last move time has no advantage over a non-adaptive player. Accordingly, the dominant LM strategy against an Exponential player is still a Periodic strategy, with the period depending on the player’s move cost and the rate of the Exponential opponent. Greedy Strategy. To the best of our knowledge, the only existing adaptive strategy against general Renewal players is the “Greedy” strategy [3]. Greedy calculates the “local benefit”, L(z) of a given move time, z, as: ∞ 1 z (1) xfˆ0 (x)dx + z fˆ0 (x)dx − k1 , L(z) = z x=0 z where fˆ0 (x) = f0 (τ +x)/(1−F0 (τ )), f0 is the probability density function (PDF) of the opponent’s strategy, F0 is the corresponding cumulative density function (CDF), and τ is the interval since the opponent’s last move. Greedy finds the move time, zˆ, which maximizes this local benefit, and schedules a move at zˆ if the maximum local benefit is positive. In contrast, if the local benefit is negative, Greedy chooses not to move, dropping out of the game. Although the Greedy strategy is able to compete with any Renewal strategy, it is dependent on prior knowledge of the opponent’s strategy. van Dijk et al. showed that Greedy can play optimally against Periodic and Exponential players [3]. However, they showed a strategy for which Greedy is not optimal. This motivates us to look into other general adaptive strategies that achieve higher benefit than Greedy and require less knowledge about the opponent’s strategy.

4

New Adaptive Strategy for FlipIt

Our main insight is to apply traditional reinforcement learning (RL) strategies to the FlipIt security game to create a Last Move adaptive strategy that outperforms existing adaptive strategies. We find that modeling FlipIt as a Markov Decision Process (MDP ) and defining an LM Q-Learning strategy is non-trivial, as the

370

L. Oakley and A. Oprea

stealthy nature of the game resists learning. We consider the most challenging setting, in which the adaptive player has no prior knowledge on the opponent’s strategy. In this section, we present QFlip, a strategy which is able to overcome those challenges and elegantly compete against any Renewal opponent. 4.1

Modeling FlipIt as an MDP

Correctly modeling the game of two-player FlipIt as an MDP is as important to our strategy’s success as the RL algorithm itself. In our model, player 1 is an agent interacting with an environment defined by the control history of the FlipIt resource as depicted in Fig. 3. Agent (Adaptive Player) 1. Estimate value of at in st 2. Store estimated value in Q(st,at) 3. Choose at+1 (with exploration) (see Algorithm 1 for details)

state at time t+1 , reward (st+1) (rt+1)

Environment (FlipIt game history)

action at time t (at∈{0,1})

Fig. 3. Modeling FlipIt as an MDP.

We consider the infinite but discrete version of FlipIt and say that at every time step (tick), t ∈ {1, 2, . . . }, the game is in some state st ∈ S where S is a set of observed state values dependent on the history of the environment. At each time t, the agent chooses an action at ∈ A = {0, 1} where 0 indicates waiting, and 1 indicates moving. The environment updates accordingly and sends the agent state st+1 and reward rt+1 defined in Table 1 and Eq. (3), respectively. Defining optimal state values and reward functions is essential to generating an effective RL algorithm. In a stealthy game with an unknown opponent, this is a non-trivial task that we will investigate in the following paragraphs. Modeling State. At each time step t, the LM player knows two main pieces of information: its own last move time (LM1 ), and the time since the opponent’s last known move (τt ). The observed state can therefore depend on one or both of these values. We define three observation schemes in Table 1. We compare these observation schemes against various opponents in the following sections. Table 1. Observation schemes Scheme

s0

st for t > 0

oppLM

−1

τt if LM1 > player 0 first move, otherwise −1

ownLM

0

t − LM1

composite

(sownLM , soppLM ) 0 0

(sownLM , soppLM ) t t

QFlip: An Adaptive Reinforcement Learning Strategy

371

Modeling Reward. Temporal difference learning algorithms leverage incremental updates from rewards at each time tick, and therefore require the environment to transmit meaningful reward values. A good reward should be flexible, independent of prior knowledge of opponent strategy, and most importantly promote the ultimate goal of achieving high benefit. We divide actions into three resulting categories: flipping, consecutive, and no-play based on the type of action and the state of the environment as depicted in Table 2. Table 2. Player 1 action categories Move type

at Env state

Flipping

1

LM0 > LM1 −k1

τt+1 = t − LM0 + 1 Player 1 takes control

Consecutive 1

LM0 ≤ LM1 −k1

τt+1 = τt + 1

Player 1 moves while in control

No-play

any

τt+1 = τt + 1

Player 1 takes no action

0

Cost Outcome

0

Explanation

The most straightforward reward after each action is a resulting benefit (at ,st )

β1 (a ,s )

(at ,st )

= Γ1

− k1

(2)

where Γ1 t t is the resulting gain, or Player 1’s additional time in control between time t and the opponent’s next move as a result of taking action at in state st . These rewards would sum to equal Player 1’s total benefit over the course of the game, therefore exactly matching the goal of maximized benefit. (a ,s ) For consecutive moves, β1 t t = −k1 , as Player 1 is already in control, and therefore attains no additional gain, resulting in a wasted move. For no-plays, (a ,s ) β1 t t = 0, as not moving guarantees no additional gain and incurs no cost. The main challenge here comes from determining reward for flipping moves. Consider the case where the opponent plays more than once between two of (a ,s ) the agent’s moves. Here, it is impossible to calculate an accurate β1 t t , as the agent cannot determine the exact time they lost control. Moreover, there is no way to calculate future gain from any move against a randomized opponent, as the opponent’s next move time is unknown. We acknowledge a few ineffective responses to these challenges. The first rewards Player 1 for playing soon after the opponent (higher reward for lower resulting τt+1 values). This works against a Periodic opponent, but not work against an Exponential opponent as it does not reward optimal play. Another approach is a reward based on prior gain, rather than resulting gain. This is difficult to calculate and rewards previous moves, rather than the current action. We determined experimentally that the best reward for at against an unknown opponent is a fixed constant related to the opponent’s move frequency as follows

372

L. Oakley and A. Oprea

rt+1

⎧ ⎪ ⎨0 = −k1 ⎪ ⎩ ρ−k1 c

if no-play at time t if consecutive move at time t if flipping move at time t

(3)

where ρ is an estimate of Player 0’s average move frequency and c is a constant determined before gameplay (for normalization). Playing often toward the beginning of the game and keeping track of the observed move times can provide a rough estimate of ρ. This reward proves highly effective, while maintaining the flexibility to play against any opponent without any details of their strategy. 4.2

The QFlip Strategy

In this section we present a new, highly effective LM adaptive strategy, QFlip, based on existing temporal difference reinforcement learning techniques. QFlip plays within our FlipIt model from Sect. 4.1. Though optimized to play against Renewal opponents, a QFlip player can compete against any player, including other adaptive opponents. To the best of our knowledge, QFlip is the first adaptive strategy which can play FlipIt against both Renewal and non-Renewal opponents without any prior knowledge about their strategy. Value Estimation. QFlip uses feedback attained from the environment after each move and the information gathered during gameplay to estimate the value of action at in state st . Player 1 has no prior knowledge of the opponent’s strategy, therefore must learn an optimal strategy in real-time. We adopt an online temporal difference model of value estimation where Q(st , at ) is the expected value of taking action at in state st as in [22]. We start by defining the actual value of an action at in state st as a combination of the immediate reward and potential future value as Vst ,at = rt+1 + γ · max Q(st+1 , a ). a ∈A

(4)

where 0 ≤ γ ≤ 1 is a constant discount to the estimated future value, and rt+1 is the environment-provided reward from Eq. (3). After each tick, we update our value estimate by a discounted difference between estimated move value and actual move value as follows: Qα+1 (st , at ) = Qα (st , at ) +

1 (Vs ,a − Qα (st , at )) α+1 t t

(5)

where α is the number of times action a has been performed in state s, and 1/α is the step-size parameter, which discounts the change in estimate proportionally to the number of times this estimate has been modified. This update policy uses the estimate error to step toward the optimal value estimate at each state. Note that, if γ = 0, we play with no consideration of the future, and Qα (s, a) is just an average of the environment-provided rewards over all times action a was performed in state s.

QFlip: An Adaptive Reinforcement Learning Strategy

373

Action Choice (Exploration). A key element of any reinforcement learning algorithm is balancing exploitation of learned value estimation with exploration of new states and actions to avoid getting stuck in local maxima. We employ a modified decaying--greedy exploration strategy from [22] as choose uniformly at random from A with probability at = argmaxa∈A Q(st , a) with probability 1 −

(6)

where = · e−d·v for constant exploration and decay parameters 0 ≤ , d < 1 and v equal to the number of times QFlip has visited state st . If Q(st , 0) = Q(st , 1), we choose at = 0 with probability p, and at = 1 with probability 1 − p. Algorithm Definition. We then define the agent’s policy in Algorithm 1. This is a temporal difference Q-Learning based algorithm. The algorithm first estimates the opponent move rate, ρ, by playing several times. This step is important to determine if it should continue playing or drop out (when k1 ≥ ρ), and to fix the environment reward. If the agent decides to play, it proceeds to initialize the Q table of estimated rewards in each state and action to 0. The agent’s initial state is set according to Table 1. The action choice is based on exploration, as previously discussed. Once an action is selected, the agent receives the reward and new state from the environment and updates Q according to Eq. (5).

Algorithm 1 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

5

Estimate rate of play ρ of opponent and drop out if k1 ≥ ρ Initialize 2D table Q with all zeros Initialize s0 according to observation type (see section 4.1) for t ∈ {1, 2, . . . } do if Q(st ,0)=Q(st ,1) then at ← 0 with probability p, else at ← 1 else Choose action at according to equation (6) Simulate action at on environment, and observe st+1 , rt+1 Update Q(st ,at ) according to equation (5)

Theoretical Analysis for Periodic Opponent

We consider first an opponent playing periodically with random phase. Previous work has focused primarily on analyzing Pδ strategies in a non-adaptive context [3,6,14,15]. We first show theoretically that QFlip eventually learns to play optimally against a Periodic opponent when the future discount γ is set at 0. We employ this restriction because, when γ > 0, the actual value of at in st (Vst ,at ) depends on the maximum estimated value in st+1 , which changes concurrently

374

L. Oakley and A. Oprea

with Q(st , at ). Additionally, Sect. 6.2 shows experimentally that changing γ does not have much effect on benefit. In the discrete version of FlipIt a player using the Periodic strategy Pδ plays first at some uniformly random Rδ ∈ {0, . . . , δ}, then plays every δ ticks for all subsequent moves. In this case, the optimal LM strategy is to play immediately after the opponent. Our main result is the following theorem, showing that QFlip converges to the optimal strategy with high probability. Theorem 1. Playing against a Pδ opponent with γ = 0, k1 < δ, QFlip using the ownLM observation scheme converges to the optimal LM strategy as t → ∞. We will prove this theorem by first showing that QFlip visits state δ + 1 infinitely often, then claiming that QFlip will eventually play once in state δ + 1. We conclude by proving QFlip eventually learns to play in state δ+1 and no other state. This is exactly the optimal strategy of playing right after the opponent. Because the Pδ strategy is deterministic after the random phase, we can model Player 1’s known state and transitions according to the actual state of the game as in Fig. 4. We prove several lemmas and finally the main theorem below.

-1

2

3

4

...

+1

...

m

...

Fig. 4. QFlip using oppLM states against Pδ . Arrows indicate state transitions. Red is at = 1, green is at = 0, and yellow is either at = 0 or at = 1. (Color figure online)

Lemma 1. If t > Rδ and at = 1, QFlip will visit state δ + 1 in at most δ additional time steps. Proof. We want to show that, for any st , with t > Rδ , choosing at = 1 means that QFlip will visit state δ + 1 again in at most δ additional time steps. Case 1 : Assume 1 < st < δ + 1. We see from Fig. 4 that st+1 = st + 1, for all at when −1 < st < δ + 1. Therefore, QFlip will reach state δ + 1 in δ + 1 − st < δ additional time steps. Case 2 : Assume st ≥ δ + 1. The opponent is Pδ , so LM1 < LM0 at time t. Given at = 1, Table 2 gives st+1 = t − LM0 + 1 < δ + 1, returning to Case 1. Case 3 : Assume st = −1 and QFlip chooses at = 1. From Tables 1 and 2, we have that st+1 = t − LM0 < δ + 1, returning again to Case 1. Lemma 2. Playing against a Pδ opponent with γ = 0, 0 ≤ p < 1 and k1 < δ, QFlip visits state δ + 1 infinitely often.

QFlip: An Adaptive Reinforcement Learning Strategy

375

Proof. We prove by induction on the number of visits, n, to state δ + 1 that QFlip visits state δ + 1 infinitely often. Base of Induction: We show that, starting from s0 = −1, QFlip will reach st = δ + 1 with probability converging to 1. If QFlip chooses at = 1 when Rδ < t ≤ δ, this is a flipping move, putting QFlip in state st+1 < δ + 1. If not, t > δ ≥ Rα . When st = −1, QFlip chooses at = 0 with probability p and at = 1 with probability 1 − p. Therefore P [QFlip not flipping after v visits to s = −1] = pv .

(7)

This implies that QFlip will flip with probability 1 − pv → 1 as t = v → ∞. By Lemma 1, QFlip will reach state δ + 1 in finite steps with probability 1. Inductive Step: Assume QFlip visits state δ + 1 n times. Because we are considering an infinite game, at any time t there are infinitely many states that have not been visited. Therefore ∃m ≥ δ + 1 such that ∀s > m, Q(s) = (0, 0). If QFlip flips at state st ∈ {δ + 1, ..., m}, it will reach δ + 1 again in a finite number of additional steps by Lemma 1. If QFlip does not flip at state st ∈ {δ + 1, ..., m}, we have P [QFlip does not move after z steps] = pz

(8)

since probability of moving when Q(s) = (0, 0) is p and not moving implies st+1 = st +1 > m. Therefore, as t = z → ∞, QFlip will flip again with probability 1 − pz → 1. By mathematical induction, we have that state δ + 1 is visited infinitely often with probability converging to 1. Lemma 3. If γ = 0 and k1 < δ, QFlip will eventually choose to move in state δ + 1 with probability 1. Proof. If QFlip flips in any visit to state δ + 1, the conclusion follows. Assume QFlip does not flip in state s = δ + 1. Since γ = 0, from Eq. (4), Vst ,at = 0 for at = 0. Therefore Q(s) = (0, 0) and we have P [QFlip does not flip after v visits to state δ + 1] = pv .

(9)

By Lemma 2, we know that QFlip visits state δ + 1 infinitely often. Therefore, probability that QFlip moves in state δ + 1 is 1 − pv → 1 as v → ∞. Proof of Theorem 1. We will now prove the original theorem, using these lemmas. To prove that QFlip plays optimally against Pδ , we must show that it will eventually (1) play at s = δ + 1 at each visit and (2) not play at s = δ + 1 at any visit. Assuming γ = 0, we have from Sect. 2.5 of [22] that i=1 1 1

· (rα+1 − Qα (s, a)) = ri . Qα+1 (s, a) = Qα (s, a) + α+1 α + 1 α+1

(10)

376

L. Oakley and A. Oprea

Here we denote by ri the reward obtained the i-th time state s was visited and action a was taken. Additionally, Pδ plays every δ time steps after the random phase. Therefore we derive from Eq. (3): ⎧ ⎪ if ai = 0 ⎨0 ri = −k1 if ai = 1 and 1 ≤ si ≤ δ (11) ⎪ ⎩ δ−k1 if a = 1 and s ≥ δ + 1 i i c By Eqs. (10) and (11), we have for all states s, 1

Qα (s, 0) = 0 = 0. α i=1 α

(12)

First we show that QFlip will eventually choose at = 0 in all states 1 < st < δ + 1. Consider some st such that 1 < s < δ + 1. If QFlip never chooses at = 1 in this state, we are done. Assume QFlip plays at least once in this state, α > 0, then α 1

Qα (s, 1) = −k1 = −k1 < 0 = Qα (s, 0) (13) α i=1 since k1 > 0. Therefore argmaxa∈A Q(s, a) = 0. Because = · e−d·v , and 0 ≤ , d < 1, as v → ∞ we have that → 0. Therefore P [QFlip does not play at s] → 1 for 1 ≤ s ≤ δ as desired. Next we show that QFlip will eventually play at state δ + 1 at each visit. From Lemma 3, we know that QFlip will play once at s = δ + 1 with probability 1, meaning α > 0 with probability 1. By Eqs. (10) and (11) we have for α > 0 and s = δ + 1. δ − kA 1 δ − kA = > 0 = Qα (s, 0). α i=1 c c α

Qα (s, 1) =

(14)

Now argmaxa∈A Q(s, a) = 1, so as → 0, P [QFlip plays at s = δ + 1] → 1. If QFlip plays at state δ + 1, it will not reach states s > δ + 1, and thus cannot play in those states. Therefore, as t → ∞, P [QFlip plays optimally] → 1.

6

QFlip Against Pδ and Eλ Opponents

In this section we show experimentally that QFlip learns an optimal strategy against Pδ and Eλ opponents. Prior to starting play, we allow QFlip to choose its observation scheme based on the opponent’s strategy. QFlip chooses the oppLM observation against Pδ and the ownLM observation against Eλ , but sets other parameters of QFlip identically. This reflects the theoretical analysis from [3] which states that an optimal adaptive strategy against a Pδ opponent depends on τ , while τ is irrelevant in optimal strategies against an Eλ opponent. Next section, we generalize QFlip to play with no knowledge of opponent strategy.

QFlip: An Adaptive Reinforcement Learning Strategy

6.1

377

Implementation

All simulations (https://github.com/lisaoakley/flipit-simulation) are written in Python 3.5 with a custom OpenAI Gym environment for FlipIt (https://github. com/lisaoakley/gym-flipit). We ran each experiment over a range of costs and Player 0 parameters within the constraint that k1 < ρ, as other values have an optimal drop out strategy. For consistency, we calculated ρ from the distribution parameters before running simulations. We report averages across multiple runs, choosing number of runs and run duration to ensure convergence. Integrals are calculated using the scipy.integrate.quad function. For Greedy’s maximization step we used scipy.optimize.minimize with the default “BFGS” algorithm. QFlip can run against a variety of opponents with minimal configuration, thus we set all QFlip hyper-parameters identically across experiments, namely γ = 0.8, = 0.5, d = 0.05, c = 5, and p = 0.7 unless otherwise noted.

(a) 8, 000 ticks

(b) 16, 000 ticks

(c) 64, 000 ticks

Fig. 5. Player’s 1 average benefit for different and p parameters at three time ticks. Here QFlip plays with oppLM, γ = 0 and k1 = 25 against a Pδ opponent with δ = 50, averaged over 10 runs. Darker purples mean higher average benefit. QFlip converges to optimal (.48), improving more quickly with low exploration.

6.2

QFlip vs. Periodic

In Sect. 5 we proved that QFlip will eventually play optimally against Pδ when future discount γ is 0 and k1 < δ. In this section, we show experimentally that most configurations of QFlip using oppLM quickly lead to an optimal strategy. Additionally, we show there is little difference in benefit when γ > 0. Learning Speed. When γ = 0, Eq. (4) reduces to Vst ,at = rt+1 . In this case we verify that QFlip learns an optimal strategy for all exploration parameters, but that lower exploration rates cause QFlip to reach optimal benefit more quickly. Against a Periodic opponent, Qα (s, a) is constant after QFlip takes action a in state s at least once (α > 0). Thus, exploring leads to erroneous moves. When the probability of moving for the first time in state s is low (p is high), QFlip makes fewer costly incorrect moves in states s < δ + 1 leading to higher benefit.

378

L. Oakley and A. Oprea

When p = 1, QFlip never plays and β1 = 0. Figure 5 displays Player 1’s average benefit for different values of p and . QFlip achieves close to optimal benefit after 64, 000 ticks with high exploration rates and high probability of playing in new states (1 − p), and in as little as 8, 000 ticks with no exploration ( = 0) and low 1 − p. Varied Configurations. When γ > 0, Vst ,at factors in the estimated value of state st+1 allowing QFlip to attain positive reward for choosing not to move in state δ + 1. The resulting values in the Q table can negatively impact learning. Figure 6 shows the average benefit over time (left) and the number of non-optimal runs (right) for different and γ values. We observe that QFlip performs nonoptimally on 38% of runs with γ = 0.8 and = 0 but has low benefit variation between runs. However, increasing even to 0.1 compensates for this and allows QFlip to play comparably on average with future estimated value (γ > 0) as with no future estimated value (γ = 0). This result allows us flexibility in configuring QFlip which we will leverage to maintain hyper-parameter consistency against all opponents. In the rest of the paper we set γ = 0.8, = 0.5, and p = 0.7.

γ

0 0.1 0.25 0.5 0 0.1 0.8 0.25 0.5 0

(a) Learning over time averaged over 50 runs per configuration. Optimal benefit vs. Pδ with δ = 50 and k1 = 25 in a discrete game is 0.48

# non-optimal runs 0 0 0 0 19 3 1 0

min max benefit benefit 0.477 0.474 0.471 0.465 0.417 0.455 0.452 0.465

0.478 0.476 0.473 0.468 0.478 0.476 0.473 0.468

(b) Statistics over 50 runs. “Nonoptimal” runs have average benefit > .02 less than optimal (.48) after 500,000 ticks.

Fig. 6. QFlip using oppLM with k1 = 25 playing against Pδ with δ = 50.

Comparison to Greedy. Assuming the Greedy strategy against Pδ plays first at time δ, it will play optimally with probability 1 − k1 /δ. However, with probability k1 /δ, Greedy will drop out after its first adaptive move [3]. We compare QFlip and Greedy against Pδ for δ = 50 across many costs in Fig. 7. QFlip consistently achieves better average benefit across runs, playing close to optimally on average. Additionally, Player 0 with k0 = 1 attains more benefit on average against a Greedy opponent as a result of these erroneous drop-outs. For k1 < 45, QFlip attains benefit between 5% and 50% better than Greedy on average.

QFlip: An Adaptive Reinforcement Learning Strategy

(a) Player 1 average benefit by cost

379

(b) Player 0 average benefit by cost

Fig. 7. Player 1 and Player 0’s average benefit for QFlip and Greedy across Player 1 costs. QFlip with oppLM playing against Pδ with fixed k0 = 1 and δ = 50 for 250,000 ticks, averaged over 100 runs.

6.3

QFlip vs. Exponential

The optimal LM strategy against an Eλ opponent is proven in [3] to be Pδ with δ dependent on k0 and λ. The exponential distribution is memoryless, so optimal δ is independent of time since the opponent’s last move. Optimal QFlip ignores τ and moves δ steps after its own last move. QFlip therefore prefers the ownLM observation space from Table 1, rather than oppLM used against Periodic.

(a) QFlip average benefit versus cost. QFlip becomes near-optimal for all costs when duration is high.

(b) Ratio of QFlip and Optimal average benefit versus time. QFlip achieves optimal benefit quickly when costs are low.

Fig. 8. QFlip with ownLM playing against Eλ with λ = 1/100.

For QFlip to learn any Pδ strategy, it must visit states s < δ many times. When playing against an Exponential opponent, the optimal δ grows quickly as k1 increases. For instance, against an Eλ opponent with λ = 1/100 the optimal Pδ strategy is δ = 53 for k1 = 10 and δ = 389 for k1 = 90 [3]. As a result, the optimal Periodic strategy takes longer to learn as k1 grows. Figure 8(a) shows the average benefit versus cost after running the algorithm for up to 4.096 million ticks for rate of the Exponential distribution λ = 1/100. For small costs,

380

L. Oakley and A. Oprea

QFlip learns to play optimally within a very short time (16, 000 ticks). As the move cost increases, QFlip naturally takes longer to converge. We verified this for other values of λ as well. Figure 8(b) shows how the benefit varies by time for various move costs. Given enough time, QFlip converges to a near-optimal Periodic strategy for all costs (even as high as k1 = 100, which results in drop out for λ = 1/100).

7

Generalized QFlip Strategy

Previous sections show that QFlip converges to optimal using the oppLM and ownLM observation schemes for the Pδ and Eλ opponents respectively. In this section we show that QFlip using a composite observation scheme can play optimally against Pδ and Eλ , and perform well against other Renewal strategies without any knowledge of the opponent’s strategy. The composite strategy uses as states both Player 1’s own last move time (LM1 ), and the time since the opponent’s last known move (τt ), as described in Table 1. Composite QFlip is the first general adaptive FlipIt strategy that has no prior information on the opponent. 7.1

Composite QFlip Against Pδ and Eλ Opponents

Figure 9 shows that QFlip’s average benefit eventually converges to optimal against both Eλ and Pδ when using a composite observation scheme. We note that it takes significantly longer to converge to optimal when using the composite scheme. This is natural, as QFlip has an enlarged state space (quadratic compared to oppLM and ownLM observation schemes) and now visits each state less frequently. We leave approximation methods to expedite learning to future work.

(a) QFlip against Pδ with δ = 50.

(b) QFlip against Eλ with λ = 1/100.

Fig. 9. QFlip with k1 = 10 averaged over 10 runs. QFlip converges to optimal against Pδ and Eλ using oppLM and ownLM observation schemes respectively, and plays close to optimally against both with composite observation scheme.

QFlip: An Adaptive Reinforcement Learning Strategy

(a) QFlip vs. Uδ,u with δ = 100, u = 50

381

(b) QFlip vs. Nμ,σ with μ = 100, σ = 10

Fig. 10. QFlip’s average benefit by time for Uniform (left) and Normal (right) distributions with k1 = 10. QFlip with oppLM and composite observations outperforms Greedy against Uδ,u averaged over 10 runs. Against both opponents, composite converges to oppLM as time increases.

7.2

Composite QFlip Against Other Renewal Opponents

The composite strategy results in flexibility against multiple opponents. We evaluate QFlip using composite observations against Uniform and Normal Renewal opponents in Fig. 10. QFlip attains 15% better average benefit than Greedy against Uδ,u . Figure 10 also shows that QFlip attains a high average benefit of 0.76 against a Nμ,σ opponent. We do not compare Nμ,σ to Greedy as the numerical packages we used were unable to find the maximum local benefit from Eq. (1). QFlip using composite attains average benefit within 0.01 of QFlip using oppLM (best performing observation scheme) against both opponents after 10 million ticks.

8

Conclusions

We considered the problem of playing adaptively in the FlipIt security game by designing QFlip, a novel strategy based on temporal difference Q-Learning instantiated with three different observations schemes. We showed theoretically that QFlip plays optimally against a Periodic Renewal opponent using the oppLM observation. We also confirmed experimentally that QFlip converges against Periodic and Exponential opponents, using the ownLM observation scheme in the Exponential case. Finally, we showed general QFlip with a composite observation scheme performs well against Periodic, Exponential, Uniform, and Normal Renewal opponents. Generalized QFlip is the first adaptive strategy which can play against any opponent with no prior knowledge. We performed detailed experimental evaluation of our three observation schemes for a range of distributions parameters and move costs. Interestingly, we showed that certain hyper-parameter configurations for the amount of exploration ( and d), future reward discount (γ), and probability of moving in new

382

L. Oakley and A. Oprea

states (1 − p) are applicable against a range of Renewal strategies. Thus, QFlip has the advantage of requiring minimal configuration. Additionally, we released an OpenAI Gym environment for FlipIt to aid future researchers. In future work, we plan to consider extensions of the FlipIt game, such as multiple resources and different types of moves. We are interested in analyzing other non-adaptive strategies besides the class of Renewal strategies. Finally, approximation methods from reinforcement learning have the potential to make our composite strategy faster to converge and we plan to explore them in depth. Acknowledgements. We would like to thank Ronald Rivest, Marten van Dijk, Ari Juels, and Sang Chin for discussions about reinforcement learning in FlipIt. We thank Matthew Jagielski, Tina Eliassi-Rad, and Lucianna Kiffer for discussing the theoretical analysis. This project was funded by NSF under grant CNS-1717634. This research was also sponsored by the U.S. Army Combat Capabilities Development Command Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Combat Capabilities Development Command Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

References 1. Bowers, K.D., et al.: Defending against the unknown enemy: Applying FLIPIT to system security. In: Proceedings of the Conference on Decision and Game Theory for Security. GameSec (2012) 2. Chung, K., Kamhoua, C.A., Kwiat, K.A., Kalbarczyk, Z.T., Iyer, R.K.: Game theory with learning for cyber security monitoring. In: 2016 IEEE 17th International Symposium on High Assurance Systems Engineering (HASE), pp. 1–8, January 2016. https://doi.org/10.1109/HASE.2016.48 3. van Dijk, M., Juels, A., Oprea, A., Rivest, R.L.: FlipIt: The game of stealthy takeover. J. Cryptol. 26, 655–713 (2013) 4. Elderman, R., Pater, L.J.J., Thie, A.S., Drugan, M.M., Wiering, M.: Adversarial reinforcement learning in a cyber security simulation. In: ICAART (2017) 5. Farhang, S., Grossklags, J.: FlipLeakage: a game-theoretic approach to protect against stealthy attackers in the presence of information leakage. In: Zhu, Q., Alpcan, T., Panaousis, E., Tambe, M., Casey, W. (eds.) GameSec 2016. LNCS, vol. 9996, pp. 195–214. Springer, Cham (2016). https://doi.org/10.1007/978-3-31947413-7 12 6. Feng, X., Zheng, Z., Hu, P., Cansever, D., Mohapatra, P.: Stealthy attacks meets insider threats: a three-player game model. In: IEEE Military Communications Conference on MILCOM 2015–2015, pp. 25–30, October 2015. https://doi.org/10. 1109/MILCOM.2015.7357413 7. Feng, X., Zheng, Z., Mohapatra, P., Cansever, D.: A stackelberg game and Markov modeling of moving target defense. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) Decision and Game Theory for Security, pp. 315–335. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68711-7 17

QFlip: An Adaptive Reinforcement Learning Strategy

383

8. Grossklags, J., Reitter, D.: How task familiarity and cognitive predispositions impact behavior in a security game of timing. In: 2014 IEEE 27th Computer Security Foundations Symposium, pp. 111–122, July 2014. https://doi.org/10.1109/ CSF.2014.16 9. Han, Y.: Reinforcement learning for autonomous defence in software-defined networking. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 145–165. Springer, Cham (2018). https://doi.org/10.1007/978-3030-01554-1 9 10. Hu, P., Li, H., Fu, H., Cansever, D., Mohapatra, P.: Dynamic defense strategy against advanced persistent threat with insiders. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 747–755, April 2015. https://doi. org/10.1109/INFOCOM.2015.7218444 11. Hu, Q., Lv, S., Shi, Z., Sun, L., Xiao, L.: Defense against advanced persistent threats with expert system for internet of things. In: Ma, L., Khreishah, A., Zhang, Y., Yan, M. (eds.) WASA 2017. LNCS, vol. 10251, pp. 326–337. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-60033-8 29 12. Kl´ıma, R., Tuyls, K., Oliehoek, F.A.: Markov security games: learning in spatial security problems (2016) 13. Laszka, A., Horvath, G., Felegyhazi, M., Butty´ an, L.: FlipThem: modeling targeted attacks with flipit for multiple resources. In: Poovendran, R., Saad, W. (eds.) GameSec 2014. LNCS, vol. 8840, pp. 175–194. Springer, Cham (2014). https://doi. org/10.1007/978-3-319-12601-2 10 14. Laszka, A., Johnson, B., Grossklags, J.: Mitigating covert compromises: a gametheoretic model of targeted and non-targeted covert attacks. In: 9th International Conference on Web and Internet Economics (WINE) (2013) 15. Laszka, A., Johnson, B., Grossklags, J.: Mitigation of targeted and non-targeted covert attacks as a timing game. In: Das, S.K., Nita-Rotaru, C., Kantarcioglu, M. (eds.) GameSec 2013. LNCS, vol. 8252, pp. 175–191. Springer, Cham (2013). https://doi.org/10.1007/978-3-319-02786-9 11 16. Maleki, H., Valizadeh, S., Koch, W., Bestavros, A., van Dijk, M.: Markov modeling of moving target defense games. In: Proceedings of the 2016 ACM Workshop on Moving Target Defense, MTD 2016, pp. 81–92. ACM, New York (2016). https:// doi.org/10.1145/2995272.2995273 17. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M.A.: Playing atari with deep reinforcement learning. CoRR abs/1312.5602 (2013) 18. Nochenson, A., Grossklags, J.: A behavioral investigation of the FlipIt game. In: 12th Workshop on the Economics of Information Security (WEIS) (2013) 19. Pham, V., Cid, C.: Are we compromised? modelling security assessment games. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 234–247. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34266-0 14 20. Reitter, D., Grossklags, J., Nochenson, A.: Risk-seeking in a continuous game of timing. In: 13th International Conference on Cognitive Modeling (ICMM) (2013) 21. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529, 484–503 (2016) 22. Sutton, R.S., Barto, A.G.: Introduction to Reinforcement Learning, 1st edn. MIT Press, Cambridge (1998) 23. Tesauro, G.: Temporal difference learning and TD-Gammon. Commun. ACM 38(3), 58–68 (1995). https://doi.org/10.1145/203330.203343

384

L. Oakley and A. Oprea

24. Xiao, L., Li, Y., Han, G., Dai, H., Poor, H.V.: A secure mobile crowdsensing game with deep reinforcement learning. IEEE Trans. Inf. Forensics Secur. 13(1), 35–47 (2018). https://doi.org/10.1109/TIFS.2017.2737968 25. Zhang, M., Zheng, Z., Shroff, N.B.: Stealthy attacks and observable defenses: a game theoretic model under strict resource constraints. In: 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 813–817, December 2014. https://doi.org/10.1109/GlobalSIP.2014.7032232 26. Zhu, M., Hu, Z., Liu, P.: Reinforcement learning algorithms for adaptive cyber defense against Heartbleed. In: Proceedings of the First ACM Workshop on Moving Target Defense, MTD 2014, pp. 51–58. ACM, New York (2014). https://doi.org/ 10.1145/2663474.2663481

Linear Temporal Logic Satisfaction in Adversarial Environments Using Secure Control Barrier Certificates Bhaskar Ramasubramanian1(B) , Luyao Niu2 , Andrew Clark2 , Linda Bushnell1 , and Radha Poovendran1 1

Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA {bhaskarr,lb2,rp3}@uw.edu 2 Department of Electrical and Computer Engineering, Worcester Polytechnic Institute, Worcester, MA 01609, USA {lniu,aclark}@wpi.edu

Abstract. This paper studies the satisfaction of a class of temporal properties for cyber-physical systems (CPSs) over a finite-time horizon in the presence of an adversary, in an environment described by discretetime dynamics. The temporal logic specification is given in saf e−LT LF , a fragment of linear temporal logic over traces of finite length. The interaction of the CPS with the adversary is modeled as a two-player zerosum discrete-time dynamic stochastic game with the CPS as defender. We formulate a dynamic programming based approach to determine a stationary defender policy that maximizes the probability of satisfaction of a saf e − LT LF formula over a finite time-horizon under any stationary adversary policy. We introduce secure control barrier certificates (S-CBCs), a generalization of barrier certificates and control barrier certificates that accounts for the presence of an adversary, and use S-CBCs to provide a lower bound on the above satisfaction probability. When the dynamics of the evolution of the system state has a specific underlying structure, we present a way to determine an S-CBC as a polynomial in the state variables using sum-of-squares optimization. An illustrative example demonstrates our approach. Keywords: Linear temporal logic · saf e − LT LF · Dynamic programming · Secure control barrier certificate · Sum-of-squares optimization

1

Introduction

Cyber-physical systems (CPSs) use computing devices and algorithms to inform the working of a physical system [8]. These systems are ubiquitous, and vary in This work was supported by the U.S. Army Research Office, the National Science Foundation, and the Office of Naval Research via Grants W911NF-16-1-0485, CNS1656981, and N00014-17-S-B001 respectively. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 385–403, 2019. https://doi.org/10.1007/978-3-030-32430-8_23

386

B. Ramasubramanian et al.

size and scale from energy systems to medical devices. The wide-spread influence of CPSs such as power systems and automobiles makes their safe operation critical. Although distributed algorithms and systems allow for more efficient sharing of information among parts of the system and across geographies, they also make the CPS vulnerable to attacks by an adversary who might gain access to the distributed system via multiple entry points. Attacks on distributed CPSs have been reported across multiple application domains [20,43,44,46]. In these cases, the damage to the CPS was caused by the actions of a stealthy, intelligent adversary. Thus, methods designed to only account for modeling and sensing errors may not meet performance requirements in adversarial scenarios. Therefore, it is important to develop ways to specify and verify properties that a CPS must satisfy that will allow us to provide guarantees on the operation of the system while accounting for the presence of an adversary. In order to verify the behavior of a CPS against a rich set of temporal specifications, techniques from formal methods can be used [9]. Properties like safety, stability, and priority can be expressed as formulas in linear temporal logic (LTL) [19]. These properties can then be verified using off-the-shelf model solvers [15,28] that take these formulas as inputs. If the state space and the actions available to the agents are both finite and discrete, then the environment can be represented as a Markov decision process (MDP) [38] or a stochastic game [11]. These representations have also been used as abstractions of continuousstate continuous action dynamical system models [10,32]. However, a significant shortcoming is that the computational complexity of abstracting the underlying system grows exponentially with the resolution of discretization desired [14,21]. The method of barrier certificates (or barrier functions), which are functions of the states of the system was introduced in [36]. Barrier functions provide a certificate that all trajectories of a system starting from a given initial set will not enter an unsafe region. The use of barrier functions does not require explicit computation of sets of reachable states, which is known to be undecidable for general dynamical systems [29], and moreover, it allows for the analysis of general nonlinear and stochastic dynamical systems. The authors of [36] further showed that if the states and inputs to the system have a particular structure, computationally efficient methods can be used to construct a barrier certificate. Barrier certificates were used to determine probabilistic bounds on the satisfaction of an LTL formula by a discrete-time stochastic system in [22]. A more recent work by the same authors [23] used control barrier certificates to synthesize a policy in order to maximize the probability of satisfaction of an LTL formula. Prior work that uses barrier certificates to study temporal logic satisfaction assumes a single agent, and does not study the case when the CPS is operating in an adversarial environment. To the best of our knowledge, this paper is the first to use barrier certificates to study temporal logic satisfaction for CPSs in adversarial environments. We introduce secure barrier certificates (S-CBCs), and use it to determine probabilistic bounds on the satisfaction of an LTL formula

Linear Temporal Logic Satisfaction in Adversarial Environments

387

under any adversary policy. Further, definitions of barrier certificates and control barrier certificates in prior work can be recovered as special cases of S-CBCs. 1.1

Contributions

In this paper, we consider the setting when there is an adversary whose aim is to ensure that the LTL formula is not satisfied by the CPS (defender). The temporal logic specification is given in saf e − LT LF , a fragment of LTL over traces of finite length. We make the following contributions: – We model the interaction between the CPS and adversary as a two-player dynamic stochastic game with the CPS as defender. The two players take their actions simultaneously, and these jointly influence the system dynamics. – We present a dynamic programming based approach to determine a stationary defender policy to maximize the probability of satisfaction of an LTL formula over a finite time-horizon under any stationary adversary policy. – In order to determine a lower bound on the above satisfaction probability, we define a new entity called secure control barrier certificates (S-CBCs). SCBCs generalize barrier certificates and control barrier certificates to account for the presence of an adversary. – When the evolution of the state of the dynamic game can be expressed as polynomial functions of the states and inputs, we use sum-of-squares optimization to compute an S-CBC as a polynomial function of the states. – We present an illustrative example demonstrating our approach. 1.2

Outline of Paper

We summarize related work on control barrier certificates and temporal logic satisfaction in Sect. 2. Section 3 gives an overview of temporal logic and gametheoretic concepts that will be used to derive our results. The problem that is the focus of this paper is formulated in Sect. 4. Our solution approach is presented in Sect. 5, where we define a dynamic programming operator to synthesize a policy for the defender in order to maximize the probability of satisfaction of the LTL formula under any adversary policy. We define a notion of secure control barrier certificates to derive a lower bound on the satisfaction probability, and are able to explicitly compute an S-CBC under certain assumptions. Section 6 presents an illustrative example, and we conclude the paper in Sect. 7.

2

Related Work

The method of barrier functions was introduced in [36] to certify that all trajectories of a continuous-time system starting from a given initial set do not enter an unsafe region. Control barrier functions (CBFs) were used to provide guarantees on the safety of continuous-time nonlinear systems with affine inputs for an adaptive cruise control application in [6]. The notion of input-to-state CBFs that

388

B. Ramasubramanian et al.

ensured the safety of nonlinear systems under arbitrary input disturbances was introduced in [24], and safety was characterized in terms of the invariance of a set whose computation depended on the magnitude of the disturbance. The authors of [45] relaxed the supermartingale condition that a barrier certificate had to satisfy in [36] in order to provide finite-time guarantees on the safety of a system. The verification and control of a finite-time safety property for continuous-time stochastic systems using barrier functions was recently presented in [41]. Barrier certificates were used to verify LTL formulas for a deterministic, continuous-time nonlinear dynamical system in [49]. Time-varying CBFs were used to accomplish tasks specified in signal temporal logic in [30]. A survey of the use of CBFs to design safety-critical controllers is presented in [5]. The use of barrier certificates or CBFs in these works were all for continuous time dynamical systems and did not consider the effect of the actions of an adversarial player. Barrier certificates in the discrete-time setting were used to analyze the reachable belief space of a partially observable Markov decision process (POMDP) with applications to verifying the safety of POMDPs in [2], and for privacy verification in POMDPs in [3]. The use of barrier certificates for the verification and synthesis of control policies for discrete-time stochastic systems to satisfy an LTL formula over a finite time horizon was presented in [22] and [23]. These papers also assumed a single agent, and did not account for the presence of an adversary. The authors of [33] used barrier functions to solve a reference tracking problem for a continuous-time linear system subject to possible false data injection attacks by an adversary, with additional constraints on the safety and reachability of the system. Probabilistic reachability over a finite time horizon for discrete-time stochastic hybrid systems was presented in [1]. This was extended to a dynamic stochastic game setting when there were two competing agents in [18], and to the problem of ensuring the safety of a system that was robust to errors in the probability distribution of a disturbance input in [50]. These papers did not assume that a temporal specification had to be additionally satisfied. Determining a policy for an agent in order to maximize the probability of satisfying an LTL formula in an environment specified by an MDP was presented in [19]. This setup was extended to the case when there were two agentsa defender and an adversary- who had competing objectives to ensure the satisfaction of the LTL formula in an environment specified as a stochastic game in [32]. These papers assume that the states of the system are completely observable, which might not be true in every situation. The satisfaction of an LTL formula in partially observable environments represented as POMDPs was studied in [42] and the extension to partially observable stochastic games with two competing agents, each with its own observation of the state of the system, was formulated in [39].

3

Preliminaries

In this section, we give a brief introduction to linear temporal logic and discrete-time dynamic stochastic games. Wherever appropriate, we consider a

Linear Temporal Logic Satisfaction in Adversarial Environments

389

probability space (Ω, F , P). We write (X, B(X)) to denote the measurable space X equipped with the Borel σ−algebra, and R≥0 to denote the set of non-negative real numbers. 3.1

Linear Temporal Logic

Temporal logic frameworks enable the representation and reasoning about temporal information on propositional statements. Linear temporal logic (LTL) is one such framework, where the progress of time is ‘linear’. An LTL formula [9] is defined over a set of atomic propositions A P, and can be written as: φ := T|σ|¬φ|φ ∧ φ|Xφ|φUφ, where σ ∈ A P, and X and U are temporal operators denoting the next and until operations. The semantics of LTL are defined over (infinite) words in 2A P . The syntax of linear temporal logic over finite traces, denoted LT LF [17], is the same as that of LTL. The semantics of LT LF is expressed in terms of finite-length words in 2A P . We denote a word in LT LF by η, write |η| to denote the length of η, and ηi , 0 < i < |η|, to denote the proposition at the ith position of η. We write (η, i) |= φ when the LT LF formula φ is true at the ith position of η. Definition 1 (LT LF Semantics). The semantics of LT LF can be recursively defined in the following way: 1. 2. 3. 4. 5. 6.

(η, i) |= T; (η, i) |= σ iff σ ∈ ηi ; (η, i) |= ¬φ iff (η, i) |= φ; (η, i) |= φ1 ∧ φ2 iff (η, i) |= φ1 and (η, i) |= φ2 ; (η, i) |= Xφ iff i < |η| − 1 and (η, i + 1) |= φ; (η, i) |= φ1 Uφ2 iff ∃j ∈ [i, |η|] such that (η, j) |= φ2 and for all k ∈ [i, j), (η, k) |= φ1 .

Finally, we write η |= φ if and only if (η, 0) |= φ. Moreover, the logic admits derived formulas of the form: (i) φ1 ∨ φ2 := ¬(¬φ1 ∧ ¬φ2 ); (ii) φ1 ⇒ φ2 := ¬φ1 ∨ φ2 ; (iii) Fφ := TUφ (eventually); (iv) Gφ := ¬F¬φ (always). The set L (φ) comprises the language of finite-length words associated with the LT LF formula φ. In this paper, we focus on a subset of LT LF called saf e − LT LF [40], that explicitly considers only safety properties [26]. Definition 2 (saf e − LT LF Formula). An LT LF formula is a saf e − LT LF formula if it can be written in positive normal form (PNF)1 , using the temporal operators X (next) and G (always). 1

In PNF, negations occur only adjacent to atomic propositions.

390

B. Ramasubramanian et al.

Next, we define an entity that will serve as an equivalent representation of an LT LF formula, and will allow us to check if the LT LF formula is satisfied or not. Definition 3 (Deterministic Finite Automaton). A deterministic finite automaton (DFA) is a quintuple A = (Q, Σ, δ, q0 , F ) where Q is a nonempty finite set of states, Σ is a finite alphabet, δ : Q × Σ → Q is a transition function, q0 ∈ Q is the initial state, and F ⊆ Q is a set of accepting states. Definition 4 (Accepting Runs). A run of A of length n is a finite sequence σn−1 σ0 σ1 q1 −→ . . . −−−→ qn such that qi ∈ δ(qi−1 , σi−1 ) for all of (n + 1) states q0 −→ i ∈ [1, n] and for some σ0 , . . . , σn−1 ∈ Σ. The run is accepting if qn ∈ F . We write L (A ) to denote the set of all words accepted by A . Every LT LF formula φ over A P can be represented by a DFA Aφ with Σ = 2A P that accepts all and only those runs that satisfy φ, that is, L (φ) = L (Aφ ) [16]. The DFA Aφ can be constructed by using a tool like Rabinizer4 [25]. 3.2

Discrete-Time Dynamic Stochastic Games

We model the interaction between the CPS (defender) and adversary as a twoplayer dynamic stochastic game that evolves according to some known (discretetime) dynamics [7]. The evolution of the state of the game at each time step is affected by the actions of both players. Definition 5 (Discrete-time Dynamic Stochastic Game). A discrete-time dynamic stochastic game (DDSG) is a tuple G = (X, W, Ud , Ua , f, N , A P, L), where X ⊆ Rn and W are Borel-measurable spaces representing the state-space and uncertainty space of the system, Ud ⊆ Rd and Ua ⊆ Ra are compact Borel spaces that denote the action sets of the defender and adversary, f : X ×Ud ×Ua × W → X is a Borel-measurable transition function characterizing the evolution of the system, N = {0, 1, . . . , N − 1} is an index-set denoting the stage of the game, A P is a set of atomic propositions, and L : X → 2A P is a labeling function that maps states to a subset of atomic propositions that are satisfied in that state. The evolution of the state of the system is given by: x(k + 1) = f (x(k), ud (k), ua (k), w(k));

x(0) = x0 ∈ X;

k∈N,

(1)

where {w(k)} is a sequence of independent and identically distributed (i.i.d.) random variables with zero mean and bounded covariance. In this paper, we focus on the Stackelberg setting with the defender as leader and adversary as follower. The leader selects its inputs anticipating the worstcase response by the adversary. We assume that the adversary can choose its action based on the action of the defender [18], and further, restrict our focus to stationary strategies for the two players. Due to the asymmetry in information available to the players, equilibrium strategies for the case when the game is zero-sum can be chosen to be deterministic strategies [13].

Linear Temporal Logic Satisfaction in Adversarial Environments

391

Definition 6 (Defender Strategy). A stationary strategy for the defender is (d) (d) a sequence μ(d) := {μk }k∈N of Borel-measurable maps μk : X → Ud . Definition 7 (Adversary Strategy). A stationary strategy for the adversary (a) (a) is a sequence μ(a) := {μk }k∈N of Borel-measurable maps μk : X × Ud → Ua .

4

Problem Formulation

For a DDSG G , recall that the labeling function L indicates which atomic propositions are true in each state. Assumption 1. We restrict our attention to labeling functions of the form L : X → A P. Then, if A P = (a1 , . . . , ap ), A P and L will partition the state space as X := ∪pi=1 Xi , where Xi := L−1 (ai ). We further assume that Xi = ∅ for all i. Remark 1. Through the remainder of the paper, we interchangeably use xk or x(k) to denote the state at time k. Given a sequence of states xN := (x0 , x1 , . . . , xN −1 ), using Assumption 1, if ηk = L(xk ) for all k ∈ N , then we can write L(xN ) = (η0 , η1 , . . . , ηN −1 ). Definition 8 (LTL Satisfaction by DDSG). For a DDSG G and a saf e − LT LF formula φ, we write Pxμ0(d) ,μ(a) {L(xN ) |= φ} to denote the probability that the evolution of the DDSG starting from x(0) = x0 under player policies μ(d) and μ(a) satisfies φ over the time horizon N = {0, 1, . . . , N − 1}. We are now ready to formally state the problem that this paper seeks to solve. Problem 1. Given a discrete-time dynamic game G = (X, W, Ud , Ua , f, N , A P, L) that evolves according to the dynamics in Eq. (1) and a saf e−LT LF formula φ, determine a policy for the defender, μ(d) , that maximizes the probability of satisfying φ over the time horizon N = {0, 1, . . . , N − 1} under any adversary policy μ(a) for all x0 ∈ L−1 (aj ) for some aj ∈ A P. That is, compute: sup inf Pxμ0(d) ,μ(a) {L(xN ) |= φ} μ(d)

5

μ(a)

(2)

Solution Approach

In this section, we present a dynamic programming approach to determine a solution to Problem 1. Our analysis is motivated by the treatment in [18] and [50]. We then introduce the notion of secure control barrier certificates (S-CBCs), and use these to provide a lower bound on the probability of satisfaction of the saf e − LT LF formula φ for a defender policy under any adversary policy

392

B. Ramasubramanian et al.

in terms of the accepting runs of length less than or equal to the length of the time-horizon of interest of a DFA associated with φ. For systems whose evolution of states can be written as a polynomial function of states and inputs, we present a sum-of-squares optimization approach in order to compute an S-CBC. S-CBCs generalize barrier certificates [22] and control barrier certificates [23] to account for the presence of an adversary. A difference between the treatment in this paper and that of [22,23] is that we define S-CBCs for stochastic dynamic games, while the latter papers focus on stochastic systems with a single agent. 5.1

Dynamic Programming for saf e − LT LF Satisfaction

We introduce a dynamic programming (DP) operator that will allow us to recursively solve a Bellman equation related to Eq. (2) backward in time. First, observe that we can write the satisfaction probability in Definition 8 as: 1(L(xk ) |= φ)|x(0) = x0 }, (3) Pxμ0(d) ,μ(a) {L(xN ) |= φ} = Eμ(d) ,μ(a) { k∈N

where Eμ(d) ,μ(a) is the expectation operator under the probability measure Pμ(d) ,μ(a) induced by agent policies μ(d) and μ(a) . 1(·) is the indicator function, which takes value 1 if its argument is true, and 0 otherwise. Assume that V : X → [0, 1] is a Borel-measurable function. A DP operator T can then be characterized in the following way: V (xN −1 ) = 1(L(xN −1 ) |= φ) (T V )(xk ) := sup inf 1(L(xk ) |= φ) ud

ua

(4)

V (f (xk , ud , ua , w))dxk+1 ,

(5)

X

where dxk+1 ≡ (dxk+1 |xk , ud , ua ) is a probability measure on the Borel space (X, B(X)). The following results adapts Theorem 1 of [18] to the case of temporal logic formula satisfaction over a finite time-horizon. Theorem 1. Assume that the DDSG G has to satisfy a saf e − LT LF formula φ over horizon N . Let the DP operator T be defined as in Eq. (5). Additionally, if dxk ≡ (dxk+1 |xk , ud , ua ) is continuous, then, sup inf Pxμ0(d) ,μ(a) {L(xN ) |= φ} = (T N V )(x0 ), μ(d)

μ(a)

(6)

where T N := T ◦ T ◦ · · · ◦ T (N times) is the repeated composition of the operator T . Proof. Consider a particular pair of stationary agent policies μ(d) and μ(a) . (d) (a) For these policies, define measurable functions Vkμ ,μ : X → [0, 1], k = 0, 1, . . . , N − 1:

Linear Temporal Logic Satisfaction in Adversarial Environments (d)

(a)

VNμ−1,μ

(d)

Vkμ

(xN −1 ) := 1(L(xN −1 ) |= φ)

,μ(a)

N −1

(xk ) := Eμ(d) ,μ(a) {

393

(7)

1(L(xi ) |= φ)|x(k) = xk }, k = 0, 1, . . . , N − 2

i=k

(8) (d)

(a)

Therefore, we have Pxμ0(d) ,μ(a) {L(xN ) |= φ} = V0μ ,μ (x0 ). Now, consider strategies of the agents at a stage k. Define the operator Tμ(d) ,μ(a) : k

k

(Tμ(d) ,μ(a) V )(xk ) := 1(L(xk ) |= φ) k

k

V (f (xk , ud , ua , w))dxk+1

(9)

X

Expanding Eq. (8) using the definition of the expectation operator will allow us (d)

to write Vkμ

,μ(a)

(x) = (Tμ(d)

(a) k+1 ,μk+1

V )(x).

The result follows by an induction argument which uses the fact that Tμ(d) ,μ(a) k k is a monotonic operator. We refer to [18] for details. Further, this procedure also guarantees the existence of a defender policy that will maximize the probability of satisfaction of φ under any adversary policy. 5.2

Secure Control Barrier Certificates

Definition 9. A continuous function B : X → R≥0 is a secure control barrier certificate (S-CBC) for the DDSG G if for any state x ∈ X and some constant c ≥ 0, inf sup Ew [B(f (x, ud , ua , w)|x] ≤ B(x) + c. ud

(10)

ua

Intuitively, for some defender action ud , the increase in the value of an S-CBC is bounded from above along trajectories of G under any adversary action ua . Remark 2. S-CBCs generalize control barrier certificates and barrier certificates seen in prior work. If f (x, ud , ua1 , w) ∼ f (x, ud , ua2 , w) for every ua1 , ua2 ∈ Ua , then we recover the definition of a control barrier certificate [23]. The definition of a barrier certificate [22,36] is got by additionally requiring that f (x, ud1 , ua1 , w) ∼ f (x, ud2 , ua2 , w) for every ud1 , ud2 ∈ Ud and ua1 , ua2 ∈ Ua . Here ∼ denotes stochastic equivalence of the respective stochastic processes [35]. In the latter case, when c = 0, the function B is a super-martingale. For this case, along with some additional assumptions on the system dynamics, asymptotic guarantees on the satisfaction of properties over the infinite time-horizon can be established [36]. Remark 3. Although our definition of S-CBCs in Definition 9 bears resemblance to the notion of a worst-case barrier certificate introduced in [36], there are some distinctions. While the entity in [36] considers a dynamical system with a single

394

B. Ramasubramanian et al.

disturbance input, our setting considers three terms that influence the evolution of the state of the system: we want to find a defender input that will allow the barrier function to satisfy a certain property under any adversary input and disturbance. A second point of difference is that while [36] focuses on asymptotic analysis, we consider properties over a finite time horizon. We limit our attention to stationary strategies for both players. Studying the effects of other strategies is left as future work. The following preliminary result will be used subsequently to determine a bound on the probability of reaching a subset of states under particular agent policies over a finite time-horizon. Lemma 1. Consider a DDSG G and let B : X → R≥0 be an S-CBC as in Definition 9. Then, for some constants λ > 0 and c ≥ 0, initial state x0 ∈ X, and a stationary defender policy, μ(d) : X → Ud , the following holds under any stationary adversary policy μ(a) : X × Ud → Ua : B(x0 ) + cN λ

inf sup Pxμ0(d) ,μ(a) [ sup B(x(k)) ≥ λ] ≤

μ(d)

μ(a)

0≤k 1 for all x ∈ X1 , then the DDSG G starting from x0 ∈ X0 is (δ + cN )−reachable with respect to X1 . Proof. Observe that X1 ⊆ {x ∈ X : B(x) ≥ 1}. Therefore, Pxμ0(d) ,μ(a) [∃k ∈ N : x(k) ∈ X1 ] ≤ Pxμ0(d) ,μ(a) [B(x(k)) ≥ 1]. Since this should be true for arbitrary k, we have: x0 { μ(d) ,μ(a)

sup P[xk ∈ X1 ] ≤ P

k∈N

sup B(x(k)) ≥ 1} ≤ inf k∈N

μ(d)

x0 { μ(d) ,μ(a)

sup P

μ(a)

sup B(x(k)) ≥ 1} k∈N

≤ B(x0 ) + cN ≤ δ + cN

The second line of the above system of inequalities follows by setting λ = 1 in Lemma 1, and the fact that B(x) ≤ δ for all x ∈ X0 .

Linear Temporal Logic Satisfaction in Adversarial Environments

5.3

395

Automaton-Based Verification

In order to verify that {L(xN ) |= φ} under agent policies μ(d) and μ(a) , we need to establish that (η0 , η1 , . . . , ηN −1 ) ⊆ L (Aφ ). To do this, we first construct a DFA A¬φ , that accepts all and only those words over A P that do not satisfy the saf e − LT LF formula φ. We have the following result: Lemma 2. [9] For L(xN ) = (η0 , η1 , . . . , ηN −1 ) and a DFA Aφ , the following is true: (η0 , η1 , . . . , ηN −1 ) ⊆ L (Aφ ) ⇔ (η0 , η1 , . . . , ηN −1 ) ∩ L (A¬φ ) = ∅ The construction of A¬φ can also be carried out in Rabinizer4 [25]. The accepting runs of A¬φ of length less than or equal to N can be computed using a depth-first search algorithm [47]. For the purposes of this section, it is important to understand that the accepting runs of A¬φ of length less than or equal to N will give a bound on the probability that a particular pair of agent policies (μ(d) , μ(a) ) will not satisfy φ over the time horizon N . Using Definition 4 and following the treatment of [22] and [23] define the following terms (the reader is also referred to these works for an example that offers a detailed treatment of the procedure): RN (A¬φ ) := {q = (q0 , . . . , qn ) ∈ L (A¬φ ) : n ≤ N, qi = qi+1 ∀i < n}

(12)

a

a RN (A¬φ )

:= {q = (q0 , . . . , qn ) ∈ RN (A¬φ ) : a ∈ A P and q0 − → q1 } (13) a {(qi , qi+1 , qi+2 , T (q, qi+1 )) : 0 ≤ i ≤ n − 2} q ∈ RN (A¬φ ), |q| > 2 P a (q) := ∅ otherwise

T (q, qi+1 ) :=

(14) N + 2 − |q| 1

a

∃a ∈ A P : qi+1 − → qi+1 otherwise

(15)

Intuitively, RN (A¬φ ) is the set of accepting runs in A¬φ of length not greater than N , and without counting any self-loops in the states of the DFA. The set a (A¬φ ) is the set of runs in RN (A¬φ ) with the first state transition labeled by RN a (A¬φ ), P a (q) defines the set of paths of length a ∈ A P. For an element of RN 3 augmented with a ‘loop-bound’. The ‘loop-bound’ T (q, qi+1 ) is an indicator of the number of ‘self-loops’ the run in the DFA can make at state qi+1 while still keeping its length less than or equal to N . We assume that T (q, qi+1 ) = 1 when the run cannot make a self-loop at qi+1 . 5.4

Satisfaction Probability Using S-CBCs and A¬φ

In this section, we show that an accepting run of A¬φ of length less than or equal to N gives a lower bound on the probability that a particular pair of agent policies will not satisfy the saf e − LT LF formula φ. We use this in conjunction with the S-CBC to derive an upper bound on the probability that φ will be

396

B. Ramasubramanian et al.

satisfied for a particular choice of defender policy under any adversary policy. Specifically, we use Theorem 2 over each accepting run of A¬φ of length less than or equal to N to give a bound on the overall satisfaction probability. Theorem 3. Assume that the DDSG G has to satisfy a saf e − LT LF formula φ over horizon N . Let A¬φ be the DFA corresponding to the negation of φ, and for this DFA, assume that the quantities in Eqs. (12)–(14) have been computed. Then, for some aj ∈ A P and all x0 ∈ L−1 (aj ) the maximum value of the probability of satisfaction of φ for a defender policy μ(d) under any adversary policy μ(a) satisfies the following inequality: sup inf Pxμ0(d) ,μ(a) {L(xN ) |= φ} ≥ 1 − (δρ + cρ T ), μ(a)

μ(d)

a a q∈R Nj (A ¬φ ) ρ∈P j (q)

where ρ = (q, q , q , T ) ∈ P aj (q) is the set of paths of length 3 with loop bound T for aj ∈ A P in an accepting run of length N in A¬φ . a

Proof. For aj ∈ A P, consider q ∈ RNj (A¬φ ) (Eq. (13)) and the set P aj (q) (Eqs. (14) and (15)). Consider an element ρ = (q, q , q , T ) ∈ P aj (q). From Theorem 2, for some stationary defender policy μ(d) , the probability that a trajectory σ σ → q ) and reaching x1 ∈ L−1 (σ : q − → q ) of G starting from x0 ∈ L−1 (σ : q − under stationary adversary policy μ(a) over the time horizon T is at most δρ +cρ T . Therefore, the probability of an accepting run in A¬φ of length at most N starting from x0 ∈ L−1 (aj ) is upper bounded by: (δρ + cρ T ) inf sup Pxμ0(d) ,μ(a) {L(xN ) |= ¬φ} ≤ μ(d)

μ(a)

a a q∈R Nj (A ¬φ ) ρ∈P j (q)

Now consider Eq. (2) of Problem 1. We have the following set of equivalences and inequalities: sup inf Pxμ0(d) ,μ(a) {L(xN ) |= φ} = sup (− sup (−Pxμ0(d) ,μ(a) {L(xN ) |= φ})) μ(d)

μ(a)

μ(d)

= − inf sup μ(d)

μ(a)

(−Pxμ0(d) ,μ(a) {L(xN

μ(a)

) |= φ})

= − inf sup (−1 + Pxμ0(d) ,μ(a) {L(xN ) |= ¬φ}) μ(d)

μ(a)

≥ 1 − inf sup Pxμ0(d) ,μ(a) {L(xN ) |= ¬φ} μ(d)

≥1−

μ(a)

a q∈R Nj (A ¬φ )

ρ∈P

aj

(δρ + cρ T ) (q)

Theorem 3 generalizes Theorem 5.2 of [23] to provide a lower bound for a stationary defender policy that maximizes the probability that the saf e−LT LF formula is satisfied by the DDSG G over the time horizon N , starting from x0 ∈ L−1 (aj ) for some aj ∈ A P for any stationary adversary policy.

Linear Temporal Logic Satisfaction in Adversarial Environments

5.5

397

Computing an S-CBC

The use of barrier functions will circumvent the need to explicitly compute sets of reachable states, which is known to be undecidable for general dynamical systems [29]. However, computationally efficient methods can be used to construct a barrier certificate if the system dynamics can be expressed as a polynomial [36]. This will allow for determining bounds on the probability of satisfaction of the LTL formula without discretizing the state space. In contrast, if the underlying state space is continuous, computing the satisfaction probability and the corresponding agent policy using dynamic programming will necessitate a discretization of the state space in order to approximate the integral in Eq. (5). We propose a sum-of-squares (SOS) optimization [34] based approach that will allow us to compute an S-CBC if the evolution of the state of the DDSG has a specific structure. The key insight is that if a function can be written as a sum of squares of different polynomials, then it is non-negative. Assumption 2. The sets X, Ud , Ua in the DDSG G are continuous, and f (x, ud , ua , w) in Eq. (1) can be written as a polynomial in x, ud , ua for any w. Further, the sets Xi = L−1 (ai ) in Assumption 1 can be represented by polynomial inequalities. Proposition 1. Under the conditions of Assumption 2, suppose that sets X0 := {x ∈ X : g0 (x) ≥ 0}, X1 := {x ∈ X : g1 (x) ≥ 0}, and X := {x ∈ X : g(x) ≥ 0}, where the inequalities are element-wise. Assume that there is an SOS polynomial B(x), constants δ ∈ [0, 1] and c, SOS (vector) polynomials s0 (x), s1 (x), and s(x), and polynomials sdui (x) corresponding to the ith entry in ud , such that: − B(x) − s0 (x)g0 (x) + δ

B(x) − s1 (x)g1 (x) − 1

∀ua ∈ Ua : − Ew [B(f (x, ud , ua , w)|x] + B(x) −

(16)

(17) (udi − sdui (x))− s (x)g(x)+ c

i

(18) are all SOS polynomials. Then, B(x) satisfies the conditions of Theorem 2, and udi = sdui (x) is the corresponding defender policy. Proof. The proof of this result follows in a manner similar to Lemma 7 in [49] and Lemma 5.6 in [23], and we do not present it here. The authors of [23] discuss an alternative approach in the case when the input set has finite cardinality. A similar treatment is beyond the scope of the present paper, and will be an interesting future direction of research.

6

Example

We present an example demonstrating our solution approach to Problem 1.

398

B. Ramasubramanian et al.

Example 1. Let the dynamics of the DDSG G with X = W = R2 , Ud is a compact subset of R, Ua = [−1, 1], and w1 (k), w2 (k) ∼ U nif [−1, 1] (and i.i.d.) be given by: x1 (k + 1) = −0.5x1 (k)x2 (k) + w1 (k) x2 (k + 1) = x1 (k)x2 (k) +

0.1x22 (k)

+ ud (k) + 0.6ua (k) + w2 (k)

(19) (20)

Let A P = {a0 , a1 , a2 , a3 , a4 }, and sets X0 , X1 , X2 , X3 , X4 such that for x ∈ Xi , L(x) = ai . The sets Xi are defined by: X0 := {(x1 , x2 ) : x21 + x22 ≤ 0.9}, X1 := {(x1 , x2 ) : (2 ≤ x1 ≤ 6) ∧ (−2 ≤ x2 ≤ 2)}, X2 := {(x1 , x2 ) : x21 + (x2 − 10)2 ≤ 4}, X3 := {(x1 , x2 ) : (−10 ≤ x1 ≤ −3) ∧ (−4 ≤ x2 ≤ −2)}, Xi . X4 := X \ i

The aim for an agent is to determine a sequence of inputs {ud } such that starting from X0 , for any sequence of adversary inputs {ua }, it avoids obstacles in its environment, defined by the sets X1 , X2 , and X3 for 10 units of time. The corresponding saf e − LT LF formula is φ = [a0 ∧ G¬(a1 ∨ a2 ∨ a3 )]. The DFA that accepts ¬φ is shown in Fig. 1. Suppose we are interested in determining a bound on the probability of φ being satisfied for a time-horizon of length 10. Using Eqs. (12)–(15), we have P a0 (q0 , q1 , q2 ) = {(q0 , q1 , q2 , 9)}, and P aj = ∅ for j = 1, 2, 3, 4.

Fig. 1. The DFA that accepts ¬φ for the saf e − LT LF formula φ = [a0 ∧ G¬(a1 ∨ a2 ∨ a3 )] and A P = {a0 , a1 , a2 , a3 , a4 }.

We use a sum-of-squares optimization toolbox, SOSTOOLS [37] along with SDPT3 [48], a semidefinite program solver. The barrier function B(x) = B(x1 , x2 ) was assumed to be a polynomial of degree-two. For the case c = 0, we determine the smallest value of δ that will satisfy the conditions in Proposition 1 to compute an S-CBC. The output of the program was an S-CBC given by B(x) = 0.1915x21 + 0.1868x1 x2 − 0.144x1 + 0.1201x22 + 0.1239x2

Linear Temporal Logic Satisfaction in Adversarial Environments

399

The environment and the obstacles denoted by the sets X1 , X2 , X3 and the contours of the S-CBC is shown in Fig. 2. We observe that B(x) is less than 1 in some part of X1 . A possible reason is that when solving for the second condition in Proposition 1, we work with the union of the sets X1 , X2 , and X3 , which may lead to a conservative estimate of the S-CBC.

Fig. 2. The regions X0 , X1 , X2 , X3 , X4 along with the computed secure control barrier certificate (S-CBC): B(x) = 0.1915x21 + 0.1868x1 x2 − 0.144x1 + 0.1201x22 + 0.1239x2 . The regions with red boundaries (X1 , X2 , X3 ) denote obstacles in the environment. X0 is the set from which the agent starts at time 0. The contours show the values of the S-CBC of degree 2 ranging from 1 to 100. (Color figure online)

From Theorem 2 and the computed value of δ, we have that sup inf Pxμ0(d) ,μ(a) {L(xN ) |= φ} ≥ 0.9922. μ(d)

μ(a)

This bound is conservative in the sense that we consider defender inputs ud for only the extreme values of ua = −1 and ua = 1. However, for the dynamics in Eq. (20), if the last inequality in Proposition 1 is non-negative for both ua = −1 and ua = 1, then for any ua ∈ [−1, 1], this quantity will be non-negative. Determining methods to explicitly compute a defender policy and considering S-CBCs of higher degree is an area of future research.

400

7

B. Ramasubramanian et al.

Conclusion

This paper introduced a new class of barrier certificates to provide probabilistic guarantees on the satisfaction of temporal logic specifications for CPSs that may be affected by the actions of an intelligent adversary. We presented a solution to the problem of maximizing the probability of satisfying a temporal logic specification in the presence of an adversary. The interaction between the CPS and adversary was modeled as a discrete-time dynamic stochastic game with the CPS as defender. The evolution of the state of the game was influenced jointly by the actions of both players. A dynamic programming based approach was used to synthesize a policy for the defender in order to maximize this satisfaction probability under any adversary policy. We introduced secure control barrier certificates, an entity that allowed us to determine a lower bound on the satisfaction probability. The S-CBC was explicitly computed for a certain class of dynamics using sum-of-squares optimization. An example illustrated our approach. Our example may have resulted in conservative bounds for the satisfaction probabilities since we restrict our focus to barrier certificates that are second degree polynomials and to stationary policies for the two agents. Future work will seek to study conditions under which possibly more effective non-stationary agent policies and higher degree S-CBCs can be deployed to solve the problem. A second interesting problem over a finite time-horizon is to investigate if explicit time bounds can be enforced on the temporal logic formula. An example of such a property is that the agent is required to reach a subset of states of the system between 3 and 5 min. This formula cannot be encoded in LTL, but there are other temporal logic frameworks like metric interval temporal logic [4] or signal temporal logic [31] that will allow us to express it. We propose to study the case when the system will have to satisfy other kinds of timed temporal specifications [12] in the presence of an adversary in dynamic environments.

References 1. Abate, A., Prandini, M., Lygeros, J., Sastry, S.: Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems. Automatica 44(11), 2724–2734 (2008) 2. Ahmadi, M., Jansen, N., Wu, B., Topcu, U.: Control theory meets POMDPs: A hybrid systems approach. arXiv preprint arXiv:1905.08095 (2019) 3. Ahmadi, M., Wu, B., Lin, H., Topcu, U.: Privacy verification in POMDPs via barrier certificates. In: IEEE Conference on Decision and Control, pp. 5610–5615 (2018) 4. Alur, R., Feder, T., Henzinger, T.A.: The benefits of relaxing punctuality. J. ACM 43(1), 116–146 (1996) 5. Ames, A.D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K., Tabuada, P.: Control barrier functions: Theory and applications. In: European Control Conference, pp. 3420–3431 (2019)

Linear Temporal Logic Satisfaction in Adversarial Environments

401

6. Ames, A.D., Xu, X., Grizzle, J.W., Tabuada, P.: Control barrier function based quadratic programs for safety critical systems. IEEE Trans. Autom. Control 62(8), 3861–3876 (2016) 7. Ba¸sar, T., Olsder, G.J.: Dynamic Noncooperative Game Theory, vol. 23. SIAM, Philadelphia (1999) 8. Baheti, R., Gill, H.: Cyber-physical systems. Impact Control Technol. 12(1), 161– 166 (2011) 9. Baier, C., Katoen, J.P.: Principles of Model Checking. MIT Press, Cambridge (2008) 10. Belta, C., Yordanov, B., Aydin Gol, E.: Formal Methods for Discrete-Time Dynamical Systems. SSDC, vol. 89. Springer, Cham (2017). https://doi.org/10.1007/9783-319-50763-7 11. Bertsekas, D.P.: Dynamic Programming and Optimal Control, Volumes I and II, 4th edn. Athena Scientific, Nashua (2015) 12. Bouyer, P., Laroussinie, F., Markey, N., Ouaknine, J., Worrell, J.: Timed temporal logics. In: Aceto, L., Bacci, G., Bacci, G., Ing´ olfsd´ ottir, A., Legay, A., Mardare, R. (eds.) Models, Algorithms, Logics and Tools. LNCS, vol. 10460, pp. 211–230. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-63121-9 11 13. Breton, M., Alj, A., Haurie, A.: Sequential Stackelberg equilibria in two-person games. J. Optim. Theory Appl. 59(1), 71–97 (1988) 14. Chow, C.S., Tsitsiklis, J.N.: An optimal one-way multigrid algorithm for discretetime stochastic control. IEEE Trans. Autom. Control 36(8), 898–914 (1991) 15. Cimatti, A., Clarke, E., Giunchiglia, F., Roveri, M.: NuSMV: A new symbolic model verifier. In: Halbwachs, N., Peled, D. (eds.) CAV 1999. LNCS, vol. 1633, pp. 495–499. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48683-6 44 16. De Giacomo, G., Vardi, M.: Synthesis for LTL and LDL on finite traces. Int. Joint Conf. Artif. Intell. 15, 1558–1564 (2015) 17. De Giacomo, G., Vardi, M.Y.: Linear temporal logic and linear dynamic logic on finite traces. In: International Joint Conference on Artificial Intelligence, pp. 854– 860 (2013) 18. Ding, J., Kamgarpour, M., Summers, S., Abate, A., Lygeros, J., Tomlin, C.: A stochastic games framework for verification and control of discrete time stochastic hybrid systems. Automatica 49(9), 2665–2674 (2013) 19. Ding, X., Smith, S.L., Belta, C., Rus, D.: Optimal control of MDPs with linear temporal logic constraints. IEEE Trans. Autom. Control 59(5), 1244–1257 (2014) 20. Farwell, J.P., Rohozinski, R.: Stuxnet and the future of cyber war. Survival 53(1), 23–40 (2011) 21. Gordon, G.J.: Approximate solutions to Markov decision processes. School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, Technical report (1999) 22. Jagtap, P., Soudjani, S., Zamani, M.: Temporal logic verification of stochastic systems using barrier certificates. In: Lahiri, S.K., Wang, C. (eds.) ATVA 2018. LNCS, vol. 11138, pp. 177–193. Springer, Cham (2018). https://doi.org/10.1007/ 978-3-030-01090-4 11 23. Jagtap, P., Soudjani, S., Zamani, M.: Formal synthesis of stochastic systems via control barrier certificates. arXiv preprint arXiv:1905.04585 (2019) 24. Kolathaya, S., Ames, A.D.: Input-to-state safety with control barrier functions. Control Syst. Lett. 3(1), 108–113 (2018) 25. Kˇret´ınsk´ y, J., Meggendorfer, T., Sickert, S., Ziegler, C.: Rabinizer 4: From LTL to your favourite deterministic automaton. In: Chockler, H., Weissenbacher, G. (eds.) CAV 2018. LNCS, vol. 10981, pp. 567–577. Springer, Cham (2018). https://doi. org/10.1007/978-3-319-96145-3 30

402

B. Ramasubramanian et al.

26. Kupferman, O., Vardi, M.Y.: Model checking of safety properties. In: Halbwachs, N., Peled, D. (eds.) CAV 1999. LNCS, vol. 1633, pp. 172–183. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48683-6 17 27. Kushner, H.J.: Stochastic Stability and Control. Academic Press, Cambridge (1967) 28. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: Verification of probabilistic real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806, pp. 585–591. Springer, Heidelberg (2011). https://doi.org/10.1007/9783-642-22110-1 47 29. Lafferriere, G., Pappas, G.J., Yovine, S.: Symbolic reachability computation for families of linear vector fields. J. Symb. Comput. 32(3), 231–253 (2001) 30. Lindemann, L., Dimarogonas, D.V.: Control barrier functions for signal temporal logic tasks. IEEE Control Syst. Lett. 3(1), 96–101 (2019) 31. Maler, O., Nickovic, D.: Monitoring temporal properties of continuous signals. In: Lakhnech, Y., Yovine, S. (eds.) FORMATS/FTRTFT -2004. LNCS, vol. 3253, pp. 152–166. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-302063 12 32. Niu, L., Clark, A.: Secure control under LTL constraints. In: IEEE American Control Conference, pp. 3544–3551 (2018) 33. Niu, L., Li, Z., Clark, A.: LQG reference tracking with safety and reachability guarantees under false data injection attacks. In: IEEE American Control Conference, pp. 2950–2957 (2019) 34. Parrilo, P.A.: Semidefinite programming relaxations for semialgebraic problems. Math. Programm. 96(2), 293–320 (2003) 35. Pola, G., Manes, C., van der Schaft, A.J., Di Benedetto, M.D.: Bisimulation equivalence of discrete-time stochastic linear control systems. IEEE Trans. Autom. Control 63(7), 1897–1912 (2017) 36. Prajna, S., Jadbabaie, A., Pappas, G.J.: A framework for worst-case and stochastic safety verification using barrier certificates. Trans. Autom. Control 52(8), 1415– 1428 (2007) 37. Prajna, S., Papachristodoulou, A., Parrilo, P.A.: Introducing SOSTOOLS: A general purpose sum of squares programming solver. In: IEEE Conference on Decision and Control, vol. 1, pp. 741–746 (2002) 38. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, Hoboken (2014) 39. Ramasubramanian, B., Clark, A., Bushnell, L., Poovendran, R.: Secure control under partial observability under temporal logic constraints. In: IEEE American Control Conference, pp. 1181–1188 (2019) 40. Saha, I., Ramaithitima, R., Kumar, V., Pappas, G.J., Seshia, S.A.: Automated composition of motion primitives for multi-robot systems from safe LTL specifications. In: Proceedings of International Conference on Intelligent Robots and Systems, pp. 1525–1532 (2014) 41. Santoyo, C., Dutreix, M., Coogan, S.: Verification and control for finite-time safety of stochastic systems via barrier functions. In: IEEE Conference on Control Technology and Applications (2019) 42. Sharan, R., Burdick, J.: Finite state control of POMDPs with LTL specifications. In: IEEE American Control Conference, pp. 501–508 (2014) 43. Shoukry, Y., Martin, P., Tabuada, P., Srivastava, M.: Non-invasive spoofing attacks for anti-lock braking systems. In: Bertoni, G., Coron, J.-S. (eds.) CHES 2013. LNCS, vol. 8086, pp. 55–72. Springer, Heidelberg (2013). https://doi.org/10.1007/ 978-3-642-40349-1 4

Linear Temporal Logic Satisfaction in Adversarial Environments

403

44. Slay, J., Miller, M.: Lessons learned from the maroochy water breach. In: Goetz, E., Shenoi, S. (eds.) ICCIP 2007. IIFIP, vol. 253, pp. 73–82. Springer, Boston, MA (2008). https://doi.org/10.1007/978-0-387-75462-8 6 45. Steinhardt, J., Tedrake, R.: Finite-time regional verification of stochastic non-linear systems. Int. J. Robot. Res. 31(7), 901–923 (2012) 46. Sullivan, J.E., Kamensky, D.: How cyber-attacks in Ukraine show the vulnerability of the US power grid. Electr. J. 30(3), 30–35 (2017) 47. Tarjan, R.: Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160 (1972) 48. Toh, K.C., Todd, M.J., T¨ ut¨ unc¨ u, R.H.: SDPT3: A MATLAB software package for semidefinite programming. Optim. Methods Softw. 11, 545–581 (1999) 49. Wongpiromsarn, T., Topcu, U., Lamperski, A.: Automata theory meets barrier certificates: Temporal logic verification of nonlinear systems. IEEE Trans. Autom. Control 61(11), 3344–3355 (2015) 50. Yang, I.: A dynamic game approach to distributionally robust safety specifications for stochastic systems. Automatica 94, 94–101 (2018)

Cut-The-Rope: A Game of Stealthy Intrusion Stefan Rass1(B) , Sandra K¨ onig2 , and Emmanouil Panaousis3 1

Institute of Applied Informatics, Universitaet Klagenfurt, Klagenfurt, Austria [email protected] 2 Center for Digital Safety & Security, Austrian Institute of Technology, Vienna, Austria [email protected] 3 Department of Computer Science, University of Surrey, Guildford, UK [email protected]

Abstract. A major characteristic of Advanced Persistent Threats (APTs) is their stealthiness over a possibly long period, during which the victim system is being penetrated and prepared for the finishing blow. We model an APT as a game played on an attack graph G, and consider the following interaction pattern: the attacker chooses an attack path in G towards its target v0 , and step-by-step works its way towards the goal by repeated penetrations. In each step, it leaves a backdoor for an easy return to learn how to accomplish the next step. We call this return path the “rope”. The defender’s aim is “cutting” this rope by cleaning the system from (even unknown) backdoors, e.g., by patching systems or changing configurations. While the defender is doing so in fixed intervals governed by working hours/shifts, the attacker is allowed to take any number of moves at any point in time. The game is thus repeated, i.e., in discrete time, only for the defender, while the second player (adversary) moves in continuous time. It also has asymmetric information, since the adversary is stealthy at all times, until the damage causing phase of the APT. The payoff in the game is the attacker’s chance to reach this final stage, while the defender’s goal is minimizing this likelihood (risk). We illustrate the model by a numerical example and open access implementation in R. Keywords: Advanced persistent threats · Security Cyber physical system · Attack graph · Attack tree

1

· Cyber defense ·

Introduction

Contemporary APTs exhibit some similarities to human diseases: there is a phase of infection (where the attacker makes the initial contact, e.g., by sending a successful spam or phishing email), a phase of incubation (where the attacker penetrates the system as deep as it can; often slowly and stealthy to avoid c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 404–416, 2019. https://doi.org/10.1007/978-3-030-32430-8_24

Cut-The-Rope: A Game of Stealthy Intrusion

405

detection), and a phase of outbreak (where the attacker causes the actual damage). The game proposed in this work covers the incubation phase, letting the defender, similar to the human body’s immune system, taking actions to keep the adversary away from vital assets, even without knowing explicitly about its moves, location or even presence. The playground of our penetration game is an attack graph, such as obtained from a topological vulnerability analysis (see, e.g., [10]). We adopt an example from the literature to illustrate our game thereon. The game is hereby designed for ease of application, to account for the expected large diversity of infrastructures on which Cut-The-Rope is playable; nonetheless, the treatment is novel in two aspects: – there is no natural synchronicity in the players taking actions; Particularly, we have a defender that acts in rounds, as working days/hours, time shifts or other organisational regulations prescribe, facing an opponent that can act in continuous time, at any time, as often as s/he likes, and in any pattern (independent or adaptive to the defender’s actions to learn about the current system configuration (leader-follower style), etc.) – the goal is not minimizing the time that an attacker spends in the system, but rather the chances for the attacker to hit vital assets (no matter how long it takes, or attempting to keep it completely outside the system). The second point distinguishes Cut-The-Rope from related games like FlipIt [5], based on the recognition that access to a critical asset even during a short time window may suffice to cause huge damage. For example, the cooling system of a nuclear power plant could be shut down within a short period of time, causing an unstoppable chain reaction. On the contrary, the attacker may spend an arbitrary lot of time with a honeypot, where no damage is possible. Thus, the average or total time spent in a system is not necessarily what counts; what is important is the adversary’s chance to use its time (no matter how short) to cause damage. Consequently, Cut-The-Rope is about minimizing the adversary’s odds to reach a critical area, rather than to keep it out completely or to minimize its time of having parts of the system under control. Related Work. APTs, due to their diverse combination of attacks, hardly admit a single model to capture them; rather, they call for a combination of models tailored to different aspects or characteristics of the attack. The common skeleton identified for “the general” APT incurs the three above mentioned phases, but can be refined into what is called the kill chain [11], consisting of reconnaissance, exploit, command & control, priviledge escalation, lateral movement and objective/target, in the sequential order just given. A proper defense aligns with these phases, and most related work [6] is specific for at least one of them. Notable is the ADAPT project [1], covering a wide spectrum of aspects and phases. Specific defense models include the detection of spying activities [17], tracing information flows [16], detection of malware [12], deception [4] (also via honeypots [13]), attack path prediction [7], and general network defense [2] to name only a few. Our game is in a way similar to that of [15], yet differs from

406

S. Rass et al.

this previous model in not being stochastic, and in using payoffs that are not real-valued. The stochastic element is included in a simpler way in our model. Taking the APT as a long term yet one-shot event, an attack tree can be treated as a (big) game in extensive form. In this view, it is possible to think of the APT as an instance of the induced gameplay, to which Bayesian or subgame perfect equilibria can be sought [9]. More similar to this work, we can treat the APT as a game of inspections, to discover optimal strategies of inspection in different depths of a shell-structured defense [24,27]. A different classification of related work is based on the protection goals. Defenses can be optimized for confidentiality [14], the monetary value of some asset upon theft or damage [27], or the time that an adversary has parts of the system under control [5]. This distinction can be important depending on the context, as industrial production typically puts priority on availability and integrity, with confidentiality as a secondary or tertiary interest. Conversely, whenever personal data is processed, confidentiality becomes the top priority, putting availability more down on the list.

2

The Model

The story evolves around a defender seeking to protect some asset from a stealthy intruder. To this end, the defender maintains an attack graph on which it engages in a game to keep the attacker away from the asset. It does so by taking turns periodically, doing spot-checks on randomly chosen nodes in the graph. A spot check hereby can mean various things, such as a mere change of credentials, a malware check, but also more complex operations such as the deactivation of services or a complete reset or reinstallation of the respective computer from a clean (trusted) reference image. Not all these options may be open for all nodes, e.g., the defender may not be allowed to deactivate certain services (like a secure shell), or a reinstallation may cause undesirably high costs due to the temporal outage of the node (see [23] for a game model including this aspect). 2.1

A Running Example

Let us illustrate Cut-The-Rope using an example attack graph shown in Fig. 1b, computed from a topological vulnerability analysis in the infrastructure as shown in Fig. 1a. In the (simplified) instance of Cut-The-Rope described next, the devices have no distinct resilience against penetration. That is, the attacker has equal chances to take any step along the attack path. We can later drop this assumption easily for the price of an only slightly modified implementation of Cut-The-Rope, as we outline in the conclusions Sect. 4. The attacker works its way from a starting point (not necessarily a fixed position; multiple possibilities for a start are permitted), stepwise towards the asset. It does so by exploiting individual vulnerabilities found at each node along the way. The attack graph (Fig. 1b) gives rise to a set of attack paths as listed in Fig. 1c. Note

ftp, rsh, ssh

File Server (Machine 1)

Router

Firewall

Workstation Machine 0

ftp, rsh Database Server (Machine 2)

(a) Infrastructure from [26] 1

ftp(0,2)

ftp(0,1)

execute(0)

ssh(0,1)

8

5

2 ftp_rhosts(0,1)

ftp_rhosts(0,2)

sshd_bof(0,1)

rsh(0,2)

trust(2,0)

rsh(0,1)

trust(2,0) 9 rsh(0,2)

rsh(0,1)

3

ftp(1,2) execute(1) 4 ftp_rhosts(1,2) rsh(1,2) trust(2,0) rsh(1,2) 6 execute(2) 7 local_bof(2)

full_access(2)

10

initial attacker‘s capability precondition attacker‘s exploit final result

x

location (number corresponds to the vertex vi in the graph on which the game is played. Edges between the vi correspond to the connections shown above.

(b) Attack Graph [26]

407

No. Attack path 1 execute(0) → ftp rhosts(0,1) → rsh(0,1) → ftp rhosts(1,2) → rsh(1,2) → local bof(2) → full access(2) 2 execute(0) → ftp rhosts(0,1) → rsh(0,1) → rsh(1,2) → local bof(2) → full access(2) 3 execute(0) → ftp rhosts(0,2) → rsh(0,2) → local bof(2) → full access(2) 4 execute(0) → rsh(0,1) → ftp rhosts(1,2) → sshd bof(0,1) → rsh(1,2) → local bof(2) → full access(2) 5 execute(0) → rsh(0,1) → rsh(1,2) → local bof(2) → full access(2) 6 execute(0) → rsh(0,2) → local bof(2) → full access(2) 7 execute(0) → sshd bof(0,1) → ftp rhosts(1,2) → rsh(0,1) → rsh(1,2) → local bof(2) → full access(2) 8 execute(0) → sshd bof(0,1) → rsh(1,2) → local bof(2) → full access(2)

Cut-The-Rope: A Game of Stealthy Intrusion

(c) Attack paths in the graph shown in Fig 1b

Fig. 1. Example playground for Cut-The-Rope

that this list is in first place made to be exhaustive, and practically may undergo a cleanup to remove attack paths that are not meaningful. In our example, each path describes a penetration by a sequence of unary or binary predicates access(x) or protocol(x,y), expressing the sort of access

408

S. Rass et al.

gained on machine x or access to machine y by some protocol (see Fig. 1b for examples, e.g., rsh(0,1) means a gain of access to machine 1 from machine 0, via a remote shell exploit). Preconditions (appearing as rectangular boxes in Fig. 1b shows) are omitted from our following analysis for simplicity. For each respective next stage along the overall attack, the adversary will come back over its so-far prepared route, to inspect the next node for vulnerabilities that can be exploited. We can imagine this path as a “rope” along which the adversary “climbs up to the top”, i.e., the target asset. The defender has no means of noticing this activity, since the attacker is stealthy. The defender, repeatedly inspecting nodes, may successfully clean the respective node from any adversarial traces, and thus cuts the adversary’s rope, sending it effectively back to the point immediately before the cut (i.e., the inspected point); hence the game’s name Cut-The-Rope. Cutting at a point that the adversary has not yet reached has no impact, since the adversary will always learn the current configuration before mounting a penetration (it will thus never get stuck). A lateral movement of the attacker is allowed, but modeled as the mere choice of a different attack path. 2.2

Game Definition

An instance of Cut-The-Rope is a tuple (G, v0 , AS1 , AS2 , λ) with the following ingredients: G = (V, E) is an attack graph, containing a designated node v0 ∈ V as the target for the attack(er). The action set AS1 ⊆ V \ {v0 } contains all nodes admissible for spot checking by the defender (excluding v0 to avoid trivialities). The action set AS2 for the attacker contains all attack paths towards v0 in G, from one (or several) “entry nodes” in G. For a set Ω, we write Δ(Ω) to mean the set of all distributions supported on Ω, and we write X ∼ F or X ∼ x ∈ Δ(Ω) to tell that the random variable has the general distribution F or categorical distribution x. We assume |AS2 | to be of manageable size, practically achievable by tool aids to construct the attack graph (often grouping of nodes with similar characteristics regarding vulnerabilities [3,10]). The value λ ∈ R>0 is the attack rate: it specifies an average over how many steps the attacker takes in times when the defender is inactive. Note that we assume no particular cost for the attacker to penetrate here, and for simplicity, we further assume that the attacker is always successful in the penetration (i.e., the idle times of the attacker are used for learning about configurations and exploits, and the learning is always successful; in reality, this assumption may be overly pessimistic for the defender, but can be relaxed almost trivially as we discuss in the conclusions section). Also for simplicity, we assume λ to be constant over time (the generalization towards a nonhomogeneous attack process is beyond the scope of this work, but an interesting open question). This implies the basic assumption that a move of the attacker at any time takes it a random (Poissonian) number of N ∼ P ois(λ) steps further down on the attack path, until the target asset v0 ∈ G. The payoff in the game is the adversary’s random location L ∼ U (x, θ, λ)(V ) that depends on (i) the defender’s probability vector x ∈ Δ(AS1 ) of doing spot checks on the node set

Cut-The-Rope: A Game of Stealthy Intrusion

409

AS1 , (ii) the attacker’s starting point θ ∈ V \ {v0 }, and (iii) the attack rate λ. Since the attacker is stealthy, the defender has no means of knowing where the attacker currently is, i.e., from where it has started to take the next N steps towards v0 . Thus, the defender cannot work out the distribution of L, and can only choose its own randomized spot checking rule x ∈ Δ(AS1 ) with knowledge of λ. We model the defender’s uncertainty about the adversary’s location by considering each starting point as inducing a distinct adversary type θ ∈ Θ, where Θ ⊆ V \ {v0 } is the set of possible starting locations (θ = v0 is assumed to avoid trivialities). This turns the competition into a Bayesian game where the defender faces an adversary of unknown type (location) from the finite set Θ. The game, however, does not qualify as a signalling game, since the attacker remains invisible at all times (until it reaches v0 ). Still, a perfect Bayesian equilibrium (Definition 1) will turn out as a suitable solution concept. Remark 1. The value of λ is common knowledge of both parties, but realistically an assumption made by the defender. As such, it may be subject to estimation errors, and may (in reality) be a presumed range of possible values rather than a fixed value. The implied change to the model, however, amounts to a mere change of the resulting distribution from plain P ois(λ) into something more complex (e.g., a mix of Poisson distributions or other), but nonetheless a distribution over the number of steps being taken. Likewise, assuming a fixed number of attacker steps at any point in (continuous) time yet is describable by yet another distribution over the step number within a fixed time interval. Our model does not anyhow hinge on the shape of the step distribution, so both generalizations are left for reports on future (practical) instances of Cut-The-Rope. For each attack path π ∈ AS2 , starting from the location θ, the attack step number distribution, here P ois(λ), assigns probability masses to the nodes on π. The totality of all attack paths then defines a total probability mass on each node of G = (V, E), which is the distribution U (later made more explicit in the derivation of expressions (2) and (3)). We put the nodes in V in ascending order of (graph-theoretic) distance to v0 (with any order on nodes of equal distance). Then, the mass assigned in the proximity of v0 is the tail mass of U . We stress that replacing the uncertain payoffs by numbers, i.e., taking the mere expected payoff is not meaningful, since we are interested in the attacker not hitting the target, but do not care about its average penetration depth. The latter is uninteresting, since it causes no damage for the defender; only loosing v0 does that! The optimization of both players in the game is then doable by a stochastic tail ordering on the random variable U for the attacker (as given by (2)) and U for the defender (as given by (3)), where U differs from U only in the fact that U relates to an attacker of type θ, while U is the weighted mix of attackers of all types, since the defender does not know which type it is facing (and hence has to adopt a hypothesis on θ; this is where the Bayesian flavour of the game comes in). Our chosen stochastic order is the -relation introduced in [22], which, in the special case of a categorical distribution, is equivalent to a lexicographic ordering

410

S. Rass et al.

on the probability masses. That is, if two distributions U1 = (p1 , . . . , pn ) and U2 = (q1 , . . . , qn ) are given, then U1 U2 if and only if pn < qn or [pn = qn and U1 = (p1 , . . . , pn−1 ) U2 = (q1 , . . . , qn−1 )], with (final) equality and following in the canonic way. Applying this ordering on the masses that our payoff distributions put on the nodes in V in order of distance to v0 , amounts to the game being about minimizing the attacker’s chances to reach v0 for the defender, while the attacker at the same time pushes towards v0 by -maximizing the tail mass. Formally, the optimization problems are: Defender: -minimize U over x ∈ Δ(AS1 ), given λ and a hypothesis F (distribution) on the adversary type θ ∼ F (Θ). Attacker: -maximize U , given the defender’s strategy x, from the starting point (type) θ ∈ V . To make these rigorous, let us now work out how U and U are computed for the players: The formal definition of the adversary’s utility is a simple matter of conditioning the attack steps distribution on the current situation in the attack graph. With Θ determining the random starting point θ ∈ V , the adversary may take one out of several routes πθ,1 , πθ,2 , . . . , πθ,mθ , all going from θ to v0 (their count being denoted as mθ ). A general such path is represented as π = (θ, w1 , w2 , . . . , v0 ) with all wi ∈ {v1 , v2 , . . .} = V . The set of nodes constituting π is V (π). Also, let dπ (u, v) be the graph-theoretic distance counting the edges on the subsection from u to v on the chosen path π. Then, the utility distribution for the attacker assigns to each node v ∈ V the mass Pr(adversary’s location = v|V (π)) =

fP ois(λ) (dπ (θ, v)) , PrP ois(λ) (V (π))

(1)

x

λ −λ in which fP ois(λ) (x) is the density of the Poisson distribution, and = x! e PrP ois(λ) (V (π)) = x∈V (π) PrP ois(λ) (dπ (θ, x)) = x∈V (π) fP ois(λ) (dπ (θ, x)) (in a slight abuse of notation). Now, the defender’s action comes in, who hopes to cut the rope behind the attacker. Let c ∈ V be the checked node, then the possibly truncated path is (θ, w1 , w2 , . . . , wi−1 ), if c = wi for some node wi on π π|c = otherwise. (θ, w1 , . . . , v0 ),

Cutting the rope then means conditioning the distribution of the adversary’s location on the shorter (cut) path π|c . The formula is the same as (1), only with π replaced by π|c now. Since c ∼ x follows the defender’s mixed spot checking strategy (possibly degenerate), and the set of paths π along which the attacker steps forward (at rate λ) is determined by the random starting position θ ∼ Θ, the utility distribution for the attacker is given as the tuple U (x, θ, λ) = (Pr(adversary’s location = v|V (π|c )))v∈V

(2)

Cut-The-Rope: A Game of Stealthy Intrusion

2.3

411

Equilibrium Definition and -Computation

The defender may associate each possible attack starting point in G with a distinct type of adversary. The belief about the adversary’s type is then again Θ, and the ex ante payoff is then Pr(θ) · U (x, θ, λ) (3) U (x, λ) = θ∈Θ

Θ

Observe that we can equivalently write U (x, θ, λ)(v) = Prx (adversary’s location = v|θ, λ), since this is by construction the distribution of the attacker’s location, conditional on the starting point θ. In this view, however, (3) is just the law of total probability, turning U into the distribution of the attacker’s location. It turns out that a perfect Bayesian equilibrium fits nicely for our purpose. We instantiate the definition from [8, Def. 8.1] to our setting. Definition 1. A perfect Bayesian equilibrium (PBE) for a two-player signalling game is a strategy profile such that: (P1 ) Sequential rationality of the attacker: ∀θ, the type-θ-adversary maximizes U (x, θ, λ) over its action space AS2 , for the (fixed, randomized) action x chosen by the defender1 . (P2 ) Sequential rationality for the defender: ∀a2 ∈ AS2 , the defender chooses its action conditional on the observed action a2 of the attacker (signal) as x∗ ∈ argmina1 ∈AS1 θ Pr(θ|a2 )U (a1 , θ, λ) (B) Bayesian updating of beliefs: the conditional belief is Pr(θ|a2 ) = Pr(θ ) Pr(a2 |θ ), whenever the denominator is > 0. Pr(θ) · Pr(a2 |θ) θ ∈Θ Otherwise, any distribution Pr(·|a2 ) is admissible. How does this fit for us? The assumption of a stealthy attacker means that there are no signals, so our game is not truly a signalling game in the strict sense. However, the notion of a perfect equilibrium nonetheless is meaningful, if the defender has other means of updating a belief about which part of the system is infected. Condition (P1 ) says that the attacker, who knows the random spot checking pattern x of the defender, will take the “best” path a2 ∈ {πθ,1 , . . . , πθ,mθ } ⊆ AS2 so as to maximize the probability of hitting v0 (after a Poissonian number of steps). Regarding the defender’s Bayesian updating prescription (B), recall that the attacker is stealthy and hence avoids sending signals. This makes the conditioning in (B) be on an empty set, since we receive no signal, formally meaning a2 = ∅ and implying a zero denominator in (B) because Pr(a2 |θ ) = Pr(∅|θ ) = 0. This allows the defender to just carry over its a priori belief Θ into the posteriori belief Pr(θ|·) := PrΘ (θ), and (B) is automatically satisfied. Intuitively, unbeknownst of the attacker’s location, the defender faces one big information set, on which it can impose the hypothesis Θ that will remain unchanged in absence of any signals (for the same reason, we do not have any separating equilibria; they are all necessarily pooling). Finally, plugging Pr(θ|a2 ) = PrΘ (θ) into 1

The dependence of U on a ∈ AS2 is implicit here, but comes in through the probabilities involved to define the utility; we will come back to this in a moment.

412

S. Rass et al.

condition (P2 ), we end up finding that the defender in fact -minimizes U as given by (3). Taking yet another angle of view, we can also arrive at condition (3) by considering the competition as one-against-all [25], where the defender simultaneously faces exactly one adversary of type θ for all types θ ∈ Θ. Then, condition (3) is just a scalarization of the resulting multi-criteria security game (cf. [20]). It follows that each PBE in the sense just explained is equal to a multi-goal security strategy (MGSS), and vice versa, enabling the application of algorithms to compute MGSS [19] in order to get a PBE. This is the method applied in the following (using MGSS as a mere technical vehicle).

3

Computational Results

Cut-The-Rope is most conveniently implemented in R, since the basic system already ships with most of the required functions. The full details and explanations of the code are available as a supplementary online resource [18]. First, let us assign consecutive numbers as representatives of the locations in the attack graph, displayed as circled numbers in Fig. 1b. This constitutes the set V = {1, 2, . . . , 10}, with 10 = v0 , the target asset, and v = 1 being the (common) starting point for all attack paths. In our example game, the defender has n = |V | − 1 = 9 strategies, excluding the trivial defense of always (and only) checking the target node v0 . The attacker, 10 This set shrinks in turn, can choose from a total of 8 paths from 1 to v0 = . accordingly when the attack starts from a point θ ∈ V \ {v0 } = {1, 2, . . . , 9}, to contain only the paths that contain the respective θ. For θ = 1, this gives the full set of 8 paths, while, for example, θ = 3 leaves only 4 paths for the attacker. As before, let the number of paths per θ be mθ , then each type of opponent has a n × mθ payoff matrix. The entries therein are indexed by a node i ∈ V being checked, and a path πθ,j . The attacker’s utility distribution (2) is obtained by evaluating the Poisson density fP ois(λ) (x) for x = 0, 1, . . . , |V (πθ,j )|, i (if that node is on the path). and conditioning it on the potential cut at node

Effectively, the conditioning amounts to setting all probability masses on the path from i to v0 to zero, and renormalizing the remainder of the vector (see [18, code lines 30–35]). Now, the defender comes in and cuts the rope. This is merely another conditioning, i.e., zeroing the mass of all nodes that come after i, if i is on the attacker’s residual route (“last mile towards v0 ”). The resulting masses are then assigned to the nodes in V , placing zero mass on all nodes that are never reached, either because the attacker would anyway have not come across the node, or if the rope has been cut before the node has been reached (see [18, code lines 36–39]). The so-constructed distribution (see [18, variable L]) is the vector U (x = ei , θ = j, λ = 2) from (2), with ei being the i-th unit vector (acting as a degenerate distribution for our purposes). Having this, it remains to sum up these with weights according to Θ, in the final utility U as in (3). To this end, we adopt

Cut-The-Rope: A Game of Stealthy Intrusion

413

a non-informative prior, i.e., a uniform distribution Θ on the possible adversary types, i.e., starting points in V \ {v0 }, and iteratively compute the weighted sum (3) (in [18, code line 44]) from an initially constructed vector U ∈ R|V | of all zeroes. Note that this, unlike the defender’s strategy set, includes the target node v0 . The remaining labour of setting up the game and solving for a multi-criteria security strategy, giving the sought perfect Bayesian pooling equilibrium is aided by the package HyRiM [21] for R ([18, code lines 46–52]). The PBE obtained is in pure strategies, prescribing the defender to periodically patch potential local buffer overflows at machine 2 (optimal pure strategy being local bof(2)), while the attacker is best off by choosing the attack path execute(0) → ftp rhosts(0,2) → rsh(0,2) → full access(2). This matches the intuition of the best strategy being the defense of the target, by avoiding exploits thereon. Since all attack paths intersect at the node local bof(2), this equilibrium is not surprising. Still, it may appear odd to find the equilibrium not being the shortest among all attack paths. The reason lies in our choice of the attack rate to be λ = 2: for that setting, it is equally probable for the adversary to take 1 step (chance fP ois(2) (1) = 0.2706706) or 2 steps (with the same chance fP ois(2) (1) = fP ois(2) (2)), so the attacker is indeed indifferent between these two options, based on its attack rate. The equilibrium utility for the attacker is it to be located at positions V = {1, 2, . . . , 10} with probabilities U ∗ ≈ (0.573, 0, 0, 0, 0, 0.001, 0.111, 0.083, 0.228, 0.001) i.e., the effect of this defense is as desired; the adversary can get close to v0 , but has only a very small chance of conquering it. A different solution is obtained if we restrict the defender’s scope to checking only some FTP connections, remote- and secure shells. Under the so-restricted action space AS1 = {ftp rhosts(0,1), ftp rhosts(0,2), sshd bof(0,1), rsh(1,2), rsh(0,2)}, we obtain a PBE in mixed strategies being as in Table 1. Table 1. Example results attack path no. Pr(attack path) (from Figure 1c) 1 2 3 4 5 6 7 8

0.104 0.119 0.603 0.174 0 0 0 0

(a) attacker’s equilibrium

location v ∈ V Pr(check v) (from Figure 1b) ftp rhosts(0,1) ftp rhosts(0,2) sshd bof(0,1) rsh(1,2) rsh(0,2)

0.504 0 0.181 0.166 0.149

(b) defender’s equilibrium

414

S. Rass et al.

The payoff distribution density obtained under this equilibrium is the attacker being located on the nodes (1, 2, 3, . . . , 10) with probabilities U ∗ ≈ (0.545, 0.017, 0.030, 0.022, 0.012, 0.034, 0.128, 0.021, 0.045, 0.146).

4

Conclusions

Cut-The-Rope has been designed for ease of application, but admits a variety of generalizations and possibilities for analytic studies. Examples include (i) probabilistic success on spot checks, (ii) probabilistic success on exploits, (iii) spot checking with random intervals, (iv) taking a fixed number of steps at any point in time, (v) multiple adversarial targets, etc. Cases (i)–(iv) all amount to a mere substitution of the Poisson distribution by: (i) a mix of distributions (one for the success, and one for the failure of a spot check), (ii) a product of probabilities to describe the chances to penetrate all nodes along a path, (iii) a geometric distribution describing the random number of spot checks between two events with exponentially distributed time in between, or (iv) a mix of distributions to describe the steps taken within the periodicity of the defender’s activity. Using an artificial target node to represent multiple real targets, (v) also boils down to a change in the attack graph model, but no structural change to the game. Generally, Cut-The-Rope opens up an interesting class of games of mixed timing of moves between the actors, unlike as in extensive or normal form games, where players usually take actions in a fixed order. Likewise, and also different to many other game models, Cut-The-Rope has no defined start or finish for the defender (“security is never done”), while only one of the two players knows when the game starts and ends. The model is thus in a way complementary to that of FlipIt, while it allows the attacker to spend any amount of time in the system, as long as the vital asset remains out of reach. This is actually to reflect the reality of security management: we cannot keep the adversary out, we can only try keeping him as far away as possible.

References 1. ADAPT: Analytical Framework for Actionable Defense against Advanced Persistent Threats—UW Department of Electrical & Computer Engineering (2018). https://www.ece.uw.edu/projects/adapt-analytical-framework-foractionable-defense-against-advanced-persistent-threats/ 2. Alpcan, T., Ba¸sar, T.: Network Security: A Decision and Game Theoretic Approach. Cambridge University Press, Cambridge (2010) 3. BSI: IT-Grundschutz International. Bundesamt f¨ ur Sicherheit in der Informationstechnik (2016). https://www.bsi.bund.de/DE/Themen/ITGrundschutz/ ITGrundschutzInternational/itgrundschutzinternational node.html 4. Carroll, T.E., Grosu, D.: A game theoretic investigation of deception in network security. In: 2009 Proceedings of 18th International Conference on Computer Communications and Networks, pp. 1–6. IEEE, San Francisco, August 2009

Cut-The-Rope: A Game of Stealthy Intrusion

415

5. Dijk, M., Juels, A., Oprea, A., Rivest, R.L.: FlipIt: the game of “Stealthy Takeover”. J. Cryptol. 26(4), 655–713 (2013) 6. Etesami, S.R., Ba¸sar, T.: Dynamic games in cyber-physical security: an overview. Dyn. Games Appl. (2019). https://doi.org/10.1007/s13235-018-00291-y. ISSN: 2153-0793 7. Fang, X., Zhai, L., Jia, Z., Bai, W.: A game model for predicting the attack path of APT. In: 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure Computing. pp. 491–495. IEEE, Dalian, August 2014 8. Fudenberg, D., Tirole, J.: Game Theory. MIT Press (1991). ISBN: 978-0262061414 9. Huang, L., Zhu, Q.: Adaptive Strategic Cyber Defense for Advanced Persistent Threats in Critical Infrastructure Networks. arXiv:1809.02227 [cs], September 2018 10. Jajodia, S., Noel, S., Kalapa, P., Albanese, M., Williams, J.: Cauldron missioncentric cyber situational awareness with defense in depth. In: 2011 - MILCOM 2011 Military Communications Conference, pp. 1339–1344. IEEE (2011) 11. Kamhoua, C.A., Leslie, N.O., Weisman, M.J.: Game Theoretic Modeling of Advanced Persistent Threat in Internet of Things. J. Cyber Secur. Inf. Syst. 6(3), 40–46 (2018) 12. Khouzani, M., Sarkar, S., Altman, E.: Saddle-point strategies in malware attack. IEEE J. Sel. Areas Commun. 30(1), 31–43 (2012) 13. La, Q.D., Quek, T.Q.S., Lee, J.: A game theoretic model for enabling honeypots in IoT networks. In: 2016 IEEE International Conference on Communications (ICC), pp. 1–6. IEEE, May 2016 14. Lin, J., Liu, P., Jing, J.: Using signaling games to model the multi-step attackdefense scenarios on confidentiality. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 118–137. Springer, Heidelberg (2012). https://doi.org/ 10.1007/978-3-642-34266-0 7 15. Lye, K.W., Wing, J.M.: Game strategies in network security. Int. J. of Inf. Secur. 4, 71–86 (2005) 16. Moothedath, S., et al.: A game theoretic approach for dynamic information flow tracking to detect multi-stage advanced persistent threats. arXiv:1811.05622 [cs], November 2018 17. Qing, H., Shichao, L., Zhiqiang, S., Limin, S., Liang, X.: Advanced persistent threats detection game with expert system for cloud. J. Comput. Res. Dev. 54(10), 2344 (2017) 18. Rass, S., K¨ onig, S., Panaousis, E.: Implementation of cut-the-rope in R. https:// www.syssec.at/de/downloads/papers, supplementary material to this work, July 2019 19. Rass, S., Rainer, B.: Numerical computation of multi-goal security strategies. In: Poovendran, R., Saad, W. (eds.) GameSec 2014. LNCS, vol. 8840, pp. 118–133. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12601-2 7 20. Rass, S.: On game-theoretic network security provisioning. J. Netw. Syst. Manag. 21(1), 47–64 (2013) 21. Rass, S., K¨ onig, S.: HyRiM: multicriteria risk management using zero-sum games with vector-valued payoffs that are probability distributions. https://cran.rproject.org/web/packages/HyRiM/index.html 22. Rass, S., K¨ onig, S., Schauer, S.: Defending against advanced persistent threats using game-theory. PLoS ONE 12(1), e0168675 (2017) 23. Rass, S., K¨ onig, S., Schauer, S.: On the cost of game playing: how to control the expenses in mixed strategies. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) Decision and Game Theory for Security, pp. 494–505. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-68711-7 26

416

S. Rass et al.

24. Rass, S., Zhu, Q.: GADAPT: a sequential game-theoretic framework for designing defense-in-depth strategies against advanced persistent threats. In: Zhu, Q., Alpcan, T., Panaousis, E., Tambe, M., Casey, W. (eds.) GameSec 2016. LNCS, vol. 9996, pp. 314–326. Springer, Cham (2016). https://doi.org/10.1007/978-3-31947413-7 18 25. Sela, A.: Fictitious play in ‘one-against-all’ multi-player games. Econ. Theor. 14(3), 635–651 (1999) 26. Singhal, A., Ou, X.: Security risk analysis of enterprise networks using probabilistic attack graphs. https://doi.org/10.6028/NIST.IR.7788 27. Zhu, Q., Rass, S.: On multi-phase and multi-stage game-theoretic modeling of advanced persistent threats. IEEE Access 6, 13958–13971 (2018)

Stochastic Dynamic Information Flow Tracking Game with Reinforcement Learning Dinuka Sahabandu1(B) , Shana Moothedath1 , Joey Allen2 , Linda Bushnell1 , Wenke Lee2 , and Radha Poovendran1 1

2

Department of Electrical and Computer Engineering, University of Washington, Seattle, WA 98195, USA {sdinuka,sm15,lb2,rp3}@uw.edu College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA [email protected], [email protected]

Abstract. Advanced Persistent Threats (APTs) are stealthy, sophisticated, and long-term attacks that impose significant economic costs and violate the security of sensitive information. Data and control flow commands arising from APTs introduce new information flows into the targeted computer system. Dynamic Information Flow Tracking (DIFT) is a promising detection mechanism against APTs that taints suspicious input sources in the system and authenticates the tainted flows at certain processes according to a well defined security policy. Employing DIFT to defend against APTs in large scale cyber systems is restricted due to the heavy resource and performance overhead introduced on the system. The objective of this paper is to model resource efficient DIFT that successfully detect APTs. We develop a game-theoretic framework and provide an analytical model of DIFT that enables the study of trade-off between resource efficiency and the quality of detection in DIFT. Our proposed infinite-horizon, nonzero-sum, stochastic game captures the performance parameters of DIFT such as false alarms and false-negatives and considers an attacker model where the APT can relaunch the attack if it fails in a previous attempt and thereby continuously engage in threatening the system. We assume some of the performance parameters of DIFT are unknown. We propose a model-free reinforcement learning algorithm that converges to a Nash equilibrium of the discounted stochastic game between APT and DIFT. We execute and evaluate the proposed algorithm on a real-world nation state attack dataset. Keywords: Security of computer systems · Advance persistent threats · Dynamic Information Flow Tracking · Stochastic games Reinforcement learning

·

This work was supported by ONR grant N00014-16-1-2710 P00002 and DARPA TC grant DARPA FA8650-15-C-7556. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 417–438, 2019. https://doi.org/10.1007/978-3-030-32430-8_25

418

1

D. Sahabandu et al.

Introduction

The chances of government entities and organizations falling victim to digital espionage continue to rise alarmingly due to the recent growth in cyber attacks, both in number and sophistication [14]. In recent years, Advanced Persistent Threats (APTs) have emerged from traditional cyber threats (malware) as a costly threat and a top concern for entities such as national defense and financial organizations that deal with classified information [31]. Unlike malware that execute quick damaging attacks, APTs launch more sophisticated, strategic, stealthy, and extended multi-stage attacks that operate continuously over a long period of time to achieve a specific malicious objective [6]. A typical APT attack consists of one or more of the following steps [6]. APTs initially carry out a reconnaissance on the victim system to identify vulnerabilities and establish the foothold using various techniques such as social engineering and spear-phishing. Then APTs will start to search and gather system information and compromise system resources needed to obtain necessary privilege escalations [6] to reach the final stage of the attack. This phase is often called as the lateral movement of APTs and it consists of multiple stages [6]. In the final stage of the attack APTs will use the resources gathered during the lateral movement phase to achieve a particular goal such as sabotaging critical infrastructures (e.g., Stuxnet [9]) or exfiltrating sensitive information (e.g., Operation Aurora, Duqu, Flame, and Red October [3]). The stealthy, sophisticated and strategic nature of APTs make defending against them challenging using conventional security mechanisms such as firewalls, anti-virus softwares, and intrusion detection systems that heavily rely on the signatures of malware or anomalies observed in the benign behavior of the system. Although the APTs’ activities with system components are kept at low rate to maintain the stealthiness, the interaction between APTs and the system introduce information flows in the victim system. Information flows consist of data and how the data is propagated between the system processes (e.g., an instance of a program execution) [20]. Dynamic Information Flow Tracking (DIFT) is a technique that was initially developed to dynamically track how information flows are used during program execution [20,28]. Operation of DIFT is based on the idea of tainting, tracking, and verifying authenticity of the tainted information flows. Recently, DIFT was used to detect APTs [6,8]. DIFT first taints (tags) all the information flows that are originating from the set of processes that are susceptible to cyber threats [20,28]. Then, based on well-defined tag propagation rules DIFT propagates tags into the output information flows when tagged flows are mixed with untagged flows at the system processes. Finally DIFT verifies the authenticity of the tagged flows at a subset of processes, referred to as traps, using a set of pre-specified tag check rules. Tag propagation rules and tag check rules are defined by system’s security experts and often called security policy of the DIFT. When a tagged (suspicious) information flow is verified as malicious at a trap, DIFT traces back to its entry point in the system and terminate the victimized process in order to protect the system.

Stochastic Dynamic Information Flow Tracking Game

419

Tagging and tracking information flows require allocating additional memory and storage resources of the system to DIFT scheme. Moreover, trapping information flows demand extra processing power from the system. The number of benign information flows exceeds the amount of malicious information flows in a system by a large factor. Therefore, DIFT can incur a tremendous resource and performance overhead to the underlying system and the situation can be worse in large scale systems such as servers used in the data centers [8]. Although there has been software design based suggestions to reduce the resource and performance cost of DIFT [16,28], widespread deployment of DIFT across various cyber systems and platforms is heavily constrained by the aforementioned resource and performance constraints [15,22]. An analytical model that captures the system level interactions between APTs and DIFT would enable deploying resource efficient DIFT as an effective defense mechanism that can detect and prevent threats imposed by the APTs. In a system equipped with DIFT, interactions of the APT with the system to achieve the malicious objective while evading detection depends on the efficiency of the DIFT scheme. On the other hand, determining a resource-efficient policy for DIFT that maximize the detection probability depends on the nature of APT’s interactions with the system. Non-cooperative game theory provides rich set of rules that can model the strategic interactions between two competing agents (APT and DIFT). The feasible interactions among system processes and objects (e.g., files, network sockets) can be abstracted into a graph called information flow graph [16]. Therefore, in this paper we propose a game-theoretic formulation on the underlying system information flow graph to facilitate the study of trade-off between resource efficiency and efficiency of detection in DIFT. We make the following contributions in this paper. • We model the interaction between APT and DIFT as a two-player, nonzerosum, imperfect information, infinite-horizon stochastic game. The state space of the game and the action sets of both players are finite and the players strategize from a stationary policy to maximize their individual payoffs. • We capture the performance evaluation metrics such as false negatives of the DIFT using the stochastic nature of the game. Our attack model allows APT to continuously attack the system while having the capability of relaunching the attack in case it fails to achieve its final objective. • We provide a reinforcement learning based algorithm that converges to a Nash equilibrium (NE) of the APT vs. DIFT stochastic game. The proposed algorithm utilizes the structure of the game and is based on the two-time scale algorithm in [23]. • To evaluate the performance of our approach, we perform experimental analysis of the model and the algorithm on nation state attack data obtained from Refinable Attack INvestigation (RAIN) [16] framework. The remainder of the paper is organized as follows: Sect. 2 presents the related work to this paper. Section 3 details preliminaries of information flow graph, the attacker model, and the DIFT detection system. Section 4 formulates the stochastic game between APTs and DIFT and Sect. 5 describes the notion of

420

D. Sahabandu et al.

equilibria used to analyze the game. Section 6 presents a reinforcement learning based algorithm to calculate a NE of the discounted stochastic game between the players APT and DIFT. Section 7 demonstrates the experimental results when our proposed algorithm in Sect. 6 is used on real-world attack dataset. Section 8 gives the concluding remarks and future directions.

2

Related Work

Stochastic games are widely used to model interaction of competing agents in a dynamic system. In stochastic games the environment is nonstationary to each agent, agents make their decision independently, and each agent’s payoff is affected by the joint decision of all agents. Security games [17], economic games [2], and resilience of cyber-physical systems [32] are modeled as stochastic games in the literature. The environment of security games is the underlying system and the agents are the defense mechanism and the adversary. Stochastic games have been used to model the competitive interaction between malicious attackers and the Intrusion Detection System (IDS) in computer and communication networks [1,21]. In [21], the existence and uniqueness of an equilibrium in an IDS vs. Attacker game is analyzed. On the other hand, [1] considered a zero-sum IDS vs. Attacker game and analyzed the game using techniques from Markov Decision Process (MDP) [10] and Q-learning [12]. Security of a distributed IDS that undergo simultaneous attacks by a number of attackers is modeled as a nonzero-sum stochastic game in [33] and a value-iteration based algorithm is proposed to compute an -NE. Game-theoretic framework was proposed in the literature to model interaction of APTs with the system [13,27]. While [13] modeled a deceptive APT, a mimicry attack by APT was considered in [27]. In our framework, we consider APTs that target certain confidential information in the system and conduct operations to breach that information. We model the detection of APTs using DIFT as a dynamic stochastic game. In our prior work, we have used game theory to model the interaction of APTs and a DIFT-based detection system [18,19,26]. The game models in [18,19,26] are non-stochastic as the notions of false alarms and false-negatives are not considered. Recently, a stochastic model of DIFT-games was proposed in [25] where the notion of conditional branching in programs was addressed. The approach in [25] assume that the transition probabilities of the game are known. However, in certain cases the transitions probabilities may not be known. Multi-agent reinforcement learning (MARL) algorithms are proposed in the literature to obtain NE strategies of stochastic games when the transition probabilities of the game and the payoff functions of the players are unknown. The convergence of these approaches are shown only in the case of zero-sum games. For nonzero-sum games the convergence is guaranteed only for special cases where the NE of the game is unique [12]. Reference [5] introduced two properties, rationality and convergence, that are necessary for convergence and proposed a WOLF-policy hill climbing algorithm which is empirically shown to converge

Stochastic Dynamic Information Flow Tracking Game

421

to NE. A weaker notion of equilibrium in game, referred to as correlated equilibrium, was considered in [11] as the computation of NE is hard. Recently, a two-time scale algorithm to compute an NE of a stochastic game is given in [23]. We exploit the structure of the game and propose a modified version of the two-time scale algorithm in [23].

3

Preliminaries

This section introduces the following components: Information Flow Graph (IFG); Adversary Model of APT; and Dynamic Information Flow Tracking (DIFT). 3.1

Information Flow Graph (IFG)

An information flow graph (IFG) is a directed multigraph that represents the history of a system’s execution in terms of the spatio-temporal relationships between processes and objects (files and network endpoints) [16]. Processes and objects are nodes in the graph and the directed edges describe interactions and information flows between the nodes. Using provenance-enhanced auditing is heavily desired by large enterprises and government agencies due to its ability to answer two key questions, how an attack infiltrated their systems and what are the ramifications of the attack. Unfortunately, classical auditing systems cannot efficiently answer these questions; this is because they cannot effectively embed the causal relationships into uniform records, like provenance-enhanced systems. When causal relationships are embedded into audit logs, security-experts can run provenance-dependent queries to derive the origin of an attack and the ramifications of an attack. Identifying the origins of an attack is completed by doing a backward traversal, which analyzes the ancestral dependencies of the attack. Additionally, forward analysis techniques traverse through the graph in the forward direction, which effectively determines the ramifications of the attack [16]. In order to detect APTs we use the IFG obtained from the system log. Let G = (VG , EG ) represents the IFG of the system. VG = {s1 , . . . , sN } consists of the processes (e.g., an instance of a computer program) and objects in the system and EG ⊆ VG × VG represents the information flows (directed) in the system from one node to the other. We perform our game-theoretic analysis on the IFG of the system. 3.2

Adversary Model: Advanced Persistent Threats

Advanced persistent threats (APTs) are sophisticated attackers, such as groups of experienced cybercriminals, that establish an illicit, long-term presence in a system in order to mine valuable information/intelligence. The targets of APTS, which are very specifically chosen, typically include large enterprises or governmental networks. The attacker spends time and resources to identify the vulnerabilities that it can exploit to gain access into the system, and to design an attack

422

D. Sahabandu et al.

that will likely remain undetected for a long period of time. These attacks are stealthy and differ from the conventional cyber attacks in complexity and their ability to evade the intrusion detection systems by adopting a nominal system behavior. We can break down a successful APT campaign into the following key stages: 1. Initial Compromise: During the initial compromise stage, the attacker’s goal is to gain access to an enterprise’s network. In most cases, the attacker achieves this by exploiting a vulnerability or a social engineering trick, such as a phishing email. 2. Foothold Establishment: Once the attacker has completed the initial compromise, it will establish a persistent presence by opening up a communication channel with their Command & Control (C&C) server. 3. Privilege Escalation: Next, the attacker will try and escalate its privileges which may be necessary in order to access sensitive information, such as proprietary source code or customer information. 4. Internal Reconnaissance: During the reconnaissance phase, the attacker will try to gain information about the system, such as what nodes are accessible on the system and the security defenses, such as an IDS, that are being used. 5. Lateral Movement: The attacker will increase its control of the system by moving laterally to new nodes in the system. 6. Attack Completion: The final goal of the attacker is to deconstruct the attack, hopefully in a way to minimize his footprint in order to evade detection. For example, attackers may rely on removing the system’ss log. Let λ ⊂ VG be the possible entry points of the adversary and Dj ⊂ VG be the set of targets, refereed to ad destinations, of stage j of the attack. The different stages of the attack translate to different stages of the game in our game formulation described in Sect. 4. 3.3

Defender Model: Dynamic Information Flow Tracking (DIFT)

DIFT is a taint analysis system that dynamically monitor the operation of a system. It consists of three components: (i) taint sources, (ii) taint propagation rules, and (iii) taint sinks. Taint (tag) sources are processes and objects in the system that are considered as untrusted sources of information, i.e., λ. All the information flows originating from a taint source are labeled or tagged and then its use is dynamically tracked using the taint propagation rules [30]. Taint propagation rules define how to propagate tags into the output information flows when tagged flows are used with untagged flows at the system processes and objects. Finally the tagged flows undergo security analysis at dynamically generated security check points called as taint sinks (traps) when an unauthorized use is observed. Tag propagation rules and tag check rules at the traps are defined by systems security experts and often called as the security policy of the DIFT. When a tagged (suspicious) information flow is verified for its unauthorized use at a trap process, DIFT marks it as a malicious flow and trace back to its entry point in the system to terminate the victimized process. Although DIFT is a

Stochastic Dynamic Information Flow Tracking Game

423

widely accepted detection mechanism against APTs, during analysis there are chances of generating false alarms (false-positives) and false-negatives [7,28]. In our DIFT model, we consider a DIFT architecture with these features.

4

Problem Formulation: DIFT vs. APT Game

In this section, we model a two-player stochastic game between APTs (PA ) and DIFT (PD ). Let N be the number of nodes in the IFG, G = (VG , EG ), M be the number of stages of the attack, and λ ⊂ VG denote the set of possible entry points for the APT. We introduce a virtual-node s0 and a set of out-going edges E0 from s0 to each node in the set λ to model the entry point of the APT in the system. The resulting IFG is referred to as Modified Information Flow Graph (MIFG), Gˆ = (VGˆ, EGˆ), where VGˆ = VG ∪ s0 and EGˆ = EG ∪ E0 . 4.1

System and Player Models

Consider a system equipped with a DIFT defense scheme (PD ) threatened by an APT (PA ). Recall, Dj ⊂ VG , for j = 1, . . . , M , denote the nodes in the IFG corresponding to the goals of the j th attack stage. We refer to the set Dj as the destination nodes of the APT (PA ) in the stage j. The objective of PA is to sequentially reach at least one node from each set Dj , for all j = 1, . . . , M , without getting detected by the DIFT (PD ). We model the multi-stage attack of the APT as a multi-stage dynamic game between PD and PA . The information flows originating from the set λ are tainted as suspicious flows by the DIFT. In order to detect the presence of the APTs and mitigate the threats imposed by APTs, DIFT inspects the tainted flows at certain processes and objects in the system (i.e., VG ). The cost of performing security analysis at a node of the IFG varies depending on the available resources (e.g., memory, processing power) and the intensity of the traffic (amount of benign and malicious information flows) at each node of the IFG. Hence, the objective of PD is to identify a set of nodes in the IFG to perform security analysis to prevent PA from successfully completing the attack while minimizing the resource cost. The game between the two players PA and PD unfolds in time t ∈ T := {1, 2, . . .} on the MIFG with player PA starting the game at time t = 1 from node s0 in the MIFG. As APTs are persistent attackers, we consider an infinite horizon game where PA is allowed to restart the attack whenever one of the following conditions are satisfied: 1. Player PA sequentially reach at least one node in set Dj , for all j = 1, . . . , M . 2. PA drops out of the game (abandoning the attack). 3. Player PD successfully detect PA before PA ’s objective is achieved. In case 1 the adversary chooses to continue the successful attack, case 2 the attacker abandons the current attack and starts a new attack, and case 3 the attacker starts a new attack as the defender detected the attacker in the current attack.

424

4.2

D. Sahabandu et al.

State Space of the Game

Let S := {s0 } ∪ {VG × {1, . . . , j}} ∪ {{φ, τ } × {1, . . . , j}} = s0 , sj1 , . . . , sjN , φj , . . . , τ j , for all j ∈ {1, . . . , M }, represent the finite state space of the multi-stage game. s1 , . . . , sN indicate the nodes of the IFG, j denotes the stage of the game, and sji denotes a state where tagged flow is at node si in stage j. States φj and τ j are corresponding to PA dropping out of the game and PD successfully detecting PA , respectively, at stage j. Additionally, the states sM i with si ∈ DM ⊂ VG , where i = 1, . . . , N , are associated with PA achieving the final goal. Let the random variable s¯t denote the state of the game at time t ∈ T := {1, 2, . . .}. Let AA and AD denote the action spaces of players PA and PD , respectively. At each time t ∈ T , PA and PD simultaneously take the actions at ∈ AA and dt ∈ AD , respectively. The action sets of players PA and PD at any state s ∈ S, AA (s) and AD (s), can be defined as five cases. Case (i): when s = s1i , where si ∈ λ. Then, AA (s) = {s1i : (si , si ) ∈ EG } ∪ {∅}, adversary can select an action to move to a node si in stage one and AD (s) = 0, defender is not allowed to trap tagged flows. Case (ii): when s = sji , where si ∈ Dj . Then, AA (s) = {s1i : (si , si ) ∈ EG } ∪ {∅} which illustrates player PA choosing to transition to one of the out-neighboring node of si in stage j and action ∅ represents PA dropping out of the game. Also, AD (s) ∈ {0, 1}, where the action dt = “0” and dt = “1” denote PD deciding not to trap tagged flows and deciding to trap tagged flows, respectively. Case (iii): when s = sji , where si ∈ Dj for j = 1, . . . , M − 1. Then, AA (s) = {sj+1 } which represents APT i traversing from stage j of the attack to stage j + 1 and AD (s) = 0. Case (iv): when s = s0 , AA (s) = {s1i : si ∈ λ} and AD (s) = {0}. Case (v): when s = {φ, τ } ∪ {sM : si ∈ DM }. Then, AA (s) = s0 which captures the ability i of APT to comeback and relaunch the attack in the system and AD (s) = 0. We assume state transitions are stationary, i.e., state at time t + 1, s¯t+1 depends only on the current state s¯t and the actions at and dt of both players at the state s¯t . The stationary transition probability from a given state s to a state s under actions a ∈ AA (s) and d ∈ AD (s) is defined as follows: st+1 = s |¯ st = s, at = a, dt = d}, p(s |s, a, d) := P{¯

(1)

for all s, s ∈ S, where p : S × S × AA × AD → [0, 1]. Note that, for given s¯t = s, at = a, and dt = d, the state s¯t+1 = s is assumed to be conditionally independent of all other random variables at a time less than or equal to t. Further, s ∈S p(s |s, a, d) = 1. Moreover, p(s |s, a, d) = 1 in the following two cases: (i) when d = 0 and a = s = AD (s) and (ii) when d = 1, a = ∅ and s = φ. Consider a state s¯t = s with defender’s action dt = 1 and adversary’s action at = ∅. Then 0 < p(s |s, a, d) = p(s |s, a, 1) < 1 due to the false negatives of the DIFT1 . Let fn (sji ) denote the rate of false negatives generated at node si 1

When d = 1, p(s |s, a, d) is expected to be strictly 1 with s = τ for any choice of a = ∅ if DIFT has no false negative rates. In this paper we assume that DIFT has some nonzero rate of false negatives.

Stochastic Dynamic Information Flow Tracking Game

425

of the IFG in stage j. Then for a state s¯t = sji , action at = sji and dt = 1 the possible next state s¯t+1 are sji with probability fn (sji ), s¯t+1 = (2) τ with probability 1 − fn (sji ). False negatives of the DIFT scheme arise from the limitations of the security rules that can be deployed at each node of the IFG (i.e., processes and objects in the system). Such limitations are due to variations in the number of rules and the depth of the security analysis2 (e.g., system call level trace, CPU instruction level trace) that can be implemented at each node of the IFG resulting from the resource constraints including memory, storage and processing power imposed by the system on each IFG node. We assume both players know the current state of the game. But PD is unaware whether an incoming tagged (suspicious) flow is malicious or not and PA does not know the chances of getting trapped at different processes and objects in the system. This results in an asymmetry on the information possessed by each players. Hence the game is an imperfect information game. Furthermore, both players are unaware of the transition probabilities which depend on the rate of false negatives generated at the different nodes of the IFG (Eq. (2)). Consequently, player’s do not know the payoff structure of the game and hence the game is an incomplete information game. 4.3

Policies and Payoffs of the Players

Players PA and PD decide their actions based on the current state of the game. Thus, the policies of the players we consider here are restricted to stationary policies3 . At any time t ∈ T , PA and PD select an action from the action set AA and AD , respectively, based on some probability distribution. Hence, the player policies are stochastic policies4 . A stochastic stationary policy of player PA is given by pA : S → [0, 1]|AA | and of player PD is given by pD : S → [0, 1]|AD | . The set of all stationary, stochastic policies of PA and PD are denoted by pA and pD , respectively. Let P(pA , pD ) represent the state transition matrix of the game resulting from pA ∈ pA , pD ∈ pD . Then, P(pA , pD ) = [p(s |s, pA , pD )]s,s ∈S , where p(s |s, pA , pD ) =

p(s |s, a, d)pA (s, a)pD (s, d).

a∈AA (s) d∈AD (s) 2 3 4

Detecting an unauthorized use of tagged flow crucially depends on the path traversed by the information flow. A policy of a player is called as stationary, if the player’s decision of choosing an action in any state s ∈ S is invariant with respect to the time of visit to s [10]. In contrast, a policy pA ∈ pA (pD ∈ pD ) is said to be a pure or deterministic policy if for all s ∈ S, the entries of vectors pD and pA belong to the set {0, 1}.

426

D. Sahabandu et al.

Next define rA (pA , pD ) (rD (pA , pD )) to be the expected reward vector of PA (PD ) under the policies pA ∈ pA and pD ∈ pD . Then for k ∈ {A, D} let rk (pA , pD ) = rk (s, pA , pD ) where rk (s, pA , pD ) =

s∈S

T = rk (s0 , pA , pD ), . . . , rk (τ M , pA , pD ) ,

(3)

rk (s, a, d)pA (s, a)pD (s, d) for each s ∈ S. Further-

a∈AA (s) d∈AD (s)

more, rA (s, a, d) and rD (s, a, d) are defined as follows. ⎧ j αA if s = τ j ⎪ ⎪ ⎪ ⎨ β j if s = sj : s ∈ D i j A i rA (s, a, d) = j j ⎪ σ if s = φ A ⎪ ⎪ ⎩ 0 otherwise ⎧ j ⎪ αD if s = τ j ⎪ ⎪ ⎪ j ⎪ ⎪ if s = sji : si ∈ Dj ⎨ βD j rD (s, a, d) = σD if s = φj ⎪ ⎪ ⎪ CD (s) if s = sji , si ∈ {Dj ∪ λ} and d = 1 ⎪ ⎪ ⎪ ⎩0 otherwise

(4)

(5)

Here rk (s, a, d) gives the reward associated with player Pk , where k ∈ {A, D}, when PA selects action a and PD choose action d at the state s. Reward for PA at a state s when each player chooses their respective actions a ∈ AA (s) and j d ∈ AD (s), rA (s, a, d), consists of three components (i) penalty term αA < 0 th 1 if the APT is detected by the defender in the j stage where αA ≤ . . . ≤ M αA , (ii) reward term βAj > 0 for APT reaching a destination of stage j, for j = 1, . . . , M , where βA1 ≤ . . . ≤ βAM , and (iii) penalty term σAj < 0 for APT dropping out of the game in the j th stage where σA1 ≤ . . . ≤ σAM . On the other j hand rD (s, a, d) consists of four components (i) reward term αD > 0 for defender 1 M detecting the APT in the j th stage where αD ≥ . . . ≥ αD , (ii) penalty term j βD < 0 for APT reaching a destination of stage j, for j = 1, . . . , M , where j 1 M βD ≥ . . . ≥ βD , (iii) reward σD > 0 for APT dropping out of the game in the j th 1 M stage where σD ≥ . . . ≥ σD , and (iv) a security cost CD (s) < 0 associated with performing security checks on tagged flows at a state s = sji : si ∈ {Dj ∪ λ}, i.e., cost of choosing node si ∈ {Dj ∪ λ} in stage j ∈ {1, . . . , M } as a tag sink. Remark 1. Security cost, CD (s), at a state, s = {sji : si ∈ {Dj ∪ λ}}, consists of two components: 1. Resource cost for performing security checks on tagged benign flows. Let c1 (j) denote the fixed resource cost incurred to the system if security check is done for tagged information flows reaching each node si in stage j. Define pf (sji ) to be the fraction of tagged flows through node si in stage j. Then the resource cost associated with performing security analysis at a state s can be written as c1 (j)pf (sji ).

Stochastic Dynamic Information Flow Tracking Game

427

2. Cost of false alarms or false positives (cost of identifying a tagged benign flow as a malicious flow). Let fp (sji ) be the false positive rate if DIFT performs security analysis at a node si in the stage j. Also, let c2 (j) denote the fixed cost associated with generation of false alarms in the system at stage j, i.e., if a tagged benign flows at node si during stage j triggers a false alarm. Then the expected cost of false alarm at a state s is given by fp (sji )c2 (j)pf (sji ). Hence, CD (s) = c1 (j)pf (sji ) + fp (sji )c2 (j)pf (sji ). Let the payoffs of players PA and PD at time t be denoted as UA (t) and UD (t), respectively. Then E (UA (t)) and E (UD (t)) characterize the expected s0 ,pA ,pD

s0 ,pA ,pD

payoffs to players PA and PD , respectively, at time t. Moreover, for k ∈ {A, D}, E (Uk (t)) = Pt (pA , pD )rk (pA , pD ) s0 . Next we define the discounted value

s0 ,pA ,pD

(vk (s0 , pA , pD )), i.e., discounted expected payoff, of the game for players (k ∈ {A, D}) under the initial state s0 and stationary player policies pA and pD . vk (s0 , pA , pD ) =

∞ t=0

γ

t

E

s0 ,pA ,pD

(Uk (t))

(6)

Here γ ∈ [0, 1) denote the discount factor of the game. Notice that smaller (close to ‘0’) γ in Eq. (6) implies players are more interested in short-term rewards while larger values (close to ‘1’) mean players are more concerned about long-term rewards. However, APTs are specifically designed to launch long-term stealthy attacks on victim systems. Hence we highlight here that using γ → 1 is essential for approximately capturing the true long-term aspects of the game between APT and DIFT. Please refer to Remark 2 for more information on the choice of using discounting to define the values of players. Let Γ denote the game between PA and PD . Note that Γ is a non-cooperative game where both players try to maximize their respective payoffs defined in Eq. (6). The differences of the components in rA (s, a, d) and rD (s, a, d) makes vA (s0 , pA , pD ) = −vD (s0 , pA , pD ) generally. Hence Γ is a nonzero-sum stochastic game. Moreover, due to the discount factor, γ, used in vk (s0 , pA , pD ), Γ is a discounted stochastic game. Finally let ρ(Γ ) define the set of parameters associated with Γ . Game paramj j j eters include c1 (j), c2 (j), pf (sji ), CD (sji ), fn (sji ), fp (sji ), αA , βAj , σAj , αD , σD and j βD for i ∈ {1, . . . , N } and j ∈ {1, . . . , M }. Some of the game parameters such as fn (sji ) are difficult to know a priori. In Sect. 6 we discuss how our proposed algorithm successfully tackles such scenarios where players lack knowledge on some parameters in ρ(Γ ) (specifically the transition probability structure of the game). Moreover, in Sect. 7 we provide details on extracting cost parameters (e.g., CD (sji )) from system log data and estimating reward/penalty parameters.

428

5

D. Sahabandu et al.

Solution Concept

In this section, we first introduce the notion of equilibrium considered in the paper. Then we briefly discuss computation of equilibrium policies in infinitehorizon stochastic games. We use the solution concept of Nash equilibrium (NE) to analyze Γ . A stochastic stationary strategy profile (pD , pA ) is a NE if vA (s0 , pA , pD ) ≤ vA (s0 , pA , pD ) for all pA ∈ pA , and vD (s0 , pA , pD ) ≤ vD (s0 , pA , pD ) for all pD ∈ pD .

(7) (8)

Remark 2. Computation of NE in infinite-horizon stochastic games depends on the payoff evaluation criteria used by the players. Discounted reward method and limiting average (also referred as undiscounted5 reward criteria) are two payoff evaluation procedures widely used in stochastic games. Existence of a Nash equilibrium for a general two-player nonzero-sum, undiscounted, infinite-horizon, general stochastic game under stationary polices remain as an open problem. [29] provides an illustrative example of an undiscounted stochastic game where no stationary strategy exists for both players even when a weaker notion of NE is considered. However, Nash equilibrium of discounted stochastic games under stationary policies is well studied in the literature (see Proposition 5.1) and there exists a nonlinear program that can be used to calculate stationary NE policies (see Problem 5.2). The following proposition provides the existence of a Nash equilibrium for two player nonzero-sum discounted stochastic games. Proposition 5.1 ([10], Chap. 3, Theorem 3.8.1). Every nonzero-sum discounted stochastic game has at least one Nash equilibrium point in stationary strategies. Let the value vector of a player k ∈ {A, D} be denoted by

T vk = vk (s) s∈S = vk (s0 ) . . . vk (τ M ) , where each entry represent the expected payoff for the k th player at some state s ∈ S. We present the following nonlinear program (NLP) adopted from [10] to characterize NE in Γ . Problem 5.2. Consider the following NLP with the optimization variable z = (vA , vD , pA , pD ): ⎧ ⎫ ⎨ ⎬ min 1T [vk − rk (pA , pD ) − γP(pA , pD )vk ] ⎩ ⎭ k∈{A,D}

5

Undiscounted value of a game for players (k ∈ {A, D}) under the initial state s0 and stationary pA and pD is defined by vk (s0 , pA , pD ) = player policies 1 T lim E (Uk (t)) . T →∞ T + 1 t=0 s0 ,pA ,pD

Stochastic Dynamic Information Flow Tracking Game

429

Subject to: 1. (pA , pD ) ∈ pA × pD , 2. RA (s)pTD (s) + γL(s, vA )pTD (s) vA (s)1mA (s), s ∈ S 3. pA (s)RD (s) + γpA (s)L(s, vD ) vD (s)1TmD (s), s ∈ S where Rk (s) = [rk (s, a, d)]a∈AA (s),d∈AD (s) with mA (s) = |pA (s)| and mD (s) = |pD (s)| denoting the number of actions allowed for PA and PD at a state s ∈ S, respectively. 1i is a column vector of all ones of size i. Moreover, for a given state s and a value vector vk for a player k ∈ {A, D}, p(s |s, a, d)vk (s ) . L(s, vk ) = s ∈S

a∈AA (s),d∈AD (s)

In Problem 5.2, Condition 1 ensures (pA , pD ) is a valid stochastic stationary policy pair of Γ that satisfies the basic probability theoryconditions such as for all s ∈ S, pA (s, a) = 1 with pA (s, a) ≥ 0 and pD (s, d) = 1 with a∈AA (s)

d∈AD (s)

pD (s, d) ≥ 0. Whereas Conditions 2 and 3 ensure a valid policy pair (pA , pD ) is a NE of Γ . Let the objective function of Problem 5.2 be denoted by ψ(z). Then Theorem 3.8.2 in [10] states that ψ(z) = 0 at NE. However it is noted that if the underlying non-zero sum game has a local minimum which is not a global minimum then NLP presented in Problem 5.2 can get stuck at a local minima which prevents ψ(z) from reaching to 0 [23]. Hence in such cases the solution of Problem 5.2 will not converge to an NE. A Reinforcement Learning (RL) algorithm, named ON-SGSP, is proposed in [23] to solve the NLP in Problem 5.2 which is guaranteed to converge to a NE when the following assumption is satisfied. Assumption 1. In Γ , Markov chain induced by the state transition matrix P(pA , pD ) is irreducible and positive recurrent under all possible player policies pA ∈ pA and pD ∈ pD . Let the current state be s¯t and at (dt ) be the action chosen by PA (PD ). Let t vAt (¯ st ) (vD (¯ st )), ζAt (¯ st , at ) (ζDt (¯ st , dt )) and ptA (¯ st , at ) (ptD (¯ st , dt )) denote the discounted value, estimated gradient of ψ(z) and probability of action at ∈ AA (¯ st ) (dt ∈ AD (¯ st )) at time t for player PA (PD ), respectively. Then the following set of update equations captures the core steps of the ON-SGSP algorithm in [23]. Value updates: vAt+1 (¯ st ) = vAt (¯ st ) + c(t)[ˆ A (¯ st , s¯t+1 , at , dt )] t+1 t (¯ st ) = vD (¯ st ) + c(t)[ˆ D (¯ st , s¯t+1 , at , dt )] vD Gradient estimations:

430

D. Sahabandu et al. ⎡

t+1

ζA

(¯ st , at ) = ζA (¯ st , at ) + c(t) ⎣ t

⎤ (ˆ k (¯ st , s¯t+1 , at , dt )) − ζA (¯ st , a t ) ⎦ t

k∈{A,D}

⎡ t+1

st , dt ) = ζD (¯ st , dt ) + c(t) ⎣ ζD (¯ t

⎤ (ˆ k (¯ st , s¯t+1 , at , dt )) − ζD (¯ st , dt )⎦ t

k∈{A,D}

Policy updates:

t+1 ptA (¯ st , at ) × |ˆ A (¯ st , s¯t+1 , at , dt )|sgn(−ζA (¯ st , at )) t+1 t t (¯ pt+1 (¯ s , d ) = Π p (¯ s , d ) − b(t) p s , d ) × |ˆ (¯ s , s ¯ , a , d )|sgn(−ζ (¯ s , d )) , t t t t t t D t t+1 t t t t D D D D

pt+1 st , a t ) = Π A (¯

ptA (¯ st , at ) − b(t)

(9) where Â (¯ st , s¯t+1 , at , dt ) = rA (¯ st , at , dt ) + γvAt (¯ st+1 ) − vAt (¯ st ) t t st , s¯t+1 , at , dt ) = rD (¯ st , at , dt ) + γvD (¯ st+1 ) − vD (¯ st ) ˆD (¯

(10)

for all s¯t ∈ S and t ∈ T . Here, the function sgn(x) denotes the continuous version of the sign/signum function and it maps any x outside of a very small interval around 0 to +1 or -1 depending on the sign of the x [23]. Function Π(.) t projects the policy ptA (¯ st ) (ptD (¯ st )) into the simplex defined by st , at ) > 0 for pA (¯ t all at ∈ AA (¯ st ) (pD (¯ st , dt ) > 0 for all dt ∈ AD (¯ st )) and ptA (¯ st , at ) = 1 at ∈AA (¯ st ) ( ptD (¯ st , dt ) = 1). The terms b(t) and c(t) denote the two time scale dt ∈AD (¯ st )

b(t) learning rates/step-sizes of the RL algorithm that satisfies lim sup c(t) → 0. t→∞ This implies that the value update and the gradient estimation occur in a faster time scale compared to the policy update. Further, b(t) and c(t) satisfy the standard step-size conditions that are required for the convergence of two time ∞ ∞ ∞ ∞ scale algorithms [4] such as b(t) = c(t) = ∞ and b2 (t) = c2 (t) < ∞. t=0

t=0

t=0

t=0

Construction of update equations in Eq. (9) is based on the Bellman error. The following equations define the Bellman error A (s, a, pD ) (D (s, d, pA )) of PA (PD ) at state s ∈ S when action a ∈ AA (s) (d ∈ AD (s)) is used while PD (PA ) is following the policy pD (pA ). A (s, a, pD ) = rA (s, a, d) + γ p(s |s, a, d)vA (s ) pD (s, d) − vA (s) d∈AD (s)

D (s, d, pA ) =

a∈AA (s)

rA (s, a, d) + γ

s ∈S

p(s |s, a, d)vD (s ) pA (s, a) − vD (s)

s ∈S

(11) Notice that when the policies converge, each ˆk (¯ st , s¯t+1 , at , dt ) = 0 for both k ∈ {A, D}. In fact ˆk (.) in Eq. (9) is the stochastic approximation of k (.) in Eq. (11) [24]. This implies that when the update equations converge, there

Stochastic Dynamic Information Flow Tracking Game

431

is no Bellman error. In deriving the above update equation, the NLP given in Problem 5.2 is converted into the set of sub problems related to each state, s ∈ S and each player, k ∈ {A, D}. The objective of the set of sub problems is to ensure that there is no Bellman error. Then in update equations, first values for the players at each state is updated using value iteration. Then policies are updated in the decent direction using the gradient estimates of the objective function, ψ(z), to ensure convergence to a set of points that satisfy pA (s, a)A (s, a, pD ) = 0 and pD (s, d)D (s, d, pA ) = 0 for all a ∈ AA (s), d ∈ AD (s) and s ∈ S. Also notice that the update equations operate in model-free settings, i.e free of transition probabilities (p(s |s, a, d)). We refer reader to [23] for the proof of convergence of the ON-SGSP algorithm.

6

Algorithm to Calculate Equilibrium Policies of the Players

In this section we first identify a set of policies, (ˆ pA , p ˆ D ) ⊂ (pA , pD ), in Γ that will violate Assumption 1. Then we prove (ˆ pA , p ˆ D ) does not form an NE in Γ . We also show that if ON-SGSP algorithm returns a policy in (ˆ pA , p ˆD ) at iteration t, then algorithm will terminate with such a policy. In order to avoid such policies, (ˆ pA , p ˆ D ), we modify ON-SGSP algorithm to solve for the stationary Nash equilibrium policies of the Γ . First we denote pk |S⊂S to be a policy of player k restricted to a set of states ˆ ˆ S ⊂ S. Then we identify specific set of stochastic stationary polices in Γ as follows. Definition 6.1. Define the set of policies (ˆ pA , p ˆ D ) ⊂ (pA , pD ) that satisfy the ˆ ⊂ S. For (ˆ following conditions for a set of states S pA , pˆD ) ∈ (ˆ pA , p ˆD ) ˆ 1. pÂ |S⊂S induces at-least one cycle in the restricted state space S. ˆ ˆ 2. pˆD |S⊂S = 0|S| ˆ ˆ , where 0|S| ˆ represents the all zeros vector of length |S|. Notice that policy pair (ˆ pA , pˆD ) induces a recurrent class consisting of states ˆ and a transient class with states s ∈ {S\S}. ˆ Hence (ˆ in S pA , pˆD ) does not satisfy Assumption 1. In the next theorem we prove three important properties of a policy pair (ˆ pA , pˆD ). We show that (ˆ pA , pˆD ) does not form an NE in Γ and it avoids ON-SGSP algorithm in [23] from converging to NE. Theorem 6.2. A policy pair (ˆ pA , pˆD ) satisfies the following properties, 1. It does not form an NE in Γ . 2. It avoids ON-SGSP algorithm from converging to an NE. 3. Let vkt |S⊂S denote a value vector at time instance t for player k restricted to ˆ ˆ ⊂ S. Then vt |ˆ = 0 ˆ for k ∈ {A, D}. a set of states S k S |S|

432

D. Sahabandu et al.

Proof. 1. Let pT denote the probability of PA getting detected by PD . Assume PD is ˆ with arbitrary small probabilperforming a security analysis at a state s ∈ S ity, i.e., 0 < pD (s, d = 1) 0), initial value 0 0 0 0 0 0 , vD ) where vA (s) = 0 and vD (s) = 0 for all s ∈ S, initial gradient values (ζA , ζD ) vectors (vA / (ˆ pA , p ˆ D ). and initial policies (p0A , p0D ) where (p0A , p0D ) ∈ T T 2: Output: Equilibrium policies, (pA , pD ) ← (pA , pD ) and equilibrium payoff vectors, (vA , vD ) ← T T , vD ). (vA t t 0 0 t t 0 0 3: Initialization: t ← 1, (vA , vD ) ← (vA , vD ), (ζA , ζD ) ← (ζA , ζD ), (ptA , ptD ) ← (p0A , p0D ), ˆ←∅ s¯t ← s0 and S 4: while t ≤ T do 5: if s¯t = s0 then ˆ←∅ S 6: 7: else ˆ then 8: if s¯t ∈ /S ˆ=S ˆ ∪ {¯ S st } 9: 10: else t t 11: if vA |S ˆ = vD |S ˆ = 0|S| ˆ & conditions 1,2 in Definitsion 6.1 hold then

12: 13: 14: 15: 16: 17: 18: 19:

t t 0 0 t t 0 0 t ← 1, s¯t ← s0 , (vA |S ˆ , v D |S ˆ ) ← (vA |S ˆ , v D |S ˆ ) and (pA |S ˆ , pD |S ˆ ) ← (pA |S ˆ , pD |S ˆ) end if end if end if PA and PD simultaneously play at from ptA (¯ st ) and dt from ptD (¯ st ) at s¯t Next state s¯t+1 is revealed PA and PD observes their respective rewards rA (¯ st , at , dt ) and rD (¯ st , at , dt ). Define t t Â (¯ st , s¯t+1 , at , dt ) = rA (¯ st , at , dt ) + γvA (¯ st+1 ) − vA (¯ st ) t

t

ˆD (¯ st , s¯t+1 , at , dt ) = rD (¯ st , at , dt ) + γvD (¯ st+1 ) − vD (¯ st )

20:

Value updates:

t+1

t

st ) = vA (¯ st ) + c(t)[ˆ A (¯ st , s¯t+1 , at , dt )] vA (¯ t+1 vD (¯ st )

21:

t

= vD (¯ st ) + c(t)[ˆ D (¯ st , s¯t+1 , at , dt )]

Gradient estimations: ⎡ t+1

ζA

t

(¯ st , at ) = ζA (¯ st , at ) + c(t) ⎣

⎤ t

(ˆ k (¯ st , s¯t+1 , at , dt )) − ζA (¯ st , at )⎦

k∈{A,D}

⎡ t+1

t

st , dt ) = ζD (¯ st , dt ) + c(t) ⎣ ζD (¯

⎤ t

(ˆ k (¯ st , s¯t+1 , at , dt )) − ζD (¯ s t , dt ) ⎦

k∈{A,D}

22:

Policy updates: t+1

t t+1 pA (¯ st , at ) − b(t) ptA (¯ st , at ) × |ˆ A (¯ st , s¯t+1 , at , dt )|sgn(−ζA (¯ st , at ))

t+1

t t+1 pD (¯ st , dt ) − b(t) ptD (¯ st , dt ) × |ˆ D (¯ st , s¯t+1 , at , dt )|sgn(−ζD (¯ st , dt ))

pA (¯ st , at ) = Π pD (¯ s t , dt ) = Π

23: 24:

7

t←t+1 end while

Simulations

In this section, we test Algorithm 6.1 on a real world nation state attack dataset recorded using Refinable Attack INvestigation (RAIN) architecture [16]. RAIN records system-call events during the runtime of the underlying host computer system. We show that our algorithm successfully converge to an NE of the APT vs. DIFT game, Γ , corresponding to the IFG extracted from the nation state attack dataset for three different values of discounting factors, γ, used.

Stochastic Dynamic Information Flow Tracking Game

435

Nation state attack was conducted by the US DARPA red-team during an evaluation of RAIN architecture. The attack runs through three consecutive days but we use only day one system log data for our analysis. Goal of the first day attack is to exfiltrate sensitive information from a company by establishing a back door in a victim computer system. During the first day of attack we observe four attack stages. Adversary enters the system through spear phishing attack and lead the victim to a website that was hosting ads from a malicious web server. Then the adversary exploit a vulnerability in the Firefox browser in first stage of the attack. Next the adversary fingerprints the compromised system to detect running processes and network information during stage two. In stage three adversary writes a malicious program to disk. Lastly in fourth and final stage adversary establishes a backdoor to continuously exfiltrates the companies sensitive data from the victim computer system. Initial IFG converted from attack data resulted in a coarse-grain graph with 1, 32000 nodes and approximately 2 million edges. Coarse graph captures the whole system data during the recording time which includes the attack related data and lots of data related to the system’s background processes (noise). Hence coarse graph provides very little security sensitive (attack related) information about the underlying system. We pruned the coarse graph to extract the security sensitive information about the system from the log data [16]. First step of the pruning includes upstream, downstream and point to point stream techniques presented in [16]. Then we further combine object nodes (e.g., files, net-flow objects) that belong to the same directories or that use the same network socket. The resulting pruned information flow graph consists of 30 nodes (N = 30) out of which 8 nodes are identified as attack destination nodes corresponding to the 4 stages (M = 4) of the day 01 nation state attack. One node related to a net-flow object has been identified as an entry point used for the attack (|λ| = 1). Setting up reward parameters for players at each stage depends on the criticality of the resources compromised at each stage. In the nation state attack adversary, PA , gathers information in the intermediate stages that are critical to achieve the final goal. For example fingerprinting done at stage two of the attack helps adversary to establish a back door to the victim system. In order to capture adversary gaining information needed for his final goal while passing 1 through each stage of the attack, we set reward parameters of the players to βA = 2 3 4 1 2 3 4 10, βA = 20, βA = 50, βA = 120, βD = −10, βD = −20, βD = −50, βD = −120, j j j j αA = σA = −200, for j ∈ {1, . . . , 4}, and αD = σD = 200, for j ∈ {1, . . . , 4}. Figure 1 shows convergence of the discounted values, vAt (s0 ) (Fig. 1(a)) t and vD (s0 ) (Fig. 1(b)), of PA and PD for three discounting factors γ = 0.55, 0.75 and 0.95. In all cases considered, Algorithm 6.1 converges to a NE of Γ in finite time under the IFG of nation state attack dataset and aforementioned reward values. Furthermore, the results show that defender achieves better payoff values at the NE when γ → 1. This implies that Algorithm 6.1 can provide a good estimate on the NE of Γ when γ → 1.

436

D. Sahabandu et al. 50

120 = 0.55 = 0.75 = 0.95

0

80

t vD (s0)

-50

t vA (s0)

= 0.55 = 0.75 = 0.95

100

-100

60

-150

40

-200

20

-250

0 0

2

4

6

Iteration t

8

10 10 4

0

2

4

6

Iteration t

8

10 10 4

t t Fig. 1. Convergence of the discounted values, vA (s0 ) and vD (s0 ), of PA and PD for 0 0 three discounting factors γ = 0.55, 0.75 and 0.95. At t = 1, (vA , vD ) = (10, 10) was 0 used as initial values for the players and initial player policies, (pA , p0D ), were set to uniform distributions across actions at each states.

8

Conclusion

Advanced Persistent Threats (APTs) are sophisticated, strategic, and stealthy attacks consisting of multiple stages that undergo continuously over a long period of time to achieve a specific malicious objective. In this paper, we presented an analytical model that captures the system level interactions between APTs and DIFT for enabling a resource efficient mechanism that can detect and prevent threats imposed by APTs. We modeled the interaction between APT and DIFT as a two-player, nonzero-sum, imperfect information, stochastic game in infinitehorizon. The game consists of multiple stages where each stage corresponds to a stage in the attack. The transition probabilities of the stochastic game, depend on the effectiveness of the defense mechanism (DIFT) such as false-negative rates, are assumed to be unknown. We adopted a model-free reinforcement learning approach and proposed a two-time scale algorithm that converge to a Nash equilibrium (NE) of the APT vs. DIFT stochastic game. The proposed algorithm utilized the structure of the game and is based on the two-time scale algorithm in [23]. In order to evaluate the performance of the proposed method, we conducted experimental analysis of the model and the algorithm on nation state attack data obtained from Refinable Attack INvestigation (RAIN) [16] framework. In future, we plan to analyze the average reward game model in contrast to the discounted game considered in this paper. Further, we also plan to extend the attacker model to a multi-attacker case.

References 1. Alpcan, T., Ba¸sar, T.: An intrusion detection game with limited observations. In: International Symposium on Dynamic Games and Applications (2006) 2. Amir, R.: Stochastic games in economics and related fields: an overview. In: Neyman, A., Sorin, S. (eds.) Stochastic Games and Applications. ASIC, vol. 570, pp. 455–470. Springer, Dordrecht (2003). https://doi.org/10.1007/978-94-010-01892 30

Stochastic Dynamic Information Flow Tracking Game

437

3. Bencs´ ath, B., Pék, G., Butty´ an, L., Felegyhazi, M.: The cousins of Stuxnet: Duqu, Flame, and Gauss. Future Internet 4(4), 971–1003 (2012) 4. Borkar, V.S.: Stochastic approximation with two time scales. Syst. Control Lett. 29(5), 291–294 (1997) 5. Bowling, M., Veloso, M.: Rational and convergent learning in stochastic games. In: International Joint Conference on Artificial Intelligence, vol. 17, no. 1, pp. 1021– 1026 (2001) 6. Brogi, G., Tong, V.V.T.: TerminAPTor: highlighting advanced persistent threats through information flow tracking. In: IFIP International Conference on New Technologies, Mobility and Security, pp. 1–5 (2016) 7. Clause, J., Li, W., Orso, A.: Dytan: a generic dynamic taint analysis framework. In: International Symposium on Software Testing and Analysis, pp. 196–206 (2007) 8. Enck, W., et al.: TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Trans. Comput. Syst. 32(2), 1–5 (2014) 9. Falliere, N., Murchu, L.O., Chien, E.: W32.Stuxnet Dossier. White paper, Symantec Corp., Security Response, vol. 5, no. 6, p. 29 (2011) 10. Filar, J., Vrieze, K.: Competitive Markov Decision Processes. Springer, New York (2012) 11. Greenwald, A., Hall, K., Serrano, R.: Correlated Q-learning. In: International Conference on Machine Learning (ICML), vol. 3, pp. 242–249 (2003) 12. Hu, J., Wellman, M.P.: Nash Q-learning for general-sum stochastic games. J. Mach. Learn. Res. 4, 1039–1069 (2003) 13. Huang, L., Zhu, Q.: Adaptive strategic cyber defense for advanced persistent threats in critical infrastructure networks. ACM SIGMETRICS Perform. Eval. Rev. 46(2), 52–56 (2019) 14. Jang-Jaccard, J., Nepal, S.: A survey of emerging threats in cybersecurity. J. Comput. Syst. Sci. 80(5), 973–993 (2014) 15. Jee, K., Kemerlis, V.P., Keromytis, A.D., Portokalidis, G.: ShadowReplica: efficient parallelization of dynamic data flow tracking. In: ACM SIGSAC Conference on Computer & Communications Security, pp. 235–246 (2013) 16. Ji, Y., et al.: RAIN: refinable attack investigation with on-demand inter-process information flow tracking. In: ACM SIGSAC Conference on Computer and Communications Security, pp. 377–390 (2017) 17. Lye, K.W., Wing, J.M.: Game strategies in network security. Int. J. Inf. Secur. 4(1–2), 71–86 (2005) 18. Moothedath, S., et al.: A game theoretic approach for dynamic information flow tracking to detect multi-stage advanced persistent threats. ArXiv e-prints arXiv:1811.05622, November 2018 19. Moothedath, S., Sahabandu, D., Clark, A., Lee, S., Lee, W., Poovendran, R.: Multistage dynamic information flow tracking game. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 80–101. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01554-1 5 20. Newsome, J., Song, D.: Dynamic taint analysis: automatic detection, analysis, and signature generation of exploit attacks on commodity software. In: Network and Distributed Systems Security Symposium (2005) 21. Nguyen, K.C., Alpcan, T., Ba¸sar, T.: Stochastic games for security in networks with interdependent nodes. In: International Conference on Game Theory for Networks, pp. 697–703 (2009) 22. Nightingale, E.B., Peek, D., Chen, P.M., Flinn, J.: Parallelizing security checks on commodity hardware. ACM SIGPLAN Not. 43(3), 308–318 (2008)

438

D. Sahabandu et al.

23. Prasad, H., Prashanth, L.A., Bhatnagar, S.: Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. In: International Conference on Autonomous Agents and Multiagent Systems, pp. 1371–1379 (2015) 24. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 400–407 (1951) 25. Sahabandu, D., et al.: A game theoretic approach for dynamic information flow tracking with conditional branching. In: American Control Conference (ACC) (2019, to appear) 26. Sahabandu, D., Xiao, B., Clark, A., Lee, S., Lee, W., Poovendran, R.: DIFT games: dynamic information flow tracking games for advanced persistent threats. In: IEEE Conference on Decision and Control (CDC), pp. 1136–1143 (2018) 27. Sayin, M.O., Hosseini, H., Poovendran, R., Ba¸sar, T.: A game theoretical framework for inter-process adversarial intervention detection. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 486–507. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01554-1 28 28. Suh, G.E., Lee, J.W., Zhang, D., Devadas, S.: Secure program execution via dynamic information flow tracking. ACM SIGPLAN Not. 39(11), 85–96 (2004) 29. Vieille, N.: Two-player stochastic games II: the case of recursive games. Israel J. Math. 119(1), 93–126 (2000) 30. Vogt, P., Nentwich, F., Jovanovic, N., Kirda, E., Kruegel, C., Vigna, G.: Cross site scripting prevention with dynamic data tainting and static analysis. In: Network & Distributed System Security Symposium, pp. 1–12 (2007) 31. Watkins, B.: The impact of cyber attacks on the private sector, pp. 1–11 (2014) 32. Zhu, Q., Ba¸sar, T.: Robust and resilient control design for cyber-physical systems with an application to power systems. In: IEEE Decision and Control and European Control Conference (CDC-ECC), pp. 4066–4071 (2011) 33. Zhu, Q., Tembine, H., Ba¸sar, T.: Network security configurations: a nonzero-sum stochastic game approach. In: American Control Conference (ACC), pp. 1059–1064 (2010)

Adversarial Attacks on Continuous Authentication Security: A Dynamic Game Approach Serkan Sarıta¸s1(B) , Ezzeldin Shereen1 , Henrik Sandberg2 , and Gy¨ orgy D´ an1 1

Division of Network and Systems Engineering, KTH Royal Institute of Technology, SE-10044 Stockholm, Sweden {saritas,eshereen,gyuri}@kth.se 2 Division of Decision and Control Systems, KTH Royal Institute of Technology, SE-10044 Stockholm, Sweden [email protected]

Abstract. Identity theft through phishing and session hijacking attacks has become a major attack vector in recent years, and is expected to become more frequent due to the pervasive use of mobile devices. Continuous authentication based on the characterization of user behavior, both in terms of user interaction patterns and usage patterns, is emerging as an effective solution for mitigating identity theft, and could become an important component of defense-in-depth strategies in cyber-physical systems as well. In this paper, the interaction between an attacker and an operator using continuous authentication is modeled as a stochastic game. In the model, the attacker observes and learns the behavioral patterns of an authorized user whom it aims at impersonating, whereas the operator designs the security measures to detect suspicious behavior and to prevent unauthorized access while minimizing the monitoring expenses. It is shown that the optimal attacker strategy exhibits a threshold structure, and consists of observing the user behavior to collect information at the beginning, and then attacking (rather than observing) after gathering enough data. From the operator’s side, the optimal design of the security measures is provided. Numerical results are used to illustrate the intrinsic trade-off between monitoring cost and security risk, and show that continuous authentication can be effective in minimizing security risk.

Keywords: Continuous authentication Markov decision process

· Dynamic stochastic game ·

This work was partly funded by the Swedish Civil Contingencies Agency (MSB) through the CERCES project and has received funding from the European Institute of Innovation and Technology (EIT). This body of the European Union receives support from the European Union’s Horizon 2020 research and innovation programme. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 439–458, 2019. https://doi.org/10.1007/978-3-030-32430-8_26

440

1

S. Sarıta¸s et al.

Introduction

Online identity theft and session hijacking are widely used for performing cyberattacks against online payment systems. As tools for performing identity theft and session hijacking are becoming widely available, the incidence of such attacks is expected to rise in the future. Furthermore, with the proliferation of bring your own device (BYOD) policies, identity theft and session hijacking could be an important attack vector in compromising not only online transactions but also critical infrastructures. Addressing these attack is thus crucial in mitigating advanced persistent threats (APT). Continuous authentication based on behavioral authentication is emerging as a promising technology in detecting identity theft and session hijacking. Continuous authentication typically relies on a machine learning model trained based on recorded user input, e.g., movement patterns of pointing devices, keystroke patterns, transaction characteristics, which is used for detecting anomalous user input in real-time [2,3]. User input that is classified as anomalous is typically rejected, and may result in the need for user re-authentication. Clearly, a high incidence of false positives is detrimental to the usability of the system, and thus it should be kept low. A lower false positive rate at the same time implies a higher false negative rate, i.e., lower probability of detection. Finding the optimal parameters for continuous authentication is thus a challenging problem, especially if continuous authentication is used in combination with other solutions for incident detection, such as intrusion detection services (IDS). In this paper we address this problem. We formulate a model of a system that uses an IDS and continuous authentication for mitigating APT. We then formulate the optimization problem faced by the attacker and by the defender as a dynamic leader-follower game. We characterize the optimal attack strategy, and show that it has a threshold structure. We then provide a characterization of the impact of the parameters of continuous authentication and of the IDS on the cost of the defender, so as to facilitate their joint optimization. We provide numerical results to illustrate the attacker strategy and the impact of the defender’s strategy on the attacker’s expected cost. The rest of the paper is organized as follows. After presenting the related literature in Sect. 2, the problem formulation is provided in Sect. 3. The optimal attack and defense strategies are discussed in Sects. 4 and 5, respectively. In Sect. 6, we provide numerical examples and comparative analyses. Section 7 concludes the paper.

2

Background and Related Work

Continuous authentication has received increasing attention lately both from industry and academia. Authors in [3] demonstrated the use of keystroke dynamics, mouse movements, and application usage for continuously authenticating users on workstations. Their results showed that keystroke dynamics proved to be the best indicator of user identity. Continuous authentication for smartphone

Adversarial Attacks on Continuous Authentication Security

441

users and users of other wearable electronic devices was considered recently in [2], based on behavioral information of touch gestures like pressure, location, and timing. Authors in [8] demonstrated the potential of using other behavioral information like hand movement, orientation and grasp (HMOG) information for continuously authenticating mobile users. Similarly, authors in [7] demonstrated continuous authentication for wearable glasses, such as Google glass. Authors in [4] showed that car owners or office workers could be continuously authenticated by sensors on their seats. Similar ideas have been proposed for military and battlefield applications for continuously authenticating soldiers by their weapons and suits [1]. Related to our work are previous works that used game theoretic approaches for modeling network security problems and for proposing security solutions. Cooperative authentication in Mobile Ad-hoc Networks (MANETs) was considered in [11], where many selfish mobile nodes need to cooperate in authenticating messages from other nodes while not sacrificing their location privacy. In [9], a game was used to model the process of physical layer authentication in wireless networks, where the defender adjusts its detection threshold in hypothesis testing while the attacker adjusts how often it attacks. The problem of secret (password) picking and guessing was modeled in [6] as a game between a defender (the picker) and an attacker (the guesser). Slightly similar to our work is [10], where the authors consider a game between monitoring nodes and monitored nodes in wireless sensor networks, where the monitoring nodes decide the duration of behavioral monitoring, and the monitored nodes decide when to cooperate and when not to cooperate. Nonetheless, to the best of our knowledge, our work is the first to propose a game theoretic approach for secure risk management considering continuous authentication.

3

Problem Formulation

We consider a system that consists of an organization that maintains a corporate network (e.g., a critical infrastructure operator), an employee u of the organization that uses resources on the corporate network, and an attacker denoted by a. Our focus is on the interaction between the organization and the attacker, which we model as a dynamic discrete stochastic game with imperfect information. Following common practice in game theoretic models of security, we assume that the attacker is aware of the strategy of the defender (operator), while the defender (operator) is not aware of the actions taken by the attacker over time, and hence of the attacker’s knowledge. In this section, we first describe the system model, then define the actions of the operator and the attacker. 3.1

User Behavior

For ease of exposition, we consider that time is slotted, and use t for indexing time-slots. We focus on a user u that interacts with the operator’s resources (e.g., servers, control systems, etc.) through generating data traffic, and focus

442

S. Sarıta¸s et al.

on one resource (r) for ease of exposition. We denote by Λu (t) the amount of traffic generated by user u in time-slot t, and we assume it is Poisson distributed with parameter λu (this is equivalent to the common assumption that arrivals can be modeled by a Poisson process, with intensity λu /ι, where ι is the length of the time-slot). The successful interaction of the user with the resource in a time-slot generates immediate reward vr for the operator. 3.2

Intrusion Detection and Continuous Authentication

We consider that the operator maintains or buys an intrusion detection service (IDS) in its infrastructure. Motivated by state-of-the-art IDSs, we consider that the intrusion detection service detects anomalous behavior in hosts and in the network, detection thus requires attacker activity. A detection by the IDS is followed by an investigation by a security threat analyst, which implies that a potential attacker would be detected and eliminated. We denote by m the per time-slot operation cost of the IDS, which determines its ability to detect an attacker (e.g., m determines the number of security threat analysts that can be hired), as discussed later. In addition to the IDS, in order to mitigate identity theft, e.g., through session hijacking and remote access tool-kits, the operator uses continuous authentication (also referred to as behavioral authentication) for verifying that the traffic received from user u is indeed generated by user u. Behavioral authentication is based on a characterization of the user behavior, e.g., through training a machine learning model. For simplicity, we consider that the user behavior can be described by a Gaussian distribution Bu ∼ N (bu , σu ), with mean bu and variance σu . While this model is admittedly simple, it allows for analytical tractability. We consider that continuous authentication is used on a per time-slot basis, that is, the user behavior during the time-slot is verified at the end of every time-slot, and a decision is made based on the match between the user behavior model and the actual behavior of the user during the corresponding time-slot. If the user fails the test then the user is blocked from accessing the resources. We assume that for an appropriate cut-off point c, the test result is positive if Bu > c, and negative otherwise. Note that even if there is no attacker, the user could be blocked due to a false positive (FP). We denote by ηu the false positive rate of the continuous authentication security system. This is equivalent to saying that the system applies a detection threshold of c = Φ−1 u (1−ηu ), where Φu is the cumulative distribution function (CDF) of Bu . Thus, without an attacker, the system S can be in two different states1 : the blocking state (BL) and the unblocking state (UB). In state BL, the user can not interact with resources and hence cannot generate reward vr , while in state UB, it is authorized to interact with the resources and thus it can generate reward vr . If the user fails continuous authentication in a time-slot, then the 1

In the case of an attacker, the third state AD (attacker is detected) is introduced in Sect. 3.4.

Adversarial Attacks on Continuous Authentication Security

443

state of the system switches from UB to BL. Note that this could happen due to a FP or due to a true positive (TP), i.e., input generated by an attacker as discussed later. Furthermore, to allow productivity, we consider that a user that is blocked in time-slot t is unblocked in time-slot t + 1 with probability q; i.e., Pr S(t + 1) = UB | S(t) = BL = q. The above assumptions imply that if there is no input from the user in timeslot t and the system was in state UB, then it will stay in state UB, as no false alarm is generated in the case of the lack of user activity. Hence, without an attacker, we can model the continuous authentication security system as a discrete time Markov chain with state space {UB, BL}, and the state transition probabilities are Pr S(t + 1) = UB S(t) = UB = e−λu + (1 − e−λu )(1 − ηu ) Puu , Pr S(t + 1) = BL S(t) = UB = ηu (1 − e−λu ) = 1 − Puu , (1) Pr S(t + 1) = UB S(t) = BL = q, Pr S(t + 1) = BL S(t) = BL = 1 − q. 3.3

Attack Model

Motivated by recent security incidents caused by identity theft and session hijacking, we consider an attacker that compromises a system component at cost Ca , e.g., the user’s computer, which allows it to observe the traffic generated by user u and to craft packets that appear to originate from user u. We refer to observing the user traffic as listening, and to crafting packets as attacking in the following. In addition, the attacker can decide not do anything during a time-slot, which we refer to as waiting. Consequently, in every time-slot, the attacker can choose between three actions: wait (l(t) = 0, a(t) = 0), listen (l(t) = 1, a(t) = 0), and attack (l(t) = 0, a(t) = 1), where l(t) = 1 stands for listening and a(t) = 1 stands for attacking. The purpose of listening is to collect behavioral information about the user, so as to learn to imitate legitimate user behavior that would pass continuous authentication. The purpose of attacking is to execute a rogue command on the resource, but in order for the attack to be successful, the system has to be in state UB and the attacker generated input should pass continuous authentication. If in time-slot t the attack is successful, then the attacker obtains immediate reward cr , which is a penalty for the defender. Motivated by that many attacks have a monetary reward, we consider that the future reward of the attacker is discounted by a discount factor ρ. In what follows we first define the actions of the attacker at time-slot t, then we will provide expressions for the attacker’s reward in Sect. 4.2. Listening (l(t) = 1, a(t) = 0). The attacker observes the behavior of the user during a time-slot in order to learn it and imitate the user for a successful attack. Learning during time-slot t is determined by the traffic Λu (t) generated by the user and by the learning rate γ. The total amount of observation of the attacker

444

S. Sarıta¸s et al.

t−1 about the user until time-slot t can be expressed as L(t) = τ =0 1{l(τ )=1} Λu (τ ), where 1{D} is the indicator function of an event D. At the same time, since listening requires activity from the attacker, the IDS could detect the attacker in the time-slot. We denote by δl (m) the probability that the IDS detects the attacker in a time slot when it is listening. We make the reasonable assumption that δl (m) is a concave function of m, δl (0) = 0 and limm→∞ δl (m) = 1, where m is the per time-slot operation cost of the IDS, as defined previously. Attacking (l(t) = 0, a(t) = 1). The attacker generates and sends rouge input to the resource, trying to impersonate the legitimate user. How well the attacker can imitate the user depends on the amount of observation L(t) that it has collected about the user. We consider that given L(t) amount of information the attacker can generate input following a Gaussian distribution, Bû (L(t)) ∼ û (L(t))), where ˆbu (L(t)) = bu (1 + e−γL(t) ) and σ û (L(t)) = σu (1 + N (ˆbu (L(t)), σ −γL(t) ). Since the user behavior is a Gaussian r.v., Bu ∼ N (bu , σu ), and Bû ∼ e N (ˆbu , σ û ) is the random variable generated by the attacker, we can use the binormal method [5] for expressing the Receiver Operating Characteristic (ROC) curve of the continuous authentication security system as ROC(ηu , L(t)) = Φ(a + bΦ−1 (ηu )),

(2)

where ηu is the FP rate, Φ(·) is the CDF of the standard normal distribution, ˆ

σu u a = buσˆ(L(t))−b , and b = σû (L(t)) . Note that ROC(ηu , L(t)) is the TP rate of the u (L(t)) detector (i.e., the conditional probability of classifying rogue input as such). By inspecting (2), and substituting ω = L(t), we can observe that bu bu − σu Φ−1 (ηu ) ROC(ηu , ω) = Φ(a + bΦ−1 (ηu )) = Φ − = Φ(ξω ). σu σu (1 + e−γω )

ξω

Normally, one would expect that as the number of observations increases, the attacker can imitate the real behavior of the user more successfully; i.e., its input is harder to distinguish from a real user input. Hence, we can safely assume that ROC(ηu , ω) = Φ(ξω ) should be a non-increasing function of ω, or equivalently, it must hold that bu ≥ σu Φ−1 (ηu ). Similar to listening, attacking requires activity from the attacker, and thus the IDS could detect the attacker in the time-slot. We denote by δa (m) the probability that the IDS detects the attacker in a time slot when it is attacking. Similar to δl (m), we assume that δa (m) is concave, δa (0) = 0 and limm→∞ δa (m) = 1. Waiting (l(t) = 0, a(t) = 0). If the attacker chooses to wait, it neither learns nor attacks, hence it cannot be detected but cannot learn or obtain a reward either. 3.4

Continuous Authentication Game

We can now informally introduce the continuous authentication game. In the game the defender (the operator) is the leader, and chooses a defense strategy

Adversarial Attacks on Continuous Authentication Security

445

(m, ηu ). The defense strategy is known to the attacker, i.e., the follower, who in turn decides whether or not to invest in compromising the system at cost Ca , and if it decides to compromise the system, in every time-slot it decides whether to wait, listen, or attack. The game ends when the attacker is detected (AD) by the IDS, i.e., when S(t) = AD. AD is thus an absorbing state. The attacker is interested in maximizing its utility (reward), while the operator is interested in maximizing its average utility. In what follows we formulate the Markov decision process (MDP) faced by the attacker and the optimization problem faced by the defender.

4

Optimal Attack Strategy

We start with describing the state space and the state transitions as a function of the attacker’s policy and of the defender’s strategy. We then derive the optimal attack policy for given defender strategy. 4.1

States and Actions

In order to formulate the MDP faced by the attacker, observe that the state of the system from the perspective of the attacker depends on whether the system is blocked (S(t) = BL) or unblocked (S(t) = UB), and on the amount of observations L(t) it has collected so far. Clearly, the state transition probabilities are affected by the actions of the attacker, hence the optimization problem faced by the attacker can be formulated as an MDP. In the following we provide the state transition probabilities, depending on the action chosen by the attacker. Waiting (l(t) = 0, a(t) = 0). When waiting, the attacker does not observe the user traffic, neither does it attempt to attack, hence the state transition probabilities are determined by the FP rate in state UB, and by unblocking in state BL. As depicted in Fig. 1(a), the state transition probabilities are thus Pr S(t + 1) = UB, L(t + 1) = ω S(t) = UB, L(t) = ω, l(t) = 0, a(t) = 0 = Puu , Pr S(t + 1) = BL, L(t + 1) = ω S(t) = UB, L(t) = ω, l(t) = 0, a(t) = 0 = 1 − Puu , Pr S(t + 1) = UB, L(t + 1) = ω S(t) = BL, L(t) = ω, l(t) = 0, a(t) = 0 = q, Pr S(t + 1) = BL, L(t + 1) = ω S(t) = BL, L(t) = ω, l(t) = 0, a(t) = 0 = 1 − q.

Although seemingly unimportant, waiting is preferred by the attacker when the system is in state BL, since the user can not interact with the resources, the attacker cannot increase its total number of observation about the user, and thus listening is not an optimal action for the attacker due to the possibility of being detected. Similarly, when the system is in state BL, an attack would be blocked, but the attacker could be detected. Therefore, when the system is in state BL, the attacker would prefer waiting. Since the attacker is completely passive while waiting, the IDS cannot detect the attacker. Therefore, practically,

446

S. Sarıta¸s et al.

(a) State transitions when waiting.

(b) State transitions when listening.

Fig. 1. State transitions and corresponding probabilities when the attacker is (a) waiting and (b) listening.

as depicted in Figs. 1(a), (b) and 2, it is not possible to switch to state AD from state BL. Listening (l(t) = 1, a(t) = 0). If listening, the attacker can be detected by the IDS with probability δl (m). If the user does not generate traffic (i.e., N = 0), then a FP cannot be triggered, but the attacker’s amount of observation does not change. On the contrary, if the user generates traffic (i.e., N ≥ 1), then the attacker can observe and learn, as long as the user generated traffic does not cause a FP. Thus, as depicted in Fig. 1(b), the transition probabilities are Pr S(t + 1) = AD S(t) = UB, L(t) = ω, l(t) = 1, a(t) = 0 = δl (m), Pr S(t + 1) = UB, L(t + 1) = ω S(t) = UB, L(t) = ω, l(t) = 1, a(t) = 0 = (1 − δl (m))e−λu , Pr S(t + 1) = UB, L(t + 1) = ω + N S(t) = UB, L(t) = ω, l(t) = 1, a(t) = 0 e−λu λN u = (1 − δl (m))(1 − ηu ) , for N = 1, 2, . . . , N ! Pr S(t + 1) = BL, L(t + 1) = ω S(t) = UB, L(t) = ω, l(t) = 1, a(t) = 0 = (1 − δl (m))ηu (1 − e−λu ). Attacking (l(t) = 0, a(t) = 1). If attacking, the attacker can be detected by the IDS with probability δa (m). If the attacker is not detected, then for a successful attack the attacker generated input must pass continuous authentication (false negative) and the user traffic must not cause a FP. If any of these two does not hold, the system switches to state BL. As depicted in Fig. 2, we thus have

Adversarial Attacks on Continuous Authentication Security

447

Pr S(t + 1) = AD S(t) = UB, L(t) = ω, l(t) = 0, a(t) = 1 = δa (m), Pr S(t + 1) = UB, L(t + 1) = ω S(t) = UB, L(t) = ω, l(t) = 0, a(t) = 1 = (1 − δa (m))Puu (1 − Φ(ξω )), Pr S(t + 1) = BL, L(t + 1) = ω S(t) = UB, L(t) = ω, l(t) = 0, a(t) = 1

= (1 − δa (m))(1 − Puu (1 − Φ(ξω ))).

4.2

Attacker Reward as a Dynamic Programming Recursion

Let the total observation of the attacker about the user at the beginning of the time-slot t be L(t) = ω. Further, let us denote by Jt (L(t) = ω, S(t) = UB) and Jt (L(t) = ω, S(t) = BL) the total reward of the attacker starting from the timeslot t when S(t) = UB and S(t) = Fig. 2. State transitions and probabilities when the attacker choose to attack. BL, respectively. For notational convenience, we will use J(ω, UB) and J(ω, BL) henceforth2 . Clearly, the total reward of the attacker corresponds to J(0, UB). Then, depending on the states, actions and corresponding probabilities described above, and accounting for the discount factor ρ, the dynamic programming recursion of the attacker reward can be established as J(ω, BL) = ρqJ(ω, UB) + ρ(1 − q)J(ω, BL) ρq J(ω, UB), ⇒ J(ω, BL) = 1 − ρ(1 − q) ⎧ ⎪ + ρ(1 − Puu )J(ω, BL) ⎨ρPuu J(ω, UB) J(ω, UB) = ρ(1 − δl (m)) O(ω) + ηu (1 − e−λu )J(ω, BL) ⎪ ⎩ ρ(1 − δa (m))A(ω)

(3) l = 0, a = 0 l = 1, a = 0 , (4) l = 0, a = 1

where O(ω) = e−λu J(ω, UB) + (1 − ηu ) A(ω) = Puu 1 − Φ(ξω )

qω

∞

P (Λu = n)J(ω + Λu , UB),

n=1

cr + J(ω, UB) + 1 − Puu 1 − Φ(ξω ) J(ω, BL) ρ

cr = qω + qω J(ω, UB) + (1 − qω )J(ω, BL). ρ

qω

Note that, since ROC(ηu , ω) = Φ(ξω ) is a non-increasing function of ω, the parameter qω is a non-decreasing function of ω. 2

For ease of exposition, ω denotes L(t) = ω. Notice the time dependency of ω (even though it is not explicitly stated in the notation).

448

S. Sarıta¸s et al.

To simplify the expressions and to obtain structural insight, let us substitute (3) into (4), thus for l = 0 and a = 0 we obtain J(ω, UB) = ρ

Puu (1 − ρ) + ρq J(ω, UB). 1 − ρ + ρq

(5)

This allows us to formulate the following proposition. Proposition 1. Let ρ < 1. Then waiting cannot be optimal in state U B. Proof. By (5), if waiting is to be optimal in state UB then its reward must be J(ω, UB) = 0, which cannot be optimal. As a consequence, if the system is in state UB then the attacker prefers either listening or attacking during the time-slot. On the contrary, waiting is the optimal action in state BL as discussed in Sect. 4.1. Let us now consider that the attacker prefers listening in state UB. We can again substitute (3), to obtain for l = 1 and a = 0, −λu

J(ω, UB) = ρ(1 − δl (m)) e

J(ω, UB) + (1 − ηu )

∞

P (Λu = n)J(ω + Λu , UB)

n=1

ρq + ρ(1 − δl (m))ηu (1 − e−λu ) J(ω, UB) 1 − ρ(1 − q) ρq = ρ(1 − δl (m)) e−λu + ηu (1 − e−λu ) J(ω, UB) 1 − ρ + ρq

U

+ ρ(1 − δl (m))(1 − ηu )

∞

P (Λu = n)J(ω + Λu , UB)

n=1

Kω

= U J(ω, UB) + Kω .

(6)

Using the same substitution, if the attacker prefers attacking in state UB, i.e., for l = 0 and a = 1, we obtain cr ρq J(ω, UB) J(w, UB) = ρ(1 − δa (m)) qω + qω J(ω, UB) + (1 − qω ) ρ 1 − ρ(1 − q) qω (1 − ρ) + ρq = (1 − δa (m)) qω cr + ρ(1 − δa (m)) J(ω, UB)

1 − ρ + ρq

C ω

= Tω J(ω, UB) + Cω .

Tω

(7)

Based on (6) and (7), the attacker reward in state UB can be expressed as Kω l = 1, a = 0 U Jω + Kω l = 1, a = 0 . Jω J(ω, UB) = = 1−U Cω Tω Jω + Cω l = 0, a = 1 l = 0, a = 1 1−Tω

Adversarial Attacks on Continuous Authentication Security

Note that since the attacker aims for the maximum reward as Jω Kω Cω , her optimal policy must satisfy the following: max 1−U , 1−T ω Jω = Jω =

Kω 1−U , Cω 1−Tω

l = 1, a = 0 l = 0, a = 1

if if

Kω 1−U Kω 1−U

> ≤

Cω 1−Tω Cω 1−Tω

.

449

=

(8)

Based on the above analysis, we can summarize the parameters of the attacker reward in (8) as follows.

Kω = ρ(1 − δl (m))(1 − ηu ) U = ρ(1 − δl (m)) e

−λu

∞ e−λu λnu Jω+n n! n=1

+ ηu (1 − e

−λu

ρq ) 1 − ρ + ρq

Cω = (1 − δa (m)) qω cr qω (1 − ρ) + ρq Tω = ρ(1 − δa (m)) 1 − ρ + ρq qω = Puu 1 − ROC(ηu , ω) = Puu (1 − Φ(ξω )) Puu = e−λu + (1 − e−λu )(1 − ηu ) = 1 − ηu + ηu e−λu ξω =

4.3

bu bu − σu Φ−1 (ηu ) − σu σu (1 + e−γω )

Listening Reward vs. Attacking Reward

Observe that since qω is a non-decreasing function of ω, the reward of attacking Cω 1−Tω is a non-decreasing function of ω; i.e., more observation is always at least as good for the attacker. Thus, it is interesting for the attacker to analyze the advantage of attacking with more observation: Proposition 2. Let Lω

Cω+1 1−Tω+1 Cω 1−Tω

be the ratio between the attacking rewards

of two consecutive amounts of observation. Then, Lω is a monotonic decreasing function of ω and limω→∞ Lω = 1. Proof. Lω can be expanded as follows: (1−δa (m)) qω+1 cr

Lω =

1−ρ(1−δa (m))

q ω+1 (1−ρ)+ρq 1−ρ+ρq

(1−δa (m)) qω cr q (1−ρ)+ρq 1−ρ(1−δa (m)) ω1−ρ+ρq

qω (1−ρ)+ρq qω+1 1 − ρ(1 − δa (m)) 1−ρ+ρq = . qω 1 − ρ(1 − δa (m)) qω+1 (1−ρ)+ρq

(9)

1−ρ+ρq

Since qω is a non-decreasing function of ω; i.e., qω+1 ≥ qω , it can be obtained from (9) that Lω ≥ 1. Furthermore, since limω→∞ qω = (1 − ηu + ηu e−λu ) (1 − ηu ), it

450

S. Sarıta¸s et al.

ω holds that limω→∞ Lω = 1. We note that dL dω < 0 can be proved analytically for a continuous extension of Lω ; i.e., assuming ω ∈ [0, ∞) rather than ω ∈ {0, 1, . . .}.

In order to compare the listening and attacking rewards for some amount of observation ω, let us define the incremental observation gain n−1 ∞ ρ(1 − δl (m))(1 − ηu ) e−λu λnu Lω+i . (10) χω 1−U n! n=1 i=0 Lemma 1. χω is a decreasing function of ω. Furthermore, limω→∞ χω < 1. Proof. The first part of the lemma follows from that Lω is a decreasing function of ω. To prove the second part of the lemma, since limω→∞ Lω = 1, observe that lim χω =

ω→∞

ρ(1 − δl (m))(1 − ηu ) (1 − e−λu ) < 1. ρq 1 − ρ(1 − δl (m)) e−λu + ηu (1 − e−λu ) 1−ρ+ρq

As a consequence of the above result we can state the following. Corollary 1. If χω=0 > 1 then there exists a critical value ω such that = 0. χω=ω−1 > 1 and χω=ω ≤ 1. Otherwise; i.e., if χω=0 ≤ 1 then ω Note that ω is independent of time and can be calculated (before the game-play) for a given set of parameters3 . We are now ready to prove that the attacker policy is indeed a threshold policy. Theorem 1. The attacker prefers listening over attacking for ω < ω . Proof. In order to compare the listening reward Cω 1−Tω observe that

Kω 1−U

and the attacking reward

∞ Kω ρ(1 − δl (m))(1 − ηu ) e−λu λnu = Jω+n 1−U 1−U n! n=1

≥

∞ ρ(1 − δl (m))(1 − ηu ) e−λu λnu Cω+n 1−U n! 1 − Tω+n n=1

=

∞ Cω ρ(1 − δl (m))(1 − ηu ) e−λu λnu 1 − Tω 1−U n! n=1

∞ Cω ρ(1 − δl (m))(1 − ηu ) e−λu λnu = 1 − Tω 1−U n! n=1 χω

Cω+n 1−Tω+n Cω 1−Tω

n−1 i=0

Lω+i

.

(11)

, which proves Thus, listening is preferred over attacking when χω > 1; i.e., ω < ω the theorem. 3

The sum in (10) can be partitioned into Cn=1 and ∞ n=C+1 for any arbitrary C, and χω can be approximated from below by utilizing Lω ≥ 1 in the latter one. Then, the corresponding ω can be calculated accordingly.

Adversarial Attacks on Continuous Authentication Security

451

Note that, after the critical value of ω ≥ ω , since χω ≤ 1, we cannot compare the attacking and listening rewards based on (11). 4.4

Listening or Attacking (by Value Iteration)

Due to Theorem 1, listening is optimal for ω < ω . An optimal strategy; i.e., listening or attacking, for ω ≥ ω will be our focus in this part. Note that the attacker gets an immediate reward cr only when the attack is successful. The attacker gets a (discounted) reward by listening on account of the successful attacks in the future. Therefore, the attacker gets zero reward if she only listens, which implies that for any amount of observation ω ≥ω , there , in which attacking is optimal. must be some ω ≥ ω Since a backward induction through the Bellman optimality equations for the attacker reward is already established in (8), we are ready to apply the value iteration method to obtain the optimal attacker strategy for ω ≥ ω (note that ρ(1−δl (m))(1−ηu ) ∞ e−λu λn u since < 1, the Bellman update/operator in (8) is n=1 1−U n! a contraction mapping, which guarantees the existence and the uniqueness of an optimal point, that is achievable by the value iteration method). Theorem 2. The attacker prefers attacking over listening for ω ≥ ω . Proof. For the initial values of the rewards, we assign zero reward for every ω; (0) i.e., Jω = 0 ∀ω. Regarding the first iteration of the value updates, since (1)

Jω,L =

∞ (0) ρ(1 − δl (m))(1 − ηu ) e−λu λnu (0) Kω = Jω+n = 0; 1−U 1−U n! n=1

(12)

i.e., all listening rewards are zero, attacking would be the optimal choice for every (1) (1) (1) Cω Cω ω. Then, Jω,L = 0 and Jω,A = 1−T hold ∀ω, which implies Jω,∗ = 1−T ∀ω. ω ω (2)

In the second iteration, the attacking rewards do not change; i.e., Jω,A = Cω 1−Tω ∀ω. Regarding the listening rewards, for every ω, similar to (11), we obtain (2)

∞ (1) ρ(1 − δl (m))(1 − ηu ) e−λu λnu Cω+n Kω = 1−U 1−U n! 1 − Tω+n n=1 n−1 ∞ Cω ρ(1 − δl (m))(1 − ηu ) e−λu λnu = Lω+i . 1 − Tω 1−U n! n=1 i=0

Jω,L =

(13)

χω

Cω In (13), if χω ≤ 1, then Jω,L ≤ 1−T = Jω,A , which implies that attacking is ω the optimal strategy. Since limω→∞ χω < 1, and χω is a decreasing function of ω, after the critical value ω ≥ ω , attacking is always preferred over listening. (2) (2) Cω Similarly, if χω > 1, or equivalently if ω < ω , since Jω,L > 1−T = Jω,A , listening ω (2)

(2)

452

S. Sarıta¸s et al.

is preferred over attacking. Thus, at the end of the second iteration, the following (2) holds regarding the value update Jω,∗ and the corresponding strategy: ⎫ (1) (2) Kω (2) ∀ω ⎬ Jω,L = 1−U Jω,L ω < ω (2) . ⇒ Jω,∗ = (2) (2) Jω,A ω ≥ ω J = Cω ∀ω ⎭ ω,A

1−Tω

(3)

In the third iteration, the attacking rewards are the same again; i.e., Jω,A = Cω 1−Tω ∀ω. For the listening rewards, since (3)

Jω,L =

∞ (2) ρ(1 − δl (m))(1 − ηu ) e−λu λnu (2) Kω = Jω+n,∗ 1−U 1−U n! n=1

(3)

(2)

holds, we have Jω,L = Jω,L for ω ≥ ω . Regarding ω < ω , observe the following: (3) Jω,L

ρ(1 − δl (m))(1 − ηu ) = 1−U

ω −ω−1 e−λu λn u

(2) Jω+n,L

∞ e−λu λnu (2) Jω+n,A + n!

n! n=1 ω −ω ω −ω−1 ∞ e−λu λn (2) e−λu λnu (2) ρ(1 − δl (m))(1 − ηu ) u Jω+n,A + Jω+n,A > 1−U n! n! n=1 ω −ω

=

(2) Jω,L

>

(2) Jω,A

=

(3) Jω,A ,

which implies that, for ω < ω , listening would be the optimal strategy, as in the previous iteration. Furthermore, the listening rewards are greater than or equal (3) (2) to the ones from the previous iteration; i.e., Jω,L > Jω,L for ω < ω − 1 and (3)

(2)

(n)

(n−1)

− 1. Jω,L = Jω,L for ω = ω Note that the optimal attacker reward Jω,∗ is obtained partially at each in the second iteration, Jω,∗ iteration. In particular, Jω,∗ is obtained for ω ≥ ω is obtained for ω = ω −1 in the third iteration, Jω,∗ can be obtained for ω = ω −2 in the forth iteration, and so on. By iterating further in this way, we observe that listening is optimal for ω < ω and attacking is optimal for ω ≥ ω , and we can obtain the optimal reward and strategy of the attacker in at most ω + 2 number (n) (n) (n−1) Cω and J ≥ J hold ∀ω (in particular, of iterations. Moreover, Jω,A = 1−T ω,L ω,L ω Jω,L > Jω,L

(n)

(n−1)

for ω < ω and Jω,L = Jω,L

for ω ≥ ω ).

Based on the results and observations above, the optimal attack strategy can be summarized as follows: ⎧ ⎪ ⎨Waiting (l(t) = 0, a(t) = 0) Listening (l(t) = 1, a(t) = 0) ⎪ ⎩ Attacking (l(t) = 0, a(t) = 1)

if S(t) = BL, L(t) arbitrary if S(t) = UB, L(t) < ω if S(t) = UB, L(t) ≥ ω

(14)

Adversarial Attacks on Continuous Authentication Security

5

453

Optimal Defense Strategy

Due to the leadership role, for any defense strategy (m, ηu ), the defender (operator) can anticipate the optimal attacker strategy; i.e., the critical amount of observation ω (as a function of m and ηu ), and corresponding decisions (i.e., listening is optimal for ω < ω and attacking is optimal for ω ≥ ω ). Knowing ω , the defender can make use of the transition probabilities described in Sect. 4.14 . From the perspective of the defender, the system state consists of the triplets (t, L(t) = ω, S(t)) which evolve over time as an MDP, and the corresponding transition probabilities are provided in Sect. 4.1. Note that the evolution of the triplets (t, L(t) = ω, S(t)) starting from (0, 0, UB) can be represented as an infinite directed graph with countably many vertices (note that t and w are discrete, and there are three possible states S(t)). We will make use of the vertices as states/rewards, and the edges as the transition probabilities as follows. Firstly, let us define the total defender reward until the time-slot t as Vt (ω, UB), Vt (ω, BL), and Vt (ω, AD) when L(t) = ω, and S(t) = UB (the unblocking state), S(t) = BL (the blocking state), and S(t) = AD (the attacker is detected by the IDS and the game ends), respectively. Starting from V0 (0, UB) = 0 and for ω 0, the set is contained in the -neighborhood of some finite-dimensional linear subspace. The following proposition shows that a solution exists by (i) showing that the problem can be transformed into an optimization problem over a certain Banach space without loss of generality; and then (ii) showing that the constraint set is bounded and flat; finally (iii) showing that the optimization objective is continuous, which enables us to invoke the Weierstrass Theorem [10] to conclude that a solution exists. Proposition 1. The infinite-dimensional optimization problem (44) admits a solution. 2m with norm · S . Proof. Consider the linear vector space S ⊂ ⨉∞ k=1 S Particularly, each s ∈ S is an infinite sequence of symmetric matrices, i.e., s := {S1 , S2 , . . .}, and

sS :=

∞

1/2 Sk 2F

< ∞.

(46)

k=1

Note that we can view s ∈ S as a sequence of real numbers with certain ordering within each symmetric matrix. For example, we can view

(47) s = [S1i,j ], [S2i,j ], . . . , , where Ski,j ∈ R denotes the ith row and the jth column entry of the matrix Sk ∈ S2m , as {S11,1 , . . . , S11,2m , . . . , S12m,1 , . . . , S12m,2m , S21,1 , . . .}.

(48)

The 2 -norm of (48) is bounded if, and only if, sS is bounded. Therefore, S is a subspace of 2 -space, i.e., S ⊆ 2 . Furthermore, given any l ∈ 2 , there exists

Dynamic Deceptive Signaling

471

a unique sequence of symmetric matrices in S. For example, l := {l1 , l2 , . . .} can be transformed into a sequence of symmetric matrices as ⎫ ⎧⎡ ⎤ l1 l2 l4 ... ⎪ ⎪ ⎪ ⎪ ⎪⎢l l ⎪ ⎪ ⎪ ⎥ ⎬ ⎨⎢ 2 3 ⎥ lm(2m+1)+1 . . . ⎢ ⎥ . , , . . . . . . ⎢l4 ⎥ . .. .. ⎪ ⎪ ⎪ ⎪ ⎣ ⎦ ⎪ ⎪ ⎪ ⎪ ⎭ ⎩ .. . l m(2m+1)

and its S-norm is bounded since l ∈ 2 . Therefore, 2 -space is a subspace of S, i.e., 2 ⊆ S. This yields that the normed spaces S and 2 are isometric. Correspondingly, S is also a Banach space since 2 is Banach. Next, our goal is to show that the constraint set in (44) is a subset of S so that we can prove its compactness if we can show that the constraint set is flat in addition to being a bounded subset of the Banach space S. Eventually, by showing that the optimization objective is continuous, we can invoke Weierstrass Theorem to conclude that a solution exists. However, Ψ as defined in (45) is not a subset of S. By inspecting the optimization objective,√ we observe that through a change of variable S¯k := αSk Sk for all k, where αS := βS ∈ (0, 1), (44) can be transformed into min

lim

¯k }∈Ψ ¯ κ→∞ {S

κ

αSk tr{S¯k V } + c,

(49)

k=1

where k ¯ ¯ Ψ¯ := {{S¯k ∈ S2m }∞ k=1 | αS Σz,k Sk αS C Sk−1 C , S0 = O2m }.

(50)

In order to show that Ψ¯ ⊂ S, let us take a look at the norm of any s¯ := {S¯k } ∈ Ψ¯ , which is given by ¯ s2S = lim

κ→∞

(b)

= lim

κ

k=1 κ

κ→∞

(a)

S¯k 2F ≤ lim

κ→∞

(c)

κ

S¯k 2∗

k=1 κ

tr{S¯k }2 ≤ lim

k=1

4m2 Mz2 αS2 = < ∞, 1 − αS2

κ→∞

αS2k Mz2 4m2

k=1

(51)

2 = where (a) follows since for a matrix S, the Frobenius norm S F i σi (S) while the trace norm S∗ = i σi (S), where σi (S) refers to the ith singular value of S; and (b) follows since S¯k is a positive semi-definite matrix and its singular values are the same with its eigenvalues; and (c) follows since we have αSk Mz I2m αSk Σz,k S¯k by the uniform boundedness condition on zz k and this implies that tr{S¯k } ≤ αSk Mz tr{I2m }. Therefore, we have that Ψ¯ is a bounded subset of S.

472

M. O. Sayin and T. Ba¸sar

In order to show that Ψ¯ is flat, let us consider the following finite-dimensional space FK := {{Fk } ∈ S | Fk = O2m for k > K}. Note that FK ⊂ S. We define the distance of any s¯ ∈ Ψ¯ to the space FK by d(¯ s, FK ) = inf{¯ s − f S , f ∈ FK }, which can also be written as9 d(¯ s, FK ) = inf

f ∈FK

(a)

= inf

f ∈Fk

=

∞

1/2 S¯k − Fk 2F

k=1

K k=1

S¯k −

Fk 2F

∞

+

1/2 S¯k 2F

k=K+1

=0 ∞

1/2

S¯k 2F

≤

k=K+1

=

(52)

∞

1/2 αS2k 4m2 Mz2

k=K+1

(K+1) 2mMz αS , 1 − αS2

(53)

where (a) follows since {S¯1 , . . . , S¯K , O2m , . . .} ∈ FK . Therefore, we can ensure that the distance between any s¯ ∈ Ψ¯ and finite-dimensional FK is less than any > 0 by selecting K ∈ N such that the upper bound on the distance, i.e., (53), is less than . This yields that Ψ¯ is flat in addition to being a bounded subset of the Banach space S. Hence Ψ¯ is a compact set. Furthermore, the optimization objective in (49) is a linear functional over S. It is continuous if, and only if, it is bounded. And Von Neumann’s trace inequality (40) yields that κ κ k αS tr{S¯k V } ≤ sup lim αSk tr{S¯k V } sup lim ¯s S =1 κ→∞ ¯ s S =1 κ→∞ k=1

k=1

≤ sup

lim

κ

¯ s S =1 κ→∞ k=1

αSk

2m

σi (S¯k )σi (V ).

i=1

Let us introduce two sequences of real numbers: x := {σ1 (S¯1 ), . . . , σ2m (S¯1 ), σ1 (S¯2 ), . . . , σ2m (S¯2 ), . . .} y := {αS σ1 (V ), . . . , αS σ2m (V ), αS2 σ1 (V ), . . . , αS2 σ2m (V ), . . .}. 9

Note that {S¯k − Fk } ∈ S, which ensures that its S-norm is bounded.

(54)

Dynamic Deceptive Signaling

473

Then, the 2 -norm of the sequences is given by x2 = 1, due to the constraint ¯ sS = 1, and y2 = (αS V ∗ )/ 1 − αS2 . With the conventional inner-product of 2 -Hilbert space, (54) can be written as sup (x, y),

(55)

x 2 =1

while the Cauchy Schwarz inequality yields that αS V ∗ |(x, y)| ≤ x2 y2 = , 1 − αS2

(56)

and the equality holds if, and only if, x = μy for some μ ∈ R. Therefore, due to the norm constraint, the maximizing sequence x is given by x = y/y. Coming back to the original problem (54), we have sup

lim

κ

¯ s S =1 κ→∞ k=1

αSk

2m

αS V ∗ σi (S¯k )σi (V ) ≤ 1 − αS2 i=1

(57)

and the equality holds if, and only if, we have σi (S¯k ) = αSk−1 Hence, we obtain

1 − αS2

σi (V ) . V ∗

κ αS V ∗ k ¯ sup lim αS tr{Sk V } = κ→∞ 1 − αS2 ¯ s S =1 k=1

and the maximizing sequence of symmetric matrices is given by αSk−1 1 − αS2 ¯ Sk = V. V ∗

(58)

(59)

(60)

Therefore, the linear functional is bounded, and correspondingly continuous. This completes the proof. Even though a solution for (44) is guaranteed to exist, powerful computational tools to solve SDP cannot be applied since we seek to compute an infinite sequence of symmetric matrices. However, in the following theorem, we show how to approximate the solution with any approximation error. Theorem 2. For any given > 0, let K ∈ N be such that Mz V ∗

βSK+1 < . 1 − βS

(61)

∗ } be the solution of Furthermore, let {S1∗ , . . . , SK

min

{Sk }∈ΨK

K k=1

βSk tr{Sk V } + c.

(62)

474

M. O. Sayin and T. Ba¸sar

Then, we have min

lim

{Sk }∈Ψ κ→∞

κ

βSk tr{Sk V } ≥

k=1

K

βSk tr{Sk∗ V } − .

(63)

k=1

Proof. Note that for any K ∈ N, we can write (44) as min

{Sk }∈Ψ

K

βSk tr{Sk V } + lim

κ

κ→∞

k=1

βSk tr{Sk V } + c.

k=K+1

We seek to provide a bound on the absolute value of the second term. Particularly, for any {Sk }∞ k=K+1 , we have κ κ k lim βS tr{Sk V } ≤ lim βSk |tr{Sk V }| κ→∞ κ→∞ k=K+1

(a)

≤ lim

k=K+1 ∞

κ→∞

βSk Mz V ∗ ≤ Mz V ∗

k=K+1

βSK+1 , 1 − βS

(64)

where (a) follows by (41). Therefore, if (61) holds, we have (63), which completes the proof.

5

Illustrative Examples

As an illustrative example, we consider the scenarios where there is only a single stage, and x 1 and θ 1 are scalar random variables10 . We suppose that x and θ are independent of each other, have zero mean and unit variance. Indeed, the bias θ is a standard normal random variable, i.e., θ ∼ N (0, 1). We, however, consider the scenarios where the information of interest x is not necessarily Gaussian, in order to examine the performances attained by the players in a more general setting, different from the previous studies [1,7,13–15,18]. Particularly, the information of interest x is given by xr , (65) x = bx l + (1 − b )x almost everywhere over R, where b , x l , and x r are random variables independent of each other and of the bias θ . Let b be a Bernoulli random variable and P{bb = 1} = 1/2. And let x l ∼ N (−μ, σ 2 ) and x r ∼ N (μ, σ 2 ), where μ ∈ [0, 1] and x} = 1. In other words, the information of the variance σ 2 are such that var{x interest x is a Gaussian mixture with two components x l and x r at left and right, respectively. By varying μ ∈ [0, 1], we seek to examine the performances of the players. Note that when μ = 0, x becomes a standard normal random variable as illustrated in Fig. 1a. Then, the best linear estimator attains the minimum mean 10

Henceforth, we omit the subscript for notational simplicity.

3

3

2

2

1

1

θ

θ

Dynamic Deceptive Signaling

0

0

−1

−1

−2

−2

−3 −3

−2

−1

0

1

2

−3 −3

3

−2

−1

x

3

2

2

1

1

0

−1

−2

−2

0

2

3

0

−1

−1

1

(b) The case when µ = 0.7071.

θ

θ

(a) The case when µ = 0.

−2

0

x

3

−3 −3

475

1

2

x

(c) The case when µ = 0.9806.

3

−3 −3

−2

−1

0

1

2

3

x

(d) The case when µ = 1.

Fig. 1. Color-coded samples of the augmented vector z for different values of μ ∈ [0, 1]. For the best signaling rule to deceive a linear estimator, the best linear and nonlinear estimates are plotted via a green line and black circles, respectively. (Color figure online)

square error. However, for larger μ > 0, the best linear estimator does not attain the best possible performance. PR can attain better performance by using a nonlinear filter since the underlying information is no longer Gaussian when μ > 0. Furthermore, when μ = 1, which is the maximum possible value under the conx} = 1, the information of interest x becomes a Rademacher straint that var{x random variable as illustrated in Fig. 1d. Note that standard normal random variables have the maximum entropy while Rademacher random variables have the minimum entropy within the general class of random variables that have zero mean and unit variance [6]. ! " We note that the augmented vector z = x θ ∈ R2 has zero mean and its covariance matrix is I2 , independent of μ ∈ [0, 1]. Correspondingly, for all ! " μ ∈ [0, 1], the solution for (34) is given by S ∗ = uu , where u = 0.5257 0.8507 , " ! as shown in [20]. And the optimal signal s ∗ = uz 0 , almost everywhere over R. The optimal signal can be viewed as the projection onto the direction of the vector u ∈ R2 . For a linear estimator, the projection, i.e., E(zz | uz ) = uuz , is the best estimate. However, for a nonlinear estimator, the best estimate is given by E{zz |uz }, and it is not equal to the linear estimate in general.

476

M. O. Sayin and T. Ba¸sar Sender’s Cost

Receiver’s Cost

2

2 0.5

Full Signal No Signal Nonlinear Est. Linear Est.

1.6 1.4

1.8 1.6

Cost

0.8

1.4 0.3 0.7

1.2

0.65 0.7

1.2

1

Cost

1.8

No Signal Linear Est. Nonlinear Est. Full Signal

1 0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

1

1

0 0

0.2

0.4

0.6

0.8

Mean of the Right Component in the Mixture

1

0

0.2

0.4

0.6

0.8

1

Mean of the Right Component in the Mixture

Fig. 2. Players’ performances for null, full, and strategic signaling, and linear and nonlinear estimation strategies.

In Fig. 1, we plot about 500 samples of the augmented vector z for different values of μ ∈ [0, 1] and color-code the samples drawn from different components of the mixture x . We have conducted 107 independent trials of Monte Carlo simulations [9] in order to compute the best nonlinear estimate E{zz |uz } numerically. In Fig. 1, the best linear and nonlinear estimates for the signal s = uz are plotted via a green line and black circles, respectively, for different μ ∈ [0, 1]. Note that the best linear and nonlinear estimates match exactly for μ = 0, i.e., for Gaussian information, while the deviation increases for larger μ. This deviation has different impact on players’ performances. In other words, if PR could use the best nonlinear estimator while PS has selected his signaling rule considering the scenarios where PR can only use the best linear estimator, e.g., Kalman filter; then PS can have larger cost while the best nonlinear estimator can lead to smaller cost for PR . In Fig. 2, we compare the costs of the players for different scenarios: (i) PS does not share any signal, and correspondingly PR ’s best linear/nonlinear estimate would be E{zz } = 0; (ii) PS shares z completely, and correspondingly PR ’s best linear/nonlinear estimate would be z ; (iii) PS shares uz and PR uses the best linear estimator; (iv) PS shares uz yet PR uses the best nonlinear estimator. We note that the best deceptive signaling rule is not null or full information disclosure. And PS ’s performance degrades slightly when he/she selects the signaling rule considering the scenarios where PR uses the best linear estimator while PR can use a nonlinear estimator even when the information of interest x is a Rademacher random variable as illustrated in Fig. 1d.

6

Conclusion

In non-cooperative environments, e.g., adversarial settings, an agent who has access to valuable information could seek to deceive other agents who seek to

Dynamic Deceptive Signaling

477

make informed decisions. In this paper, we have addressed the optimal deceptive signaling of multivariate distributions over finite or infinite horizon. We have modeled the interaction between the agents under the solution concept of Stackelberg equilibrium, where the agent signaling is the leader. We have shown that the optimal signaling strategy to deceive a Kalman filter is linear within the general class of stochastic kernels over finite or infinite horizons. For problems over finite horizon, we have provided an SDP-based method to compute the optimal signaling numerically. Over the infinite horizon, the corresponding SDP is also infinite dimensional. We have shown the existence of a solution and provided a method to approximate the optimal performance within any given neighborhood. Numerical analysis has shown that the performance of the sender degrades slightly when the receiver uses the best nonlinear estimator even for the scenarios where the information of interest is a Rademacher random variable rather than Gaussian. Some future research questions on this topic include: how much the sender’s cost measure would increase/decrease if the receiver uses a particle filter instead of a Kalman filter; what the optimal signaling strategies are to deceive a particle filter; and what the optimal signaling strategies are for the scenarios with higher order cost measures other than quadratic cost.

A

Proof of Lemma 1

In the following, we show each property one by one: Property (i) follows since Mma (bb, c ) = Mma (˜b ⊥ , c ). Property (ii) follows since aE(a a | c ) } = E{a a(E{a ac }E{ccc }†c ) } = E{a ac }E{ccc }† E{a ac } . E{a

(66)

Property (iii) follows since, by Property (i), we have a | b , c )E(a a | c ) } = E{E(a a | ˜b ⊥ , c )E(a a | c ) }. E{E(a

(67)

By taking a closer look at the right-hand-side, we obtain

† ac } E{a cov{cc} E{ccc } ac } . cov{cc}† E{a a˜b ⊥ } E{a E{˜b ⊥c } cov{˜b ⊥ }

Since E{˜b ⊥c } = Oma ×mc , it is equivalent to ac }cov{cc}† cov{cc}cov{cc}† E{a ac } = cov{E(a a | c )}, E{a which follows since the pseudo inverse of a matrix is a weak inverse for the multiplicative semi-group, i.e., M † M M † = M † . Property (iv) follows since a | b , c )} is equal to cov{E(a a | ˜b ⊥ , c )}. By taking a closer look at the rightcov{E(a hand-side, we obtain

ac } E{a a˜b ⊥ } E{a

cov{cc}

† cov{˜b ⊥ }

ac } E{a a | c )}. (68) a | ˜b ⊥ )} + cov{E(a = cov{E(a a˜b ⊥ } E{a

478

M. O. Sayin and T. Ba¸sar

References 1. Akyol, E., Langbort, C., Ba¸sar, T.: Information-theoretic approach to strategic communication as a hierarchical game. Proc. IEEE 105(2), 205–218 (2017) 2. Anderson, B.D.O., Moore, J.B.: Optimal Filtering. Prentice Hall Inc., Upper Saddle River (1979) 3. Ba¸sar, T., Olsder, G.: Dynamic Noncooperative Game Theory. Society for Industrial and Applied Mathematics (SIAM) Series in Classics in Applied Mathematics (1999) 4. Carroll, T.E., Grosu, D.: A game theoretic investigation of deception in network security. Secur. Commun. Nets 4(10), 1162–1172 (2011) 5. Clark, A., Zhu, Q., Poovendran, R., Ba¸sar, T.: Deceptive routing in relay networks. In: Grossklags, J., Walrand, J. (eds.) GameSec 2012. LNCS, vol. 7638, pp. 171–185. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34266-0 10 6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2006) 7. Farokhi, F., Teixeira, A., Langbort, C.: Estimation with strategic sensors. IEEE Trans. Autom. Control 62(2), 724–739 (2017) 8. Howe, D.G., Nissenbaum, H.: TrackMeNot: resisting surveillance in web search. In: Kerr, I., Lucock, C., Steeves, V. (eds.) On the Identity Trail: Privacy, Anonymity and Identity in a Networked Society. Oxford University Press, Oxford (2009) 9. Kroese, D.P., Brereton, T., Taimre, T., Botev, Z.I.: Why the Monte Carlo method is so important today. Wiley Interdisc. Rev. Comput. Stat. 6(6), 386–392 (2014) 10. Luenberger, D.G.: Optimization by Vector Space Methods. Wiley, Hoboken (1969) 11. Mirsky, L.: A trace inequality of John von Neumann. Monatshefte f¨ ur Mathematik 79(4), 303–306 (1975) 12. Pawlick, J., Colbert, E., Zhu, Q.: A game-theoretic taxonomy and survey of defensive deception for cybersecurity and privacy. ArXiv:1712.05441 (2017) 13. Sarıta¸s, S., Y¨ uksel, S., Gezici, S.: Quadratic multi-dimensional signaling games and affine equilibria. IEEE Trans. Autom. Control 62(2), 605–619 (2017) 14. Sayin, M.O., Akyol, E., Ba¸sar, T.: Hierarchical multi-stage Gaussian signaling games in noncooperative communication and control systems. Automatica 107, 9–20 (2019) 15. Sayin, M.O., Ba¸sar, T.: Secure sensor design for cyber-physical systems against advanced persistent threats. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauder, S. (eds.) Proceedings of International Conference on Decision and Game Theory for Security. LNCS, vol. 10575, pp. 91–111. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-68711-7 6 16. Sayin, M.O., Ba¸sar, T.: Dynamic information disclosure for deception. In: Proceedings of 57th IEEE Conference on Decision and Control (CDC), pp. 1110–1117 (2018) 17. Sayin, M.O., Ba¸sar, T.: Deception-as-defense framework for cyber-physical systems. arXiv:1902.01364 (2019) 18. Sayin, M.O., Ba¸sar, T.: Robust sensor design against multiple attackers with misaligned control objectives. arXiv:1901.10618 (2019) 19. Spitzner, L.: Honeypots: Tracking Hackers. Addison-Wesley Professional, Boston (2002) 20. Tamura, W.: A theory of multidimensional information disclosure. Working paper, available at SSRN 1987877 (2014) 21. Zhu, Q., Clark, A., Poovendran, R., Ba¸sar, T.: Deceptive routing games. In: Proceedings of IEEE Conference on Decision and Control, pp. 2704–2711 (2012)

MTDeep: Boosting the Security of Deep Neural Nets Against Adversarial Attacks with Moving Target Defense Sailik Sengupta1(B) , Tathagata Chakraborti2 , and Subbarao Kambhampati1 1

Arizona State University, Tempe, AZ, USA {sailiks,rao}@asu.edu 2 IBM Research, Cambridge, MA, USA [email protected]

Abstract. Present attack methods can make state-of-the-art classification systems based on deep neural networks mis-classify every adversarially modified test example. The design of general defense strategies against a wide range of such attacks still remains a challenging problem. In this paper, we draw inspiration from the fields of cybersecurity and multi-agent systems and propose to leverage the concept of Moving Target Defense (MTD) in designing a meta-defense for ‘boosting’ the robustness of an ensemble of deep neural networks (DNNs) for visual classification tasks against such adversarial attacks. To classify an input image at test time, a constituent network is randomly selected based on a mixed policy. To obtain this policy, we formulate the interaction between a Defender (who hosts the classification networks) and their (Legitimate and Malicious) users as a Bayesian Stackelberg Game (BSG). We empirically show that our approach MTDeep, reduces misclassification on perturbed images for various datasets such as MNIST, FashionMNIST, and ImageNet while maintaining high classification accuracy on legitimate test images. We then demonstrate that our framework, being the first meta-defense technique, can be used in conjunction with any existing defense mechanism to provide more resilience against adversarial attacks that can be afforded by these defense mechanisms alone. Lastly, to quantify the increase in robustness of an ensemble-based classification system when we use MTDeep, we analyze the properties of a set of DNNs and introduce the concept of differential immunity that formalizes the notion of attack transferability.

1

Introduction

State-of-the-art systems for image classification based on Deep Neural Networks (DNNs) are used in many important tasks such as recognizing handwritten digits on cheques [10], object classification for automated surveillance [9] and autonomous vehicles [6]. Adversarial attacks to make these classification systems misclassify inputs can lead to dire consequences. For example, in [15], road signs saying ‘stop’ are misclassified, which can make an autonomous vehicle behave c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 479–491, 2019. https://doi.org/10.1007/978-3-030-32430-8_28

480

S. Sengupta et al. 1 CNN

0

2 A

FGMh (i)

random selection

HRNN 3

output

4 MLP

Fig. 1. At attack perturbation crafted for HRNN (FGMh ) is rendered ineffective when MTDeep picks the MLP (at random) for classification at test time.

ˆ dangerously. If D(i) denotes the class of an image i output by a Deep Neural ˆ an adversarial perturbation when added to the image i tries to Network D, ˆ ˆ + ). In addition, attackers try to minimize some norm ensure that D(i) = D(i of , which ensures that the changed image i + and the original image i are indistinguishable to humans. The effectiveness of an attack method is measured by the accuracy of a classifier on the perturbed images generated by it. Defenses against adversarial examples are designed to be effective against a certain class of attacks by either training the classifier with perturbed images generated by these attacks or making it hard for these attacks to modify some property of the neural network. Some recent works construct defenses that enforcing classification of images that are distance away from an image in the training set to the same class. Unfortunately, this has the side effect of bringing down the classification accuracy [21]. In this paper, we take a different view and design a meta-defense that can function both as (1) a first line of defense against new attacks and (2) a second line of defense when used in conjunction with any existing defense mechanism to boost the security gains the latter can provide. We consider a game theoretic perspective and investigate the use of Moving Target Defense (MTD) [25], in which we randomly select a network from an ensemble of networks when classifying an input image (i.e. strategic randomization at test time), for boosting the robustness against adversarial attacks (see Fig. 1). Our contributions are– – MTDeep – an MTD-based framework for an ensemble of DNNs. – A Bayesian Stackelberg Game formulation with two players – MTDeep and the users. The Stackelberg Equilibrium of this game gives us the optimal randomization strategy for the ensemble that maximizes the classification accuracy on regular as well as adversarially modified inputs. – We show empirically that MTDeep can be used as (1) a standalone defense mechanism to increase the accuracy on adversarial samples by ≈ 24% for MNIST, ≈ 22% for Fashion MNIST and ≈ 21% for ImageNET data-sets against a variety of well-known attacks and (2) in conjunction with existing defense mechanisms like Ensemble Adversarial Training, MTDeep increases

MTDeep: Moving Target Defense for Deep Neural Networks

481

the robustness of a classification system (by ≈ 50% for MNIST). We also show that black-box attacks (see related work) on a distilled network are ineffective (in comparison to white-box attacks) against the MTDeep system. – We define the concept of differential immunity, which is (1) the first attempt at defining a robustness measure for an ensemble against attacks and (2) a quantitative metric to capture the notion of attack transferability. Although prior research has shown that effectiveness of attacks can sometimes transfer across networks [19], we show that there is still enough residual disagreement among networks that can be leveraged to design an add-on defense-in-depth mechanism by using MTD. In fact, recent work has demonstrated that it is possible to train models with limited adversarial attack transferability [2], making our meta-level defense approach particularly attractive.

2

Related Work

In this section, we first discuss existing work on crafting adversarial inputs against DNNs (at test-time) and defenses developed against them. Then, we briefly discuss some work in Moving Target Defense that inspires this defense.1 Recent literature has shown multiple ways of crafting adversarial samples for a DNN [11,13,15,19] using the gradient information or by examining the geometric space around an input. These attacks require complete knowledge about the classification network. On the other hand, attacks that craft gradient-based perturbations on distilled networks [14,19] or use zeroth-order optimization [5] can cripple DNNs even when the attacker has no knowledge about the actual classification network (and are thus called black-box attacks). Defense techniques against the two types of attacks described above commonly involve generating adversarial perturbed training images using one (or all) of the attack methods described and then using the generated images along with the correct labels to fine tune the parameters of the DNN. Ensemble adversarial training [20] and stability training [24] are improvements on this defense technique. Unlike us, the former does not use the ensemble at classification time. We do not discuss other defenses further because our proposed framework can be used in conjunction with any of these to improve their security guarantees. Our approach is well supported by findings in previous research works that show introduction of randomized switching makes it harder for any attacker to reverse engineer a classification system [22], which is necessary for constructing effective white-box attacks. Note that ensemble based defenses [1,8] can be viewed as simply adding an extra pooling layer whose weights are equal to the importance given to the votes of the constituent networks. Thus, all attacks on a DNNs are trivially effective against such voting-based ensembles. To this extent, researchers have also shown that an ensemble of vulnerable DNNs cannot result in a classifier robust to attacks [7]. In contrast, MTDeep builds in an implicit mechanism based on randomization at prediction time, making it difficult for an 1

A detailed overview can be found at https://arxiv.org/abs/1705.07213.

482

S. Sengupta et al.

Table 1. The actions of the players and the utilities of the two user types–L and A for the MNIST dataset. Legitimate User (L) MTDeep Classification Image MLP CNN HRNN

99.1 98.3 98.7

Adversarial User (A) F GMm F GMc F GMh DFm DFc DFh P GDm P GDc P GDh 3.1 55.06 25.12

20.39 10.28 27.24

38.93 71.39 11.43

1.54 89.8 93.83 98.87 0.87 98.55 95.38 83.17 3.66

0.00 78.00 23.00

49.00 0.00 51.00

61.00 90.0 0.00

adversary to fool the classification system. There has been some previous work that leverage randomization at test time [4] but cannot be used out-of-the-box for DNNs. The authors try to prevent misclassification rate under attack and end-up affecting the classification accuracy on non-adversarial test inputs. Universal perturbations [12], based on the DeepFool attack [13], needs to generate only one “universal” perturbation per network. Authors show that adversarial training is ineffective against this attacks. On the contrary, we show that MTDeep can prove to be an effective defense against these attacks because such attacks are network specific and thus, often have low transferability. Moving Target Defense (MTD) is a paradigm used in software security that tries to reduce the success rate of an attack by pro-actively switching between multiple software configurations [25]. Devising effective switching strategies for MTD systems requires reasoning about attacks in a multi-agent game theoretic fashion in order to provide formal guarantees about the security of such systems [18]. Thus, we model the interaction between an image classification system (an ensemble) and its users, both legitimate and adversarial, as a Bayesian Stackelberg Game, providing provable guarantees on the expected performance on both legitimate and adversarial inputs.

3

MTDeep: MTD for Deep Neural Networks

In our system, the defender has multiple system classifiers for a given task. The attacker has a set of attacks that it can use to cripple the constituent classifiers. Given an input to the system, the defender selects, at random, one of the configurations to run the input and returns the output generated by that system. Since the attacker does not know which system is specifically selected, its attacks are less effective than before (Fig. 1). As stated earlier, randomization in selecting a configuration for classification of each input is paramount. Unfortunately, an MTD framework for classification systems, that leverages randomization, might end up reducing the accuracy of the overall system in classifying non-perturbed images. Thus, in order to retain good classification accuracy and guarantee high security, we model the interaction between MTDeep and its users as a Bayesian Stackelberg Game and show that the equilibrium results in the optimal selection strategy. We now discuss our game-theoretic formulation. Players and Action Sets. The configuration space for the defender, i.e. MTDeep, comprise of various DNNs that are trained on the particular image classification task. The second player in this game is the user of the classification system. The second player has two player types – Legitimate User (L) and

MTDeep: Moving Target Defense for Deep Neural Networks

483

the Adversary (A). L has one action – to input non-perturbed images to the MTDeep system. The adversary A has various attack actions and uses one of these to perturbs an input image. In our threat model, we consider a strong adversary who knows the different constituent architectures in our MTDeep system. This means they can easily generate powerful white-box attacks. Utilities. Existing works that design defense methods against adversarial attacks for DNNs model the problem as a zero-sum game where the attacker tries to maximize the defender’s loss function by coming up with perturbed images that the network misclassifies, whereas the defender tries to reduce the loss on these adversarially perturbed examples [11]. Fine tuning the classifier to have high accuracy on adversarially perturbed inputs often has the side effect of reducing the classification accuracy on non-perturbed inputs from the test set [21]. In this paper, we move away from the zero-sum game assumption and try to ensure that the defender minimizes the loss functions for both types of inputs images– images from the initial test-set and the adversarially perturbed ones. Thus, we want MTDeep to be effective for L (proportional to minimizing the loss on the original test set) and, at the same time, increase the accuracy of classification for the perturbed images (proportional to minimizing the loss against adversarial inputs at test-time), making this a multi-objective optimization problem. The utilities for each player in this game are as follows. – The Legitimate User (L) and the defender both get a reward value equal to the % accuracy of the DNN system. – The Adversary (A) and the defender play a constant(= 100) sum game, where the former’s reward value for an attack against a network is given by the fooling rate and the defender’s reward is the accuracy on perturbed inputs. We also consider a parameter α that defines the probability of the player types A and L. It lets the defender weigh the importance of catering to legitimate test samples vs. correctly classifying adversarial samples. The game-matrix for the MNIST classification task is shown in Table 1.2 MTDeep’s Switching Strategy. Note that the defender D has to play first, i.e. deploy a classification system that either a legitimate user L can use or an adversary A can attack. This imparts a leader-follower paradigm to the formulated Bayesian Game. The defender leads by playing first and then the attacker follows by choosing an attack action having inferred the leader’s (mixed) strategy. Satisfying the multi-objective criterion, mentioned above, is now equivalent to finding the Stackelberg Equilibrium of this game. We find this equilibrium by using the mixed integer quadratic program (MIQP) formulation in [16].

4

Experimental Results

We first compare the effectiveness of MTDeep as a standalone defense mechanism for classifying MNIST, Fashion-MNIST and ImageNet datasets. We then 2

More details and examples of games (for the Fashion-MNIST and Imagenet classification tasks) can be found at https://arxiv.org/abs/1705.07213.

S. Sengupta et al.

Accuracy →

100

M T Deep

100

M T Deep

M T D-U RS

M T D-U RS

CN N

CN N

M LP HRN N

50

Accuracy →

484

M LP HRN N

50

0

0 0

0.25

0.5 α→

(a) MNIST

0.75

1

0

0.25

0.5 α→

0.75

1

(b) Fashion-MNIST

Fig. 2. Accuracy of MTDeep with non-adversarially trained networks compared to accuracy of individual constituent networks and uniform random strategy.

show that MTDeep piggybacked onto an existing defense mechanism can help boost the classification accuracy against adversarial attacks by almost 50%. We then analyze the effect of black-box attacks created on a distilled network and introduce the notion of differential immunity for ensembles. We discuss how that this metric can capture the informal notion of transferability of attacks and be used to measure the effectiveness of MTDeep. 4.1

MTDeep as a Standalone Defense Technique

We compare the effectiveness of MTDeep with two baselines– the individual networks in the ensemble and the a randomized ensemble that uses Uniform Random Strategy (MTD-URS) to pick one of the constituent networks with equal probability. In contrast, MTDeep uses the Stackelberg equilibrium strategy of the defender to pick a constituent DNN. MNIST and Fashion-MNIST. For each data-set, we trained three classification networks that were built using either Convolution layers (CNN), Multi-layer Perceptrons (MLP) or Hierarchical Recurrent layers (HRNN). The size of the train and test sets were 50000 and 10000 respectively. We considered three attack methods for the attacker – the Fast Gradient Based (FGM) attack (with = 0.3), the DeepFool (DF) attack (with three classes being considered at each step when searching for an attack perturbation), and the Projected Gradient Descent (PGD) attack (with : 0.3, − iter : 0.05). An adversarial example generated using the PGD algorithm on the loss information of the CNN is termed as P GDc in Table 1 (similarly P GDh/m ). We then find the classification accuracy of each network on these adversarial examples to compute the utility values. Note that an adversarial example developed using information about one network may not be as effective for the other networks. We find that this is especially true for attacks like DF that exploit information about a particular network’s classification boundary. On the other hand, attacks that exploit the gradient signals of a particular network are more effective against

MTDeep: Moving Target Defense for Deep Neural Networks 100

M T Deep V GG − F

50

Caf f eN et GoogLeN et V GG − 16

Accuracy →

100

Accuracy →

485

80

60

α = 0 α = 0.25

40

α = 0.5 α = 0.75

V GG − 19 ResN et − 152

0 0

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

α →

(a) Accuracy of classifiers under attack.

α = 1

20 1

2

3

4

5

6

Number of Neural Networks in MTDeep

(b) Participation of constituent networks.

Fig. 3. Results on the ImageNET classification task.

the other networks, i.e. have high transferability. We observe this trend for both the MNIST and the Fashion-MNIST data-set. In Fig. 2, we plot the accuracy of a particular classification system as α varies from 0 to 1. When α = 0 and the defender ignores the possibility of playing against an adversary, the optimal strategy for MTDeep boils down to a pure strategy for selecting the most accurate classifier. In contrast, MTD-URS has lower classification accuracy than MTDeep because it chooses the two lessaccurate classifiers with probability 0.33. Given that classification accuracy of the constituent networks are relatively high, the drop in accuracy is small. When α = 1 and the defender only receives adversarial examples as inputs, strong attacks like PGD can fool individual networks 100% of the time for MNIST and 97% for Fashion-MNIST. In contrast, randomized selection of networks at test-time perform much better because an adversarial perturbation developed based on information from one network fails to fool other networks that may be selected at classification time. MTDeep achieves a classification accuracy of 24% for MNIST and 25% for Fashion-MNIST while MTD-URS has a classification accuracy of ≈ 20% for both the data sets. The difference in classification accuracy stems form the fact that MTD-URS picks more vulnerable networks with equal probability. ImageNET. We use six different networks which have excelled on ILSVRC2012s validation set [17] to construct the ensemble for MTDeep3 . Generating attacks like FGM, DF and PGD for ImageNET are time and resource intensive. Thus, we assume the adversary uses Universal Perturbations (UP) developed for each network [12], which are built on top of DF and only one attack mask is generated for each constituent network (as opposed to each test image). Defense mechanisms like adversarial training are ineffective against this type of attack [12]. Furthermore, no other defenses have been shown to be effective against this attack. In such cases, MTDeep is a particularly attractive approach because it increases the robustness of the classification system even when all other defense mechanisms are ineffective. In Fig. 3(a), we plot the expected accuracy for the MTDeep along with the expected accuracy of each of the constituent networks when the probability of 3

More details can be found at https://arxiv.org/abs/1705.07213.

486

S. Sengupta et al.

MTDeep

Legitimate User (L) Classification Image

MLPeat CNNeat HRNNeat

97.99 98.97 97.22

Accuracy →

100

Adversarial User (A) F GMm F GMc F GMh DFm DFc DFh P GDm P GDc P GDh 95.06 61.44 81.24

75.32 96.55 84.79

70.1 68.58 93.1

1.5 96.97 95.73 98.36 0.79 96.09 96.85 95.9 4.41

0.00 72.00 82.00

88.00 20.00 71.00

69.00 81.00 10.00

Fig. 4. The utilities for the players when the adversary uses the aforementioned attacks against the classifiers fine-tuned using Ensemble Adversarial Training (EAT) with FGM attacks.

50

M T Deep CN N M LP HRN N

0 0

0.25

0.5

0.75

1

α →

Fig. 5. Accuracy gains of MTDeep when constituent classifiers are adversarially trained.

the adversary type α varies. Given there are six constituent networks in the ensemble, to avoid clutter, we don’t plot MTD-URS for brevity but observe that it always has ≈ 4% less accuracy than MTDeep, which is a large accuracy difference in accuracy in the context of ImageNET. At α = 0, MTDeep uses the most accurate network (ResNet-152) to maximize classification accuracy. As adversarial inputs become more ubiquitous and α becomes 1, the accuracy on perturbed inputs drops for all the constituent networks of the ensemble. Thus, to stay protected, MTDeep switches to a mixed policy that utilizes more networks. At α = 1, the accuracy of MTDeep is 42% compared to 20% for the best of the single DNN architectures. The optimal strategy in this case is x = 0.0, 0.171, 0.241, 0.0, 0.401, 0.187 which discards two of the six constituent networks. Note that the 22% accuracy bump for modified images comes despite (i) high misclassification rates of the constituent networks against Universal Perturbations, and (ii) lack of proven defense mechanisms against such attacks. 4.2

MTDeep as an Add-On Defense-in-depth Solution

We study the use of MTDeep on top of a state-of-the-art defense mechanism called Ensemble Adversarial Training (EAT) [20]. EAT is an improvement of adversarial training that uses adversarial examples generated on non-target networks to fine tune weights of the target network. Given MTDeep works with an ensemble, it renders itself naturally to this robustification method. Unfortunately, using EAT can only make the networks robust against attack images generated by the particular attack algorithm. We observe that the individual networks are still vulnerable to stronger (i.e. more computationally intensive) attacks. In Fig. 4, we show that the utility values obtained using the three constituent networks whose parameters are fine-tuned using EAT (which, in turn uses FGM). Note that although there is a boost in overall accuracy against adversarial examples generated using FGM, the other attacks (1) DF, which is generated in a very different manner compared to FGM, and (2) PDG, which

MTDeep: Moving Target Defense for Deep Neural Networks

487

Table 2. Differential immunity of various ensembles and their accuracy (α = 1). Networks

Differential immunity (δ)

Accuracy of best constituent net

Accuracy of MTDeep

Gain

FashionMNIST 0.11

3%

24.8%

21.8%

MNIST

0.19

0%

23.68%

23.68%

ImageNET

0.34

22.2%

42.88%

20.68%

MNIST + EAT 0.78

4.41%

54.71%

50.3%

represents a stronger class of attacks, are both still able to cripple the individual constituent networks. Although we do not presently understand why EAT helps is reducing the transferability of the PDG and DF attacks, this phenomenon helps MTDeep, when used in conjunction to the EAT, obtain impressive accuracy gains. In Fig. 5, we see that when α = 1 (only adversarially perturbed inputs at test time) the accuracy of the constituent networks are 0–4% while MTDeep achieves an accuracy of ≈ 55% Thus, we see a gain of more than 50% when classifying only adversarially perturbed images. 4.3

Blackbox Attacks on MTDeep

MTDeep designs a strategy based on a set of known attacks. Once deployed, an attacker can train a substitute network via distillation, i.e. use MTDeep as an oracle to obtain labels for the (chosen-ciphertext like) training set for the substitute network. Given that the distilled network captures information relating to the randomization at test time, we wanted to see how effective such a distillation procedure is in generating an expected network that mimics MTDeep. More specifically, we want to know if adversarial samples generated on this distilled network [14] successfully transfer against the MTDeep ensemble. For this purpose, we used the three networks designed for MNIST data and experimented with α = 1. We notice that MTDeep has higher immunity to blackbox attacks and is able to classify attack inputs ≈ 32% of the time compared to the ≈ 24% accuracy against white-box attacks. Thus, there exists a white-box attack in the attacker’s arsenal that strictly dominates the black-box attack4 . Thus, the defender’s optimal mixed strategy remains unaffected. 4.4

Differential Immunity

If an attack u ∈ U could cripple all the networks n ∈ N , using MTDeep will provide no gains in robustness. In this section, we try to quantify the gains MTDeep can provide. Let E : N × U → [0, 100] denote the fooling rate function 4

Note that even if a blackbox attack proves to be a more effective attack against the ensemble (for a different dataset), this attack is not modeled by the defender in the original game. They may choose to include it in the formulated game.

S. Sengupta et al.

Attacks F GMC F GMH F GMM F GMBB

0 4788 389 1513 2305

1 3641 2728 5790 2569

2 1449 6667 2479 2678

Diff. Acc. →

488

3 118 212 214 2444

3 M T Deep

2

M T D-U RS

1 0

−50−40−30−20−10 0 10 20 30 40 50 % deviation of α →

Fig. 6. Agreement among constituent networks when classifying perturbed inputs for the MNIST data-set.

Fig. 7. Loss in % accuracy when correct α is different from assumed α.

where E(n, u) is the fooling rate when an attack u is used against a network n. Now, the differential immunity δ of an ensemble N against a set of known attacks U can measured as follows, δ(U, N ) = min u

maxn E(n, u) − minn E(n, u) + 1 maxn E(n, u) + 1

If the maximum and minimum fooling rates of u on n differ by a wide margin, then the differential immunity of MTDeep is higher. The denominator ensures that an attack which has high impact (or fooling rate) reduces the differential immunity of a system compared to a low impact attack even when the numerator is the same. The +1 factor the numerator ensures that higher values of maxn E(n, u) reduce the δ when maxn E(n, u) = minn E(n, u). Note that δ ∈ [0, 1]. As per this measure, the differential immunity of the various ensembles used in our experiments are highlighted in Table 2. As per our expectation, we observe a general trend that the differential immunity of an ensemble is proportional to the accuracy gains obtained by MTdeep when compared to the most secure constituent network in the ensemble. Although we notice the lowest gain in case of ImageNET, note that this 20.68% gain in accuracy is substantially better than the ≈ 22% (or t ≈ 24%) gain in accuracy for the Fashion-MNIST (or MNIST) dataset(s) with non-adversarially trained DNNs because the number of classes in ImageNET is 1000 compared to 10 for the latter two datasets.Lastly, existing measures of robustness are mostly designed for a single DNN [3,23] and thus, cannot capture the effect of attack transferability on robustness (of an ensemble). Thus, we propose differential immunity as one of the metrics for evaluating the robustness of ensembles that use any form of randomization at test time. Disagreement Metrics. In Fig. 6, we highlight the number of perturbed test images (total 10000) on which 0–3 constituent DNN’s classification output(s) agree with the correct class label. We conducted these experiments using the non-adversarially trained networks for MNIST classification and for brevity purposes, use only the FGM attack method. F GMC is the strongest attack that can make all the n ∈ N misclassify at least 70% of the images. As generating δ can be costly at times, which needs the fooling rates for each pair (u, n), one can

MTDeep: Moving Target Defense for Deep Neural Networks

489

generate the agreement metrics on a small data set to provide upper bounds for δ. This provides an idea as to how using a MTDeep ensemble can increase the robustness against adversarial samples. In this case, δM N IST ≤ 0.51 because for the strongest attack, every network in the ensemble will misclassify (approx.) 49% of the time. Also, note that a majority based ensemble can will only be able to guarantee an accuracy of ≈ 14% against the FSMC attack because in all the other cases, only net 0 or net 1 is able to predict the correct class. In comparison, MTDeep can obtain an accuracy of 26.8% against FGM attacks. 4.5

Participation of Individual Networks

In Fig. 3(b), we explore the participation of individual networks in the mixed strategy equilibria for MTDeep used to classify ImageNET data. The results clearly show that while it is useful to have multiple networks providing differential immunity (as testified by the improvement of accuracy in adversarial conditions), the leveling-off of the objective function values with more DNNs in the mix does underline that there is much room for research in actively developing DNNs that can provide greater differential immunity. Note that no more than four (out of the six) networks participate in the equilibrium. An ensemble of networks with higher differential immunity equipped with MTD can thus provide significant gains in both security and accuracy. 4.6

Robustness Against Miscalibrated α

If the value of α, which is assumed up-front, turns out to be incorrect, the computed strategy ends up becoming sub-optimal. In Fig. 7, we plot the deviation of the chosen policy (based on the assumed α) from the optimal as the real α is varied ±50% from the one assumed. The BSG-framework remains quite robust (as opposed to a uniform random strategy) i.e. the accuracy is within 0–3% of the optimal accuracy. The robustness to α further highlights the usefulness of MTDeep as a meta-defense meant to work not only against adversarial attacks but also in the context of a deployed classifier that will have to deal with adversaries as well as legitimate users.

5

Conclusion

In this paper, we introduced MTDeep – a framework inspired by Moving Target Defense in cybersecurity – as ‘security-as-a-service’ to help boost the security of existing classification systems based on Deep Neural Networks (DNNs). We modeled the interaction between MTDeep and the users as a Bayesian Stackelberg Game, whose equilibrium gives the optimal solution to the multi-objective problem of reducing the misclassification rates on adversarially modified images while maintaining high classification accuracy on the non-perturbed images. We empirically showed the effectiveness of MTDeep against various classes of attacks for

490

S. Sengupta et al.

the MNIST, the Fashion-MNIST and the ImageNet data-sets. Lastly, we demonstrated how using MTDeep in conjunction with existing defense mechanisms for DNNs result in more robust classifiers, thereby highlighting the importance of developing ensembles with higher differential immunity. Acknowledgments. We thank the reviewers for their comments. This research is supported in part by NASA grant NNX17AD06G and ONR grants N00014161-2892, N00014-13-1-0176, N00014-13-1-0519, N00014-15-1-2027. The first author is also supported by an IBM Ph.D. Fellowship.

References 1. Abbasi, M., Gagné, C.: Robustness to adversarial examples through an ensemble of specialists. arXiv:1702.06856 (2017) 2. Adam, G.A., Smirnov, P., Goldenberg, A., Duvenaud, D., Haibe-Kains, B.: Stochastic combinatorial ensembles for defending against adversarial examples. arXiv:1808.06645 (2018) 3. Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A., Criminisi, A.: Measuring neural net robustness with constraints. In: NIPS (2016) 4. Biggio, B., Fumera, G., Roli, F.: Adversarial Pattern Classification Using Multiple Classifiers and Randomisation. In: da Vitoria, Lobo N. (ed.) SSPR /SPR 2008. LNCS, vol. 5342, pp. 500–509. Springer, Heidelberg (2008). https://doi.org/10. 1007/978-3-540-89689-0 54 5. Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.J.: Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. arXiv:1708.03999 (2017) 6. De La Escalera, A., Moreno, L.E., Salichs, M.A., Armingol, J.M.: Road traffic sign detection and classification. IEEE Trans. Ind. Electron. 44(6), 848–859 (1997) 7. He, W., Wei, J., Chen, X., Carlini, N., Song, D.: Adversarial example defenses: ensembles of weak defenses are not strong. arXiv preprint arXiv:1706.04701 (2017) 8. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 (2015) 9. Javed, O., Shah, M.: Tracking and object classification for automated surveillance. In: ECCV (2006) 10. Jayadevan, R., Kolhe, S.R., Patil, P.M., Pal, U.: Automatic processing of handwritten bank cheque images: a survey. J. Doc. Anal. Recogn. 15(4), 267–296 (2012). https://doi.org/10.1007/s10032-011-0170-8 11. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017) 12. Moosavi-Dezfooli, S.M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. arXiv:1610.08401 (2016) 13. Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: Deepfool: a simple and accurate method to fool deep neural networks. In: CVPR (2016) 14. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: ACM CCS (2017) 15. Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z.B., Swami, A.: The limitations of deep learning in adversarial settings. In: 2016 IEEE European Symposium on Security and Privacy (EuroS&P) (2016)

MTDeep: Moving Target Defense for Deep Neural Networks

491

16. Paruchuri, P., Pearce, J.P., Marecki, J., Tambe, M., Ordonez, F., Kraus, S.: Playing games for security: an efficient exact algorithm for solving Bayesian stackelberg games. In: AAMAS (2008) 17. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015) 18. Sengupta, S., et al.: A game theoretic approach to strategy generation for moving target defense in web applications. In: AAMAS (2017) 19. Szegedy, C., et al.: Intriguing properties of neural networks. arXiv:1312.6199 (2013) 20. Tramèr, F., Kurakin, A., Papernot, N., Boneh, D., McDaniel, P.: Ensemble adversarial training: attacks and defenses. arXiv:1705.07204 (2017) 21. Tsipras, D., Santurkar, S., Engstrom, L., Turner, A., Madry, A.: Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152 (2018) 22. Vorobeychik, Y., Li, B.: Optimal randomized classification in adversarial settings. In: AAMAS (2014) 23. Weng, T.W., et al.: Evaluating the robustness of neural networks: an extreme value theory approach. arXiv preprint arXiv:1801.10578 (2018) 24. Zheng, S., Song, Y., Leung, T., Goodfellow, I.: Improving the robustness of deep neural networks via stability training. In: CVPR (2016) 25. Zhuang, R., DeLoach, S.A., Ou, X.: Towards a theory of moving target defense. In: Proceedings of the First ACM Workshop on Moving Target Defense, pp. 31–40. ACM (2014)

General Sum Markov Games for Strategic Detection of Advanced Persistent Threats Using Moving Target Defense in Cloud Networks Sailik Sengupta(B) , Ankur Chowdhary, Dijiang Huang, and Subbarao Kambhampati Arizona State University, Tempe, AZ, USA {sailiks,achaud16,dijiang,rao}@asu.edu Abstract. The processing and storage of critical data in large-scale cloud networks necessitate the need for scalable security solutions. It has been shown that deploying all possible detection measures incur a cost on performance by using up valuable computing and networking resources, thereby resulting in Service Level Agreement (SLA) violations promised to the cloud-service users. Thus, there has been a recent interest in developing Moving Target Defense (MTD) mechanisms that helps to optimize the joint objective of maximizing security while ensuring that the impact on performance is minimized. Often, these techniques model the challenge of multi-stage attacks by stealthy adversaries as a single-step attack detection game and use graph connectivity measures as a heuristic to measure performance, thereby (1) losing out on valuable information that is inherently present in multi-stage models designed for large cloud networks, and (2) come up with strategies that have asymmetric impacts on performance, thereby heavily affecting the Quality of Service (QoS) for some cloud users. In this work, we use the attack graph of a cloud network to formulate a general-sum Markov Game and use the Common Vulnerability Scoring System (CVSS) to come up with meaningful utility values in each state of the game. We then show that, for the threat model in which an adversary has knowledge of a defender’s strategy, the use of Stackelberg equilibrium can provide an optimal strategy for placement of security resources. In cases where this assumption turns out to be too strong, we show that the Stackelberg equilibrium turns out to be a Nash equilibrium of the general-sum Markov Game. We compare the gains obtained using our method(s) to other baseline techniques used in cloud network security. Finally, we highlight how the method was used in a real-world small-scale cloud system.

1

Introduction

A cloud service provider provides processing and storage hardware along with networking resources to customers for profit. Although a cloud provider might want to use state-of-the-art security protocols, vulnerabilities in software desired (or used) by customers can put sensitive information stored in or communicated c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 492–512, 2019. https://doi.org/10.1007/978-3-030-32430-8_29

IDS Placement Strategy Using General-Sum Markov Games

493

over the cloud network at risk. Distributed elements such as firewalls, Intrusion Detection Systems (IDS), log-monitoring systems etc. have been the backbone for detecting malicious traffic in (or stopping them from entering) such systems. Unfortunately, the scale of modern-day cloud systems makes the placement of all possible detecting and monitoring mechanisms an expensive solution [12,28,32]; using up the computing and network resources that could have been better utilized by giving them to customers which in turn would be better for business. Thus, the question of how one should place a limited number of detection mechanisms to limit the impact on performance while ensuring that the security of the system is not drastically weakened becomes a significant one. There has been an effort to answer this question in previous research works [28,32]. Researchers have pointed out that static placement of detection systems is doomed to be insecure because an attacker, with reconnaissance on their side (which exists for any such cyber-system is by default), will eventually learn this static placement strategy and hence, avoid it. Thus, a dynamic placement of these detection mechanisms, generally known as Moving Target Defense for continuously shifting the detection surface, has become a default. In this method, the set of attacks for which monitoring systems are placed changes in some randomized fashion after every fixed time step. In these works, the cloud system is treated in a way similar to that of physical security systems where the primary challenge is to allocate a limited set of security resources to an asset/schedule that needs to be protected [23,29,31]. In the case of real-world cloud-systems, these aforementioned solutions lead to three problems. First, only single-step attacks are considered in the gametheoretic modeling which lead to sub-optimal strategies because such models fail to account for multi-stage attack behavior. For example, strategies generated by prior work may prioritize detecting a high-impact attack on a web-server more than a low-impact attack on a path that leads an attack on a storage server, which when exploited may have major consequences. Second, the threat model assumes that an attacker can launch an attack from any node in the cloud network. This is too strong an assumption, leading to sub-optimal placement strategies in real-world settings. Third, existing methods can come up with placement strategies that allocate multiple detection systems on a sub-net while another sub-net is not monitored. This results in a steep degradation of performance for some customers. To address these challenges, we need a suitable method for modeling multi-stage attacks. Unfortunately, capturing all possible attack paths can lead to an action-set explosion in previously proposed normal-form games [28]. Thus, we use Markov Games to model such interactions. Specifically, we try to address these problems by modeling the cloud system as a General-Sum Markov Game. We use particular nodes of our system’s attack graph to represent the states of our game while the attacker’s actions are modeled after real-world attacks based on the Common Vulnerabilities and Exploits (CVEs) described in the National Vulnerability Database (NVD) [1]. The defender’s actions correspond to the placement of detection systems that can detect the attacks. We design the utility values for each player in this game

494

S. Sengupta et al.

leveraging (1) the Common Vulnerability Scoring Systems, which provide metrics for each attack, and (2) cloud designer’s quantification of how the placement of a detection system impacts the performance. These help us come up with defense strategies that take into account the long-term impacts of a multi-stage attack while restricting the defender to pick a limited number of monitoring actions in each part of the cloud. The latter constraints ensure that the performance impact on a customer, due to the placement of detection measures, is limited. The popular notion of using min-max equilibrium for Markov Games [18] is an optimal strategy for the players in zero-sum games and becomes suboptimal for a general-sum game. Furthermore, we model an attacker who is aware of the defender’s placement strategy at each state of our Markov game and thus, consider the Stackelberg equilibrium of this game. In scenarios where the latter assumption is too strong, we show that the Stackelberg Equilibrium of our general-sum game is, depending on the problem structure, a subset of Nash Equilibria and still results in optimal strategies. The key contributions of this research work are as follows: – We model the multi-stage attack scenarios, which are typically employed in launching Advanced Persistent Threats (APTs) campaigns against high-value targets, as a general-sum Markov Game. The cost-benefit analysis based on the two-player game, provides strategies for placing detection systems in a cloud network. – We leverage the attack-graph modeling of cloud networks, the Common Vulnerabilities and Exposures (CVEs) present in the National Vulnerability Database and the Common Vulnerability Scoring Service (CVSS) to design the states, the actions and utility values of our game. In addition, we consider prior work in cloud-systems to (1) model the uncertainty of an attack’s success and (2) leverage heuristic measures that model the performance impact of placing detection mechanisms on the cloud. – Our framework considers a threat model where the attacker can infer the defender’s detection strategy. Therefore, we design a dynamic programming solution to find the Stackelberg equilibrium of the Markov Game. If an attacker does not have information about the defender’s strategy, we show that the Stackelberg equilibrium of the general-sum Markov Game is a subset of Nash Equilibrium when a set of properties hold (similar to prior work in extensive form games [16]). In order to showcase the effectiveness of our approach we analyze a synthetic and a real-world cloud system.

2

Background

In this section, we first introduce the reader to the notion of real-world vulnerabilities and exploits present in a cloud system that we will use throughout our paper. Second, we describe the threat model for our cloud scenario. Lastly, we describe the notion of Attack Graphs (AG) followed by a brief description of Markov games and some well-known algorithms used to find the optimal policy or strategy for each player. We will use the example attack scenario for cloud networks shown in Fig. 1 as a running example in our discussion.

IDS Placement Strategy Using General-Sum Markov Games

495

Fig. 1. An example cloud system highlighting its network structure, the attacker and defender (admin) agents and the possible attacks and monitoring mechanisms.

2.1

Vulnerabilities and Exploits

Software security is defined in terms of three characteristics - Confidentiality, Integration, and Availability [20]. Thus, in a broad sense, a vulnerability (that can be attacked or exploited) for a cloud system can be defined as a security flaw in a software service hosted over a given port. When exploited by a malicious attacker, it can cause loss of Confidentiality, Availability or Integrity (CIA) of that virtual machine (VM). The National Vulnerability Database (NVD) is a public directory of known vulnerabilities and exploits. It assigns each known attack a unique identifier (CVE-id), describes the technology affected and the attack behavior. Thus, to model the known attacks against our system, we use the Common Vulnerabilities and Exposures (CVEs) listed in NVD. In the cloud-scenario described in Fig. 1, we have three VMs – an LDAP server, an FTP server, and a Web server. Each of these servers have a (set of) vulnerability present on it. On the LDAP server, an attacker can use a local privilege escalation to gain root privilege on it. The other two vulnerabilities – A cross-side scripting (XSS) attack on the Web server and the remove code execution on the FTP server – can only be executed with root access to the LDAP server. We can now describe the threat model for our scenario. 2.2

Threat Model

In the example cloud scenario, the attacker starts with user-level access to an LDAP server. The terminal state is to compromise the FTP server (which, as we will see later, leads to an all absorbing state in our Markov Game). The attacker can perform actions such as 1: exploit-LDAP, exploit-Web or exploit-FTP. Note that the attacker has two possible paths to reach the goal node, i.e. priv(attacker, (FTP: root)) which are:

496

S. Sengupta et al.

– Path 1: exploit-LDAP → exploit-FTP – Path 2: exploit-LDAP → exploit-Web → exploit-FTP On the other hand, the (network) Admin, who is the defender in our case, can choose to monitor (1) read-write requests made by services running on a VM using host-based Intrusion Detection Systems (IDS) like auditd, or (2) network traffic along both the paths using the network-based monitoring agents like Snort. We will denote these IDS systems using the terminology monitor-LDAP, monitor-FTP, etc. We assume that the Admin has a limited budget, i.e., cannot place all possible IDS system on the cloud network, and thus, must try to perform monitoring in an optimized fashion. On the other hand, the attacker will try to perform attacks along with some path that minimizes their probability of getting detected. Further, we assume an attacker has knowledge of the defender’s placement strategy because of the inherent reconnaissance phase in cyber-security scenarios, thus rendering pure strategies for placement of detection systems useless. Thus, to come up with a good dynamic placement placement strategy, we need to model the various multi-attack paths and the attacker’s strategy. We first discuss the formalism of Attack Graphs that are a popular way to model the various attacks (and attack paths) in a cloud scenario [4,12] and then give a brief overview of two-player Markov Games. 2.3

Attack Graph Formalism

Attack Graphs (AG) are a representation tool used to model the security scenario of a complex network system like the cloud. Researchers have shown that AG can help to model multi-stage or multi-hop attack behavior. Attack Graph is a graph G = {N, E} that consists of a set of nodes (N ) and a set of edges (E) where, – As shown in the Fig. 2, nodes can be of four types – the nodes NC represent vulnerabilities (shown as rectangles), e.g. vulExists (LDAP, Local Priv. Escalation), ND represents the attacker’s state (shown as diamonds) e.g., priv(attacker, (LDAP :user)), rule nodes NR represent a particular exploit action (shown as ellipses) and finally, root or goal nodes NG that represent the goal of an attack scenario (shown using two concentric diamonds), e.g., priv(attacker, (FTP: root)). – E = Epre × Epost denotes a set of directed edges in the graph. An edge e ∈ Epre goes from a node in ND or NC to a rule node NR and denotes that an attack or rule can only be triggered if all the conditions of the edges going into n ∈ NR is satisfied (AND-nodes). An edge e ∈ Epost goes from a node NR to n ∈ ND indicating the change in attacker’s privilege changes upon successful execution of the rule. Note how the two attacks paths mentioned in the threat model section become evident by a simple look at the AG. The conditional and cumulative probability values pertaining to the success of an attack path over the AND (conjunct) and OR (disjunct) nodes can be calculated using probability estimates as described by Chung et al. [6].

IDS Placement Strategy Using General-Sum Markov Games

497

Fig. 2. The left figure shows the attack graph of the synthetic cloud scenario shown in Fig. 1. The right figure shows the formulated Markov Game.

2.4

Two-Player Markov Games

We now define a two-player Markov Game and also introduce the reader to some notations used later in our modeling. We call the two players of this game the Defender D (who is the Admin of the cloud system) and the Attacker A (who is an adversary trying to exploit a vulnerability in the Cloud System). With that, we can now formally define a Markov Game as follows. Markov Game for two players D and A can be defined by the tuple (S, M, E, τ, U D , U A , γ D , γ A ) [30] where, – S = {s1 , s2 , s3 , . . . , sk } are finite states of the game, – M = {m1 , m2 , . . . , mn } is the finite set of monitoring actions for D,

498

S. Sengupta et al. Table 1. Vulnerability information for the cloud network

VM

Vulnerability

CVE

LDAP

Local Priv Esc

CVE-2016-5195

5.0

MEDIUM

Web Server (WS) Cross Site Scripting CVE-2017-5095

7.0

EASY

FTP

CIA impact

Remote Code Exec. CVE-2015-3306 10.0

Attack complexity

MEDIUM

– E = {e1 , e2 , . . . , en } is the finite set of (exploit) actions available to A, – τ (s, mi , ej , s ) represents the probability of reaching a state s ∈ S from the state s ∈ S if D chooses to deploy the monitoring mi and A chooses to use the exploit ej , – U i (s, mi , ej ) represents the reward obtained by player i(= A or D) if in state s, D choose to deploy the monitoring mi and A choose to use the exploit ej , – γ i → [0, 1) is the discount factor for player i(= A or D). In light of recent studies on characterizing attackers based on personality traits [2], one might argue that a defender’s perspective of long term rewards is different than that of an attacker. Given that we did not find a formal model or user study clearly stating how these differ, we will consider γ A = γ D = γ going forward. As the solvers for our formulated game can work in cases even when γ A = γ D , this assumption just helps us simplify the notations. The concept of an optimal policy in this Markov game is well-defined for zero-sum games [18] where U D (s, mi , ej ) = −U A (s, mi , ej ) ∀ s ∈ S, mi ∈ M , and ej ∈ E. In these cases, a small modification to the Value Iteration algorithm can be used to compute the min-max strategy for both players. To see this, note that the Q-value update for this Markov Game (for player x) becomes as follows, Qx (s, mi , ej ) = Rx (s, mi , ej ) + γ τ (s, mi , ej , s ) · V D (s ) (1) s

where V D (s ) denotes the value function (or reward-to-go) with respect to D if in state s . We will use the notation M (s) (and E(s)) to denote the set of defender actions (and attacker actions) possible in state s. Given this, the mixed policy π(s) for state s over the defender’s applicable actions (∈ M (s)) can be computed using the value-update, V x (s) = max min Qx (s, mi , ej ) · πmi (2) π(s)

ej

mi

where πmi denotes the probability of choosing the monitoring strategy mi . When the Markov Game has a general-sum reward structure and one player can infer the other player’s strategy before making their move, the min-max strategy becomes sub-optimal and one must consider other notions of equilibrium [9,33]. We give an overview of prior work on these lines later in the paper.

IDS Placement Strategy Using General-Sum Markov Games

2.5

499

Quantifying the Impact of Vulnerabilities

The use of the Common Vulnerability Scoring System (CVSS) for rating the impact of attacks is well studied in cyber-security [10,29]. For (almost all) CVEs listed in the NVD database, we have a six-dimensional CVSS v2 vector, which can be decomposed into multiple measures like Access Complexity (AC) that models the difficulty of exploiting a particular vulnerability and the impact on Confidentiality, Integrity, and Availability (CIA score) gained by exploiting it. The values of AC are categorical {EASY, MEDIUM, HIGH} (that have a corresponding numerical value associated with them), while CIA values are in the range [0, 10]. For the set of vulnerabilities present in our system, the values of the two metrics are shown in Table 1.

3

Markov Game Modeling

Before discussing the game theoretic formulation in detail, we highlight a few important assumptions. Besides the Markovian assumption, we assume that (1) there is a list of attacks known to both the attacker and the defender (which cannot be immediately fixed either due to lack of resources or restrictions imposed by third-party customers who host their code on the cloud system [11,28]) and (2) the attacker may reside in the system but will remain undetected until it attempts to exploit an existing vulnerability, i.e. a stealthy adversary [32]. These assumptions forces our formulation to (1) only deal with known attacks and, (2) come up with good (but in-optimal) strategies for placing detection mechanisms. The latter is a result of ignoring the partial observability inherent in the problem [21,22] to come up with scalable solutions, necessary for cloud networks. States. The state of our Markov Game (MG) are derived using the nodes ND and NG of an Attack Graph (AG) (the blue diamond shaped nodes are mapped to the blue circular nodes in Fig. 2). These nodes in the AG represent the state of an attacker in the cloud system, i.e. the attacker’s position and their privilege level. The goal or a root node NG of an AG are mapped to a terminal selfabsorbing state in our MG while the other nodes ND represent non-terminal game states. Note that the location of an attacker on multiple physical servers in the cloud network can map to a single state of the AG (and therefore, a single state in our MG). The MG states that map to a goal or a root node NG of an AG are the terminal self-absorbing states. Among these non-terminal states there exist a set of states Si that represent the external-facing entry-points to the cloud network, the initial state of any multi-stage attack. For the cloud scenario shown in Fig. 1, we have four states. The state s0 corresponds to the goal node A-{FTP:root} and is a terminal state of the MG. The state s1 (⊂ Si ) corresponds to the state where an attack originates while the two states s2 and s3 correspond to nodes where the attacker has different access privileges on the three servers (LDAP, WS and FTP) server. Given that we use a ternary range to denote an adversary’s privilege – no-access, user or root-user – on each server, there can be a maximum of nine states (# servers

500

S. Sengupta et al.

× # access-levels)1 in the AG and hence, in our MG. Note that given a set of known attacks, the number of states is often much less (four vs. nine for our scenario) because most of the states are not reachable from the states in Si . Players and Pure Strategies. As mentioned above, the players for our game are the Admin (or the defender) D and the attacker A. The pure strategy set for A in state s consists of exploit actions they can perform with the privilege they have in state s. For example, consider s2 where the he attacker has the access vector A-{LDAP:root, FTP:user, WS:user}. With this as the precondition, the attack actions can be represented by the rule nodes NR (shown in oval) in the AG. Note that there is always a vulnerability node ∈ NC associated with a rule node and thus, with each action. The pure strategy set for D in a state s consists of monitoring actions where each such action corresponds to placing an Intrusion Detection System (IDS) for detecting attacks that can be executed by A in state s. These actions are made possible in real-world scenarios by using two sorts of IDS systems – (1) host based IDS like auditd that can notify D about abnormality in the use of CPU resources or access to files on the server and (2) network based IDS like snort that can observe traffic on the wire and report the use of unexpected bit patterns in the header or body of a packet. Although a pure strategy for D can only detect a single attack in our simple example, it is possible that a set of detection systems, capable of detecting multiple attacks, is considered as a pure strategy. We will see this in the case of the real-world cloud scenario discussed in the experimental section. In the context of Stackelberg Security Games, such groups of actions are often called schedules [15,24] and the pure strategy is defined over these schedules. We note that our modeling supports such representations.2 To allow for a realistic setting, we add one more action to the pure strategy set of each player–no-act and no-mon. These represent the no-op for each player which allows an attacker to not attack in a particular state if it feels that there is a high risk of getting caught. Similarly, this allows a defender to not monitor for an attack thereby saving valuable resources. Transitions. The transitions in our MG represent that given a particular state and a pair of actions drawn from the joint action space E × M , the probability with which a game reaches a state s , i.e. τ (s, m, e, s ). There exists a few obvious constraints in the context of our MG – (1) τ (s, m, no-act, s) = 1, i.e. if an attacker does not execute an attack, the game remains in the same state, (2) when e = no-act, τ (s, me , e, s ) = p/|Si | ∀s ∈ Si where p is the probability that e is detected by me , the monitoring service deployed to detect e, i.e. when successfully detected, the attacker starts from either of the initial states with 1

2

Partial observability over the state space can increase the number of states to be a power-set of this number, i.e. 2(# servers×# access-levels) . In these cases, the Subset of Sets are Sets (SSAS) property defined in [16] may not hold and thus, the Strong Stackelberg Equilibrium will not always be a Nash Equilibrium for the formulated Markov Game (see later (Lemma 1) for details).

IDS Placement Strategy Using General-Sum Markov Games

501

Table 2. Utilities (U A , U D ) for state s2 D (Defender) no-mon mon-Web mon-FTP no-act 0, 0 0, −2 0, −3 A (Attacker) exp-Web 7, −7 −8, 6 7, −10 exp-FTP 10, −10 10, −12 −8, 5

Table 3. Utilities (U A , U D ) for states s1 (left) and s3 (right). no-mon mon-FTP no-mon mon-LDAP no-act 0, 0 0, −2 no-act 0, 0 0, −2 exp-LDAP 5, −5 −5, 3 exp-FTP 10, −10 −8, 6

equal probability, and (3) τ (s, no-mon, e, s ) = 0 if s ∈ Si and s ∈ Si , i.e. an attacker cannot be detected if the defender does not perform a monitoring action. We highlight a few transitions for our Markov Game in Fig. 2. In state s1 , the defender does not monitor the exp-LDAP attack action and with 0.6 probability the game moves to the state s2 (and with 0.4 it remains in s1 ). These probability are calculated using the Access Complexity vector and the function defined in [6] for obtaining the probability of success given an attack. This is also done when the defender deploys a monitoring mechanism me but the attacker executes another attack e where e = e and e = no-act (see the transition for an example joint action from s2 in Fig. 2). Lastly, the transition from s3 shows a case relating to (3) mentioned in the previous paragraph. The value of τ (s3 , mon-FTP, exp-FTP, s1 ) = 0.9 because monitoring access to files like /etc/passwd and /etc/shadow can only detect some forms of privilege escalations (such as remote code execution that tries to eithers creates a new user or tries to escalate the privilege of a non-root user), but may not be able to monitor a case where an attacker simply obtains access to a root user account. Rewards. The rewards that can be obtained by the players depending on their strategy in a particular state (except the terminal state s0 ) of our cloud scenario is shown in Tables 2 and 3. Most prior works [22,26,32] haven’t used sensible heuristics to come up with attacker utility and or defender’s resource costs. In our case, the reward values are obtained using multiple metrics – (1) the impact score (IS) (also called the CIA impact score) of a particular attack, (2) the cost of performance degradation (provided by security and engineering experts in the cloud domain; often obtained by running MiniNET simulation) based on the placement of a particular IDS [28], and (3) the hops taken by an attacker to reach a particular state, which is often used to measure how advanced an Advanced Persistent Threat (APT) is. Note that the last factor is non-Markovian part in the overall reward function of our Markov Game; it depends on the path taken by an attacker to reach a particular state. To bypass this issue, we consider all

502

S. Sengupta et al.

Algorithm 1. Dynamic Programming for finding SSE in Markov Games 1: procedure Given (S, M, E, τ, U D , U A , γ D = γ A = γ), 2: Output(V i (s), π i (s) ∀ i ∈ {A, D}) 3: V (s) = 0 ∀ s 4: loop: i == k break; 5: // Update Q-values 6: Update QD (s, m, e) and QA (s, m, e) ∀ s ∈ S, m ∈ M (s), e ∈ E(s) 7: using U D , U A and V (s). 8: // Do value and policy computation 9: Calculate V i (s) and π i (s) for i ∈ {A, D} using the values Qi (s, m, e) in Eq. 3 10: i←i+1 11: goto loop. 12: end procedure

possible paths an attacker can take to reach the particular state and average the path value, which gives us an average of how advanced is the APT. Further, the actual path taken by a stealthy adversary who has been residing in the network for a long time, is difficult (if not impossible) to obtain. Thus, an average estimate is a good heuristic for estimating the importance of an APT. We will not explain how the reward value for the action pair (mon-Web, exp-Web), shown in Table 2, was obtained. First, The impact score for this vulnerability CVE-2017-5059, shown in Table 1, is 7. Second, we monitored performance using Nagios [8] to measure the end-to-end network bandwidth, number of concurrent requests to live web-services and the delay in servicing network requests when mon-Web was deployed. We observed that there was an increase in network delay, decrease in network bandwidth and decrease in the number of concurrent requests serviced per second. Based on expert knowledge, we estimated the reward (or rather impact on performance) of placing the IDS that monitors this vulnerability is −2. Finally, given that this vulnerability can only be executed if the attacker has exploited at least one vulnerability before coming to this state, the APT score was calculated to be 1. Thus, the defender’s reward’s for placing the correct IDS that can detect the corresponding attacker action is 7 minus 2 (cost incurred due to reduced performance) plus 1 (for detecting an APT that had already penetrated 1-hop into the network), totaling 6. On the other hand, the attacker’s reward for this action pair is −7 spending effort is executing a vulnerability of impact 7 plus −1 for losing a vantage point in the cloud network, totaling a reward of −8. The other reward values were defined using a similar line of reasoning. Given that the defender’s cost of placing IDS is not of any concern for the attacker3 , when an attacker chooses no-act, A’s reward is 0. On the contrary, the defender will still incur a negative reward if it deploys a monitoring system because it impacts the performance of the sub-net.

3

This is a strong reason to move away from the zero-sum reward modeling in [5].

IDS Placement Strategy Using General-Sum Markov Games

3.1

503

Optimal Placement Strategy

Finding the optimal solution to a two-player general-sum Markov Games is more involved than finding the min-max strategy (in zero-sum settings). Moreover, in our threat model, we assume that the attacker A, with reconnaissance efforts, will get to know the strategy of the defender D, imparting it the flavor of leaderfollower games. We highlight a dynamic programming approach shown in Algorithm 1. Although this algorithm looks similar to the one used for computing min-max equilibrium, it has an important difference. In line 9, instead of using Eq. 2 to calculate the optimal value and player strategies, we compute the Strong Stackelberg Equilibrium (SSE) in each state. For each iteration, we first consider the Q-values for that state and the joint actions represented as a normal-form game matrix. We then find the optimal policy for both players in state s. Since our model, at present, does not consider multiple adversary types, the equilibrium calculation for each state can be done in polynomial time [15]. This type of a solution resembles the idea of finding Stackelberg Equilibrium in discounted stochastic games which has been discussed in [33]. They authors describe an ILP approach over all the states of the Markov Game that becomes computationally intensive in our case given the large number of states in our formulation. Furthermore, the iterative approach provides an anytime solution which can be stopped at a premature stage (by setting lower values of k in Algorithm 1) to yield a strategy for placement. For completeness, we briefly highlight the optimization problem in [23] that we used to update the value of each state in Algorithm 1. D A QD (s, m, e)πm πe (3) max π D ,π A

s.t.

m∈M e∈E

D D D πm = 1, ∀πm πm ∈ [0, 1]

Defender’s selects a valid mixed strategy.

m∈M

0≤v−

πeA = 1, ∀πeA πeA ∈ {0, 1}

Attacker’s selects a valid pure strategy.

e∈E D QA (s, m, e)πm ≤ (1 − πeA )L ∀πeA

Attacker’s pure strategy maximizes their reward given defender’s mixed strategy.

m∈M

where L is a large positive number. The assumption that an attacker, with the inherent advantage of reconnaissance, is aware of the defenders mixed policy in each state can be a strong one in the context of Markov games. Thus, one might question the optimality of the strategy that we come up with using Algorithm 1. In [17], researchers have shown that SSEs are a subset of Nash Equilibrium for a particular class of problems. Specifically, they show that if the security resource allocation problem (which in our case, is allocating IDS for covering a vulnerability) has a particular property,

504

S. Sengupta et al.

Fig. 3. Defender’s value for each of the four state–s1 (top-left), s2 (top-right), s3 (bottom-left), and s0 , which in the all absorbing terminal state (bottom-right).

termed as SSAS, then the defender’s SSE is also a NE for the game. Given this, we can state the following.4 Lemma 1. If the Subset of Set Are Sets (SSAS) property holds in every state s of a Markov Game, then the SSE is also a NE of the Markov Game. Proof. We will prove the lemma by contradiction. Let us assume that SSAS property holds in every state of a Markov Game (MG), but the SSE of the MG is not the NE. First, consider γ = 0 for this MG. Thus, the SSE and NE strategy for each state can be calculated only based on the utilities of only this state. Now, if SSE ∈ NE for this Markov Game, then there is some state in which the SSE strategy is not the NE strategy. But if that is the case, then we would have violated the SSAS theorem in [17] for state, which cannot be true. For the case γ > 0, the proof still holds because note that the SSAS property is not related to the Q-values using which the strategy is computed in the Markov Game. Note that this holds trivially for our small example because the defender’s pure strategy for each state is to deploy a single IDS (and thus, all subset of schedules are are a possible schedule).

4

Experimental Results

In this section, we first compare the effectiveness of the optimal strategies for our general-sum leader-follower Markov-game against the Uniform Random Strategy, 4

In the case of multiple attackers, SSE ∈ NE. Although such scenarios exist in cybersecurity settings, we consider a single attacker in this modeling and plan to consider the multiple attacker setting in the future.

IDS Placement Strategy Using General-Sum Markov Games

505

which is a popular method and often used as a baseline for Moving Target Defense strategies in cybersecurity. We then discuss the set-up of a real-world, small scale cloud scenario and the gains we can obtain using our formulated Markov Game. 4.1

Evaluation of Strategies

We first discuss two baseline strategies and then briefly explain why we choose it for comparison with the SSE strategy for our general-sum leader-follower markov-game formulated in the previous section. • Uniform Random Strategy (URS) In this, the defender samples an action by drawing from a uniform random distribution over pure-strategies. For example, in the state s1 shown in Table 2, the defender can choose to monitor the FTP or the Web server or neither of them, all with an equal probability of 0.33. Researchers have claimed that selecting between what to choose when shifting between MTD configurations should be done using a uniform random strategy [34]. Although there have been other methods based on Stackelberg Security Games (SSGs) which have shown that such a strategy may be sub-optimal [28], it provides a baseline strategy that can be used in the context of our Markov Game. Adapting the latter strategies proposed in previous work need us to compile our multi-stage Markov into a single step normal form game. First, there is no trivial way of doing this conversion as the rewards in [28] talk about average impact on the network that are difficult to encode meaningfully in our Markov Game. Furthermore, the pure strategy set in [28] would have to incorporate full attack paths as opposed to single attack actions. This would make the strategy computation time-consuming. Second, our work can be seen as a generalization of applying the notion of Stackelberg Equilibria, similar to [33], for Markov Games in the context of IDS placement and thus, a counterpart solution to the normal form games case described in [28] to Markov Games. Hence, we do not consider [28] as a baseline. • Min-max Strategy Although our game is a general sum setting, one might ask how sub-optimal the min-max strategies for a similar zero-sum Markov game is when we ignore the impact on performance. In essence, the attacker still has the same utility in the individual states shows in Tables 2 and 3, but the defender’s reward values are just the opposite of the attacker’s reward, making it a zerosum game. Here, we hope to see that the impact on performance would reduce the defender’s overall utility. Comparison of Strategies. In Fig. 3, we plot the values of the four states (V (s)) of our game for the baselines (URS and Min-max), and the Strong Stackelberg Eq. (SSE). We also, to provide empirical support for Lemma 1, plot the Nash Eq (NE). On the x-axis, we vary the discount factor and on the y-axis plot the value of the state with respect to the defender. In the terminal state s0 , the defender gains a high negative reward because the attacker was able to exploit all the possible vulnerabilities successfully without getting detected. Thus, for all the states, since there is non-zero probability of reaching s0 , the defender’s

506

S. Sengupta et al.

value function is negative. Note that as one weighs the future rewards more, i.e. γ approaches 1, the value of the states decrease in magnitude because the negative reward in s0 is given higher weight. As stated above, the SSE for our example game is the same as the NE for our Markov Game and the curves for both the strategies overlap. On the other hand, URS is much worse off than our strategy for all the states with more than a single action whereas, Min-max, although better than URS, is sub-optimal with respect to SSE for all states except s3 and s0 . s0 , being a terminal state, has only one action for the defender and thus, all the methods are trivially equivalent. Thus, all the curves overlap (bottom-right in Fig. 3). In state s3 , which is just a single action away from the terminal state with high negative reward, the defender always picks the action to monitor for an attack regardless of the performance impact (whose magnitude is less in comparison to the impact of the attack). Thus, even though the Min-max strategy is ignorant of the latter costs, it picks the same strategy as SSE. Hence, their plots overlap (bottom-left in Fig. 3). Now, before discussing the differences between SSE and Min-max in the other two states (s1 and s2 ), we take a look at the mixed strategy obtained by finding the SSE of our game. This will help us in explaining the sub-optimality of the Min-max strategy. For a discount factor of γ = 0.8: πMG-SSE (s0 ) πMG-SSE (s1 ) πMG-SSE (s2 ) πMG-SSE (s3 )

: : : :

{ terminate : 1.0} {no−mon : 0 . 0 9 7 , mon−LDAP: 0 . 9 0 3 } {no−mon : 0 . 0 , mon−Web : 0 . 5 3 9 , mon−FTP: 0 . 4 6 1 } { p i n o −mon : 0 . 0 , mon−FTP: 1 . 0 }

Note that, in our example, barring the terminal state s0 , other states have only one or two proper detection actions because no-mon asks the defender to not monitor for any attacks. Thus, we expected that these actions to have probabilities almost equal to zero (unless it has a considerable impact on performance). In the case of state s1 , the 0.097 probability of picking that action shows that in states far away from the terminal, the defender chooses to be a little more relaxed in terms of security and pays more attention to performance. On the contrary, in state s3 an exploit action will move the game to the terminal state that has high-negative reward for the defender and thus, the defender is asked to place all attention to security. In general, this implies that an Admin D should pay more attention to security against APTs deep within the network in states closer to a critical resource and can choose to reason about performance near the entry points of the cloud system. Thus, in these states, the optimal mix-max strategy, oblivious to the performance costs, invests in monitoring resources and thus, becomes sub-optimal with respect to the SSE. 4.2

Case Study on the Cloud

In this section, we do a case study on a real-world sub-network in a cloud system, highlighting briefly the system setup, the Markov Game formulation, the comparison between URS and SSE for a (cherry-picked) state and how all these strategies can be implemented with the help of Software Defined Networking.

IDS Placement Strategy Using General-Sum Markov Games

507

User

Remote AƩacker 192.168.101.*/24 Kali VM

Firewall-PANOS 172.16.0.1 192.168.101.221 Snort IDS, HTTPS, Proxy, DNS, SSH

Internal Network 172.16.0.0/24

GRU: Windows 2012 R2 172.16.0.22 AD, DNS

Kevin: CentOS6 172.16.0.8 HTTP, HTTPS, FTP, SSH

Helen: Windows 7 172.16.0.23 Domain Controller

Dave: Debian 172.16.0.20 HTTP, HTTPS, FTP, SSH

George: Debian 172.16.0.21 HTTP, HTTPS, DNS, SSH

Fig. 4. A real world cloud system.

Implementation Details. We utilized the Virtual Machine (VM) images from the Western Region Cybersecurity Defense Competition (WRCCDC) [7]. The competition helps university students (the Blue Team) gain first-hand experience in dealing with real-world attacks. Students are asked to defend a corporate infrastructure against experienced white-hat hackers (the Red-Team). In the scenario shown in Fig. 4, a Red Team attacker mounts a multi-step attack by following a slow and low approach, that tries to evade the IDS placed. The goal of the attacker can be either to disrupt the network services or ex-filtrate data out of the private networks. Both of these attacks, if successful, can lead to the loss of mission-critical information (FTP server files are valuable to the company) and business (service downtime). We model each of these goal states in the attack graph as states that lead to an all-absorbing terminal state with unit probability, thus ending the game. The set of attacks were discovered using low-intensity network scanning tools like OpenVAS that run over an extended period of time and generate a report of the vulnerabilities present in the system. Due to space considerations, we summarize the vulnerabilities present in our cloud network in Fig. 5. Corresponding to the attacks, we considered the deployment of IDS mechanisms like Snort IDS, Web Proxy, etc. situated at different levels of the protocol stack. We used WRCCDC’s VM images to create a similar environment in our organization’s cloud service. To connect these VMs, we created a flat structure network with Palo Alto Network OS (Next-Generation Firewall) hosted at the gateway of the network (172.16.0.0/24) and had eight host machines in total [3]. The network was connected using SDN switch with OpenFlow v1.3 for (1) vulnerability scanning to gather knowledge about known attacks in the cloud, (2) computing the Markov Game strategy and (3) enforcing a particular deployment strategy and switching to a new one after a fixed time period T . For the first step we used scanning tools like OpenVAS, as described earlier. For the second step, we use our strategy computation algorithm that also solves the optimization

508

S. Sengupta et al.

Host High Medium Low 172.16.0.22 4 14 1 172.16.0.23 2 3 1 172.16.0.8 3 8 3 172.16.0.16 0 13 6 172.16.0.20 0 2 1 172.16.0.11 0 1 2 172.16.0.1 0 0 1 172.16.0.21 0 0 1 Total– 8 9 41 16

Fig. 5. The number of abilities in our cloud found via persistent ration with OpenVAS ability scanner.

vulnersystem explovulner-

Fig. 6. Defender’s value for the state s1 as discount factor increases from 0.5 to 1.

problem using Gurobi solver. For the last step, we used enable and disable (kill) scripts for the different IDS mechanisms and perform them using SDN protocols. Results. In the formulated Markov Game, we had eight states corresponding to each VM in our cloud system. In Fig. 6, we highlight the defender’s value for state s1 . The state s1 (attacker has root privilege on the Firewall VM in this state) was a lucrative vantage point for an attacker in the network because all other states were reachable via an array of vulnerabilities accessible in s1 . The defender had pure strategies in this state that deployed a set of IDS mechanisms (as opposed to a single IDS). These pure strategies detected a particular set of vulnerabilities and were either easier or profitable (in terms of resource usage) to deploy together. For example, deployment of two Network-based IDS using Snort is easier to configure for the Admin than the deployment of a host-based and a network-based IDS at the same time. For this study, we did not consider a high negative utility in the terminal state. We also noticed that the magnitude of positive defender utility (obtained using impact score) and negative defender utility (calculated using Mini-NET simulation similar to [28]) were comparable. Thus, the defender’s value for all the states (and thus, s1 ) turned out to be positive. If Fig. 6, a key reason for the high value gains w.r.t. URS is because URS neither paid more attention to attacks that had higher impact nor cared about the performance, both of which were essential given s1 is a critical vantage point in the cloud network.

5

Related Work

Sheyner et al. [12] presented a formal analysis of attacks on a network with costbenefit analysis that suffers from scalability issues in the context of cloud networks. To solve this problem, authors in [4] provided a polynomial-time method

IDS Placement Strategy Using General-Sum Markov Games

509

for attack-graph construction using distributed algorithms. However, these methods model the security situation from an attacker’s point of view and are mostly used to characterize the qualitative impact of an attack. We go beyond this notion and leverage this representation to construct a two-player game that helps the defender to come up with strategic defenses. Authors in [13] introduced the idea of moving secret proxies to new network locations using a greedy algorithm, which they show can thwart brute force and DDoS attacks. In [35], Zhuang et al. showed that the MTD system designed with intelligent adaptations improved the effectiveness further. In [29] authors show that intelligent strategies based on common intuitions can be detrimental to security and highlight how game theoretic reasoning can alleviate the problem. On those lines, Wei et al. [19] and Sengupta et al. [27,28] use a game theoretic approach to model the attacker-defender interaction as a two-player game where they calculate the optimal response for the players using Nash and the Stackelberg Equilibrium concepts. The flavor of the approaches are similar to those of Stackelberg Security Games (SSGs) for numerous physical security applications highlighted in [23,31]. Although they talk about the use of the Markov Decision Process (MDP) approaches for MTD, they leave it as future work. On these lines, authors in [33], show that the SSE of markov games can be arbitrarily suboptimal for stocastic discouted path games and provide an LP approximation for calculating the former. In this work, we believe that the Markovian assumption is sufficient to capture the strategy of an attacker and propose a dynamic programming based anytime solution method to find the SSE. In the context of cloud systems, [25] discussed a risk-aware MTD strategy where they modeled the attack surface as a non-decreasing probability density function and then estimated the risk of migrating a VM to a replacement node using probabilistic inference. In [14], authors highlight obfuscation as a possible MTD strategy in order to deal with attacks like OS fingerprinting and network reconnaissance in the SDN environment. Furthermore, they highlight that the trade-off between such random mutations, which may disrupt any active services, require analysis of cost-benefits. In this work, we follow suit and consider the trade-off between security and performance of the cloud system.

6

Conclusion and Future Work

A cloud network is composed of heterogeneous network devices and applications interacting with each other. The interaction of these entities over a complex and hierarchical network structure poses a substantial risk to the overall security of the cloud system. At the same time, it makes the problem of monitoring ongoing attacks by adversaries located both outside and inside the network a challenging problem. In this paper, we model the concept of Moving Target Defense (MTD) for shifting the detection surface in the cloud system as a Markov Game. This helps us reason about (1) the security impact of multi-stage attacks that are characteristics of Advanced Persistent Threats (APT) while (2) ensuring that we do not place all possible security mechanisms in the cloud system, thereby hogging

510

S. Sengupta et al.

all the valuable cloud resources. The various parameters of our Markov Game are obtained using an array of softwares, prior research and publicly available metrics ubiquitous in the security community. We propose a dynamic programming based anytime algorithm to find the Strong Stackelberg Equilibrium of our Markov Game and highlight its superiority to baseline strategies for an emulated and a small real-world cloud network setup. Acknowledgment. We want to thank the reviewers for their comments. This research is supported in part by these research grants: Naval Research Lab N00173-15-G017, AFOSR grant FA9550-18-1-0067, the NASA grant NNX17AD06G, ONR grants N00014-16-1-2892, N00014-18-1-2442, N00014-18-12840, NSF—US DGE1723440, OAC-1642031, SaTC-1528099, 1723440 and NSF—China 61628201 and 61571375. The first author is also supported by an IBM Ph.D. Fellowship.

References 1. National vulnerability database. https://nvd.nist.gov. Accessed 25 Sept 2018 2. Basak, A., et al.: An initial study of targeted personality models in the flipit game. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 623–636. Springer, Cham (2018). https://doi.org/10.1007/978-3030-01554-1 36 3. Chowdhary, A., Dixit, V.H., Tiwari, N., Kyung, S., Huang, D., Ahn, G.J.: Science DMZ: SDN based secured cloud testbed. In: IEEE Conference on Network Function Virtualization and Software Defined Networks (2017) 4. Chowdhary, A., Pisharody, S., Huang, D.: SDN based scalable MTD solution in cloud network. In: ACM Workshop on Moving Target Defense (2016) 5. Chowdhary, A., Sengupta, S., Huang, D., Kambhampati, S.: Markov game modeling of moving target defense for strategic detection of threats in cloud networks. In: AAAI Workshop on Artificial Intelligence for Cyber Security (2019) 6. Chung, C.J., Khatkar, P., Xing, T., Lee, J., Huang, D.: NICE: network intrusion detection and countermeasure selection in virtual network systems. IEEE Trans. Dependable Secure Comput. 10(4), 198–211 (2013) 7. Western Regional Collegiate Cyber Defense Competition: WRCCDC (2018). https://archive.wrccdc.org/images/2018/ 8. Nagios Enterprises: Nagios (2015) 9. Guerrero, D., Carsteanu, A.A., Huerta, R., Clempner, J.B.: An iterative method for solving Stackelberg security games: a Markov games approach. In: 2017 14th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE), pp. 1–6. IEEE (2017) 10. Houmb, S.H., Franqueira, V.N., Engum, E.A.: Quantifying security risk level from cvss estimates of frequency and impact. JSS 83(9), 1622–1634 (2010) 11. Jajodia, S., Park, N., Serra, E., Subrahmanian, V.: SHARE: a Stackelberg honeybased adversarial reasoning engine. ACM Trans. Internet Technol. (TOIT) 18, 30 (2018) 12. Jha, S., Sheyner, O., Wing, J.: Two formal analyses of attack graphs. In: 2002 Proceedings of the 15th IEEE Computer Security Foundations Workshop, pp. 49– 63. IEEE (2002)

IDS Placement Strategy Using General-Sum Markov Games

511

13. Jia, Q., Sun, K., Stavrou, A.: MOTAG: moving target defense against internet denial of service attacks. In: 2013 22nd International Conference on Computer Communication and Networks, pp. 1–9. IEEE (2013) 14. Kampanakis, P., Perros, H., Beyene, T.: SDN-based solutions for moving target defense network protection. In: IEEE 15th International Symposium on a World of Wireless, Mobile and Multimedia Networks. IEEE (2014) 15. Korzhyk, D., Conitzer, V., Parr, R.: Complexity of computing optimal Stackelberg strategies in security resource allocation games. In: AAAI (2010) 16. Korzhyk, D., Yin, Z., Kiekintveld, C., Conitzer, V., Tambe, M.: Stackelberg vs. nash in security games: an extended investigation of interchangeability, equivalence, and uniqueness. J. Artif. Int. Res. 41(2), 297–327 (2011). http://dl.acm.org/citation.cfm?id=2051237.2051246 17. Korzhyk, D., Yin, Z., Kiekintveld, C., Conitzer, V., Tambe, M.: Stackelberg vs. nash in security games: an extended investigation of interchangeability, equivalence, and uniqueness. J. Artif. Intell. Res. 41, 297–327 (2011) 18. Littman, M.L.: Markov games as a framework for multi-agent reinforcement learning. In: Eleventh International Conference on Machine Learning (1994) 19. Lye, K.W., Wing, J.M.: Game strategies in network security. Int. J. Inf. Secur. 4, 71–86 (2005) 20. McCumber, J.: Information systems security: a comprehensive model. In: Proceedings of the 14th National Computer Security Conference (1991) 21. Miehling, E., Rasouli, M., Teneketzis, D.: Optimal defense policies for partially observable spreading processes on Bayesian attack graphs. In: Proceedings of the Second ACM Workshop on Moving Target Defense, pp. 67–76. ACM (2015) 22. Nguyen, T.H., Wright, M., Wellman, M.P., Singh, S.: Multistage attack graph security games: heuristic strategies, with empirical game-theoretic analysis. In: Security and Communication Networks 2018 (2018) 23. Paruchuri, P., Pearce, J.P., Marecki, J., Tambe, M., Ordonez, F., Kraus, S.: Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games. In: AAMAS 2008, pp. 895–902 (2008) 24. Paruchuri, P., Pearce, J.P., Marecki, J., Tambe, M., Ordonez, F., Kraus, S.: Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games. In: AAMAS (2008) 25. Peng, W., Li, F., Huang, C.T., Zou, X.: A moving-target defense strategy for cloud-based services with heterogeneous and dynamic attack surfaces. In: IEEE International Conference on Communications (ICC) (2014) 26. Schlenker, A., et al.: Deceiving cyber adversaries: a game theoretic approach. In: Proceedings of the 17th International Conference on Autonomous Agents and Multi Agent Systems, pp. 892–900. International Foundation for Autonomous Agents and Multiagent Systems (2018) 27. Sengupta, S., Chakraborti, T., Kambhampati, S.: MTDeep: boosting the security of deep neural nets against adversarial attacks with moving target defense. In: Workshop on Engineering Dependable and Secure Machine Learning Systems. AAAI (2018) 28. Sengupta, S., Chowdhary, A., Huang, D., Kambhampati, S.: Moving target defense for the placement of intrusion detection systems in the cloud. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 326–345. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01554-1 19 29. Sengupta, S., et al.: A game theoretic approach to strategy generation for moving target defense in web applications. In: AAMAS (2017)

512

S. Sengupta et al.

30. Shapley, L.S.: Stochastic games. Proc. Nat. Acad. Sci. 39(10), 1095–1100 (1953) 31. Sinha, A., Nguyen, T.H., Kar, D., Brown, M., Tambe, M., Jiang, A.X.: From physical security to cybersecurity. J. Cybersecur. 1(1), 19–35 (2015) 32. Venkatesan, S., Albanese, M., Cybenko, G., Jajodia, S.: A moving target defense approach to disrupting stealthy botnets. In: Proceedings of the 2016 ACM Workshop on Moving Target Defense, pp. 37–46. ACM (2016) 33. Vorobeychik, Y., Singh, S.: Computing Stackelberg equilibria in discounted stochastic games (corrected version) (2012) 34. Zhuang, R., DeLoach, S.A., Ou, X.: Towards a theory of moving target defense. In: ACM MTD Workshop, 2014, pp. 31–40. ACM (2014) 35. Zhuang, R., Zhang, S., Bardas, A., DeLoach, S.A., Ou, X., Singhal, A.: Investigating the application of moving target defenses to network security. In: 6th International Symposium on Resilient Control Systems (ISRCS). IEEE (2013)

Operations over Linear Secret Sharing Schemes Arkadii Slinko(B) Department of Mathematics, University of Auckland, Auckland, New Zealand [email protected]

Abstract. A secret sharing scheme implemented in an organisation is designed to reflect the power structure in that organisation. When two organisations merge, this usually requires a number of substantial changes and, in particular, changes to their secret sharing schemes which have to be merged in the way which reflects a new role of each of the organisations. This paper looks at the ways secret sharing scheme can be modified when organisational changes occur. We restrict ourselves with the class of ideal linear secret sharing schemes and describe how matrices of these linear schemes have to be modified when we take the sum, the product or the composition of two linear access structures. Keywords: Secret sharing scheme · Access structure · Simple game Composition of access structures · Dual access structure

1

·

Introduction

Secret sharing schemes—first introduced by Shamir [15] and Blakley [3]—now widely used in many cryptographic protocols—is a tool designed for securely storing information that is highly sensitive and highly important. Such information includes encryption keys, missile launch codes, and numbered bank accounts. A secret sharing scheme stipulates giving to agents (whoever they may be) ‘shares’ of the secret so that only authorised coalitions of agents can calculate the secret putting their shares together. If two organizations merge they have to figure out how the secrets will be shared in the new organization. As we will see, there are different ways to do this. The set of all authorised coalitions of a secret sharing scheme is known as the access structure. It can also be modeled by a simple game, and in his pioneering paper Shamir [15] also suggested (independently from any literature on simple games) to model seniority of users by assigning weights to them. However this approach was not actively pursued, instead [17] introduced the concept of a hierarchical access structure. Such an access structure stipulates that agents are partitioned into m levels, and a sequence of thresholds k1 < k2 < . . . < km Arkadii Slinko was supported by Marsden Fund grant 3706352. c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 513–524, 2019. https://doi.org/10.1007/978-3-030-32430-8_30

514

A. Slinko

is set, so that a coalition is authorised if it has either k1 agents of the first level or k2 agents of the first two levels or k3 agents of the first three levels etc. These hierarchical structures are now called disjunctive, since only one of the m conditions must be satisfied for a coalition to be authorised. If all m conditions must be satisfied, then the hierarchical access structure is called conjunctive [19]. Moreover, if a non-authorised coalition can extract from their shares no information about the secret whatsoever, the scheme is called perfect. The most informationally efficient perfect secret sharing schemes are called ideal. For an ideal scheme the length of each share in bits is exactly the length of the secret and it cannot be shorter. Linear secret sharing schemes—initially called vector space construction [4]— also take their origin in the original Shamir’s construction. Given a matrix H the access structure related to it is encoded in the dependency relations of the rows of H, and the secret sharing scheme for this access structure can be straightforwardly implemented. It is important to note that linear secret sharing schemes are always ideal.1 Moreover, the only known general method of proving ideality of a secret sharing scheme for the time being is to prove its linearity. In particular, both disjunctive and conjunctive hierarchical access structures have been proven to be linear, hence ideal [4,19]. So linear access structures play in the theory of secret sharing an important role. Beimel in his survey [1] notes that linearity of the scheme is important in some applications—e.g., for secure multiparty computation (see Sect. 4 of his survey)—and thus “studying linear secret sharing schemes and their limitations is important”. Operations over secret sharing schemes are important both for practical and theoretic purposes. Firstly, sometimes reorganisations take place and companies or departments get merged. This may lead to amending the access structure to the secret accordingly. Secondly, the operation of composition has become important in description of ideal secret sharing schemes [2,6,7]. The reason is that the class of ideal secret sharing schemes is closed under the operation of composition. And, if a class is closed under compositions, then only indecomposable access structures need to be described. In this paper we investigate under which operations the set of linear access structures is closed. We prove the closedness of this class under the operations of taking products and sums [15], under the operation of composition [2] and also taking duals. This would hopefully simplify and streamline future proofs of linearity. At the moment each class of games, say hierarchical or tripartite, is proven to be linear in its own way. Closedness under composition allows us to obtain the following result which seems to be new. It is known that all indecomposable ideal weighted access structures are known to be linear [2,5,6]. It is also known [7] that an arbitrary ideal weighted access structure is a composition of indecomposable ones. Hence our result implies that every weighted ideal access structure is linear. 1

Here we have in mind the original definition of linearity, called the vector space construction, given by [4] and not its extension to monotone span programs [8].

Operations over Linear Secret Sharing Schemes

515

It is also not too difficult to prove closedness of the class of linear secret sharing schemes under the operation of taking subschemes and reduced schemes (as defined for simple games in [20]). This opens an interesting possibility that we may investigate in the future: it may be possible to define linear secret sharing schemes through the set of forbidden minors as was done by [14] for the larger class of secret schemes that can be obtained as matroid ports.

2 2.1

Preliminaries Access Structure

Suppose n agents from set A = {1, 2, . . . , n} agreed to share a secret. Some coalitions (subsets) of A are authorised to know the secret such subsets form an access structure. An access structure is any subset W ⊆ 2A such that X ∈ W and X ⊆ Y, then Y ∈ W

(1)

reflecting the fact that if a smaller coalition knows the secret, then the larger one will know it too. The access structure is public knowledge and all agents know it. Due to the monotonicity requirement (1) the access structure is completely defined by its minimal authorised coalitions. We assume that every agent participates in at least one minimal authorised coalition. 2.2

Operations over Access Structures

The following operations, apart from the last one, can be found in the standard textbook [20]. The sum and product of access structures are relatively intuitive. Definition 1 (Sum of Access Structures [11]). Let U1 and U2 be disjoint sets of users and let Γ1 and Γ2 be access structures over U1 and U2 respectively. Let U = U1 ∪ U2 . Then the sum of Γ1 and Γ2 is Γ = {X ⊆ U | X ∩ U1 ∈ Γ1 or X ∩ U2 ∈ Γ2 }. Definition 2 (Product of Access Structures [11]). Let U1 and U2 be disjoint sets of users and let Γ1 and Γ2 be access structures over U1 and U2 respectively. Let U = U1 ∪ U2 . Then the product of Γ1 and Γ2 is Γ1 × Γ2 = {X ⊆ U | X ∩ U1 ∈ Γ1 and X ∩ U2 ∈ Γ2 }. The dual games play an important role in game theory [20]. They might play an important role in the theory of secret sharing schemes too. Definition 3 (Dual Access Structure [16]). Let Γ be an access structure on a set of users U . The dual access structure Γ ∗ to Γ is an access structure on the same set of players U with a coalition X being authorised in Γ ∗ if and only if its complement X c is authorised in Γ . Authorised coalitions in Γ ∗ are also called blocking coalitions of Γ , i.e., Γ ∗ = {X ⊆ U | X c ∈ Γ }.

516

A. Slinko

In 2008, Beimel et al. [2] suggested a new kind of description of ideal weighted threshold access structures, which was later used by Farras and Padro [6], in their paper characterising these structures. The description revolved around a method of composition of access structures. The idea is that any access structure is either irreducible, or can be split into a main access structure together with a secondary one, which is substituted into the main one instead of a single player. Definition 4 (Composition of Access Structures [10]). Let U1 and U2 be disjoint sets of users and let Γ1 and Γ2 be access structures over U1 and U2 respectively. Let u ∈ U1 , and set U=U1 ∪ U2 \{u}. Then the composition Γ = Γ1 ◦u Γ2 of Γ1 and Γ2 via u is Γ = {X ⊆ U | X ∩ U1 ∈ Γ1 or (X ∩ U2 ∈ Γ2 and (X ∩ U1 ) ∪ {u} ∈ Γ1 )}. The composition of access structures is natural in real life application. For instance, if in a group of people there is a highly skilled person who is going to retire and is to be replaced by others, then it may turn out that he cannot be replaced by a single person but by a group of several people. 2.3

Linear Access Structures

A linear secret sharing scheme (also called the vector space construction [4]) is defined as follows. Suppose the participants of the scheme are [n] = {1, 2, . . . , n}. Let H be an arbitrary (n + 1) × k matrix with coefficients in a finite field F with rows h0 , h1 , . . . , hn , where h0 = (1, 0 . . . , 0). Let us define the access structure as follows: ΓH = {{i1 , i2 , . . . , ik } ⊆ [n] | h0 ∈ span{hi1 , hi2 , . . . , hik }}. Given such a matrix we can define a secret sharing scheme. We choose t0 , t1 , t2 , . . . , tn randomly, calculate s0 , s1 , s2 , . . . , sn by the equation Ht = s and declare s0 to be the secret and s1 , s2 , . . . , sn the shares that would be given to users 1, 2, . . . , n, respectively. Then this is a perfect secret sharing scheme realising the access structure ΓH . We note that the target vector can be any non-zero vector. Indeed, we can always change the basis to convert our target vector into any other target vector. The most two common target vectors are (1, 0, . . . , 0) and (1, 1, . . . , 1). The Shamir’s secret sharing scheme is obviously linear. On the other hand we do not have to go far to find an access structure which cannot carry an ideal secret sharing scheme. Example 1. The scheme P4 with the set of four agents [4] = {1, 2, 3, 4} and the set of minimal authorised coalitions Wmin = {{1, 2}, {2, 3}, {3, 4}} is not ideal, hence non-linear (see, for example, [18]).

Operations over Linear Secret Sharing Schemes

517

Example 2. The access structure Jn on [n] = {1, 2, . . . , n} with the set of minimal authorised coalitions Wmin = {{1, 2}, {1, 3}, . . . , {1, n}, {2, 3, . . . , n}} is not ideal, hence non-linear. The access structures P4 and Jn are not even matroid ports [14] so they are non-linear in the strong sense.

3

Main Results

Now we will introduce a useful notation. Let H=(H1 , . . . , Hn ) be a matrix represented by its columns H1 , H2 , . . . , Hn . Let H−i denote H without its ith column, i.e., H−i = (H1 , . . . , Hi−1 , Hi+1 , . . . , Hn ). Similarly, H −i will denote H without its ith row. Lemma 1. Consider a linear secret sharing scheme defined by an (n + 1) × k matrix H given by rows h0 , h1 , . . . , hn with a target row h0 = (1, 0, . . . , 0). Let ¯ 1, . . . , h ¯ n be the rows of H−1 . Let X = {i1 , i2 , . . . , ik } be an unauthorised ¯h0 , h coalition. Suppose there are elements of the field a1 , a2 , . . . , ak ∈ F such that ¯ i + . . . + ak h ¯ i = (0, 0, . . . , 0), then also a1 hi + . . . + ak hi = (0, 0, . . . , 0). a1 h 1 1 k k Proof. Indeed suppose on the contrary that a1 hi1 +. . .+ak hik = (b, 0, . . . , 0) with b = 0. But then for ai = ai /b we have a1 hi1 + . . . + ak hik = (1, 0, . . . , 0) = h0 , which is the target vector. This contradicts to the fact that X was assumed not to be authorised. Theorem 1 (Sum of Access Structures). If two access structures Γ1 and Γ2 are linear, then their sum, Γ1 + Γ2 is linear. Proof. Let U1 and U2 be sets of users for Γ1 and Γ2 respectively, with |U1 | = n and |U2 | = m. Let Γ1 and Γ2 be linear P and Q be (n + 1) × k and (m + 1) × r matrices with rows p0 , p1 , . . . , pn and q0 , q1 , . . . , qm , where p0 and q0 be target vectors (1, 0, . . . , 0) of dimension k and r, respectively. Let us construct the following matrix for the sum Γ first. Let M be an (n + m) × (k + r + 1) matrix and set the target vector be t = (1, 0, . . . , 0) P1 P−1 0 , M= Q1 0 Q−1 We need to show that only authorised coalitions of Γ have rows that span the target vector. Let X be an arbitrary coalition in Γ . Then X = X1 ∪ X2 where X1 ⊆ U1 , X2 ⊆ U2 . Let the corresponding sets of rows be {pi | i ∈ I} and {qj | i ∈ J}, respectively.

518

A. Slinko

IfX is authorised in ΓM , then its respective rows can span the target vector, i.e., i∈I αi pi + j∈J βj qj = t for some scalars αi and βj . Due to the structure of M , we have ¯ i = (0, . . . , 0) and ¯ j = (0, . . . , 0). αi p βj q i∈I

j∈J

Hence, by Lemma 1, if neither X1 nor X2 is authorised in Γ1 and Γ2 , respectively, then αi pi = (0, . . . , 0) and βj qj = (0, . . . , 0). i∈I

so that

j∈J

αi pi +

i∈I

βj qj = (0, . . . , 0) = t.

j∈J

On the other hand, if either X1 or X2 are authorised coalitions in their respective access structures, the coalition can obviously reconstruct the target vector. Hence, we see that ΓM coincides with Γ1 + Γ2 which is then linear. Theorem 2 (Product of Access Structures). If two access structures Γ1 and Γ2 are linear then their product, Γ1 × Γ2 is linear. Proof. Let U1 and U2 be sets of users for Γ1 and Γ2 respectively, with |U1 | = n and |U2 | = m. Let Γ1 and Γ2 be linear P and Q be (n + 1) × k and (m + 1) × r matrices with rows p0 , p1 , . . . , pn and q0 , q1 , . . . , qm , where p0 and q0 be target vectors (1, 0, . . . , 0) of dimension k and r, respectively. Let us construct the following (n + m) × (k + r − 1) matrix P P1 0 M= , 0 V1 V−1 where P1 and Q1 are the first columns of P and Q, respectively, and let us set the target vector be t=(1, 0, . . . , 0). We need to show that only authorised coalitions of Γ have rows that span the target vector. Let X be a coalition in Γ . Then X = X1 ∪ X2 , where X1 ⊆ U1 and X2 ⊆ U2 . Let the corresponding sets of rows be {pi | i ∈ I} and {qj | i ∈ J}. Suppose X is authorised in ΓM and hence its respective rows can span the target vector. Thus, αi pi + βj qj = t. i∈I

j∈J

Then due to the structure of the matrix M , we know that X1 is an authorised coalition of Γ1 and the first k entries of the respective rows make ai pi = (1, 0, . . . , 0). i∈I

Operations over Linear Secret Sharing Schemes

519

We also see that j∈J βj qj = (0, 0, . . . , 0). If X2 is not authorised in Γ2 , then, by Lemma 1, we know that βj qj = (0, 0, . . . , 0). j∈J

Since the (k + 1)th column of M is P1 which is also the first column of M , we have t= αi pi + βj qj = (1, 0, . . . , 0, 1, 0, . . . , 0). i∈I

j∈J

This leads to a contradiction. Hence, for X to be authorised in ΓM , coalition X2 must be authorised in Γ2 and thus X is authorised in Γ1 × Γ2 . Suppose now X is authorised in Γ1 × Γ2 , i.e., both X1 and X2 are authorised in their respective access structures. Then users from X1 , using their rows, can span (1, 0, . . . , 0, 1, 0, . . . , 0) and users from X2 can span (0, . . . , 0, 1, 0, . . . , 0). Jointly they can span the target vector t. Hence, ΓM and Γ1 × Γ2 coincide. Theorem 3 (Composition of Access Structures). If two access structures Γ1 and Γ2 on sets of users U1 and U2 are linear, then the composition Γ = Γ1 ◦u Γ2 via u ∈ U1 is also linear. Proof. Let |U1 | = n and |U2 | = m be cardinalities of the sets of users for Γ1 and Γ2 , respectively. Let Γ1 and Γ2 be linear and P and Q be (n+1)×k and (m+1)×r matrices such that Γ1 = ΓP and Γ2 = ΓQ , where P has rows p0 , p1 , . . . , pn and q0 , q1 , . . . , qm , where p0 and q0 be target vectors (1, 0, . . . , 0) of dimensions k and r, respectively: ⎡ ⎡ ⎤ ⎡ ⎤ ⎤ ⎡ ⎤ 1 0 ... 0 q0 1 0 ... 0 p0 ⎢ q1 ⎥ ⎢ q11 q12 . . . q1r ⎥ ⎢ p1 ⎥ ⎢ p11 p12 . . . p1k ⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ p2 ⎥ ⎢ p21 p22 . . . p2k ⎥ ⎥ ⎢ ⎥ P =⎢ ⎥=⎢ ⎥ , Q = ⎢ q2 ⎥ = ⎢ q21 q22 . . . q2r ⎥ . (2) ⎢ .. ⎥ ⎢ .. ⎢ .. ⎥ ⎢ .. .. .. .. ⎥ .. .. .. ⎥ ⎣ . ⎦ ⎣ . ⎣ . ⎦ ⎣ . . . . ⎦ . . . ⎦ pn

pn1 pn2 . . . pnk

qm

qm1 qm2 . . . qmr

Without loss of generality P and Q can be assumed to have an equal number of columns, i.e., k = r. This is because several columns of zeros of appropriate dimension could be added to any matrix without changing the access structure. Let u be the ith user in U1 , whose corresponding vector is pi = (pi1 , . . . , pik ). Let ⎡ ⎤ pi1 q11 pi2 q11 . . . pik q11 ⎢ pi1 q21 pi2 q21 . . . pik q21 ⎥ ⎢ ⎥ A = [pi1 Q1 , pi2 Q1 , . . . , pik Q1 ] = ⎢ . .. .. ⎥ . ⎣ .. . ··· . ⎦ pi1 qm1 pi2 qm1 . . . pik qm1

520

A. Slinko

with rows a1 , . . . , am . This matrix obviously has rank 1. Moreover, its row space Row(A) is spanned by pi . We build now a matrix for the composition Γ as follows. Let H be the following (n + m − 1) × (2k − 1) matrix: −i P 0 H= , A Q−1 and set the target vector be t = (1, 0, . . . , 0). We just need to show that only S authorised coalitions of Γ have rows that span the target vector. Let H= , R where S consists of the (n − 1) upper rows of H and R consists of the remaining m rows of H. That is, S = (P −i 0) and R = (A Q−1 ). Let X be an authorised coalition in Γ . Then X = X1 ∪ X2 where X1 ⊆ U1 , X2 ⊆ U2 . Let the corresponding sets of rows of P and Q be {pi | i ∈ I} and qj | j ∈ J} being the corresponding rows of {qj | j ∈ J}, respectively, and {¯ Q−1 . Since X is authorised, then there exist a linear combination αi si + βj rj = t. (3) i∈I

j∈J

Case 1. Suppose X2 is not authorised in Γ2 . From the structure of the ¯ j = 0. which by Lemma 1 implies that j∈J βj qj = matrix we have j∈J βj q (0, 0, . . . , 0). and hence βj aj = 0 and βj rj = (0, 0, . . . , 0). j∈J

j∈J

This implies

αi pi = (1, 0, . . . , 0).

i∈I

and X1 is an authorised coalition of Γ1 . Case X1 is not. Then in (3) we must 2. Suppose X2 is authorised in Γ2 and ¯ j = 0. Hence, we must have j∈J βj qj = 0 and at the same time j∈J βj q have βj qj = (1, 0, . . . , 0), j∈J

for some scaled coefficients βj , j ∈ J. This implies βj qj1 = 1 and βj aj = pi . j∈J

j∈J

Then (3) shows that X1 ∪ {i} ∈ Γ1 . We showed that ΓH ⊆ Γ . The reverse inclusion is straightforward. Thus we have shown that if Γ1 and Γ2 are linear then the composition of Γ1 and Γ2 via u ∈ U1 is linear.

Operations over Linear Secret Sharing Schemes

521

Martinez et al. [11] wrote: “It is not clear when the composition of structures admitting a vector space constructions admits a vector space construction”. They prove a partial result of Theorem 3 when Γ2 is a (t, n)-threshold structure. For consideration of the dual structure we will assume the target vector to be t = (1, 1, . . . , 1). The fact that the dual of a linear secret sharing scheme is also linear follows from the theory of representable matroids and have been mentioned in the iterature (see, e.g., [6]). However, sometimes it is useful to know how the original matrix of a linear secret sharing scheme is related to the matrix of its dual. Theorem 4 (The Dual of an Access Structure). Let Γ be a linear access structure on a set of users U . Then the dual access structure Γ ∗ is also linear. Proof. Let Γ = ΓM be a linear access structure on a set of users U , with |U | = n, defined by an (n + 1) × s matrix M with rows m1 , . . . , mn . Let A = {ui1 , . . . , uik } be an authorised coalition in Γ and I = {i1 , i2 , . . . , ik }. k Then for some α1 , α2 , . . . , αk ∈ F we get j=1 αj mij = t. or cA M = t, where the row vector cA has α1 , α2 , . . . , αk in positions i1 , i2 , . . . , ik and zeros in all other positions. Let now A1 , A2 , . . . , A be all minimal authorised coalitions of Γ . Let us form the matrix M ∗ = [cTA1 , . . . , cTA ]. We will illustrate this construction by an example. Suppose F = Z7 and ⎡ ⎤ 222 ⎢1 0 3⎥ ⎥ M =⎢ ⎣0 3 1⎦. 206 Then ΓM has minimal authorised coalitions A1 = {u1 }, A2 = {u2 , u3 }, A3 = {u3 , u4 }. We have cA1 = (4, 0, 0, 0), So

Let us note that

cA2 = (0, 1, 5, 0),

cA3 = (0, 0, 5, 4).

⎡

⎤ 400 ⎢0 1 0⎥ ⎥ M∗ = ⎢ ⎣0 5 5⎦. 004 ⎡

⎤ 1 ··· 1 ⎢ ⎥ (M ∗ )T M = ⎣ ... ... ... ⎦ .

(4)

1 ··· 1 Let us continue the proof of the theorem by proving that Γ ∗ = ΓM ∗ . Suppose X is not authorised in Γ ∗ . Then it is not blocking and X c contains a minimal

522

A. Slinko

authorised coalition A of Γ (which X cannot block). Then the column cA in M ∗ contains only zeros in positions of row vectors corresponding to users from X. As a result those users can never span the target vector t = (1, 1, . . . , 1) and X is not authorised in ΓM ∗ . By the contrapositive, if X is authorised in ΓM ∗ , then it is authorised in Γ ∗ . For the converse we need the following lemma. Lemma 2. If a coalition X ⊆ U is not authorised by ΓM , then for the matrix MX , consisting of rows corresponding to members of X, there exists a column vector a such that t · a = 1 and MX a = 0. Proof. Let MX be the submatrix of M consisting only of the rows corresponding / Row(MX ), so to members of X. If X in ΓM is not authorised, then t ∈ t rank = rank(MX ) + 1. MX Then MX cannot have full column rank, hence the columns of MX are linearly dependent and thus have a column vector a ∈ F s such that MX a = 0. Since the dimension of the null-space of MX is one more that the dimension of the t null-space of , this vector a can be chosen so that t · a = 0. Scaling vector MX a we can achieve t · a = 1 and MX a = 0. Let us now continue the proof of the theorem. Suppose now that X is authorised in Γ ∗ , hence blocking in Γ , but not authorised in ΓM ∗ . By Lemma 2 there exists a column vector a such that 1 t a = . (5) ∗ 0 MX Due to (4) we have

aT (M ∗ )T M = cM = t,

where, due to (5) vector c has zeros in positions corresponding to members of the coalition X. This means that X c is authorised in Γ and X is not blocking. This contradiction proves that X is also authorised in ΓM ∗ .

4

On the Structure of Weighted Ideal Secret Sharing Schemes

The access structure Γ on the set of participants P is weighted if there exists a weight function w : P → R with non-negative values and a real number q > 0 such that for a coalition X ⊆ P X∈Γ ⇔ w(x) ≥ q. x∈X

Operations over Linear Secret Sharing Schemes

523

Beimel-Tassa-Weinreb [2] and Farras-Padro [6] proved that every weighted ideal access structure is a composition of indecomposable ones and they described all the indecomposables proving, in particular, that they are linear. It was implicitely assumed that the composition of two weighted access structures is weighted and that the composition of two linear access structures is linear. The first statement, however, appeared to be not true which led Hameed and Slinko [7] to sharpening of the characterisation by showing that the indecomposable access structures which appear in the decomposition of a weighted ideal access structure may be only of a very particular kind. Since all indecomposable structures are linear, Theorem 3 now implies Theorem 5. Any weighted ideal access structure is linear.

5

Related Literature

Shapley [16] was the first to introduce the operation of composition for simple games in full generality, when each player of a game can be replaced with another game. Martin [10] reinvented a particular case of this construction when only one player can be replaced. This variant of Shapley’s construction appeared to be of great value for secret sharing as it was instrumental in description of weighted ideal secret sharing schemes [2,6,7]. Shapley’s construction in full generality was reinvented in [11] and used in [9]. They proved a partial case of Theorem 3. Martinez et al. [11] also define the operations of the sum and the product of access structures used in this paper. Maurer [12,13] used operations on secret sharing schemes in the context of MPC, and, especially made a heavy use of dual structures. Operations over simple games are discussed in details in Taylor and Zwicker monograph [20].

6

Conclusion

When organizations merge, the secret sharing schemes have to be merged too. We considered several ways of doing this. We prove that, if the original schemes were linear, i.e., obtained by Brickell’s vector space construction, then the merged scheme can be also chosen to be linear. In particular, we prove that the composition of two linear secret sharing schemes is linear. We give an explicit expression for the respective matrices that generate the merged schemes.

References 1. Beimel, A.: Secret-sharing schemes: a survey. In: Chee, Y.M., Guo, Z., Ling, S., Shao, F., Tang, Y., Wang, H., Xing, C. (eds.) IWCC 2011. LNCS, vol. 6639, pp. 11–46. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20901-7 2 2. Beimel, A., Tassa, T., Weinreb, E.: Characterizing ideal weighted thresholdsecret sharing. SIAM J. Discret. Math. 22, 360–397 (2008)

524

A. Slinko

3. Blakley, G.R.: Safeguarding cryptographic keys. In: Proceedings of the National Computer Conference, vol. 48, p 313 (1979) 4. Brickell, E.F.: A survey of hardware implementations of RSA. In: Brassard, G. (ed.) CRYPTO 1989. LNCS, vol. 435, pp. 368–370. Springer, New York (1990). https://doi.org/10.1007/0-387-34805-0 34 5. Farr` as, O., Mart´ı-Farré, J., Padr´ o, C.: Ideal multipartite secret sharing schemes. J. Cryptol. 25, 434–463 (2012) 6. Farr` as, O., Padr´ o, C.: Ideal hierarchical secret sharing schemes. In: Micciancio, D. (ed.) TCC 2010. LNCS, vol. 5978, pp. 219–236. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11799-2 14 7. Hameed, A., Slinko, A.: A characterisation of ideal weighted secret sharing schemes. J. Math. Cryptol. 9, 227–244 (2015) 8. Karchmer, M., Wigderson, A.: On span programs. In: Proceedings of the Eighth Annual Structure in Complexity Theory Conference, pp. 102–111. IEEE (1993) 9. M´ arquez-Corbella, I., Mart´ınez-Moro, E., Su´ arez-Canedo, E.: On the composition of secret sharing schemes related to codes. Discret. Math. Algorithms Appl. 6, 1450013 (2014) 10. Martin, K.: New secret sharing schemes from old. J. Comb. Math. Comb. Comput. 14, 65–77 (1993) 11. Mart´ınez-Moro, E., Mozo-Fern´ andez, J., Munuera, C.: Compounding secret sharing schemes. Australas. J. Comb. 30, 277–290 (2004) 12. Maurer, U.: Secure multi-party computation made simple. Discret. Appl. Math. 154, 370–381 (2006) 13. Nikov, V., Nikova, S., Preneel, B.: Multi-party computation from any linear secret sharing scheme unconditionally secure against adaptive adversary: the zero-error case. In: Zhou, J., Yung, M., Han, Y. (eds.) ACNS 2003. LNCS, vol. 2846, pp. 1–15. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45203-4 1 14. Seymour, P.: A forbidden minor characterization of matroid ports. Q. J. Math. 27, 407–413 (1976) 15. Shamir, A.: How to share a secret. Commun. ACM 22, 612–613 (1979) 16. Shapley, L.S.: Simple games: an outline of the descriptive theory. Behav. Sci. 7, 59–66 (1962) 17. Simmons, G.J.: How to (Really) share a secret. In: Goldwasser, S. (ed.) CRYPTO 1988. LNCS, vol. 403, pp. 390–448. Springer, New York (1990). https://doi.org/ 10.1007/0-387-34799-2 30 18. Stinson, D.R.: An explication of secret sharing schemes. Des. Codes Crypt. 2, 357–390 (1992) 19. Tassa, T.: Hierarchical threshold secret sharing. J. Cryptol. 20, 237–264 (2007) 20. Taylor, A., Zwicker, W.: Simple Games. Princeton University Press, Princeton (1999)

Cyber Camouflage Games for Strategic Deception Omkar Thakoor1(B) , Milind Tambe1 , Phebe Vayanos1 , Haifeng Xu2 , Christopher Kiekintveld3 , and Fei Fang4 1

University of Southern California, Los Angeles, CA 90007, USA {othakoor,tambe,phebe.vayanos}@usc.edu 2 University of Virginia, Charlottesville, VA 22904, USA [email protected] 3 University of Texas at El Paso, El Paso, TX 79968, USA [email protected] 4 Carnegie Mellon University, Pittsburgh, PA 15213, USA [email protected]

Abstract. The rapid increase in cybercrime, causing a reported annual economic loss of $600 billion (Lewis 2018), has prompted a critical need for effective cyber defense. Strategic criminals conduct network reconnaissance prior to executing attacks to avoid detection and establish situational awareness via scanning and fingerprinting tools. Cyber deception attempts to foil these reconnaissance efforts by camouflaging network and system attributes to disguise valuable information. Game-theoretic models can identify decisions about strategically deceiving attackers, subject to domain constraints. For effectively deploying an optimal deceptive strategy, modeling the objectives and the abilities of the attackers, is a key challenge. To address this challenge, we present Cyber Camouflage Games (CCG), a general-sum game model that captures attackers which can be diversely equipped and motivated. We show that computing the optimal defender strategy is NP-hard even in the special case of unconstrained CCGs, and present an efficient approximate solution for it. We further provide an MILP formulation accelerated with cut-augmentation for the general constrained problem. Finally, we provide experimental evidence that our solution methods are efficient and effective. Keywords: Game theory

1

· Cyber deception · Optimization

Introduction

The ubiquity of Internet connectivity has spurred a significant increase in cybercrime. Major cyber attacks such as recent data breaches at Equifax (Gutzmer 2007), Yahoo (Goel and Perlroth 2016), as well as government agencies like the Office of Personnel Management (Peterson 2015) are often executed by adept attackers conducting reconnaissance as the first stage for an effective cyber attack (Mandiant 2013; Joyce 2016). Rather than attempting “brute c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 525–541, 2019. https://doi.org/10.1007/978-3-030-32430-8_31

526

O. Thakoor et al.

Fig. 1. The attacker scans virtual machines on the test-bed and the configuration observed via NMAP can be set to differ from the true configuration that NMAP would show without our deployed deception.

force” exploits, scanning tools such as NMap (Lyon 2009), xProbe2 (Arkin and Yarochkin 2003) and fingerprinting techniques such as sinFP (Auffret 2010) are used to identify vulnerabilities and develop specific plans to infiltrate the network while minimizing the risk of detection. To mitigate the reconnaissance abilities of attackers, deception techniques aim to disguise valuable network information to create uncertainty. This can lead attackers to spend more time in reconnaissance activities (increasing the chances of detection), or to attempt infiltration tactics that are less effective. Examples of such techniques include the use of honeypots or decoys (FergusonWalter et al. 2017), real systems using deceptive defenses (De Gaspari et al. 2016), and obfuscated responses to fingerprinting (Berrueta 2003; Rahman et al. 2013). Canary (Thinkst 2015) is an example of a deception-based tool in commercial use, while CyberVAN (Chadha et al. 2016) is a test-bed for simulating various deception algorithms. Figure 1 shows a demonstration of our model on CyberVAN. There are two general factors to consider when deploying cyber deception techniques. First, strategic use of deception is vital due to the significant costs and feasibility constraints that must be considered; e.g., deception via counter-fingerprinting techniques like HoneyD, OSfuscate, and IPMorph typically degrades performance (Rahman et al. 2013). In most deception methods, we must also consider the costs of deploying, and maintaining deceptive strategies which may include both computational resources and developer time. Second, optimizing the effectiveness of deception depends on modeling the preferences and capabilities of the attacker. The attacker’s goals can greatly vary—they may exactly conflict the defender’s, or they could be partially orthogonal. For instance, an economically motivated attacker may find utility primarily in financial records whereas the defender may consider losing national security data as a more critical loss. In many cases, the preferences may be strongly governed by the available exploits. Despite this diversity in real-world adversaries, previous game theoretic models for cyber deception assume zero-sum

Cyber Camouflage Games for Strategic Deception

527

payoffs, implying directly conflicting attacker motives. Hence, to eliminate this fundamental limitation, this paper considers general-sum payoffs in this setting. Furthermore, for situations where there may be uncertainty in the defender’s knowledge of attacker’s payoffs, this work serves as a vital stepping stone to model such uncertainty (more details on related work in Sect. 1.1). The main contributions of this paper are as follows. First, we present Cyber Camouflage Games—a general-sum game model that presents completely distinct computational challenges and insights relative to the previous zero-sum models. Second, we prove that computing an optimal solution is NP-hard even for unconstrained CCG and present a Fully Polynomial Time Approximation Scheme (FPTAS) for this case. Third, for CCG with constraints, we present an MILP formulation to find an optimal solution, harnessing polytopal strategy space for compactness and boosted with cut augmentation. Finally, we experimentally evaluate our algorithms and show substantial improvement in scalability and robustness. 1.1

Related Work

The Cyber Deception Game (CDG) (Schlenker et al. 2018) is a game-theoretic deception model limited to zero-sum settings. It cannot model diversely modeled attackers and only focuses on the challenge of deception being costly and partly infeasible but fails to present the strategic challenge which exists regardless of top-end deception methods, which our model highlights. Since a zero-sum model also implicitly implies perfectly known payoffs which may not always be possible, eliminating this fundamental limitation by considering general-sum payoffs as we do, can allow for a more holistic model in the future that considers uncertainty when making decisions, as many security game models previously have (Kiekintveld et al. 2011, 2013; Nguyen et al. 2014). Other works in cyber defense (Alpcan and Ba¸sar 2010; Laszka et al. 2015; Serra et al. 2015; Schlenker et al. 2017) have adopted game theoretic models, including several that aim to strategically deploy honeypots (Pıbil et al. 2012; Durkota et al. 2015). However, these do not consider camouflaging the network as in our model. De Gaspari et al. (2016) provide a realistic systems architecture for active defense using the same types of deception abilities we consider, but they do not address how to strategically optimize these tactics under practical constraints. Several use moving target defense that mitigate attacker reconnaissance by using movement to adapt and randomize the attack surface of a network or system (Albanese et al. 2014; MacFarland and Shue 2015; Albanese et al. 2016; Achleitner et al. 2016), but this work typically does not model nor optimize against a strategic adversary. Despite being a Stackelberg game for a security domain, CCGs have a very distinct structure in comparison with Stackelberg Security Games (SSG) (Tambe 2011), since the core defensive action of “masking” differs from “defending” targets in SSG in several ways. First, security resources are limited in SSGs, while in CCG every target can be masked. Second, covering a target in SSGs directly improves its security, whereas in CCGs, the effectiveness of masking depends on

528

O. Thakoor et al.

how other machines are masked to alter the attacker’s information state. Finally, SSGs typically focus on mixed strategies and the Strong Stackelberg Equilibria (SSE), whereas CCGs are restricted to pure strategies and therefore need the Weak Stackelberg Equilibrium (WSE) concept. Pita et al. (2010) present a robust approach for sub-optimal attackers that can be adapted for WSE computation in normal-form Stackelberg games, but cannot be directly applied to CCGs due to the exponential strategy space.

2

Cyber Camouflage Games

We refer to a network administrator as the “defender” and a cybercriminal as the “attacker”. CCGs have the components explained as follows. Network Configurations. A network consists of a set of machines indexed in K := {1, . . . , |K|}. Each machine has a true configuration (TC) which is modeled as a tuple of attributes such as [OS Linux, Webserver TomCat 8]. The TC should be a complete description of the security relevant features of the machine, so machines with the same TC are considered identical. Let I index the set of TCs present in the network. The true state of the network (TSN) is defined by a vector n = (ni )i∈I where each ni denotes the number of machines with TC i. Using deception techniques, the defender can disguise each machine by obfuscating some of its attributes. We say the defender “masks” each machine with an observed configuration (OC); J denotes the set of all possible OCs. An OC similarly captures the set of observed attributes, e.g., [OS Linux, Webserver Nginx 1.8], and is assumed to be a complete representation of the information observed by the attacker so machines with the same OC are indistinguishable to the attacker. This framework can directly capture deception via obfuscation of system attributes, and is also applicable to other deception methods such as honeypots by including a “honeypot” as a TC, and the configurations it mimics as OCs. Deception Strategies. The defender’s strategy can be encoded as an integer matrix Φ, where Φij denotes the number of machines with TC i, masked with OC j. The observed state of the network (OSN) is a vector that, unlike the TSN, is a function of the strategy Φ. We denote an OSN as m(Φ) := (mj (Φ))j∈J , where mj (Φ) = i Φij denotes the number of machines masked by OC j for strategy Φ. Strategy Feasibility and Costs. Achieving deception is often costly and not arbitrarily feasible. We represent feasibility constraints using a (0, 1)-matrix Π, where Πij = 1 if TC i can be masked with OC j. Further, Ji denotes the set of OCs that can mask TC i, i.e., Ji := {j ∈ J | Πij = 1}. Next, we assume that masking a TC i with an OC j, has a cost of cij incurred by the defender—this is relevant only if Πij = 1, and denotes the combined costs from deployment, maintenance, degraded functionality, etc. The defender can afford the total cost of masking up to a budget B. Let F denote the set of strategies that are feasible and affordable—described with linear constraints:

Cyber Camouflage Games for Strategic Deception

529

Φij ∈ Z≥0 , Φij ≤ Πij ni ∀(i, j) ∈ I × J , F = Φ j∈J Φij = ni ∀i ∈ I, i∈I j∈J Φij cij ≤ B The first and the third constraints follow from the definition of Φ and n resp. The second inequality imposes the feasibility constraints, and the fourth, the budget constraint. Defender and Attacker Valuations. If a machine with TC i is attacked, the attacker gets a utility via —his valuation of TC i. Collectively, these are represented as a vector v a . Analogously, we define valuations v d for the defender; a higher vid reflects a smaller loss when TC i is compromised. Remark 1. It is natural to set v a as positive values and v d as negative ones, however, the problem remains equivalent (as explained momentarily) if the defender valuations (or, independently, the attacker valuations as well) are simultaneously scaled by a positive constant or shifted by a constant, so we do not specify positivity of any values. Game Model. We model the interaction as a Stackelberg game to capture the sequence of decisions between the players. The defender is the leader who knows the TSN n and can deploy a deception strategy Φ. The attacker observes the OSN and chooses an OC to attack. Since the attacker cannot distinguish machines with the same OC, this is interpreted as an attack on a randomly selected machine with this OC. We assume that the defender can only play a pure strategy since it is usually not possible to change the network frequently, making the attacker’s view of the network static. We assume the attacker perfectly knows the defender’s strategy Φ to compute best response, as in CDG (Schlenker et al. 2018), which is justified via insider information leakage or other means of surveillance. When the defender plays a strategy Φ, her expected utility when OC j is attacked (with mj (Φ) > 0), is given by ud (Φ, j) = E[vid |Φ, j] =

i∈Ij

P(i|Φ, j)vid =

Φij vd . mj (Φ) i i∈I

Here, E[·] denotes the expectation operator, and P, the probability of TC of the attacked machine, conditioned on its OC j and the defender Φ. strategy Φ a v Similarly, the attacker’s expected utility in this case is ua (Φ, j) = i∈I mj ij (Φ) i . These utility expressions justify Remark 1. An illustrative example of CCGs is as follows. CCG Example: Consider a CCG with 6 machines, 4 TCs and 3 OCs. Let the TSN be n = (2, 2, 1, 1). Let the valuations be v d = (8, 2, 7, 11) and v a = (7, 2, 5, 11). Let J1 = {1}, J2 = {2}, J3 = {1, 3} and J4 = {2, 3}. Let the costs be c31 = 5, and cij = 1 for all other feasible (i, j) pairs, and let the budget B = 7. Thus, machines with TC 1 and 2 have only 1 choice of OC to mask due to feasibility constraint. Masking TC 3 with OC 1 at cost 5 is too expensive,

530

O. Thakoor et al.

since masking the remaining machines costs at least 3. Thus, due to the budget constraint, TC 3 has OC 3 as the unique choice. Thus, the defender’s strategy space is ⎧ ⎡ ⎡ ⎤ ⎤⎫ 200 200 ⎪ ⎪ ⎪ ⎪ ⎨ ⎢0 2 0⎥ ⎢ ⎥⎬ ⎥ , Φ = ⎢ 0 2 0 ⎥ F = Φ=⎢ ⎣0 0 1⎦ ⎣ 0 0 1 ⎦⎪ ⎪ ⎪ ⎪ ⎩ ⎭ 010 001 If the defender plays Φ, attacker’s best response is to attack OC 1, yielding expected utilities ua (Φ, 1) = 7, and ud (Φ, 1) = 8 for the attacker and the defender, resp. Optimization Problem. Having defined the game model, we now discuss the solution approach. Previous work on general-sum Stackelberg games has typically used Strong Stackelberg equilibria (SSE). This assumes that whenever the follower has multiple best responses, he breaks ties in favor of the leader (i.e., maximizing her payoff), which the defender can induce using mixed strategies. The defender cannot always induce a specific response in a CCG since he is restricted to pure strategies (Guo et al. 2018). Therefore, we consider the robust assumption that the attacker breaks ties against the defender. This worst-case tiebreaking leads to Weak Stackelberg Equilibria (WSE) (Breton et al. 1988). A drawback of WSE is that it may not exist (von Stengel and Zamir 2004). However, it has been shown to exist when the defender can play a finite set of pure strategies as in CCG. We therefore adopt WSE and assume that the attacker chooses a best response to the defender strategy Φ, minimizing the defender utility in case of a tie. Thus, the defender’s utility is umin (Φ) as defined by the following optimization problem (OP): min ud (Φ, j) | ua (Φ, j) ≥ ua (Φ, j ) ∀j ∈ J . j

(1)

Hence, the defender needs to choose argmaxΦ umin (Φ). We first study the game without any feasibility or budget constraints. This unconstrained CCG underlines the inherent challenge of strategic deception even when sophisticated techniques are available for arbitrarily masking TCs with any OCs at low costs. Remark 2. Note that, setting all entries of Π to 1 makes all strategies feasible, ni max cij makes all feasible and setting the budget and costs so that B ≥ strategies affordable.

3

i

j∈Ji

Unconstrained CCGs

In this setting, masking any TC with any OC is possible, and every feasible strategy has total cost within budget. First, we prove that

Cyber Camouflage Games for Strategic Deception

531

Theorem 1. Computing the optimal defender strategy in unconstrained CCGs is NP-hard. Proof. We prove the NP-hardness via a reduction from the subset sum problem, denoted as SubsetSum, which is a well-known NP-complete problem. Given a set S of integers, SubsetSum is the decision problem to determine whether there is a non-empty subset of S whose sum is zero. An instance of SubsetSum is specified by a set of N integers {x1 , . . . , xN } = S (w.l.o.g., assume xn = 0 for any n and n∈[N ] xn = 0 since otherwise the problem is trivial). Given such an instance of SubsetSum, we construct the following unconstrained general-sum CDG. First, let w = − n∈[N ] xn ∈ Z, so that, i xi + w = 0. Let δ1 , δ2 ∈ (0, 1) be small constants s.t. 1 − δ2 ≥ δ1 ≥ (N + 1)δ2 . We construct a CDG with (N + 2) machines each with a different TC. Thus, K = I = {1, . . . , N, N + 1, N + 2}, and the TSN n is the all-one vector. There are two OCs, i.e., J = {1, 2}. The valuations for the attacker and the defender are defined as v a = (x1 , . . . , xN , w, δ2 ) and v d = (−x1 , . . . , −xN , −w + δ1 , δ2 ), respectively. We remark that, the defender’s and attacker’s valuations are only non-zero-sum on the last two TCs. This completely defines an unconstrained CDG instance. 2 We claim that the defender can achieve utility strictly greater than δN1 +δ +2 in the constructed instance if and only if the SubsetSum instance is a YES instance. As a result, any algorithm for computing the optimal defender utility for general-sum CDGs can be transferred, in polynomial time, to an algorithm for SubsetSum. This implies the NP-hardness of solving unconstrained CDGs. We first show that if the SubsetSum is a YES instance, then the defender δ1 +δ2 there exists a can achieve a utility strictly greater than N +2 . By assumption, non-empty set S ⊂ S such that xn ∈S xn = 0. Let N = |S | > 0. Consider the strategy that masks all TCs in I = {i | xi ∈ S } to OC 1, and masks TCs in I \ I to OC 2. By construction, the attacker will have expected utility 0 on OC 1 but a strictly positive utility on OC 2. As a result, the attacker will attack δ1 +δ2 1 +δ2 OC 2, resulting in expected defender utility Nδ+2−N > N +2 . As a result, the δ1 +δ2 optimal defender utility must be strictly greater than N +2 . Next, we show that if the SubsetSum is a NO instance, then the optimal 2 defender utility is at most δN1 +δ +2 . We consider the following particular masking strategies: 1. If all the TCs are masked by one OC, then the defender will achieve expected − n xn −w+δ1 +δ2 2 utility = δN1 +δ N +2 +2 2. If machine N + 2 is masked by (say) OC 1 and all other machines are masked by OC 2, then the attacker has a better utlity in attacking OC 1, resulting 2 in defender utility δ2 ≤ δN1 +δ +2 by construction. 3. Otherwise, any other solution to the CDG instance corresponds to a partition of I into two non-empty sets, denoted as I1 , I2 , which correspond to TCs masked as OC 1, 2 respectively. Moreover, w.l.o.g., assume N + 2 ∈ I1 and |I1 | > 1. Then I2 is a strict subset of I \ {N + 2} which all have integer attacker values. Since the SubsetSum is a NO instance, we know that the

532

O. Thakoor et al.

total attacker values in I2 cannot sum up to 0. If they sum up to a positive integer (thus at least 1), then two properties hold: 1. The total attacker value in I1 is at most −1 + δ2 < 0; 2. The total defender value in I2 is at most −1+δ1 < 0. The first property implies that the attacker will attack OC 2 and the second property implies that the defender will get strictly negative utility. Similarly, if the total attacker values in I2 sum up to a negative integer, the attacker will attack OC 1, still resulting in a negative defender utility. To sum 2 up, in this case, the optimal defender utility is at most δN1 +δ +2 . This concludes the proof. Thus, this result is in sharp contrast to unconstrained CDGs where masking all the machines with the same OC is an optimal strategy and thus the computation has constant-time complexity. We now show that despite the NP-hardness, the problem admits a Fully Polynomial Time Approximation Scheme (FPTAS). To that end, we first need the following proposition. Proposition 1. Unconstrained CCGs always have an optimal defender strategy where at most 2 OCs mask all the machines. Proof. Let OC 1, 2 be feasible for all the TCs. Let Φ be any optimal strategy which yields defender utility u. Let J = argmaxj ua (Φ, j) denote the set of ∗ ∗ attacker’s best response OCs. Then consider a∗ strategy Φ as follows: Φi1 = ∗ Φ , Φ = Φ ∀ i ∈ I, and Φ = 0 for all other OC j = 1, 2. i2 ij j ∈J ij j ∈J / ij Then, Φ∗ induces the attacker to attack OC 1, resulting in defender utility at least u (as the defender utility for every OC in J is at least u). Thus, Φ∗ is optimal and uses at most 2 OCs to mask all the machines. Next, assume, w.l.o.g., that vid ∈ [0, 1] ∀i (as the problem is equivalent if the valuations are shifted, or, simultaneously scaled by a positive constant). Then, we show an FPTAS: Theorem 2. For any > 0, there is a O(n3 /) time algorithm that computes a deception strategy with defender utility at most less than the optimal. Proof. We use dynamic programming (DP) to compute an approximate solution. To start, we first discretize the defender valuations by rounding them down to the closest multiples of . Let integer vi = vid / , so that vi is the defender valuation rounded down. Note that vi ∈ [0, 1/]. By Proposition 4.4, we can w.l.o.g. focus on strategies using the 2 OCs to mask all the machines. We design the strategy such that OC 1 is the attacker’s best response. Our idea is to compute a 3-dimensional table A, where A[i, k, l] denotes the maximum attacker valuation sum for attacking OC 1, over all the strategies in which OC 1 masks exactly k machines, all from among the first i machines with the defender valuation sum being l for OC 1. By definition, A[i, k, l] satisfies the following recurrence relation: A[i, k, l] = max{A[i − 1, k − 1, l − vi ] + via , A[i − 1, k, l]}

Cyber Camouflage Games for Strategic Deception

533

which follows from considering the two options for machine i—whether to mask it with OC 1 or not. The base cases are A[0, 0, 0] = 0 and A[0, k, l] = −∞ if either k or l are non-zero. After computing table A, we are ready to compute the optimal defender strategy w.r.t. the rounded defender payoffs. In particular, the maximum defender utility of our strategy is the maximum value of l such that ∃k > 0 with attacker’s utility for attacking OC 1, i.e., A[n, k, l]/k, being more than that of attacking OC 2, or equivalently, more than the average attacker valuation i via ni /|n|. Such a table entry can be found by enumerating A[n, k, l] for different k and l. It is easy to see that the DP computes an optimal defender strategy for defender payoff v. To prove that this strategy is an additive approximation to the original problem with defender payoffs v d , let U d (u, Φ) denote the defender utility when using defender valuations u and strategy Φ. Let Φ∗ denote the optimal strategy to the original problem and Φˆ be the strategy output by our algorithm. We have ˆ ≥ U d (v, Φ) ˆ ≥ U d (v, Φ∗ ) ≥ U d (v d , Φ∗ ) − U d (v d , Φ) where the first and third inequalities are due to the rounding down of v d to ˆ This v entry-wise by , and the second inequality follows by optimality of Φ. concludes the proof.

4 4.1

Constrained CCGs Optimal Defender Strategy MILP Formulation

Our goal is to compute the WSE, i.e., to compute max umin (Φ). As umin (Φ) is Φ∈F

given by OP (1), computing WSE is a bilevel OP which cannot ordinarily be reduced to a single-level Mixed Integer Linear Program (MILP) (Sinha et al. 2018). In particular, the single-level reduction has been shown for SSE computation, since the attacker’s tiebreaking aligns with the defender’s objective. However, this does not apply to WSE due to the worst-case tiebreaking. Hence, we first formulate an OP which considers -optimal responses for the attacker (for a small constant ) and assume he selects the one with the least defender utility. This OP (referred to as GS-MIP) is: max

γ

s.t.

α, γ ∈ R, Φ ∈ F, q ∈ {0, 1}|J | q1 + . . . + q|J | ≥ 1 (1 − qj ) ≤ α − ua (Φ, j)

∀j ∈ J

(2a) (2b)

M (1 − qj ) ≥ α − ua (Φ, j)

∀j ∈ J

(2c)

γ ≤ u (Φ, j) + M (1 − qj )

∀j ∈ J

(2d)

qj ≤ mj (Φ)

∀j ∈ J .

(2e)

Φ,q ,γ,α

(2)

d

534

O. Thakoor et al.

The maximization objective γ gives the defender’s optimal utility. The binary variables qj indicate if attacking OC j is an -optimal attacker strategy, of which there is at least one an possibly more, as specified by (2a). (2b) and (2c) make α the optimal attacker utility, and enforce qj = 1 for all the -optimal strategies for the attacker (using a big-M constant). (2e) ensures that only the OCs which actually mask a machine are considered as valid attacker responses. Finally, (2d) captures the worst-case tiebreaking by requiring that γ is least of the utilities the defender can get from a possible -optimal attacker response.1 In reality, an optimal attacker corresponds to having = 0 by definition. Nevertheless, setting > 0 is necessary to enforce the worst-case tiebreaking as explained above for constraint (2b); setting = 0 can be shown to lead to an SSE solution, and not WSE. Despite this challenge, since the number of targets are finite, there must be an such that only the optimal strategies are -optimal. Then, for such small enough , (2b) would enforce that the attacker can choose from precisely the set of optimal strategies. Hence, we conclude that, Proposition 2. ∃ > 0 s.t. OP (2) computes max umin (Φ). Φ∈F

Remark 3. should be set to a value that ensures that the second-best attacker utility is at least epsilon less than optimal. It suffices to set it to L/k 2 where L is the bit precision of the valuations and k = |K| is the number of machines. Other works considering -optimal responses include (Tijs 1981) which computes -optimal Nash equilibria and (Pita et al. 2010) which considers robust optimization against boundedly rational opponents in Bayesian normal-form Stackelberg games. Despite similarities, in particular, that of a Stackelberg setting in the latter, their solution methods do not apply here due to key differences in the CCG model, viz., non-Bayesian setting, perfect rationality, restriction to pure strategies, and most importantly, compact input representation via polytopal strategy space (Jiang et al. 2017). This makes it nonviable to enumerate strategies like normal-form games. However, the utility functions ud and ua are linear fractionals, i.e., ratios of expressions that are linear in Φ. This property allows for an MILP formulation of GS-MIP despite the structural complexity of CCGs, as follows. We use an alternate representation of the defender’s strategy with a |K| × |J | machine k is masked with OC j. Then, we (0, 1)-matrix Θ, where Θkj = 1 iff Θkj , and the player utilities as, can write the OSN m as mj (Θ) = k∈K

u (Θ, j) = d

i∈I k∈Ki

Θkj vid

mj (Θ)

; u (Θ, j) = a

i∈I k∈Ki

Θkj via

mj (Θ)

,

where, Ki is the set of machines with TC i. Substituting these expressions in constraints of GS-MIP, e.g., say (2b), and multiplying the equation to get rid of fractional expressions yields, 1

The additional constant M can be simply replaced by maxi,i |via − via | and maxi,i |vid − vid | resp. in the 3rd , 4th constraints.

Cyber Camouflage Games for Strategic Deception

(1 − qj )

Θkj ≤ α

k

Θkj −

Θkj via

535

∀j ∈ J .

i∈I k∈Ki

k

The constraint above and the ones obtained by similarly transforming (2c), (2d), (2e), contain bilinear terms that are products of a binary variable with a binary or continuous variable, which can be linearized using standard techniques. The complete resultant MILP can be found in the appendix. 4.2

Cuts to Speed up the MILP Formulation

Symmetry Breaking. Using the alternate representation Θ exponentially blows up the feasibility region of the MILP, since each strategy Φ has many equivalent Θ representations due to machines having the same TC being identical. For instance, a strategy Φ which masks the ni machines having TC i with a different OC each, results in ni ! equivalent Θ representations corresponding to the different permutations of machine assignment to OCs. To break this symmetry, we add constraints to require that the assignment of machines to OCs is lexicographically sorted within a TC, i.e., for machines k, k with the same TC, masked with different OCs ˆj and j resp., we must have k < k ⇔ ˆj < j . A linear constraint captures this for machines k, k : jΘkj ≤ jΘk j . (3) j∈J

j∈J

Proposition 3. For any strategy Θ, and for machines k, k , masked with OCs ˆj and j as per Θ, (3) ⇔ ˆj < j . Further, adding constraints (3) to GS-MIP preserves at least one optimal solution while eliminating all of the symmetric solutions. Proof. For any strategy Θ, for any machine k, we must have Θkj = 1. j∈J

Suppose machines k, k are masked with different OCs ˆj and j . Thus, Θk = eˆj , and Θk = ej (where ej denotes the unit vector with 1 in the j th coordinate and 0 elsewhere). Hence, jΘkj = ˆj and jΘk j = j . Thus, it follows that j∈J

j∈J

jΘkj
0 is the attacker’s penalty when intercepted by the defender, which is the increment of the defender’s utility. The attacker’s reward is −R(s, d , a), i.e., the game is a zero-sum game and we assume that players’ rewards are discounted over time with the discount factor γ < 1. We assume perfect recall, hence both players remember their respective history. A history of the defender is formed by actions he played and observations he received, i.e., (d , o)t and a history of the attacker is (s, d , a, o)t . Therefore, the spaces of the histories of players are (AD , OD )t and (S, AD , AA , O)t for the defender and the attacker, respectively. The strategies σD , σA of players map each of their histories to a distribution over actions. For any given pair of strategies σD , σA ∈ Σ = ΣD , ΣA , we use uD (σD , σA ) to denote the expected utility of the defender when players follow the strategies σD , σA , respectively, i.e., ∞ uD (σD , σA ) = γ t E[R(s, d , a)]. (4) t=1

A best response of player θ ∈ {D, A} to the opponent’s strategy σ−θ is the strategy σθBR ∈ BR(σ−θ ) where uθ (σθBR , σ−θ ) ≥ uθ (σθ , σ−θ ) for any other σθ ∈ Σθ . We use the Stackelberg equilibrium as our solution concept. A strategy profile σ = σD , σA forms a Stackelberg equilibrium if i) σA ∈ BR(σD ) and ii) , σA ) where σA ∈ BR(σD ). With the zero-sum assumption, UD (σD , σA ) ≥ UD (σD the Stackelberg equilibrium can be computed by the minimax formulation maxσD ∈ΣD minσA ∈ΣA uD (σD , σA ).

4

(5)

Applying PG-HSVI to DPOS3G

We present an overview of applying PG-HSVI proposed in [12] to DPOS3G in this section and describe our RITA algorithm in the next section, where RITA 2

Alternatively, we can assume that the target’s value is reduced to vi0 when the attack takes the collecting action at target i, which can also be solved by our algorithm.

When Players Affect Target Values

549

adopts PG-HSVI as a subroutine in building a strategy. The PG-HSVI algorithm approximates the optimal value function of the infinite horizon DPOS3G by considering value-functions of the game with a restricted horizon. The value function of a defender’s strategy σD is defined as VσD : Δ(S) → R which means given the initial action d0 and initial belief b0 , the function returns the expected utility of the defender when following the strategy σD and the optimal value function of the game is V ∗ (b0 ) = supσD VσD (b0 ). Each iteration of PG-HSVI assumes players will play their Stackelberg equilibrium strategies with subsequent utilities (defined below) given by value functions of previous iterations. Value Backup. PG-HSVI performs a Value Backup operation, which is denoted by H, at the belief b to improve the approximation of the value function, which corresponds to solve a stage game, i.e., game with one round, denoted as [HV ](b) where V is the approximated value function obtained by previous iterations. The defender’s and the attacker’s strategies for a stage game are denoted as πD ∈ Δ(AD ) and πA : S → Δ(AA ), respectively. Note that the attacker’s strategy specifies his strategy at each state in S because we assume that the attacker can observe the state explicitly. The utilities of [HV ](b) depend on the immediate reward R and the discounted value of the subsequent game, represented by V . The immediate reward of the defender is = b(s)πD (d)πA (s, a)R(s, d, a). (6) Rπimm D ,πA s∈S

d∈AD

a∈AA

As the defender knows her action d and the observation o, she can derive the belief of the subsequent game by bπd,o (s ) = A

1 Ts,d,a (o, s )b(s)πA (a). s∈S a∈AA P r(o|d, πA )

(7)

And the subsequent reward of the game is the expected utility over all actionobservation pairs d, o of the defender for a game starting with a belief bπd,o , A i.e., (V ) = πD (d)P r(o|d, πA )V (bπd,o (s )). (8) Rπsuq D ,πA A d∈AD

o∈O

And the Stackelberg equilibrium of [HV ](b) can be computed by

minπA maxπD Rπimm + γRπsuq (V ) . D ,πA D ,πA

(9)

Point-Based Update. PG-HSVI performs point-based updates by sampling the belief space to approximate the value function V ∗ . The lower bound of the value function V is represented by a set of α-vectors Γ and the corresponding upper bound is represented as a lower envelope of a set of points Υ . The algorithm is depicted in Algorithm 1. The lower bound of the value function V (and Γ ) is initialized by the value of the uniform strategy of the defender and the upper bound V (and Υ ) is initialized by solving a perfect information counterpart of the game. In every iteration, some belief point are sampled by the forward search

550

X. Wang et al.

Algorithm 1. PG-HSVI Input: G = S, AD , AA , OD , T, R, γ, initial action d0 , initial belief b0 and desired precision Output: approximated value function Vˆ Initialize Vˆ = {V , V }; while V (b0 ) − V (b0 ) > do Explore(b0 , , 0);

1 2 3 4 5 6 7 8 9 10 11 12 13

procedure Explore(b, , t) πA ← optimal strategy of the attacker in [HV ](b); πD ← optimal strategy of the defender in [HV ](b); d, o ∈ arg max πD (d) · P r(o|d, πA ) · excess(bd,o πA , t + 1) ; if excess (bd,o πA , t + 1) > 0 then Explore (bd,o πA , , t + 1); Γ ← Γ ∪ LΓ (b); Υ ← Υ ∪ U Υ (b);

heuristic (Line 9 in Algorithm 1) which selects the action-observation pair d, o , t + 1) where πD is the such that maximizing πD (d) · P r(o|d, πA ) · excess(bπd,o A defender’s strategy computed by [HV ](b) (Line 8 in Algorithm 1). The excess is defined as

excess(b, t) = V (b) − V (b) − ρ(t) (10) t where ρ(t) = γ −t − i=1 R · γ −i and R is selected to ensure the termination of the exploration. The forward exploration will terminate if the criteria is satisfied (Line 10 in Algorithm 1). After the termination of the forward exploration, PGHSVI performs updates of V and V by adding an α-vector LΓ (b) into Γ and a point U Υ (b) into Υ , respectively3 . For more details, please refer to [12].

5

RITA: Algorithm to Improve Scalability

PG-HSVI can be used for solving DPOS3G, however, the scalability of PG-HSVI is limited in this case due to an exponential number of states and transitions (we show the scalability in Sect. 6). Table 1 shows the number of states and transitions for the number of targets and resources. For example, when N = 8, K = 2 and Mi = 3, ∀i ∈ [N ], the game has 20,412 states and more than 9,700,000 transitions. PG-HSVI algorithm is not able to handle such large-scale games due to memory requirements (similarly to the original HSVI algorithm for POMDPs).

3

We remove the fixing process of Lipschitz continuity of V to speed up the algorithm. The experiments show that the algorithm converges efficiently, though without the theoretical guarantee of the termination.

When Players Affect Target Values

551

Table 1. Numbers of sates and transitions of DPOS3G. For simplicity, all targets have the same number of possible values, i.e., Mi = M, ∀i. Num. of states Num. of transitions K (N, K) CN · M N −K

K 2 (2N + 1) · (CN ) · M N −K

To this end, we propose RITA, displayed in Algorithm 2, which builds a Reduced game by considering key states and the associated transitions to reduce the game size, Iteratively adding the defender’s actions into consideration and Transferring the α-vectors in the game solved in the previous iteration to the next iteration to improve the initialization of the lower bound. As discussed below, PG-HSVI is adopted as a subroutine of RITA. The general procedure of RITA is as follows. RITA first builds a reduced game which includes all states in Sd0 and only key states in Sd , d = d0 (Line 3 of Algorithm 2 and more details are in Sect. 5.1). Next, instead of considering all defender’s actions, RITA first solves the game with an initial subset of defender’s actions (Line 4 of Algorithm 2) and incrementally adds the defender’s actions (Line 14 of Algorithm 2) to build the incremental game (Line 7 of Algorithm 2 and more details are in Sect. 5.2). RITA is terminated when the increment of the defender’s utility of one iteration is less than some threshold (Line 9 of Algorithm 2). RITA is guaranteed to give a lower bound of the optimal defender’s utility of the original game, providing an efficient way to evaluate the solution quality as shown in Sect. 6.

Algorithm 2. RITA 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Input: G, d0 , b0 , , the minimum incremental gap η Output: approximated value function Vˆ G =buildReducedGame(G, d0 ); Initialize actions Ad D ⊆ AD for each Sd (Algorithm 3); V = −∞, Γ = ∅; while true do G =buildIncrementalGame(G , {Ad D }, Γ ) (Algorithm 4); Vˆ ← PG-HSVI(G ,d0 ,b0 , ); if V (b0 ) − V < η then break ; else V = V (b0 ); α-VectorTransfer(Γ, G ); for d ∈ AD do d d ←actionToAdd(G ), Ad D = AD ∪ {d};

552

5.1

X. Wang et al.

Building the Reduced Game

In this section, we propose a flexible reduction scheme to reduce the number of states and transitions in DPOS3G and the game obtained is named as the reduced game G (Line 3 in Algorithm 2). The motivation of this scheme is that the states in Sd , d = d0 are of less importance, i.e., bring less utility to the defender, compared with the states in Sd0 due to the discount factor γ. Therefore, RITA keeps all states in the Sd0 and only a limited number of key states in Sd , d = d0 . Then, RITA rebuilds the transitions in the reduced game. Selecting Key States. To ensure that there always is a state for a transition to transit to, RITA has to keep the worst states in Sd , d = d0 in the reduced game. The worst state in Sd is defined as s# = d, v ∈ Sd , d = d0 such that vi = vi0 if di = 1 and vi = vî otherwise, i.e., the unprotected targets of d are with the highest values. Figure 3 presents the most conservative case of the reduced game for the example in Fig. 1 where only the worst state is kept in each Sd , d = d0 . By including the worst state (together with the definition of transitions), it is guaranteed that we obtain a lower bound to the true value of the game. To improve the bound, additional states can be added to the reduced game without any other requirements. We denote the reachable state set in G as S d .

0, 1; 2, 1, 0, 1; 3, 1, 0, 1; 7, 1

0, 1

1, 0

1, 0

0, 1

0, 1; 7, 1

0, 1

Reduce

1, 0

0, 1

1, 0; 2, 1, 1, 0; 2, 2, 1, 0; 2, 4

1, 0; 2, 1, 1, 0; 2, 2, 1, 0; 2, 4

Original Game

Reduced Game

1, 0

Fig. 3. The reduction of the example in Fig. 1 by keeping the worst states in Sd , d = d0 . The element on the arrow represents the defender’s action. The three values of the two targets are {2, 3, 7} and {1, 2, 4}, respectively. Suppose the defender’s initial action is 1, 0 and Sd0 is depicted by the blue (lower) node where all states are kept in the reduced game. The green (upper) node is the reachable state set of 0, 1. By building the reduced game, we only keep the worst state, i.e., 0, 1; 7, 1, in the green node. All transitions to states in the green node at the original game will transit to the unique state, i.e., 0, 1; 7, 1 at the reduced game. (Color figure online)

Rebuilding the Transitions. Reducing the number of states in G can invalidate transitions of the original game G that lead to these states. Therefore, we need to replace these transitions by defining new transition function. To this

When Players Affect Target Values

end, we introduce a norm to measure the distance from s to s : i∈[N ] vi − vi , if vi ≥ vi , ∀ i ∈ [N ], d(s, s ) = +∞, otherwise.

553

(11)

This norm ensures that if all targets’ values in s are higher than the values in s, the distance is finite and otherwise the distance is positive infinity. Note that this norm is asymmetric, i.e., the distance from s to s is not equal to the distance from s to s. For the rebuilding of the transitions, suppose that s ∈ S d is the current state and d , a are the actions of both player. If d = d0 , then the successive state s in the reduced game is the same as the state in the original game, i.e., Ts,d ,a (o, s ) = 1. On the other hand, if d = d0 , the successive state will be s ∈ arg mins∈S d {d(s , s)} where s ∈ Sd is the successive state in the original game. The basic idea of rebuilding the transitions is that we will use the state in s ∈ S d which is closest to the successive state in the original game as the successive state in the reduced game. Note that as we always keep the worst states in Sd , d = d0 , there always exists a states with a finite distance from s . For s ∈ Sd , d ∈ AD , when players take d , a, the reward for the defender in the reduced game is the same as the reward in the original game, i.e., R(s, d , a). After specifying the states, transitions and rewards in the reduced game, RITA applies PG-HSVI to solve the reduced game. Proposition 1 proves that RITA can give a lower bound of the original by solving the reduced game. We note that by keeping more key states in Sd , d = d0 , RITA gets a larger reduced game and with a better lower bound to the original game and if all states in Sd , d = d0 are kept in Sd , the reduced is equivalent to the original game. The number of states and transitions of the reduced game is shown in Table 2. Particularly, the number of states and transitions in the most conservative reduced game with N = 8, K = 2 and M = 3 is 756 and 347,004, respectively. Table 2. Numbers of sates and transitions of reduced games of DPOS3G where Q is the number of states in Sd , d = d0 . Num. of states K (N, K, Q) Q · (CN − 1) + M N −K

Num. of transitions K (2N + 1)CN · K [Q · (CN − 1) + M N −K ]

Proposition 1. The lower bound obtained by solving the reduced game is also a lower bound of the original game. Proof. It can be observed from Eq. (11) that when players take d , a at s ∈ Sd in the reduced game G , the successive state s in the reduced game is never better than the successive state s in the original game, i.e., vi ≥ vi , ∀i ∈ [N ]. Additionally, the rewards in the reduced game are equal to the rewards in the original game. Thus, the optimal defender’s utility in the reduced game is never

554

X. Wang et al.

better than the optimal defender’s utility in the original game. Therefore, the lower bound obtained by solving the reduced game is a lower bound of the original game. In essence, by building the reduced game, the states in the reachable state set Sd , d = d0 are abstracted into one or more partitions where each partition is represented by a key state and all transitions to states in partitions will transit to the key states. We emphasize that this reduction scheme is quite flexible which allows the trade-off between the scalability and the solution quality by building different sizes of the reduced games. For the application of our model to the real world, RITA can obtain the defender’s strategy for the first round by solving the reduced game because all states in Sd0 are kept in the reduced game. 5.2

Incrementally Adding Defender’s Actions

Although the reduced game can significantly reduce the number of states and transitions, the reduced game is still very large. It is because both defender and the attacker can take all possible actions where the number of the defender’s K , which grows exponentially against the number of targets. A wildly actions is CN adopted remedy is strategy generation [9,13,18], which starts by solving a smaller game and incrementally adds actions into consideration. Therefore, instead of considering all defender’s actions, RITA selects a subset of the actions initially (Line 4 in Algorithm 2) and incrementally find the defender’s actions (Line 14 in Algorithm 2) to add into the incremental game (Line 7 in Algorithm 2). The reduced game where only a subset of defender’s actions for each reachable state set is considered is named as incremental game.

Algorithm 3. Defender Initial Action selection 1 2 3 4 5 6 7 8 9 10

Input: the defender previous action d, the targets’ possible values Mi , ∀i ∈ [N ], the number of initial actions N um Output: Ad D for i ∈ [N ] do if di = 1 then v˜i = vi0 ; else v˜i = v∈Mi eβv v/ v∈Mi eβv ; r = mini {˜ vi }/ maxi {˜ vi }; for 1 : N um do d = arg maxd∈AD \Ad { i di · v˜i }; D

d Ad D = AD ∪ {d}; vi = vi (1 − di + r · di ), ∀i ∈ [N ];

For the defender’s action d and its reachable state set Sd , RITA generates the defender’s initial subset of the action using Algorithm 3. As for each reachable

When Players Affect Target Values

555

state set Sd , if a target is not protected by d, i.e., di = 0, the target can take ˜ are computed by Mi values. Therefore the importance values of targets v 0 vi , if di = 1; v˜i = (12) βv βv v∈Mi e v/ v∈Mi e , if di = 0. where β is the parameter such that if β → 0, the importance value is the average value of the different values the target can be and if β → +∞, the importance value is the maximum value of the difference that target can be (Lines 3–5 in Algorithm 3). Then, RITA selects the defender’s initial actions for each reachable state set to ensure that (i) every target can be protected (i.e., for every target there exists at least one action that protects this target) and (ii) there are more actions to protect important targets. To implement this idea, as displayed in Lines 7–10 in Algorithm 3, RITA iteratively selects the action which protect the targets with the highest importance values. After each selection of the action, the targets being protected will multiply a factor r to decrease the importance values (Line 10 in Algorithm 3), which ensures that no target is covered by more than one action until all targets are covered by some action. The factor r is the ratio of the smallest values and the largest importance values of targets (Line 6 in Algorithm 3). After generating Ad D for each Sd , RITA generates the game by only keeping the transitions from each state by actions in Ad D.

Algorithm 4. Build incremental game 1 2 3 4 5 6 7

Input: the reduced game G , {Ad D }, the α-vector set Γ Output: the incremental game G for Ts,d ,a (o, s ) ∈ G do if d ∈ Ad D then G = G ∪ {Ts,d ,a (o, s )}; for LΓ (b) ∈ Γ do G = G ∪ {LΓ (b)};

For each iteration, RITA needs to select the actions to be added into Ad D (Line 14 in Algorithm 2). As our game is with infinite horizon, it is difficult to find the optimal defender’s action for each reachable state set. Therefore, instead of finding the optimal action, for each reachable state set, RITA adds the defender’s action which is the best-response to the attacker’s strategy computed in Line 7 in Algorithm 1. Note that if a reachable state set is never visited by PG-HSVI during the forward explorations, RITA will not add the new action into the action set. After adding the action into the set Ad D for each Sd , RITA includes the transitions associated with these actions into the incremental game (Lines 3– 5 in Algorithm 4). RITA terminates when the increment of the lower bound of the defender’s utility of an iteration is less than η (Line 9 in Algorithm 2),

556

X. Wang et al.

which means adding the actions cannot increase the defender’s utility efficiently. Proposition 2 states that solving the incremental game gives a lower bound of the reduced game. Proposition 2. The lower bound obtained by solving the incremental game is also a lower bound of the reduced game. Proof. The incremental game has the same states with the reduced game but only consider a subset of the defender’s actions in each reachable state set. Therefore, the optimal defender’s strategy obtained by solving the incremental game can also be implemented in the reduced game, i.e., we can obtain a lower bound by solving the reduced game which is at least as good as the lower bound obtained by solving the incremental game. 5.3

Transferring α-vectors

The convergence of PG-HSVI depends on the initialization of both lower and upper bounds. As RITA always adds actions into consideration during the iterations, the α-vectors computed in the j-th iteration can provide a better initialization of the lower bound for the incremental game in the (j + 1)-th iteration. The correctness of this transferring is proved in Proposition 3. Specifically, after solving an incremental game G , RITA extracts the α-vectors into Γ . Transferring α-vectors is displayed in Lines 6–7 in Algorithm 4. It is noteworthy that one cannot easily transfer the representation of upper bounds between incremental games because the upper bound from j-th iteration is not guaranteed to be an upper bound for the game in the (j + 1)-th iteration. Proposition 3. The α-vectors transferred to the incremental game provides a lower bound of the incremental game. Proof. As we always add the defender’s actions into the reachable state set in the incremental game, the forward explorations implemented in the smaller incremental games can also be implemented in the larger one. Therefore, the transferred α-vectors provide a lower bound of the incremental game. 5.4

Early Terminating the Forward Explorations

To further speed up the algorithm and reduce the space used by the algorithm, RITA adopts early termination of the forward explorations. This is motivated by the fact that as RITA only chooses a subset of the defender’s actions at the incremental game, the lower bounds would be worse than the case which considers all defender actions. It may take even more iterations to satisfy the termination criteria V (b0 ) − V (b0 ) > , where most of the iterations only update the upper bound and the lower bound dose not change much. Therefore, RITA will terminate the forward exploration when the number of t is larger than Tˆ. We show that this technique can reduce the process memory of RITA in the experiment section (Sect. 6).

When Players Affect Target Values

6

557

Experimental Evaluation

We now present experimental evaluations of RITA and demonstrate the benefits of key aspects of our novel algorithm. The experiments are performed on randomly generated instances of DPOS3G. The initial value vi0 of each target is randomly sampled and re-scaled to the range of [c, c + 3], where c is the penalty to the attacker, which is 2 for all experiments in this section. The range would influence the defender’s utility, as well as the convergence time and the space needed due to the forward explorations. The cap value of each target is vî = (vi0 )3 and the function f (·) is (vi0 )k where k is the number of preparation actions worked on this target, i.e., Mi = {vi0 , (vi0 )2 , vî }. We choose the number of initial actions of the defender in each reachable state set as N/K + 1 where N and K are the numbers of targets and resources, respectively. The minimal increment of an iteration of RITA is η = 2 and with a smaller η, RITA can improve the solution quality with loss of the scalability. All computations are performed on a 64-bit PC with 8.0 GB RAM and a 2.3 GHz CPU and all data points are averaged over 30 instances. We build the two variants of the reduced game to demonstrate the trade-off between the scalability and the solution quality: (i) the most conservative game (Reduced# ) and (ii) the reduced game (Reduced## ) which both keeps the worst state and the second worst state in Sd , d = d0 . The second worst state is defined as s## = d, v ∈ Sd , d = d0 such that vi = vi if di = 1 and vi = vî otherwise, where f (vi , a(i, +)) = vî , i.e., the unprotected target by d is with the second highest value of the target. The six variants of algorithms are tested are (1) PG-HSVI for original game (PG-HSVI), (2) PG-HSVI for the reduced game with both the worst and the second worst states in Sd , d = d0 (PG-HSVI+Reduced## ), (3) PG-HSVI for the reduced game with the worst states in Sd , d = d0 (PG-HSVI+Reduced# ), (4) iteratively solving the incremental games for the original game (Iterative+Original), (5) iteratively solving the incremental games for the reduced game with both the worst and the second worst states in Sd , d = d0 (RITA## ) and (6) iteratively solving the incremental games for the reduced game with the worst states in Sd , d = d0 (RITA# ). The six variants illustrate the influence of different techniques, i.e., building reduced games and incrementally adding defender’s actions, on the scalability and the solution quality. 6.1

Scalability

We first investigate the scalability of PG-HSVI and the two variants of RITA. The results about the runtime and the maximum process memory are displayed in Fig. 4 with different values of γ. The results show that RITA can be significantly faster than PG-HSVI, especially when the number of states are large. Specifically, with a cap of 1800 s, RITA can approximate the game with more than 20000 states and PG-HSVI can only solve the game with less than 2000 states. It can be observed that when γ is larger, both PG-HSVI and RITA need more time and space to solve the game because it may take more forward

558

X. Wang et al.

explorations to meet the termination criteria. Additionally, RITA## takes more runtime and process memory than RITA# because RITA## has more states and transitions. Another observation is that when the game is small, RITA is not necessarily faster than PG-HSVI because it may take several iterations for RITA before termination, which makes RITA take more time than PG-HSVI. This is consistent with the performance of strategy-generation algorithms (e.g., double oracle). 4

10

6

10

5

Process Memory (MB)

Runtime (ms)

10

4

10

3

10

PG-HSVI ## RITA # RITA

2

10

1

10

1

10

2

10

3

10

4

2

10

1

10

PG-HSVI ## RITA # RITA

0

10

5

10

3

10

1

10

10

2

10

3

10

4

5

10

10

Number of states

Number of states

(a) Runtime (γ = 0.60).

(b) Memory (γ = 0.60). 4

10

6

10

5

Process Memory (MB)

Runtime (ms)

10

4

10

3

10

PG-HSVI ## RITA # RITA

2

10

1

10

1

10

2

10

3

10

4

10

Number of states

(c) Runtime (γ = 0.75).

3

10

2

10

1

10

PG-HSVI ## RITA # RITA

0

5

10

10

1

10

2

10

3

10

4

10

5

10

Number of states

(d) Memory (γ = 0.75).

Fig. 4. The comparison of runtime and the space for PG-HSVI and RITA. Both axes are plotted in log scale.

We then investigate the efficiency of the two main components of RITA, the reduced game and the iterative method. The results are shown in Fig. 5 with γ = 0.6 for all six variants. The results indicate that building reduced game is more efficient to speed up the algorithm than the iterative method, as well as to reduce the process memory. Because by building reduced games, we ignore most transitions even before solving the game rather than by solving a smaller game and then incrementally increasing the size.

When Players Affect Target Values

559

4

10

6

10

5

Process Memory (MB)

Runtime (ms)

10

4

10

3

PG-HSVI ## PG-HSVI+Reduced # PG-HSVI+Reduced Iterative+Original ## RITA # RITA

10

2

10

1

10

1

2

10

3

10

4

10

2

10

PG-HSVI ## PG-HSVI+Reduced # PG-HSVI+Reduced Iterative+Original ## RITA # RITA

1

10

0

10

5

10

3

10

1

2

10

10

3

10

4

10

5

10

10

Number of states

Number of states

(a) Runtime.

(b) Process memory.

Fig. 5. The comparison of runtime and space for different variants with γ = 0.6. Table 3. Comparisons of runtime and process memory for RITA# . |S| = 20, 412, γ = 0.75

Runtime Memory

Without trans., without early term 1783.59 s 9.9G Without trans, with early term

1062.64 s 5.6G

With trans., with early term

829.56 s

Number of states (γ=0.60) 54

270

Number of states (γ=0.75) 1215

9

0

0

-20

-20

Lower bound

Lower bound

9

-40 -60 -80 -100

PG-HSVI ## RITA # RITA Initial

(a) γ = 0.60.

5.6G

54

270

1215

-40 -60 -80 -100

PG-HSVI ## RITA # RITA Initial

(b) γ = 0.75.

Fig. 6. Solution quality where y-axis is the lower bounds of the defender’s utility.

We then investigate the two techniques used in RITA, (i) transferring αvectors and (ii) early termination. We test the cases which are (i) without transferring α-vectors and without early termination, (ii) without transferring α-vectors and with early termination and (iii) with transferring α-vectors and with early termination. The results are displayed in Table 3 where both methods can reduce the runtime of the algorithms and the early termination can also reduce the process memory by avoiding unnecessary forward explorations.

560

6.2

X. Wang et al.

Solution Quality

Next we investigate the solution quality of the different methods. We use the initial lower bound of PG-HSVI, which is the defender’s utility for the uniform strategy as the baseline and compare the lower bounds returned by PG-HSVI, RITA## and RITA# . Note that PG-HSVI computes the optimal defender’s utility. The results are displayed in Fig. 6 with γ = 0.6 and 0.75, respectively. The results illustrate RITA’s ability to trade-off between the solution quality and the scalability. With dramatically improving the scalability, both variants of RITA lose the solution quality compared with PG-HSVI. However, we show that with increasing the runtime and the process memory, RITA## achieves a better solution quality than RITA# ; RITA provides the flexibility to maintain more states in the reduced game if the solution quality is more important. The advantage of RITA## over RITA# is increased when the number of states in the game increases and the value of γ increases. Note that even RITA# is still far better than the defender’s utility of the uniform strategy. Another observation is that when γ is small, RITA can give a better approximation to the optimal defender’s utility because that the states in Sd , d = d0 are of less importance.

7

Conclusion

In this work, we propose a novel defender-sided partially observable Stochastic Stackelberg security game (DPOS3G) where the targets’ values are affected by players’ actions and the defender can only partially observe the game. To solve the game, we propose RITA based on PG-HSVI and with three key novelties: (a) building a reduced game with only key states; (b) incrementally adding defender’s actions to further reduce the number of transitions of the game; (c) providing novel heuristics for lower bound initialization. Finally, extensive experimental evaluations show that RITA significantly outperform the PG-HSVI on scalability and allowing for trade off in scalability and solution quality. Acknowledgements. This work was supported by Microsoft AI for Earth, NSF grant CCF-1522054, the Czech Science Foundation (no. 19-24384Y), National Research Foundation of Singapore (no. NCR2016NCR-NCR001-0002) and NAP.

References 1. Basilico, N., Gatti, N., Amigoni, F.: Leader-follower strategies for robotic patrolling in environments with arbitrary topologies. In: AAMAS, pp. 57–64 (2009) 2. Blum, A., Haghtalab, N., Procaccia, A.D.: Learning optimal commitment to overcome insecurity. In: NIPS, pp. 1826–1834 (2014) ´ Rosas, K., Navarrete, H., Ord´ 3. Bucarey, V., Casorr´ an, C., Figueroa, O., on ˜ez, F.: Building real stackelberg security games for border patrols. In: Rass, S., An, B., Kiekintveld, C., Fang, F., Schauer, S. (eds.) GameSec 2017. LNCS, vol. 10575, pp. 193–212. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68711-7 11

When Players Affect Target Values

561

4. Chung, T.H., Hollinger, G.A., Isler, V.: Search and pursuit-evasion in mobile robotics. Auton. Robots 31(4), 299–316 (2011) 5. Fang, F., Jiang, A.X., Tambe, M.: Protecting moving targets with multiple mobile resources. JAIR 48, 583–634 (2013) 6. Fang, F., et al.: Deploying PAWS: field optimization of the protection assistant for wildlife security. In: AAAI, pp. 3966–3973 (2016) 7. Gan, J., An, B., Vorobeychik, Y., Gauch, B.: Security games on a plane. In: AAAI, pp. 530–536 (2017) 8. Gan, J., Elkind, E., Wooldridge, M.: Stackelberg security games with multiple uncoordinated defenders. In: AAMAS, pp. 703–711 (2018) 9. Halvorson, E., Conitzer, V., Parr, R.: Multi-step multi-sensor hider-seeker games. In: IJCAI, pp. 159–166 (2009) 10. Haskell, W.B., Kar, D., Fang, F., Tambe, M., Cheung, S., Denicola, E.: Robust protection of fisheries with compass. In: AAAI, pp. 2978–2983 (2014) 11. Hor´ ak, K., Boˇsansk´ y, B.: A point-based approximate algorithm for one-sided partially observable pursuit-evasion games. In: Zhu, Q., Alpcan, T., Panaousis, E., Tambe, M., Casey, W. (eds.) GameSec 2016. LNCS, vol. 9996, pp. 435–454. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-47413-7 25 12. Hor´ ak, K., Bosansk´ y, B., Pechoucek, M.: Heuristic search value iteration for onesided partially observable stochastic games. In: AAAI, pp. 558–564 (2017) 13. Jain, M., Korzhyk, D., Vanˇek, O., Conitzer, V., Pˇechouˇcek, M., Tambe, M.: A double oracle algorithm for zero-sum security games on graphs. In: AAMAS, pp. 327–334 (2011) 14. Johnson, M.P., Fang, F., Tambe, M.: Patrol strategies to maximize pristine forest area. In: AAAI, pp. 295–301 (2012) 15. Kar, D., et al.: Cloudy with a chance of poaching: adversary behavior modeling and forecasting with real-world poaching data. In: AAMAS, pp. 159–167 (2017) 16. Letchford, J., Conitzer, V., Munagala, K.: Learning and approximating the optimal strategy to commit to. In: Mavronicolas, M., Papadopoulou, V.G. (eds.) SAGT 2009. LNCS, vol. 5814, pp. 250–262. Springer, Heidelberg (2009). https://doi.org/ 10.1007/978-3-642-04645-2 23 17. Marecki, J., Tesauro, G., Segal, R.: Playing repeated Stackelberg games with unknown opponents. In: AAMAS, pp. 821–828 (2012) 18. McMahan, H.B., Gordon, G.J., Blum, A.: Planning in the presence of cost functions controlled by an adversary. In: ICML, pp. 536–543 (2003) 19. Paruchuri, P., Pearce, J.P., Marecki, J., Tambe, M., Ordonez, F., Kraus, S.: Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games. In: AAMAS, pp. 895–902 (2008) 20. Pita, J., et al.: Using game theory for Los Angeles airport security. AI Mag. 30(1), 43 (2009) 21. Shieh, E., et al.: PROTECT: a deployed game theoretic system to protect the ports of the united states. In: AAMAS, pp. 13–20 (2012) 22. Tambe, M.: Security and Game Theory: Algorithms, Deployed Systems, Lessons Learned. Cambridge University Press, Cambridge (2011) 23. Tsai, J., Yin, Z., Kwak, J., Kempe, D., Kiekintveld, C., Tambe, M.: Urban security: game-theoretic resource allocation in networked physical domains. In: AAAI, pp. 881–886 (2010) 24. Varakantham, P., Lau, H.C., Yuan, Z.: Scalable randomized patrolling for securing rapid transit networks. In: IAAI, pp. 1563–1568 (2013)

562

X. Wang et al.

25. Vidal, R., Shakernia, O., Kim, H.J., Shim, D.H., Sastry, S.: Probabilistic pursuitevasion games: theory, implementation, and experimental evaluation. IEEE Trans. Robot. Autom. 18(5), 662–669 (2002) 26. Vorobeychik, Y., An, B., Tambe, M., Singh, S.P.: Computing solutions in infinitehorizon discounted adversarial patrolling games. In: ICAPS, pp. 314–322 (2014) 27. Yin, Y., An, B., Jain, M.: Game-theoretic resource allocation for protecting large public events. In: AAAI, pp. 826–834 (2014)

Perfectly Secure Message Transmission Against Independent Rational Adversaries Kenji Yasunaga1(B) 1

2

and Takeshi Koshiba2

Graduate School of Information Science and Technology, Osaka University, Osaka, Japan [email protected] Faculty of Education and Integrated Arts and Sciences, Waseda University, Tokyo, Japan [email protected]

Abstract. Secure Message Transmission (SMT) is a two-party protocol by which the sender can privately transmit a message to the receiver through multiple channels. An adversary can corrupt a subset of channels and makes eavesdropping and tampering over the corrupted channels. Fujita et al. (GameSec 2018) introduced a game-theoretic security notion of SMT, and showed protocols that are secure even if an adversary corrupts all but one of the channels, which is impossible in the standard cryptographic setting. In this work, we study a game-theoretic setting in which all the channels are corrupted by two or more independent adversaries. Specifically, we assume that there are several adversaries who exclusively corrupt subsets of the channels, and prefer to violate the security of SMT with being undetected. Additionally, we assume that each adversary prefers other adversaries’ tampering to be detected. We show that secure SMT protocols can be constructed even if all the channels are corrupted by such rational adversaries. We also study the situation in which both malicious and rational adversaries exist. Keywords: Cryptography · Secure message transmission theory · Rational adversary

1

· Game

Introduction

Cryptography in the traditional sense provides the confidentiality of messages between two parties, Alice and Bob. Symmetric-key cryptography requires to share the key before communication, and the key agreement is still a problem to be resolved. Asymmetric-key cryptography (a.k.a. public-key cryptography) is free from the key agreement problem but must rely on some unproven computational hardness of mathematical problems. In the standard setting, we implicitly assume that there is a single channel between Alice and Bob. Since computer networks nowadays are like a web, we may assume that several channels are c Springer Nature Switzerland AG 2019 T. Alpcan et al. (Eds.): GameSec 2019, LNCS 11836, pp. 563–582, 2019. https://doi.org/10.1007/978-3-030-32430-8_33

564

K. Yasunaga and T. Koshiba

available for communication between Alice and Bob. Secure message transmission (SMT) is a scheme for the communication between Alice and Bob in the environment in which several channels are available. SMT is a two-party cryptographic protocol with n channels by which a sender Alice securely and reliably sends messages to a receiver Bob. SMT also assumes the existence of the adversary who can corrupt t channels out of the n channels. The adversary can eavesdrop messages from the corrupted channels and alter them. We consider privacy and reliability as properties of SMT against the adversaries. The privacy means that the adversary can obtain no information on the messages Alice sends to Bob. The reliability means that a message Bob receives coincides with the message Alice sends. An SMT protocol is said to be perfect if the protocol satisfies both properties in the perfect sense. An SMT protocol is said to be almost-reliable if the protocol satisfies the perfect privacy and allows transmission errors of small probability. The notion of SMT was originally proposed by Dolev, Dwork, Waarts, and Yung [9]. They showed that any 1-round (i.e., non-interactive) perfect SMT must satisfy that t < n/3, and any perfect SMT of at least two rounds must satisfy that t < n/2. Since then, the efficiency of perfect SMT has been improved in the literature [3,25,28,32]. The most efficient 2-round perfect SMT was given by Spini and Zémor [31]. In the case of almost-reliable SMT, the situation is different from the case of perfect SMT. Franklin and Wright [10] showed an almost-reliable SMT against t < n corruptions by using a public channel in addition to the usual channels. Later, Garay and Ostrovsky [15] and Shi et al. [30] gave the most round-efficient almost-reliable SMT protocols using public channels. In the standard cryptographic setting, adversaries are assumed to be semihonest or malicious. Semi-honest adversaries follow the protocol but try to extract secret information during the protocol execution. Malicious ones deviate from the protocol either to obtain secret information or to obstruct the protocol execution. Especially, malicious adversaries would do anything regardless of their risks. However, some adversaries realistically take their risks into account and rationally behave forward the other participants in the protocol. To incorporate the notion of “rationality” into cryptography, we employ game-theoretic ideas. Halpern and Teague [20] firstly investigated the power and the limitation of rational participants in secret sharing. Since then, rational secret sharing has been investigated in the literature [1,6,11,24]. Besides secret sharing, rational settings have been employed in other cryptographic protocols such as leader election [2,16], agreement protocols [18,21], public-key cryptography [34,35], twoparty computation [5,17], delegated computation [7,19,23], and protocol design [13,14]. In particular, we can overcome the “impossibility barrier” in some cases [4,12,18] by considering that the adversaries rationally behave. Fujita, Yasunaga, and Koshiba [12] studied a game-theoretic security model for SMT. They introduced rational “timid” adversaries who prefer to violate the security requirement of SMT but do not prefer the tampering actions to be detected. They showed that even if the adversary corrupts all but one of

Perfectly SMT Against Independent Rational Adversaries

565

the channels, it is possible to construct perfect SMT protocols against rational timid adversaries. In the standard cryptographic setting, perfect SMT can be constructed only when the adversary corrupts a minority of the channels. This demonstrates a way of circumventing the impossibility results of cryptographic protocols based on a game-theoretic approach. In this paper, we further investigate the game-theoretic security of SMT. In [12], the simplest game-theoretic setting (i.e., 1-player game) was employed. In the 1-player game, the player’s behavior is determined by the strategy of the largest expected utility. In this paper, we consider the case of games for two or more players (i.e., adversaries). We study a game-theoretic setting in which all the channels may be corrupted by two or more independent rational timid adversaries. More specifically, we assume that there are more than one adversaries who exclusively corrupt subsets of the channels, and prefer to violate the security of SMT with being undetected. Additionally, we assume that each adversary prefers other adversaries’ tampering to be detected. Note that if a single adversary corrupts all the channels, we cannot hope for the security of SMT. We show that secure SMT protocols can be constructed even if all the channels are corrupted by such independent rational adversaries. One protocol uses a public channel, and the others do not. – We show that Shi et al.’s almost-reliable SMT protocol (after a minor adaptation) in [30], which uses a public channel, works as a perfect SMT against multiple independent rational adversaries. We assume that there are λ ≥ 2 adversaries, and adversaries i ∈ {1, . . . , λ} exclusively corrupt ti ≥ 1 channels such that t1 + · · · + tλ ≤ n. Since we employ a Nash equilibrium as a solution concept, the result is not surprising. Nash equilibrium requires that no deviation increases the utility, assuming that the other adversaries follow the prescribed strategy. Since the security against a single adversary corrupting n − 1 channels is provided in [12], a similar argument can be applied in our setting, though slightly different utility functions should be considered. – To construct perfect SMT protocols without public channel, we employ the idea of cheater-identifiable secret sharing (CISS), where every player who submits a forged share in the reconstruction phase can be identified. Intuitively, in the setting of rational SMT, timid adversaries will not tamper with shares because the tampering action will be detected with high probability, but the message can be recovered by using other shares. We construct a non-interactive SMT protocol based on the idea of CISS due to Hayashi and Koshiba [22]. Technically, our construction employs pairwise independent (a.k.a. strongly universal) hash functions as hash functions. Since the security requirements of CISS are not sufficient for proving the security of rational SMT, we provide the security analysis of our protocol, not for general CISS-based SMT protocols. – The limitation of CISS is that the number of forged shares should be a minority. Namely, the above construction only works for adversaries who corrupt at most (n − 1)/2 channels. We show that a slight modification of the CISS-

566

K. Yasunaga and T. Koshiba

based protocol gives a perfect SMT protocol against strictly timid adversaries even if one of them may corrupt a majority of the channels. Adversaries are said to be strictly timid if they prefer being tampering undetected to violating the reliability. A similar idea was used in the previous work of [12], where robust secret sharing is employed for the protocol against a strictly timid adversary. Since we consider independent adversaries who prefer other adversaries to be detected, CISS is suitable in this setting. – Finally, we consider the setting in which a malicious adversary exists as well as rational adversaries. Namely, there are several adversaries, all but one behave rationally, but one behaves maliciously. We believe this setting is preferable because the assumption that all of the adversaries are rational may not be realistic. Mixing of malicious and rational adversaries was studied in the context of rational secret sharing [1,24]. We show that a modification of the CISS-based protocol achieves a non-interactive perfect SMT protocol against such adversaries. The protocol is secure as long as a malicious adversary corrupts t∗ ≤ (n − 1)/3 channels, and each rational adversary corrupts at most min{(n − 1)/2 − t∗ , (n − 1)/3} channels. We clarify the differences from the previous work of [12]. In [12], there is only one adversary who corrupts at most n − 1 channels. This setting can be seen as one in which there are two independent adversaries A1 and A2 . While A1 tries to violate the security of the SMT protocol by corrupting at most t ≤ n − 1 channels, the other adversary A2 , who corrupt n − t ≥ 1 channels, does nothing for the protocol. Thus, the setting of [12] can be seen as a weaker setting of independent adversaries. In other words, this work provides stronger results for the problem of SMT protocols against rational adversaries. The mixed setting of malicious and rational adversaries in this work is closest to the traditional cryptographic setting of SMT. Even in this setting, we present a non-interactive protocol against adversaries corrupting in total t < n/2 channels, for which cryptographic SMT requires interaction or a weaker bound t < n/3.

2

Secure Message Transmission

A sender S and a receiver R are connected by n channels, and in addition, they may use an authentic and reliable public channel. Messages sent over the public channel are publicly accessible and correctly delivered to the receiver. We assume that SMT protocols proceed in rounds. In each round, one party can synchronously send messages over the n channels and the public channel. The messages will be delivered before the next round starts. The adversary A can corrupt at most t channels. Such an adversary is referred to as t-adversary. Messages sent over corrupted channels can be eavesdropped and tampered by the adversary. We assume that the adversary cannot delay messages over the corrupted channels. Namely, the tampered messages will be transmitted to the receiver in the same round. We also assume that A is computationally unbounded.

Perfectly SMT Against Independent Rational Adversaries

567

Let M be the message space. In SMT, the sender tries to send a message in M to the receiver by using n channels and the public channel, and the receiver outputs some message after the protocol execution. For an SMT protocol Π, let MS denote the random variable of the message sent by S and MR the message output by R in Π. An execution of Π can be completely characterized by the random coins of all the parties, namely, S, M, and A, and the message MS sent by S. Let VA (m, rA ) denote the view of A when the protocol is executed with MS = m and the random coins rA of A. Specifically, VA (m, rA ) consists of the messages sent over the corrupted channels and the public channel when the protocol is run with MS = m and A’s random coins rA . We formally define the properties of SMT protocols. Definition 1. A protocol between S and R is (ε, δ)-Secure Message Transmission (SMT) against t-adversary if the following three conditions are satisfied against any t-adversary A. – Correctness: For any m ∈ M, if MS = m and A does not corrupt any channels, then Pr[MR = m] = 1, – Privacy: For any m0 , m1 ∈ M and rA ∈ {0, 1}∗ , it holds that SD(VA (m0 , rA ), VA (m1 , rA )) ≤ ε, where SD(X, Y ) denotes the statistical distance between two random variables X and Y over a set Ω, which is defined by SD(X, Y ) =

1 |Pr[X = u] − Pr[Y = u]| , 2 u∈Ω

and – Reliability: For any message m ∈ M, when MS = m, Pr[MR = m] ≤ δ, where the probability is taken over the random coins of S, R, and A. If a protocol achieves (0, 0)-SMT, the protocol is called perfect SMT, and if a protocol achieves (0, δ)-SMT, which admits transmission failures of small probability δ, the protocol is called almost-reliable SMT. For perfect SMT, Dolev et al. [9] showed the below. Theorem 1 ([9]). Perfect SMT protocols against t-adversary are achievable if and only if t < n/2.

3

SMT Against Independent Rational Adversaries

We define our security model of SMT in the presence of independent rational adversaries. Rationality of the adversary is characterized by a utility function

568

K. Yasunaga and T. Koshiba

which represents the preference of the adversary over possible outcomes of the protocol execution. We can consider various preferences of adversaries regarding the SMT protocol execution. The adversaries may prefer to violate the security of SMT protocols without the detection of tampering actions. In addition, they may prefer other adversaries to be detected by tampering actions. Here, we consider the adversaries who prefer (1) to violate the privacy, (2) to violate the reliability, (3) their tampering actions to be undetected, and (4) other adversaries’ actions to be detected. To define the utility function, we specify the SMT game as follows. We assume that there are λ adversaries 1, 2, . . . , λ for λ ≥ 2. Each adversary does not cooperate with other adversaries. We assume that adversary j ∈ {1, . . . , λ} exclusively corrupt at most tj channels out of the n channels for tj ≥ 1, and that λ j=1 tj ≤ n. The SMT Game. First, set parameters suc = 0 and guessj = detectj = 0 for every j ∈ {1, . . . , λ}. Given an SMT protocol Π with the message space M, choose m ∈ M uniformly at random, and run the protocol Π in which the message to be sent is MS = m. In the protocol execution, adversaries j can exclusively corrupt tj channels, and tamper with any messages sent over the corrupted channels. The sender or the receiver may send a special message “DETECT at i” for i ∈ {1, . . . , n}, meaning that some tampering action was detected at channel i. Then, if adversary j corrupts channel i, set detectj = 1. After running the protocol, the receiver outputs MR , and each adversary j outputs Mj for j ∈ {1, . . . , λ}. If MR = MS , set suc = 1. S , set For j ∈ {1, . . . , λ}, if Mj = M guessj = 1. The outcome of the game is suc, {guessj , detectj }j ∈{1,...,λ} . The utility of the adversary is defined as the expected utility in the SMT game. Definition 2 (Utility). The utility Uj (A1 , . . . , Aλ , U ) of adversary j when strategy (A1 , . . . , Aλ ) and utility function U are employed is the expected value that maps index j and the outcome out = E[U (j, out)], where U is a function suc, {guessj , detectj }j ∈{1,...,λ} of the SMT game to real values, and the probability is taken over the random coins of the sender, the receiver, and the adversaries, and a random choice of message MS . Each adversary j ∈ {1, . . . , λ} tries to maximize utility Uj by choosing a strategy Aj . Since the utility depends on other adversaries’ strategies, we use gametheoretic notions in the security definition. We define the security of rational secure message transmission (RSMT). For strategies B1 , . . . , Bλ , Aj , we denote by (Aj , B−j ) the strategy profile (B1 , . . . , Bj−1 , Aj , Bj+1 , . . . , Bλ ). Definition 3 (Security of RSMT). An SMT protocol Π is perfectly secure against rational (t1 , . . . , tλ )-adversaries with utility function U if there are tj adversary Bj for j ∈ {1, . . . , λ} such that for any tj -adversary Aj for j ∈ {1, . . . , λ},

Perfectly SMT Against Independent Rational Adversaries

569

1. Perfect security: Π is (0, 0)-SMT against (B1 , . . . , Bλ ), and 2. Nash equilibrium: Uj (Aj , B−j , U ) ≤ Uj (Bj , B−j , U ) for every j ∈ {1, . . . , λ} in the SMT game. The perfect security guarantees that the strategy profile (B1 , . . . , Bλ ) is harmless. The Nash equilibrium guarantees that no adversary j can gain more utility by changing the strategy from Bj to Aj . Thus, the above security implies that each adversary j has no incentive to deviate from the harmless strategy Bj . In the security proof of our protocol, we will consider the strategy profile (B1 , . . . , Bλ ) in which each adversary j does not corrupt any channels, and outputs Mj by choosing a message uniformly at random from M. For such (B1 , . . . , Bλ ), the perfect privacy and reliability immediately follow if Π satisfies the correctness. Timid Adversaries We construct secure protocols against independent timid adversaries, who do not prefer the tampering actions to be detected, and prefer to violate the reliability. ind be the set of utility functions that Regarding the utility function, let Utimid satisfy the following conditions: 1. U (j, out) > U (j, out ) if suc < suc , guessj = guessj , and detectj = detectj , 2. U (j, out) > U (j, out ) if suc = suc , guessj = guessj , detectj < detectj , and detectk = detectk for every k ∈ {1, . . . , λ} \ {j}, and 3. U (j, out) > U (j, out ) if suc = suc , guessj = guessj , detectk > detectk for some k = j, and detectj = detectj for every j ∈ {1, . . . , λ} \ {k},

where out = suc, {guessj , detectj }j∈{1,...,λ} and out = (suc , {guessj , detectj }j∈{1,...,λ} ) are the outcomes of the SMT game. In addition, timid adversaries may have the following property: 4. U (j, out) > U (j, out ) if suc > suc , guessj = guessj , detectj < detectj , and detectk = detectk for every k ∈ {1, . . . , λ} \ {j}. ind be the set of utility functions satisfying the above four conditions. Let Ust-timid ind , and strictly An adversary is said to be timid if his utility function is in Utimid ind timid if the utility function is in Ust-timid . For j ∈ {1, . . . , n} and b ∈ {0, 1}, we write detect−j = b if detectj = b for every j ∈ {1, . . . , n} \ {j}. In the analysis of the security of our protocols, we use the following values of utility of adversary j ∈ {1, . . . , λ}.

– u0 is the utility when Pr[guessj = 1] = 1, – u1 is the utility when Pr[guessj = 1] = 0, – u2 is the utility when Pr[guessj = 1] = 0,

1 |M| ,

suc = 0, detectj = 0, detect−j =

1 |M| ,

suc = 0, detectj = 0, detect−j =

1 |M| ,

suc = 1, detectj = 0, detect−j =

570

K. Yasunaga and T. Koshiba

– u3 is the utility when Pr[guessj = 1] = 0, and – u4 is the utility when Pr[guessj = 1] = 0.

1 |M| ,

suc = 0, detectj = 1, detect−j =

1 |M| ,

suc = 1, detectj = 1, detect−j =

ind , it holds that u0 > u1 > max{u2 , u3 } and For any utility function in Utimid ind , it holds that u0 > u1 > u2 > u3 > min{u2 , u3 } > u4 . If the utility is in Ust-timid u4 .

4

Protocol with Public Channel

We show that the SJST protocol of [30] works as a perfect SMT protocol against independent adversaries. See Sect. A.1 for the description of the protocol. More specifically, we slightly modify the SJST protocol such that in the second and the third rounds, if bi = 1 in B or vi = 1 in V for some i ∈ {1, . . . , n}, the special message “DETECT at i” is also sent together. Theorem 2. For any λ ≥ 2, let t1 , . . . , tλ be integers satisfying t1 + · · · + tλ ≤ n and 1 ≤ ti ≤ n − 1 for every i ∈ {1, . . . , λ}. If the parameter in the SJST protocol satisfies 1 u3 − u4 u1 − u3 , 1 + log2 1 + log2 t + log2 ≥ max u2 − u4 − α t α t∈{t1 ,...,tλ } for some α ∈ (0, u2 − u4 ), then the protocol is perfectly secure against rational ind . (t1 , . . . , tλ )-adversaries with utility function U ∈ Utimid Proof. For each j ∈ {1, . . . , λ}, let Bj be the adversary who does not corrupt any channels and outputs a uniformly random message from M as Mj . Then, the perfect security for (B1 , . . . , Bλ ) immediately follows. We show that (B1 , . . . , Bλ ) is a Nash equilibrium. Since Uj (B1 , . . . , Bλ ) = u2 for j ∈ {1, . . . , λ}, it is sufficient to show that Uj (Aj , B−j ) ≤ u2 for any tj adversary Aj . Note that, since the SJST protocol achieves the perfect privacy against at most n − 1 corruptions, we have that Pr[guessj = 1] = 1/|M| for any tj -adversary Aj . Since messages in the second and the third rounds are sent through the public channel, the adversary Aj can tamper with messages only in the first round. If Aj changes the lengths of ri or Ri of the i-th channel, the tampering will be detected, and hence detectj = 1. Thus, such tampering cannot increase the utility. Suppose that Aj corrupts tj channels in the first round. Namely, there are exactly tj distinct i’s such that (ri , Ri ) = (ri , Ri ). Note that a tampering action such that ri = ri and Ri = Ri does not increase the probability that suc = 0, but may only increase that of detectj = 1. Hence, we assume that Ri = Ri for all the corrupted channels. Also, note that Aj cannot cause detectj for j = j since a message “DETECT at i” is sent only when tampering is made by an adversary who corrupts the i-th channel. Thus, the maximum utility of Uj (Aj , B−j ) is u1 . We define the following events:

Perfectly SMT Against Independent Rational Adversaries

571

– E1 : No tampering action is detected in the protocol, – E2 : At least one but not all tampering actions are detected, and – E3 : All tampering actions are detected. Note that these three events are disjoint, and either event should occur. Thus, we have that Pr[E1 ]+Pr[E2 ]+Pr[E3 ] = 1. It follows from the discussion in Sect. A.3 that the probability that the tampering action on one channel is not detected is 21− . Since each hash function hi is chosen independently for each channel i, we have that Pr[E1 ] = 2(1−)tj . Similarly, we obtain that Pr[E3 ] = (1−21− )tj . Note that the utility when E1 occurs is at most u1 . When E2 occurs, some tampering is detected, but not another tampering. Thus, we have suc = 0 and detectj = 1. In the case of E3 , we have suc = 0 and detectj = 0. Hence, the utilities when E2 and E3 occur are at most u3 and u4 , respectively. Therefore, the utility of adversary j is Uj (Aj , B−j ) ≤ u1 · Pr[E1 ] + u3 · Pr[E2 ] + u4 · Pr[E3 ] = u3 + (u1 − u3 ) Pr[E1 ] − (u3 − u4 ) Pr[E3 ] ≤ u3 + (u1 − u3 ) 2(1−)tj − (u3 − u4 ) 1 − tj 21− ≤ u3 + α − (u3 − u4 ) 1 − tj 21− ≤ u2 ,

(1) (2)

3 3 −u4 where we use the relations ≥ 1+ t1j log2 u1 −u and ≥ 1+log2 tj +log2 u2u−u α 4 −α in (1) and (2), respectively. Thus, the utility of adversary j when playing with (Aj , B−j ) is at most u2 for every j ∈ {1, . . . , λ}, and hence the statement follows.

5

Protocol for Minority Corruptions

We provide a non-interactive SMT protocol based on secret-sharing and pairwise independent hash functions. The protocol is secure against independent adversaries who only corrupt minorities of the channels. Namely, we assume that each adversary corrupts at most (n − 1)/2 channels. Note that the protocol does not use the public channel as in the protocol in Sect. 4. We describe the construction of our protocol. The protocol can employ any secret-sharing scheme of threshold (n − 1)/2, which may be Shamir’s scheme described in Sect. A.2. Let (s1 , . . . , sn ) be the shares generated by the scheme from the message to be sent. Then, pairwise independent hash functions hi are chosen for each i ∈ {1, . . . , n}. For any j = i, hi (sj ) is computed as an authentication tag for sj . Then, (si , hi , {hi (sj )}j=i ) will be sent through the ith channel. When si is modified to si = si by some adversary, the modification can be detected by the property of pairwise independent hash functions because the adversary cannot modify all tags hj (si ) for j = i. In addition, a random mask ri,j is applied to hi (sj ) to conceal the information of sj in hi (sj ). The masks {rj,i }j=i for si will be sent through the i-th channel so that only the i-th channel

572

K. Yasunaga and T. Koshiba

reveals the information of si . Hence, the message sent through the i-th channel is (si , hi , {hi (sj ) ⊕ ri,j }j=i , {rj,i }j=i ). As long as minorities of the channels are corrupted by each adversary, a single adversary cannot cause erroneous detection of silent adversaries. We give a formal description. Protocol 1. Let (Share, Reconst) be a secret-sharing scheme of threshold (n − 1)/2, where a secret is chosen from M, and the shares are defined over V. Let m ∈ M be the message to be sent by the sender, and H = {h : V → {0, 1} } a class of pairwise independent hash functions in Sect. A.3. 1. The sender does the following: Generate the shares (s1 , . . . , sn ) by Share(m), and randomly choose hi ∈ H for each i ∈ {1, . . . , n}. Also, for every distinct i, j ∈ {1, . . . , n}, choose ri,j ∈ {0, 1} uniformly at random, and i ∈ {1, . . . , n}, send then compute Ti,j = hi (sj ) ⊕ ri,j . Then, for each through the i-th channel. , {r } mi = si , hi , {Ti,j }j∈{1,...,n}\{i} j,i j∈{1,...,n}\{i} ˜ i , {T˜i,j }j∈{1,...,n}\{i} , {˜ rj,i }j∈{1,...,n}\{i} on each 2. After receiving m ˜ i = s˜i , h channel i ∈ {1, . . . , n}, the receiver

does the following: For every i ∈ ˜ i (˜ {1, . . . , n}, compute the list Li = j ∈ {1, . . . , n} : h sj ) ⊕ r˜i,j = T˜i,j . If a majority of the lists coincide with a list L, reconstruct the message m ˜ by Reconst({i, s˜i }i∈{1,...,n}\L ), send messages “DETECT at i” for every i ∈ L, and output m. ˜ Theorem 3. For any λ ≥ 2, let t1 , . . . , tλ be integers satisfying t1 + · · · + tλ ≤ n and 1 ≤ ti ≤ (n−1)/2 for every i ∈ {1, . . . , λ}. If the parameter in Protocol 1 satisfies u1 − u4 + 2 log2 (n + 1) − 1, ≥ log2 u2 − u4 then the protocol is perfectly secure against rational (t1 , . . . , tλ )-adversaries with ind . utility function U ∈ Utimid Proof. For k ∈ {1, . . . , λ}, let Bk be the tk -adversary who does not corrupt any channels and outputs a random message as Mk . First, note that, for any i ∈ {1, . . . , n}, the information of si can be obtained only by mi , the message sent over the i-th channel. This is because for any j = i, hj (si ) is masked as hj (si ) ⊕ ri,j , and the random mask ri,j is included only in mi . Also, each si is a share of the secret sharing of threshold (n − 1)/2. Since Bk can obtain at most (n − 1)/2 shares, Bk can learn nothing about the message sent from the sender. Thus, the perfect security is achieved for (B1 , . . . , Bλ ). Next, we show that (B1 , . . . , Bλ ) is a Nash equilibrium. For k ∈ {1, . . . , λ}, let Ak be any tk -adversary. Since Uk (B1 , . . . , Bλ ) = u2 , to increase the utility, Ak needs to get either (a) suc = 0, or (b) suc = 1, detectk = 0, and detectk = 1 for some k = k. For the case of (a), Ak tries to change si into s˜i = si for some i ∈ {1, . . . , n}. Since Ak does not corrupt some i ∈ {1, . . . , n}, the index i corrupted by Ak will si ) ⊕ r˜i ,i = Ti ,i . Note that s˜i and r˜i ,i are be included in the list Li unless hi (˜

Perfectly SMT Against Independent Rational Adversaries

573

included in m ˜ i , and thus can be changed, but hi and Ti ,i are in m ˜ i , and thus have been unchanged. It follows from the property of pairwise independent hash functions that this can happen with probability 21− assuming s˜i = si . Thus, i will be included in Li with probability at least 1 − 21− . Since there are at least n − (n − 1)/2 = (n + 1)/2 such indices i , the probability that a majority of the lists contains i is at least 1 − (n + 1)/2 · 21− . Note that Ak may corrupt (n − 1)/2 channels in total. The probability that all the corrupted indices coincide with a majority of the list is at least 1 − (n − 1)/2 · (n + 1)/2 · 21− ≥ 1 − (n + 1)2 · 2−(+1) . In that case, the message can be reconstructed by other shares, and thus we have suc = 1, detectk = 1, and detectk = 0 for k = k, resulting in the utility of u4 . Since Ak only corrupts a minority of the channels, it cannot cause detectk = 1 for k = k. Thus, the maximum utility of Ak is u1 . Thus, the utility of adversary k when tampering as s˜i = si is at most Uk (Ak , B−k ) ≤ (n + 1)2 · 2−(+1) · u1 + 1 − (n + 1)2 · 2−(+1) · u4 , which is at most u2 by the assumption on . ˜ i for the For the case of (b), Ak needs to generate the corrupted message m ˜ i (sj ) ⊕ ri,j = T˜i,j , i-th channel so that for a majority of indices j ∈ {1, . . . , n}, h where each j is corrupted by Bk with k = k, and thus ri,j and sj are not tampered with. Since Ak only corrupts a minority of the channels, this cannot happen. Therefore, (B1 , . . . , Bλ ) is a Nash equilibrium.

6

Protocol for Majority Corruptions

We present a protocol against adversaries who may corrupt a majority of the channels. We assume that adversaries are strictly timid in this setting. The protocol is a minor modification of the protocol for minority corruptions. In Protocol 1, the lists Li ’s of the corrupted channels are generated for each channel, and the final list L is determined by the majority voting. Thus, if an adversary corrupts a majority of the channels, the result of the majority voting can be easily forged, and hence the protocol does not work for majority corruption. To cope with majority corruptions, we modify the protocol such that (1) the threshold of the secret sharing is changed from (n − 1)/2 to n − 1, and (2) the final list L of the corrupted channels is composed of the union of all the set Li , namely, L = L1 ∪ · · · ∪ Ln . The threshold of n − 1 can be achieved by Shamir’s scheme. Intuitively, this protocol works for strictly timid adversaries because any tampering detection is approved without voting and thus such adversaries will keep silent not to be detected. We give a formal description of the protocol. Protocol 2. Let (Share, Reconst) be a secret-sharing scheme of threshold n − 1, where a secret is chosen from M, and the shares are defined over V. Let m ∈ M be the message to be sent by the sender, and H = {h : V → {0, 1} } a class of pairwise independent hash functions in Sect. A.3.

574

K. Yasunaga and T. Koshiba

1. The sender does the following: Generate the shares (s1 , . . . , sn ) by Share(m), and randomly choose hi ∈ H for each i ∈ {1, . . . , n}. Also, for every distinct i, j ∈ {1, . . . , n}, choose ri,j ∈ {0, 1} uniformly at random, and i ∈ {1, . . . , n}, send then compute Ti,j = hi (sj ) ⊕ ri,j . Then, for each through the i-th channel. , {r } mi = si , hi , {Ti,j }j∈{1,...,n}\{i} j,i j∈{1,...,n}\{i} ˜ i , {T˜i,j }j∈{1,...,n}\{i} , {˜ 2. After receiving m ˜ i = s˜i , h rj,i }j∈{1,...,n}\{i} on each channel i ∈ {1, . . . , n}, the receiver does the following: For every i ∈ ˜ i (˜ {1, . . . , n}, compute the list Li = j ∈ {1, . . . , n} : h sj ) ⊕ r˜i,j = T˜i,j . Then, set L = L1 ∪ · · · ∪ Ln . If L = ∅, reconstruct the message m ˜ by ˜ Otherwise, send messages “DETECT Reconst({i, s˜i }i∈{1,...,n} ), and output m. at i” for every i ∈ L, and output ⊥ as the failure symbol. Theorem 4. For any λ ≥ 2, let t1 , . . . , tλ be integers satisfying t1 + · · · + tλ ≤ n and 1 ≤ ti ≤ n − 1 for every i ∈ {1, . . . , λ}. If the parameter in Protocol 2 satisfies u0 − u3 ≥ log2 − 1, u2 − u3 then the protocol is perfectly secure against rational (t1 , . . . , tλ )-adversaries with ind . utility function U ∈ Ust-timid Proof. For k ∈ {1, . . . , λ}, we define Bk as the tk -adversary who does not corrupt any channels and outputs a random message as Mj . By the same reason as in the proof of Theorem 3, the protocol is perfectly secure against (B1 , . . . , Bλ ). Next, we show that (B1 , . . . , Bλ ) is a Nash equilibrium. Let Ak be any tk adversary for k ∈ {1, . . . , λ}. As in the proof of Theorem 3, Ak needs to yield either (a) suc = 0, or (b) suc = 1, detectk = 0, and detectk = 1 for some k = k. For the case of (a), Ak needs to corrupt the i-th channel so that s˜i = si . Since there is at least one index i ∈ {1, . . . , n} that is corrupted by Bk with k = k, the index i is included in the list Li with probability at least 1 − 21− . Thus, the utility of adversary k is at most Uk (Ak , B−k ) ≤ 2−(+1) · u0 + 1 − 2−(+1) · u3 , which is at most u2 by assumption. For the case of (b), if some index is in the final list L, since the threshold of secret sharing is n − 1, the message is not reconstructed. Then we have suc = 0. Namely, (b) does not happen. Thus, (B1 , . . . , Bλ ) is a Nash equilibrium.

7

SMT Against Malicious and Rational Adversaries

In the previous sections, we have discussed SMT against independent rational adversaries. We have assumed that all the adversaries behave rationally. The assumption may be strong in the sense that all of them can be characterized by the utility function we defined. In this section, we discuss more realistic situations in which some adversary may not behave rationally, but maliciously.

Perfectly SMT Against Independent Rational Adversaries

7.1

575

Rational SMT in the Presence of a Malicious Adversary

Without loss of generality, we assume that there are λ ≥ 2 adversaries, and adversaries 1, . . . , λ − 1 are rational, and adversary λ behaves maliciously. We use the same definitions of the SMT game and the utility function in Sect. 3. We define robust security against rational adversaries. A similar definition appeared in the context of rational secret sharing [1]. For strategies B1 , . . . , Bλ−1 , Aλ , Aj for j ∈ {1, . . . , λ − 1}, we denote by (Aj , B−j , Aλ ) the strategy profile (B1 , . . . , Bj−1 , Aj , Bj+1 , . . . , Bλ−1 , Aλ ). Definition 4 (Security of Robust RSMT). An SMT protocol Π is t∗ -robust perfectly secure against rational (t1 , . . . , tλ−1 )-adversaries with utility function U if there are tj -adversary Bj for j ∈ {1, . . . , λ − 1} such that for any tj -adversary Aj for j ∈ {1, . . . , λ − 1} and t∗ -adversary Aλ , 1. Perfect security: Π is (0, 0)-SMT against (B1 , . . . , Bλ−1 , Aλ ), and 2. Robust Nash equilibrium: Uj (Aj , B−j , Aλ , U ) ≤ Uj (Bj , B−j , Aλ , U ) for every j ∈ {1, . . . , λ − 1} in the SMT game. Compared to Definition 3, robust RSMT requires that the perfect security is achieved even in the presence of a malicious adversary Aλ , and a strategy profile (B1 , . . . , Bλ−1 , Aλ ) is a Nash equilibrium for adversary j ∈ {1, . . . , λ − 1}. 7.2

Protocol Against Malicious and Rational Adversaries

We show that a robust RSMT protocol can be constructed based on the protocol for minority corruption in Sect. 5. For t∗ -robust against (t1 , . . . , tλ−1 )adversaries, we assume that t∗ ≤ (n − 1)/3 and 1 ≤ tj ≤ min{(n − 1)/2 − t∗ , (n − 1)/3} for each j ∈ {1, . . . , λ − 1}. Our non-interactive protocol is obtained simply by modifying the threshold of the secret sharing in Protocol 1 from (n − 1)/2 to (n − 1)/3. This protocol works because when only a malicious adversary corrupts at most (n − 1)/3 channels, the transmission failure does not occur due to the error-correction property of the secret sharing. Thus, perfect security is achieved in the presence of a malicious adversary. Even if some rational adversary deviates from the protocol together with a malicious adversary, they can affect at most tj + t∗ ≤ (n − 1)/2 votes, and thus any tampering will be identified with high probability by the majority voting. The formal description is given below. Protocol 3. Let (Share, Reconst) be a secret-sharing scheme of threshold (n − 1)/3, where a secret is chosen from M, the shares are defined over V, and the secret can be reconstructed even if (n−1)/3 out of n shares are tampered with. Let m ∈ M be the message to be sent by the sender, and H = {h : V → {0, 1} } a class of pairwise independent hash functions in Sect. A.3. 1. The sender does the following: Generate the shares (s1 , . . . , sn ) by Share(m), and randomly choose hi ∈ H for each i ∈ {1, . . . , n}. For every distinct i, j ∈ {1, . . . , n}, choose ri,j ∈ {0, 1} uniformly at random, and

576

K. Yasunaga and T. Koshiba

then compute Ti,j = hi (sj ) ⊕ ri,j . For each i ∈ {1, . . . , n}, send mi = , {r } si , hi , {Ti,j }j∈{1,...,n}\{i} j,i j∈{1,...,n}\{i} through the i-th channel. ˜ i , {T˜i,j }j∈{1,...,n}\{i} , {˜ 2. After receiving m ˜ i = s˜i , h rj,i }j∈{1,...,n}\{i} on each channel i ∈ {1, . . . , n}, the receiver

does the following: For every i ∈ ˜ i (˜ {1, . . . , n}, compute the list Li = j ∈ {1, . . . , n} : h sj ) ⊕ r˜i,j = T˜i,j . If a majority of the list coincide with a list L, reconstruct the message m ˜ by Reconst({i, s˜i }i∈{1,...,n} ), send message “DETECT at i” for every i ∈ L, and output m. ˜ For the security analysis, we define the values of utility of adversary j ∈ {1, . . . , λ − 1} such that – u1 is the utility in the same case as u1 except that detectλ = 1, – u2 is the utility in the same case as u2 except that detectλ = 1, and – u4 is the utility in the same case as u4 except that detectλ = 1. The values u1 , u2 , u4 are defined as the case that detectj = 0 for every j ∈ {1, . . . , λ} \ {j}. In the above, the values u1 , u2 , u4 are defined as detectj = 0 for every j ∈ {1, . . . , λ − 1} \ {j} and detectλ = 1. Theorem 5. For any λ ≥ 2, let t1 , . . . , tλ−1 , t∗ be integers satisfying t1 + · · · + tλ−1 +t∗ ≤ n, 0 ≤ t∗ ≤ (n−1)/3, and 1 ≤ ti ≤ min{(n−1)/2−t∗ , (n−1)/3} for every i ∈ {1, . . . , λ − 1}. If the parameter in Protocol 3 satisfies u∗1 − u∗4 ≥ ∗ ∗ ∗ max + 2 log2 (n + 1) − 1 , log2 ∗ u2 − u∗4 (u1 ,u2 ,u4 )∈{(u1 ,u2 ,u4 ),(u1 ,u2 ,u4 )} then the protocol is t∗ -robust perfectly secure against rational (t1 , . . . , tλ−1 )ind . adversaries with utility function U ∈ Utimid Proof. For k ∈ {1, . . . , λ−1}, let Bk be the tk -adversary who does not corrupt any channels, and output a random message as Mk . Let Aλ be any t∗ -adversary. Note that the information of si can be obtained only by seeing mi since each hj (si ) is masked by rj,i , which is included only in mi . Since each si is a share of the secret sharing of threshold (n − 1)/3, each adversary Bk and Aλ can learn nothing about the original message. Although at most t∗ messages may be corrupted by Aλ , it follows from the property of the underlying secret sharing that the message can be correctly recovered in the presence of t∗ ≤ (n − 1)/3 corruptions out of n shares. Thus, the protocol is perfectly secure against (B1 , . . . , Bλ−1 , Aλ ). Next, we show that (B1 , . . . , Bλ−1 , Aλ ) is a Nash equilibrium for any Aλ . When the strategy profile (B1 , . . . , Bλ−1 , Aλ ) is employed, we have suc = 1. Hence, to increase the utility of adversary k, Ak needs to get either (a) suc = 0, or (b) suc = 1, detectk = 0, and detectk = 1 for some k = k. For the case of (a), Ak tries to change si into s˜i = si for some i ∈ {1, . . . , n}. When playing with (Ak , B−k , Aλ ), the number of corrupted channels is at most tk + t∗ ≤ (n − 1)/2. Hence, there are a majority of indices i that is not corrupted by Ak or Aλ , and for each i , the tampering on the i-th channel will

Perfectly SMT Against Independent Rational Adversaries

577

be detected, namely, the list Li will include i with high probability. By the same argument as in the proof of Theorem 3, any tampering of s˜i = si by Ak and Aλ is detected with probability at least 1 − (n + 1)2 · 2−(+1) . Thus, we have that Uk (Ak , B−k , Aλ ) ≤ (n + 1)2 · 2−(+1) · u∗1 + 1 − (n + 1)2 · 2−(+1) · u∗4 , which is at most u∗2 by the assumption on . For the case of (b), Ak needs the result that j ∈ Li for a majority of the list Li ’s, where the j-th channel is corrupted by adversary k . However, since Ak and Aλ can corrupt a minority of the channels, this event cannot happen. Thus, we have shown that (B1 , . . . , Bλ−1 ) is a robust Nash equilibrium.

8

Conclusions

We have studied the problem of constructing SMT protocols against adversaries who may corrupt all the channels between the sender and the receiver. If all adversaries are malicious, we cannot hope for reliable transmission because adversaries who interrupt all the messages can cause transmission failure. Also, if a single adversary corrupts all the channels, we cannot achieve privacy since the adversary can obtain the same information as the receiver who can recover the transmitted message. We show that if multiple rational adversaries exclusively corrupt the channels, perfectly secure SMT protocols can be constructed. Our results demonstrate that even if all the physical resources may be corrupted by adversaries, it is possible to provide secure protocols by considering the rationality and independence of each group of adversaries. Acknowledgments. This work was supported in part by JSPS Grants-in-Aid for Scientific Research Numbers 16H01705, 17H01695, 18K11159, and 19K22849.

A A.1

Building Blocks The SJST protocol

We describe an almost-reliable SMT protocol using the public channel proposed by Shi, Jiang, Safavi-Naini, and Tuhin [30]. We refer it as the SJST protocol. The protocol is based on the simple protocol for “static” adversaries in which the sender sends a random key Ri over the i-th channel for each i ∈ {1, . . . , n}, and the encrypted message c = m ⊕ R1 ⊕ · · · ⊕ Rn over the public channel. Suppose that the adversary sees the messages sent over the corrupted channels, but does not change them. Since the adversary cannot see at least one key Rj when corrupting less than n channels, the mask R1 ⊕ · · · ⊕ Rn for the encryption looks random for the adversary. Thus, the message m can be securely encrypted and reliably sent through the public channel. To cope with “active” adversaries, who may change messages sent over the corrupted channels, the SJST protocol employs a mechanism for detecting the adversary’s tampering by

578

K. Yasunaga and T. Koshiba

using hash functions. Specifically, the pairwise independent hash functions (see Sect. A.3) satisfy the following property: when a pair of keys (ri , Ri ) is changed to (ri , Ri ) = (ri , Ri ), the hash value for (ri , Ri ) is different from that for (ri , Ri ) with high probability if the hash function is chosen randomly after the tampering occurred. In the SJST protocol, the sender sends a pair of keys (ri , Ri ) over the i-th channel. Then, the receiver chooses n pairwise independent hash functions hi ’s, and sends them over the public channel. By comparing hash values for (ri , Ri )’s sent by the sender with those for (ri , Ri )’s received by the receiver, they can identify the channels for which messages, i.e., keys, were tampered with. By ignoring keys sent over such channels, the sender can correctly encrypt a message m with untampered keys and send the encryption reliably over the public channel. We describe the SJST protocol below, which is a three-round protocol, and achieves the reliability with δ = (n − 1) · 21− , where is the length of hash values. Protocol 4 (The SJST protocol [30]). Let n be the number of channels, m ∈ M the message to be sent by the sender S, and H = {h : {0, 1}k → {0, 1} } a class of pairwise independent hash functions. 1. For each i ∈ {1, . . . , n}, S chooses ri ∈ {0, 1} and Ri ∈ {0, 1}k uniformly at random, and sends the pair (ri , Ri ) over the i-th channel. 2. For each i ∈ {1, . . . , n}, R receives (ri , Ri ) through the i-th channel, and then chooses hi ← H uniformly at random. If |ri | = or |Ri | = k, set bi = 1, and otherwise, set bi = 0. Then, set Ti = ri ⊕ hi (Ri ), and Hi = (hi , Ti ) if bi = 0, and Hi = ⊥ otherwise. Finally, R sends (B, H1 , . . . , Hn ) over the public channel, where B = (b1 , . . . , bn ). 3. S receives (B, H1 , . . . , Hn ) through the public channel. For each i ∈ {1, . . . , n} with bi = 0, S computes Ti = ri ⊕ hi (Ri ), and sets vi = 0 if Ti = Ti , sends (V, c) over the public channel, where and vi = 1 otherwise. Then, S ). V = (v1 , . . . , vn ), and c = m ⊕ ( vi =0 Ri 4. On receiving (V, c), R recovers m = c ⊕ ( vi =0 Ri ). Theorem 6 ([30]). The SJST protocol is (0, (n − 1) · 21− )-SMT against tadversary for any t < n. We can find a complete proof of the above theorem in [30]. For selfcontainment, we give a brief sketch of the proof. – Privacy: The adversary can get c = m ⊕ ( vi =0 Ri ) through the public channel. Since m is masked by uniformly random Ri ’s, the adversary has to corrupt all the i-th channels with vi = 0 to recover m. However, since any t-adversary can corrupt at most t (< n) channels, the adversary can cause vi = 1 for at most n − 1 i’s. Hence, there is at least one i with vi = 0, for which the adversary cannot obtain Ri . Thus, the protocol satisfies the perfect privacy.

Perfectly SMT Against Independent Rational Adversaries

579

– Reliability: Since the protocol uses the public channel at the second and the third rounds, the adversary can tamper with channels only at the first round. Suppose that the adversary tampers with (ri , Ri ). If Ri = Ri and Ti = Ti , then R would recover a wrong message, but the tampering is not detected. It follows from the property of pairwise independent hash functions (see Sect. A.3) that the probability that the above event happens is at most (n − 1)21− . Thus, the protocol achieves the reliability with δ = (n − 1) · 21− . A.2

Secret Sharing

Secret sharing, introduced by Shamir [29] and Blackley [8], enables us to distribute the secret information securely. Let s ∈ F be a secret from some finite field F. A (threshold) secret-sharing scheme provides a way for distributing s into n shares s1 , . . . , sn such that, for some parameter t > 0, (1) any t shares give no information about s, and (2) any t + 1 shares uniquely determine s. Definition 5. Let t, n be positive integers with t < n. A (t, n)-secret sharing scheme with range G consists of two algorithms (Share, Reconst) satisfying the following conditions: – Correctness: For any s ∈ G and I ⊆ {1, . . . , n} with |I| > t, Pr [(˜ s, J) ← Reconst ({i, si }i∈I ) ∧ s˜ = s] = 1, where (s1 , . . . , sn ) ← Share(s), and – Perfect Privacy: For any s, s ∈ G and I ⊆ {1, . . . , n} with |I| ≤ t, SD ({si }i∈I , {si }i∈I ) = 0, where (s1 , . . . , sn ) ← Share(s) and (s1 , . . . , sn ) ← Share(s ). Shamir [29] gave a (t, n)-secret sharing scheme based on polynomial evaluations for any t < n. Let F be a finite field of size at least n. Then, for a given secret s ∈ F, the sharing algorithm chooses random elements r1 , . . . , rt ∈ F, and constructs a polynomial f (x) = s + r1 x + r2 x2 + · · · + rt xt of degree t over F. Then, for a fixed set of n distinct elements {a1 , . . . , an } ⊆ F, the i-th share is f (ai ) for i ∈ {1, . . . , n}. Given {i, f (ai )}i∈I for |I| > t, the reconstruction algorithm recovers the polynomial f by polynomial interpolation, and outputs f (0) = s as a recovered secret. McEliece and Sarwate [26] observed that Shamir’s scheme is closely related to Reed-Solomon codes, and thus the shares can be efficiently recovered even if some of them have been tampered with. We will use the useful fact that even if at most (n − 1)/3 out of the n shares are tampered with, the original secret can be correctly recovered by decoding algorithms of Reed-Solomon codes.

580

A.3

K. Yasunaga and T. Koshiba

Pairwise Independent Hash Functions

Wegman and Carter [33] introduced the notion of pairwise independent (or strongly universal) hash functions and gave its construction. As in the SJST protocol described above, our protocols employ pairwise independent hash functions. Definition 6. Suppose that a class of hash functions H = {h : {0, 1}m → {0, 1} }, where m ≥ , satisfies the following: for any distinct x1 , x2 ∈ {0, 1}m and y1 , y2 ∈ {0, 1} , Pr [h(x1 ) = y1 ∧ h(x2 ) = y2 ] ≤ γ.

h∈H

Then H is called γ-pairwise independent. In the above, the randomness comes from the uniform choice of h over H. Here we mention a useful property of almost pairwise independent hash function, which guarantees the security of some SMT protocols. Lemma 1 ([30]). Let H = {h : {0, 1}m → {0, 1} } be a γ-almost pairwise independent hash function family. Then for any (x1 , c1 ) = (x2 , c2 ) ∈ {0, 1}m ×{0, 1} , we have Pr [c1 ⊕ h(x1 ) = c2 ⊕ h(x2 )] ≤ 2 γ. h∈H

In [33], Wegman and Carter constructed a family of 21−2 -almost pairwise independent hash functions. In particular, their hash function family Hwc = {h : {0, 1}m → {0, 1} } satisfies that Pr [h(x1 ) = y1 ∧ h(x2 ) = y2 ] = 21−2

h∈Hwc

for any distinct x1 , x2 ∈ {0, 1}m and for any y1 , y2 ∈ {0, 1} and also Pr [c1 ⊕ h(x1 ) = c2 ⊕ h(x2 )] = 21−

h∈Hwc

(3)

for any distinct pairs (x1 , c1 ) = (x2 , c2 ) ∈ {0, 1}m × {0, 1} .

References 1. Abraham, I., Dolev, D., Gonen, R., Halpern, J.Y.: Distributed computing meets game theory: robust mechanisms for rational secret sharing and multiparty computation. In: Ruppert, E., Malkhi, D. (eds.) PODC, pp. 53–62. ACM (2006) 2. Abraham, I., Dolev, D., Halpern, J.Y.: Distributed protocols for leader election: a game-theoretic perspective. In: Afek, Y. (ed.) DISC 2013. LNCS, vol. 8205, pp. 61–75. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41527-2 5 3. Agarwal, S., Cramer, R., de Haan, R.: Asymptotically optimal two-round perfectly secure message transmission. In: Dwork, C. (ed.) CRYPTO 2006. LNCS, vol. 4117, pp. 394–408. Springer, Heidelberg (2006). https://doi.org/10.1007/11818175 24

Perfectly SMT Against Independent Rational Adversaries

581

4. Asharov, G., Canetti, R., Hazay, C.: Towards a game theoretic view of secure computation. In: Paterson, K.G. (ed.) EUROCRYPT 2011. LNCS, vol. 6632, pp. 426–445. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-204654 24 5. Asharov, G., Canetti, R., Hazay, C.: Toward a game theoretic view of secure computation. J. Cryptol. 29(4), 879–926 (2016) 6. Asharov, G., Lindell, Y.: Utility dependence in correct and fair rational secret sharing. J. Cryptol. 24(1), 157–202 (2011) 7. Azar, P.D., Micali, S.: Super-efficient rational proofs. In: Kearns, M., McAfee, R.P., ´ (eds.) EC 2013, pp. 29–30. ACM (2013) Tardos,E. 8. Blakley, G.R.: Safeguarding cryptographic keys. In: Proceedings of the National Computer Conference 1979, vol. 48, pp. 313–317 (1979) 9. Dolev, D., Dwork, C., Waarts, O., Yung, M.: Perfectly secure message transmission. J. ACM 40(1), 17–47 (1993) 10. Franklin, M.K., Wright, R.N.: Secure communication in minimal connectivity models. J. Cryptol. 13(1), 9–30 (2000) 11. Fuchsbauer, G., Katz, J., Naccache, D.: Efficient rational secret sharing in standard communication networks. In: Micciancio [27], pp. 419–436 12. Fujita, M., Yasunaga, K., Koshiba, T.: Perfectly secure message transmission against rational timid adversaries. In: Bushnell, L., Poovendran, R., Ba¸sar, T. (eds.) GameSec 2018. LNCS, vol. 11199, pp. 127–144. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01554-1 8 13. Garay, J.A., Katz, J., Maurer, U., Tackmann, B., Zikas, V.: Rational protocol design: cryptography against incentive-driven adversaries. In: FOCS, pp. 648–657. IEEE Computer Society (2013) 14. Garay, J.A., Katz, J., Tackmann, B., Zikas, V.: How fair is your protocol?: a utility-based approach to protocol optimality. In: Georgiou, C., Spirakis, P.G. (eds.) PODC, pp. 281–290. ACM (2015) 15. Garay, J.A., Ostrovsky, R.: Almost-everywhere secure computation. In: Smart, N. (ed.) EUROCRYPT 2008. LNCS, vol. 4965, pp. 307–323. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78967-3 18 16. Gradwohl, R.: Rationality in the full-information model. In: Micciancio [27], pp. 401–418 17. Groce, A., Katz, J.: Fair computation with rational players. In: Pointcheval, D., Johansson, T. (eds.) EUROCRYPT 2012. LNCS, vol. 7237, pp. 81–98. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29011-4 7 18. Groce, A., Katz, J., Thiruvengadam, A., Zikas, V.: Byzantine agreement with a rational adversary. In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) ICALP 2012. LNCS, vol. 7392, pp. 561–572. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-31585-5 50 19. Guo, S., Hub´ acek, P., Rosen, A., Vald, M.: Rational arguments: single round delegation with sublinear verification. In: Naor, M. (ed.) ITCS, pp. 523–540. ACM (2014) 20. Halpern, J.Y., Teague, V.: Rational secret sharing and multiparty computation: extended abstract. In: Babai, L. (ed.) Proceedings of the 36th Annual ACM Symposium on Theory of Computing, Chicago, IL, USA, 13–16 June 2004, pp. 623–632. ACM (2004) 21. Halpern, J.Y., Vila¸ca, X.: Rational consensus: extended abstract. In: Giakkoupis, G. (ed.) PODC, pp. 137–146. ACM (2016)

582

K. Yasunaga and T. Koshiba

22. Hayashi, M., Koshiba, T.: Universal construction of cheater-identifiable secret sharing against rushing cheaters based on message authentication. In: 2018 IEEE International Symposium on Information Theory, ISIT 2018, Vail, CO, USA, 17–22 June 2018, pp. 2614–2618. IEEE (2018) 23. Inasawa, K., Yasunaga, K.: Rational proofs against rational verifiers. IEICE Trans. 100-A(11), 2392–2397 (2017) 24. Kawachi, A., Okamoto, Y., Tanaka, K., Yasunaga, K.: General constructions of rational secret sharing with expected constant-round reconstruction. Comput. J. 60(5), 711–728 (2017) 25. Kurosawa, K., Suzuki, K.: Truly efficient 2-round perfectly secure message transmission scheme. IEEE Trans. Inf. Theory 55(11), 5223–5232 (2009) 26. McEliece, R.J., Sarwate, D.V.: On sharing secrets and reed-solomon codes. Commun. ACM 24(9), 583–584 (1981) 27. Micciancio, D. (ed.): TCC 2010. LNCS, vol. 5978. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11799-2 28. Sayeed, H.M., Abu-Amara, H.: Efficient perfectly secure message transmission in synchronous networks. Inf. Comput. 126(1), 53–61 (1996) 29. Shamir, A.: How to share a secret. Commun. ACM 22(11), 612–613 (1979) 30. Shi, H., Jiang, S., Safavi-Naini, R., Tuhin, M.A.: On optimal secure message transmission by public discussion. IEEE Trans. Inf. Theory 57(1), 572–585 (2011) 31. Spini, G., Zémor, G.: Perfectly secure message transmission in two rounds. In: Hirt, M., Smith, A. (eds.) TCC 2016. LNCS, vol. 9985, pp. 286–304. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-53641-4 12 32. Srinathan, K., Narayanan, A., Rangan, C.P.: Optimal perfectly secure message transmission. In: Franklin, M. (ed.) CRYPTO 2004. LNCS, vol. 3152, pp. 545–561. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-28628-8 33 33. Wegman, M.N., Carter, L.: New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci. 22(3), 265–279 (1981) 34. Yasunaga, K.: Public-key encryption with lazy parties. IEICE Trans. 99-A(2), 590–600 (2016) 35. Yasunaga, K., Yuzawa, K.: Repeated games for generating randomness in encryption. IEICE Trans. 101-A(4), 697–703 (2018)

Author Index

Allen, Joey 417 An, Bo 33, 542 Anwar, Ahmed H.

Jadliwala, Murtuza 276 Johnson, Benjamin 310 21

Barreto, Carlos 1 Basak, Anjon 21 Başar, Tamer 459 Bhargava, Radhika 45 Bilinski, Mark 65 Bošanský, Branislav 331, 542 Boumkheld, Nadia 85 Bucarey, Victor 97 Bushnell, Linda 385, 417

Labbé, Martine 97 Lee, Wenke 417 Li, Zuxing 297 Liang, Yu 331 Liu, Mingyan 259 Liu, Yan 238

Canann, Taylor J. 118 Cartwright, Anna 135 Cartwright, Edward 135 Chakraborti, Tathagata 479 Chowdhary, Ankur 492 Clark, Andrew 352, 385 Clifton, Chris 45

Maiti, Anindya 276 Manshaei, Mohammad Hossein Mauger, Justin 65 Merlevede, Jonathan 310 Meyer, Joachim 33 Moothedath, Shana 417

Dán, György 297, 439 Desmedt, Yvo 152 Dunstatter, Noah 164 Fang, Fei 238, 525 Ferguson-Walter, Kimberly Fugate, Sunny 65 Gabrys, Ryan 65 Griffin, Christopher 184 Grossklags, Jens 310 Guirguis, Mina 164 Gupta, Umang 238 Gutierrez, Marcus 21 Holvoet, Tom 310 Huang, Dijiang 492 Huang, Linan 196 Huang, Yunhan 217

Kambhampati, Subbarao 479, 492 Kamhoua, Charles 21 Kamra, Nitin 238 Khalili, Mohammad Mahdi 259 Kiekintveld, Christopher 21, 525 König, Sandra 404 Koshiba, Takeshi 563 Koutsoukos, Xenofon 1 Kumari, Kavita 276

65

Nguyen, Thanh H. 331 Niu, Luyao 352, 385 Oakley, Lisa 364 Oprea, Alina 364 Panaousis, Emmanouil 85, 404 Panda, Sakshyam 85 Poovendran, Radha 385, 417 Rajtmajer, Sarah 184 Ramasubramanian, Bhaskar Rass, Stefan 85, 404

385

276

584

Author Index

Sahabandu, Dinuka 417 Sandberg, Henrik 439 Sarıtaş, Serkan 439 Sayin, Muhammed O. 459 Sengupta, Sailik 479, 492 Shereen, Ezzeldin 439 Slinko, Arkadii 152, 513 Souza, Brian 65 Squicciarini, Anna 184 Tahsini, Alireza 164 Tambe, Milind 238, 525, 542 Tešić, Jelena 164 Thakoor, Omkar 525 Umar, Prasanna

184

Vayanos, Phebe 525 Venkatesan, Sridhar 21 Wang, Kai 238 Wang, Xinrun 33, 542 Xu, Haifeng 525 Xue, Lian 135 Yaakov, Yoav Ben 33 Yadav, Amulya 331 Yasunaga, Kenji 563 Zhang, Xueru 259 Zhu, Quanyan 196, 217